Data Analysis in a multi-omic world

In STATegra we aim to develop tools that deal with the new biomedical research environment: a multi-omics world. We have moved from a situation were data generation meant large efforts to a new situation where many data types are being generated in a daily basis. This new data provides insights of different regulatory aspects and more importantly all this data is deposited in public repositories for all researchers to access it. The bottleneck of the new situation, as discussed in (Gomez-Cabrero et al, 2014), is the necessity of tools to deal with large and heterogeneous data sets in an integrative manner.

STATEgra aims to develop such tools. Examples of the present developments can be grouped into three types: (i) Explorative Approaches, (ii) Pathway-based approached, (iii) Variable selection and (iv) Network-based approaches.

(i) Explorative Approaches

Data Fusion Methods is a strategy to analyse jointly the overall common and distinct variability of different omics data types measured over the same set of samples. The results of this analysis are multiple PCA models where clustering of samples for the common variability component and clustering for omics-specific variability can be visualised. The approximation is valuable to explore relationships across multi-omics dataset.

OmicsClustering is a clustering method based on the combined and weighted distances between genes calculated on the basis of several omics measurements. The algorithm requires a mapping strategy to assign non-gene features (such as ChiP-seq peaks) to genes. The approach is interesting to see gene associations due to different regulatory characteristics of genes.

Both DataFusion and OmicsClustering are implemented in the STATegRa Bioconductor package that has been submitted to the Bioconductor repository (see also software section).

(ii) Pathway-based approaches.

Paintomics. Paintomics is a data integration approach based on the joint visualisation of different omics data-types on the template of KEGG pathways. The tool displays gene expression, protein, methylation or metabolic changes at protein or metabolite positions of the pathway map. The published version allowed integration of metabolomics and gene expression data, while the current version under development integrates any type of omics data that can be associated to genes. The application also incorporates a multi-omic functional enrichment test to identify significant pathways accounting for different types of omics data.

Pathway Network Analysis. It is a methodology to create a global network of pathways interactions from gene expression data. Currently we are extending the approach to include metabolomics and proteomics data. The method reveals functional relationships between pathways and key genes involved in these interconnections.

(iii) Variable selection.

Integration of omics time course datasets with maSigPro. We have applied this approach to the integration of RNA-seq and DNase-seq data. Basically we model each omics time course using the Next-maSigPro approach, developed at CIPF, and then create a classification matrix where the profile relationshiop of each gene with its associated DNase HyperSensitive (DHS) regions is categorized according to their patterns of change, for example gene expression goes up and the associated DHS regions remains flat. Genes at each expression-DHS pair category are further analyzed in terms of functional enrichment, regulatory motifs and network properties. The method is interesting to explore patterns of chromatin changes associated to gene expression changes. The approach could be generalized to model other tye of time-course omics data pair-wise relationships.

Machine learning approaches. In this approach we first apply classification trees to identify different clusters of genes with similar “regulatory programmes”, understood as the set of regulatory variables (methylation, microRNAs, DHS regions, etc) that predict the expression of their associated gene. Then we use structural equations to model the expression of each gene as a function of its candidate regulators.

Mens for Machina (MxM). This package developed by FORTH implements the Statistically Equivalent Signatures (SES) algorithm, i.e., a feature selection method that aims at discovering the minimally-sized set(s) of variables that are needed for optimally predict a given output.

(iv) Network-based approaches.

Dynamic Network Modelling. The method models analyzes ChIP-seq and DNase-seq data to infere the binding of TFs to specific gene related hypersensitivity regions and create a time-resolved network of TF-target binding interactions. The approach incorporates gene expression data to monitor which genes are actually expressed and set them as seed of new chromatin interactions in sucessive time points.

 

EU Project

Contact

Website: http://stategra.eu/
Email: info@stategra.eu

Privacy & Terms

All images, figures and contents are copyrighted to Stategra Project Partners.