Causality in Genomics Studies: Time is ripe for a new Paradigm

Several paradigms have occurred in successive waves during the last two centuries and they have moved life sciences from empiric to experimental (typically hypothesisdriven) and then to data-driven, concretized by the recent entry in the “big data” era. The value of large, complex and linkable information generated by genomics and other -omics fi elds and large scale epidemiologic studies has resulted in the domination of associational reasoning, where scientists look for association/correlation between a large number of diverse and heterogeneous attributes (genes, proteins, polymorphisms, phenotypes). This associational approach based on observational data has dominated genomics studies because of the scarcity of experimental data, where some variables are controlled (fi xed to some values) or manipulated and their effects on other variables are measured. The lack of ‘experimental’ interventional data in genomics is mainly due to the logistical feasibility and high cost of their production, which requires genes and proteins experimental perturbations (gene knockout for example).


Introduction
Biological sciences have been for so long dominated by observational approaches as shown by the scarcity of studies trying to infer causality from association, starting from the historical and very instructive studies of Pierre Louis (1835) on the effi cacy of the standard treatments for pneumonia and John Snow (1855) on the causal relationship between cholera and water contamination [1].
Several paradigms have occurred in successive waves during the last two centuries and they have moved life sciences from empiric to experimental (typically hypothesisdriven) and then to data-driven, concretized by the recent entry in the "big data" era. The value of large, complex and linkable information generated by genomics and other -omics fi elds and large scale epidemiologic studies has resulted in the domination of associational reasoning, where scientists look for association/correlation between a large number of diverse and heterogeneous attributes (genes, proteins, polymorphisms, phenotypes). This associational approach based on observational data has dominated genomics studies because of the scarcity of experimental data, where some variables are controlled (fi xed to some values) or manipulated and their effects on other variables are measured. The lack of 'experimental' interventional data in genomics is mainly due to the logistical feasibility and high cost of their production, which requires genes and proteins experimental perturbations (gene knockout for example).
The analyses of data from observational genomics studies are associational by nature, when one looks to make some kind of inference and knowledge discovery from patterns of association using statistical and data mining techniques. Most of these techniques, even those that model relationships between dependent and independent variables (such linear/ logistic regression), have limited interest in inferring causality which means establishing a cause-effect relationship among variables [2], for a general and wide review on the subject). In fact, to make causal inference, we must assume that equations of correlations linking variables are invariant under proposed interventions and it is problematic to verify such assumptions without making interventions. Moreover, if the model changes when variables are subject to intervention (rather than being simply measured) then this model is of poor utility for predicting results of interventions. Although this conceptual hurdle is well understood by most investigators in life sciences, the use of such models is still widespread despite the high risks of misinterpretation [1].
Although huge efforts have been made to separate 'signal' from 'noise' in associational analyses of genomics data by requiring replication and imposing more stringent criteria for statistical signifi cance, the major weakness of this approach is that it has not allowed yet to show which of these associations have meaning, and particularly a causal meaning. Observational data are sensitive to many biases, such as selection, confounding, latent unobserved variables and lack of generalizability. To overcome these problems a new reasoning framework was needed, providing an iterative process of interpreting what we know and what we need to know. This management, synthesis and translation of data to knowledge need methods and algorithms that can use observational data to extract some information on causal relationships that can be thereafter confi rmed by experimental approaches [3,4].
A outlined by Pearl [2], there are two mental barriers in causal inference that have slowed the development of an appropriate reasoning framework for this purpose; the fi rst is that causality cannot be tested from observational data alone and the second is that it cannot be inferred based on classical probability calculus (and the statistical approaches based on it). To overcome these barriers, he developed the do-calculus framework for the identifi cation of causal effects in nonparametric models.
Using this framework, Maathuis et al. [3], in a seminal paper showed that when making some assumptions on the distribution of the data, one can estimate bounds on causal effects of variables on a target variable using graphical models.
In this short review paper I present the rationale behind this method and give examples of its application for inferring causality in gene expression data. I then discuss its possible application to microbiome data.

Theoretical Issues
What is meant by causality in biology? Mayr [5], gave the following pragmatic defi nition of cause in functional biology: "a member of a set of jointly suffi cient reasons without which the event would not happen". A typical causal question in genomics is whether a given gene acts causally on the expression of another gene or whether a given gene mutation or a genotype is a cause of a phenotype. One way to check this causal relationship is to make the gene 'disappear' (this can be done by what is called knock-out experiments) or fi x it to some genotype and see how the phenotype changes. However, since most of these experiments cannot be done on humans and are very heavy in models organisms, establishment of causality in biology is a big challenge. Even if these experiments could be done, causality is always diffi cult to demonstrate due to the high complexity of living systems and as Mayr [5] noted "in view of the high number of multiple pathways possible for most biological processes and in view of randomness of many of the biological processes particularly on the molecular level, causality in biological systems is not predictive or at best is only statistically predictive".

What is the difference between observational and experimental data?
We call observational data , data that are gathered by observing/measuring variables of a biological system (tissues, cells, bacteria..) under some conditions, for example expression or methylation levels of genes, genotypes for a set of markers in diseased and control individuals.
Experimental data, also called interventional data, arise when the researchers apply some treatments to a biological system in a randomized controlled experiment and look to the effect of such treatments on a response variable.

Why estimating causal effect from observations is very useful?
In genomics studies, experimental data are sometimes infeasible, unethical and often time consuming and expensive while observational data are cheaper, more abundant and available. Thus developing methods that can answer causal questions, at least partially, about biological systems from observational data only is very useful. However, moving from the observational world to the experimental world, i.e. from pre-interventional to post-interventional distribution of data, is impossible without some assumptions.
One of these assumptions is to consider that data were generated from a causal structure that can be modeled by a directed acyclic graph (DAG) either it is known or to be learned from the data. In the following paragraphs, I gave brief presentation of the methods that can be used to achieve this task.

What is DAG?
A graph is a set of nodes (variables) connected by a set of, directed or indirected, edges; in expression data nodes are genes whose levels of expression are measured while edges might mean regulation relationships (regulation means that a gene inhibits, stimulate or moderate the expression of another gene). A graph is called complete if every pair of nodes is connected by an edge and we call the parents of any node X (denoted Pa(X)) all those nodes that are connected to X with a directed edge pointing to X. A Directed Acyclic Graph (DAG) is a graph where there exist no path (a set of edges that go from one node to another) linking any node X to any node Y that can go from X to Y by a directed set of edges and then back from Y to X by another set of directed edges.

What is a structural equation model (SEM) and what are its properties?
A SEM is a model that describes how any variable X (observed) is generated from the set of all or part of the other variables i.e. how changes in these variables lead to a change in X, added to some random noise. The SEM can be represented as a DAG, where the edges are drawn such that a set of variables other than X is the parent set of X. This graph is called the causal graph since an edge that goes from Y to X means that Y is a direct cause of X. Assuming that the causal structure is a DAG means that we do not allow feedback loops or unmeasured confounding variables. We also assume that the errors are jointly independent.
Given these assumptions one can show [2] that the joint density function of all nodes f (x 1 ,x 2 ,.. x p ) can be factorized as the product over all nodes of the conditional densities of each node X i given its parents : f i (x i /Pa (x i )).

How the SEM framework allows to move from observational to experimental worlds?
Given the previous, an intervention at some variable X i is simply equivalent to changing the generating mechanism of X i that is to change the corresponding structural equation relative to X i without changing the others. More clearly, setting the variable X i to some value x' i (expressed as do (X i =x' i ) in the language of the do-calculus is interpreted as an outside intervention [2]. For example if X i and X j denote the expression level of genes i and j then the expectation of X j given that X i is fi xed to some value x' i (E(X j /do(Xi=x' i )) represents the average expression level of gene j after setting the expression level of gene i to the value x' i by an outside intervention.
So a do-intervention on X i means that X i no longer depends on its former parents in the DAG, so that incoming edges into X i can be removed and we get what is called a truncated DAG, leading to a truncated factorization of the post-intervention distribution: F (x 1 ,.. x p /do(X i =x' i ))=∏ j≠i f j (x j /pa(x j )) if x i =x' i and 0 otherwise How can we defi ne a causal effect?
If we have discrete variables, like an expression status of a gene (over expressed versus under expressed), we can defi ne the causal effect of gene i on gene j by P (X j = 1/do (X i =1)) -P (X j = 1/do (X i =0)). If the variables are measured on a quantitative scale then the total causal effect of X i on X j can be defi ned as: Based on the factorization formula and assuming that the joint distribution of all variables is Gaussian means that E (X j /x' i ,pa(X i )) is linear in x' i and pa(X i ) and thus the causal effect of X i on X j is the regression coeffi cient of X i in the linear regression of X j on X i and pa(X i ) when X j is not a parent of X i and 0 otherwise. Note that the causal effect computed by the formula above does not depend on x' i and can be interpreted (as any linear regression coeffi cient) as the average increase of the variable X j when X i increases by one unit: E (X j /do (X i = x' i +1)) -P (X j /do (X i = x' i )).

How can we compute practically the causal effect when the causal DAG is not known?
In the previous developments we assumed that the DAG was known. However in most applications this is not the case, so that we need to learn the DAG from the observational data. This can be done by several structure learning methods [6,7]. Since the causal DAG cannot be uniquely determined, what we can learn is in fact what is called a completed partially DAG or a CPDAG. A CPDAG is in fact a single representation of a family of many possible and equally likely (we call these Markov equivalent) DAGs; if an edge between two nodes is always directed in the same direction in all DAGs then it is directed in the CPDAG, and if not it is undirected in the CPDAG. One of the most used and fast algorithms to fi nd the CPDAG is the PC algorithm [8].
Given a CPDAG one can perform simple linear regression analysis of each response node of interest over any other node and its parents to estimate its causal effect on the response node for all DAGs contained in the CPDAG. This gives a set of causal effect estimates, so by taking the minimal absolute value of those estimates we obtain a consistent estimate of the lower bound of the causal effect. This approach, developed by Maathuis et al. [3], was named IDA (Intervention-calculus when de DAG is absent).
Thus IDA proceeds in four step to estimate the causal effect of every node X on a response node of interest, Y: (1) First estimate the CPDAG of the underlying causal DAG using the PC algorithm [8]. Since step (2) is computationally intensive and that the estimation of the effect of X on Y only needs to know the parents of X in each DAG, we can only extract this information quite easily from the CPDAG.
The IDA method assumes, besides the two previously stated assumptions (no feedback loops in the causal model and that the variables have a joint multivariate Gaussian distribution), that there is no hidden variables.

Is there any computer program to use IDA?
IDA has been implemented in an R package named pcalg [9].

Can we use IDA to estimate joint causal effects of many variables at a time?
A generalization of the IDA method to estimate the effect of multiple simultaneous interventions (such as knocking out many genes in expression analyses) has been developed and named JointIDA [10]. There are also some other methods based on Gaussian Bayesian networks that can jointly estimate causal effect from observational and interventional data [11].

Is there any extensions to IDA?
Stekhoven et al. [12], presented an extension of IDA that they called Causal Stability Ranking (CStaR), where they estimate the stability of estimates of causal effects using resampling from the data and ranking of the variables according to their effects. They showed that their method improves the true positive rate as compared to IDA and largely outperforms classical high-dimensional regression methods. Teramoto et al. [13], proposed an extension of IDA to deal with non-Gaussian distribution of the data that they call the non-paranormal method (NPN-IDA). In this method a non-paranormal correlation coeffi cient is used within the PC algorithm (instead of the Pearson correlation coeffi cient) to learn the CPDAG. They showed, based on application to real data, that this improves learning of the DAG, thus leading to more accurate estimation of causal effects.
Several other extensions of the IDA method have been proposed to deal with the presence of hidden variables and feedback loops using appropriate algorithms for the learning of the causal graphical [14][15][16]. Some of these algorithms are available in the pcalg R package. There are also extensions of IDA to deal with time series data [17], heteregeous data, like when one is using a mix of observational and experimental data sets [18], different datasets with overlapping sets of variables [19] or a combination of both [20].

Gene expression
In their seminal paper Maathuis et al. [3], gave an application of IDA on evaluating the causal effect of the expression level of 4088 genes (continuous variables) on ribofl avin production by Bacillus subtilis. Since the number of observations (71 strains) is very small compared to the number of variables (what is called small n, big p problem), the basic IDA method for causal DAG learning cannot be applied so a local variant of IDA was proposed [3], based on many bootstrapping rounds from the original dataset. They showed that within the top ten most causal genes identifi ed by IDA only one was identifi ed by standard regression techniques in high-dimensional problems, like Lasso [21], which are not able to estimate causal effect but only statistical association effects. In their second paper [4] applied IDA to the classical dataset of Hughes et al.

Microbiome data
Recent studies have suggested the role of some bacterial species in the gut microbiota in colon cancer. Two recent studies have reported a highly signifi cant association between the rate of some bacterial species (evaluated by 16sRNA next generation sequencing) and colon cancer risk [27,28]. In a recent work (submitted), we used IDA to calculate causal effects of the bacterial species rate on cancer risk. Since, the response variable here is the patients status (cancer versus control), which is a binary variable, we used linear regression as an approximation considering the response as a dummy variable. Another solution (currently under consideration) is to use binary logistic regression (a generalized linear regression model) but we need to demonstrate that, in this case, the good properties of IDA estimate still hold.
We found that Fusobacterium nucleatum species have the highest causal effect on cancer risk, which is in good agreement with several reports that described the possible mechanisms by which these species can enhance cancer. However, differently from gene expression studies where knockout experiments can allow to validate causal effects, experiments with microbiota, even in model organisms, are very diffi cult since the microbiota itself is an extremely complex and dynamic system and its enrichment with some specifi c species might not be possible to validate their effect on some response variable.

Conclusion
Estimating causal effects is a very important issue in modern genomics since it will allow us to move from the current paradigm of associational studies to a new paradigm, where we can predict with high accuracy and confi dence, The IDA method (or family of methods) , based on the docalculus and causal graphs, offer a fi rst step on this long road that will allow us to infer, with high precision and signifi cant, biological meaning, causal effects from observational data. Further efforts are needed to develop new conceptual frameworks and effi cient algorithms for causality inference from genomics data because without this the clinical translation of the fi ndings will still remain far from reach.