Application of biclustering of gene expression data and gene set enrichment analysis methods to identify potentially disease causing nanomaterials

Background: The presence of diverse types of nanomaterials (NMs) in commerce is growing at an exponential pace. As a result, human exposure to these materials in the environment is inevitable, necessitating the need for rapid and reliable toxicity testing methods to accurately assess the potential hazards associated with NMs. In this study, we applied biclustering and gene set enrichment analysis methods to derive essential features of altered lung transcriptome following exposure to NMs that are associated with lung-specific diseases. Several datasets from public microarray repositories describing pulmonary diseases in mouse models following exposure to a variety of substances were examined and functionally related biclusters of genes showing similar expression profiles were identified. The identified biclusters were then used to conduct a gene set enrichment analysis on pulmonary gene expression profiles derived from mice exposed to nano-titanium dioxide (nano-TiO2), carbon black (CB) or carbon nanotubes (CNTs) to determine the disease significance of these data-driven gene sets. Results: Biclusters representing inflammation (chemokine activity), DNA binding, cell cycle, apoptosis, reactive oxygen species (ROS) and fibrosis processes were identified. All of the NM studies were significant with respect to the bicluster related to chemokine activity (DAVID; FDR p-value = 0.032). The bicluster related to pulmonary fibrosis was enriched in studies where toxicity induced by CNT and CB studies was investigated, suggesting the potential for these materials to induce lung fibrosis. The pro-fibrogenic potential of CNTs is well established. Although CB has not been shown to induce fibrosis, it induces stronger inflammatory, oxidative stress and DNA damage responses than nano-TiO2 particles. Conclusion: The results of the analysis correctly identified all NMs to be inflammogenic and only CB and CNTs as potentially fibrogenic. In addition to identifying several previously defined, functionally relevant gene sets, the present study also identified two novel genes sets: a gene set associated with pulmonary fibrosis and a gene set associated with ROS, underlining the advantage of using a data-driven approach to identify novel, functionally related gene sets. The results can be used in future gene set enrichment analysis studies involving NMs or as features for clustering and classifying NMs of diverse properties.

two novel genes sets: a gene set associated with pulmonary fibrosis and a gene set associated with ROS, underlining the advantage of using a data-driven approach to identify novel, functionally related gene sets. The results can be used in future gene set enrichment analysis studies involving NMs or as features for clustering and classifying NMs of diverse properties.
Conventional molecular approaches for the study of organismal response to toxicant exposures or diseases involve the study of one gene or a few genes at a time, whereas biological response is driven by a group of genes. Thus, when normal function of a specific biological process is perturbed, alterations and enrichment in the expression of a subset of co-functioning genes associated with that biological process are observed. Toxicogenomic tools such as gene expression profiling have become a widely used strategy for investigating the genome-wide changes relating to molecular mechanisms underlying many complex responses and diseases. The fact that genes interact with each other and are expressed in functionally relevant patterns implies that gene-expression data can be grouped into functionally meaningful gene sets across a subset of conditions [13][14][15][16][17][18][19][20][21][22][23][24][25][26][27][28][29][30][31][32]. The analysis of such predefined gene sets is a powerful alternative to individual gene analysis [13]. However, derivation of meaningful and relevant gene sets from the thousands of genes showing expression changes following exposure to toxicants is challenging.
Gene set data analysis, a computational technique which determines if a predefined set of genes exhibit statistically significant differential expression between two or more experimental conditions (time, dose, tissue, etc.), relies on the knowledge of annotated pathways relevant to the underlying physiology or biology being investigated. A survey conducted by Huang et al. [33] identified 68 different gene set enrichment tools. These methods are applied to manually and computationally curated [29] gene sets to identify enriched functional groupings of genes. These gene set enrichment tools include DAVID [21,22], EASE [34], GoMiner [35], MAPPFinder [36], Onto-express [37] and others, which consist of controlled descriptions of gene functions that are frequently used to define gene sets. Other tools, such as pathway databases including Gene Ontology [38], KEGG [39], BioCyc [40], TfactS [41], CTD [42], and BioCarta (http://www.biocarta.com), have also been applied in gene set analysis. Despite the number of tools available, the effective identification of functional groups of genes relevant to the underlying physiology across several conditions still remains a challenge. As a result, these tools continue to be refined and improved.
Nanomaterials (NMs) are materials manufactured on the nanoscale (1-100 nm) and are the building blocks of nanotechnology. On the nanoscale, materials exhibit unique size-associated properties (optical, magnetic, mechanical, thermodynamic, electrical, etc.), which are harnessed for use in various commercial applications [43]. Current applications of NMs include therapeutic applications (e.g., nanomedicine, drug delivery, diagnostics), agriculture, manufacturing, electronics, cosmetics, textiles, and environmental remediation and protection. Although NMs are synthesized from their corresponding, known, bulk chemical substances, owing to their distinct sizeassociated properties, their biological or toxicological behavior are often different from their analogous bulk compound. Because of their smaller size and large surface area, NMs are known to have increased ability to interact with cellular membranes, they can easily cross cellular barriers and penetrate deeper regions of tissue (such as the highly vascularized alveolar regions of lungs), and they exhibit increased toxicological activity as compared to the corresponding bulk material or comparatively large particles [43]. A variety of conventional toxicology tools have been assessed using both in vitro and in vivo models for their suitability and applicability for toxicity testing of NMs. However, these tools are single-endpoint-based or targeted in nature, investigate only one type of response at a time, and lack detailed mechanistic information [44]. Given the rate at which nanotechnology is growing, and the limitations of the currently available toxicological testing tools, it is estimated that it will take several decades and millions of dollars to complete the assessment of NMs of various sizes, shapes and surface coatings that require immediate assessment [45]. Therefore, more efficient toxicity testing and prediction tools are needed to provide a comprehensive overview of the biological activities of NMs to rapidly screen the toxicological potential of NMs.
Over the last few years, genome-wide expression analysis tools have been used as an alternative approach to comprehensively investigate the toxicological response induced by various classes of NMs and to identify the properties of NMs that are responsible for eliciting adverse effects. We have previously used transcriptomics profiling tools to investigate the underlying mechanisms of toxicity induced by nanoparticles of titanium dioxide (nano-TiO 2 ) [46][47][48] and carbon nanotubes (CNTs) [49,50] of various sizes and properties. This work identified the properties of nano-TiO 2 that influence their inflammogenic potential [51]. These studies have generated a large repository of gene expression data that reflect the diversity of NM-induced biological response across a variety of experimental conditions. However, the challenge lies in the effective use of these data to discern individual or networks of genes conferring adverse outcomes of regulatory importance or disease phenotypes.
In the present study, we used a meta-analysis approach like that described by Turcan et al. [20] to identify functionally related biclusters of genes showing similar expression profiles, derived from publicly available gene expression data sets describing specific lung diseases (Table 1). One advantage of biclustering is that genes in the same cluster do not have to behave similarly over all experimental conditions. Unlike classical clustering techniques, biclusters can overlap with each other. This is ideal for mining functionally related gene sets as genes can be asso-ciated with more than one biological process. Several studies [3,[52][53][54][55] have shown that biclustering is a useful methodology to uncover processes that are active only over some but not all experimental conditions [56].
In this study, experiments investigating lung diseases (including lung inflammation, emphysema, chronic obstructive pulmonary disease (COPD) or lung cancer) in mice using the whole genome gene expression tools were obtained from GEO. For each study, raw data were downloaded from GEO and normalized as described in the methods below. Biological replicates for each of the experimental conditions were averaged. All studies were merged together and biclustering was employed. Through this analysis, ten biclusters representing ten functional gene sets were identified. Using DAVID [21,22], the biological functions associated with these biclusters were identified. Next, we applied these candidate gene sets/biclusters to nine, publically available, toxicogenomic gene expression studies ( Table 1, published studies from our laboratory) to examine the toxicity induced by a variety of NMs (nano-TiO 2 , CB and CNTs) to determine the disease significance of the altered gene expression profiles following exposure to NMs. The analysis was restricted to lung disease models since pulmonary response following NM exposure is well characterized.

Results and Discussion
Identification of biclusters of genes from lung disease models To develop a data-driven view of the mouse lung response following exposure to NMs, publicly available genomic data from GEO that describe characteristic features of select lung diseases were leveraged. Eleven studies encompassing 52 experimental conditions with 8752 common gene symbols were assembled and specific gene sets were extracted using the repeated Bimax [69] biclustering method. A total of ten distinct biclusters were identified. The results of the biclustering are visually summarized in Figure 1.   The eighth bicluster consisted of models for bacterial infection, Th2 response (GSE4231), asthma (GSE6858) and pulmonary fibrosis (GSE25640) with sixteen gene symbols (C1qb, Ch25h, Clec4a2, Ctss, F7, Fcgr2b, Itgam, Itgb2, Lgmn, Lpxn, Ly86, S100a4, Serpina3g, Serpina3n, Slc7a2, and Tbxas1). These gene symbols resulted in three significant GOs: response to wounding (FDR p-value = 0.0037), defense response (FDR p-value = 0.0063) and inflammatory response (FDR p-value = 0.0045).
The tenth bicluster resulting from the analysis of the genes that were 2-fold down-regulated consisted of lung inflammation and disease models such as the bacterial infection model, a model for human small cell lung carcinoma, the study on spontaneous lung tumor, an asthma model and pulmonary fibrosis. This bicluster consisted of seventeen gene symbols (Aldh3a1, Bmp6, Cyp1a1, Cyp4b1, Eng, Fmo1, Fmo2, Gpr155, Igfbp6, Mapt, Ndrg2, Omd, Pcolce2, Pgam2, Scube2, Slc7a10, and Tnxb). These genes were associated with a variety of functions including fatty acid metabolism; however, DAVID functional annotation analysis of these gene symbols resulted in no statistically significant results to known annotated gene sets. However, several of these genes are associated with reactive oxygen species (ROS), which may not be a well-established gene set.

Application of biclusters to classify NM-induced lung response
Next, gene set enrichment analysis (GSEA) [29] using the bicluster-method-derived genes sets was conducted on the nine publically available studies [47][48][49][50][51]68] that examined NM-induced pulmonary toxicity. These results are presented in Figure 2. Bicluster-3 (genes associated with chemokine activity reflecting pulmonary inflammation) was enriched for most of the NMs. These results are in alignment with other studies in the literature that have shown pulmonary inflammation to be the predominant response following exposure to a variety of NMs. Bicluster-7 was the other significant cluster that was enriched in most of the experiments related to CNTs and CB. This cluster consisted of gene symbols showing strong association with pulmonary fibrosis. CNTs are well known to induce pulmonary fibrosis [50]. Although exposure to CB was not shown to cause lung fibrosis at the tested doses [48], studies have shown that CB exposure enhances bleomycin-induced lung fibrosis [70]. These results suggest that both carbon-based NMs may perturb similar biological processes and functions and factors in addition to the altered expression of a few genes in the gene set may contribute to the initiation of lung fibrosis.

Conclusion
In this study, we examined the applicability of a data-driven approach to identify gene sets from the comprehensive gene expression data using a biclustering method. The results showed that the lung response to NM exposure predominantly reflects responses observed following bacterial infections and bleomycin injury models that involve acute inflammation. The combined biclustering and gene set enrichment analysis also identified CNT and CB as potentially fibrogenic NMs.
Although several genes sets associated with acute DNA binding, cell cycle, apoptosis, and ROS response that were specific to different disease models were also observed to be perturbed following exposure to NMs, the implication of such perturbation was not clear from this analysis. In addition the identification of several previously defined, functionally relevant gene sets, the present study also identified two novel genes sets: Bicluster-7 (consisting of genes associated with pulmonary fibrosis) and Bicluster-10 (consisting of genes associated with ROS), underlining the advantage of using a data-driven approach to identify novel, functionally related gene sets. The results can be used in future gene set enrichment analysis studies involving NMs or as features for clustering and classifying NMs of diverse properties. While powerful, data-driven meta-analysis approaches have several limitations. One important limitation is that the analysis is conditional on the subset of studies selected from the public data repositories such as GEO and EBI. Also, the experiments available in these repositories may not be representative of the population. For example, there are other mouse models of lung diseases that were not included in the present study due to lack of publicly available data or failure to meet the criteria set by the present study (time points, mouse strain, microarray platforms used).
The analysis is also limited to the gene symbols that were consistently investigated across the various microarray plat-forms from the different studies included in the analyses. Furthermore, the bicluster analysis is conditional to the two-fold change cut-off employed to create the binary matrix for the Bimax algorithm and the choice of the Bimax parameters.
Modifying the fold cut-off to 1.75-and 1.5-fold, an additional 28 (23 up and 5 down) and 100 (89 up and 11 down) biclusters were identified. However, the interpretations were derived from the 2-fold cut-off as it provides the most conservative approach. The biclusters were stable when varying the minimum number of rows and when varying the minimum number of columns. Here, additional clusters were identified when these parameters were reduced and clusters were eliminated when these parameters were increased. Changes to any of the above could impact the final results and therefore the interpretation of the data.

Experimental
Lung disease models The data were obtained from the GEO. The accession numbers for the studies [11,[57][58][59][60][61][62][63][64][65][66][67] used in the exploration of novel gene sets are presented in Table 1. These data sets cover a variety of lung diseases and lung injury outcomes, including different lung inflammation models, emphysema, chronic obstructive pulmonary disease and experiments studying lung cancer and lung tumors. Several different microarray platforms including the Illumina expression beadchip were used in these studies. The analysis was restricted to lung disease models since pulmonary responses following exposure to NMs are well characterized.

Data processing and normalization
The log 2 transformation was applied to all signal intensity measurements. For the two color microarray studies, the LOWESS normalization method [71] using the R statistical software environment [72] was applied. For studies using the Affymetrix GeneChips ® , the RMA normalization was applied using the justRMA function in the affy [73] R package. Quantile normalization was applied for studies that utilized the Illumina beadchip. This was done using the lumiN function in the lumi [74] R package.
Probes with technical replicates were then averaged using the median. The data for each study was then merged to its appropriate annotation file to obtain the gene symbol. Probes with the same gene symbol were then averaged using the median. The experimental conditions with biological replicates were averaged using the median. The median was used as it is a robust estimate of the central tendency.
For each experimental condition, the data was further normalized by centering to the matched control. The control samples were then removed from the data set. The remaining data is presented relative to the control, equivalently the log 2 of the fold change (estimated using medians) for all the studies. The data were then merged across studies using the gene symbol. The mining the log 2 of the fold changes was done in an attempt to minimize the cross-platform differences. However, platform differences may exist through compression of the fold-change values [75].

Biclustering
The biclustering data analysis was conducted in R using the biclust [69] package. The repeated Bimax [56] method was selected for this analysis. Bimax uses a simple data model that assumes two possible states for each expression level, no change and change with respect to a control experiment. For this analysis, two binary matrices were constructed: one matrix, consisting of zeros and ones, where the ones indicated genes that were 2-fold up-regulated and a second matrix, where the ones identify genes that were 2-fold down-regulated.
The option for the minimum number of rows for the Bimax method was set at 15. The minimum number of columns (which represent the experimental conditions) was set as 5 and the maximum number of columns was set as 15. This resulted in 8 biclusters from the binary matrix representing the up-regulated genes and 2 biclusters were identified for the matrix representing the down-regulated genes.

NM-induced lung response data sets
The data sets examining differential gene expression in mouse lung exposed to CB, nano-TiO 2 or CNTs were compiled from GEO. Since this is a proof-of-concept study, the investigation was limited to those NMs for which lung toxicological response is well characterized. Also, the genomics datasets with multiple doses and post-exposure time points were considered in the analysis. The GEO accession numbers for these studies are presented in Table 1. These studies utilized the two color Agilent microarray reference design [76]. The data were LOWESS normalized and probes with technical replicates were averaged. The annotation file containing the gene symbol was merged with the expression data and probes with multiple gene symbols were averaged using the median expression.

Gene set enrichment
As the NM-induced lung response data sets contained multiple doses, the test statistic from the Attract [19] approach was used. Using this method, the overall F-statistic for the dose effect was estimated for each gene. The F-statistics were then log 2 -transformed. A two sample t-test (assuming unequal variances) was then conducted, comparing the mean of the log 2 F-statistics within the bicluster to the mean of the log 2 F-statistics for all genes. The observed t-statistics and p-values are reported in Figure 2.