Researchers at Johns Hopkins University and the University of Hawaii created a model based on epigenomics data capable of predicting differential gene expression in lung cancer. The study was recently published in the journal BMC Bioinformatics and is entitled “Using epigenomics data to predict gene expression in lung cancer.”
Epigenetics refers to external modifications to DNA that do not change the DNA sequence but can control gene expression, the “on” and “off” status of genes. Epigenetic alterations can be influenced by several elements such as age and environment/lifestyle, and many aberrant modifications can lead to several diseases like cancer and neurodevelopmental disorders.
DNA methylation is a key mechanism of epigenetic regulation, where a methyl group is added to the cytosine (C) or adenine (A) nucleotides in the DNA molecule; in humans, the most common DNA methylation is in CpG dinucleotides. Histone modification is another key mechanism and corresponds to alterations (methylation or acetylation) in histones, the core element of nucleosomes where DNA sequences are wrapped around.
Many bioinformatics tools have been developed to assess the role of epigenetic regulation in gene expression, namely high throughput methods for methylation arrays, CHIP-Sequencing, gene expression microarray and RNA-Sequencing. Quantitative models based on epigenetic information are however needed to accurately predict the up or down regulation in gene expression.
In this study, a new machine learning-based model to predict gene expression as a consequence of epigenetic modification was developed in a lung cancer context. This model analyzed a large set of data on histone modification, CpG methylation, and genomic information, allowing the accurate prediction of differential RNA expression in lung cancers. The team used publicly available data from The Cancer Genome Atlas (TCGA) Project (Illumina Infinium HumanMethylation450K Beadchip CpG methylation array data from paired lung cancer and adjacent normal tissues) and the ENCODE project (histone modification marker CHIP-Seq data). A comprehensive list of 1,424 characteristics was analyzed, including nucleotide composition and conservation, histone H3 methylation modification and CpG methylation.
The research team found that the best model generated included 67 features covering all four types of data (CpG methylation, histone H3 modification, nucleotide sequence and conservation), with histone H3 methylation modification (32 features) and CpG methylation (15 features) appearing as the most frequent, suggesting their importance to predict gene expression. The promoter regions of the genes (a sequence that acts as a switch controlling the on-and-off expression of genes) were also found to be major contributors for an accurate prediction of gene expression.
The researchers concluded that, based on a broad list of genomic and epigenomic features, they have created an accurate model to predict RNA differential expression in lung cancer where CpG methylation features are the most important for the prediction. The research team developed this model based on data on lung cancer, which is the primary cause of cancer deaths in both men and women in the world. However, they suggest that future studies should be conducted to investigate general epigenetic predictors for differential gene expression also in other types of cancer.