Epigenetics refers to external modifications to DNA that do not change the DNA sequence but can control gene expression, the “on” and “off” status of genes. Epigenetic alterations can be influenced by several elements such as age and environment/lifestyle, and many aberrant modifications can lead to several diseases like cancer and neurodevelopmental disorders.
DNA methylation is a key mechanism of epigenetic regulation, where a methyl group is added to the cytosine (C) or adenine (A) nucleotides in the DNA molecule; in humans, the most common DNA methylation is in CpG dinucleotides. Histone modification is another key mechanism and corresponds to alterations (methylation or acetylation) in histones, the core element of nucleosomes where DNA sequences are wrapped around.
Many bioinformatics tools have been developed to assess the role of epigenetic regulation in gene expression, namely high throughput methods for methylation arrays, CHIP-Sequencing, gene expression microarray and RNA-Sequencing. Quantitative models based on epigenetic information are however needed to accurately predict the up or down regulation in gene expression.
In this study, a new machine learning-based model to predict gene expression as a consequence of epigenetic modification was developed in a lung cancer context. This model analyzed a large set of data on histone modification, CpG methylation, and genomic information, allowing the accurate prediction of differential RNA expression in lung cancers. The team used publicly available data from The Cancer Genome Atlas (TCGA) Project (Illumina Infinium HumanMethylation450K Beadchip CpG methylation array data from paired lung cancer and adjacent normal tissues) and the ENCODE project (histone modification marker CHIP-Seq data). A comprehensive list of 1,424 characteristics was analyzed, including nucleotide composition and conservation, histone H3 methylation modification and CpG methylation.