In genetics, enhancer regions are short non-coding strands of DNA belonging to a class of elements known as cis-regulatory. These regions, typically in intergenic regions, encompass binding sites for proteins (transcription factors) that regulate gene expression. Changes to the DNA sequence in these enhancer regions are thought to provide a mechanism to explain changes of gene expression even though the genes across species may display some sequence conservation.
The challenge in the identification of these enhancers is that, unlike protein-coding genes, they do not follow a clear genetic code. Sequence conservation across species is often used as a proxy to identify regulatory regions with the rationale that functional elements are under some evolutionary pressure to maintain biological function and, hence, less subject to mutations. However, there is emerging evidence that functional non-coding sequences that show conservation of biological function in the combination of transcription factor binding sites but show little to no conservation at a sequence level using traditional alignment tools. Hence, our ultimate goal is the identification of conserved non-coding sequences by alternative methods to “traditional” sequence alignments.
In this work, we obtain an alignment between the DNA assemblies of human (hg38) and mouse (mm9) around the gene tbx5 and perform a cursory statistical analysis to identify and classify non-coding functional regions. The model is developed in a Bayesian framework and we obtain a segmentation using the segmentation classification algorithm, changept. The algorithm samples segmentations using a Markov chain Monte Carlo method that generalises the Gibbs sampler. We aim to incorporate a diverse set of data types into the model to increase accuracy and applicability.