Author
Liu, Ge - George | |
WEIRAUCH, MATTHEW - UNIV OF CA SANTA CRUZ | |
Van Tassell, Curtis - Curt | |
Li, Robert | |
Sonstegard, Tad | |
MATUKUMALLI, LAKSHMI - GEORGE MASON UNIVERSITY | |
Connor, Erin | |
HANSON, RICHARD - CASE WESTERN UNIVERSITY | |
YANG, JIANQI - CASE WESTERN UNIVERSITY |
Submitted to: Genome Biology
Publication Type: Peer Reviewed Journal Publication Acceptance Date: 6/25/2007 Publication Date: 3/27/2009 Citation: Liu, G., Weirauch, M., Van Tassell, C.P., Li, R.W., Sonstegard, T.S., Matukumalli, L.K., Connor, E.E., Hanson, R.W., Yang, J. 2008. Identification of conserved regulatory elements in mammalian promoter regions: a case study using the PCK1 promoter. Genomics, Proteomics and Bioinformatics. 6(3-4):129-143. Interpretive Summary: Comparative genomics is the primary method to discover regulatory elements by identifying conserved genetic sequences by cross-species genome comparison. Except for the most conserved and prominent transcription factor binding sites (TFBS), there is a general lack of agreement between in silico predictions and experimental results for most of TFBS, particularly, for those less conserved but biologically active elements which might be relevant to the tissue- and temporal-specific transcription regulation. A detailed quality control and benchmarking of in silico predictions is currently missing. We designed a systematic approach, combining position weight matrixes (PWM from JASPAR) and phylogenetic footprinting algorithm (TFLOC), to identify less conserved but biologically active TFBS in mammalian promoter regions. Using human, mouse and rat promoter sequence alignments as input, we applied this approach to the upstream 1 kb promoter regions of all available RefSeq genes. Computational prediction was compared with previously known sites of PEPCK (Phosphoenolpyruvate Carboxykinase, Cytolsolic isoform, pck1). This approach produced a sensitivity over 75% and a true-positive rate about 32%. With previously known TFBS being correctly predicted, some novel candidate sites were revealed. The newly discovered sites were further confirmed by experimental verifications including gel shifting and in vitro reporter assays. This approach provides an accessible resource for developing transcription research hypotheses and the TFBS dataset for all available RefSeq genes is freely available at http://bfgl.anri.barc.usda.gov/tfbsConsSites. Technical Abstract: Background Comparative genomics is the primary method to discover regulatory elements by identifying conserved sequences due to evolutionary constraints by cross-species genome comparison. Except for the most conserved and prominent transcription factor binding sites (TFBS), there is a general lack of cross reference between the in silico predictions and experimental results for most of TFBS. Particularly, for those less conserved but biologically active elements which might be relevant to the tissue- and temporal-specific transcription regulation, a detailed quality control and benchmarking of in silico predictions is currently missing. Results A systematic approach, combining position weight matrixes (PWM from JASPAR) and phylogenetic footprinting algorithm (TFLOC), was implemented to identify less conserved but biologically active TFBS in mammalian promoter regions. Using human, mouse and rat promoter sequence alignments as input, this approach was applied to the upstream 1 kb promoter regions of all available RefSeq genes. Computational prediction was compared with previously known sites of PEPCK (Phosphoenolpyruvate Carboxykinase, Cytolsolic isoform, pck1). This approach produced a reasonable sensitivity over 75% and a true-positive rate about 32%. With previously known TFBS being correctly predicted, some novel candidate sites were revealed. The newly discovered sites were further confirmed by experimental verifications including gel shifting and in vitro reporter assays. Conclusions This approach is featured with expandable TFBS matrix, adjustable threshold, and is compatible with the whole genome analysis. It provides an accessible resource for developing transcription research hypotheses and the TFBS dataset for all available RefSeq genes is freely available at http://bfgl.anri.barc.usda.gov/tfbsConsSites. |