Site-specific transcription factors (TFs) play a critical role in the regulation of gene expression during mammalian development and disease. TFs recognise specific sequences of DNA, subsequently bind, and recruit transcriptional cofactors to regulate their target gene expression. However, genome-wide studies of TF occupancy, such as ChIP-seq, have revealed they bind relatively few of their cognate sequences. While, in addition, many sites that are bound in vivo lack identifiable binding sequences. This is likely due to cooperative binding from other transcriptional cofactors, or additional new binding sequences, or both.
To address this question, we have tested members of the Krüppel-like Factor (KLF), Specificity Protein (SP), and Early Growth Response (EGR) families – well-known members of the C2H2-ZF protein superfamily – for their ability to recognize previously unrecognised sequences; an expanded lexicon of binding sequences. Utilizing a combination of EMSA and next-generation sequencing – known as SPEC-seq – we have defined relative binding affinities and specificities for KLF, SP and EGR family members to more than a million variants of the 10bp binding motif.
We find a strong correlation between in vivo and in vitro relative binding affinities. Our results suggest TFs bind to motifs with a much broader spectrum of affinities than previously thought. This is likely because intermediate affinity sites are not adequately represented by SELEX motif enrichment approaches, phage display or by de novo motif discovery from ChIP-seq data, since these approaches are biased towards discovery of the very highest affinity sites. SPEC-seq is a new way to fully ascertain the full repertoire of potential binding sites for any TF, and thus provide new insights into the interactions between TF binding and epigenetic regulation of gene expression.