Defining a list of “transcription factors” is not as straightforward as one could imagine. If we browse the literature we can actually find different lists of TFs out there, and they do not always overlap. In this post, I compared a few of the main ones (most-cited and most-used) out there.
What is a Transcription Factor (TF)?
First of all, let’s start with the basics. What is a TF? TFs are proteins that regulate gene activity by binding to specific DNA sequences termed motifs or regulatory elements (Ptashne and Gann, 1997, Lambert et al., 2018). These DNA-binding sequences can be found proximal to the target genes, such as upstream (promoters), downstream, or even in intronic regions, or more distant to the target genes (e.g. enhancers or silencers). A defining feature of TFs is that they contain at least one specific domain that can bind DNA, which is termed a DNA-binding domain (DBD). TFs use a variety of DNA-binding structural motifs to recognize their target DNA sequences, these include homeodomain (HD), helix-turn-helix (HTH) and high-mobility group box (HMG) (Mistis et al., 2020). Such DBDs can be used to classify TFs. Several other proteins participate in the process of regulation of gene expression (coactivators, chromatin remodelers, histone acetyltransferases, histone deacetylases, kinases, and methylases). These other proteins are also essential to gene regulation, but lack DNA-binding domains, and therefore are not considered TFs (Brivanlou and Darnell, 2002).
The regulation of gene transcription orchestrated by complex transcriptional regulatory networks is a fundamental part of both tissue-specific gene expression and gene activity in response to stimuli (Latchman 1997, Wilkinson et al., 2017). Thus, the analysis of TF-networks is a central issue in bioinformatics and systems biology.
How many TFs exist?
When I started working in the inference of TF-networks one of the first issues I faced was simply the annotation of TFs themselves. Is this protein a TF? Or a co-factor? Or is it another kind of transcriptional regulator (that is not a TF)? Coming up with a list of TFs is still not that trivial.
So, I started by collecting the lists of Human TFs as annotated in different papers or databases, I selected the ones I think are amongst the most commonly used and mentioned:
- Vaquerizas et al. published a paper in Nature Reviews Genetics in 2009 entitled “A census of human transcription factors: function, expression and evolution”. This is a highly cited paper from Nick Luscomb’s group, and the data herein is often used as a source list of TFs. I downloaded 1987 TFs from Supplementary Table 3.
Carro et al., 2010 is a 2010 paper published in Nature from Andrea Califano group. Califano’s lab is well known for their work in the inference of TF-networks, and master regulator analysis. This is the first list of TFs I used. It contains 928 TFs published in Supplementary Table 2.
The TcoF-DB is a Database of transcription co-factors and transcription factor interactions. TFs were taken from a list of proteins published in a 2014 Nature publication by Forrest et al. 2014. The TcoF-DB was initially developed in 2010 (Schaefer et al., 2010) and since then it has been updated frequently, with the latest update published in NAR in 2017 Schmeier et a., 2017. The list of 1758 human TFs can be downloaded from TcoF-DB website.
Lambert et al., 2018 is a fairly recent paper published in Cell in 2018. In this paper, the authors curated a list of 1639 human TFs (quoting) which they published in Supplementary Table 1.
The AnimalTFDB3.0 is a comprehensive database including classification and annotation of genome-wide TFs and cofactors in 97 animal genomes. The first AnimalTFDB was constructed and published in NAR in 2012 (Zhang et al., 2012), and it has since then been updated frequently, the last version came out in 2019 (Hu et al. 2019). To identify TFs, the authors used Hidden Markov Model (HMM) profiles of DBDs, these were either available in Pfam database or rebuilt. Then, by applying the hmmsearch program in HMMER package, protein sequences were searched against the DBD HMM profiles to predict TFs. This latest version contains, as of December 2020, 1665 TFs and can be downloaded from AnimalTFDB3.0 website – Downloads section.
DoRothEA is a gene set resource containing signed transcription factor (TF) – target interactions first described in Garcia-Alonso et al., 2019. This is one of the most recent, and impressive, efforts to collect TF-regulon data coming from Julio-Saez Rodrigues group. As of 5th of December 2020, this collection contains 1355 TFs and can be downloaded directly from the BioC package dorothea (see code below).
Finally, I collected a list of 1605 Ensemble gene IDs associated with GO:0003700 (‘DNA-binding transcription factor activity’). This list contains genes that may act as transcription factors. The idea of using this GO Id comes from many different posts and online forums (e.g. here and here) but also recently used in Schubert et al., 2020. Actually one of my favourite recent papers, it argues that inference of regulatory networks in cancer via co-expression analysis can be biased by aneuploidies – but that is a topic for another post! This list can be accessed directed through R (see code below).
Get all TF lists
# Set up options(stringsAsFactors = FALSE) library(data.table) library(annotables) data(grch38) # Vaquerizas_2009, Carro_2010, Forrest_2014, Lampber_2018: Make sure you download the corresponding files from the papers Vaquerizas_2009 <- unique(read.delim("http://inesdesantiago.github.io/SeqQC.blog/TFlists/Vaquerizas_2009_SupTable3.txt")$EnsemblID) Carro_2010 <- read.delim("http://inesdesantiago.github.io/SeqQC.blog/TFlists/Carro_2010_SupTable2.txt", head=FALSE)$V1 Carro_2010 <- unique(na.omit(grch38$ensgene[grch38$symbol %in% Carro_2010])) Forrest_2014 <- read.delim("http://inesdesantiago.github.io/SeqQC.blog/TFlists/Forrest_2014.txt")$GeneID Forrest_2014 <- unique(na.omit(grch38$ensgene[grch38$entrez %in% Forrest_2014])) Lambert_2018 <- unique(read.delim("http://inesdesantiago.github.io/SeqQC.blog/TFlists/Lambert_2018_SupTable1.txt")$ENSGid) # AnimalTFDB3 (2019) AnimalTFDB3_2019 <- unique(fread("http://bioinfo.life.hust.edu.cn/static/AnimalTFDB3/download/Homo_sapiens_TF")$Ensembl) # DoRothEA (2019) library(dorothea) data("entire_database", package = "dorothea") Dorothea_2019 <- unique(na.omit(grch38$ensgene[grch38$symbol %in% unique(entire_database$tf)])) # GO0003700 library(org.Hs.eg.db) GO0003700 <- data.frame(mget("GO:0003700", org.Hs.egGO2ALLEGS)[])[,1] GO0003700 <- unique(na.omit(grch38$ensgene[grch38$entrez %in% GO0003700])) # Arrange all in a list TFs_lists <- list("Vaquerizas_2009"=Vaquerizas_2009, "Carro_2010"=Carro_2010, "Lambert_2018"=Lambert_2018, "AnimalTFDB3_2019"=AnimalTFDB3_2019, "Dorothea_2019"=Dorothea_2019, "Forrest_2014"=Forrest_2014, "GO0003700"=GO0003700)
Intersect and Analyse TF lists
Some TFs appear in only one list, others in several lists.
venn library allows us to plot up to 7 lists. The plot is not very informative because there are so many numbers and combinations to consider, but here it goes:
The UpSet plot can show a better overview of how lists of TFs overlap. The first two bigger bars tell us that around 600+ TFs appear in all 7 lists, and another group just below 600 appears in all 6 lists except Carro et al., 2010. The larger bar near the right end side of the plot tells us that there is a group of about ~380 TFs that appear solely in Vaquerizas et al., 2009 list.
library(ComplexHeatmap) gene_matrix <- list_to_matrix(TFs_lists) m = make_comb_mat(gene_matrix) UpSet(m)
The overlap of the TF lists can also be conveniently visualized using a Jaccard index. Seems like the list of TFs from Carro et al., 2010 and Vaquerizas et al 2009 are the ones most dissimilar:
library(IntClust) library(pheatmap) jaccard_similarity = 1 - Distance(t(gene_matrix) , "tanimoto") pheatmap(jaccard_similarity)
Looking at TFs that appear in only one out of seven TF lists
Having realised that there are a few TFs that appear only in 1 of the 7 TF lists, I felt compelled into looking at them in more detail. Are those totally nonsensical?
There are 608 TFs that appear solely in one TF list, and most of these come from Vaquerizas et al., 2009:
rs1 <- gene_matrix[rowSums(gene_matrix) == 1,] nrow(rs1) #608 combinations <- apply(rs1,1, function(x) paste(names(x)[x==1], collapse=",")) barplot(sort(table(combinations)), las=2, cex.names=.5, main = "TFs appearing solely in one TF list", ylab = "Number of TFs", density=20, angle=45, col="brown" )
PANTHER.db is a protein database with proteins annotated with Gene Ontology terms and organised into pathways and classes (See the 2013 paper in NAR). I really like to use PANTHER.db protein classes to have a general idea of the protein classes that exist in my dataset. I had to do this step manually, but if anyone knows how to get this informatically directly in R – please let me know!
So, first, I save the list of genes on interest (those appearing solely in 1 TF list) and looked at them in the Panther database. HGNC IDs are a pretty robust way to map IDs and retrieve data from the Panther Database
# load libraries library(biomaRt) library(data.table) # combine data data <- data.table("ENSG"=names(combinations), "DB" = combinations) # any(duplicated(data$ENSG)) #FALSE # Annotate HGNC IDs ensembl = useMart("ENSEMBL_MART_ENSEMBL",dataset="hsapiens_gene_ensembl", host="www.ensembl.org") genemap <- getBM( attributes = c("ensembl_gene_id", "hgnc_id"), filters = "ensembl_gene_id", values = data$ENSG, mart = ensembl ) idx <- match(data$ENSG, genemap$ensembl_gene_id) data$HGNC <- genemap$hgnc_id[idx] data <- data[!is.na(data$HGNC),] data <- data[data$HGNC != "",] # Annotate gene symbols data[, "GeneSymbol" := grch38$symbol[match(ENSG, grch38$ensgene)] ] data <- data[!is.na(GeneSymbol),] # write data[, ENSG := NULL] data <- unique(data) fwrite(data, "Genes_in_solely1_TFlist.txt", sep="\t")
Then, I enter tthis list of HGNC symbols into PANTHER.db. I select the organism: “Homo Sapiens” and then “Functional classification viewed in gene list”. A table with multiple annotations per gene appears. You can download it by clicking “Send the list to” –> “File”.
# load Panther results panther_results <- fread("pantherGeneList.txt", head = FALSE) colnames(panther_results) <- c("GeneID","HGNC","GeneName","PantherFamily","PantherProteinClass","Species") panther_results <- panther_results[, .(HGNC, PantherProteinClass)] # merge with previous list data <- merge(data, panther_results, by = "HGNC") data <- data[PantherProteinClass != "",] summaryData <- data[, .N, by = c("DB", "PantherProteinClass")] # order factor for plot orderClass <- data[, .N, by = "PantherProteinClass"][order(N)]$PantherProteinClass summaryData$PantherProteinClass <- factor(summaryData$PantherProteinClass, levels = orderClass) # plot barplot library(ggplot2) library(jcolors) ggplot(summaryData, aes(PantherProteinClass, N, fill = DB)) + geom_bar(stat="identity") + coord_flip() + scale_fill_jcolors(palette = "pal7") + ggtitle("Protein Classes of Eliminated TFs")
Lets look at a few classes in detail:
- Some are chromatin-binding proteins such as RBL2, RB1, HIRA, SIN3A, which actually one may argue are also TFs. But within this class w also find SUZ12 and PHF19 which are known chromatin-modifying enzymes. Other proteins seem to be tricky, for instance, PAX binding protein 1 (PAXBP1) is an adaptor protein linking the transcription factor PAX3 and PAX7 to the histone methylation machinery:
data[PantherProteinClass == "chromatin/chromatin-binding, or -regulatory protein(PC00077)", list(DB,GeneSymbol)] %>% unique()
## DB GeneSymbol ## 1: Carro_2010 SCML1 ## 2: Carro_2010 SCML2 ## 3: Forrest_2014 PAXBP1 ## 4: Forrest_2014 NCOA6 ## 5: Vaquerizas_2009 SUZ12 ## 6: Forrest_2014 L3MBTL2 ## 7: Lambert_2018 SCMH1 ## 8: Carro_2010 AATF ## 9: GO0003700 SIN3A ## 10: Vaquerizas_2009 PDS5B ## 11: Lambert_2018 PHF19 ## 12: Vaquerizas_2009 BRD9 ## 13: AnimalTFDB3_2019 SAMD11 ## 14: Vaquerizas_2009 RNASEK-C17orf49 ## 15: Carro_2010 HIRA ## 16: GO0003700 MDM2 ## 17: GO0003700 MDM4 ## 18: Lambert_2018 PHF1 ## 19: Carro_2010 RB1 ## 20: Carro_2010 RBL2
- Many proteins from the La-related protein (LARP) family are classified as “RNA binding proteins” by Panther and were labelled as TFs solely by the Vaquerizas publication. Known to be canonical components of transcription via its effects on tRNA maturation, polymerase III transcript stability, mRNA translation, and mRNA metabolism (Stavraka1 and Blagden, 2015), their label as ‘transcription factors’ can be questioned:
data[PantherProteinClass == "RNA binding protein(PC00031)", list(DB,GeneSymbol)] %>% unique()
## DB GeneSymbol ## 1: Vaquerizas_2009 SSB ## 2: Vaquerizas_2009 ZFP36 ## 3: Vaquerizas_2009 ZFR ## 4: Forrest_2014 PSPC1 ## 5: Vaquerizas_2009 RBM26 ## 6: Vaquerizas_2009 ADAR ## 7: Vaquerizas_2009 LARP6 ## 8: Vaquerizas_2009 LARP4 ## 9: Vaquerizas_2009 LARP1B ## 10: Vaquerizas_2009 LARP7 ## 11: Vaquerizas_2009 LARP4B ## 12: Vaquerizas_2009 ZFR2 ## 13: Vaquerizas_2009 RBM27 ## 14: Vaquerizas_2009 LARP1 ## 15: Forrest_2014 HNRNPK ## 16: Forrest_2014 NONO ## 17: Vaquerizas_2009 NUFIP1 ## 18: Forrest_2014 PCBP2 ## 19: Forrest_2014 PCBP4
- Proteins labelled as “ubiquitin-protein ligase”. Here I immediately spot BRCA1. BRCA1 is a human tumour suppressor gene (also known as a caretaker gene) and is responsible for repairing DNA, I wouldn’t see it as a TF, but it is annotated as such in the TFcoFDB. Then we have a few subunits of E3 ubiquitin ligase complexes such as the cullin ubiquitin ligases (CUL1, CUL2, CUL3, etc) and the Makorin RING finger proteins (MKRN1, MKRN2) which I believe should not be labelled as TFs either:
data[PantherProteinClass == "ubiquitin-protein ligase(PC00234)", list(DB,GeneSymbol)] %>% unique()
## DB GeneSymbol ## 1: Vaquerizas_2009 TRIM3 ## 2: Forrest_2014 BRCA1 ## 3: Carro_2010 TRIM22 ## 4: Vaquerizas_2009 CBLL1 ## 5: Vaquerizas_2009 CUL1 ## 6: Vaquerizas_2009 CUL2 ## 7: Vaquerizas_2009 CUL3 ## 8: Vaquerizas_2009 CUL4A ## 9: Vaquerizas_2009 CUL4B ## 10: Vaquerizas_2009 CUL5 ## 11: Vaquerizas_2009 UBR4 ## 12: Vaquerizas_2009 UBE2K ## 13: Vaquerizas_2009 MKRN1 ## 14: Vaquerizas_2009 MKRN2 ## 15: Vaquerizas_2009 MKRN3 ## 16: Forrest_2014 CNOT4
Final list of TFs
How to come up with a reliable list of TFs?
Here we can continue the manual curation of the list of TFs a bit further, and really examine all the inconsistent ones one by one. But I guess, as a quick decision rule, we can decide to trust any protein that appears in at least 2 out of the 7 lists considered. This rule leads to a list of 1924 TFs.
gene_matrix <- gene_matrix[rowSums(gene_matrix) >= 2,] final_TF_list <- data.table("ENSG" = rownames(gene_matrix), "evidence" = apply(gene_matrix,1, function(x) paste(names(x)[x==1]))) dim(final_TF_list)
##  1924 2
I will get back to this to hopefully do something useful with it.
In the meantime, I made the list available in this link, I hope it can be useful to someone else!