Where to find a comprehensive list of potential human transcription factors?

Where to find a comprehensive list of potential human transcription factors?

Defining a list of “transcription factors” is not as straightforward as one could imagine. If we browse the literature we can actually find different lists of TFs out there, and they do not always overlap. In this post, I compared a few of the main ones (most-cited and most-used) out there.

What is a Transcription Factor (TF)?

First of all, let’s start with the basics. What is a TF? TFs are proteins that regulate gene activity by binding to specific DNA sequences termed motifs or regulatory elements (Ptashne and Gann, 1997, Lambert et al., 2018). These DNA-binding sequences can be found proximal to the target genes, such as upstream (promoters), downstream, or even in intronic regions, or more distant to the target genes (e.g. enhancers or silencers). A defining feature of TFs is that they contain at least one specific domain that can bind DNA, which is termed a DNA-binding domain (DBD). TFs use a variety of DNA-binding structural motifs to recognize their target DNA sequences, these include homeodomain (HD), helix-turn-helix (HTH) and high-mobility group box (HMG) (Mistis et al., 2020). Such DBDs can be used to classify TFs. Several other proteins participate in the process of regulation of gene expression (coactivators, chromatin remodelers, histone acetyltransferases, histone deacetylases, kinases, and methylases). These other proteins are also essential to gene regulation, but lack DNA-binding domains, and therefore are not considered TFs (Brivanlou and Darnell, 2002).

The regulation of gene transcription orchestrated by complex transcriptional regulatory networks is a fundamental part of both tissue-specific gene expression and gene activity in response to stimuli (Latchman 1997, Wilkinson et al., 2017). Thus, the analysis of TF-networks is a central issue in bioinformatics and systems biology.

How many TFs exist?

When I started working in the inference of TF-networks one of the first issues I faced was simply the annotation of TFs themselves. Is this protein a TF? Or a co-factor? Or is it another kind of transcriptional regulator (that is not a TF)? Coming up with a list of TFs is still not that trivial.

So, I started by collecting the lists of Human TFs as annotated in different papers or databases, I selected the ones I think are amongst the most commonly used and mentioned:

  • Vaquerizas et al. published a paper in Nature Reviews Genetics in 2009 entitled “A census of human transcription factors: function, expression and evolution”. This is a highly cited paper from Nick Luscomb’s group, and the data herein is often used as a source list of TFs. I downloaded 1987 TFs from Supplementary Table 3.

  • Carro et al., 2010 is a 2010 paper published in Nature from Andrea Califano group. Califano’s lab is well known for their work in the inference of TF-networks, and master regulator analysis. This is the first list of TFs I used. It contains 928 TFs published in Supplementary Table 2.

  • The TcoF-DB is a Database of transcription co-factors and transcription factor interactions. TFs were taken from a list of proteins published in a 2014 Nature publication by Forrest et al. 2014. The TcoF-DB was initially developed in 2010 (Schaefer et al., 2010) and since then it has been updated frequently, with the latest update published in NAR in 2017 Schmeier et a., 2017. The list of 1758 human TFs can be downloaded from TcoF-DB website.

  • Lambert et al., 2018 is a fairly recent paper published in Cell in 2018. In this paper, the authors curated a list of 1639 human TFs (quoting) which they published in Supplementary Table 1.

  • The AnimalTFDB3.0 is a comprehensive database including classification and annotation of genome-wide TFs and cofactors in 97 animal genomes. The first AnimalTFDB was constructed and published in NAR in 2012 (Zhang et al., 2012), and it has since then been updated frequently, the last version came out in 2019 (Hu et al. 2019). To identify TFs, the authors used Hidden Markov Model (HMM) profiles of DBDs, these were either available in Pfam database or rebuilt. Then, by applying the hmmsearch program in HMMER package, protein sequences were searched against the DBD HMM profiles to predict TFs. This latest version contains, as of December 2020, 1665 TFs and can be downloaded from AnimalTFDB3.0 website – Downloads section.

  • DoRothEA is a gene set resource containing signed transcription factor (TF) – target interactions first described in Garcia-Alonso et al., 2019. This is one of the most recent, and impressive, efforts to collect TF-regulon data coming from Julio-Saez Rodrigues group. As of 5th of December 2020, this collection contains 1355 TFs and can be downloaded directly from the BioC package dorothea (see code below).

  • Finally, I collected a list of 1605 Ensemble gene IDs associated with GO:0003700 (‘DNA-binding transcription factor activity’). This list contains genes that may act as transcription factors. The idea of using this GO Id comes from many different posts and online forums (e.g. here and here) but also recently used in Schubert et al., 2020. Actually one of my favourite recent papers, it argues that inference of regulatory networks in cancer via co-expression analysis can be biased by aneuploidies – but that is a topic for another post! This list can be accessed directed through R (see code below).

Get all TF lists

# Set up
options(stringsAsFactors = FALSE)
library(data.table)
library(annotables)
data(grch38)

# Vaquerizas_2009, Carro_2010, Forrest_2014, Lampber_2018: Make sure you download the corresponding files from the papers
Vaquerizas_2009 <- unique(read.delim("http://inesdesantiago.github.io/SeqQC.blog/TFlists/Vaquerizas_2009_SupTable3.txt")$EnsemblID)

Carro_2010 <- read.delim("http://inesdesantiago.github.io/SeqQC.blog/TFlists/Carro_2010_SupTable2.txt", head=FALSE)$V1
Carro_2010 <- unique(na.omit(grch38$ensgene[grch38$symbol %in% Carro_2010]))

Forrest_2014 <- read.delim("http://inesdesantiago.github.io/SeqQC.blog/TFlists/Forrest_2014.txt")$GeneID
Forrest_2014 <- unique(na.omit(grch38$ensgene[grch38$entrez %in% Forrest_2014]))

Lambert_2018 <- unique(read.delim("http://inesdesantiago.github.io/SeqQC.blog/TFlists/Lambert_2018_SupTable1.txt")$ENSGid)

# AnimalTFDB3 (2019)
AnimalTFDB3_2019 <- unique(fread("http://bioinfo.life.hust.edu.cn/static/AnimalTFDB3/download/Homo_sapiens_TF")$Ensembl)

# DoRothEA (2019)
library(dorothea)
data("entire_database", package = "dorothea")
Dorothea_2019 <- unique(na.omit(grch38$ensgene[grch38$symbol %in% unique(entire_database$tf)]))

# GO0003700
library(org.Hs.eg.db)
GO0003700 <- data.frame(mget("GO:0003700", org.Hs.egGO2ALLEGS)[[1]])[,1]
GO0003700 <- unique(na.omit(grch38$ensgene[grch38$entrez %in% GO0003700]))

# Arrange all in a list
TFs_lists <- list("Vaquerizas_2009"=Vaquerizas_2009, 
                  "Carro_2010"=Carro_2010, 
                  "Lambert_2018"=Lambert_2018,
                  "AnimalTFDB3_2019"=AnimalTFDB3_2019, 
                  "Dorothea_2019"=Dorothea_2019,
                  "Forrest_2014"=Forrest_2014,
                  "GO0003700"=GO0003700)

Intersect and Analyse TF lists

Some TFs appear in only one list, others in several lists.
Luckily venn library allows us to plot up to 7 lists. The plot is not very informative because there are so many numbers and combinations to consider, but here it goes:

library(venn)
venn(TFs_lists[1:7])

The UpSet plot can show a better overview of how lists of TFs overlap. The first two bigger bars tell us that around 600+ TFs appear in all 7 lists, and another group just below 600 appears in all 6 lists except Carro et al., 2010. The larger bar near the right end side of the plot tells us that there is a group of about ~380 TFs that appear solely in Vaquerizas et al., 2009 list.

library(ComplexHeatmap)
gene_matrix <- list_to_matrix(TFs_lists)
m = make_comb_mat(gene_matrix)
UpSet(m)

The overlap of the TF lists can also be conveniently visualized using a Jaccard index. Seems like the list of TFs from Carro et al., 2010 and Vaquerizas et al 2009 are the ones most dissimilar:

library(IntClust)
library(pheatmap)
jaccard_similarity = 1 - Distance(t(gene_matrix) , "tanimoto")
pheatmap(jaccard_similarity)

Looking at TFs that appear in only one out of seven TF lists

Having realised that there are a few TFs that appear only in 1 of the 7 TF lists, I felt compelled into looking at them in more detail. Are those totally nonsensical?

There are 608 TFs that appear solely in one TF list, and most of these come from Vaquerizas et al., 2009:

rs1 <- gene_matrix[rowSums(gene_matrix) == 1,]
nrow(rs1) #608
combinations <- apply(rs1,1, function(x) paste(names(x)[x==1], collapse=","))
barplot(sort(table(combinations)), las=2, cex.names=.5, main = "TFs appearing solely in one TF list", ylab = "Number of TFs", density=20, angle=45, col="brown" )

PANTHER.db is a protein database with proteins annotated with Gene Ontology terms and organised into pathways and classes (See the 2013 paper in NAR). I really like to use PANTHER.db protein classes to have a general idea of the protein classes that exist in my dataset. I had to do this step manually, but if anyone knows how to get this informatically directly in R – please let me know!

So, first, I save the list of genes on interest (those appearing solely in 1 TF list) and looked at them in the Panther database. HGNC IDs are a pretty robust way to map IDs and retrieve data from the Panther Database

# load libraries
library(biomaRt)
library(data.table)

# combine data
data <- data.table("ENSG"=names(combinations), "DB" = combinations)
# any(duplicated(data$ENSG)) #FALSE

# Annotate HGNC IDs
ensembl = useMart("ENSEMBL_MART_ENSEMBL",dataset="hsapiens_gene_ensembl", 
                    host="www.ensembl.org")
genemap <- getBM( attributes = c("ensembl_gene_id", "hgnc_id"),
                    filters = "ensembl_gene_id",
                    values = data$ENSG,
                    mart = ensembl )
idx <- match(data$ENSG, genemap$ensembl_gene_id)
data$HGNC <- genemap$hgnc_id[idx]
data <- data[!is.na(data$HGNC),]
data <- data[data$HGNC != "",]

# Annotate gene symbols
data[, "GeneSymbol" := grch38$symbol[match(ENSG, grch38$ensgene)] ]
data <- data[!is.na(GeneSymbol),]

# write
data[, ENSG := NULL]
data <- unique(data)
fwrite(data, "Genes_in_solely1_TFlist.txt", sep="\t")

Then, I enter tthis list of HGNC symbols into PANTHER.db. I select the organism: “Homo Sapiens” and then “Functional classification viewed in gene list”. A table with multiple annotations per gene appears. You can download it by clicking “Send the list to” –> “File”.

# load Panther results
panther_results <- fread("pantherGeneList.txt", head = FALSE)
colnames(panther_results) <- c("GeneID","HGNC","GeneName","PantherFamily","PantherProteinClass","Species")
panther_results <- panther_results[, .(HGNC, PantherProteinClass)]

# merge with previous list
data <- merge(data, panther_results, by = "HGNC")
data <- data[PantherProteinClass != "",]
summaryData <- data[, .N, by = c("DB", "PantherProteinClass")]

# order factor for plot
orderClass <- data[, .N, by = "PantherProteinClass"][order(N)]$PantherProteinClass
summaryData$PantherProteinClass <- factor(summaryData$PantherProteinClass, levels = orderClass)

# plot barplot
library(ggplot2)
library(jcolors)
ggplot(summaryData, aes(PantherProteinClass, N, fill = DB)) + geom_bar(stat="identity") +  coord_flip() + scale_fill_jcolors(palette = "pal7") + ggtitle("Protein Classes of Eliminated TFs")

Lets look at a few classes in detail:

  • Some are chromatin-binding proteins such as RBL2, RB1, HIRA, SIN3A, which actually one may argue are also TFs. But within this class w also find SUZ12 and PHF19 which are known chromatin-modifying enzymes. Other proteins seem to be tricky, for instance, PAX binding protein 1 (PAXBP1) is an adaptor protein linking the transcription factor PAX3 and PAX7 to the histone methylation machinery:
data[PantherProteinClass == "chromatin/chromatin-binding, or -regulatory protein(PC00077)", list(DB,GeneSymbol)] %>% unique()
##                   DB      GeneSymbol
##  1:       Carro_2010           SCML1
##  2:       Carro_2010           SCML2
##  3:     Forrest_2014          PAXBP1
##  4:     Forrest_2014           NCOA6
##  5: Vaquerizas_2009           SUZ12
##  6:     Forrest_2014         L3MBTL2
##  7:     Lambert_2018           SCMH1
##  8:       Carro_2010            AATF
##  9:        GO0003700           SIN3A
## 10: Vaquerizas_2009           PDS5B
## 11:     Lambert_2018           PHF19
## 12: Vaquerizas_2009            BRD9
## 13: AnimalTFDB3_2019          SAMD11
## 14: Vaquerizas_2009 RNASEK-C17orf49
## 15:       Carro_2010            HIRA
## 16:        GO0003700            MDM2
## 17:        GO0003700            MDM4
## 18:     Lambert_2018            PHF1
## 19:       Carro_2010             RB1
## 20:       Carro_2010            RBL2
  • Many proteins from the La-related protein (LARP) family are classified as “RNA binding proteins” by Panther and were labelled as TFs solely by the Vaquerizas publication. Known to be canonical components of transcription via its effects on tRNA maturation, polymerase III transcript stability, mRNA translation, and mRNA metabolism (Stavraka1 and Blagden, 2015), their label as ‘transcription factors’ can be questioned:
data[PantherProteinClass == "RNA binding protein(PC00031)", list(DB,GeneSymbol)] %>% unique()
##                   DB GeneSymbol
##  1: Vaquerizas_2009        SSB
##  2: Vaquerizas_2009      ZFP36
##  3: Vaquerizas_2009        ZFR
##  4:     Forrest_2014      PSPC1
##  5: Vaquerizas_2009      RBM26
##  6: Vaquerizas_2009       ADAR
##  7: Vaquerizas_2009      LARP6
##  8: Vaquerizas_2009      LARP4
##  9: Vaquerizas_2009     LARP1B
## 10: Vaquerizas_2009      LARP7
## 11: Vaquerizas_2009     LARP4B
## 12: Vaquerizas_2009       ZFR2
## 13: Vaquerizas_2009      RBM27
## 14: Vaquerizas_2009      LARP1
## 15:     Forrest_2014     HNRNPK
## 16:     Forrest_2014       NONO
## 17: Vaquerizas_2009     NUFIP1
## 18:     Forrest_2014      PCBP2
## 19:     Forrest_2014      PCBP4
  • Proteins labelled as “ubiquitin-protein ligase”. Here I immediately spot BRCA1. BRCA1 is a human tumour suppressor gene (also known as a caretaker gene) and is responsible for repairing DNA, I wouldn’t see it as a TF, but it is annotated as such in the TFcoFDB. Then we have a few subunits of E3 ubiquitin ligase complexes such as the cullin ubiquitin ligases (CUL1, CUL2, CUL3, etc) and the Makorin RING finger proteins (MKRN1, MKRN2) which I believe should not be labelled as TFs either:
data[PantherProteinClass == "ubiquitin-protein ligase(PC00234)", list(DB,GeneSymbol)] %>% unique()
##                   DB GeneSymbol
##  1: Vaquerizas_2009      TRIM3
##  2:     Forrest_2014      BRCA1
##  3:       Carro_2010     TRIM22
##  4: Vaquerizas_2009      CBLL1
##  5: Vaquerizas_2009       CUL1
##  6: Vaquerizas_2009       CUL2
##  7: Vaquerizas_2009       CUL3
##  8: Vaquerizas_2009      CUL4A
##  9: Vaquerizas_2009      CUL4B
## 10: Vaquerizas_2009       CUL5
## 11: Vaquerizas_2009       UBR4
## 12: Vaquerizas_2009      UBE2K
## 13: Vaquerizas_2009      MKRN1
## 14: Vaquerizas_2009      MKRN2
## 15: Vaquerizas_2009      MKRN3
## 16:     Forrest_2014      CNOT4

Final list of TFs

How to come up with a reliable list of TFs?

Here we can continue the manual curation of the list of TFs a bit further, and really examine all the inconsistent ones one by one. But I guess, as a quick decision rule, we can decide to trust any protein that appears in at least 2 out of the 7 lists considered. This rule leads to a list of 1924 TFs.

gene_matrix <- gene_matrix[rowSums(gene_matrix) >= 2,]
final_TF_list <- data.table("ENSG" = rownames(gene_matrix), "evidence" = apply(gene_matrix,1, function(x) paste(names(x)[x==1])))
dim(final_TF_list)
## [1] 1924    2

I will get back to this to hopefully do something useful with it.
In the meantime, I made the list available in this link, I hope it can be useful to someone else!

2 thoughts on “Where to find a comprehensive list of potential human transcription factors?

  1. Thank you so much for this entry! I was actually looking lately for some TF-database but was not much happy with the results at the end. This approach sounds really good. Thank you again!

    Like

  2. Great information about transcription. Keep posting. DelSh Business Consultancy provides best professional translation services. DBC is a cultural integration firm headquartered in New Delhi, India and branch office in Shanghai, China. We have expertise in providing professional language translations with our team of in-house native foreign language Translators, Business Analysts, Market Research Specialists. Click to know more about Delsh Business Consultancy

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s