Defining a list of "transcription factors" is not as straightforward as one could imagine. If we browse the literature we can actually find different lists of TFs out there, and they do not always overlap. In this post, I compared a few of the main ones (most-cited and most-used) out there. What is a Transcription … Continue reading Where to find a comprehensive list of potential human transcription factors?
This post is mainly a list of tricks for using matrix design, most of them I learned along the way by reading Limma documentation, posts on StackOverflow and Bioconductor support and often just from trial and error and experience. I will assume you are familiar with the DE workflow, which might be different if you … Continue reading 10 Tips & Tricks for complex model.matrix designs in DGE analysis
The filtering of low-expression genes is a common practice in the analysis of RNA-seq data. There are several reasons for this. For the detection of differentially expressed genes (DEGs) and from a biological point of view, genes that not expressed at a biologically meaningful level in any condition are not of interest and are therefore … Continue reading Removing low count genes for RNA-seq downstream analysis
Here I will show how principal component analysis (PCA) and singular vector decomposition (SVD) are related to each other. As an example, lets create some example data: Using prcomp To perform a typical principal component analysis on the samples, we need to transpose the data such that the samples are rows of the data matrix. … Continue reading SVD vs PRCOMP in R
Data visualization has already become an essential skill for scientists, and anyone else who needs to understand data. As scientists, we have access to increasingly more data, from genomics to digital-health and AI, it is more and more clear that the problem will only grow. Knowing how to plot data is an essential part of … Continue reading Data visualization ideas and libraries for bioinformatics
Hypergeometric tests are useful for enrichment analysis. For intance, we can use the hypergeometirc test to model the association between genes and a GO class. The classical example for the hypergeometric distribution is the urn problem when sampling without replacement. You start with N balls in the urn, of which K1 are white. The distribution … Continue reading How to use phyper in R
The GTEx consortium has just published a collection of papers in a special issue of Nature that together provide an unprecedented view of the human transcriptome across dozens of tissues. The work is based on a large-scale RNA-Seq experiment of postmortem tissue from hundreds of human donors, illustrated in Figure 1 of the overview by Ward and Gilad 2017:
The data provide a powerful new opportunity for several analyses, highlighted (at least for me) by the discovery of 673 trans-eQTLs at 10% genome-wide FDR. Undoubtedly more discoveries will be published when the sequencing data, available via dbGAP, is analyzed in future studies. As a result, the GTEx project is likely to garner many citations, both for specific results, but also drive-by-citations that highlight the scope and innovation of the project. Hopefully, these citations will include the key GTEx paper:
Carithers, Latarsha J, Ardlie, Kristin, Barcus, Mary, Branton…
View original post 635 more words
I had to re-blog this blog post by Lior Pachter.
It helped me memorize better the FDRs, FPRs, and all of that!
I found Lior’s explanation easy to follow and he also published a ‘cheat’ table (which I already printed and have it on my desk!).
Happy new year to all!
What are confusion matrices?
In the 1904 book Mathematical Contributions to the Theory of Evolution, Karl Pearson introduced the notion of contingency tables. Sometime around the 1950s the term “confusion matrix” started to be used for such tables, specifically for 2×2 tables arising in the evaluation of algorithms for binary classification.
Example: Suppose there are 11 items labeled A,B,C,D,E,F,G,H,I,J,K four of which are of the category blue (also to be called “positive”) and eight of which are of the category red (also to be called “negative”). An algorithm called BEST receives as input the objects without their category labels, i.e. just A,B,C,D,E,F,G,H,I,J,K and must rank them so that the top of the ranking is enriched, as much as possible, for blue items. Suppose BEST produces as output the ranking:
View original post 1,589 more words
t-SNE stands for t-Distributed Stochastic Neighbor Embedding (t-SNE) and is a popular technique for dimensionality reduction. The technique was introduced by van der Maaten and Hinton in 2008. T-SNE is particularly well suited for the visualization of high-dimensional genomic or proteomic datasets (e.g. gene expression, mass spectrometry, etc). The most popular used method in genomics/proteomics … Continue reading How to Use t-SNE Effectively
We have all hear about those "breakthrough biomarkers" that.. simply.. do not work! There have been numerous reports in the literature that investigate why certain biomarkers don't work after all. But before discussing the reasons for failed biomarkers, let me give some examples of disease biomarkers that have been successfully used for diagnosis, prognosis or … Continue reading Your Biomarker doesn’t work (and a case for reproducible research)!