Science is carried out by testing hypothesis through experimentation in the ‘wet lab’ and in the ‘dry lab’. It often starts out with a hypothesis that we want to test, and then through series of experiments and bioinformatic analysis new hypothesis are generated, some are validated and others discarded. In the ‘dry lab’ this iterative process often relies on the use of published datasets. When I am testing a new concept, a new algorithm, or just want to compare my results with some other related data, I go to public databases to search for what is out there. Besides obvious huge data portals from big consortia (TCGA, ICGC, ENCODE, etc), there are many other databases such as GEO and ArrayExpress that host datasets from individual labs and publications.
Hackday with DNAdigest
So, faced with the increasing need for better tools to search through large databases in an intelligent way, we organised a hackaton to come up with ‘recommender systems for scientific datasets’.
The Hackday was organised by DNAdigest, a non-profit organisation with an aim to tackle the challenges of genomic data sharing. You can read more about them here.
Recommender systems for scientific datasets
The idea was simple – perhaps inspired by Amazon and Netflix recommender systems – the idea was to help scientists finding datasets from various sources (various experiments, studies, etc).
Below, there’s a good example of this concept. When I search GSE1379 on Google scholar it finds 41 papers with “GSE1379″ in the text. Along side with this, you can see how other related datasets are often co-cited:
A huge matrix of a dataset-citation-network:
We are open to more ideas and contributors to this project! If you are interested, have some ideas and would like to contribute, join the discussion at http://dnadigest.hackpad.com, or just add a comment below!