A new tool developed at ETH Zurich, MetaGraph, allows scientists to conduct research across a broad audience DNA And RNA databases in seconds – like a “Google for DNA”.
DNA sequencing has transformed biomedical research, making it possible to identify rare inherited diseases in patients and identify specific mutations within tumor cells. In recent years, new techniques (next generation sequencing) have enabled remarkable scientific progress. In 2020 and 2021, for example, these methods have allowed researchers to quickly decode and monitor SARS-CoV-2 global genome.
At the same time, a growing number of scientists are sharing their sequencing results publicly. This openness has led to the accumulation of enormous data sets stored in major repositories such as the American SRA (Sequence Read Archive) and the European ENA (European Nucleotide Archive). These databases now contain approximately 100 petabytes of information, which is comparable to the total amount of text available on the Internet, with one petabyte equaling one million gigabytes.
Until recently, searching these immense archives to compare DNA sequences required enormous computing resources, making effective analysis almost impossible. Researchers from ETH Zurich have developed a solution to overcome this challenge.
Full-text search instead of downloading entire datasets
Scientists have developed a method that significantly shortens and facilitates this search. The digital tool “MetaGraph” searches the raw data of all DNA or RNA sequences stored in the databases – like a classic Internet search engine. After entering a sequence of interest in full text form into a search mask, searchers can find out in seconds or minutes, depending on the query, where it has already appeared.
“It’s a kind of Google for DNA,” summarizes Professor Gunnar Rätsch, data scientist at the Department of Computer Science at ETH Zurich. Until now, researchers had to search databases for descriptive metadata. In order to access the raw data, they had to download the respective datasets. This research was incomplete, time-consuming and expensive.
“MetaGraph” is relatively cost-effective, as the researchers claim in their study. The representation of all public biological sequences would fit on a few computer hard drives, while larger queries would cost no more than $0.74 per megabase.
As the DNA search engine developed by ETH Zurich researchers is both precise and efficient, it can help accelerate genetic research, for example in the case of poorly studied pathogens or new pandemics. The tool could thus become a catalyst in research into antibiotic resistance: for example, by identifying resistance genes or useful viruses in databases capable of destroying bacteria – called bacteriophages.
Compression by a factor of 300
In the study published on October 8 in the journal Nature, ETH researchers demonstrate how MetaGraph works: the tool indexes the data and presents it in compressed form. This is achieved through complex mathematical graphics that enhance the structure of the data – similar to spreadsheets such as Excel. “Mathematically speaking, it is a huge matrix with millions of columns and billions of rows,” as Rätsch explains.
The idea of making large amounts of data searchable using indexes is a common practice in computer research. What is new in the work of the ETH researchers is the complex combination of raw data and metadata as well as the compression by a factor of around 300, similar to that of a book summary: it no longer contains all the words, but all the main storylines and connections remain intact – more compact, but without significant loss of information.
“We are pushing the boundaries of what is possible in order to keep the data sets as compact as possible without losing the necessary information,” explains Dr. André Kahles, who, like Rätsch, is a member of the Biomedical Informatics Group at ETH Zurich. Unlike other DNA search masks currently being studied, the ETH researchers’ approach is scalable. This means that the greater the amount of data queried, the less additional computing power the tool requires.
Half of the data is already available now
ETH researchers first presented MetaGraph in 2020 and have been improving it ever since. The tool is already available for queries (link). It provides a full-text search engine for millions of sets of DNA and RNA sequences, as well as proteins from viruses, bacteria, fungi, plants, animals and humans. Currently, just under half of the world’s available sequence datasets are indexed. According to Gunnar Rätsch, the rest should follow by the end of the year. Since MetaGraph is available as open source, it could also be of interest to pharmaceutical companies that have large amounts of internal research data.
Kahles even believes that it is possible that the DNA search engine will one day be used by individuals: “At first, even Google did not know exactly what a search engine was for. If the rapid development of DNA sequencing continues, it could become commonplace to more precisely identify the plants on your balcony.”
Reference: “Efficient and accurate search in petabase-scale sequence repositories” by Mikhail Karasikov, Harun Mustafa, Daniel Danciu, Oleksandr Kulkov, Marc Zimmermann, Christopher Barber, Gunnar Rätsch and André Kahles, October 8, 2025, Nature.
DOI: 10.1038/s41586-025-09603-w
Never miss a major breakthrough: join the SciTechDaily newsletter.
Follow us on Google, Discover and News.