867 stories

Searching all public metagenomes with sourmash

1 Share

In preparation for an NIH/DOE workshop I'm attending today on "Emerging Solutions in Petabyte Scale Sequence Search", I thought I'd write down what we're currently doing with sourmash for public metagenome search. I'm writing this blog post in a hurry, and I may revise it later as I receive comments and feedback; I'll point to a diff if I do.

This is based largely on work that was done by Dr. Luiz Irber last year, as part of his PhD work with me.

sourmash itself is available (see sourmash.readthedocs.io/), and we just released v4.1.2 yesterday! It's under the BSD 3-clause license and is fully available via conda and pip.

In brief - lightweight metagenome search with MAGsearch

Today, we can use MAGsearch to robustly find matches to 10kb+ sequences (or collections of 10,000 or more k-mers) across all publicly available metagenomes, out to about 93% ANI.

It's particularly useful for -

  • gathering candidates from public metagenomes for e.g. outbreak detection.
  • finding matches to a particular species or genus so as to study its ecological distribution.
  • gathering data sets to expand our knowledge of a species pangenome

A search with ~100 query genomes takes about 17 hours, today, and will search 580,000 metagenomes representing 530 TB of original sequence data.

How it works underneath

We use sourmash to support metagenome containment search with scaled signatures.

sourmash scaled signatures are derived from MinHash techniques. They are compressed representations of k-mer collections, and can reliably be used to find exact matches of ~10kb segments of DNA between any two collections; larger matches can be found out to about 93% ANI.

One key aspect here is that search can be done without access to the original data.

We maintain a collection of signatures for ~580,000 public metagenomes with the SRA for k=21, 31, and 51. A search with about 100 genome-sized queries currently takes about 17 hours using 32 threads with 48 GB of RAM (on our HPC).

Our complete collection of signatures is approximately 10 TB total, although this contains far more than the metagenome data - it contains 3.7m signatures, representing 1.3 PB of total data (SRA metagenomes + SRA non-plant/animals + Genbank/Refseq microbial genomes).

This collection of signatures is automatically updated by wort, which coordinates a distributed collection of workers to compute signatures as new data arrives at NCBI.

Simple opportunities for improvement

MAGsearch is a robust prototype, with many straightforward opportunities for improvement. I would guess that with a few weeks of focused investment, we could get down to about ~1 hour per search.

First, the MAGsearch code doesn't do anything special in terms of loading; it's using the default sourmash signature format, which is JSON. For example, binary encodings would decrease the collection size a lot, while also speeding up search (by decreasing the load time).

Second, searching the signatures is done linearly, and uses Rust to do so in parallel. It uses the same Rust code that underlies sourmash (but is several versions behind the latest version). Making use of recent improvements in sourmash Rust code would probably speed this up several fold.

Third, we can now add protein signatures to our collection of DNA signatures, which would enable much more sensitive search. (We'd have to sketch a lot of data, though. :)

Broader limitations

The internal data structures we use in sourmash are optimized for relatively small collections of k-mers, because sourmash is built around downsampling k-mer collections. We're slowly improving our internal structures, but supporting all k-mers is not straightforward and is not something on our current roadmap.

Our sketching techniques only support individual k-mer sizes/molecule types. So while we can compute, store and search multiple k-mer sizes for DNA, protein, Dayhoff encodings, etc., they are stored separately and don't "compress" together. This means that signature collections grow quickly in size as we provide more k-mer sizes and molecule types!

We're not quite sure how to provide our current databases to people. Personally I'm not really ready to support MAGsearch as a service, either, but that's partly because of a lack of funding.

What else does sourmash offer?

sourmash itself is stable and well tested, and can be used with confidence to do many bioinformatics tasks. It is easy to install (pip/conda), and is reasonably well documented.

Our data structures and algorithms are simple and well-understood and straightforward to (re)implement. While they aren't yet all published, we are happy to explain them and tell you where they will and won't work.

sourmash is fast, and low memory, and requires little disk space for even pretty large collections of signatures.

sourmash has an increasingly useful command-line interface that supports many common k-mer and search operations. In this sense, it can be used as a partial guide for a good "default" set of operations that k-mer-based tools could support. We have paid a fair amount of attention to user experience, too.

Underneath, sourmash has a flexible Python API that is slowly being replaced with Rust underneath. This means that we can quickly prototype new functionality while refactoring critical functionality underneath, so sourmash performance is continually improving while we are also tackling new use cases.

We have an open, robust approach to software development, with an increasingly diverse array of contributors. I'm not sure we're ready to take on a lot of new contributors quite yet, because our roadmapping processes are not very mature, but we're working on that.

We use semantic versioning for the sourmash package itself, and we communicate clearly about breaking changes. As a result, sourmash can be cleanly integrated into workflows with simple versioning pinning requirements.

We support public and private collections of signatures, and all of our primary search and analysis approaches work with multiple databases or signature collections without needing to re-index them or combine them from scratch.

We also support flexible "free-form" taxonomy, and in particular support both NCBI and GTDB taxonomies.

Where would I like to see petabase-scale search go?

I wouldn't advocate for sourmash itself (either the software or the underlying techniques) as the one true method for searching all (meta)genomic data. Among other things, sourmash has a lot of other use cases that matter to us!

But I think we have a few experiences to offer to any such effort -

  • we have functioning implementations that support a number of really useful use cases for metagenome search and analysis. It would be nice not to lose those use cases!
  • high-sensitivity prefiltering approaches are good and enable flexible triage afterwards. We mostly use sourmash as a lightweight way to find all the things that we might care about, before doing more in-depth analysis.
  • having both command-line and Python APIs has been incredibly useful, and I think it would be a mistake to bypass good APIs in favor of a Web API. Of course, this also increases the developer effort by a lot, but the return is that you enable a lot more flexibility.
  • riffing more on that, I think it would be a mistake to write a custom Web-hosted indexing and search tool that only works with NCBI formats and taxonomies.
  • riffing even more on that, it's been great to be able to quickly add databases/collections to search, and supporting both completely private databases as well as rapid updating of public database collections is something that has been really useful in comparison to many other metagenome analysis tools.
  • simplicity of data structures and algorithms has helped us a lot with sourmash. Software support is fundamentally a game of maintenance and it has been great to be able to reimplement our core data structures and algorithms in multiple languages. In particular, I worry a lot about premature optimization when I look at other packages.

Luiz has also done a lot of thinking about distributed computing and decentralization via Dat and IPFS that I think could be valuable, but I'm not expert enough to summarize it myself. Hopefully Luiz will write something up :). (You can already check out his PhD thesis, chapters 4 and 5, for some juicy details and discussion, though!)

What other tools should we be looking at for large scale search?

I think Serratus did an excellent job of showing some of the possibilities of massive-scale metagenome search!

There's lots of tools out there in various stages of development, but I am particularly interested in metagraph.

I'd love to hear about more tools and approaches - please drop them in the comments or on twitter!


Read the whole story
3 days ago
Davis, CA
Share this story

Moving sourmash towards more community engagement - a funding application

1 Share

We applied for funding from CZI for sourmash a few weeks back, via the Essential Open Source Software for Science program. Here's the core of the application (lightly edited).

(We'll hear about funding by end of September, I believe.)

Feedback welcome, unless you're alerting me to the presence of typos :)

Proposal details

We seek funding for maintenance and user support for the sourmash software, while embarking on an ambitious plan to improve sustainability through improved governance, enhanced inclusivity, and robust community engagement.

Short description of software project:

Sourmash is mature software that enables lightweight content search, comparison and classification of microbial genomes and metagenomes. Sourmash works in low memory with compact databases, supports both NCBI and GTDB taxonomies, and can operate on private collections of genomes and metagenomes. The release of v4.1 brings massive-scale search of all Genbank microbial genomes and all public metagenomes to commodity hardware. These features are underpinned by novel data structures and algorithms, including an extension of MinHash that supports containment and the use of min-set-cov to do highly accurate metagenome analysis. Sourmash serves as a robust, reliable, and performant backbone for microbial sequence analysis.

We use development practices based on 30 years of scientific software engineering expertise: we develop in the open, do code review, have tests with 90%+ line coverage, and have a robust release process with semantic versioning. We provide thorough documentation, engage with users via our issue tracker, and use social media to broadcast new features and use cases. The utility of sourmash has been recognized by both users and funding agencies: we are increasingly well cited, the NSF is supporting the development of flexible taxonomies and distant evolutionary classification via protein k-mers, and the NIH is supporting iHMP reanalysis.

Proposal Summary

Sourmash is mature software that serves as a stable component of sequence analysis workflows, a fast and lightweight tool for massive-scale search of public and private sequence databases, and a platform for novel data structure and algorithm exploration. Sourmash is explicitly designed to meet the computational needs created by the massive expansion of sequencing capacity in microbiome biology.

We have arrived at an important crossroads with sourmash. We are just now releasing mature support for petabase-scale content search (v4.1.x and v4.2), and are currently writing up our novel data structures and algorithms for publication. We have ongoing projects using sourmash to analyze Human Microbiome Project datasets, including discovering strain-specific markers of Inflammatory Bowel Disease. Simultaneously, grant support for the core development of sourmash is ending, and Dr. Luiz Irber, the core developer behind most of the scaling work, is moving to another job where sourmash will become his part-time project. While sourmash research development will continue, we have no way to robustly support our current user base and grow the developer community with traditional funding, and do not have the governance infrastructure to productively engage with other support mechanisms.

We request support from CZI to support our newly released features with continued sourmash core development, while working toward sustainability by growing the project out of the lab and into the community. We propose to use the period to expand the sourmash community, define and grow a governance framework, connect to the Python and Rust bioinformatics ecosystem, and train both biologists and bioinformaticians to better engage with open source bioinformatics software. In particular, we see an opportunity to use sourmash to provide one example of how to grow a small project based in a single lab into a more sustainable community-based project. Importantly, this kind of maintenance and community growth does not fall within the scope of traditional funding opportunities.

At the end of this two year period, we will have continued to release and support high performance, high impact software. We will also have expanded our developer and user community, chosen a governance framework, identified a fiscal sponsorship plan, and published our strategies for project growth and sustainability.

Work Plan

Software development activities:

We propose to follow a “python-dev” model in which maintenance and feature releases proceed on their own timeline, while the roadmap process coordinates the planning and development of related feature sets (e.g. taxonomy extensions and database formats are connected). This separates maintenance updates from the “slow science” process of developing, testing, and evaluating new functionality against scientific use cases, while also ensuring that fully baked new functionality does regularly get released. Software development will proceed under our current “async” model, in which all decisions are discussed and documented openly in GitHub.

Fully 50% of the funded effort on this proposal goes to the “maintenance mode” activities, which are intended to further regularize the development process and support iterative, gradual performance improvement while preventing feature and performance regressions. This will include regular releases, continued maintenance of and improvements to software development and release process, database updates and releases as new genomes and metagenomes are made public, regular JOSS publications on major new versions (v4, v5, etc.), structural improvements to sourmash core, including a plugin architecture for storage formats, new command-line subcommands, and visualizations, and sketch serialization documentation and format upgrades to store more metadata, and support higher-performance binary formats.

Community engagement activities:

The community engagement activities below seek to build, grow, and support an active and robust user and developer community that includes biologists, bioinformaticians, computer scientists, and software engineers.

As sourmash matured, we focused our efforts toward building sustainable software and developing advanced use cases within the lab first, with documentation for new users added via github issues, blog posts, and feature papers. However, this has resulted in somewhat uneven support resources: e.g. we lack intermediate-level tutorials helping users transition from our introductory tutorials to advanced use cases or python API usage. We will upgrade our documentation systematically, create a “recipes” site, and construct an FAQ section that is well integrated with the documentation by reorganizing and amending existing content.

We plan to provide a warm, welcoming community forum that encourages new user questions and contributions. This will require engaged moderators, a strong Code of Conduct process, and a large user base, which we have not had the bandwidth to support previously. A key outcome of this funding will be the clear definition of a single support forum for sourmash, as one of the first outputs of our governance process.

Contributors may come from both the user community and the broader bioinformatics/CS community. We routinely source use cases, ideas for new functionality, and requests for performance improvement from the current biology-focused user community, and will encourage deeper and broader contributions through our governance and contributor framework, discussed below.

Similarly, there are many implementation aspects of sourmash that are interesting to, and may provide fodder for, CS and software engineers who are interested in contributing to bioinformatics software. While this is supported within the lab, these challenges are not immediately obvious or accessible to others without some biological background and appropriate documentation. We will build tutorials and documentation that highlight the algorithmic and implementation aspects of sourmash (sketching approximations, scaling issues, indexing formats, performance benchmarking, and quality-of-result benchmarking) and provide guidance for CS researchers who wish to evaluate new algorithms. Our governance and contributor framework will welcome extensions and evaluations and require neither permission nor involvement from sourmash core.

We see great value in further broadening our contributor base, and will continue to improve our current support for first-time OSS contributors by expanding our new contributor issue labels beyond “good first issue”, “good next issue”, and “repeatable quest”. While we do not expect many of these contributors to become long-term sourmash contributors, some may; more importantly, a steady influx of new first-time contributors will ensure that our development documentation remains accurate and useful. In support of this effort, we have budgeted for two 10 hrs/wk undergraduates to continue to contribute. We will also offer first-time contributor collaboratives, run documentation and visualization improvement hackyfests, and contribute to hackathons at BOSC and PyCon.

Governance activities:

We will build a Steering Council that guides governance, defines contributor guidelines, authorship considerations, and oversees the roadmap process. As part of this, we will nucleate “sourmash.bio” and move development activities out of the dib-lab organization. The Steering Council will also define the scope of the project and outline contribution mechanisms, most likely via a fiscal sponsor (perhaps the Software Freedom Conservancy).

Milestones and Deliverables:

We will deliver regular releases of sourmash under semantic versioning, per http://ivory.idyll.org/blog/2021-sourmash-v4-released.html. We anticipate approximately quarterly releases of major.minor versions, with more frequent patch releases.

We will quarterly update our roadmaps for v4.2.x, v5, and beyond. All planned features for these versions are discussed in the issue tracker. Each minor release will feature a link to updated roadmaps for the coming features. The issue tracker will continue to be constantly updated and refined in conjunction with releases and roadmaps.

These releases will also see regular refinement and updates of both the Python layer and the Rust layer; a major goal of our project is to expand our Rust contributor pool via CS undergrads and also (potentially) engagement with rust-bio.

We will simultaneously engage in iterative refactoring of our documentation to include not just getting-started docs and tutorials, but also detailed guidelines on how to get started contributing, video guides to sourmash, a “recipe” site that outlines solutions to common use cases, developer-oriented documentation for new plugins and visualizations, and a CS-focused introduction to the problems that sourmash is tackling. Recipes will be in place by mid-2022 and major updates will be delivered on a semi-annual basis.

Each summer (2022 and 2023) we will participate in undergraduate research projects (e.g. the National Summer Undergraduate Research Program) and introduce biology and CS undergraduates to problems in microbial genomics and metagenomics, including but not limited to sourmash. We will also participate in summer training courses (STAMPS at MBL, and DIBSI at UC Davis) as was our usual pre-pandemic practice (2010-2019).

We will offer at least two webinars and four hackfests annually, with our focus varied between attracting new users, attracting new developers, refining our documentation, exploring new functionality and improving our UX, and highlighting new analysis opportunities.

In December of 2021, 2022, and 2023 we will provide a detailed update of our governance progress and future plans. By December 2021, we will have issued invitations to a Steering Council, and begun the process of holding quarterly meetings. By December 2022, we will have engaged with potential fiscal sponsors and identified a path forward.

By mid-2022, we will have designated and seeded a support forum for sourmash.

While this will not be supported by this proposal specifically, we will also have submitted two papers on sourmash by December 2021.

In terms of metrics, * We will have engaged with over 1000 new users via hackyfests, webinars, etc. as a direct result of CZI funding. * Our stretch goal is over 500 citations combined for sourmash core papers by Dec 2023. * We hope to be the “stable, boring” option for petabase-scale content search and expect to have seen a substantial growth in user support and functionality requests for these use cases. * We also expect to see a dozen or more 3rd party extension modules adding new format import/export and visualizations to sourmash. * We will have submitted at least two major releases (v4 and v5) to JOSS, one in 2021 and one by end of 2023.

Value to Biomedical Users:

As the biomedical field increasingly moves towards large-scale sequencing, both of single genomes (e.g. individuals) and metagenomes (e.g. gut microbiome), lightweight analysis tools are becoming an essential part of core biomedical treatment and research. Sourmash provides a lightweight and robust interface for these analyses. In particular, we note four of our well-developed applications have considerable biomedical relevance for sequencing data analysis generally, and microbiome work specifically:

(1) finding the minimal list of relevant genomes for a microbiome, from all available (800k+) microbial and viral genomes;

(2) searching all microbiome data sets for a specific genome;

(3) detecting and removing contamination in metagenome, genome and transcriptome data sets;

(4) extraction of annotation independent features to support machine learning.

These applications are already under active use for large-scale biomedical data: the NIH has provided short-term funding to Dr. Brown in support of applying sourmash systematically to the Human Microbiome Project data sets, and we have an ongoing project using sourmash to discover strain-specific markers of Inflammatory Bowel Disease using a random forest approach.

Beyond the technical aspects of sourmash, we will work towards being a good example of a scientific open source project in biology/bioinformatics, by intentionally moving towards community governance, rewarding a wide variety of contributions, providing use-case focused tutorials, and guiding sourmash users towards how to support and evolve sourmash.

Diversity, Equity, and Inclusion Statement:

We believe that social barriers to contribution are a major cause of the low diversity in scientific OSS, and we are committed to systematically lowering these barriers while also lifting contributors over these barriers.

We also believe that lightweight and robust methods that support large-scale data discovery and reuse can expand bioinformatics into the “lightly resourced” space, e.g. Primarily Undergraduate Institutions; this is an equity issue because so many current methods require substantial resources simply to get started.

Training modules at the DIBSI and STAMPS workshops will introduce sourmash to a diverse range of research-focused participants. NSURP is focused on undergraduates from underrepresented backgrounds, and in 2020 we hosted two Latinx undergraduates. UC Davis is also an HSI and our undergraduate researchers will be recruited with attention to diversity.

We need a stronger CoC response framework, both for forum moderation and for project contributors; currently, the CoC process is based on the BDFL model, which is inadequate. This is important for DEI and antiracism, and improving our CoC process is one of our main goals in finding a fiscal sponsor who can provide a larger framework within which we can operate.

Last but not least, we believe that providing authorship for all contributors, including those who contribute use cases, recipes, and documentation, provides a way to formally recognize contributions that are traditionally undervalued in both open source projects and academia. Recognizing this kind of “invisible labor” is fundamentally an equity issue.

Read the whole story
3 days ago
Davis, CA
Share this story

Bloody Mary in Sky with Diamonds


Read the whole story
41 days ago
42 days ago
Davis, CA
53 days ago
Rio de Janeiro, Brasil
Share this story


1 Share


Read the whole story
49 days ago
Davis, CA
Share this story

Night Shift

1 Share

The post Night Shift appeared first on The Perry Bible Fellowship.

Read the whole story
51 days ago
Davis, CA
Share this story

John Cleese has an NFT bridge to sell you for $69.3 million

John Cleese Brooklyn Bridge
John Cleese’s illustration of the Brooklyn Bridge, being sold as an NFT. | Image: John Cleese

“The world has gone terminally insane,” Monty Python’s John Cleese told Vanity Fair on the topic of getting into NFTs. So naturally, the comic actor known for silly walks is now selling a ludicrously expensive NFT of his own: a Vonnegut-esque drawing of the Brooklyn Bridge made on an iPad Pro. It’s listed on OpenSea and the auction ends in 10 days.

Bidding started at $100 and has now made its way up to nearly $36,000. Cleese won’t let his masterpiece go for that cheap, though. Cleese said he wants $69.3 million for the piece, suggesting the illustration won’t sell unless someone bids at least that much.

Continue reading…

Read the whole story
84 days ago
Davis, CA
84 days ago
Share this story
Next Page of Stories