836 stories

Improved workflows-as-applications: tips and tricks for building applications on top of snakemake

1 Share

(Thanks to Camille Scott, Phillip Brooks, Charles Reid, Luiz Irber, Tessa Pierce, and Taylor Reiter for all their efforts over the years! Thanks also to Silas Kieser for his work on ATLAS, which gave us inspiration and some working code :).)

A while back, I wrote about workflow as applications, in which I talked about how Camille Scott had written dammit (link below) in the pydoit workflow system, and released it as an application. In doing so, Camille made a fundamental observation: many bioinformatics tools are wrappers that run other bioinformatics tools, and that is literally what workflow tools are designed to do!

Since that post, we've doubled down on workflow systems, improved and adapted our good enough in-lab practices for software and workflow development, and written a paper on workflow systems - (also on bioRxiv).

Projects that we write this way end up consisting of large collections of interrelated Python scripts (more on how we manage that later - see e.g. the spacegraphcats.search package for an example). This strategy also allows integration of multiple different languages under a single umbrella, including (potentially) R scripts and bash scripts and... whatever else you want :).

As part of this effort, we've developed much improved practices around better (more functional) user experiences with our software. In this blog post, I'm going to talk about some of these - read on for details!

This post extracts experience from the following in-lab projects:

Some background: how do we build applications on top of snakemake?

We've done quite a few times now, and there are 3 parts to the pattern:

first, we build a Snakefile that does the things we want to do, and stuff it into a Python package.

second, we create a Python entry point (see __main__ in spacegraphcats) that calls snakemake - in this case it does it by calling the Python API (but see below for better options).

third, in that entry point we load config files, salt in our own overrides, and otherwise customize the snakemake session.

and voila, now when you call that entry point, you run a custom-configured snakemake that runs whatever workflows are needed to create the specified targets! See for example the docs on running spacegraphcats.

Problems that we've run into, and their solutions.

The strategy above works great in general, but there are a few annoying problems that have popped up over time.

  • we want more flexible config than is provided by a single config file.
  • we want to distribute jobs from our application across clusters.
  • we don't want to have to manually implement all of snakemake's (many) command line options and functionality.
  • we want to support better testing!
  • we want to run our applications from within Snakemake workflows.

So, over time, we've come up with the following solutions. Read on!

Stacking config files

One thing we've been doing for a while is providing configuration options via a YAML file (see e.g. spacegraphcats config files). But once you've got more than a few config files, you end up with a whole host of options in common and only a few config parameters that you change for each run.

With our newer project, charcoal, I decided to try out stacking config files, so that there's an installation-wide set of defaults and config parameters, as well as a project-specific config.

This makes it possible to have sensible defaults that can be overridden easily on a per-project basis.

The way this works with snakemake is that you supply one or more JSON or YAML files like this to snakemake. Snakemake then loads them all in order and supplies the parameters in the Snakefile namespace via the config variable.

The Python code to do this via the wrapper command-line is pretty straight forward - you make a list of all the config files and supply that to subprocess!

Supporting snakemake job management on clusters

Snakemake conveniently supports cluster execution, where you can distribute jobs across HPC clusters.

With both spacegraphcats and elvers, we couldn't get this to work at first. This is because we were calling snakemake via its Python API, while the cluster execution engine wanted to call snakemake at the command line and couldn't figure out how to do that properly in our application setup.

The ATLAS folk had figured this out, though: ATLAS uses subprocess to run the snakemake executable, and when I was writing charcoal, I tried doing that instead. It works great, and is surprisingly much easier than using the Python API!

So, now our applications can take full advantage of snakemake's underlying cluster distribution functionality!

Supporting snakemake's (many) parameters

With spacegraphcats, the first application we built on snakemake, we implemented a kind of janky parameter passing thing where we just mapped our own parameters over to snakemake parameters explicitly.

However, snakemake has tons of command line arguments that do useful things, and it's really annoying to reimplement them all. So in charcoal, we switched from argparse to click for argument parsing, and simply pass all "extra" arguments on to snakemake.

This occasionally leads to weird logic like the code needed to support--no-use-conda, where we by default pass --use-conda to snakemake, and then have to override that to turn it off. But by and large it's worked out quite smoothly.

A drop-in module for a command-line API

As we build more applications this way, we're starting to recognize commonalities in the use cases. Recently I wanted to upgrade the spacegraphcats CLI to take advantage of lessons learned, and so I copied the charcoal __main__.py over to spacegraphcats.click and started editing it. Somewhat to my surprise, it was really easy to adapt to spacegraphcats - like, 15 minutes easy!

So, we're pretty close to having a "standard" entry point module that we can copy between projects and quickly customize.

Testing, testing, testing!

We get a lot of value from writing automated functional and integration tests for our command-line apps; they help pin down functionality and make sure it's still working over time.

However, with spacegraphcats, I really struggled to write good tests. It's hard to test the whole workflow when you have piles of interacting Python scripts in a workflow - e.g. the workflow tests are terrible: clunky to write and hard to modify.

In contrast, once I had the new command-line API working, I had the tools to make really nice and simple workflow tests that relied on snakemake underneath - see test_snakemake.py. Now our tests look like this:

def test_dory_build_cdbg():
    global _tempdir

    dory_conf = utils.relative_file('spacegraphcats/conf/dory-test.yaml')
    target = 'dory/bcalm.dory.k21.unitigs.fa'
    status = run_snakemake(dory_conf, verbose=True, outdir=_tempdir,
    assert status == 0
    assert os.path.exists(os.path.join(_tempdir, target))

which is about as simple as you can get - specify config file and a target, run snakemake, check that the file exists.

The one tricky bit in test_snakemake.py is that the tests should be run in a particular order, because they build on each other. (You can actually run them in any order you want, because snakemake will create the files as needed, but it makes the test steps take longer.)

I ended up using pytest-dependency to recapitulate which steps in the workflow depended on each other, and now I have a fairly nice granular breakdown of tests, and they seem to work well.

(I'm still stuck on how to ensure that the outputs of the tests have the correct content, but that's a problem for another day :).

Using workflows inside of workflows

Last but not least, we tend to want to run our applications within workflows. This is true even when our applications are workflows :).

However, we ran into a little bit of a problem with paths. Because snakemake relies heavily on file system paths, the applications we built on top of snakemake had fairly hardcoded outputs. For example, spacegraphcats produces lots of directories like genome_name, genome_name_k31_r1, genome_name_k31_r1_search, etc. that have to be in the working directory. This turns into an ugly mess for any reasonably complicated workflow.

So, we took advantage of snakemake's workdir: parameter to provide a command-line feature in our applications that would stuff all of the outputs in a particular directory.

This, however, meant some input locations needed to be adjusted to absolute rather than relative paths. Snakemake handled this automatically for filenames specified in the Snakefile, but for paths loaded from config files, we had to do it manually. This turned out to be quite easy and works robustly!

You can see an example of this usage here. The --outdir parameter tells spacegraphcats to just put everything under a particular location.

Concluding thoughts

I've been pleasantly surprised at how easy it has been to build applications on top of snakemake. We've accumulated some good experience with this, and have some fairly robust and re-usable code that solves many of our problems. I hope you find it useful!


Read the whole story
1 day ago
Davis, CA
Share this story

MinHashing all the things: a quick analysis of MAG search results

1 Comment

Last time I described a way to search MAGs in metagenomes, and teased about interesting results. Let's dig in some of them!

I prepared a repo with the data and a notebook with the analysis I did in this post. You can also follow along in Binder, as well as do your own analysis! Binder

Preparing some metadata

The supplemental materials for Tully et al include more details about each MAG, so let's download them. I prepared a small snakemake workflow to do that, as well as downloading information about the SRA datasets from Tara Oceans (the dataset used to generate the MAGs), as well as from Parks et al, which also generated MAGs from Tara Oceans. Feel free to include them in your analysis, but I was curious to find matches in other metagenomes.

Loading the data

The results from the MAG search are in a CSV file, with a column for the MAG name, another for the SRA dataset ID for the metagenome and a third column for the containment of the MAG in the metagenome. I also fixed the names to make it easier to query, and finally removed the Tara and Parks metagenomes (because we already knew they contained these MAGs).

This left us with 23,644 SRA metagenomes with matches, covering 2,291 of the 2,631 MAGs. These are results for a fairly low containment (10%), so if we limit to MAGs with more than 50% containment we still have 1,407 MAGs and 2,938 metagenomes left.

TOBG_NP-110, I choose you!

That's still a lot, so I decided to pick a candidate to check before doing any large scale analysis. I chose TOBG_NP-110 because there were many matches above 50% containment, and even some at 99%. Turns out it is also an Archaeal MAG that failed to be classified further than Phylum level (Euryarchaeota), with a 70.3% complete score in the original analysis. Oh, let me dissect the name a bit: TOBG is "Tara Ocean Binned Genome" and "NP" is North Pacific.

And so I went checking where the other metagenome matches came from. 5 of the 12 matches above 50% containment come from one study, SRP044185, with samples collected from a column of water in a station in Manzanillo, Mexico. Other 3 matches come from SRP003331, in the South Pacific ocean (in northern Chile). Another match, ERR3256923, also comes from the South Pacific.

What else can I do?

I'm curious to follow the refining MAGs tutorial from the Meren Lab and see where this goes, and especially in using spacegraphcats to extract neighborhoods from the MAG and better evaluate what is missing or if there are other interesting bits that the MAG generation methods ended up discarding.

So, for now that's it. But more important, I didn't want to sit on these results until there is a publication in press, especially when there are people that can do so much more with these them, so I decided to make it all public. It is way more exciting to see this being used to know more about these organisms than me being the only one with access to this info.

And yesterday I saw this tweet by @DrJonathanRosa, saying:

I don’t know who told students that the goal of research is to find some previously undiscovered research topic, claim individual ownership over it, & fiercely protect it from theft, but that almost sounds like, well, colonialism, capitalism, & policing


I want to run this with my data!

Next time. But we will have a discussion about scientific infrastructure and sustainability first =]


Read the whole story
15 days ago
Another quick one, finding uncharacterized archaea in public datasets =)
Davis, CA
Share this story

What you should know before optimizing code

1 Share

Maybe you’ve heard the rules of program optimization.

  1. Don’t optimize
  2. Don’t optimize yet (experts only)

If everyone followed the rules, then nobody would optimize and there wouldn’t be any experts! Looks like some of you are guilty. Obviously, the idea is that we should avoid unnecessary optimization, not that we shouldn’t optimize at all. But when do we start thinking about performance? The original rule, which dates back to the book Principles of Program Design, gives a clue.

“Don’t do it yet — that is, not until you have a perfectly clear and unoptimized solution” Michael Jackson, 1975

The second half of the quote was dropped by future books, but maybe it shouldn’t have been - it’s good advice that’s still relevant 45 years later. Jackson’s foresight is amazing when you realize that computers in 1975 still used punch cards. Since punch cards don’t have a backspace, the term “perfectly clear solution” takes on a whole new meaning. These days, a punching error at work means criminal charges, not debugging, and the punch cards are in the computer history museum.

Our first priority is still to produce correct and maintainable code, but today we have powerful tools to measure performance and run tests. We can iterate faster and optimize earlier - but only with the right setup. Disclaimer: I work on computationally expensive problems in machine learning. I think this optimization checklist contains some good general guidelines, but I am biased by my environment.

Do you need to optimize your code?

Sometimes, you can avoid optimization altogether by changing the design, using better algorithms, or modifying the production runtime.

  • Is there a performance problem? If you have no valid reason to optimize, don’t do it.

  • Is the problem reproducible? You shouldn’t optimize until you have a benchmark that reproduces the problem. Pay special attention to randomized algorithms and machine learning code. Uncommon (but severe) performance problems can be hidden by randomness. The problem might also be something external to your code, such as the system configuration, hardware, or dependency version.

  • Are the requirements stable? Don’t optimize if the code might be replaced in a week.

  • Are you using the right algorithms? Make thoughtful choices for data structures and algorithms. Don’t make decisions based solely on theoretical complexity. Sometimes, “worse” complexity is much faster in practice, especially if you can throw a GPU at it or use a ridiculously optimized SAT solver. I would rather do bit operations than compute digits of .

  • Did you design for performance? You should think about performance at the architecture design phase. In demanding areas such as graphics and machine learning, we often design for performance from Day 1. Consider the type of hardware that will likely run your code. It is a good idea to flatten class hierarchies, minimize the number of abstractions, and make a plan for memory allocations.

Is your code ready to optimize?

  • Is the code correct? Fast wrong answers are still wrong.

  • Is the code maintainable? The code should be as simple as possible, but no more. Optimization usually makes code harder to maintain. Start with as little complexity as you can.

  • Do you have unnecessary dependencies? Before optimizing, examine your dependency list. Removing dependencies is a good way to pay off tech debt, but it can also speed up the code.

  • Do you have tests? You’ll need correctness tests to verify that the program is still correct after your optimizations. Re-run the tests every time you change the code.

  • Do you have a benchmark? To know whether you are actually making the code faster, you need to evaluate the code on a standard task before / after doing the optimization. It is best to automate the process of running the benchmark so that you can easily check code changes.

  • Is the benchmark realistic? The benchmark should run the program under the conditions and inputs that the program will encounter in production. If possible, use the exact inputs that cause the program to have bad performance. If the program might encounter different types of inputs, make sure each type is represented in your benchmark.

  • Does the code follow best practices? Each programming language has best practices that lead to good performance. For example, we avoid temporary local copies and pass objects by reference in C++ to avoid copying large data structures. Without following best practices, you might encounter uniformly slow code when you profile, where the true performance problems are hidden by hundreds of simple mistakes. These improvements are free in the sense that they usually don’t hurt maintainability.

Do you have the information you need?

  • Have you profiled? Use a profiler to identify bottlenecks - do not rely on intuition. Generate a flame graph or examine a random sample of stack traces. If you are optimizing a compiled language, you should profile a release build - debug builds can have much different performance characteristics. You should also profile the system on the production hardware, if possible. The closer the profiling setup to the deployment setup, the better.

  • Does the profiler measure the right thing? Your program might be wasting CPU cycles, but it might also be writing massive files, spending hours waiting for a DNS response, cache thrashing, or quivering in fear of Brendan Gregg. Since I work on algorithms and data structures, my code is usually (surprise!) CPU bound. But I have seen deep learning code that is bound by GPU memory and CPU-GPU communication, and I’ve written bioinformatics software that is bound by file IO. Network communication is a problem in web development and distributed computing, and electricity is a limiting factor for IoT. Use off-cpu analysis or just pause the program in a debugger to catch non-CPU problems.

  • Are you profiling the right workload? As you optimize, the program may become so fast that the profiler can no longer attach to the process when you run the benchmark. You will need to give the program more work, perhaps by calling the benchmark in a loop. This should be done carefully to make sure that the new workload is still representative of performance.

  • Does the profiler show hotspots? In well-written code, the profiler will show that the program spends most of its time in a few expensive locations, called “hotspots”. If the profiler does not show hotspots, you may need to rethink your design or programming practices to avoid uniformly slow code.

[Image Credit] Jeremy Bishop on Unsplash

[Image Credit] Taken at the Computer History Museum’s IBM 1401 restoration project. Released under the CC BY 2.0 license.

Read the whole story
17 days ago
Davis, CA
Share this story

MinHashing all the things: searching for MAGs in the SRA

1 Comment and 2 Shares

(or: Top-down and bottom-up approaches for working around sourmash limitations)

In the last month I updated wort, the system I developed for computing sourmash signature for public genomic databases, and started calculating signatures for the metagenomes in the Sequence Read Archive. This is a more challenging subset than the microbial datasets I was doing previously, since there are around 534k datasets from metagenomic sources in the SRA, totalling 447 TB of data. Another problem is the size of the datasets, ranging from a couple of MB to 170 GB. Turns out that the workers I have in wort are very good for small-ish datasets, but I still need to figure out how to pull large datasets faster from the SRA, because the large ones take forever to process...

The good news is that I managed to calculate signatures for almost 402k of them 1, which already let us work on some pretty exciting problems =]

Looking for MAGs in the SRA

Metagenome-assembled genomes are essential for studying organisms that are hard to isolate and culture in lab, especially for environmental metagenomes. Tully et al published 2,631 draft MAGs from 234 samples collected during the Tara Oceans expedition, and I wanted to check if they can also be found in other metagenomes besides the Tara Oceans ones. The idea is to extract the reads from these other matches and evaluate how the MAG can be improved, or at least evaluate what is missing in them. I choose to use environmental samples under the assumption they are easier to deposit on the SRA and have public access, but there are many human gut microbiomes in the SRA and this MAG search would work just fine with those too.

Moreover, I want to search for containment, and not similarity. The distinction is subtle, but similarity takes into account both datasets sizes (well, the size of the union of all elements in both datasets), while containment only considers the size of the query. This is relevant because the similarity of a MAG and a metagenome is going to be very small (and is symmetrical), but the containment of the MAG in the metagenome might be large (and is asymmetrical, since the containment of the metagenome in the MAG is likely very small because the metagenome is so much larger than the MAG).

The computational challenge: indexing and searching

sourmash signatures are a small fraction of the original size of the datasets, but when you have hundreds of thousands of them the collection ends up being pretty large too. More precisely, 825 GB large. That is way bigger than any index I ever built for sourmash, and it would also have pretty distinct characteristics than what we usually do: we tend to index genomes and run search (to find similar genomes) or gather (to decompose metagenomes into their constituent genomes), but for this MAG search I want to find which metagenomes have my MAG query above a certain containment threshold. Sort of a sourmash search --containment, but over thousands of metagenome signatures. The main benefit of an SBT index in this context is to avoid checking all signatures because we can prune the search early, but currently SBT indices need to be totally loaded in memory during sourmash index. I will have to do this in the medium term, but I want a solution NOW! =]

sourmash 3.4.0 introduced --from-file in many commands, and since I can't build an index I decided to use it to load signatures for the metagenomes. But... sourmash search tries to load all signatures in memory, and while I might be able to find a cluster machine with hundreds of GBs of RAM available, that's not very practical.

So, what to do?

The top-down solution: a snakemake workflow

I don't want to modify sourmash now, so why not make a workflow and use snakemake to run one sourmash search --containment for each metagenome? That means 402k tasks, but at least I can use batches and SLURM job arrays to submit reasonably-sized jobs to our HPC queue. After running all batches I summarized results for each task, and it worked well for a proof of concept.

But... it was still pretty resource intensive: each task was running one query MAG against one metagenome, and so each task needed to do all the overhead of starting the Python interpreter and parsing the query signature, which is exactly the same for all tasks. Extending it to support multiple queries to the same metagenome would involve duplicating tasks, and 402k metagenomes times 2,631 MAGs is... a very large number of jobs.

I also wanted to avoid clogging the job queues, which is not very nice to the other researchers using the cluster. This limited how many batches I could run in parallel...

The bottom-up solution: Rust to the rescue!

Thinking a bit more about the problem, here is another solution: what if we load all the MAGs in memory (as they will be queried frequently and are not that large), and then for each metagenome signature load it, perform all MAG queries, and then unload the metagenome signature from memory? This way we can control memory consumption (it's going to be proportional to all the MAG sizes plus the size of the largest metagenome) and can also efficiently parallelize the code because each task/metagenome is independent and the MAG signatures can be shared freely (since they are read-only).

This could be done with the sourmash Python API plus multiprocessing or some other parallelization approach (maybe dask?), but turns out that everything we need comes from the Rust API. Why not enjoy a bit of the fearless concurrency that is one of the major Rust goals?

The whole code ended up being 176 lines long, including command line parsing using strucopt and parallelizing the search using rayon and a multiple-producer, single-consumer channel to write results to an output (either the terminal or a file). This version took 11 hours to run, using less than 5GB of RAM and 32 processors, to search 2k MAGs against 402k metagenomes. And, bonus! It can also be parallelized again if you have multiple machines, so it potentially takes a bit more than an hour to run if you can allocate 10 batch jobs, with each batch 1/10 of the metagenome signatures.

So, is bottom-up always the better choice?

I would like to answer "Yes!", but bioinformatics software tends to be organized as command line interfaces, not as libraries. Libraries also tend to have even less documentation than CLIs, and this particular case is not a fair comparison because... Well, I wrote most of the library, and the Rust API is not that well documented for general use.

But I'm pretty happy with how the sourmash CLI is viable both for the top-down approach (and whatever workflow software you want to use) as well as how the Rust core worked for the bottom-up approach. I think the most important is having the option to choose which way to go, especially because now I can use the bottom-up approach to make the sourmash CLI and Python API better. The top-down approach is also way more accessible in general, because you can pick your favorite workflow software and use all the tricks you're comfortable with.

But, what about the results?!?!?!

Next time. But I did find MAGs with over 90% containment in very different locations, which is pretty exciting!

I also need to find a better way of distributing all these signature, because storing 4 TB of data in S3 is somewhat cheap, but transferring data is very expensive. All signatures are also available on IPFS, but I need more people to host them and share. Get in contact if you're interested in helping =]

And while I'm asking for help, any tips on pulling data faster from the SRA are greatly appreciated!



  1. pulling about a 100 TB in 3 days, which was pretty fun to see because I ended up DDoS myself because I couldn't download the generated sigs fast enough from the S3 bucket where they are temporarily stored =P

Read the whole story
17 days ago
New post! Using Rust for large scale search in public genomic databases =)
Davis, CA
15 days ago
Share this story

The Security Value of Inefficiency


For decades, we have prized efficiency in our economy. We strive for it. We reward it. In normal times, that's a good thing. Running just at the margins is efficient. A single just-in-time global supply chain is efficient. Consolidation is efficient. And that's all profitable. Inefficiency, on the other hand, is waste. Extra inventory is inefficient. Overcapacity is inefficient. Using many small suppliers is inefficient. Inefficiency is unprofitable.

But inefficiency is essential security, as the COVID-19 pandemic is teaching us. All of the overcapacity that has been squeezed out of our healthcare system; we now wish we had it. All of the redundancy in our food production that has been consolidated away; we want that, too. We need our old, local supply chains -- not the single global ones that are so fragile in this crisis. And we want our local restaurants and businesses to survive, not just the national chains.

We have lost much inefficiency to the market in the past few decades. Investors have become very good at noticing any fat in every system and swooping down to monetize those redundant assets. The winner-take-all mentality that has permeated so many industries squeezes any inefficiencies out of the system.

This drive for efficiency leads to brittle systems that function properly when everything is normal but break under stress. And when they break, everyone suffers. The less fortunate suffer and die. The more fortunate are merely hurt, and perhaps lose their freedoms or their future. But even the extremely fortunate suffer -- maybe not in the short term, but in the long term from the constriction of the rest of society.

Efficient systems have limited ability to deal with system-wide economic shocks. Those shocks are coming with increased frequency. They're caused by global pandemics, yes, but also by climate change, by financial crises, by political crises. If we want to be secure against these crises and more, we need to add inefficiency back into our systems.

I don't simply mean that we need to make our food production, or healthcare system, or supply chains sloppy and wasteful. We need a certain kind of inefficiency, and it depends on the system in question. Sometimes we need redundancy. Sometimes we need diversity. Sometimes we need overcapacity.

The market isn't going to supply any of these things, least of all in a strategic capacity that will result in resilience. What's necessary to make any of this work is regulation.

First, we need to enforce antitrust laws. Our meat supply chain is brittle because there are limited numbers of massive meatpacking plants -- now disease factories -- rather than lots of smaller slaughterhouses. Our retail supply chain is brittle because a few national companies and websites dominate. We need multiple companies offering alternatives to a single product or service. We need more competition, more niche players. We need more local companies, more domestic corporate players, and diversity in our international suppliers. Competition provides all of that, while monopolies suck that out of the system.

The second thing we need is specific regulations that require certain inefficiencies. This isn't anything new. Every safety system we have is, to some extent, an inefficiency. This is true for fire escapes on buildings, lifeboats on cruise ships, and multiple ways to deploy the landing gear on aircraft. Not having any of those things would make the underlying systems more efficient, but also less safe. It's also true for the internet itself, originally designed with extensive redundancy as a Cold War security measure.

With those two things in place, the market can work its magic to provide for these strategic inefficiencies as cheaply and as effectively as possible. As long as there are competitors who are vying with each other, and there aren't competitors who can reduce the inefficiencies and undercut the competition, these inefficiencies just become part of the price of whatever we're buying.

The government is the entity that steps in and enforces a level playing field instead of a race to the bottom. Smart regulation addresses the long-term need for security, and ensures it's not continuously sacrificed to short-term considerations.

We have largely been content to ignore the long term and let Wall Street run our economy as efficiently as it can. That's no longer sustainable. We need inefficiency -- the right kind in the right way -- to ensure our security. No, it's not free. But it's worth the cost.

This essay previously appeared in Quartz.

Read the whole story
22 days ago
27 days ago
Davis, CA
Share this story

Streamlining Data-Intensive Biology With Workflow Systems

1 Comment
As the scale of biological data generation has increased, the bottleneck of research has shifted from data generation to analysis. Researchers commonly need to build computational workflows that include multiple analytic tools and require incremental development as experimental insights demand tool and parameter modifications. These workflows can produce hundreds to thousands of intermediate files and results that must be integrated for biological insight. The maturation of data-centric workflow systems that internally manage computational resources, software, and conditional execution of analysis steps are reshaping the landscape of biological data analysis, and empowering researchers to conduct reproducible analyses at scale. Adoption of these tools can facilitate and expedite robust data analysis, but knowledge of these techniques is still lacking. Here, we provide a series of practices and strategies for leveraging workflow systems with structured project, data, and resource management to streamline large-scale biological analysis. Author SummaryWe present a guide for workflow-enabled biological sequence data analysis, developed through our own teaching, training and analysis projects. We recognize that this is based on our own use cases and experiences, but we hope that our guide will contribute to a larger discussion within the open source and open science communities and lead to more comprehensive resources. Our main goal is to accelerate the research of scientists conducting sequence analyses by introducing them to organized workflow practices that not only benefit their own research but also facilitate open and reproducible science.
Read the whole story
27 days ago
That's our lab! =]
Davis, CA
Share this story
Next Page of Stories