727 stories
·
25 followers

Scientific Rust #rust2019

1 Comment

The Rust community requested feedback last year for where the language should go in 2018, and now they are running it again for 2019. Last year I was too new in Rust to organize a blog post, but after an year using it I feel more comfortable writing this!

(Check my previous post about replacing the C++ core in sourmash with Rust for more details on how I spend my year in Rust).

What counts as "scientific Rust"?

Anything that involves doing science using computers counts as scientific programming. It includes from embedded software running on satellites to climate models running in supercomputers, from shell scripts running tools in a pipeline to data analysis using notebooks.

It also makes the discussion harder, because it's too general! But it is very important to keep in mind, because scientists are not your regular user: they are highly qualified in their field of expertise, and they are also pushing the boundaries of what we know (and this might need flexibility in their tools).

In this post I will be focusing more in two areas: array computing (what most people consider 'scientific programming' to be) and "data structures".

Array computing

This one is booming in the last couple of years due to industry interest in data sciences and deep learning (where they will talk about tensors instead of arrays), and has its roots in models running in supercomputers (a field where Fortran is still king!). Data tends to be quite regular (representable with matrices) and amenable to parallel processing.

A good example is the SciPy stack in Python, built on top of NumPy and SciPy. The adoption of the SciPy stack (both in academia and industry) is staggering, and many alternative implementations try to provide a NumPy-like API to try to capture its mindshare.

This is the compute-intensive side science (be it CPU or GPU/TPU), and also the kind of data that pushed CPU evolution and is still very important in defining policy in scientific computing funding (see countries competing for the largest supercomputers and measuring performance in floating point operations per second).

Data structures for efficient data representation

For data that is not so regular the situation is a bit different. I'll use bioinformatics as an example: the data we get out of nucleotide sequencers is usually represented by long strings (of ACGT), and algorithms will do a lot of string processing (be it building string-overlap graphs for assembly, or searching for substrings in large collections). This is only one example: there are many analyses that will work with other types of data, and most of them don't have a universal data representation as in the array computing case.

This is the memory-intensive science, and it's hard to measure performance in floating point operations because... most of the time you're not even using floating point numbers. It also suffers from limited data locality (which is almost a prerequisite for compute-intensive performance).

High performance core, interactive API

There is something common in both cases: while performance-intensive code is implemented in C/C++/Fortran, users usually interact with the API from other languages (especially Python or R) because it's faster to iterate and explore the data, and many of the tools already available in these languages are very helpful for these tasks (think Jupyter/pandas or RStudio/tidyverse). These languages are used to define the computation, but it is a lower-level core library that drives it (NumPy or Tensorflow follow this idea, for example).

How to make Rust better for science?

The biggest barrier to learning Rust is the ownership model, and while we can agree it is an important feature it is also quite daunting for newcomers, especially if they don't have previous programming experience and exposure to what bugs are being prevented. I don't see it being the first language we teach to scientists any time soon, because the majority of scientists are not system programmers, and have very different expectations for a programming language. That doesn't mean that they can't benefit from Rust!

Rust is already great for building the performance-intensive parts, and thanks to Cargo it is also a better alternative for sharing this code around, since they tend to get 'stuck' inside Python or R packages. And the 'easy' approach of vendoring C/C++ instead of having packages make it hard to keep track of changes and doesn't encourage reusable code.

And, of course, if this code is Rust instead of C/C++ it also means that Rust users can use them directly, without depending on the other languages. Seems like a good way to bootstrap a scientific community in Rust =]

What I would like to see in 2019?

An attribute proc-macro like #[wasm_bindgen] but for FFI

While FFI is an integral part of Rust goals (interoperability with C/C++), I have serious envy of the structure and tooling developed for WebAssembly! (Even more now that it works in stable too)

We already have #[no_mangle] and pub extern "C", but they are quite low-level. I would love to see something closer to what wasm-bindgen does, and define some traits (like IntoWasmAbi) to make it easier to pass more complex data types through the FFI.

I know it's not that simple, and there are different design restrictions than WebAssembly to take into account... The point here is not having the perfect solution for all use cases, but something that serves as an entry point and helps to deal with the complexity while you're still figuring out all the quirks and traps of FFI. You can still fallback and have more control using the lower-level options when the need rises.

More -sys and Rust-like crates for interoperability with the larger ecosystems

There are new projects bringing more interoperability to dataframes and tensors. While this ship has already sailed and they are implemented in C/C++, it would be great to be a first-class citizen, and not reinvent the wheel. (Note: the arrow project already have pretty good Rust support!)

In my own corner (bioinformatics), the Rust-bio community is doing a great job of wrapping useful libraries in C/C++ and exposing them to Rust (and also a shout-out to 10X Genomics for doing this work for other tools while also contributing to Rust-bio!).

More (bioinformatics) tools using Rust!

We already have great examples like finch and yacrd, since Rust is great for single binary distribution of programs. And with bioinformatics focusing so much in independent tools chained together in workflows, I think we can start convincing people to try it out =]

A place to find other scientists?

Another idea is to draw inspiration from rOpenSci and have a Rust equivalent, where people can get feedback about their projects and how to better integrate it with other crates. This is quite close to the working group idea, but I think it would serve more as a gateway to other groups, more focused on developing entry-level docs and bringing more scientists to the community.

Final words

In the end, I feel like this post ended up turning into my 'wishful TODO list' for 2019, but I would love to find more people sharing these goals (or willing to take any of this and just run with it, I do have a PhD to finish! =P)

Comments?

Read the whole story
luizirber
14 days ago
reply
New post!
Davis, CA
Share this story
Delete

List: Classic Christmas Stories, Updated for Late-Stage Capitalism

3 Shares

A Christmas Carol

Ebenezer Scrooge asks the Ghost of Christmas Future to reveal the exact sites of Amazon HQ2 buildings so he can buy up the nearest luxury condos. The Cratchits crowdfund $60,000 for Tiny Tim’s medical expenses, but their GoFundMe campaign gets shut down when it’s discovered that they made up his illness to drive clicks to their fledgling YouTube channel.

How the Grinch Stole Christmas

The Grinch launches an Initial Coin Offering for GiftCoin, a cryptocurrency that moves gift-giving to the blockchain and enables anonymous, decentralized gift exchanges. Little Cindy Lou’s cries of “Santy Claus, why would we possibly need this financial innovation?” go ignored as every Who in Whoville opens a Bitcoin wallet to buy into the ICO.

Miracle on 34th Street

When Kris Kringle becomes entangled in a workplace dispute, Macy’s fires him immediately since Santa Claus is a gig economy contractor with no union or wrongful termination rights.

A Christmas Story

All nine-year-old Ralphie wants for Christmas is a $1,449 512GB red iPhone XS Max, despite his mother, his teacher, and the department store Santa all warning him, “You’ll strain your eyes out!” Ralphie’s parents eventually relent and buy him one, on the condition he wears prescription blue light blocking glasses when he’s playing Fortnite.

Rudolph the Red-Nosed Reindeer

After discovering that a rare hormone disorder is the cause of Rudolph’s lit-up nose, pharmaceutical executives exploit his genetic code to develop a men’s virility drug. They market their new wonder drug through sleigh-side ads and Buzzfeed-branded content — a pack of pills costs $699 and arrives monthly in Instagram-friendly boxes.

Home Alone

Kevin McAllister finds himself home alone for the holidays, though he’s able to simulate festive human gatherings by wirelessly interacting with his Echo, Nest, Philips Hue, and Juicero. When he gets into trouble for gleefully catfishing two lonely ex-cons, Kevin’s parents can’t get home to help him sooner because their Basic Economy tickets prohibit any flight changes.

Home Alone 2

Largely the same plot, but Donald Trump is involved for some reason.

Frosty the Snowman

Frosty melts because society treats climate change as an unpriced negative externality for which polluters can pretend to develop market-based solutions while rejecting regulatory intervention. Frosty’s coal eyes are put in the stockings of children whose parents lobbied against the Paris Climate Accord.

It’s a Wonderful Life

After failing to negotiate a line of asset-backed credit, an underqualified mortgage banker considers suicide to avoid facing his arrest for financial malfeasance. (No updates required.)

Read the whole story
luizirber
21 days ago
reply
Davis, CA
acdha
28 days ago
reply
Washington, DC
Share this story
Delete

Optimize for Auditability

1 Share

When we write code, we optimize for many different things. We optimize for writability: how easy it is to write the code in the first place? We optimize for maintainability: how easy it is to make ongoing changes? We optimize for readability: how easy it is to understand what the code does?

However, we rarely optimize for auditability: how easy it is to tell if the code has a security vulnerability? By ignoring this aspect of software design we increase the burden on people reviewing code for vulnerabilities, which reduces the overall security of our software.

Some might ask, why optimize for auditability? After all, isn't it the same as readability? And if we know how to make bugs obvious, why not just fix them instead? Both of these have the same answer: it's very easy to write code whose intended purpose is clear to any engineer, but which has vulnerabilities that only a security expert will recognize. Auditability means designing languages, APIs, and patterns such that places in the code which are deserving of more stringent review are clearly delineated. This allows an auditor to focus their time as much as possible.

I have explored auditability as a goal in two different security contexts: cryptography and memory unsafety. In both contexts, I've found that code which was written with auditability in mind has allowed me to perform audits significantly more quickly and to have much higher confidence that the audit found everything it should have.

Cryptography

When we built the pyca/cryptography Python library, auditability was a core design criteria for our API. We were responding to very negative experiences we had with other libraries, where low-level APIs had defaults which were often dangerous and always hindered review. An example of this is a symmetric block cipher API with a default mode, or providing a default initialization vector when using CBC mode. While the danger of insecure defaults (such as ECB mode, or an all-zero IV) is clear, even places with acceptable defaults stymied reviews because they made it more difficult to answer questions like "Which cryptographic algorithms does this project use?"

As a result, we decided that we'd have two styles of APIs: low-level ones, where users were required to explicitly specify every parameter and algorithm, and high-level ones which should have no choices at all, and which clearly documented how they were built. The goal was that auditors could easily do things like:

  • Find uses of high-level recipes and know they were implemented securely, and not requiring significant review.
  • Search for known insecure algorithms such as MD5, ECB, or PKCS1v15 encryption very quickly.
  • Assess whether an encrypt() function you were auditing was secure easily, by putting the cryptographic algorithms front and center.

Our API design strategy works. In auditing numerous applications and libraries using pyca/cryptography, I've found that I've been able to very easily identify uses of poor algorithms, as well as limit the scope of code I needed to consider when trying to answer higher level questions about protocols and algorithm compositions.

Memory unsafety

Frequent readers of me will know I'm a big fan of Rust, and more broadly of moving away from memory unsafe languages like C and C++ to memory safe languages, be they Swift, Rust, or Go. One of the most common reactions I get when I state my belief that Rust could produce an order of magnitude fewer vulnerabilities is that Rust has unsafe. And since unsafe allows memory corruption vulnerabilities, that breaks the security of the entire system, therefore Rust is really no better than C or C++. Leaving aside the somewhat tortured logic here, ensuring that unsafe does not provide an unending stream of vulnerabilities is an important task.

I've recently had the opportunity to audit a few medium-sized Rust code bases in the thousands of lines of code range. They made significant use of unsafe, primarily for interacting with C code and unsafe system APIs. In each codebase I found one memory corruption vulnerability, one of which was unexploitable and the other was probably exploitable. In my experience this is fewer vulnerabilities than I would have identified in similar codebases written in C/C++, but the far more interesting element was how easy it was to perform the audit.

For C/C++ codebases like this, I'd start my audit by identifying all the entrypoints where untrusted data is introduced into the system, for example socket reads, public API functions, or RPC handlers. Then for each of these, depending on size, I'd try to fuzz it with something like libFuzzer and manually review the code to look for vulnerabilities. For these Rust audits, I took a dramatically different approach. I was only interested in memory corruption vulnerabilities, so I simply grepped for "unsafe" and reviewed each instance for vulnerabilities. In some cases this required reviewing callers, callees, or other code within the module, but many sites could be resolved just by examining the unsafe block itself. As a result, I was able to perform these audits to a high level of confidence in just a few hours.

Requiring code which uses dangerous features that can cause memory corruption to be within an unsafe block is a powerful form of optimizing for auditability. With C/C++, code is guilty until proven innocent; any code could have memory unsafety vulnerabilities until you've given it a look. Rust provides a sharp contrast by making code innocent until proven guilty, unless it's within an unsafe block (or in the same module and responsible for maintaining invariants that an unsafe block depends on), Rust code can be trusted to be memory safe. This dramatically reduces the scope of code you need to audit.

Conclusion

In both of these domains, optimizing for auditability lived up to my hopes. Code which was optimized for auditability took less time to review, and when I completed the reviews I was more confident I'd found everything there was to Find. This dramatic improvement in the ability to identify security issues means less bugs to become critical incidents.

Optimizing for auditability pairs well with APIs which are designed to be easy to use securely. It should be a component of more teams' strategies for building software that’s secure.

Read the whole story
luizirber
23 days ago
reply
Davis, CA
Share this story
Delete

Open source confronts its midlife crisis

2 Shares

Midlife is tough: the idealism of youth has faded, as has inevitably some of its fitness and vigor. At the same time, the responsibilities of adulthood have grown: the kids that were such a fresh adventure when they were infants and toddlers are now grappling with their own transition into adulthood — and you try to remind yourself that the kids that you have sacrificed so much for probably don’t actually hate your guts, regardless of that post they just liked on the ‘gram. Making things more challenging, while you are navigating the turbulence of teenagers, your own parents are likely entering life’s twilight, needing help in new ways from their adult children. By midlife, in addition to the singular joys of life, you have also likely experienced its terrible sorrows: death, heartbreak, betrayal. Taken together, the fading of youth, the growth in responsibility and the endurance of misfortune can lead to cynicism or (worse) drastic and poorly thought-out choices. Add in a little fear of mortality and some existential dread, and you have the stuff of which midlife crises are made…

I raise this not because of my own adventures at midlife, but because it is clear to me that open source — now several decades old and fully adult — is going through its own midlife crisis. This has long been in the making: for years, I (and others) have been critical of service providers’ parastic relationship with open source, as cloud service providers turn open source software into a service offering without giving back to the communities upon which they implicitly depend. At the same time, open source has been (rightfully) entirely unsympathetic to the proprietary software models that have been burned to the ground — but also seemingly oblivious as to the larger economic waves that have buoyed them.

So it seemed like only a matter of time before the companies built around open source software would have to confront their own crisis of confidence: open source business models are really tough, selling software-as-a-service is one of the most natural of them, the cloud service providers are really good at it — and their commercial appetites seem boundless. And, like a new cherry red two-seater sports car next to a minivan in a suburban driveway, some open source companies are dealing with this crisis exceptionally poorly: they are trying to restrict the way that their open source software can be used. These companies want it both ways: they want the advantages of open source — the community, the positivity, the energy, the adoption, the downloads — but they also want to enjoy the fruits of proprietary software companies in software lock-in and its concomitant monopolistic rents. If this were entirely transparent (that is, if some bits were merely being made explicitly proprietary), it would be fine: we could accept these companies as essentially proprietary software companies, albeit with an open source loss-leader. But instead, these companies are trying to license their way into this self-contradictory world: continuing to claim to be entirely open source, but perverting the license under which portions of that source are available. Most gallingly, they are doing this by hijacking open source nomenclature. Of these, the laughably named commons clause is the worst offender (it is plainly designed to be confused with the purely virtuous creative commons), but others (including CockroachDB’s Community License, MongoDB’s Server Side Public License, and Confluent’s Community License) are little better. And in particular, as it apparently needs to be said: no, “community” is not the opposite of “open source” — please stop sullying its good name by attaching it to licenses that are deliberately not open source! But even if they were more aptly named (e.g. “the restricted clause” or “the controlled use license” or — perhaps most honest of all — “the please-don’t-put-me-out-of-business-during-the-next-reInvent-keynote clause”), these licenses suffer from a serious problem: they are almost certainly asserting rights that the copyright holder doesn’t in fact have.

If I sell you a book that I wrote, I can restrict your right to read it aloud for an audience, or sell a translation, or write a sequel; these restrictions are rights afforded the copyright holder. I cannot, however, tell you that you can’t put the book on the same bookshelf as that of my rival, or that you can’t read the book while flying a particular airline I dislike, or that you aren’t allowed to read the book and also work for a company that competes with mine. (Lest you think that last example absurd, that’s almost verbatim the language in the new Confluent Community (sic) License.) I personally think that none of these licenses would withstand a court challenge, but I also don’t think it will come to that: because the vendors behind these licenses will surely fear that they wouldn’t survive litigation, they will deliberately avoid inviting such challenges. In some ways, this netherworld is even worse, as the license becomes a vessel for unverifiable fear of arbitrary liability.

Legal dubiousness aside, as with that midlife hot rod, the licenses aren’t going to address the underlying problem. To be clear, the underlying problem is not the licensing, it’s that these companies don’t know how to make money — they want open source to be its own business model, and seeing that the cloud service providers have an entirely viable business model, they want a piece of the action. But as a result of these restrictive riders, one of two things will happen with respect to a cloud services provider that wants to build a service offering around the software:

  1. The cloud services provider will build their service not based on the software, but rather on another open source implementation that doesn’t suffer from the complication of a lurking company with brazenly proprietary ambitions.

  2. The cloud services provider will build their service on the software, but will use only the truly open source bits, reimplementing (and keeping proprietary) any of the surrounding software that they need.

In the first case, the victory is strictly pyrrhic: yes, the cloud services provider has been prevented from monetizing the software — but the software will now have less of the adoption that is the lifeblood of a thriving community. In the second case, there is no real advantage over the current state of affairs: the core software is still being used without the open source company being explicitly paid for it. Worse, the software and its community have been harmed: where one could previously appeal to the social contract of open source (namely, that cloud service providers have a social responsibility to contribute back to the projects upon which they depend), now there is little to motivate such reciprocity. Why should the cloud services provider contribute anything back to a company that has declared war on it? (Or, worse, implicitly accused it of malfeasance.) Indeed, as long as fights are being picked with them, cloud service providers are likely clutch their bug fixes in the open core as a differentiator, cackling to themselves over the gnarly race conditions that they have fixed of which the community is blissfully unaware. Is this in any way a desired end state?

So those are the two cases, and they are both essentially bad for the open source project. Now, one may notice that there is a choice missing, and for those open source companies that still harbor magical beliefs, let me put this to you as directly as possible: cloud services providers are emphatically not going to license your proprietary software. I mean, you knew that, right? The whole premise with your proprietary license is that you are finding that there is no way to compete with the operational dominance of the cloud services providers; did you really believe that those same dominant cloud services providers can’t simply reimplement your LDAP integration or whatever? The cloud services providers are currently reproprietarizing all of computing — they are making their own CPUs for crying out loud! — reimplementing the bits of your software that they need in the name of the service that their customers want (and will pay for!) won’t even move the needle in terms of their effort.

Worse than all of this (and the reason why this madness needs to stop): licenses that are vague with respect to permitted use are corporate toxin. Any company that has been through an acquisition can speak of the peril of the due diligence license audit: the acquiring entity is almost always deep pocketed and (not unrelatedly) risk averse; the last thing that any company wants is for a deal to go sideways because of concern over unbounded liability to some third-party knuckle-head. So companies that engage in license tomfoolery are doing worse than merely not solving their own problem: they are potentially poisoning the wellspring of their own community.

So what to do? Those of us who have been around for a while — who came up in the era of proprietary software and saw the merciless transition to open source software — know that there’s no way to cross back over the Rubicon. Open source software companies need to come to grips with that uncomfortable truth: their business model isn’t their community’s problem, and they should please stop trying to make it one. And while they’re at it, it would be great if they could please stop making outlandish threats about the demise of open source; they sound like shrieking proprietary software companies from the 1990s, warning that open source will be ridden with nefarious backdoors and unspecified legal liabilities. (Okay, yes, a confession: just as one’s first argument with their teenager is likely to give their own parents uncontrollable fits of smug snickering, those of us who came up in proprietary software may find companies decrying the economically devastating use of their open source software to be amusingly ironic — but our schadenfreude cups runneth over, so they can definitely stop now.)

So yes, these companies have a clear business problem: they need to find goods and services that people will exchange money for. There are many business models that are complementary with respect to open source, and some of the best open source software (and certainly the least complicated from a licensing drama perspective!) comes from companies that simply needed the software and open sourced it because they wanted to build a community around it. (There are many examples of this, but the outstanding Envoy and Jaeger both come to mind — the former from Lyft, the latter from Uber.) In this regard, open source is like a remote-friendly working policy: it’s something that you do because it makes economic and social sense; even as it’s very core to your business, its not a business model in and of itself.

That said, it is possible to build business models around the open source software that is a company’s expertise and passion! Even though the VC that led the last round wants to puke into a trashcan whenever they hear it, business models like “support”, “services” and “training” are entirely viable! (That’s the good news; the bad news is that they may not deliver the up-and-to-the-right growth that these companies may have promised in their pitch deck — and they may come at too low a margin to pay for large teams, lavish perks, or outsized exits.) And of course, making software available as a service is also an entirely viable business model — but I’m pretty sure they’ve heard about that one in the keynote.

As part of their quest for a business model, these companies should read Adam Jacob’s excellent blog entry on sustainable free and open source communities. Adam sees what I see (and Stephen O’Grady sees and Roman Shaposhnik sees), and he has taken a really positive action by starting the Sustainable Free and Open Source Communities project. This project has a lot to be said for it: it explicitly focuses on building community; it emphasizes social contracts; it seeks longevity for the open source artifacts; it shows the way to viable business models; it rejects copyright assignment to a corporate entity. Adam’s efforts can serve to clear our collective head, and to focus on what’s really important: the health of the communities around open source. By focusing on longevity, we can plainly see restrictive licensing as the death warrant that it is, shackling the fate of a community to that of a company. (Viz. after the company behind AGPL-licensed RethinkDB capsized, it took the Linux Foundation buying the assets and relicensing them to rescue the community.) Best of all, it’s written by someone who has built a business that has open source software at its heart. Adam has endured the challenges of the open core model, and is refreshingly frank about its economic and psychic tradeoffs. And if he doesn’t make it explicit, Adam’s fundamental optimism serves to remind us, too, that any perceived “danger” to open source is overblown: open source is going to endure, as no company is going to be able to repeal the economics of software. That said, as we collectively internalize that open source is not a business model on its own, we will likely see fewer VC-funded open source companies (though I’m honestly not sure that that’s a bad thing).

I don’t think that it’s an accident that Adam, Stephen, Roman and I see more or less the same thing and are more or less the same age: not only have we collectively experienced many sides of this, but we are at once young enough to still recall our own idealism, yet old enough to know that coercion never endures in the limit. In short, this too shall pass — and in the end, open source will survive its midlife questioning just as people in midlife get through theirs: by returning to its core values and by finding rejuvenation in its communities. Indeed, we can all find solace in the fact that while life is finite, our values and our communities survive us — and that our engagement with them is our most important legacy.

Read the whole story
luizirber
35 days ago
reply
Davis, CA
Share this story
Delete

New vs. Old versions of Trinity

1 Share

tl;dr Yes.

XSEDE Computing Resources

Since I'm overdrawn on my new XSEDE allocation on the PSC Bridges LM (large memory) partition after receiving it on 11/26/2018, a report is required to apply for more.

In case you couldn't tell, I'm a really big fan of the PSC Bridges hpc! I've used 1862 SU out of the 1000 SU they allocated to me on the LM partition and 11484 out of 13000 SU on the RM (regular memory).

In case it might benefit others, here is roughly the progress report that I will submit with my request.

Re-assembling 17 Fundulus killifish transcriptomes

Since receiving my allocation to the PSC Bridges LM partition on 11/26/2018, I have re-assembled 17 transcriptomes from 16 species of Fundulus killifish using the Trinity de novo transcriptome assembler version 2.8.4. I originally assembled transcriptomes from the 16 Fundulus species using Trinity version 2.2.0.

There is some evidence that assemblies can be improved by using the updated versions of Trinity. I wanted to re-assemble these killifish transcriptomes to see if the assemblies could be improved.

These transcriptomes will be used for the purpose of studying the history of adaptation to different salinities. Some Fundulus species can tolerate a range of salinities (euryhaline) by switching osmoregulatory mechanisms while others require a more narrow salinity range (stenohaline) in either fresh or marine waters. This large set of RNAseq data from multiple species is going to help us better characterize the divergence of these molecular osmoregulatory mechanisms between euryhaline and stenohaline freshwater species. To that end, I am attempting to develop a pipeline for orthologous gene expression profiling analysis across species.

This profiling analysis will enable us to compare expression patterns of salinity-responsive genes, e.g. aquaporin 3 (shown below) across clades of species.

The nature of the fragmented assembly output from Trinity makes it especially challenging in our attempt to develop this pipeline. We want to make sure that the reference transcriptomes are the best that they can be so that expression profiles are accurately reflected across species.

Are there differences between the v2.2.0 and v2.8.4 assemblies?

Yes. Are they better? It seems so. But, there are a few items to investigate further.

Below are comparisons of various evaluation metrics between the 17 old v2.2.0 (blue) and new v2.8.4 (green) Trinity assemblies. On the left are slope graphs comparing the metric between both assemblies and on the right are split violin plots showing the distributions of all the assemblies.

Shout out to Dr. Harriet Alexander for helping with the code to visualize these types of comparisons for our paper comparing re-assemblies for the MMETSP - coming out soon in GigaScience!

Code for making these plots is here.

There are fewer contigs.

However, the large numbers >100,000 contigs indicates that these assemblies are still very fragmented. Orthogroup/ortholog prediction is required for downstream analysis.

The BUSCO scores are higher.

What makes these assemblies especially appealing is their higher scores with the Benchmarking Universal Single-Copy Ortholog (BUSCO) assessment tool. One species, Lucania goodei had 100%! The distribution of scores is tighter for the new assemblies compared to the old.

Lower % ORF?

For some reason, mean % ORF decreases. This was not expected.

Number of contigs with ORF

This is roughly the same and a very low number.

Similar unique k-mers

The number of unique k-mers (k=25) does not really change. This means that the content is similar.

Conditional Reciprocal Best Blast

Using transrate --assembly reference mode to examine the comparative metrics with Conditional Reciprocal Best BLAST (CRBB) between assemblies consistently did not work, for some reason. (Whereas it did work for comparison with the NCBI version of the Fundulus heteroclitus sister species) This requires further investigation.

Version:

Transrate v1.0.3
by Richard Smith-Unna, Chris Boursnell, Rob Patro,
   Julian Hibberd, and Steve Kelly

Command:

transrate --assembly=/pylon5/bi5fpmp/ljcohen/kfish_trinity/F_parvapinis.trinity_out.Trinity.fasta --reference=/pylon5/bi5fpmp/ljcohen/kfish_assemblies_old/F_parvapinis.trinity_out.Trinity.fasta --threads=8 --output=/pylon5/bi5fpmp/ljcohen/kfish_transrate/F_parvapinis_trinity_v_old/

Output:

[ INFO] 2018-12-06 22:23:47 : Loading assembly: /pylon5/bi5fpmp/ljcohen/kfish_trinity/F_parvapinis.trinity_out.Trinity.fasta
[ INFO] 2018-12-06 22:24:45 : Analysing assembly: /pylon5/bi5fpmp/ljcohen/kfish_trinity/F_parvapinis.trinity_out.Trinity.fasta
[ INFO] 2018-12-06 22:24:45 : Results will be saved in /pylon5/bi5fpmp/ljcohen/kfish_transrate/F_parvapinis_trinity_v_old/F_parvapinis.trinity_out.Trinity
[ INFO] 2018-12-06 22:24:45 : Calculating contig metrics...
[ INFO] 2018-12-06 22:25:37 : Contig metrics:
[ INFO] 2018-12-06 22:25:37 : -----------------------------------
[ INFO] 2018-12-06 22:25:37 : n seqs                       298549
[ INFO] 2018-12-06 22:25:37 : smallest                        183
[ INFO] 2018-12-06 22:25:37 : largest                       27771
[ INFO] 2018-12-06 22:25:37 : n bases                   310786992
[ INFO] 2018-12-06 22:25:37 : mean len                    1040.98
[ INFO] 2018-12-06 22:25:37 : n under 200                      15
[ INFO] 2018-12-06 22:25:37 : n over 1k                     77121
[ INFO] 2018-12-06 22:25:37 : n over 10k                      802
[ INFO] 2018-12-06 22:25:37 : n with orf                    62453
[ INFO] 2018-12-06 22:25:37 : mean orf percent              43.26
[ INFO] 2018-12-06 22:25:37 : n90                             340
[ INFO] 2018-12-06 22:25:37 : n70                            1141
[ INFO] 2018-12-06 22:25:37 : n50                            2512
[ INFO] 2018-12-06 22:25:37 : n30                            4111
[ INFO] 2018-12-06 22:25:37 : n10                            6966
[ INFO] 2018-12-06 22:25:37 : gc                             0.46
[ INFO] 2018-12-06 22:25:37 : bases n                           0
[ INFO] 2018-12-06 22:25:37 : proportion n                    0.0
[ INFO] 2018-12-06 22:25:37 : Contig metrics done in 52 seconds
[ INFO] 2018-12-06 22:25:37 : No reads provided, skipping read diagnostics
[ INFO] 2018-12-06 22:25:37 : Calculating comparative metrics...
[ INFO] 2018-12-06 23:23:24 : Comparative metrics:
[ INFO] 2018-12-06 23:23:24 : -----------------------------------
[ INFO] 2018-12-06 23:23:24 : CRBB hits                         0
[ INFO] 2018-12-06 23:23:24 : n contigs with CRBB               0
[ INFO] 2018-12-06 23:23:24 : p contigs with CRBB             0.0
[ INFO] 2018-12-06 23:23:24 : rbh per reference               0.0
[ INFO] 2018-12-06 23:23:24 : n refs with CRBB                  0
[ INFO] 2018-12-06 23:23:24 : p refs with CRBB                0.0
[ INFO] 2018-12-06 23:23:24 : cov25                             0
[ INFO] 2018-12-06 23:23:24 : p cov25                         0.0
[ INFO] 2018-12-06 23:23:24 : cov50                             0
[ INFO] 2018-12-06 23:23:24 : p cov50                         0.0
[ INFO] 2018-12-06 23:23:24 : cov75                             0
[ INFO] 2018-12-06 23:23:24 : p cov75                         0.0
[ INFO] 2018-12-06 23:23:24 : cov85                             0
[ INFO] 2018-12-06 23:23:24 : p cov85                         0.0
[ INFO] 2018-12-06 23:23:24 : cov95                             0
[ INFO] 2018-12-06 23:23:24 : p cov95                         0.0
[ INFO] 2018-12-06 23:23:24 : reference coverage              0.0
[ INFO] 2018-12-06 23:23:24 : Comparative metrics done in 3467 seconds
[ INFO] 2018-12-06 23:23:24 : -----------------------------------
[ INFO] 2018-12-06 23:23:24 : Writing contig metrics for each contig to /pylon5/bi5fpmp/ljcohen/kfish_transrate/F_parvapinis_trinity_v_old/F_parvapinis.trinity_out.Trinity/contigs.csv
[ INFO] 2018-12-06 23:23:43 : Writing analysis results to assemblies.csv

Conclusion

Trinity v2.8.4 is better than v2.2.0. While v2.8.4 still produces very fragmented assemblies, the higher BUSCO content is exciting.

There are questions requiring further investigation

  • Why, if the unique k-mer content is similiar, would the BUSCO scores improve between versions?
  • ORF content (number of contigs with ORF and mean ORF %) are parodoxical. Why would the ORF content decrease in the newer assemblies?
  • Why wouldn't transrate ---reference work to get CRBB metrics between these assemblies?

What has improved in Trinity v2.8.4?

According to the release notes from Trinity, major improvements have included using salmon expression quantification to help with filtering out assembly artifacts and overhauling the "supertranscript module" to deal with high polymorphism situations. Since killifish are highly heterozygous, we are likely benefitting from these improvements.

Read the whole story
luizirber
42 days ago
reply
Davis, CA
Share this story
Delete

Not Understanding

2 Shares

Not Understanding

My school theme week continues!

Read the whole story
luizirber
46 days ago
reply
Davis, CA
Share this story
Delete
Next Page of Stories