723 stories

New vs. Old versions of Trinity

1 Share

tl;dr Yes.

XSEDE Computing Resources

Since I'm overdrawn on my new XSEDE allocation on the PSC Bridges LM (large memory) partition after receiving it on 11/26/2018, a report is required to apply for more.

In case you couldn't tell, I'm a really big fan of the PSC Bridges hpc! I've used 1862 SU out of the 1000 SU they allocated to me on the LM partition and 11484 out of 13000 SU on the RM (regular memory).

In case it might benefit others, here is roughly the progress report that I will submit with my request.

Re-assembling 17 Fundulus killifish transcriptomes

Since receiving my allocation to the PSC Bridges LM partition on 11/26/2018, I have re-assembled 17 transcriptomes from 16 species of Fundulus killifish using the Trinity de novo transcriptome assembler version 2.8.4. I originally assembled transcriptomes from the 16 Fundulus species using Trinity version 2.2.0.

There is some evidence that assemblies can be improved by using the updated versions of Trinity. I wanted to re-assemble these killifish transcriptomes to see if the assemblies could be improved.

These transcriptomes will be used for the purpose of studying the history of adaptation to different salinities. Some Fundulus species can tolerate a range of salinities (euryhaline) by switching osmoregulatory mechanisms while others require a more narrow salinity range (stenohaline) in either fresh or marine waters. This large set of RNAseq data from multiple species is going to help us better characterize the divergence of these molecular osmoregulatory mechanisms between euryhaline and stenohaline freshwater species. To that end, I am attempting to develop a pipeline for orthologous gene expression profiling analysis across species.

This profiling analysis will enable us to compare expression patterns of salinity-responsive genes, e.g. aquaporin 3 (shown below) across clades of species.

The nature of the fragmented assembly output from Trinity makes it especially challenging in our attempt to develop this pipeline. We want to make sure that the reference transcriptomes are the best that they can be so that expression profiles are accurately reflected across species.

Are there differences between the v2.2.0 and v2.8.4 assemblies?

Yes. Are they better? It seems so. But, there are a few items to investigate further.

Below are comparisons of various evaluation metrics between the 17 old v2.2.0 (blue) and new v2.8.4 (green) Trinity assemblies. On the left are slope graphs comparing the metric between both assemblies and on the right are split violin plots showing the distributions of all the assemblies.

Shout out to Dr. Harriet Alexander for helping with the code to visualize these types of comparisons for our paper comparing re-assemblies for the MMETSP - coming out soon in GigaScience!

Code for making these plots is here.

There are fewer contigs.

However, the large numbers >100,000 contigs indicates that these assemblies are still very fragmented. Orthogroup/ortholog prediction is required for downstream analysis.

The BUSCO scores are higher.

What makes these assemblies especially appealing is their higher scores with the Benchmarking Universal Single-Copy Ortholog (BUSCO) assessment tool. One species, Lucania goodei had 100%! The distribution of scores is tighter for the new assemblies compared to the old.

Lower % ORF?

For some reason, mean % ORF decreases. This was not expected.

Number of contigs with ORF

This is roughly the same and a very low number.

Similar unique k-mers

The number of unique k-mers (k=25) does not really change. This means that the content is similar.

Conditional Reciprocal Best Blast

Using transrate --assembly reference mode to examine the comparative metrics with Conditional Reciprocal Best BLAST (CRBB) between assemblies consistently did not work, for some reason. (Whereas it did work for comparison with the NCBI version of the Fundulus heteroclitus sister species) This requires further investigation.


Transrate v1.0.3
by Richard Smith-Unna, Chris Boursnell, Rob Patro,
   Julian Hibberd, and Steve Kelly


transrate --assembly=/pylon5/bi5fpmp/ljcohen/kfish_trinity/F_parvapinis.trinity_out.Trinity.fasta --reference=/pylon5/bi5fpmp/ljcohen/kfish_assemblies_old/F_parvapinis.trinity_out.Trinity.fasta --threads=8 --output=/pylon5/bi5fpmp/ljcohen/kfish_transrate/F_parvapinis_trinity_v_old/


[ INFO] 2018-12-06 22:23:47 : Loading assembly: /pylon5/bi5fpmp/ljcohen/kfish_trinity/F_parvapinis.trinity_out.Trinity.fasta
[ INFO] 2018-12-06 22:24:45 : Analysing assembly: /pylon5/bi5fpmp/ljcohen/kfish_trinity/F_parvapinis.trinity_out.Trinity.fasta
[ INFO] 2018-12-06 22:24:45 : Results will be saved in /pylon5/bi5fpmp/ljcohen/kfish_transrate/F_parvapinis_trinity_v_old/F_parvapinis.trinity_out.Trinity
[ INFO] 2018-12-06 22:24:45 : Calculating contig metrics...
[ INFO] 2018-12-06 22:25:37 : Contig metrics:
[ INFO] 2018-12-06 22:25:37 : -----------------------------------
[ INFO] 2018-12-06 22:25:37 : n seqs                       298549
[ INFO] 2018-12-06 22:25:37 : smallest                        183
[ INFO] 2018-12-06 22:25:37 : largest                       27771
[ INFO] 2018-12-06 22:25:37 : n bases                   310786992
[ INFO] 2018-12-06 22:25:37 : mean len                    1040.98
[ INFO] 2018-12-06 22:25:37 : n under 200                      15
[ INFO] 2018-12-06 22:25:37 : n over 1k                     77121
[ INFO] 2018-12-06 22:25:37 : n over 10k                      802
[ INFO] 2018-12-06 22:25:37 : n with orf                    62453
[ INFO] 2018-12-06 22:25:37 : mean orf percent              43.26
[ INFO] 2018-12-06 22:25:37 : n90                             340
[ INFO] 2018-12-06 22:25:37 : n70                            1141
[ INFO] 2018-12-06 22:25:37 : n50                            2512
[ INFO] 2018-12-06 22:25:37 : n30                            4111
[ INFO] 2018-12-06 22:25:37 : n10                            6966
[ INFO] 2018-12-06 22:25:37 : gc                             0.46
[ INFO] 2018-12-06 22:25:37 : bases n                           0
[ INFO] 2018-12-06 22:25:37 : proportion n                    0.0
[ INFO] 2018-12-06 22:25:37 : Contig metrics done in 52 seconds
[ INFO] 2018-12-06 22:25:37 : No reads provided, skipping read diagnostics
[ INFO] 2018-12-06 22:25:37 : Calculating comparative metrics...
[ INFO] 2018-12-06 23:23:24 : Comparative metrics:
[ INFO] 2018-12-06 23:23:24 : -----------------------------------
[ INFO] 2018-12-06 23:23:24 : CRBB hits                         0
[ INFO] 2018-12-06 23:23:24 : n contigs with CRBB               0
[ INFO] 2018-12-06 23:23:24 : p contigs with CRBB             0.0
[ INFO] 2018-12-06 23:23:24 : rbh per reference               0.0
[ INFO] 2018-12-06 23:23:24 : n refs with CRBB                  0
[ INFO] 2018-12-06 23:23:24 : p refs with CRBB                0.0
[ INFO] 2018-12-06 23:23:24 : cov25                             0
[ INFO] 2018-12-06 23:23:24 : p cov25                         0.0
[ INFO] 2018-12-06 23:23:24 : cov50                             0
[ INFO] 2018-12-06 23:23:24 : p cov50                         0.0
[ INFO] 2018-12-06 23:23:24 : cov75                             0
[ INFO] 2018-12-06 23:23:24 : p cov75                         0.0
[ INFO] 2018-12-06 23:23:24 : cov85                             0
[ INFO] 2018-12-06 23:23:24 : p cov85                         0.0
[ INFO] 2018-12-06 23:23:24 : cov95                             0
[ INFO] 2018-12-06 23:23:24 : p cov95                         0.0
[ INFO] 2018-12-06 23:23:24 : reference coverage              0.0
[ INFO] 2018-12-06 23:23:24 : Comparative metrics done in 3467 seconds
[ INFO] 2018-12-06 23:23:24 : -----------------------------------
[ INFO] 2018-12-06 23:23:24 : Writing contig metrics for each contig to /pylon5/bi5fpmp/ljcohen/kfish_transrate/F_parvapinis_trinity_v_old/F_parvapinis.trinity_out.Trinity/contigs.csv
[ INFO] 2018-12-06 23:23:43 : Writing analysis results to assemblies.csv


Trinity v2.8.4 is better than v2.2.0. While v2.8.4 still produces very fragmented assemblies, the higher BUSCO content is exciting.

There are questions requiring further investigation

  • Why, if the unique k-mer content is similiar, would the BUSCO scores improve between versions?
  • ORF content (number of contigs with ORF and mean ORF %) are parodoxical. Why would the ORF content decrease in the newer assemblies?
  • Why wouldn't transrate ---reference work to get CRBB metrics between these assemblies?

What has improved in Trinity v2.8.4?

According to the release notes from Trinity, major improvements have included using salmon expression quantification to help with filtering out assembly artifacts and overhauling the "supertranscript module" to deal with high polymorphism situations. Since killifish are highly heterozygous, we are likely benefitting from these improvements.

Read the whole story
19 hours ago
Davis, CA
Share this story

Not Understanding


Not Understanding

My school theme week continues!

Read the whole story
4 days ago
Davis, CA
Share this story

Seqtk: code walkthrough

1 Share

On Twitter, Zach charlop-powers asked me to give a code walkthrough for seqtk. This post does that, to a certain extent.

Seqtk is a fairly simple project. It uses two single-header libraries for hash table and FASTQ parsing, respectively. Its single .c file consists of mostly independent components, one for each seqtk command. I will start with the two single-header libraries.

Buffered stream reader

In standard C, the read() system call is the most basic function to read from file. It is usually slow to read data byte by byte. That is why the C library provides the fread() series. fread() reads large chunks of data with read() into an internal buffer and returns smaller data blocks on request. It may dramatically reduce the expensive read() system calls and is mostly the preferred choice.

fread() is efficient. However, it only works with data streams coming from file descriptors, not from a zlib file handler for example. More recent programming languages provide generic buffered readers. Take Go’s Bufio as an example. It demands a read() like function from the user code, and provides a buffered single-byte reader and an efficient line reader in return. The buffered functionalities are harder to implement on your own.

The buffered reader in kseq.h predates Go, but the basic idea is similar. In this file, ks_getuntil() reads up to a delimitor such as a line break. It moves data with memcpy() and uses a single loop to test delimitor. 10 years ago when “kseq.h” was first implemented, zlib didn’t support buffered I/O. Line reading with zlib was very slow. “kseq.h” is critical to the performance of FASTA/Q parsing.

FASTA/Q parser

The parser parses FASTA and FASTQ at the same time. It looks for ‘@’ or ‘>’ if it hasn’t been read, and then reads name and comment. To read sequence, the parser first reads the first character on a line. If the character is ‘+’ or indicates a FASTA/Q header, the parser stops; if not, it reads the rest of line into the sequence buffer. If the parser stops at a FASTA/Q header, it returns the sequence as a FASTA record and indicates the header character has been read, such that the parser need not look for it for the next record. If the parser stops at ‘+’, it skips the rest of line and starts to read quality strings line by line until the quality string is no shorter than the sequence. The parser returns an error if it reaches the end of file before reading enough quality, or the quality string turns out to be longer than sequence. Given a malformatted FASTA/Q file, the parser won’t lead to memory violation except when there is not enough memory.

A basic tip on fast file parsing: read by line or by chunk, not by byte. Even with a buffered reader, using fgetc() etc to read every byte is slow. In fact, it is possible to make the FASTA/Q parser faster by reading chunks of fixed size, but the current pasrser is fast enough for typical FASTA/Q.

Hash table

File khash.h implements an open-addressing hash table with power-of-2 capacity and quadratic probing. It uses a 2-bit-per-bucket meta table to indicate whether a bucket is used or deleted. The query and insertion operations are fairly standard. There are no tricks. Rehashing in khash is different from other libraries, but that is not an important aspect.

Both “khash.h” and “kseq.h” heavily depend on C macros. They look ugly. Unfortunately, in C, that is the only way to achieve the performance of type-specific code.


The only performance-related trick I can think of in seqtk.c is the tables to map nucleotides to integers. It is commonly used elsewhere. Another convenience-related trick is isatty(). This function can test if there is an incoming stream from the standard input. Gzip probably uses this function, too.

Seqtk.c also implements a simple 3-column BED reader and comes with a Mersenne Twister pseudorandom number generator (PRNG). That PRNG is a mistake, though seqtk doesn’t need a good PRNG anyway.

The rest of seqtk consists of mostly indepedent functions, each implementing a seqtk command. I will briefly explain a couple of them. “trimfq” uses a modified Mott algorithm (please search text “Mott algorithm” in phred.doc). I think this is a much cleaner and more theoretically sound algorithm than most ad hoc methods in various read trimmers. The “sample” command takes the advantage of reservoir sampling. The core implementation only takes two lines. You can in fact sample a text file with an awk one liner:

cat file.txt | awk -v k=10 '{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=$0}END{for(z in a)print a[z]}'

Concluding remarks

This is the first time I present a code walkthrough in a blog post. Not sure if it is helpful or even qualified as a walkthrough…

Read the whole story
6 days ago
Davis, CA
Share this story


1 Share

Somewhere out there, in this universe or another, there’s a third Aliens movie in which we discover that the monsters from the first two films are precursors to an intelligent, civilized species. When they want to colonize a new world, they drop a few eggs on it and let the gnarly beasts eliminate potential threats, then turn into a third phase that has poetry and technology and what-not. What drives the film is Ridley’s reaction to this discovery: can she set aside the horrors she has been through and make peace (however grudging) with the aliens, or is the trauma too much to overcome?

Somewhere else, in this universe or another, there’s a fourth Mad Max movie that picks up right where Thunder Dome left off. Max walks out of the desert into a peaceful little town on the coast with green lawns and well-fed people. When the war ground to a halt, a Russian nuclear submarine deliberately beached itself, and her captain made a deal with the townspeople: we give you electricity, you give us a home, because ours is now a smoking pile of radioactive rubble. All goes well until a ship comes over the horizon. She’s the last of the US fleet; her captain (I picture George C. Scott with an eyepatch) has one bomb left, and as far as he’s concerned, the war isn’t over until he says it’s over. He’s Ahab, and that beached submarine is his whale, and once again, what drives the film is the question of whether people can put their pasts behind them.

And somewhere, not far from here, Supernatural hasn’t been repeating itself for season after pointless, empty season. Instead, after the great arc of the first five seasons ended, the show went back and re-did episodes from the points of view of incidental characters. They sorted through the fan theories that explained away continuity glitches, picked the best (or at least the most entertaining), and firmed up the show’s canon while showing us all what it’s like to be the poor minimum-wage bastard at the gas station where the Winchesters keep showing up and causing mayhem.

But most of all, somewhere out there, the Russos gave Bruce Banner and Natalia Romanova three minutes—just three minutes—to talk quietly with each other on the flight to Wakanda. Johansson and Ruffalo are two of the best actors working today, and I’m sure they could have broken our hearts if only they’d been given a chance.

Read the whole story
8 days ago
Davis, CA
Share this story

Keeping It WEIRD


There was a great article over on The Conversation the other day about how poor scientific research is at considering people with different viewpoints and backgrounds.

Specifically, the vast majority of what we know about human psychology and behavior comes from studies conducted with a narrow slice of humanity – college students, middle-class respondents living near universities and highly educated residents of wealthy, industrialized and democratic nations.

To illustrate the extent of this bias, consider that more than 90 percent of studies recently published in psychological science’s flagship journal come from countries representing less than 15 percent of the world’s population….Given that these typical participants are often outliers, many scholars now describe them and the findings associated with them using the acronym WEIRD, for Western, educated, industrialized, rich and democratic.

First, I love the WEIRD acronym, and I’m a little surprised I haven’t heard it before, or, if I have, that it didn’t stick.

More seriously though, focusing only on the WEIRD can have a damaging impact as we use research to guide how we parent, how we teach and how we interact with others. The article gives an excellent example of how even a pretty widely accepted, and simple, pattern test can lead us astray.

Consider an apparently simple pattern recognition test commonly used to assess the cognitive abilities of children. A standard item consists of a sequence of two-dimensional shapes – squares, circles and triangles – with a missing space. A child is asked to complete the sequence by choosing the appropriate shape for the missing space.

When 2,711 Zambian schoolchildren completed this task in one recent study, only 12.5 percent correctly filled in more than half of shape sequences they were shown. But when the same task was given with familiar three-dimensional objects – things like toothpicks, stones, beans and beads – nearly three times as many children achieved this goal (34.9 percent). The task was aimed at recognizing patterns, not the ability to manipulate unfamiliar two-dimensional shapes. The use of a culturally foreign tool dramatically underestimated the abilities of these children.

Naturally, this made me think about the research that we have done thus far to push the web forward. Most of it is significantly less formal than those being conducted by, for example, the psychology community the author was focusing on. So there’s already likely a few more gaps and oversights built in. Now throw in the inherent bias in the results, and it’s a little frightening, isn’t it?

Moving beyond the WEIRD is critical not just in scientific research, but in our own more web-centric research. We’ve known for a while that the worldwide web was becoming increasingly that: worldwide. As we try to reach people in different parts of the globe with very different daily realities, we have to be willing to rethink our assumptions. We have to be willing to revisit our research and findings with fresh eyes so that we can see what holds true, what doesn’t, and where.

Just how much are we overlooking?

We have anecdotal evidence that the way we view forms and shipping is overly simplistic. What other assumptions do we make in the usability of form controls that may be leaving folks out? What does a truly globally accessible form look like?

We know that data costs can be prohibitive in many parts of the globe, leading folks to have to get creative with things like local caching servers to afford to get online. We’ve started to focus less on page weight in WEIRD environments, but is that true of folks in other areas? Do the performance metrics we’re zeroing in on still represent the user experience in different situations?

There’s been some work done on better understanding what people expect from the web; I certainly don’t want to imply that there hasn’t. But the body of research is significantly smaller than the analysis based on the WEIRD. Much of what we do have is survey-based (versus more accurate forms of research) and speculation based on anecdotal evidence. I don’t think anyone could argue that we don’t still have a long way to go.

There always seems to be something new for the web to figure out, something that keeps us on our toes. Robyn Larsen has been talking a lot lately about how internationalization is our next big challenge, and I couldn’t agree more. One thing is certain: we have a lot to relearn.

Read the whole story
9 days ago
Davis, CA
9 days ago
Washington, DC
Share this story

Signaling openness

1 Share

How do open projects signal their “openness” to the outside community? This is a really hard question, particularly because nowadays “open” has become a buzzword that doesn’t just signal a project’s position to the community, but is also used as a marketing term to increase support, users, or resources.

I was thinking about this the other day, so decided to take to twitter:

I was surprised at how much this question resonated with people. Here are a few highlights from the (very interesting) conversation that came out of that question.

Some discussion threads

Wishes vs. reality

Tal immediately brought up a really important point: many projects want to be inclusive and welcoming to others, but they don’t have time to do so.

I think this is an important distinction, and something that should be signaled clearly. One the one hand, if a person generally wants others to contribute to the project, then they’re some degree of openness higher than a project that actively discourages this.

On the other hand, running open projects does take work, and a project that says “well I’d like to be open but can’t commit the time to do it” also isn’t that open in practice. No hard feelings there, but I think that the goal of defining a “degree of openness” isn’t to signal a value judgment on the people related to the project, but on the project itself. If you really want to grow an open community around a project, you need to dedicate time and resources to the community itself, not just the technical pieces of the tool.

Metrics of openness

That leaves open the question: “how do we measure the practical openness of a project, rather than just what it says?”. A few folks mentioned that the CHAOSS project does a lot of work in this gneeral space:

CHAOSS defines standards for metrics to collect about communities. They don’t necessarily say what others should do with those metrics, so perhaps that’s on the open community to define for themselves.

Personally, I’d love to see more tooling that makes it possible to scrape activity statistics from open repositories. Tal and others suggested a few things:

  • time to initial response to new issues (maybe separated by new vs. old contributors)
  • inequality coefficient for contributor commits
  • number of unique organizations/email domains in contrbutors
  • use of positive/welcoming language
  • explicit roles defined, and pathways towards working more with the community

I’d love to see more thoughts along these lines. If we could define a collection of metrics around openness, it’d paint a much more rich picture than simply “does this project have a permissive license.”

There was also a specific metric around governance that’s worth highlighting:

The paper linked above is a study that investigated “open governance” in a number of open-source mobile projects. It’s an interesting exploration of the ways that decision-making is made (and signaled) in several projects. Perhaps unsurprisingly, they conclude that “more open” projects are most-likely to be successful in the long term (with a few exceptions).

Finally, apparently there’s also a “badge” to signal the status of a repository (is it active, vaporware, abandoned, etc):

I’d love to see more of these semi-automated signals to help guide the open source community in deciding what projects to adopt and contribute to. As more and more people do their work online and in the open, it also creates a challenge of sifting through the noise to make the most of your (limited) time and energy. Having better metrics like these will make these decisions easier.

Mozilla’s archetypes of open projects

One of the most fascinating links I found was Mozilla’s “archetypes of open projects” document:

Briefly, this is an internal document that Mozilla made public. It attempts to define the different kinds of open projects that exist. Importantly, it also explains the value propositions of each, how it can be used strategically within an organization, and how it supports (or doesn’t) an open community around it.

I added some thoughts about how Project Jupyter fits into these archetypes on the Jupyter governance research issue and I’d love to think more about how these archetypes fit into the pre-existing open communities that are out there. If anybody wants to brainstorm how these archetypes fit into the scientific open community, I’d love to chat :-)

On that note, I want to give a brief shout-out to Mozilla in general, which has either conducted or sponsored a bunch of interesting work in open projects. For example, they have a whole wiki dedicated to working openly:

and they also run lots of training and community programs such as the Mozilla Open Leaders program. Project Jupyter is in this year’s cohort and keeping track of its progress here.

Importance of ethnography:

A final note on the importance of ethnography:

For all of my talk about metrics above, I’ve come to appreciate that numbers are never sufficient to describe the complexities of a community or group. Over the last several years at the Berkeley Institute for Data Science, I’ve had the pleasure of working with several ethnographers who have shared their perspective on how to study communities. Semi-automatically-calculated numbers can be a great way to see relatively coarse-level view of a community, but if you really wany to understand what’s going on, you need to dig in there, conduct qualitative interviews, operate in the community, and create some stories that back up (or not) the quantitative data that you collect. We’d all be better off if there were more ethnographers in our respective communities <3.

OK, that’s enough for now - I hope these links are useful and I’ll try to update them over time if I hear of some new projects along these lines. If you have any suggestions, feel free to leave ‘em in the comments!

Read the whole story
14 days ago
Davis, CA
Share this story
Next Page of Stories