710 stories

What open science is about

1 Share

Today I got a pleasant surprise: Olga Botvinnik posted on Twitter about a poster she is presenting at the Beyond the Cell Atlas conference and she name-dropped a bunch of people that helped her. The cool thing? They are all open source developers, and Olga interacted thru GitHub to ask for features, report bugs and even submit pull requests.

That's what open science is about: collaboration, good practices, and in the end coming up with something that is larger than each individual piece. Now sourmash is better, bamnostic is better, reflow is better. I would like to see this becoming more and more common =]


Read the whole story
1 hour ago
Davis, CA
Share this story

Push versus Pull

1 Comment and 3 Shares

Once in a while you’ll hear of someone doing a digital detox, which implies there’s something toxic about being digital. And there can be, but “digital” misdiagnoses the problem. The problem mostly isn’t digital technology per se but how we use it.

I think the important distinction isn’t digital vs. analog, but rather push vs. pull, or passive vs. active. When you’re online, companies are continually pushing things at you: ads, videos, songs, shopping recommendations, etc. You either passively accept whatever is pushed at you, and feel gross after a while, or you exert willpower to resist what is being pushed at you, and feel tired.

Information overload

I find it relaxing to walk into a library with millions of books. There’s an enormous amount of information in a library, but it’s not being streamed at you. You have to actively access it. An electronic catalog is far easier to use than an analog card catalog, and the introduction of digital technology does not induce stress. If anything, it reduces stress. (As long as the catalog is not down for maintenance.)

A single web page can induce a stronger sense of information overload than an entire library, even though the former contains a negligible amount of information compared to the latter.

Twitter vs RSS

Twitter can be stressful in a way that RSS is not. Both are digital, but RSS is more active and Twitter is more passive.

RSS gives you content that you have deliberately subscribed to. Your Twitter stream contains updates from people you have chosen to follow, but also unwanted content. This unwanted content comes in several forms: unwanted content from people you chose to follow, retweets, and worst of all tweets that people you follow have “liked.” You can turn off retweets from people you follow, but you can’t avoid likes [1]. Twitter also has ads, but I find ads less annoying than the other unwanted content.

When an item shows up in your RSS feed you make a choice whether to open it. But Twitter content arrives already opened, including photos. I’ll subscribe to someone’s RSS feed even if I’m interested in only one out of twenty of their posts because it is so easy to simply not read the posts you’re not interested in. But if you’re only interested in one out of twenty things people say on Twitter, then your stream is 95% unwanted content.

Instant messaging vs Email

Instant messaging and text messages are more stressful than email, at least in my opinion. This is another example of passive versus active. The more active option, while perhaps less convenient, is also less stressful.

IDEs vs editors

An IDE (integrated development environment) is a program like Visual Studio that helps you write software. There are scores of menus, buttons, and dialogs to guide you in developing your code. If you’re doing the kind of software development an IDE is designed for, it can be very useful. But I also find it stressful. I feel like options are calling out “Pick me! Pick me!”

Text editors stay out of your way, but they also don’t offer any help. The Visual Studio IDE and the Emacs editor are both enormous programs, but the former feels more passive and stressful to me. Emacs, for better and for worse, is more active. It has thousands of commands, but they’re not staring at you on buttons. You have to type them. This makes it much harder to discover new features, but it also makes the software more peaceful to use.

Here’s what the two programs look like when you open them. First Visual Studio:

Visual Studio 2015 screen shot

And now Emacs:

Emacs screen shot

Digital vs online

Using a computer is not the same thing as being online. As far as I know, nobody talked about the need for a digital detox before the web. People who say they’re worn out by digital technology are mostly worn out by social media. Computers have a few other uses besides being social media portals.

In the television series Battlestar Galactica, the protagonists had a rule that computers must not be networked. Computers were essential, but they must never be networked, in order to prevent attack from Cylon androids. Some people have a sort of personal Battlestar Galactica rule, working for long periods of time without an internet connection.

An alternative is to make disciplined use of an internet connection, for example, using it for email but not for social media. Unplugging the network cable takes less decision making and less discipline, but it’s harder to do. For example, it’s common for software to not have local documentation, so you may need to go online for help.


Much of the stress attributed to digital technology comes from passive use of the technology rather than the technology itself. There are benefits to walking away from computers periodically that this post hasn’t discussed, but most of the benefits of a digital detox come from a social media detox.

[1] Now you can block likes: here’s how.

Read the whole story
3 days ago
4 days ago
Davis, CA
Share this story
1 public comment
19 days ago
Good analogies. I haven't ever felt the need to "digitally detox" but I regularly go through all my device notifications, and make sure to enable "night mode" (no notifications during sleeping hours) and I think that is helpful.
Denver, CO

New crate: nthash

1 Share

A quick announcement: I wrote a Rust implementation of ntHash and published it in crates.io. It implements an Iterator to take advantage of the rolling properties of ntHash which make it so useful in bioinformatics (where we work a lot with sliding windows over sequences).

It's a pretty small crate, and probably was a better project to learn Rust than doing a sourmash implementation because it doesn't involve gnarly FFI issues. I also put some docs, benchmarks using criterion, and even an oracle property-based test with quickcheck.

More info in the docs, and if you want an optimization versioning bug discussion be sure to check the ntHash bug? repo, which has a (slow) Python implementation and a pretty nice analysis notebook.


Read the whole story
11 days ago
Davis, CA
Share this story

Oxidizing sourmash: WebAssembly

1 Comment

sourmash calculates MinHash signatures for genomic datasets, meaning we are reducing the data (via subsampling) to a small representative subset (a signature) capable of answering one question: how similar is this dataset to another one? The key here is that a dataset with 10-100 GB will be reduced to something in the megabytes range, and two approaches for doing that are:

  • The user install our software in their computer. This is not so bad anymore (yay bioconda!), but still requires knowledge about command line interfaces and how to install all this stuff. The user data never leaves their computer, and they can share the signatures later if they want to.
  • Provide a web service to calculate signatures. In this case no software need to be installed, but it's up to someone (me?) to maintain a server running with an API and frontend to interact with the users. On top of requiring more maintenance, another drawback is that the user need to send me the data, which is very inefficient network-wise and lead to questions about what I can do with their raw data (and I'm not into surveillance capitalism, TYVM).

But... what if there is a third way?

What if we could keep the frontend code from the web service (very user-friendly) but do all the calculations client-side (and avoid the network bottleneck)? The main hurdle here is that our software is implemented in Python (and C++), which are not supported in browsers. My first solution was to write the core features of sourmash in JavaScript, but that quickly started hitting annoying things like JavaScript not supporting 64-bit integers. There is also the issue of having another codebase to maintain and keep in sync with the original sourmash, which would be a relevant burden for us. I gave a lab meeting about this approach, using a drag-and-drop UI as proof of concept. It did work but it was finicky (dealing with the 64-bit integer hashes is not fun). The good thing is that at least I had a working UI for further testing1

In "Oxidizing sourmash: Python and FFI" I described my road to learn Rust, but something that I omitted was that around the same time the WebAssembly support in Rust started to look better and better and was a huge influence in my decision to learn Rust. Reimplementing the sourmash C++ extension in Rust and use the same codebase in the browser sounded very attractive, and now that it was working I started looking into how to use the WebAssembly target in Rust.


From the official site,

WebAssembly (abbreviated Wasm) is a binary
instruction format for a stack-based
virtual machine. Wasm is designed as a
portable target for compilation of high-level
languages like C/C++/Rust, enabling deployment
on the web for client and server applications.

You can write WebAssembly by hand, but the goal is to have it as lower level target for other languages. For me the obvious benefit is being able to use something that is not JavaScript in the browser, even though the goal is not to replace JS completely but complement it in a big pain point: performance. This also frees JavaScript from being the target language for other toolchains, allowing it to grow into other important areas (like language ergonomics).

Rust is not the only language targeting WebAssembly: Go 1.11 includes experimental support for WebAssembly, and there are even projects bringing the scientific Python to the web using WebAssembly.

But does it work?

With the Rust implementation in place and with all tests working on sourmash, I added the finishing touches using wasm-bindgen and built an NPM package using wasm-pack: sourmash is a Rust codebase compiled to WebAssembly and ready to use in JavaScript projects.

(Many thanks to Madicken Munk, who also presented during SciPy about how they used Rust and WebAssembly to do interactive visualization in Jupyter and helped with a good example on how to do this properly =] )

Since I already had the working UI from the previous PoC, I refactored the code to use the new WebAssembly module and voilà! It works!2. 3 It is still pretty basic (it doesn't even have a proper button to download the generated signature, FFS), but it is a starting point.

Next steps

The proof of concept works, but it is pretty useless right now. I'm thinking about building it as a Web Component and making it really easy to add to any webpage4.

Another interesting feature would be supporting more input formats (the GMOD project implemented a lot of those!), but more features are probably better after something simple but functional is released =P

Next time!

Where we will go next? Maybe explore some decentralized web technologies like IPFS and dat, hmm? =]



  1. even if horrible, I need to get some design classes =P

  2. the first version of this demo only worked in Chrome because they implemented the BigInt proposal, which is not in the official language yet. The funny thing is that BigInt would have made the JS implementation of sourmash viable, and I probably wouldn't have written the Rust implementation =P. Turns out that I didn't need the BigInt support if I didn't expose any 64-bit integers to JS, and that is what I'm doing now.

  3. Along the way I ended up writing a new FASTQ parser... because it wouldn't be bioinformatics if it didn't otherwise, right? =P

  4. or maybe a React component? I really would like to have something that works independent of framework, but not sure what is the best option in this case...

Read the whole story
28 days ago
Two posts in one week! \o/
Davis, CA
Share this story

Oxidizing sourmash: Python and FFI

1 Comment

I think the first time I heard about Rust was because Frank Mcsherry chose it to write a timely dataflow implementation. Since then it started showing more and more in my news sources, leading to Armin Ronacher publishing a post in the Sentry blog last November about writing Python extensions in Rust.

Last December I decided to give it a run: I spent some time porting the C++ bits of sourmash to Rust. The main advantage here is that it's a problem I know well, so I know what the code is supposed to do and can focus on figuring out syntax and the mental model for the language. I started digging into the symbolic codebase and understanding what they did, and tried to mirror or improve it for my use cases.

(About the post title: The process of converting a codebase to Rust is referred as "Oxidation" in the Rust community, following the codename Mozilla chose for the process of integrating Rust components into the Firefox codebase. 1 Many of these components were tested and derived in Servo, an experimental browser engine written in Rust, and are being integrated into Gecko, the current browser engine (mostly written in C++).)

Why Rust?

There are other programming languages more focused on scientific software that could be used instead, like Julia2. Many programming languages start from a specific niche (like R and statistics, or Maple and mathematics) and grow into larger languages over time. While Rust goal is not to be a scientific language, its focus on being a general purpose language allows a phenomenon similar to what happened with Python, where people from many areas pushed the language in different directions (system scripting, web development, numerical programming...) allowing developers to combine all these things in their systems.

But by far my interest in Rust is for the many best practices it brings to the default experience: integrated package management (with Cargo), documentation (with rustdoc), testing and benchmarking. It's understandable that older languages like C/C++ need more effort to support some of these features (like modules and an unified build system), since they are designed by standard and need to keep backward compatibility with codebases that already exist. Nonetheless, the lack of features increase the effort needed to have good software engineering practices, since you need to choose a solution that might not be compatible with other similar but slightly different options, leading to fragmentation and increasing the impedance to use these features.

Another big reason is that Rust doesn't aim to completely replace what already exists, but complement and extend it. Two very good talks about how to do this, one by Ashley Williams, another by E. Dunham.

Converting from a C++ extension to Rust

The current implementation of the core data structures in sourmash is in a C++ extension wrapped with Cython. My main goals for converting the code are:

  • support additional languages and platforms. sourmash is available as a Python package and CLI, but we have R users in the lab that would benefit from having an R package, and ideally we wouldn't need to rewrite the software every time we want to support a new language.

  • reducing the number of wheel packages necessary (one for each OS/platform).

  • in the long run, use the Rust memory management concepts (lifetimes, borrowing) to increase parallelism in the code.

Many of these goals are attainable with our current C++ codebase, and "rewrite in a new language" is rarely the best way to solve a problem. But the reduced burden in maintenance due to better tooling, on top of features that would require careful planning to execute (increasing the parallelism without data races) while maintaining compatibility with the current codebase are promising enough to justify this experiment.

Cython provides a nice gradual path to migrate code from Python to C++, since it is a superset of the Python syntax. It also provides low overhead for many C++ features, especially the STL containers, with makes it easier to map C++ features to the Python equivalent. For research software this also lead to faster exploration of solutions before having to commit to lower level code, but without a good process it might also lead to code never crossing into the C++ layer and being stuck in the Cython layer. This doesn't make any difference for a Python user, but it becomes harder from users from other languages to benefit from this code (since your language would need some kind of support to calling Python code, which is not as readily available as calling C code).

Depending on the requirements, a downside is that Cython is tied to the CPython API, so generating the extension requires a development environment set up with the appropriate headers and compiler. This also makes the extension specific to a Python version: while this is not a problem for source distributions, generating wheels lead to one wheel for each OS and Python version supported.

The new implementation

This is the overall architecture of the Rust implementation: It is pretty close to what symbolic does, so let's walk through it.

The Rust code

If you take a look at my Rust code, you will see it is very... C++. A lot of the code is very similar to the original implementation, which is both a curse and a blessing: I'm pretty sure that are more idiomatic and performant ways of doing things, but most of the time I could lean on my current mental model for C++ to translate code. The biggest exception was the merge function, were I was doing something on the C++ implementation that the borrow checker didn't like. Eventually I found it was because it couldn't keep track of the lifetime correctly and putting braces around it fixed the problem, which was both an epiphany and a WTF moment. Here is an example that triggers the problem, and the solution.

"Fighting the borrow checker" seems to be a common theme while learning Rust, but the compiler really tries to help you to understand what is happening and (most times) how to fix it. A lot of people grow to hate the borrow checker, but I see it more as a 'eat your vegetables' situation: you might not like it at first, but it's better in the long run. Even though I don't have a big codebase in Rust yet, it keeps you from doing things that will come back to bite you hard later.

Generating C headers for Rust code: cbindgen

With the Rust library working, the next step was taking the Rust code and generate C headers describing the functions and structs we expose with the #[no_mangle] attribute in Rust (these are defined in the ffi.rs module in sourmash-rust). This attribute tells the Rust compiler to generate names that are compatible with the C ABI, and so can be called from other languages that implement FFI mechanisms. FFI (the foreign function interface) is quite low-level, and pretty much defines things that C can represent: integers, floats, pointers and structs. It doesn't support higher level concepts like objects or generics, so in a sense it looks like a feature funnel. This might sound bad, but ends up being something that other languages can understand without needing too much extra functionality in their runtimes, which means that most languages have support to calling code through an FFI.

Writing the C header by hand is possible, but is very error prone. A better solution is to use cbindgen, a program that takes Rust code and generate a C header file automatically. cbindgen is developed primarily to generate the C headers for webrender, the GPU-based renderer for servo, so it's pretty likely that if it can handle a complex codebase it will work just fine for the majority of projects.

Interfacing with Python: CFFI and Milksnake

Once we have the C headers, we can use the FFI to call Rust code in Python. Python has a FFI module in the standard library: ctypes, but the Pypy developers also created CFFI, which has more features.

The C headers generated by cbindgen can be interpreted by CFFI to generate a low-level Python interface for the code. This is the equivalent of declaring the functions/methods and structs/classes in a pxd file (in the Cython world): while the code is now usable in Python, it is not well adapted to the features and idioms available in the language.

Milksnake is the package developed by Sentry that takes care of running cargo for the Rust compilation and generating the CFFI boilerplate, making it easy to load the low-level CFFI bindings in Python. With this low-level binding available we can now write something more Pythonic (the pyx file in Cython), and I ended up just renaming the _minhash.pyx file back to minhash.py and doing one-line fixes to replace the Cython-style code with the equivalent CFFI calls.

All of these changes should be transparent to the Python code, and to guarantee that I made sure that all the current tests that we have (both for the Python module and the command line interface) are still working after the changes. It also led to finding some quirks in the implementation, and even improvements in the current C++ code (because we were moving a lot of data from C++ to Python).

Where I see this going

It seems it worked as an experiment, and I presented a poster at GCCBOSC 2018 and SciPy 2018 that was met with excitement by many people. Knowing that it is possible, I want to reiterate some points why Rust is pretty exciting for bioinformatics and science in general.

Bioinformatics as libraries (and command line tools too!)

Bioinformatics is an umbrella term for many different methods, depending on what analysis you want to do with your data (or model). In this sense, it's distinct from other scientific areas where it is possible to rely on a common set of libraries (numpy in linear algebra, for example), since a library supporting many disjoint methods tend to grow too big and hard to maintain.

The environment also tends to be very diverse, with different languages being used to implement the software. Because it is hard to interoperate, the methods tend to be implemented in command line programs that are stitched together in pipelines, a workflow describing how to connect the input and output of many different tools to generate results. Because the basic unit is a command-line tool, pipelines tend to rely on standard operating system abstractions like files and pipes to make the tools communicate with each other. But since tools might have input requirements distinct from what the previous tool provides, many times it is necessary to do format conversion or other adaptations to make the pipeline work.

Using tools as blackboxes, controllable through specific parameters at the command-line level, make exploratory analysis and algorithm reuse harder: if something needs to be investigated the user needs to resort to perturbations of the parameters or the input data, without access to the more feature-rich and meaningful abstraction happening inside the tool.

Even if many languages are used for writing the software, most of the time there is some part written in C or C++ for performance reasons, and these tend to be the core data structures of the computational method. Because it is not easy to package your C/C++ code in a way that other people can readily use it, most of this code is reinvented over and over again, or is copy and pasted into codebases and start diverging over time. Rust helps solve this problem with the integrated package management, and due to the FFI it can also be reused inside other programs written in other languages.

sourmash is not going to be Rust-only and abandon Python, and it would be crazy to do so when it has so many great exploratory tools for scientific discovery. But now we can also use our method in other languages and environment, instead of having our code stuck in one language.

Don't rewrite it all!

I could have gone all the way and rewrite sourmash in Rust3, but it would be incredibly disruptive for the current sourmash users and it would take way longer to pull off. Because Rust is so focused in supporting existing code, you can do a slow transition and reuse what you already have while moving into more and more Rust code. A great example is this one-day effort by Rob Patro to bring CQF (a C codebase) into Rust, using bindgen (a generator of C bindings for Rust). Check the Twitter thread for more =]

Good scientific citizens

There is another MinHash implementation already written in Rust, finch. Early in my experiment I got an email from them asking if I wanted to work together, but since I wanted to learn the language I kept doing my thing. (They were totally cool with this, by the way). But the fun thing is that Rust has a pair of traits called From and Into that you can implement for your type, and so I did that and now we can have interoperable implementations. This synergy allows finch to use sourmash methods, and vice versa.

Maybe this sounds like a small thing, but I think it is really exciting. We can stop having incompatible but very similar methods, and instead all benefit from each other advances in a way that is supported by the language.

Next time!

Turns out Rust supports webassembly as a target, so... what if we run sourmash in the browser? That's what I'm covering in the next blog post, so stay tuned =]


  1. The creator of the language is known to keep making up different explanations for the name of the language, but in this case "oxidation" refers to the chemical process that creates rust, and rust is the closest thing to metal (metal being the hardware). There are many terrible puns in the Rust community.

  2. Even more now that it hit 1.0, it is a really nice language

  3. and risk being kicked out of the lab =P

Read the whole story
32 days ago
yay, new blog post! Hoping to write more soon =]
Davis, CA
Share this story

Scenes from the ant colony's growing magician problem

2 Comments and 5 Shares
If Cthulhu can be summoned by humans who are so far beneath it, why can't humans be summoned by ants?

The answer is they should be.


Well if a bunch of ants formed a circle in my house I'd certainly notice, try to figure out where they'd all come from, and possibly wreak destruction there.


That's why knowing and correctly pronouncing the true name is so important to the ritual. Imagine how impossible it would be to not go take a look if the circle of ants started chanting your name.

And they're like, you can't leave because we drew a line made of tiny crystals - now you have to do us a favor.

And you're like, let's just see where this goes "yup, you got me... what's the favor?"

and usually the favor is like, "kill this one ant for us" or "give me a pile of sugar" and you're like... okay? and you do, because why not, it isn't hard for you and boy is this going to be a fucking story to tell, these fucking ants chanting your name and wanting a spoonful of sugar or whatever.

And SOMEtimes you get asked for things you can't really do, one of them, she's like, "I love this ant but she won't pay any attention to me, make me important to her" and you're like... um? how? So you just kill every ant in the colony except the two of them, ta-da! problem solved! and the first ant is like horrified whisper "what have I done"

Previously, previously, previously, previously, previously, previously, previously, previously.

Read the whole story
39 days ago
40 days ago
Davis, CA
Share this story
2 public comments
37 days ago
milky way
39 days ago
"and you're like... um? how? So you just kill every ant in the colony except the two of them, ta-da! problem solved!"
Bend, Oregon
Next Page of Stories