832 stories

The Security Value of Inefficiency

1 Share

For decades, we have prized efficiency in our economy. We strive for it. We reward it. In normal times, that's a good thing. Running just at the margins is efficient. A single just-in-time global supply chain is efficient. Consolidation is efficient. And that's all profitable. Inefficiency, on the other hand, is waste. Extra inventory is inefficient. Overcapacity is inefficient. Using many small suppliers is inefficient. Inefficiency is unprofitable.

But inefficiency is essential security, as the COVID-19 pandemic is teaching us. All of the overcapacity that has been squeezed out of our healthcare system; we now wish we had it. All of the redundancy in our food production that has been consolidated away; we want that, too. We need our old, local supply chains -- not the single global ones that are so fragile in this crisis. And we want our local restaurants and businesses to survive, not just the national chains.

We have lost much inefficiency to the market in the past few decades. Investors have become very good at noticing any fat in every system and swooping down to monetize those redundant assets. The winner-take-all mentality that has permeated so many industries squeezes any inefficiencies out of the system.

This drive for efficiency leads to brittle systems that function properly when everything is normal but break under stress. And when they break, everyone suffers. The less fortunate suffer and die. The more fortunate are merely hurt, and perhaps lose their freedoms or their future. But even the extremely fortunate suffer -- maybe not in the short term, but in the long term from the constriction of the rest of society.

Efficient systems have limited ability to deal with system-wide economic shocks. Those shocks are coming with increased frequency. They're caused by global pandemics, yes, but also by climate change, by financial crises, by political crises. If we want to be secure against these crises and more, we need to add inefficiency back into our systems.

I don't simply mean that we need to make our food production, or healthcare system, or supply chains sloppy and wasteful. We need a certain kind of inefficiency, and it depends on the system in question. Sometimes we need redundancy. Sometimes we need diversity. Sometimes we need overcapacity.

The market isn't going to supply any of these things, least of all in a strategic capacity that will result in resilience. What's necessary to make any of this work is regulation.

First, we need to enforce antitrust laws. Our meat supply chain is brittle because there are limited numbers of massive meatpacking plants -- now disease factories -- rather than lots of smaller slaughterhouses. Our retail supply chain is brittle because a few national companies and websites dominate. We need multiple companies offering alternatives to a single product or service. We need more competition, more niche players. We need more local companies, more domestic corporate players, and diversity in our international suppliers. Competition provides all of that, while monopolies suck that out of the system.

The second thing we need is specific regulations that require certain inefficiencies. This isn't anything new. Every safety system we have is, to some extent, an inefficiency. This is true for fire escapes on buildings, lifeboats on cruise ships, and multiple ways to deploy the landing gear on aircraft. Not having any of those things would make the underlying systems more efficient, but also less safe. It's also true for the internet itself, originally designed with extensive redundancy as a Cold War security measure.

With those two things in place, the market can work its magic to provide for these strategic inefficiencies as cheaply and as effectively as possible. As long as there are competitors who are vying with each other, and there aren't competitors who can reduce the inefficiencies and undercut the competition, these inefficiencies just become part of the price of whatever we're buying.

The government is the entity that steps in and enforces a level playing field instead of a race to the bottom. Smart regulation addresses the long-term need for security, and ensures it's not continuously sacrificed to short-term considerations.

We have largely been content to ignore the long term and let Wall Street run our economy as efficiently as it can. That's no longer sustainable. We need inefficiency -- the right kind in the right way -- to ensure our security. No, it's not free. But it's worth the cost.

This essay previously appeared in Quartz.

Read the whole story
13 hours ago
Davis, CA
Share this story

Streamlining Data-Intensive Biology With Workflow Systems

1 Comment
As the scale of biological data generation has increased, the bottleneck of research has shifted from data generation to analysis. Researchers commonly need to build computational workflows that include multiple analytic tools and require incremental development as experimental insights demand tool and parameter modifications. These workflows can produce hundreds to thousands of intermediate files and results that must be integrated for biological insight. The maturation of data-centric workflow systems that internally manage computational resources, software, and conditional execution of analysis steps are reshaping the landscape of biological data analysis, and empowering researchers to conduct reproducible analyses at scale. Adoption of these tools can facilitate and expedite robust data analysis, but knowledge of these techniques is still lacking. Here, we provide a series of practices and strategies for leveraging workflow systems with structured project, data, and resource management to streamline large-scale biological analysis. Author SummaryWe present a guide for workflow-enabled biological sequence data analysis, developed through our own teaching, training and analysis projects. We recognize that this is based on our own use cases and experiences, but we hope that our guide will contribute to a larger discussion within the open source and open science communities and lead to more comprehensive resources. Our main goal is to accelerate the research of scientists conducting sequence analyses by introducing them to organized workflow practices that not only benefit their own research but also facilitate open and reproducible science.
Read the whole story
13 hours ago
That's our lab! =]
Davis, CA
Share this story

SIMDe 0.5.0 Released


I’m pleased to announce the availability of the first release of SIMD Everywhere (SIMDe), version 0.5.0, representing more than three years of work by over a dozen developers.

SIMDe is a permissively-licensed (MIT) header-only library which provides fast, portable implementations of SIMD intrinsics for platforms which aren’t natively supported by the API in question.

For example, with SIMDe you can use SSE on ARM, POWER, WebAssembly, or almost any platform with a C compiler. That includes, of course, x86 CPUs which don’t support the ISA extension is question (e.g., calling AVX-512F functions on a CPU which doesn’t natively support them).

If the target natively supports the SIMD extension in question there is no performance penalty for using SIMDe. Otherwise, accelerated implementations, such as NEON on ARM, AltiVec on POWER, WASM SIMD on WebAssembly, etc., are used when available to provide good performance.

SIMDe has already been used to port several packages to additional architectures through either upstream support or distribution packages, particularly on Debian.

If you’d like to play with SIMDe online, you can do so on Compiler Explorer.

What is in 0.5.0

The 0.5.0 release is SIMDe’s first release. It includes complete implementations of:

  • MMX
  • SSE
  • SSE2
  • SSE3
  • SSSE3
  • SSE4.1
  • AVX
  • FMA
  • GFNI

We also have rapidly progressing implementations of many other extensions including NEON, AVX2, SVML, and several AVX-512 extensions (AVX-512F, AVX-512BW, AVX-512VL, etc.).

Additionally, we have an extensive test suite to verify our implementations.

What is coming next

Work on SIMDe is proceeding rapidly, but there are a lot of functions to implement… x86 alone has about 6,000 SIMD functions, and we’ve implemented about 2,000 of them. We will keep adding more functions and improving the implementations we already have.

Our NEON implementation is being worked on very actively right now by Sean Maher and Christopher Moore, and is expected to continue progressing rapidly.

We currently have two Google Summer of Code students working on the project as well; Hidayat Khan is working on finishing up AVX2, and Himanshi Mathur is focused on SVML.

If you’re interested in using SIMDe but need some specific functions to be implemented first, please file an issue and we may be able to prioritize those functions.

Getting Involved

If you’re interested in helping out please get in touch. We have a chat room on Gitter which is fairly active if you have questions, or of course you can just dive right in on the issue tracker.

Read the whole story
19 days ago
Davis, CA
19 days ago
Lansing, MI
Share this story

Data Containers

1 Share

Back in 2016, there was discussion and excitement for data containers. Two recent developments have told me that now is the time to address this once more:

  • The knowledge that containers don't necessary need an operating system.
  • The ability to create a container from scratch supported by Singularity (pull request here).

I was also invited to speak at dataverse and thought it would be a great opportunity to get people thinking again about data containers. I had less than a week to throw something together, but with a little bit of thinking and testing, this last week and weekend I have a skeleton, duct-taped together, preliminary cool idea to get us started! You can continue reading below, or jump to read the container database site for detailed examples.

What are the needs of a data container?

Before I could build a data container, I wanted to decide what would be important for it to have, or generally be. I decided to take a really simple perspective. Although I could think about runtime optimization, for my starting use case I wanted to be asap - as simple as possible! If we think of a “normal” container as providing a base operating system to support libraries, small data files, and ultimately running scientific software,

then we might define a data container as:

a container without an operating system optimized for understanding and interacting with data

That’s right, remove the operating system and all the other cruft, and just provide the data! With this basic idea, I started by creating a data-container repository to play around with some ideas. I knew that I wanted:

  • a container without an operating system
  • an ability to interact with the container on its own to query the data (metadata)
  • an ability to mount it via an orchestrator to interact with the data
  • flexibility to customize the data extraction, interaction, metadata.

If we need to interact with the data still, although we won’t have an operating system, we’ll need some kind of binary in there.

How do we develop something new?

I tend to develop things with baby steps, starting with the most simple example and slowly increasing in complexity. If you look at the data-container repository, you’ll likely be able to follow my thinking.

hello world

I started by building a hello world binary on the host (in Golang) and then adding it to a scratch container as an entrypoint. These basic commands I’ll show once - they are generally the same for the following tests that I did. This is how we compile a GoLang script on my host.

GOOS=linux GOARCH=amd64 go build -ldflags="-w -s" -o hello

Here is how we add it to a container, and specify it to be the entrypoint.

FROM scratch
COPY hello /
CMD ["/hello"]

And then here is how we build and run the container. It prints a hello world message. Awesome!

$ docker build -f Dockerfile.hello -t hello .
$ docker run --rm hello
Hello from OS-less container (Go edition)


The next creation was a “sleep” type entrypoint that I could use with a “docker-compose.yml” file to bring up the container with another, and bind (and interact with) the data. Running the container on it’s own would leave your terminal hanging, but with a “docker-compose.yml” it would keep the container running as a service, and of course share any data volumes you need with other containers.

in memory database

Once that was working, I took it a step further and used an in-memory database to add some key value pair, and then print it to the terminal. After this worked it was smooth sailing, because I’d just need to create an entrypoint binary that would generate this database for some custom dataset, and then build it into a container. I wanted to create a library optimized for generating the entrypoint.

How do we generate it?

Great! At this point I had a simple approach that I wanted to take: to create a executable with an in-memory database to query a dataset. The binary, along with the data, would be added to an “empty” (scratch) container. I started a library cdb (and documentation here) that would be exclusively for this task. Since Python is the bread and butter language for much of scientific programming (and would be easier for a scientist to use) I decided to implement this metadata-extraction, entrypoint-generation tool using it. Let’s walk through the generation steps.


Adding data to a scratch base is fairly trivial - it’s the entrypoint that will provide the power to interact with the data. Given that we have a tool (cdb) that takes a database folder (a set of folders and files) and generates a script to compile:

# container database generate from the "data" folder and output the file "entrypoint.go"
$ cdb generate /data --out /entrypoint.go

We can create a multi-stage build that will handle all steps from metadata generation, to golang compile, to generation of the final data container. If you remember from above, we had originally compiled our testing Go binaries on our host. We don’t need to do that, or even to install the cdb tool, because it can be done with a multi-stage build. We will use the follow stages:

  • Stage 1 We install cdb to generate a GoLang template for an in-memory database.
  • Stage 2 We compile the binary into an entrypoint
  • Stage 3 We add the data and the binary entrypoint to a scratch container (no operating system).

In stage 1, both the function for extraction and the template for the binary can be customized. The default will generate an entrypoint that creates the in-memory database, creates indices on metadata values, and then allows the user to search, order by, or otherwise list contents. The default function produces metadata with a file size and hash.

import os
from cdb.utils.file import get_file_hash

def basic(filename):
    """Given a filename, return a dictionary with basic metadata about it
    st = os.stat(filename)
    return {"size": st.st_size, "sha256": get_file_hash(filename)}

You can imagine writing a custom function to use any kind of filesystem organization (e.g., BIDS via pybids) or other standard (e.g., schema.org) to handle the metadata part. I will hopefully be able to make some time to work on these examples. We’d basically just provide our custom function to the cdb executable, or even interact from within Python. Before I get lost in details, let’s refocus on our simple example, and take a look at this multi-stage build. Someone has likely done this before, it’s just really simple!

The Dockerfile

Let’s break the dockerfile down into it’s components. This first section will install the cdb software, add the data, and generate a GoLang script to compile, which will generate an in-memory database.

stage 1

FROM bitnami/minideb:stretch as generator
ENV PATH /opt/conda/bin:${PATH}
RUN /bin/bash -c "install_packages wget git ca-certificates && \
    wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
    bash Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda && \
    rm Miniconda3-latest-Linux-x86_64.sh"

# install cdb (update version if needed)
RUN pip install cdb==0.0.1

# Add the data to /data (you can change this)
COPY ./data .
RUN cdb generate /data --out /entrypoint.go

stage 2

Next we want to build that file, entrypoint.go, and also carry the data forward:

FROM golang:1.13-alpine3.10 as builder
COPY --from=generator /entrypoint.go /entrypoint.go
COPY --from=generator /data /data

# Dependencies
RUN apk add git && \
    go get github.com/vsoch/containerdb && \
    GOOS=linux GOARCH=amd64 go build -ldflags="-w -s" -o /entrypoint -i /entrypoint.go

stage 3

Finally, we want to add just the executable and data to a scratch container (meaning it doesn’t have an operating system)

FROM scratch
COPY --from=builder /data /data
COPY --from=builder /entrypoint /entrypoint

ENTRYPOINT ["/entrypoint"]

And that’s it! Take a look at the entire Dockerfile if you are interested, or a more verbose tutorial.


We have our Dockerfile that will handle all the work for us, let’s build the data container!

$ docker build -t data-container .

Single Container Interaction

We then can interact with it in the following ways (remember this can be customized if you use a different template). You can watch me interact in the asciinema here:

or continue reading.


If we just run the container, we get a listing of all metadata alongside the key.

$ docker run data-container
/data/avocado.txt {"size": 9, "sha256": "327bf8231c9572ecdfdc53473319699e7b8e6a98adf0f383ff6be5b46094aba4"}
/data/tomato.txt {"size": 8, "sha256": "3b7721618a86990a3a90f9fa5744d15812954fba6bb21ebf5b5b66ad78cf5816"}


We can also just list data files with -ls

$ docker run data-container -ls


Or we can list ordered by one of the metadata items:

$ docker run data-container -metric size
Order by size
/data/tomato.txt: {"size": 8, "sha256": "3b7721618a86990a3a90f9fa5744d15812954fba6bb21ebf5b5b66ad78cf5816"}
/data/avocado.txt: {"size": 9, "sha256": "327bf8231c9572ecdfdc53473319699e7b8e6a98adf0f383ff6be5b46094aba4"}


Or search for a specific metric based on value.

$ docker run data-container -metric size -search 8
/data/tomato.txt 8

$ docker run entrypoint -metric sha256 -search 8
/data/avocado.txt 327bf8231c9572ecdfdc53473319699e7b8e6a98adf0f383ff6be5b46094aba4
/data/tomato.txt 3b7721618a86990a3a90f9fa5744d15812954fba6bb21ebf5b5b66ad78cf5816


Or we can get a particular file metadata by it’s name:

$ docker run data-container -get /data/avocado.txt
/data/avocado.txt {"size": 9, "sha256": "327bf8231c9572ecdfdc53473319699e7b8e6a98adf0f383ff6be5b46094aba4"}

or a partial match:

$ docker run data-container -get /data/
/data/avocado.txt {"size": 9, "sha256": "327bf8231c9572ecdfdc53473319699e7b8e6a98adf0f383ff6be5b46094aba4"}
/data/tomato.txt {"size": 8, "sha256": "3b7721618a86990a3a90f9fa5744d15812954fba6bb21ebf5b5b66ad78cf5816"}


The start command is intended to keep the container running, if we are using it with an orchestrator.

$ docker run data-container -start


It’s more likely that we want to interact with files in the container via some analysis, or more generally, another container. Let’s put together a quick docker-compose.yml to do exactly that.

version: "3"
    restart: always
    image: busybox
    entrypoint: ["tail", "-f", "/dev/null"]
      - data-volume:/data

    restart: always
    image: data-container
    command: ["-start"]
      - data-volume:/data


Notice that the command for the data-container to start is -start, which is important to keep it running. After building our data-container, we can then bring these containers up:

$ docker-compose up -d
Starting docker-simple_base_1   ... done
Recreating docker-simple_data_1 ... done

$ docker-compose ps
        Name                Command         State   Ports
docker-simple_base_1   tail -f /dev/null    Up           
docker-simple_data_1   /entrypoint -start   Up           

We can then shell inside and see our data!

$ docker exec -it docker-simple_base_1 sh
/ # ls /data/
avocado.txt  tomato.txt

The metadata is still available for query by interacting with the data-container entrypoint:

$ docker exec docker-simple_data_1 /entrypoint -ls

Depending on your use case, you could easily make this available inside the other container.

Why should I care?

I want you to get excited about data containers. I want you to think about how such a container could be optimized for some organization of data that you care about. I also want you to possibly get angry, and exclaim, “Dinosaur, you did this all wrong!” and then promptly go and start working on your own thing. The start that I’ve documented here and put together this weekend is incredibly simple - we build a small data container to query and show metadata for two files, and then bind that data to another orchestration setup. Can you imagine a data container optimized for exposing and running workflows? Or one that is intended for being stored in a data registry? Or how about something that smells more like a “results” container that you share, and have others run the container (possibly with their own private data bound to the host) and then write a result file into an organized namespace? I can imagine all of these things!

I believe the idea is powerful because we are able to keep data and interact with it without needing an operating system. Yes, you can have your data and eat it too!

Combined with other metadata or data organizational standards, this could be a really cool approach to develop data containers optimized to interact with a particular file structure or workflow. How will that work in particular? It’s really up to you! The cdb software can take custom functions for generation of metadata and templates for generating the GoLang script to compile, so the possibilities are very open.

What are next steps

I’d like to create examples for some real world datasets. Right now, the library extracts metadata on the level of the file, and I also want to allow for extraction on the level of the dataset. And I want you to be a part of this! Please contribute to the effort or view the container database site for detailed examples! I’ll be slowly adding examples as I create them.

Read the whole story
25 days ago
Davis, CA
Share this story

Supporting Alexandra Elbakyan’s nomination for the 2020 John Maddox Prize

1 Share

The John Maddox Prize has been awarded annually since 2012 to “researchers who have shown great courage and integrity in standing up for science and scientific reasoning against fierce opposition and hostility”. The prize is a joint initiative between the journal Nature and the Sense about Science charity.

Fergus Kane nominated Alexandra Elbakyan, creator of Sci-Hub, for the prize in 2018. While selected to a final shortlist, she did not win. Dr. Kane has nominated Elbakyan a second time for 2020 and named me as a reference after reading our study on the coverage of Sci-Hub’s catalog.

Here’s a figure where we show the growth in PDF downloads from Sci-Hub over time, based on server log data that Elbakyan has made public:

Sci-Hub Downloads per Day (including Elsevier DOIs that were missing from an earlier dataset)

Sci-Hub’s growth reflects the urgent need of scholars to access the literature. As a proponent of a more open future for science, it is my honor to recommend Elbakyan for the Maddox Prize.

Reference Letter

In 2011, Alexandra Elbakyan unveiled her tool Sci-Hub for providing fulltext access to scholarly articles, especially articles that are paywalled by their publishers. The service was available at sci-hub.org until 2015, when publishing giant Elsevier was able to shut down Sci-Hub’s .org domain with an uncontested lawsuit in the U.S. court system.

In the following years, Sci-Hub adopted additional domains including .bz, .cc, .cn, .cool, .fun, .ga, .gq, .hk, .io, .is, .la, .mn, .mu, .name, .nu, .nz, .ooo, .shop, .tv, .wang, and .ws. All of these domains are now defunct. Yet Sci-Hub persevered, and, as I write, is available at https://sci-hub.st, .se, .tw, and .ee. I begin with this list of censored domains as a testament to the opposition Sci-Hub has encountered.

This opposition should come as no surprise. Sci-Hub challenges the subscription publishing business model that brings in over $10 billion US in revenue annually from toll access scholarly journals. Communicating science is of critical importance. Yet we inherited a publishing system whose design impedes access and reuse. Paywalling publicly funded research is unacceptable, especially since the proceeds don’t fund the research or reward the creators (authors, peer reviewers, and editors).

Elbakyan recognizes the injustice of the current system, writing on the Sci-Hub homepage:

We fight inequality in knowledge access across the world. The scientific knowledge should be available for every person regardless of their income, social status, [and] geographical location

Nearly all research stakeholders benefit from open access. Scholars want to disseminate their ideas. Funders want a societal return on their investment. Readers want access and the right to view research on any platform. Text miners want to accelerate the pace of discovery by using machines to extract insights from the literature.

Were authors, funders, and institutions to decide that all future publications must be open, toll access journals would have no choice but to switch. Sadly, the scholarly community has been too indecisive and laggard. Ultimately, culpability lies with authors who sign copyright transfer agreements, ceding society’s ability to access and reuse materials the public has funded. Were more scientists willing to take personal action to end the antiquated toll access system, there would be no need for Sci-Hub. But this has not been the case, and Sci-Hub, led by Elbakyan, has taken on the challenge of making sure all humans have access to the scientific record, while promoting an open future.

As libraries continue to cancel major journal subscriptions, we owe Elbakyan immense gratitude. In a decade from now, we will likely witness the vast majority of articles published openly. While initiatives like Plan S and preprints are playing a role, they are secondary to Sci-Hub. Elbakyan sees her contribution in making open access inevitable, explaining:

The effect of long-term operation of Sci-Hub will be that publishers change their publishing models to support Open Access, because closed access will make no sense anymore.

To prove this point, I led a 2018 study that found Sci-Hub already contained 85% of articles in toll access journals.

Elbakyan has made personal sacrifices for her cause. According to U.S. courts, she owes $15 million to Elsevier and $4.8 million to the American Chemical Society for copyright infringement. To highlight the absurdity of this punishment, the court fined Elbakyan $150,000 per article, since Elsevier’s suit specifically alleged infringement of 100 articles (including one U.K. government work that legally belongs in the public domain). Based outside of U.S. jurisdiction, Elbakyan did not file a legal defense against these suits, although she did write a letter to the judge.

Detractors accuse Elbakyan of being a malicious cyber criminal. The U.S. Justice Department even investigated her for ties to Russian intelligence. But there is no public evidence to support the conclusion that Sci-Hub has any other objective than universal access to research. And the operation appears to be funded primarily by donations and volunteer effort.

Elbakyan has remained steadfast in the face of these smear campaigns and legal challenges. She engages the public via her blog, Twitter, VK, and comments to the media. Of course, she must maintain a high level of privacy and restrict public appearances given the controversial nature of her work. But Sci-Hub speaks for itself. Calling Sci-Hub the “only solution available to access articles [for many researchers]”, Elbakyan comments:

What differentiates Sci-Hub from this talk, is that Sci-Hub not talking, but actually solving this problem, providing access to those researchers who need it.

Read the whole story
25 days ago
Davis, CA
Share this story

Drippy Steambottom: Master of Human Nature


this is a diesel sweeties comic strip

Turns out, coffee has an agenda.

Read the whole story
40 days ago
40 days ago
Davis, CA
Share this story
Next Page of Stories