643 stories

A brief introduction to osfclient, a command line client for the Open Science Framework

1 Share

Over the last few months, Tim Head has been pushing forward the osfclient project, an effort to build a simple and friendly command-line interface to the Open Science Framework's file storage. This project was funded by a gift to my lab through the Center for Open Science (COS) to the tune of about $20k, given by an anonymous donor.

The original project was actually to write an OSF integration for Galaxy, but that project was first delayed by my move to UC Davis and then suffered from Michael Crusoe's move to work on the Common Workflow Language. After talking with the COS folk, we decided to repurpose the money to something that addresses a need in my lab - using the Open Science Framework to share files.

Our (Tim's) integration effort resulted in osfclient, a combination Python API and command-line program. The project is still in its early stages, but a few people have found it useful - in addition to increasing usage within my lab, @zkamvar has used it to transfer "tens of thousands of files", and @danudwary found it "just worked" for grabbing some big files. And new detailed use cases are emerging regularly.

Most exciting of all, we've had contributions from a number of other people already, and I'm looking forward to this project growing to meet the needs of the open science community!

Taking a step back: why OSF, and why a command-line client?

I talked a bit about "why OSF?" in a previous blog post, but the short version is that it's a globally accessible place to store files for science, and it works well for that! It fits a niche that we haven't found any other solutions for - free storage for medium size genomics files - and we're actively exploring its use in about a dozen different projects.

Our underlying motivations for building a command-line client for OSF were several:

  • we often need to retrieve full folder/directory hierarchies of files for research and training purposes;

  • frequently, we want to retrieve those file hierarchies on remote (cloud or HPC) systems;

  • we're often grabbing files that are larger than GitHub supports;

  • sometimes these files are from private projects that we cannot (or don't want to) publicize;

Here, the Open Science Framework was already an 80% solution (supporting folder hierarchies, large file storage, and a robust permissions system), but it didn't have a command-line client - we were reduced to using curl or wget on individual files, or (in theory) writing our own REST queries.

Enter osfclient!

Using osfclient, a quickstart

(See "Troubleshooting osfclient installs" at the bottom if you run into any troubles running these commands!)

In a Python 3 environment, do:

pip install osfclient

and then execute:

osf -p fuqsk clone

This will go to the osfclient test project on http://osf.io, and download all the files that are part of that project -- if you execute:

find fuqsk

you should see:

fuqsk/figshare/this is a test text file
fuqsk/figshare/this is a test text file/hello.txt
fuqsk/googledrive/google test file.gdoc

which showcases a particularly nice feature of the OSF that I'll talk about below.

A basic overview of what osfclient did

If you go to the project URL, http://osf.io/fuqsk, you will see a file storage hierarchy that looks like so:

OSF folders screenshot

What osfclient is doing is grabbing all of the different storage files and downloading them to your local machine. Et voila!

What's with the 'figshare' and 'googledrive' stuff? Introducing add-ons/integrations.

In the above, you'll notice that there are these subdirectories named figshare and googledrive. What are those?

The Open Science Framework can act as an umbrella integration for a variety of external storage services - see the docs. They support Amazon S3, Dropbox, Google Drive, Figshare, and a bunch of others.

In the above project, I linked in my Google Drive and Figshare accounts to OSF, and connected specific remote folders/projects into the OSF project (this one from Google Drive, and this one from figshare). This allows me (and others with permissions on the project) to access and manage those files from within a single Web UI on the OSF.

osfclient understands some of these integrations (and it's pretty trivial to add a new one to the client, at least), and it does the most obvious thing possible with them when you do a osfclient clone: it grabs the files and downloads them! (It should also be able to push to those remote storages, but I haven't tested that today.)

Interestingly, this appears to be a good simple way to layer OSF's project hierarchy and permission system on top of more complex and/or less flexible and/or non-command-line-friendly systems. For example, Luiz Irber recently uploaded a very large file to google drive via rclone and it showed up in his OSF project just fine.

This reasonably flexible imposition of an overall namespace on a disparate collection of storages is pretty nice, and could be a real benefit for large, complex projects.

Other things you can do with osfclient

osfclient also has file listing and file upload functionality, along with some configurability in terms of providing a default project and permissions within specific directories. The osfclient User Guide has some brief instructions along these lines.

osfclient also contains a Python API for OSF, and you can see a bit more about that here, in Tim Head and Erin Braswell's webinar materials.

What's next?

There are a few inconveniences about the OSF that could usefully be worked around, and a lot of features to be added in osfclient. In no particular order, here are a few of the big ones that require significant refactoring or design decisions or even new REST API functionality on the OSF side --

  • we want to make osf behave a bit more like git - see the issue. This would make it easier to teach and use, we think. In particular we want to avoid having to specify the project name every time.
  • speaking of project names, I don't think the project UIDs on the OSF (fuqsk above) are particular intuitive or type-able, and it would be great to have a command line way of discovering the project UID for your project of interest.
  • I'd also like to add project creation and maybe removal via the command line, as well as project registration - more on that later.
  • the file storage hierarchy above, with osfstorage/ and figshare/ as top level directories, isn't wonderful for command line folk - there are seemingly needless hierarchies in there. I'm not sure how to deal with this but there are a couple of possible solutions, including adding a per-project 'remapping' configuration that would move the files around.

Concluding thoughts

The OSF offers a simple, free, Web friendly, and convenient way to privately and publicly store collections of files under 5 GB in size on a Web site. osfclient provides a simple and reasonably functional way to download files from and upload files to the OSF via the command line. Give it a try!

Appendix: Troubleshooting osfclient installs

  1. If you can't run pip install on your system, you may need to either run the command as root, OR establish a virtual environment -- something like

python -m virtualenv -p python3.5 osftest . osftest/bin/activate pip install osfclient

will create a virtualenv, activate it, and install osfclient. (If you run into problems)

  1. If you get a requests.exceptions.SSLError, you may be on a Mac and using an old version of openssl. You can try pip install -U pyopenssl. If that doesn't work, please add a comment to this issue.

  2. Note that a conda install for osfclient exists, and you should be able to do conda install -c conda-forge osfclient.

Read the whole story
6 days ago
Davis, CA
Share this story

Save Your Work

1 Comment and 5 Shares
Here's a useful habit I've picked up as a software engineer. Every time you do something difficult, create a reproducible artifact that can be used to do it more easily next time, and shared with others.

Some examples of this:

  • You spent all afternoon debugging a thorny issue. Write down the monitoring you checked and the steps you took to reach the conclusion you did. Put these details in the issue tracker, before moving on to actually fix it.
  • You figured out what commands to run to get the binary to work properly. Turn the commands into a short script and check it into source control.
  • You spent a day reading the code and figuring out how it works. Write yourself some notes and documentation as you go. At the end, take half an hour to clean it up and send it to your boss or teammates who might find it helpful. Maybe even put up a documentation website if that seems appropriate.

This makes it easier to pick up where you left off for next time (for you or someone else), and makes it easier to prove that the work you're doing is difficult and has value.
Read the whole story
20 days ago
Great tips!
Davis, CA
19 days ago
23 days ago
Melbourne, Australia
Share this story

The Data Tinder Collects, Saves, and Uses

1 Share

Under European law, service providers like Tinder are required to show users what information they have on them when requested. This author requested, and this is what she received:

Some 800 pages came back containing information such as my Facebook "likes," my photos from Instagram (even after I deleted the associated account), my education, the age-rank of men I was interested in, how many times I connected, when and where every online conversation with every single one of my matches happened...the list goes on.

"I am horrified but absolutely not surprised by this amount of data," said Olivier Keyes, a data scientist at the University of Washington. "Every app you use regularly on your phone owns the same [kinds of information]. Facebook has thousands of pages about you!"

As I flicked through page after page of my data I felt guilty. I was amazed by how much information I was voluntarily disclosing: from locations, interests and jobs, to pictures, music tastes and what I liked to eat. But I quickly realised I wasn't the only one. A July 2017 study revealed Tinder users are excessively willing to disclose information without realising it.

"You are lured into giving away all this information," says Luke Stark, a digital technology sociologist at Dartmouth University. "Apps such as Tinder are taking advantage of a simple emotional phenomenon; we can't feel data. This is why seeing everything printed strikes you. We are physical creatures. We need materiality."

Reading through the 1,700 Tinder messages I've sent since 2013, I took a trip into my hopes, fears, sexual preferences and deepest secrets. Tinder knows me so well. It knows the real, inglorious version of me who copy-pasted the same joke to match 567, 568, and 569; who exchanged compulsively with 16 different people simultaneously one New Year's Day, and then ghosted 16 of them.

"What you are describing is called secondary implicit disclosed information," explains Alessandro Acquisti, professor of information technology at Carnegie Mellon University. "Tinder knows much more about you when studying your behaviour on the app. It knows how often you connect and at which times; the percentage of white men, black men, Asian men you have matched; which kinds of people are interested in you; which words you use the most; how much time people spend on your picture before swiping you, and so on. Personal data is the fuel of the economy. Consumers' data is being traded and transacted for the purpose of advertising."

Tinder's privacy policy clearly states your data may be used to deliver "targeted advertising."

It's not Tinder. Surveillance is the business model of the Internet. Everyone does this.

Read the whole story
23 days ago
Davis, CA
Share this story

Is your bioinformatics analysis package Big Data ready?

1 Share

(Apologies for the Big Data buzzword, but it actually fits!)

I find it increasingly frustrating to use a broad swath of bioinformatics tools (and given how inherently frustrating the bioinformatics ecosystem is already, that's saying something!) My bugaboo this time is bioinformatics "prepared databases" - but not multiple installs of databases or their size. Rather, I'm frustrated by the cost of building them. Similarly, I'm frustrated at how many analysis pipelines require you to start from the very beginning if you add data to your experiment or sample design.

We are increasingly living in a world of what I call "infinite data", where, for many purposes, getting the data is probably the least of your problems - or, at least, the time and cost of generating sequence data is dwarfed by sample acquisition, data analysis, hypothesis generation/testing, and paper writing.

What this means for databases is that no sooner does someone publish a massive amount of data from a global ocean survey (Tara Oceans, Sunagawa et al. 2015) than does someone else analyze the data in a new and interesting way (Tully et al., 2017) and produce data sets that I, at least, really want to include in my own reference data sets for environmental metagenome data. (See Hug et al., 2016 and Stewart et al., 2017 for even more such data sets.)

Unfortunately, most bioinformatics tools don't let me easily update my databases or analyses with information from new references or new data. I'm going to pick on taxonomic approaches first because they're on my mind but this applies to any number of software packages commonly used in bioinformatics.

Let's take Kraken, an awesome piece of software that founded a whole class of approaches. The Kraken database and toolset is constructed in such a way that I cannot easily do two specific things:

  • add just one reference genome to it;
  • ask Kraken to search two different reference databases and report integrated results;

Similarly, none of the assemblers I commonly use lets me me add just one more sequencing data set into the assembly - I have to recalculate the entire assembly from scratch, after adding the additional data set. Annotation approaches don't let me update the database with new sequences and then run an abbreviated analysis that improves annotations. And etc.

What's the problem?

There are two classes of problems underneath this - one is the pragmatic engineering issue. For example, with Kraken, it is theoretically straightforward, but tricky from an engineering perspective, to support the addition of a single genome into the database; "all" you need to do is update the least-common-ancestor assignments in the database for each k-mer in your new genome. Likewise, the Kraken approach should be able to support search across databases, where you give it two different databases and it combines them in the analysis - but it doesn't seem to. And, last but not least, Kraken could in theory support taking an existing classified set of reads and "updating" their classification from a new database, although in practice there's not much point in doing this because Kraken's already pretty fast.

The second kind of challenge is theoretical. We don't really have the right algorithmic approaches to assembly to support fully online assembly in all its glory, although you could imagine doing things like error trimming in an incremental way (I mean, just imagine...) and saving that for later re-use. But we generally don't save the full information needed to add a single sequence and then quickly output an updated assembly.

And that last statement highlights an important question - for which approaches do we have the theory and the engineering to quickly update an analysis with a new reference database, or a new read data set? I can guess but I don't know.

Note, I'm explicitly not interested in heuristic approaches where you e.g. map the reads to the assembly to see if there's new information - I'm talking about using algorithms and data structures and software where you get identical results from doing analysis(n + m) and analysis(n) + analysis(m).

Even if we do have this theoretical ability, it's not clear that many tools support it, or at least they don't seem to advertise that support. This also seems like an area where additional theoretical development (new data structures! better algorithms!) would be useful.

I'm curious if I'm missing something; have I just not noticed that this is straightforward with many tools? It seems so useful...

Anyway, I'm going to make a conscious effort to support this in our tools (which we sort of already were planning, but now can be more explicit about).


p.s. This post partly inspired by Meren, who asked for this with sourmash databases - thanks!

Read the whole story
51 days ago
Davis, CA
Share this story

A post about k-mers - this time for taxonomy!

1 Share

I love k-mers. Unlike software engineering or metagenomics, this isn't a love-hate relationship - this is a pure "gosh aren't they wonderful" kind of love and admiration. They "just work" in so many situations that it helps validate the last 10 years of my research life (which, let's be honest, has been pretty k-mer focused!).

DNA k-mers underlie much of our assembly work, and we (along with many others!) have spent a lot of time thinking about how to store k-mer graphs efficiently, discard redundant data, and count them efficiently.

More recently, we've been enthused about using k-mer based similarity measures and computing and searching k-mer-based sketch search databases for all the things.

But I haven't spent too much talking about using k-mers for taxonomy, although that has become an ahem area of interest recently, if you read into our papers a bit.

In this blog post I'm going to fix this by doing a little bit of a literature review and waxing enthusiastic about other people's work. Then in a future blog post I'll talk about how we're building off of this work in fun! and interesting? ways!

A brief introduction to k-mers

(Borrowed from the ANGUS 2017 sourmash tutorial!)

K-mers are a fairly simple concept that turn out to be tremendously powerful.

A "k-mer" is a word of DNA that is k long:

ATTG - a 4-mer
ATGGAC - a 6-mer

Typically we extract k-mers from genomic assemblies or read data sets by running a k-length window across all of the reads and sequences -- e.g. given a sequence of length 16, you could extract 11 k-mers of length six from it like so:


becomes the following set of 6-mers:


k-mers are most useful when they're long, because then they're specific. That is, if you have a 31-mer taken from a human genome, it's pretty unlikely that another genome has that exact 31-mer in it. (You can calculate the probability if you assume genomes are random: there are 431 possible 31-mers, and 431 = 4,611,686,018,427,387,904. So, you know, a lot.)

The important concept here is that long k-mers are species specific. We'll go into a bit more detail later.

K-mers and assembly graphs

We've already run into k-mers before, as it turns out - when we were doing genome assembly. One of the three major ways that genome assembly works is by taking reads, breaking them into k-mers, and then "walking" from one k-mer to the next to bridge between reads. To see how this works, let's take the 16-base sequence above, and add another overlapping sequence:


One way to assemble these together is to break them down into k-mers -- e.g the sequences above becomes the following set of 6-mers:

          AGATAG -> off the end of the first sequence
           GATAGG <- beginning of the second sequence

and if you walk from one 6-mer to the next based on 5-mer overlap, you get the assembled sequence:


voila, an assembly!

Graphs of many k-mers together are called De Bruijn graphs, and assemblers like MEGAHIT and SOAPdenovo are De Bruijn graph assemblers - they use k-mers underneath.

Why k-mers, though? Why not just work with the full read sequences?

Computers love k-mers because there's no ambiguity in matching them. You either have an exact match, or you don't. And computers love that sort of thing!

Basically, it's really easy for a computer to tell if two reads share a k-mer, and it's pretty easy for a computer to store all the k-mers that it sees in a pile of reads or in a genome.

So! On to some papers!


About a year and a half ago, I had the pleasure of reviewing the MetaPalette paper by David Koslicki and Daniel Falush. This paper did some really cool things and I'm going to focus on two of the least cool things (the cooler things are deeper and worth exploring, but mathematical enough that I haven't fully understood them).

The first cool thing: the authors show that k-mer similarity between genomes at different k approximates various degrees of taxonomic similarity. Roughly, high similarity at k=21 defines genus-level similarity, k=31 defines species-level, and k=51 defines strain-level. (This is an ad-lib from what is much more precisely and carefully stated within the paper.) This relationship is something that we already sorta knew from Kraken and similar k-mer-based taxonomic efforts, but the MetaPalette paper was the first paper I remember reading where the specifics really stuck to me.

Figure 2 from Koslicki and Falush, 2016

This figure, taken from the MetaPalette paper, shows clearly how overall k-mer similarity between pairs of genomes matches the overall structure of genera.

The second cool thing: the authors show that you can look at what I think of as the "k-mer size response curve" to infer the presence of strain/species variants of known species in a metagenomic data set. Briefly, in a metagenomic data set you would expect to see stronger similarity to a known genome at k=21 than at k=31, and k=31 than at k=51, if a variant of that known species is present. (You'd see more or less even similarity across the k-mer sizes if the exact genome was present.)

k-mer response curve

In this figure (generated by code from our metagenome assembly preprint), you can see that this specific Geobacter genome's inclusion in the metagenome stays the same as we increase the k-mer size, while Burkholderia's decreases. This suggests that we know precisely which Geobacter is present, while only a strain variant of that Burkholderia is present.

I'm not sure whether this idea has been so clearly explicated before MetaPalette - it may not be novel because it sort of falls out of De Bruijn/assembly graph concepts, for example - but again it was really nicely explained in the paper, and the paper did an excellent job of then using these two observations to build a strain-aware taxonomic classification approach for metagenomic data sets.

The two major drawbacks of MetaPalette were that the implementation (though competitive with other implementations) was quite heavyweight and couldn't run on e.g. my laptop; and the math was complicated enough that reimplementing it myself was not straightforward.

MinHash and mash

(This next paper is less immediately relevant to taxonomy, but we'll get there, promise!)

At around the same time as I reviewed the MetaPalette paper, I also reviewed the mash paper by Ondov et al., from Adam Phillipy's lab. (Actually, what really happened was two people from my lab reviewed it and I blessed their review & submitted it, and then when the revisions came through I read the paper more thoroughly & got into the details to the point where I reimplemented it for fun. My favorite kind of review!)

The mash paper showed how a simple technique for fast comparison of sets, called MinHash - published in 1998 by Broder - could be applied to sets of k-mers. The specific uses were things like comparing two genomes, or two metagenomes. The mash authors also linked the Jaccard set similarity measure to the better known (in genomics) measure, the Average Nucleotide Identity between two samples.

The core idea of MinHash is that you subsample any set of k-mers in the same way, by first establishing a common permutations of k-mer space (using a hash function) and then subselecting from that permutation in a systematic way, e.g. by taking the lowest N hashes. This lets you prepare a subsample from a collection of k-mers once, and then forever more use that same subsampled collection of k-mers to compare with other subsampled collections. (In sourmash, we call these subsampled sets of k-mers "signatures".)

What really blew my mind was how fast and lightweight MinHash was, and how broadly applicable it was to various problems we were working on. The mash paper started what is now an 18 month odyssey that has led in some pretty interesting directions. (Read here and here for some of them.)

The one big drawback of mash/MinHash for me stems from its strength: MinHash has limited resolution when comparing two sets of very different sizes, e.g. a genome and a metagenome. The standard MinHash approach assumes that the sets to be compared are of similar sizes, and that you are calculating Jaccard similarity across the entire set; in return for these assumptions, you get a fixed-size sketch with guarantees on memory usage and execution time for both preparation and comparisons. However, these assumptions prevent something I want to do very much, which is to compare complex environmental metagenomes with potential constituent genomes.

A blast from the recent past: Kraken

The third paper I wanted to talk about is Kraken, a metagenome taxonomic classifier by Wood and Salzberg (2014) - see manual, and paper. This is actually a somewhat older paper, and the reason I dug back into the literature is that there has been somewhat of a renaissance in k-mer classification systems as the CS crowd has discovered that storing and searching large collections of k-mers is an interesting challenge. More specifically, I've recently reviewed several papers that implement variations on the Kraken approach (most recently MetaOthello, by Liu et al (2017)). My lab has also been getting more and more office hour requests from people who want to use Kraken and Kaiju for metagenomic classification.

Briefly, Kraken uses k-mers to apply the least-common-ancestor approach to taxonomic classification. The Kraken database is built by linking every known k-mer to a taxonomic ID. In cases where a k-mer belongs to only a single known genome, the assignment is easy: that k-mer gets the taxonomic ID of its parent genome. In the fairly common case that a k-mer belongs to multiple genomes, you assign the taxonomic ID of the least-common-ancestor taxonomic level to that k-mer.

Then, once you have built the database, you classify sequences by looking at the taxonomic IDs of constitutent k-mers. Conveniently this step is blazingly fast!

Figure from Wood and Salzberg, 2014

This figure (from Wood and Salzberg, 2014) shows the process that Kraken uses to do taxonomic identification. It's fast because (once you've computed the k-mer-to-LCA mapping) k-mer looks up are super fast, albeit rather memory intensive.

What you end up with (when using Kraken) is a taxonomic ID for each read, which can then be post-processed with something like Bracken (Lu et al., 2017) to give you an estimated species abundance.

The drawback to Kraken (and many similar approaches) is shared by MetaPalette: the databases are large, and building the databases is time/CPU-intensive. In our experience we're seeing what the manual says: you need 30+ GB RAM to run the software, and many more to build the databases.

(At the end of the day, it may be that many people actually want to use something like Centrifuge, another tool from the inimitable Salzberg group (see Kim et al., 2016). I'm only slowly developing a deeper understanding of the opportunities, limitations, and challenges of the various approaches, and I may blog about that later, but for now let's just say that I have some reasons to prefer the Kraken-style approach.)

More general thoughts

There are a number of challenges that are poorly addressed by current k-mer based classification schemes. One is scalability of classification: I really want to be able to run this stuff on my laptop! Another is scalability of the database build step: I'm OK with running that on bigger hardware than my laptop, but I want to be able to update, recombine, and customize the databases. Here, large RAM requirements are a big problem, and the tooling for database building is rather frustrating as well - more on that below.

I also very much want a library implementation of these things - specifically, a library in Python. Basically, you lose a lot when you communicate between programs through files. (See this blog post for the more general argument.) This would let us intermix k-mer classification with other neat techniques.

More generally, in the current era of "sequence all the things" and the coming era of "ohmigod we have sequenced so many things now what" we are going to be in need of a rich, flexible ecosystem of tools and libraries. This ecosystem will (IMO) need to be:

  • decentralized and locally installable & usable, because many labs will have large internal private data sets that they want to explore;
  • scalable in memory and speed, because the sheer volume of the data is so ...voluminous;
  • customizable and programmable (see above) so that we can try out cool new ideas more easily;
  • making use of databases that can be incrementally (and routinely) updated, so that we can quickly add new reference information without rebuilding the whole database;

and probably some other things I'm not thinking of. The ecosystem aspect here is increasingly important and something I've been increasingly focusing on: approaches that don't work together well are simply not that useful.

Another goal we are going to need to address is classification and characterization of unknowns in metagenomic studies. We are making decent progress in certain areas (metagenome-resolved genomics!!) but there are disturbing hints that we largely acting like drunks looking for their keys under the streetlight. I believe that we remain in need of systematic, scalable, comprehensive approaches for characterizing environmental metagenome data sets.

This means that we will need to be thinking more and more about reference independent analyses. Of the three above papers, only mash is reference independent; MetaPalette and Kraken both rely on reference databases. Of course, those two tools address the flip side of the coin, which is to properly make use of the reference databases we do have for pre-screening and cross-validation.


Read the whole story
53 days ago
Davis, CA
Share this story

Em carta aberta, entidades pedem resolução urgente da crise na CT&I e no Ensino Superior


A SBPC, juntamente a outras 8 entidades representativas das comunidades científica, tecnológica e acadêmica brasileiras e dos sistemas estaduais de ciência e inovação enviaram o documento ao presidente da República hoje. “Vivemos o risco de sofrer uma grande diáspora científica”, alertam na carta

“É muito grave a situação da ciência e tecnologia e das universidades públicas no País”, afirmam a SBPC e outras 8 entidades representativas das comunidades científica, tecnológica e acadêmica brasileiras e dos sistemas estaduais de ciência e inovação, em carta ao presidente da República, Michel Temer, enviada nesta terça-feira, 29 de agosto. O documento descreve a crítica situação da CT&I e da Educação Superior no Brasil e pede a resolução urgente dos problemas apontados.

A carta traz um alerta ao presidente e demais autoridades governamentais, bem como aos parlamentares e toda a população brasileira sobre os riscos que a enorme redução, de quase 50%, dos recursos para ciência, tecnologia, inovação e para a educação superior pública traz para o País. O documento ressalta, entre outros pontos, que universidades e institutos de pesquisa encontram-se em estado de penúria, com o sucateamento de laboratórios e unidades de pesquisa, a diminuição e mesmo a possibilidade de interrupção na concessão de bolsas, a proibição de novos concursos e a ausência de recursos essenciais para a pesquisa científica e tecnológica.

“Vivemos o risco de sofrer uma grande diáspora científica, com a evasão de cérebros altamente qualificados, formados com recursos públicos, para países mais avançados que veem na C&T um instrumento essencial para o desenvolvimento econômico e para o bem-estar social”, advertem as entidades, e acrescentam que o investimento em CT&I é essencial para garantir o aumento do PIB em períodos de recessão econômica.

Juntamente à SBPC, assinam a carta a Academia Brasileira de Ciências (ABC), a Associação Brasileira das Instituições de Pesquisa Tecnológica e Inovação (Abipti), a Associação Brasileira dos Reitores das Universidades Estaduais e Municipais (Abruem), a Associação Nacional dos Dirigentes de Instituições Federais de Ensino Superior (Andifes), o Conselho Nacional das Fundações Estaduais de Amparo à Pesquisa (Confap), o Conselho Nacional de Secretários Estaduais para Assuntos de Ciência e Tecnologia (Consecti), o Fórum Nacional de Gestores de Inovação e Transferência de Tecnologia (Fortec) e o Fórum Nacional de Secretários Municipais da Área de Ciência e Tecnologia.

Confira a carta abaixo, na íntegra:



Excelentíssimo Senhor


Presidência da República

Brasília, DF


Assunto: Situação dos recursos para a Ciência, Tecnologia e Inovação e para a Educação Superior.


Senhor Presidente,

É muito grave a situação da ciência e tecnologia e das universidades públicas no País. O contingenciamento de recursos para o Ministério da Ciência, Tecnologia, Inovações e Comunicações – MCTIC, em 2017, incidindo sobre orçamentos já muito reduzidos em relação aos de anos anteriores, produziu uma drástica diminuição nas verbas para a CT&I. Essa redução de recursos, que ocorreu também no orçamento das universidades públicas federais, ameaça seriamente a própria sobrevivência da ciência brasileira, bem como o futuro do País e sua soberania. Nós, entidades representativas das comunidades científica, tecnológica e acadêmica brasileiras e dos sistemas estaduais de ciência e inovação, por meio desta carta aberta, vimos alertar Vossa Excelência, assim como as demais autoridades governamentais, os parlamentares e a população brasileira, sobre os graves riscos que essa enorme redução nos recursos para a CT&I e para a educação superior pública traz para o País.

O investimento em CT&I é essencial para garantir o aumento do PIB em períodos de recessão econômica. Essa tem sido a política de caráter anticíclico adotada por países que se destacam no cenário econômico mundial, como os do G7 — EUA, Alemanha, UK, Japão, França, Itália e Canadá —, dado o retorno alcançado por este investimento sob a forma de desenvolvimento econômico, melhoria da qualidade de vida, liderança global e riqueza para esses países.

É notável o retorno que o investimento em C&T já proporcionou ao Brasil, apesar de ele ter sido bastante inferior ao aporte de países mais desenvolvidos. A invenção, em laboratórios de universidades públicas e da EMBRAPA, de um processo no qual a fixação do nitrogênio pelas plantas é feita por meio de bactérias permitiu a eliminação dos adubos nitrogenados na cultura da soja e multiplicou por quatro a sua produtividade, economizando hoje cerca de 15 bilhões de reais por ano para o País. A colaboração entre a Petrobras e laboratórios em universidades brasileiras é responsável pela exploração de petróleo em águas profundas e pelo êxito do Pré-Sal, que hoje abarca 47% da produção brasileira de petróleo. O Brasil não teria empresas de forte protagonismo internacional, como a EMBRAER, a EMBRACO e a WEG, se não tivéssemos universidades públicas formando quadros profissionais de qualidade e colaborando com estas iniciativas inovadoras. A ciência desenvolvida nas instituições de C&T nacionais é também essencial para a melhoria da qualidade de vida dos brasileiros. Ela beneficiou a saúde pública, contribuindo para o enfrentamento de epidemias emergentes e para o aumento da expectativa de vida dos brasileiros, atualmente cerca de quatro anos em cada década.  A recente descoberta da ligação entre o vírus zika e a microcefalia só foi possível graças ao trabalho pioneiro de pesquisadores brasileiros.

Foram elementos essenciais para esse rol de sucessos o progresso da pós-graduação, com 16.000 doutores formados por ano, e o aumento significativo na produtividade científica, ocupando o Brasil o 13º lugar entre os países de maior produção científica, à frente de nações como Holanda, Rússia, Suíça, México e Argentina.  Tal cenário decorreu de investimentos continuados nas universidades e institutos de pesquisa, em particular do CNPq, da Capes e da Finep, bem como das fundações estaduais de amparo à pesquisa. Não faltam novos desafios, como o desenvolvimento de uma biotecnologia baseada na biodiversidade brasileira, com a produção de novos fármacos, a busca de energias alternativas, a agregação de valor aos minerais presentes no território nacional, o progresso das atividades espaciais, a melhoria da educação básica, as inovações sociais para a inclusão e para a redução de desigualdades. Todos eles com grande potencial de retorno para o desenvolvimento econômico e social do País.

Esse panorama virtuoso e promissor, motivo de orgulho para os brasileiros, está, no entanto, ameaçado de extinção. O contingenciamento em 44% dos recursos para o MCTIC reduziu o orçamento de custeio e capital (OCC) do setor de C&T desse Ministério para 2,5 bilhões de reais, ou seja, cerca de 25% do OCC de 2010, corrigido pela inflação. Essa redução nos recursos para a CT&I se estendeu para outras áreas de governo e se propagou, em um efeito cascata de redução de financiamento, para muitas secretarias e fundações estaduais de amparo à pesquisa, e para instituições estaduais e municipais de ensino superior. Não é surpreendente, portanto, que vivamos uma situação crítica, na qual muitas universidades e institutos de pesquisa encontram-se em estado de penúria, com o sucateamento de laboratórios e unidades de pesquisa, a diminuição e mesmo a possibilidade de interrupção na concessão de bolsas, a proibição de novos concursos e a ausência de recursos essenciais para a pesquisa científica e tecnológica.

Um exemplo evidente é a situação extremamente preocupante do CNPq, que ainda luta por recursos para o cumprimento de seus compromissos em 2017, aí incluídos o pagamento de quase 100 mil bolsistas de Iniciação Científica, de Pós-Graduação e de Pesquisa. Igualmente crítica é a possibilidade, já delineada na PLOA, de que os recursos orçamentários para 2018 sejam mantidos no patamar extremamente baixo daqueles dispendidos em 2017, o que levará novamente o CNPq a uma situação crítica em meados do ano próximo. O Fundo Nacional de Desenvolvimento Científico e Tecnológico – FNDCT, que tem desempenhado um papel fundamental no apoio às instituições de ensino e pesquisa e às empresas inovadoras, foi também severamente atingido. Em 2017, apenas uma pequena parcela dos recursos arrecadados para o FNDCT foi disponibilizada no apoio a atividades de CT&I não reembolsáveis. Nas previsões orçamentárias para 2018, tais recursos serão da ordem de 750 milhões de reais, um valor muito abaixo do total a ser arrecadado, aproximadamente 4,5 bilhões de reais.

Essa falta de recursos põe em risco, ainda, o funcionamento dos institutos de pesquisa do MCTIC e de outros ministérios, instituições fortemente estratégicas, estranguladas a ponto de terem sua existência ameaçada, alijando o estado brasileiro de instrumentos essenciais para qualquer movimento de recuperação da economia nacional. O financiamento reduzido e parcial dos Institutos Nacionais de Ciência e Tecnologia (INCTs), contrastando com a afirmação do governo federal de que eles seriam prioritários na área da CT&I, terá também um impacto profundamente negativo para a ciência brasileira e para sua necessária internacionalização.   

É também muito grave a situação das universidades públicas federais, um sistema de 63 instituições, 320 campi e mais de um milhão de alunos, responsável por 57% dos programas de pós-graduação no País e por parte expressiva da produção científica e tecnológica nacional, além da formação de recursos humanos altamente qualificados em todos os campos do saber. Com cortes sucessivos em seus orçamentos e o contingenciamento dos recursos de 2017, da ordem de 55% do orçamento de investimento e 25% do orçamento de custeio, as universidades públicas federais estão impossibilitadas de concluir obras iniciadas, cumprir compromissos relacionados à sua manutenção e executar programas importantes para o seu desenvolvimento acadêmico e científico. A diminuição de recursos da Capes, delineada para 2018, é também motivo de grande preocupação haja vista o papel essencial desempenhado por esta agência para a pós-graduação e para a educação básica do País.  

A queda no financiamento das instituições e dos programas de pesquisa, assim como a ameaça de proibição de novos concursos públicos, contribuem para o empobrecimento e sucateamento das universidades e institutos de pesquisa pelo esvaziamento de seu quadro qualificado e pela total desmotivação e insegurança que gera nos jovens que pretendem seguir a carreira de pesquisa. Vivemos o risco de sofrer uma grande diáspora científica, com a evasão de cérebros altamente qualificados, formados com recursos públicos, para países mais avançados que veem na C&T um instrumento essencial para o desenvolvimento econômico e para o bem-estar social.

Alertamos, assim, para a necessidade urgente de reversão desse cenário, por meio do descontingenciamento, ainda em 2017, dos recursos destinados ao MCTIC com a recomposição do seu orçamento anteriormente previsto, o que implica o aporte de 2,2 bilhões de reais. É igualmente indispensável a garantia de um orçamento adequado para a ciência e tecnologia, em 2018, e a alocação de recursos condizentes para as universidades públicas federais e para a Capes. Essas são condições essenciais para um projeto de nação que se preocupe com a um desenvolvimento sustentável, que conduza à melhoria das condições de vida dos brasileiros e que assegure a soberania da nação.

Na expectativa de uma resolução urgente dos problemas aqui apontados, subscrevemo-nos.


Academia Brasileira de Ciências (ABC), Luiz Davidovich

Associação Brasileira das Instituições de Pesquisa Tecnológica e Inovação (Abipti), Júlio Cesar Felix

Associação Brasileira dos Reitores das Universidades Estaduais e Municipais (Abruem), Aldo Nelson Bona

Associação Nacional dos Dirigentes de Instituições Federais de Ensino Superior (Andifes), Emmanuel Zagury Tourinho

Conselho Nacional das Fundações Estaduais de Amparo à Pesquisa (Confap), Maria Zaira Turchi

Conselho Nacional de Secretários Estaduais para Assuntos de Ciência e Tecnologia (Consecti), Francilene Procopio Garcia

Fórum Nacional de Gestores de Inovação e Transferência de Tecnologia (Fortec), Cristina Quintella

Fórum Nacional de Secretários Municipais da Área de Ciência e Tecnologia, André Gomyde Porto

Sociedade Brasileira para o Progresso da Ciência (SBPC), Ildeu de Castro Moreira.


Jornal da Ciência

The post Em carta aberta, entidades pedem resolução urgente da crise na CT&I e no Ensino Superior appeared first on Jornal da Ciência.

Read the whole story
54 days ago
Davis, CA
54 days ago
Rio de Janeiro, Brasil
Share this story
Next Page of Stories