Thursday, May 7, 2020

sourmash databases as zip files, in sourmash v3.3.0

The feature that I'm most excited about in sourmash 3.3.0 is the ability to directly use compressed SBT search databases.

Previously, if you wanted to search (say) 100,000 genomes from GenBank, you'd have to download a several GB .tar.gz file, and then uncompress it out to ~20 GB before searching it. The time and disk space requirements for this were major barriers for teaching and use.

In v3.3.0, Luiz Irber fixed this by, first, releasing the niffler Rust library with Pierre Marijon, to read and write compressed files; second, replacing our old khmer Bloom filter nodegraph with a Rust implementation (sourmash PR #799); and, third, adding direct zip file storage (sourmash #648).

So, as of the latest release, you can do the following:

# install sourmash v3.3.0
conda create -y -n sourmash-demo \
    -c conda-forge -c bioconda sourmash=3.3.0

# activate environment
conda activate sourmash-demo

# download the 25k GTDB release89 guide database (~1.4 GB)
curl -L https://osf.io/5mb9k/download > gtdb-release89-k31.sbt.zip

# grab
(continued...)

from Planet SciPy
read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...