Running RNA-Seq analysis with distributed file storage and compute (Part 1/2) - Niklas Rindtorff

Introduction

RNA-Sequencing data analysis is a common workflow in computational biology. In this article we are going to explore how decentralised storage and compute infrastructure can facilitate the processing of such biological data.

This article is the first piece in a series of articles within LabDAO in which we introduce distributed tools for scientific work. We will start with a basic example process, and will move towards more sophisticated use-cases and abstractions over time.

Also, if you happen to head to Lisbon for IPFS camp - make sure to stop by at the DeSci track that we are hosting together with the Bacalhau team!

Join us at IPFS Camp this week if you are in Lisbon!

What is distributed file storage?

A core hallmark of open science is the sharing of data, both raw and processed, that was created within a project. Ideally, such data are shared online for free and are easily accessible for everyone.

Current solutions for the subsidized storage of scientific data include general purpose platforms like figshare and zenodo, as well as purpose-built repositories such as the NIH Gene Expression Omnibus.

While these tools are very useful, they suffer from three drawbacks:

Location addressing: Any changes to the data after release are usually not reflected in the URL that is used to download the data. The data is addressed by its location on the server, not its content. This can be an issue if authors make changes to the shared data after its initial release and future colleagues rely on the data being consistent within their analyses.
Decentralization: If a service that is storing the data goes offline, the data is not available anymore. While a lot of services create backups for expected emergency scenarios, they are not immune against other forms of interventions.
Post Hoc sharing: Research data is oftentimes manually shared online only after the project is completed. If not tested by a reviewer for errors, there is no guarantee that the data that is shared is actually the data that was processed within the project. This is especially challenging if a team of researchers tries to collaborate with each other within a larger consortium.

Addressing these shortcomings does not only lead to easier collaboration among scientists, but it also leads to more reproducible and open science.

At LabDAO we are using IPFS, the interplanetary file system, to facilitate the sharing of data among members of the community. With IPFS, data can be shared with unique content addresses (similar to a file checksum), and hosted on multiple servers globally. In addition, when paired with distributed compute, new results are automatically shared after every analysis step (i.e. regularly on a daily basis), not once a project is complete (i.e. once after 18 months).

In this article we will use a popular command-line client of IPFS, web.storage to upload and download data from IPFS.

What is distributed compute?

Distributed compute is a nascent technology that is using containers, such as docker, to process data that was shared on IPFS. Both the input and output of a computation are shared via IPFS, leading to an easy sharing of data - along every step of the scientific process.

With the container, its data and the relationship between the two being public, science can happen completely in the open per default. Instead of having to put in work to share data and code, these resources are available to everyone automatically. After completion of a project, scientists could simply share an annotated computational graph as a supplementary file.

In this tutorial we are going to use Bacalhau to analyse biological data - more specifically RNA sequencing data.

Analyzing RNA-Seq data using Salmon, bacalhau and web3.storage

RNA-Seq is common method in biology to measure the state that the cells in a tissue sample are in. By approximating the frequency in which every gene is “read” within a cell - differences between cell types and cell states (i.e. diseased vs. healthy) can be quantified. The input data for a simple RNA-Seq analysis process is a bulk sequencing file (.fastq) and the output is a N x P count matrix with N samples and P genes (.csv).

Salmon is an RNA-Seq tool that, in simple terms, uses a process called pseudoquantification to approximate the frequency with which each gene is “read” in a biological sample. Using Salmon is a two-step process that consists of:

the creation of a transcriptome index from a reference transcriptome dataset
the generation of a sample count matrix using the sample raw data and transcriptome index

We will go through these steps below.

Installing an IPFS client

To move files to IPFS effectively, we use a simple pinning service, web3.storage.

We create an account with web3.storage by logging in with our GitHub. We then go ahead and generate an API key.

Next we download a command line interface for the API from this repo:

# we assume that you have already installed npm on your local machine
npm install -g @web3-storage/w3up-cli

Now we installed the web3.storage CLI - a tool that will help us move our scientific data to IPFS.

Installing Bacalhau

To also run transformations on the data we store on IPFS, we now need to install the Bacalhau client.

# we assume that you have already installed curl on your local machine
curl -sL https://get.bacalhau.org/install.sh | bash

Pinning the reference transcriptome to IPFS

Now that we installed both a client for distributed storage and distributed compute, we can start with the actual science. The first step when using Salmon is to download a reference transcriptome for the species we are trying to analyse.

In this case we download the Arabidopsis reference transcriptome to our local machine

# we assume that you have already installed curl on your local machine
curl ftp://ftp.ensemblgenomes.org/pub/plants/release-28/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.28.cdna.all.fa.gz -o athal.fa.gz

We can now pin the reference transcriptome to IPFS

rindtorff@niklas tutorial % w3 put athal.fa.gz

# Packed 1 file (18.7MB)
# bafybeicqieuiktum7ixkkc6rkzxzxxllogcpbulnlinwfdvl4rqhxbwzsa
⁂ Stored 1 file
⁂ https://dweb.link/ipfs/bafybeicqieuiktum7ixkkc6rkzxzxxllogcpbulnlinwfdvl4rqhxbwzsa

You can check for the presence of the file on IPFS using the this link below:

https://gateway.pinata.cloud/ipfs/bafybeicqieuiktum7ixkkc6rkzxzxxllogcpbulnlinwfdvl4rqhxbwzsa/

Running Salmon to generate a transcriptome index

Next up we are going to run Salmon to generate a transcriptome index based on the reference transcriptome we just pinned to IPFS.

Usually -if we wanted to run the Salmon docker container on our local machine- we would run something like:

docker run -v /Users/rindtorff/zettelkasten/labdao/tutorial:/inputs -v /Users/rindtorff/Desktop/:/outputs combinelab/salmon salmon index -t /inputs/athal.fa.gz -i /outputs/athal_index

This lengthy command 1) mounts the reference transcriptome into an input volume 2) creates a second local output volume, 3) points to the Salmon docker container that is maintained by the Combine-Lab collective, 4) calls the index function and 5) defines the input and output file name.

The results of this operation would be written to my machine’s Desktop folder.

Instead, we are going to run the Salmon container with Bacalhau. The command looks like this:

bacalhau docker run -i bafybeicqieuiktum7ixkkc6rkzxzxxllogcpbulnlinwfdvl4rqhxbwzsa combinelab/salmon -- salmon index -t /inputs/athal.fa.gz -i /outputs/athal_index

While the command is still lengthy, you notice that instead of a local file path, I am simply mounting the file we previously pinned to IPFS.

For specialists: note that we also do not have to specify the directory identity of the mounted volume, as bacalhau uses /inputs as default for every CID that is mounted with the flag -i and the name of the file is preserved in the local filesystem of the container.

We can now list the results and download the output to our local machine for inspection.

rindtorff@niklas tutorial % bacalhau list --id-filter=f648a31a-bb12-4e79-9c36-0ab28cddcc9a

CREATED   ID        JOB                      STATE      VERIFIED  PUBLISHED
10:34:20  f648a31a  Docker combinelab/sa...  Published            /ipfs/bafybeia7iumq2...

# we download the results into a directory called index
rindtorff@niklas tutorial % bacalhau get f648a31a --output-dir index
12:37:25.901 | INF bacalhau/get.go:67 > Fetching results of job 'f648a31a'...
12:37:29.62 | INF ipfs/downloader.go:115 > Found 1 result shards, downloading to temporary folder.

After creating the index, we can go and create a count matrix for an experiment that was run. We will continue this in the next article.

About LabDAO

At LabDAO we are building tools for distributed scientific teams to work together. If you want to learn more about what we are up to - join us on our discord server or follow us on Twitter. We are building on top of Bacalhau and IPFS to bring scientists the most transparent and efficient way to share data and run scientific compute.

Want to stay up to date about LabDAO - make sure to subscribe below:

subscribe://