The ZiBRA roadtrip has involved multiple teams working simultaneously on different steps of the overall sequencing pipeline, from sample selection at the LACENs to RT-PCR to library prep, etc… Each step has some metadata associated with it. We want the eventual sequence to maintain as much linkage to associated epidemiological (date and municipality, etc…) and experimental (Ct, barcode, etc…) metadata as possible. Without a strong push these linkages will end up distributed across a plethora of lab notebooks and Excel spreadsheets. In the last few days, I’ve tried to get all the important ZiBRA metadata into a single database.
I’ve been working from nextstrain-db, which was built by Charlton Callender, a very talented undergraduate at the University of Washington, to manage sequence data and associated metadata for nextstrain builds. The basic strategy has been to have an AWS instance running a RethinkDB server that maintains a table with sample-by-sample metadata and sequences. Having a single server is very helpful to keep data flowing to single location. However, lack of internet has rather complicated this plan. Because of this, I’ve set up a local RethinkDB instance that mirrors the AWS server. Changes can be made to this local database and then ‘pushed’ up to the remote database Git-style when we manage to get internet access. Changes to the remote can also be ‘pulled’ down to the local instance.
We’re running the app Chateau to provide a web UI for making changes to individual-samples in the database, while larger modifications are done through Python interaction scripts that can upload tsv files for many documents/fields simultaneously (to upload new samples or to update metadata of existing samples) or download canonically formatted tsvs or fastas.
Turns out having everyone on board with a web UI and these extra moving parts is overly complex and unworkable. The current protocol is:
Keep one Excel notebook per collection location
Each notebook is row-by-bow sample-list for samples processed for this location. Each notebook should be titled in the form of
ZBRA_lacen_natal_samples.xls (for ZBRA samples) or
ZBRB_lacen_joao_pessoa_samples.xls (for ZBRB samples). These notebooks live in dropbox under
Follow a strict template when completing these Excel notebooks
It looks something like:
Further details of fields are below under “Excel schema”.
Sync these notebooks to the RethinkDB instance
These Excel notebooks can be exported as
.tsv and then synced to the RethinkDB database with:
python vdb/zibra_metadata_upload.py -db vdb -tb zibra --source zibra --virus zika --authors ZiBRA --ftype tsv --fname ZBRA_lacen_natal_samples.tsv
The Excel schema is as follows:
strain: This study / strain ID in the form of
ZBRB1, etc… This is the primary key of the table and is required for every sample.
sample_number: This is the sample number from this collection location,
lab: Collection lab,
lab_id: This is LACEN sample ID (or other institution sample ID). Should be a 12-digit number. This very important to be able to link sample to clinical metadata. Each LACEN uses its own IDs.
rt_positive: Whether RT-PCR is positive for Zika,
ct: Ct value of positive RT-PCR result.
extraction_date: Date of RNA extraction.
amplicon_concentration: Purity of DNA after PCR amplification. Measured in ng/ul.
minion_barcodes: List of MinION library/barcodes associated with sample separated by commas, for example
country: All samples should be
state: Name of Brazilian state of patient origin. Should be snakecase without accents,
municipality: Name of municipality of patient origin. Should be snakecase without accents,
collection_date: Collection date of the sample. Should be formatted as
2015-07-27(YYYY-MM-DD). If a sample lacks complete date information, enter as
2015-07-XX(day unknown) or
2015-XX-XX(month and day unknown).
onset_date: Date of symptom onset. Should be formatted as
host_species: Host species. Human samples are
patient_age: Patient age,
patient_sex: Patient sex,
pregnant: Whether the patient was pregnant,
pregnancy_week: Week of pregnancy, as number
pregnancy_trimester: Trimester of pregnancy
microcephaly: Whether sample was linked to microcephaly,
sample_type: Type of sample,
symptoms: Free-form text of clinical symptoms.
notes: Free-form text for notes associated with this sample.
Notes on the ZiBRA-specific nextstrain-db implementation can be found here. Other documentation:
Special thanks to Charlton Callender for building things and running support from Seattle.