Ranchero

Is your mycobacterial metadata a mess? Grab the M. bovis by the horns with Ranchero.

Ranchero is a Python solution to the dozens of different metadata formats used in genomic datasets. While it is specifically focused on NCBI's collection of Mycobacterium tuberculosis complex metadata, it still has utility for other organisms. For information on what Ranchero considers "a sample" and the like, see ./docs/data_structure.md.

Note

Ranchero should be considered pre-release software, and is currently undergoing a cleanup/refactor. More extensive documentation and examples will be provided once this cleanup is complete.

In addition to housing Ranchero itself, this repo also contains the scripts used to generate metadata TSVs for various pathogens UCSC is keeping an eye on, such as the metadata used to annotated the Taxonium SRA tree for Mycobacterium tuberculosis complex. You can find those scripts in ./compilations.

Features

Powered by polars
- Standardize entire genera in minutes thanks to polars' impressive speeds
- Use polars expressions to do things I didn't think of
Pre-configured to standardize dozens of common NCBI metadata fields
- Automatically merge columns of similar data types into a single column, filling in nulls/empty values as you go
- (MTBC only) Automatically handle lineage, strain, and scientific name
- (MTBC only) Convert old-school strain names (Beijing, LAM, etc) to the modern lineage system (L2.2.1, L4.3, etc)
Input a TSV of metadata to "inject" into an existing dataframe, optionally overriding metadata already present
Convert all of those "missing," "not collected," and "Not Applicable" strings into proper null values
Convert countries into three-letter country codes per ISO 3166
Convert dates to YYYY-MM-DD format into an ISO 8601-like format
Convert common host animal names to the standardized Genus species format when possible, as well a common name and confidence score

Installation

Because ranchero currently relies on a very specific version of polars, it is recommended to install it a venv like this:

python3 -m venv ./buildvenv
source buildvenv/bin/activate
pip install ranchero

Supported inputs

Platform	Expected format	Ranchero function
BigQuery	newline-delimited JSONL^†	from_bigquery()
Enterz Direct (efetch)	XML^‡	from_efetch()
NCBI SRA web search	XML^‡	from_efetch()
Excel/LibreOffice	TSV (XLSX not supported)	from_tsv()
Google Sheets	TSV	from_tsv()
NCBI Run Selector	CSV	from_run_selector()
basically anything else	TSV	from_tsv()

^† BQ typically outputs JSONs in a format polars does not like; from_bigquery() will fix it on the fly.
^‡ efetch typically outputs an invalid XML; from_efetch() will fix it on the fly. However, note that only -db sra -format native -mode xml and output from NCBI SRA web search is supported.

Dependencies

If you are pip-installing as recommended above, these will be included automatically.

Python >= 3.10
pandas >= 2.0.0
pyarrow
polars for Python == 1.27.0
tqdm
xmltodict for working with Enterz Direct files

Name		Name	Last commit message	Last commit date
Latest commit History 183 Commits
compilations		compilations
docs		docs
examples		examples
inputs		inputs
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
.pylintrc		.pylintrc
Coccidioides_genus_SRA_filtered_rc31.tsv		Coccidioides_genus_SRA_filtered_rc31.tsv
Coccidioides_genus_SRA_filtered_rc31_just_samples.tsv		Coccidioides_genus_SRA_filtered_rc31_just_samples.tsv
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
demo.py		demo.py
requirements.txt		requirements.txt
setup.py		setup.py
tests.py		tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Ranchero

Features

Installation

Supported inputs

Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

aofarrel/ranchero

Folders and files

Latest commit

History

Repository files navigation

Ranchero

Features

Installation

Supported inputs

Dependencies

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages