export with -m (merge) option #652

lynnjo · 2024-01-22T13:57:47Z

Hello -

I am using tiledbvcf to create a dataset that I would later like to be able to export as a merged vcf file. I can successfully, load and export data from this dataset. What I would like to do is export to a multi-sample vcf file. It looks like export with the -m option should handle this, though it gives me memory errors. I added the -b flag to increase this but still no luck. The command I am running:

tiledbvcf export --uri tiledb_datasets/gvcf_dataset  -m -b 65536 -o /workdir/lcj34/phg_v2/exportedHvcfs/mergedGvcf.vcf

The error I get:

Exception: SubarrayPartitioner: Trying to partition a unary range because of memory budget, this will cause the query to run very slow. Increase `sm.memory_budget` and `sm.memory_budget_var` through the configuration settings to avoid this issue. To override and run the query with the same budget, set `sm.skip_unary_partitioning_budget_check` to `true`.

Is there another trick to running the tiledbvcf export command to create a merged vcf? Thank you

I am running tiledbvcf version:

phgv2-conda) [lcj34@cbsubl01 phg_v2]$ tiledbvcf --version
TileDB-VCF version 0f72331-modified
TileDB version 2.16.3
htslib version 1.16

My machine is a linux, these specifics:

NAME="Rocky Linux"
VERSION="9.0 (Blue Onyx)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.0"

The text was updated successfully, but these errors were encountered:

gspowley · 2024-01-22T19:58:04Z

Hi @lynnjo,

Please try adding the following --tiledb-config options to your export command, which will increase sm.memory_budget to 10GiB, sm.memory_budget_var to 20GiB, and skip the memory budget check.

tiledbvcf export \
  --uri tiledb_datasets/gvcf_dataset  \
  -m -b 65536 \
  -o /workdir/lcj34/phg_v2/exportedHvcfs/mergedGvcf.vcf \
  --tiledb-config sm.memory_budget=10737418240,sm.memory_budget_var=21474836480,sm.skip_unary_partitioning_budget_check=true

The export may be slow, as reported by the original error message, because we have not optimized the performance of exporting a merged VCF yet.

lynnjo · 2024-01-22T20:01:15Z

Thanks @gspowley - I will try the above.

Do I still keep the "-b 65536" flag while adding the last line you show?

One more question: We note that GATK can export a multi-sample vcf using the "gatk -GenomeGVCFs -V genodb://" and that is relatively fast. I know tiledbvcf originated as genomicsDB. Is the reason this works from GATK due to GATK doing some of the work to merge the files?

gspowley · 2024-01-22T20:06:46Z

Yes, keeping the -b 65535 option will improve the export performance, assuming your system has enough memory. The memory budget parameters may need some tuning based on your dataset and system resources.

andreaswallberg · 2024-12-02T17:09:19Z

May I take the liberty to follow up on this topic?

How does multi-threading apply to the task of exporting data with (or without) the merge option?

We are testing this operation with various data and can only see a single thread being active at 100% across the whole operation.

gspowley · 2024-12-05T23:17:51Z

Right, exporting data is single threaded.

For large datasets, TileDB provides distributed, parallel queries as described in TileDB Academy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

export with -m (merge) option #652

export with -m (merge) option #652

lynnjo commented Jan 22, 2024

gspowley commented Jan 22, 2024

lynnjo commented Jan 22, 2024 •

edited

Loading

gspowley commented Jan 22, 2024

andreaswallberg commented Dec 2, 2024

gspowley commented Dec 5, 2024

export with -m (merge) option #652

export with -m (merge) option #652

Comments

lynnjo commented Jan 22, 2024

gspowley commented Jan 22, 2024

lynnjo commented Jan 22, 2024 • edited Loading

gspowley commented Jan 22, 2024

andreaswallberg commented Dec 2, 2024

gspowley commented Dec 5, 2024

lynnjo commented Jan 22, 2024 •

edited

Loading