Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

export with -m (merge) option #652

Open
lynnjo opened this issue Jan 22, 2024 · 5 comments
Open

export with -m (merge) option #652

lynnjo opened this issue Jan 22, 2024 · 5 comments

Comments

@lynnjo
Copy link

lynnjo commented Jan 22, 2024

Hello -

I am using tiledbvcf to create a dataset that I would later like to be able to export as a merged vcf file. I can successfully, load and export data from this dataset. What I would like to do is export to a multi-sample vcf file. It looks like export with the -m option should handle this, though it gives me memory errors. I added the -b flag to increase this but still no luck. The command I am running:

tiledbvcf export --uri tiledb_datasets/gvcf_dataset  -m -b 65536 -o /workdir/lcj34/phg_v2/exportedHvcfs/mergedGvcf.vcf

The error I get:

Exception: SubarrayPartitioner: Trying to partition a unary range because of memory budget, this will cause the query to run very slow. Increase `sm.memory_budget` and `sm.memory_budget_var` through the configuration settings to avoid this issue. To override and run the query with the same budget, set `sm.skip_unary_partitioning_budget_check` to `true`.

Is there another trick to running the tiledbvcf export command to create a merged vcf? Thank you

I am running tiledbvcf version:

phgv2-conda) [lcj34@cbsubl01 phg_v2]$ tiledbvcf --version
TileDB-VCF version 0f72331-modified
TileDB version 2.16.3
htslib version 1.16

My machine is a linux, these specifics:

NAME="Rocky Linux"
VERSION="9.0 (Blue Onyx)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.0"
@gspowley
Copy link
Member

Hi @lynnjo,

Please try adding the following --tiledb-config options to your export command, which will increase sm.memory_budget to 10GiB, sm.memory_budget_var to 20GiB, and skip the memory budget check.

tiledbvcf export \
  --uri tiledb_datasets/gvcf_dataset  \
  -m -b 65536 \
  -o /workdir/lcj34/phg_v2/exportedHvcfs/mergedGvcf.vcf \
  --tiledb-config sm.memory_budget=10737418240,sm.memory_budget_var=21474836480,sm.skip_unary_partitioning_budget_check=true

The export may be slow, as reported by the original error message, because we have not optimized the performance of exporting a merged VCF yet.

@lynnjo
Copy link
Author

lynnjo commented Jan 22, 2024

Thanks @gspowley - I will try the above.

Do I still keep the "-b 65536" flag while adding the last line you show?

One more question: We note that GATK can export a multi-sample vcf using the "gatk -GenomeGVCFs -V genodb://" and that is relatively fast. I know tiledbvcf originated as genomicsDB. Is the reason this works from GATK due to GATK doing some of the work to merge the files?

@gspowley
Copy link
Member

Yes, keeping the -b 65535 option will improve the export performance, assuming your system has enough memory. The memory budget parameters may need some tuning based on your dataset and system resources.

@andreaswallberg
Copy link

May I take the liberty to follow up on this topic?

How does multi-threading apply to the task of exporting data with (or without) the merge option?

We are testing this operation with various data and can only see a single thread being active at 100% across the whole operation.

@gspowley
Copy link
Member

gspowley commented Dec 5, 2024

Right, exporting data is single threaded.

For large datasets, TileDB provides distributed, parallel queries as described in TileDB Academy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants