Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contigs missing #22

Open
mikecormier opened this issue Aug 13, 2019 · 8 comments
Open

Contigs missing #22

mikecormier opened this issue Aug 13, 2019 · 8 comments

Comments

@mikecormier
Copy link

The GRCh38_UCSC2ensembl.txt file is missing contig mapping from the hg38 UCSC side. In using this file to remap UCSC contigs to Ensembl the map fails because of missing contigs.

For example, chr10_KN196480v1_fix, chr10_KQ090021v1_fix, chr11_KN196481v1_fix, etc. are all within the file being remapped, but these contigs are not in GRCh38_UCSC2ensembl.txt.

I am unaware of other files that may be missing updated contigs, but there may be a few.

Could you update the GRCh38_UCSC2ensembl.txt file, and potentially other files that are missing updated contigs?

@dpryan79
Copy link
Owner

UCSC is really annoying since it doesn't have coherent releases (e.g., with a release number). If you know what should be matched together then please submit a PR.

@mikecormier
Copy link
Author

hey @dpryan79, what is your approach to identifying which contig from UCSC matches that in Ensembl?

@dpryan79
Copy link
Owner

dpryan79 commented Jan 2, 2020

NCBI hosts a file that has a variety of chromosome naming system suggestions, so I use that and compare the chromosome lengths to ensure they match.

@dpryan79
Copy link
Owner

dpryan79 commented Jan 2, 2020

Note that there are patch contigs added over time, so these have to be updated every year or two.

@mikecormier
Copy link
Author

The latest UCSC patch is patch 12. Does the NCBI file contain the contigs from these patches?

@dpryan79
Copy link
Owner

Quite likely, yes.

@nh13
Copy link

nh13 commented May 19, 2020

You may want to check out the CollectAlternateContigName tool that parses the NCBI assembly reports: https://github.com/fulcrumgenomics/fgbio/blob/master/src/main/scala/com/fulcrumgenomics/fasta/CollectAlternateContigNames.scala#L108-L137. It stores the mappings in a valid SAM format (SAM header with sequence aliases), which supports multiple aliases. For example, Genbank, RefSeq, Ensembl, UCSC-style, and "assigned-molecule". It gives a little more control over which molecules to re-map and which names and aliases to use. There are a few UpdateContigName tools as well in the latest master that can update various file formats. I hope that helps.

As an aside, @dpryan79 any interest in a second repo with .dict files, or side-by-side .dict files? I'd be happy to start adding them if you have the source NCBI assembly reports.

@dpryan79
Copy link
Owner

@nh13 Either side-by-side or a subdirectory would work IMO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants