Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mismatching numbers of read names were observed #76

Open
kjvilardi opened this issue Mar 16, 2021 · 4 comments
Open

Mismatching numbers of read names were observed #76

kjvilardi opened this issue Mar 16, 2021 · 4 comments

Comments

@kjvilardi
Copy link

Hello!
I am trying to use singlem for multiple samples using the following:

singlem pipe --forward /scratch/vilardi.k/katiefiles/clean_reads2/all_fq/${i}.clean.R1.fq --reverse /scratch/vilardi.k/katiefiles/clean_reads2/all_fq/${i}.clean.R2.fq --otu_table ${i}_otu_table.csv --threads 24

Some samples turn out fine and I get an OTU table as output. However, for some samples I get the following error:

03/13/2021 03:29:00 PM INFO: Using as input 1 different sequence files e.g. /scratch/vilardi.k/katiefiles/clean_reads2/all_fq/Filtered_W2_4_TCCATTGC-AGGTAGGA_L00M.clean.R1.fq
03/13/2021 03:29:00 PM INFO: Searching with 14 SingleM package(s)
03/13/2021 03:29:00 PM INFO: Searching for reads matching 28 different protein HMM(s)
Traceback (most recent call last):
File "/home/vilardi.k/.conda/envs/singlem/bin/singlem", line 513, in
known_sequence_taxonomy = args.known_sequence_taxonomy)
File "/home/vilardi.k/.conda/envs/singlem/lib/python3.6/site-packages/singlem/pipe.py", line 45, in run
otu_table_object = self.run_to_otu_table(**kwargs)
File "/home/vilardi.k/.conda/envs/singlem/lib/python3.6/site-packages/singlem/pipe.py", line 176, in run_to_otu_table
search_result = self._search(hmms, forward_read_files, reverse_read_files)
File "/home/vilardi.k/.conda/envs/singlem/lib/python3.6/site-packages/singlem/pipe.py", line 860, in _search
run(hmms, graftm_protein_search_directory, True)
File "/home/vilardi.k/.conda/envs/singlem/lib/python3.6/site-packages/singlem/pipe.py", line 848, in run
extern.run(cmd)
File "/home/vilardi.k/.conda/envs/singlem/lib/python3.6/site-packages/extern/init.py", line 41, in run
raise ExternCalledProcessError(process, command)
extern.ExternCalledProcessError: Command graftM graft

returned non-zero exit status 1.

mfqe] Iterating input FASTQ file\n[2021-03-13T21:11:51Z INFO mfqe] Extracted 21918 reads from 17218560 total\nthread 'main' panicked at 'Mismatching numbers of read names were observed. Expected:\n[21915]\nbut found\n[21918]', src/main.rs:333:9\nnote: run with RUST_BACKTRACE=1 environment variable to display a backtrace.\n"STDOUT was: b''\n'STDOUT was: b''

(graftM graft comammd is super long so I am just posting the main parts of the error output)

I am not sure how to prevent 'Mismatching numbers of read names'.

@wwood
Copy link
Owner

wwood commented Mar 16, 2021

Hi,

This is usually caused by not having separate forward and reverse files, but I'm guessing based on your command line that isn't true. However, it seems you are using 'cleaned' reads. I suspect that in your cleaned reads file there are (rare) pairs sequences with the same name excluding everything after the first space. If that was true, then this command (untested) would show some:

grep ^@ failing_i.clean.R1.fq |awk '{print $1}' |sort |uniq -c |sort -rn |head

TLDR: Just use raw reads if possible.

Thanks for your interest in singlem.

@dmitrisvetlov
Copy link

dmitrisvetlov commented Jan 27, 2023

Hello Ben,

I am encountering a similar error using a single file of interleaved reads. I have reproduced the full error log below. Looking at the documentation returned by graftM graft --help, I find that for a single file of reads (whether interleaved or singles files are supported is not specified), the call to graftM graft should include a --graftm_package flag.

Looking at the relevant line in pipe.py, I don't see the --graftm_package flag being included in that command-writing code, and you will see in the error log that the flag does not appear in the call to graftm that was actually performed.

Is that a bug that can be fixed? Or can singleM/graftM not operate on single files of interleaved reads? (Also, while these are raw reads, but from JGI, and my understanding is that JGI performs some QC before releasing the data. Is that also a problem?)

Thanks,
Dmitri Svetlov

01/26/2023 05:14:49 PM INFO: Using as input 1 different sequence files e.g. sample_singleM.fastq
01/26/2023 05:14:49 PM INFO: Searching with 14 SingleM package(s)
01/26/2023 05:14:49 PM INFO: Searching for reads matching 28 different protein HMM(s)
Traceback (most recent call last):
  File "/opt/Miniconda2/miniconda2/envs/singlem/bin/singlem", line 513, in <module>
    known_sequence_taxonomy = args.known_sequence_taxonomy)
  File "/opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/pipe.py", line 45, in run
    otu_table_object = self.run_to_otu_table(**kwargs)
  File "/opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/pipe.py", line 176, in run_to_otu_table
    search_result = self._search(hmms, forward_read_files, reverse_read_files)
  File "/opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/pipe.py", line 860, in _search
    run(hmms, graftm_protein_search_directory, True)
  File "/opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/pipe.py", line 848, in run
    extern.run(cmd)
  File "/opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/extern/__init__.py", line 41, in run
    raise ExternCalledProcessError(process, command)
extern.ExternCalledProcessError: Command graftM graft --verbosity 2 --input_sequence_type nucleotide  --min_orf_length 96   --filter_minimum 28 --threads 20 --forward sample_singleM.fastq --search_only --search_hmm_files /opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/data/S1.1.ribosomal_protein_L2_rplB.gpkg.spkg/S1.1.ribosomal_protein_L2_rplB/graftmUShLiV_search.hmm /opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/data/S1.1.ribosomal_protein_L2_rplB.gpkg.spkg/S1.1.ribosomal_protein_L2_rplB/graftmY0W8yc_search.hmm /opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/data/S1.10.ribosomal_protein_S7.gpkg.spkg/S1.10.ribosomal_protein_S7/graftmunofJO_search.hmm /opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/data/S1.10.ribosomal_protein_S7.gpkg.spkg/S1.10.ribosomal_protein_S7/graftmvR14Hx_search.hmm /opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/data/S1.11.ribosomal_protein_S10_rpsJ.gpkg.spkg/S1.11.ribosomal_protein_S10_rpsJ/graftme9CA2v_search.hmm /opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/data/S1.11.ribosomal_protein_S10_rpsJ.gpkg.spkg/S1.11.ribosomal_protein_S10_rpsJ/graftmmSefD1_search.hmm /opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/data/S1.12.ribosomal_protein_S12_S23.gpkg.spkg/S1.12.ribosomal_protein_S12_S23/graftmD68DOf_search.hmm /opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/data/S1.12.ribosomal_protein_S12_S23.gpkg.spkg/S1.12.ribosomal_protein_S12_S23/graftm4bucq1_search.hmm /opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/data/S1.13.ribosomal_protein_S15P_S13e.gpkg.spkg/S1.13.ribosomal_protein_S15P_S13e/graftmSIv4ri_search.hmm /opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/data/S1.13.ribosomal_protein_S15P_S13e.gpkg.spkg/S1.13.ribosomal_protein_S15P_S13e/graftmh8KCUY_search.hmm /opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/data/S1.14.ribosomal_protein_S19_rpsS.gpkg.spkg/S1.14.ribosomal_protein_S19_rpsS/graftm5WU6Uq_search.hmm /opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/data/S1.14.ribosomal_protein_S19_rpsS.gpkg.spkg/S1.14.ribosomal_protein_S19_rpsS/graftmSHqOh3_search.hmm /opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/data/S1.2.ribosomal_protein_L3_rplC.gpkg.spkg/S1.2.ribosomal_protein_L3_rplC/graftmeZ00Qg_search.hmm /opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/data/S1.2.ribosomal_protein_L3_rplC.gpkg.spkg/S1.2.ribosomal_protein_L3_rplC/graftmID8t_I_search.hmm /opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/data/S1.3.ribosomal_protein_L5_rplE.gpkg.spkg/S1.3.ribosomal_protein_L5_rplE/graftmFIm7K0_search.hmm /opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/data/S1.3.ribosomal_protein_L5_rplE.gpkg.spkg/S1.3.ribosomal_protein_L5_rplE/graftmMH5eE7_search.hmm /opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/data/S1.4.ribosomal_protein_L6_rplF.gpkg.spkg/S1.4.ribosomal_protein_L6_rplF/graftmDE7_8b_search.hmm /opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/data/S1.4.ribosomal_protein_L6_rplF.gpkg.spkg/S1.4.ribosomal_protein_L6_rplF/graftmr_l5Gy_search.hmm /opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/data/S1.5.ribosomal_protein_L11_rplK.gpkg.spkg/S1.5.ribosomal_protein_L11_rplK/graftmCpbnSc_search.hmm /opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/data/S1.5.ribosomal_protein_L11_rplK.gpkg.spkg/S1.5.ribosomal_protein_L11_rplK/graftm5YNr6y_search.hmm /opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/data/S1.6.ribosomal_protein_L14b_L23e_rplN.gpkg.spkg/S1.6.ribosomal_protein_L14b_L23e_rplN/graftmAiWqW8_search.hmm /opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/data/S1.6.ribosomal_protein_L14b_L23e_rplN.gpkg.spkg/S1.6.ribosomal_protein_L14b_L23e_rplN/graftmIkMbHN_search.hmm /opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/data/S1.7.ribosomal_protein_L16_L10E_rplP.gpkg.spkg/S1.7.ribosomal_protein_L16_L10E_rplP/graftmAbXN7F_search.hmm /opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/data/S1.7.ribosomal_protein_L16_L10E_rplP.gpkg.spkg/S1.7.ribosomal_protein_L16_L10E_rplP/graftmWRo94z_search.hmm /opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/data/S1.8.ribosomal_protein_S2_rpsB.gpkg.spkg/S1.8.ribosomal_protein_S2_rpsB/graftmhZA_92_search.hmm /opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/data/S1.8.ribosomal_protein_S2_rpsB.gpkg.spkg/S1.8.ribosomal_protein_S2_rpsB/graftmrgaTTv_search.hmm /opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/data/S1.9.ribosomal_protein_S5.gpkg.spkg/S1.9.ribosomal_protein_S5/graftm4ehmqY_search.hmm /opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/data/S1.9.ribosomal_protein_S5.gpkg.spkg/S1.9.ribosomal_protein_S5/graftmBf7aFs_search.hmm --output_directory /dev/shm/tmpmr2rd0m5/graftm_protein_search --aln_hmm_file /opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/singlem/data/S1.1.ribosomal_protein_L2_rplB.gpkg.spkg/S1.1.ribosomal_protein_L2_rplB/graftmUShLiV_search.hmm  returned non-zero exit status 1.
STDERR was: b'Traceback (most recent call last):\n  File "/opt/Miniconda2/miniconda2/envs/singlem/bin/graftM", line 415, in <module>\n    Run(args).main()\n  File "/opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/graftm/run.py", line 613, in main\n    self.graft()\n  File "/opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/graftm/run.py", line 388, in graft\n    diamond_db\n  File "/opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/graftm/timeit.py", line 10, in timed\n    result = method(*args, **kw)\n  File "/opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/graftm/sequence_searcher.py", line 851, in aa_db_search\n    hit_reads_orfs_fasta)\n  File "/opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/graftm/sequence_searcher.py", line 943, in search_and_extract_orfs_matching_protein_database\n    hits\n  File "/opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/graftm/sequence_searcher.py", line 534, in _extract_from_raw_reads\n    extern.run(extract_cmd, stdin=\'\\n\'.join(input_reads))\n  File "/opt/Miniconda2/miniconda2/envs/singlem/lib/python3.6/site-packages/extern/__init__.py", line 41, in run\n    raise ExternCalledProcessError(process, command)\nextern.ExternCalledProcessError: Command mfqe --output-uncompressed --fasta-read-name-lists /dev/stdin --input-fasta <(awk \'{print ">" substr($0,2);getline;print;getline;getline}\' \'sample_singleM.fastq\') --output-fasta-files \'/tmp/_raw_extracted_reads.fa0c02ucub\' returned non-zero exit status 101.\nSTDERR was: b"[2023-01-27T00:45:04Z INFO  mfqe] Read in 94026 read names from /dev/stdin\\n[2023-01-27T00:45:04Z INFO  mfqe] Iterating input FASTQ file\\n[2023-01-27T00:46:46Z INFO  mfqe] Extracted 188052 reads from 95428802 total\\nthread \'main\' panicked at \'Mismatching numbers of read names were observed. Expected:\\n[94026]\\nbut found\\n[188052]\', src/main.rs:333:9\\nnote: run with `RUST_BACKTRACE=1` environment variable to display a backtrace\\n"STDOUT was: b\'\'\n'STDOUT was: b''

@wwood
Copy link
Owner

wwood commented Jan 27, 2023

Hi @dmitrisvetlov,

I believe your error is different, because mfqe is reporting exactly double the number of reads found than it was expecting.

I think this is a straight up bug you found, introduced into graftm when mfqe replaced fxtract. However, I would recommend updating the v1.0 beta (see https://wwood.github.io/singlem/Installation ) as the new version is much better. Will release bioconda when I get out of beta, which should be within a few weeks.

However, in the new version you cannot yet provide interleaved reads. It might be possible to deinterleave them on the fly and stream the forward and reverse e.g. via --forward <(zcat interleaved.fq.gz |awk ....) --reverse .... but I've not tested that. If you try let me know how you go.

I think the easiest thing here is just to deinterleave the reads before providing them to singlem, unless there is some strong reason not to?

Thanks, ben

@dmitrisvetlov
Copy link

Hi Ben,

Thanks for letting me know. I am using the workaround of deinterleaving the reads via reformat.sh within bbtools, and then providing the forward and reverse files directly to singlem pipe. I am leery of deinterleaving on the fly, since I am not sure what sort of memory overhead that might have, although I suspect it should be doable in theory.

Dmitri

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants