Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when running SemiBin2 (normalization?) #159

Open
eperezv opened this issue Mar 8, 2024 · 6 comments
Open

Error when running SemiBin2 (normalization?) #159

eperezv opened this issue Mar 8, 2024 · 6 comments

Comments

@eperezv
Copy link

eperezv commented Mar 8, 2024

Hello,

I'm running SemiBin2 to my dataset with the multi_easy_bin option. Everything seemed to work properly until it failed with something related to normalization. Any idea of the issue cause and/or how to address it?

Thank you

(SemiBin) eduardo@eduardo-PC:/data$ SemiBin2 multi_easy_bin -i contigs.flt.fna -b mapped/*.sort.bam -o semibin2_output --separator _ -p 18
[2024-03-08 10:19:33,306] INFO: Binning for short_read
[2024-03-08 10:19:33,306] INFO: SemiBin will run in self supervised mode
[2024-03-08 10:19:34,370] INFO: Running with GPU.
[2024-03-08 10:19:34,370] INFO: Performing multi-sample binning
[2024-03-08 10:19:34,371] INFO: Generating training data...
[2024-03-08 10:20:17,377] INFO: Calculating coverage for every sample.
[2024-03-08 11:31:04,311] INFO: Processed: mapped/C101.sort.bam
[2024-03-08 11:37:05,271] INFO: Processed: mapped/C102.sort.bam
[2024-03-08 11:37:05,272] INFO: Processed: mapped/C103.sort.bam
[2024-03-08 11:37:05,272] INFO: Processed: mapped/C111.sort.bam
[2024-03-08 11:37:05,272] INFO: Processed: mapped/C112.sort.bam
[2024-03-08 11:37:05,272] INFO: Processed: mapped/C113.sort.bam
[2024-03-08 11:37:05,272] INFO: Processed: mapped/C11.sort.bam
[2024-03-08 11:41:28,363] INFO: Processed: mapped/C12.sort.bam
[2024-03-08 11:50:46,311] INFO: Processed: mapped/C13.sort.bam
[2024-03-08 11:50:46,312] INFO: Processed: mapped/C161.sort.bam
[2024-03-08 11:50:46,312] INFO: Processed: mapped/C162.sort.bam
[2024-03-08 11:50:46,312] INFO: Processed: mapped/C163.sort.bam
[2024-03-08 11:50:46,312] INFO: Processed: mapped/C171.sort.bam
[2024-03-08 11:50:46,312] INFO: Processed: mapped/C172.sort.bam
[2024-03-08 11:50:46,312] INFO: Processed: mapped/C173.sort.bam
[2024-03-08 11:50:46,312] INFO: Processed: mapped/C181.sort.bam
[2024-03-08 11:50:46,312] INFO: Processed: mapped/C182.sort.bam
[2024-03-08 11:50:46,312] INFO: Processed: mapped/C183.sort.bam
[2024-03-08 11:53:33,711] INFO: Processed: mapped/C191.sort.bam
[2024-03-08 12:03:55,805] INFO: Processed: mapped/C192.sort.bam
[2024-03-08 12:08:27,854] INFO: Processed: mapped/C193.sort.bam
[2024-03-08 12:33:25,614] INFO: Processed: mapped/C1.sort.bam
[2024-03-08 12:38:03,411] INFO: Processed: mapped/C21.sort.bam
[2024-03-08 12:38:03,411] INFO: Processed: mapped/C22.sort.bam
[2024-03-08 12:38:03,411] INFO: Processed: mapped/C23.sort.bam
[2024-03-08 12:38:03,411] INFO: Processed: mapped/C2.sort.bam
[2024-03-08 12:38:03,412] INFO: Processed: mapped/C31.sort.bam
[2024-03-08 12:38:03,412] INFO: Processed: mapped/C32.sort.bam
[2024-03-08 12:38:03,412] INFO: Processed: mapped/C33.sort.bam
[2024-03-08 12:38:03,412] INFO: Processed: mapped/C3.sort.bam
[2024-03-08 12:42:33,510] INFO: Processed: mapped/C81.sort.bam
[2024-03-08 12:42:33,510] INFO: Processed: mapped/C82.sort.bam
[2024-03-08 12:42:33,510] INFO: Processed: mapped/C83.sort.bam
[2024-03-08 12:42:33,510] INFO: Processed: mapped/C91.sort.bam
[2024-03-08 12:44:14,776] INFO: Processed: mapped/C92.sort.bam
[2024-03-08 12:48:07,180] INFO: Processed: mapped/C93.sort.bam
[2024-03-08 12:48:07,180] INFO: Processed: mapped/CE1.sort.bam
[2024-03-08 12:48:07,180] INFO: Processed: mapped/CE2.sort.bam
[2024-03-08 12:48:07,180] INFO: Processed: mapped/CE3.sort.bam
[2024-03-08 13:12:59,818] INFO: Training model and clustering for S1CNODE.
[2024-03-08 13:12:59,820] INFO: Start training from a single sample.
[2024-03-08 13:13:00,438] INFO: Training model...
  0%|                                                                                                           | 0/15 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/eduardo/miniconda3/envs/SemiBin/bin/SemiBin2", line 33, in <module>
    sys.exit(load_entry_point('SemiBin==2.1.0', 'console_scripts', 'SemiBin2')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eduardo/miniconda3/envs/SemiBin/lib/python3.12/site-packages/SemiBin-2.1.0-py3.12.egg/SemiBin/main.py", line 1563, in main2
    multi_easy_binning(
  File "/home/eduardo/miniconda3/envs/SemiBin/lib/python3.12/site-packages/SemiBin-2.1.0-py3.12.egg/SemiBin/main.py", line 1326, in multi_easy_binning
    training(logger, None, args.num_process,
  File "/home/eduardo/miniconda3/envs/SemiBin/lib/python3.12/site-packages/SemiBin-2.1.0-py3.12.egg/SemiBin/main.py", line 1103, in training
    model = train_self(logger,
            ^^^^^^^^^^^^^^^^^^
  File "/home/eduardo/miniconda3/envs/SemiBin/lib/python3.12/site-packages/SemiBin-2.1.0-py3.12.egg/SemiBin/self_supervised_model.py", line 77, in train_self
    train_data_depth = normalize(train_data_depth, axis=1, norm='l1')
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eduardo/miniconda3/envs/SemiBin/lib/python3.12/site-packages/scikit_learn-1.4.1.post1-py3.12-linux-x86_64.egg/sklearn/utils/_param_validation.py", line 213, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/eduardo/miniconda3/envs/SemiBin/lib/python3.12/site-packages/scikit_learn-1.4.1.post1-py3.12-linux-x86_64.egg/sklearn/preprocessing/_data.py", line 1925, in normalize
    X = check_array(
        ^^^^^^^^^^^^
  File "/home/eduardo/miniconda3/envs/SemiBin/lib/python3.12/site-packages/scikit_learn-1.4.1.post1-py3.12-linux-x86_64.egg/sklearn/utils/validation.py", line 1072, in check_array
    raise ValueError(
ValueError: Found array with 0 sample(s) (shape=(0, 39)) while a minimum of 1 is required by the normalize function.
@psj1997
Copy link
Collaborator

psj1997 commented Mar 11, 2024

It seems it still the error when combining the k-mer features and abundance features. Can you have a look for the files generated from SemiBin for every sample? (data.csv/data_split.csv/cov.csv) How many columns in these files?

Thanks!

@eperezv
Copy link
Author

eperezv commented Mar 11, 2024

I see a folder containing the fasta files and files like C1.sort.bam_21_data.cov.csv and C1.sort.bam_21_data_split_cov.csv. But there are also other folders per each sample that contain maybe what you are asking for.
data.csv contains 176 columns (i.e., one with no head, 135 columns named 1, 2, 3... and then another 39 colums with mapped/C1.sort.bam_cov
data_split.csv same as before but just the heads.
data_cov.csv contains 40 columns (one with numbers + 39 that are my samples, sme as before, mapped/C1...

@psj1997
Copy link
Collaborator

psj1997 commented Mar 11, 2024

Can you show the five first rows of the data.csv ,data_split.csv,data_csv.csv and cov_split.csv?

@eperezv
Copy link
Author

eperezv commented Mar 11, 2024

I don't have exactly the files you indicate, but these are the ones I have (per sample)

data.csv
image

data_split.csv
image

data_cov.csv
image

data_split_cov.csv
image

@psj1997
Copy link
Collaborator

psj1997 commented Mar 11, 2024

Can you help to check the first columns of data_split_cov.csv? If they are '1581622_1, 1581622_2'? Thanks!

@eperezv
Copy link
Author

eperezv commented Mar 11, 2024

There is no _1, _2... Only what's shown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants