Busco (5.8.0) extra test to indicate bug #7209

lsterck · 2025-08-08T17:19:30Z

FOR CONTRIBUTOR:

I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
License permits unrestricted use (educational + commercial)
This PR adds a new tool or tool collection
This PR updates an existing tool or tool collection
This PR does something else (explain below)

this is linked to #6664
the stderr of the planemo test run indicates what the problem is:

/tmp/tmpgge8wz04/job_working_directory/000/4/galaxy_4.sh: line 104: [: too many arguments\n/tmp/tmpgge8wz04/job_working_directory/000/4/galaxy_4.sh: line 105: [: too many arguments

Not sure how to fix this though.
Perhaps it should become a discover_dataset kinda output capture to be able to get data from multiple output folders?

…ersionnumber update

bgruening · 2025-08-08T17:48:52Z

@lsterck try running planemo with planemo test --no_cleanup and try to look at "/tmp/tmpgge8wz04/job_working_directory/000/4/galaxy_4.sh" or paste it here, maybe we understand what is going on.

Thanks for your contributions @lsterck!

lsterck · 2025-08-08T17:51:39Z

Hi @bgruening ,

that should be what I wrote in #6664 .
I will run the planemo with the requested options nonetheless and come back here.

bgruening · 2025-08-08T18:06:04Z

Oh, true, no need to run it again, the issues says it all.

bernt-matthias · 2025-08-10T20:21:44Z

What are the names of the folders matching run*?

lsterck · 2025-08-11T08:45:52Z

in the case I reported it's the following:

run_bacteria_odb10 (linked dir, pointing to busco_galaxy/auto_lineage/run_bacteria_odb10 )
run_bacilli_odb10 (regular dir)

where in the case there is only 1 (top level) dir it's only the linked dir that is present

bernt-matthias · 2025-08-11T10:26:18Z

Can you ask upstream which of the two should be used? Also the concatenations happening at the end of the tool's command block seem questionable in this context.

lsterck · 2025-08-11T11:26:09Z

what do you mean? if you can tell prior to running it (or seeing the output) which of the dirs you will need?
If so, then no you can't, that's just the point of this auto-lineage parameter ...

Indeed, have been looking at the cat part too ...
I think it has been designed for the one dir output only use-case.

I'm currently looking into if I can 'solve' it in the command block of the wrapper.

bernt-matthias · 2025-08-11T11:38:01Z

what do you mean?

The question would be why there may be sometimes multiple output dirs (resp symlinks). Shouldnt there be only one also in the auto-lineage case?

lsterck · 2025-08-11T12:06:45Z

ah, in that sense ... yeah, good point but don't know tbh
I think it is due to how busco works, it will first do a 'general' search (genre bacteria/eukaryote/... ) and then will based on that result see if there are not more specific clades to report .
I have the impression it runs the whole busco again, internally, if it finds a more specific clade

Are you hinting to the fact we should (could?) bring this to the busco people first before implementing a fix in galaxy?

lsterck · 2025-08-11T12:08:10Z

been quickly browsing through their doc and it this seems to be intended behaviour:
https://busco.ezlab.org/busco_userguide.html#auto-select-lineage

It will be come down to handing this in the wrapper it seems (might be doable in the command block?)

lsterck · 2025-08-11T12:40:56Z

what if we add something as follows to the command block (after run but before the concatenation step) :

if test $(ls -d run_* | wc -l) -gt 1; then
  find . -type l -delete
fi

that will remove the link to dir and only in the case when there are multiple dirs present.
it will keep the symlink dir when there is only one present (== the case where with auto-lineage there is no more specific clade found) and reduce the number of run_* folders to 1 so the initial reported if statement will not fail anymore.
Will be the easiest workable patch I think

this of course makes the assumption that only the result of the most specific clade is reported, which in most (all?) cases is a valid assumption as that is exactly what one wants to achieve with the auto-lineage parameter.
Not sure how others feel about this so do correct me if I'm wrong.

lsterck · 2025-08-11T14:14:20Z

adding this:

#if $outputs:
  && if test `ls -d busco_galaxy/run_* 2>/dev/null | wc -l` -gt 1; then  find busco_galaxy/ -type l -delete; fi
#end if

to the command block in the wrapper seems to pass both tests (including the failed one on multiple dirs).

bernt-matthias · 2025-08-11T17:12:43Z

been quickly browsing through their doc and it this seems to be intended behaviour: https://busco.ezlab.org/busco_userguide.html#auto-select-lineage

It will be come down to handing this in the wrapper it seems (might be doable in the command block?)

Not sure about the docs:

The run_ folders contain the results for the respective parent datasets. The final selected dataset will be at the top level of the results folder.

So far so good :)

In the event that placement cannot be performed due to insufficient marker genes, BUSCO will report the results from the optimal parent dataset. This results folder will be soft-linked from the top level of the results folder.

Then why do we have folder + symlink? In which case are we?

The short_summary* files for the parent dataset ("generic") and the final selected dataset ("specific") are located at the top level of the results folder.

This we also need to keep in mind.

lsterck · 2025-08-11T20:49:46Z

In the event that placement cannot be performed due to insufficient marker genes, BUSCO will report the results from the optimal parent dataset. This results folder will be soft-linked from the top level of the results folder.

Then why do we have folder + symlink? In which case are we?

ah, in the case where the placement could be done and a more specific busco was found.
Why then there is still a symlinked folder is a good question indeed (from that sentence I would conclude that there should not be a symlinked folder in "our" case) .
I assume that on completion of the 'first run', to determine the big clade (archaea, bacteria, ... ) it will create the symlink folder to the best parent dataset, regardless of being a more specific one or not (== it will only look for the more specific one after the symlink folder has been created)
The example they describe in those docs is exactly the case where we are in though (it also has a symlinked folder and a regular folder with the more specific run output)

The short_summary* files for the parent dataset ("generic") and the final selected dataset ("specific") are located at the top level of the results folder.

This we also need to keep in mind.

Those are indeed present in the top folder but are, as far as I can see, not used in the wrapper for collecting the output. The wrapper gets the final output from the subdirs directly and not from those top folder files.

lsterck · 2025-08-11T20:56:00Z

I assume it's these parts of the wrapper (in the output section) that causes the issue:

        <data name='busco_sum' format='txt' label="${tool.name} on ${on_string}: short summary" from_work_dir="busco_galaxy/run_*/short_summary.txt">
            <filter>outputs and 'short_summary' in outputs</filter>
        </data>

those from_work_dir likely expects a single file and the wildcard expansion will work when there is only one dir and fail when multiple dirs are present.
(here you also see that it collects the output from the subdirs directly and not from the top folder files that are mentioned in the busco docs)

scorreard · 2025-08-12T11:15:27Z

Thanks for sharing the issue! Agreed with you last comment, it is likely the issue.

I think it would be great to keep all the output as some lineages are not very good (e.g. mollusca) and so sometimes we refer to Eukaryota even if we have a more specific one available.

If we were to rename the files after the analysis, for example :

For each run (e.g. run_archaea_odb12)
Rename the output files to include the run name (therefore no files will have the same names anymore) (e.g. run_archaea_odb12_short_summary.txt)
Output all the files finishing by *_short_summary.txt (There will be one or several outputs, but none will have the same name)

Do you think that would be a viable solution? Is the file you are using and that causes the crash publicly available?

bernt-matthias · 2025-08-12T11:42:27Z

I think it would be great to keep all the output as some lineages are not very good (e.g. mollusca) and so sometimes we refer to Eukaryota even if we have a more specific one available.

We can keep all outputs. Then my suggestion would be to replace all <data> outputs by data+discover_outputs (see) or a collection.

I'm just afraid that this can not be used easily in automatic workflows anymore. Before we had a single static output (which can be connected to downstream tools) and after this we have dynamic outputs (even if it will be in most cases only one) that are only known after the tool finished .. this will be tricky or impossible to use in workflows.

Maybe we can keep the static output and add dynamic ones for the cases where needed. But there needs to be some logic to determine which folder/file is the primary one.

If we were to rename the files after the analysis, for example :

For each run (e.g. run_archaea_odb12)
Rename the output files to include the run name (therefore no files will have the same names anymore) (e.g. run_archaea_odb12_short_summary.txt)
Output all the files finishing by *_short_summary.txt (There will be one or several outputs, but none will have the same name)

Do you think that would be a viable solution?

Guess so.

Is the file you are using and that causes the crash publicly available?

Yes. The test description is here and the test fasta here. I could also lookup the command line in the CI if needed.

scorreard · 2025-08-12T12:03:00Z

I think it would be great to keep all the output as some lineages are not very good (e.g. mollusca) and so sometimes we refer to Eukaryota even if we have a more specific one available.

We can keep all outputs. Then my suggestion would be to replace all <data> outputs by data+discover_outputs (see) or a collection.

Not sure about what you mean by discover_outputs but I'll dive in the doc

I'm just afraid that this can not be used easily in automatic workflows anymore. Before we had a single static output (which can be connected to downstream tools) and after this we have dynamic outputs (even if it will be in most cases only one) that are only known after the tool finished .. this will be tricky or impossible to use in workflows.

BUSCO outputs are usually one of the final output of a workflow, BUSCO outputs are rarely used as input files in a workflow, I think only multiQC (and maybe Blobtools) could use BUSCO output. But yes, we should still be careful about not breaking workflows.
With your logic below, we would get one static output that could be used downstream?

Maybe we can keep the static output and add dynamic ones for the cases where needed. But there needs to be some logic to determine which folder/file is the primary one.

Yes! What about something like

Grep "specific_lineage" from short_summary.specific.methanococcales_odb12.<output_folder>.json (e.g. methanococcales_odb12 here)
Rename all the specific lineages files with ${specific_lineage}_filename

For all the other runs (e.g. run_archaea_odb12)
Rename the output files to include the run name (therefore no files will have the same names anymore) (e.g. run_archaea_odb12_short_summary.txt)

Output ${specific_lineage}_short_summary.txt as the static one
Output all the other files finishing by *_short_summary.txt (These would be the dynamic ones, there may be none, one or several outputs, but none will have the same name)

What do you think?

Is the file you are using and that causes the crash publicly available?

Yes. The test description is here and the test fasta here. I could also lookup the command line in the CI if needed.

I can try to code this and ping you for review?

bernt-matthias · 2025-08-12T13:05:15Z

What do you think?

Sounds feasible

I can try to code this and ping you for review?

Excellent :)

lsterck · 2025-08-12T17:33:11Z

Thanks for sharing the issue! Agreed with you last comment, it is likely the issue.

I think it would be great to keep all the output as some lineages are not very good (e.g. mollusca) and so sometimes we refer to Eukaryota even if we have a more specific one available.

Valid point indeed! and I already felt a bit "annoyed' to have to remove info/results

If we were to rename the files after the analysis, for example :

For each run (e.g. run_archaea_odb12)
Rename the output files to include the run name (therefore no files will have the same names anymore) (e.g. run_archaea_odb12_short_summary.txt)
Output all the files finishing by *_short_summary.txt (There will be one or several outputs, but none will have the same name)

I'm a bit lost to be honest.
Such files are already there in the top folder. But that's only the short_summary output ones, not any of the other output files (full_table and missing ones for instance).
The files in the topfolder are not used for collecting the output though.
The thing is that there might be several output files and that the output tag can't handle I think.

Anyway, eager to see what you can throw together code-wise :-)
I'll also put some more thought in it

scorreard · 2025-08-20T11:05:14Z

Hi!
The new test LGTM, but it fails the tests (expected bahvior).
Should we include you code for this new test into the PR that solves the issue of the empty table (#7220)?

lsterck · 2025-08-20T11:30:03Z

Sounds like a good idea to me indeed.
It's also not that special the test, it's mainly the input file that is key (== to have one that allows the subspecies detection)

I assume the more admin-like people (@bgruening , @bernt-matthias ) have the best view on this, but for me it's a go to include it in your PR.
(I'll try to do a review of that asap btw :) )

scorreard · 2025-09-04T15:16:28Z

The changes were merged into #7220.
I suggest we close this PR if the other one is merged.

lsterck · 2025-10-06T14:56:41Z

all resolved in #7220

lsterck added 4 commits July 29, 2025 19:28

added test for issue galaxyproject#6664

05e1981

updated .shed file galaxyproject#2593 , added exit code and wrapper v…

0cdaab7

…ersionnumber update

moved contig_break option to geno mode only

41b1024

improved test

817b35c

lsterck marked this pull request as ready for review August 8, 2025 17:19

This was referenced Aug 14, 2025

BUSCO wrapper update to correct issue of empty table #7220

Merged

Bug in xml wrapper for busco (causing emtpy output files, without any warnings/errors) #6664

Closed

scorreard added a commit to scorreard/tools-iuc that referenced this pull request Sep 4, 2025

Update busco.xml to add test from galaxyproject#7209

6bea380

lsterck closed this Oct 6, 2025

lsterck deleted the busco_bug_fix branch October 6, 2025 14:56

Busco (5.8.0) extra test to indicate bug #7209

Busco (5.8.0) extra test to indicate bug #7209

Uh oh!

Conversation

lsterck commented Aug 8, 2025

Uh oh!

bgruening commented Aug 8, 2025

Uh oh!

lsterck commented Aug 8, 2025

Uh oh!

bgruening commented Aug 8, 2025

Uh oh!

bernt-matthias commented Aug 10, 2025

Uh oh!

lsterck commented Aug 11, 2025

Uh oh!

bernt-matthias commented Aug 11, 2025

Uh oh!

lsterck commented Aug 11, 2025

Uh oh!

bernt-matthias commented Aug 11, 2025

Uh oh!

lsterck commented Aug 11, 2025

Uh oh!

lsterck commented Aug 11, 2025

Uh oh!

lsterck commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lsterck commented Aug 11, 2025

Uh oh!

bernt-matthias commented Aug 11, 2025

Uh oh!

lsterck commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lsterck commented Aug 11, 2025

Uh oh!

scorreard commented Aug 12, 2025

Uh oh!

bernt-matthias commented Aug 12, 2025

Uh oh!

scorreard commented Aug 12, 2025

Uh oh!

bernt-matthias commented Aug 12, 2025

Uh oh!

lsterck commented Aug 12, 2025

Uh oh!

scorreard commented Aug 20, 2025

Uh oh!

lsterck commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scorreard commented Sep 4, 2025

Uh oh!

lsterck commented Oct 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lsterck commented Aug 11, 2025 •

edited

Loading

lsterck commented Aug 11, 2025 •

edited

Loading

lsterck commented Aug 20, 2025 •

edited

Loading