Skip to content

Conversation

@lsterck
Copy link
Contributor

@lsterck lsterck commented Aug 8, 2025

FOR CONTRIBUTOR:

  • I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
  • License permits unrestricted use (educational + commercial)
  • This PR adds a new tool or tool collection
  • This PR updates an existing tool or tool collection
  • This PR does something else (explain below)

this is linked to #6664
the stderr of the planemo test run indicates what the problem is:

/tmp/tmpgge8wz04/job_working_directory/000/4/galaxy_4.sh: line 104: [: too many arguments\n/tmp/tmpgge8wz04/job_working_directory/000/4/galaxy_4.sh: line 105: [: too many arguments

Not sure how to fix this though.
Perhaps it should become a discover_dataset kinda output capture to be able to get data from multiple output folders?

@lsterck lsterck marked this pull request as ready for review August 8, 2025 17:19
@bgruening
Copy link
Member

@lsterck try running planemo with planemo test --no_cleanup and try to look at "/tmp/tmpgge8wz04/job_working_directory/000/4/galaxy_4.sh" or paste it here, maybe we understand what is going on.

Thanks for your contributions @lsterck!

@lsterck
Copy link
Contributor Author

lsterck commented Aug 8, 2025

Hi @bgruening ,

that should be what I wrote in #6664 .
I will run the planemo with the requested options nonetheless and come back here.

@bgruening
Copy link
Member

Oh, true, no need to run it again, the issues says it all.

@bernt-matthias
Copy link
Contributor

What are the names of the folders matching run*?

@lsterck
Copy link
Contributor Author

lsterck commented Aug 11, 2025

in the case I reported it's the following:

  • run_bacteria_odb10 (linked dir, pointing to busco_galaxy/auto_lineage/run_bacteria_odb10 )
  • run_bacilli_odb10 (regular dir)

where in the case there is only 1 (top level) dir it's only the linked dir that is present

@bernt-matthias
Copy link
Contributor

Can you ask upstream which of the two should be used? Also the concatenations happening at the end of the tool's command block seem questionable in this context.

@lsterck
Copy link
Contributor Author

lsterck commented Aug 11, 2025

what do you mean? if you can tell prior to running it (or seeing the output) which of the dirs you will need?
If so, then no you can't, that's just the point of this auto-lineage parameter ...

Indeed, have been looking at the cat part too ...
I think it has been designed for the one dir output only use-case.

I'm currently looking into if I can 'solve' it in the command block of the wrapper.

@bernt-matthias
Copy link
Contributor

what do you mean?

The question would be why there may be sometimes multiple output dirs (resp symlinks). Shouldnt there be only one also in the auto-lineage case?

@lsterck
Copy link
Contributor Author

lsterck commented Aug 11, 2025

ah, in that sense ... yeah, good point but don't know tbh
I think it is due to how busco works, it will first do a 'general' search (genre bacteria/eukaryote/... ) and then will based on that result see if there are not more specific clades to report .
I have the impression it runs the whole busco again, internally, if it finds a more specific clade

Are you hinting to the fact we should (could?) bring this to the busco people first before implementing a fix in galaxy?

@lsterck
Copy link
Contributor Author

lsterck commented Aug 11, 2025

been quickly browsing through their doc and it this seems to be intended behaviour:
https://busco.ezlab.org/busco_userguide.html#auto-select-lineage

It will be come down to handing this in the wrapper it seems (might be doable in the command block?)

@lsterck
Copy link
Contributor Author

lsterck commented Aug 11, 2025

what if we add something as follows to the command block (after run but before the concatenation step) :

if test $(ls -d run_* | wc -l) -gt 1; then
  find . -type l -delete
fi

that will remove the link to dir and only in the case when there are multiple dirs present.
it will keep the symlink dir when there is only one present (== the case where with auto-lineage there is no more specific clade found) and reduce the number of run_* folders to 1 so the initial reported if statement will not fail anymore.
Will be the easiest workable patch I think

this of course makes the assumption that only the result of the most specific clade is reported, which in most (all?) cases is a valid assumption as that is exactly what one wants to achieve with the auto-lineage parameter.
Not sure how others feel about this so do correct me if I'm wrong.

@lsterck
Copy link
Contributor Author

lsterck commented Aug 11, 2025

adding this:

#if $outputs:
  && if test `ls -d busco_galaxy/run_* 2>/dev/null | wc -l` -gt 1; then  find busco_galaxy/ -type l -delete; fi
#end if

to the command block in the wrapper seems to pass both tests (including the failed one on multiple dirs).

@bernt-matthias
Copy link
Contributor

been quickly browsing through their doc and it this seems to be intended behaviour: https://busco.ezlab.org/busco_userguide.html#auto-select-lineage

It will be come down to handing this in the wrapper it seems (might be doable in the command block?)

Not sure about the docs:

The run_ folders contain the results for the respective parent datasets. The final selected dataset will be at the top level of the results folder.

So far so good :)

In the event that placement cannot be performed due to insufficient marker genes, BUSCO will report the results from the optimal parent dataset. This results folder will be soft-linked from the top level of the results folder.

Then why do we have folder + symlink? In which case are we?

The short_summary* files for the parent dataset ("generic") and the final selected dataset ("specific") are located at the top level of the results folder.

This we also need to keep in mind.

@lsterck
Copy link
Contributor Author

lsterck commented Aug 11, 2025

In the event that placement cannot be performed due to insufficient marker genes, BUSCO will report the results from the optimal parent dataset. This results folder will be soft-linked from the top level of the results folder.

Then why do we have folder + symlink? In which case are we?

ah, in the case where the placement could be done and a more specific busco was found.
Why then there is still a symlinked folder is a good question indeed (from that sentence I would conclude that there should not be a symlinked folder in "our" case) .
I assume that on completion of the 'first run', to determine the big clade (archaea, bacteria, ... ) it will create the symlink folder to the best parent dataset, regardless of being a more specific one or not (== it will only look for the more specific one after the symlink folder has been created)
The example they describe in those docs is exactly the case where we are in though (it also has a symlinked folder and a regular folder with the more specific run output)

The short_summary* files for the parent dataset ("generic") and the final selected dataset ("specific") are located at the top level of the results folder.

This we also need to keep in mind.

Those are indeed present in the top folder but are, as far as I can see, not used in the wrapper for collecting the output. The wrapper gets the final output from the subdirs directly and not from those top folder files.

@lsterck
Copy link
Contributor Author

lsterck commented Aug 11, 2025

I assume it's these parts of the wrapper (in the output section) that causes the issue:

        <data name='busco_sum' format='txt' label="${tool.name} on ${on_string}: short summary" from_work_dir="busco_galaxy/run_*/short_summary.txt">
            <filter>outputs and 'short_summary' in outputs</filter>
        </data>

those from_work_dir likely expects a single file and the wildcard expansion will work when there is only one dir and fail when multiple dirs are present.
(here you also see that it collects the output from the subdirs directly and not from the top folder files that are mentioned in the busco docs)

@scorreard
Copy link
Contributor

Thanks for sharing the issue! Agreed with you last comment, it is likely the issue.

I think it would be great to keep all the output as some lineages are not very good (e.g. mollusca) and so sometimes we refer to Eukaryota even if we have a more specific one available.

If we were to rename the files after the analysis, for example :

For each run (e.g. run_archaea_odb12)
Rename the output files to include the run name (therefore no files will have the same names anymore) (e.g. run_archaea_odb12_short_summary.txt)
Output all the files finishing by *_short_summary.txt (There will be one or several outputs, but none will have the same name)

Do you think that would be a viable solution? Is the file you are using and that causes the crash publicly available?

@bernt-matthias
Copy link
Contributor

I think it would be great to keep all the output as some lineages are not very good (e.g. mollusca) and so sometimes we refer to Eukaryota even if we have a more specific one available.

We can keep all outputs. Then my suggestion would be to replace all <data> outputs by data+discover_outputs (see) or a collection.

I'm just afraid that this can not be used easily in automatic workflows anymore. Before we had a single static output (which can be connected to downstream tools) and after this we have dynamic outputs (even if it will be in most cases only one) that are only known after the tool finished .. this will be tricky or impossible to use in workflows.

Maybe we can keep the static output and add dynamic ones for the cases where needed. But there needs to be some logic to determine which folder/file is the primary one.

If we were to rename the files after the analysis, for example :

For each run (e.g. run_archaea_odb12)
Rename the output files to include the run name (therefore no files will have the same names anymore) (e.g. run_archaea_odb12_short_summary.txt)
Output all the files finishing by *_short_summary.txt (There will be one or several outputs, but none will have the same name)

Do you think that would be a viable solution?

Guess so.

Is the file you are using and that causes the crash publicly available?

Yes. The test description is here and the test fasta here. I could also lookup the command line in the CI if needed.

@scorreard
Copy link
Contributor

I think it would be great to keep all the output as some lineages are not very good (e.g. mollusca) and so sometimes we refer to Eukaryota even if we have a more specific one available.

We can keep all outputs. Then my suggestion would be to replace all <data> outputs by data+discover_outputs (see) or a collection.

Not sure about what you mean by discover_outputs but I'll dive in the doc

I'm just afraid that this can not be used easily in automatic workflows anymore. Before we had a single static output (which can be connected to downstream tools) and after this we have dynamic outputs (even if it will be in most cases only one) that are only known after the tool finished .. this will be tricky or impossible to use in workflows.

BUSCO outputs are usually one of the final output of a workflow, BUSCO outputs are rarely used as input files in a workflow, I think only multiQC (and maybe Blobtools) could use BUSCO output. But yes, we should still be careful about not breaking workflows.
With your logic below, we would get one static output that could be used downstream?

Maybe we can keep the static output and add dynamic ones for the cases where needed. But there needs to be some logic to determine which folder/file is the primary one.

Yes! What about something like

Grep "specific_lineage" from short_summary.specific.methanococcales_odb12.<output_folder>.json (e.g. methanococcales_odb12 here)
Rename all the specific lineages files with ${specific_lineage}_filename

For all the other runs (e.g. run_archaea_odb12)
Rename the output files to include the run name (therefore no files will have the same names anymore) (e.g. run_archaea_odb12_short_summary.txt)

Output ${specific_lineage}_short_summary.txt as the static one
Output all the other files finishing by *_short_summary.txt (These would be the dynamic ones, there may be none, one or several outputs, but none will have the same name)

What do you think?

Is the file you are using and that causes the crash publicly available?

Yes. The test description is here and the test fasta here. I could also lookup the command line in the CI if needed.

I can try to code this and ping you for review?

@bernt-matthias
Copy link
Contributor

What do you think?

Sounds feasible

I can try to code this and ping you for review?

Excellent :)

@lsterck
Copy link
Contributor Author

lsterck commented Aug 12, 2025

Thanks for sharing the issue! Agreed with you last comment, it is likely the issue.

I think it would be great to keep all the output as some lineages are not very good (e.g. mollusca) and so sometimes we refer to Eukaryota even if we have a more specific one available.

Valid point indeed! and I already felt a bit "annoyed' to have to remove info/results

If we were to rename the files after the analysis, for example :

For each run (e.g. run_archaea_odb12)
Rename the output files to include the run name (therefore no files will have the same names anymore) (e.g. run_archaea_odb12_short_summary.txt)
Output all the files finishing by *_short_summary.txt (There will be one or several outputs, but none will have the same name)

I'm a bit lost to be honest.
Such files are already there in the top folder. But that's only the short_summary output ones, not any of the other output files (full_table and missing ones for instance).
The files in the topfolder are not used for collecting the output though.
The thing is that there might be several output files and that the output tag can't handle I think.

Anyway, eager to see what you can throw together code-wise :-)
I'll also put some more thought in it

@scorreard
Copy link
Contributor

Hi!
The new test LGTM, but it fails the tests (expected bahvior).
Should we include you code for this new test into the PR that solves the issue of the empty table (#7220)?

@lsterck
Copy link
Contributor Author

lsterck commented Aug 20, 2025

Sounds like a good idea to me indeed.
It's also not that special the test, it's mainly the input file that is key (== to have one that allows the subspecies detection)

I assume the more admin-like people (@bgruening , @bernt-matthias ) have the best view on this, but for me it's a go to include it in your PR.
(I'll try to do a review of that asap btw :) )

scorreard added a commit to scorreard/tools-iuc that referenced this pull request Sep 4, 2025
@scorreard
Copy link
Contributor

The changes were merged into #7220.
I suggest we close this PR if the other one is merged.

@lsterck
Copy link
Contributor Author

lsterck commented Oct 6, 2025

all resolved in #7220

@lsterck lsterck closed this Oct 6, 2025
@lsterck lsterck deleted the busco_bug_fix branch October 6, 2025 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants