Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(website): Column remapping when submitting metadata files #3478

Open
wants to merge 80 commits into
base: main
Choose a base branch
from

Conversation

fhennig
Copy link
Contributor

@fhennig fhennig commented Dec 23, 2024

resolves #3432

preview URL: https://column-remapping.loculus.org/

Summary

  • When a metadata file is uploaded, it is now possible to add a column mapping
  • The mapping is only applied when the file is uploaded
  • If a new file is uploaded, the mapping is preserved as best as possible.
  • If no mapping is configured, the file isn't read (compressed files are not loaded either)
  • The mapping cannot map the same input field twice
  • The mapping needs to contain all required input fields

Screenshot

image

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by an appropriate test.

@fhennig fhennig changed the base branch from main to excel-upload December 23, 2024 12:29
@fhennig fhennig self-assigned this Dec 24, 2024
@fhennig fhennig added the preview Triggers a deployment to argocd label Dec 24, 2024
@fhennig fhennig force-pushed the column-remapping branch 2 times, most recently from 771d376 to f6b607a Compare January 7, 2025 12:09
@fhennig fhennig force-pushed the column-remapping branch 2 times, most recently from 3bbd86a to bdb1122 Compare January 8, 2025 13:27
@fhennig

This comment was marked as outdated.

@theosanderson
Copy link
Member

I think the logic needs to be reversed, with a row for each "Column in your file", for which you can choose the inputField to map to. This is because often the uploaded file may only have like 4 columns from the selection.

@fhennig
Copy link
Contributor Author

fhennig commented Jan 8, 2025

ok!

@chaoran-chen
Copy link
Member

What happens if you have a column that is not mentioned in the inputFields? The API/backend would accept it and leave it to the pipeline to decide what to do with it. With this PR, would the website now reject files that would still be possible to submit directly through the API?

@fhennig
Copy link
Contributor Author

fhennig commented Jan 8, 2025

Files are generally not rejected.

The column remapping however would force you to either map a column to a column that is specified in the input fields, or to just not include the column.

You can also just ignore column remapping altogether, and still submit files with whatever structure you like to the backend.

@chaoran-chen
Copy link
Member

Ah, so it's possible to skip the remapping? (Sorry, I haven't tried out the feature yet!)

@fhennig
Copy link
Contributor Author

fhennig commented Jan 8, 2025

yes. Remapping is only done if you click the button to add a column mapping. I also took care to not decompress TSV files if no column mapping is to be applied! (I still need to test if this actually works the way I implemented it)

@fhennig

This comment was marked as outdated.

@rneher
Copy link

rneher commented Jan 10, 2025

I just tried this on the preview.

Overall, the flow seems sensible to me. But there were a few things I stumbled on.

  • we effectively do not have a required metadata format anymore. So the text on the left needs adjusting
  • it would be good if the button "Column mapping" had help tool-tip or similar that explains what this does (e.g. "If you are not using our metadata template, this allows you to map columns in your file to the fields expected by the database")
  • It was quite hard to find the entries in the long list. the list seems vaguely alphabetical, but important fields like "collection date" were all the way down in a seemingly random place

image

@rneher
Copy link

rneher commented Jan 10, 2025

Ideally, I think we should have required columns at the top of the list. And if easy to do, we could highly columns that are already selected with a different background color or similar?

@rneher
Copy link

rneher commented Jan 10, 2025

beyond that, I think it would be good to merge this in asap after some testing.

@fhennig
Copy link
Contributor Author

fhennig commented Jan 13, 2025

I checked about "And if easy to do, we could highly columns that are already selected with a different background color or similar?" -> Unfortunately the native select element options cannot be styled. I experimented with putting UTF-8 symbols with the text but wasn't happy. I think it's quite helpful though to see which ones are already mapped. A proper solution would, however, involve getting rid of the native select element. Probably not for this PR.

EDIT: I ended up actually doing this; replacing the native select element. It is a bit usability improvement I think!

Base automatically changed from excel-upload to main January 13, 2025 16:29
@fhennig fhennig force-pushed the column-remapping branch 2 times, most recently from db4e992 to 61fd1d1 Compare January 14, 2025 11:31
@fhennig
Copy link
Contributor Author

fhennig commented Jan 14, 2025

Ok, I decided to use ListBox by headlessUI since we already use that for the Autocomplete Field as well, and now I have this:

image

@fhennig fhennig requested a review from anna-parker January 15, 2025 09:43
@fhennig fhennig marked this pull request as ready for review January 15, 2025 14:20
@fhennig
Copy link
Contributor Author

fhennig commented Jan 16, 2025

@anna-parker preview working again!

@anna-parker

This comment was marked as resolved.

@fhennig

This comment was marked as resolved.

/* Apply this mapping to a TSV file, returning a new file with remapped columns. */
public async applyTo(tsvFile: ProcessedFile): Promise<File> {
const text = await tsvFile.text();
const inputRows = text.trim().split('\n');
Copy link
Member

@chaoran-chen chaoran-chen Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you're not using a full TSV parser? This would cause problems if there is a multiline cell which, I think, is a reasonable use case for fields like version comment. The following file doesn't work:

submissionId	versionComment	geoLocCountry	sampleCollectionDate
test1	"Paragraph 1...

...and paragraph 2"	USA	2011-07-18
test2	"Paragraph 1...

...and paragraph 2"	USA	2011-07-18

If we add column remapping, it will throw the error:

image

Without the remapping, it works fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I didn't know that multi line cells were possible.

I'm now using a library and also adapted the test: d87c684

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update: used a different lib.

Copy link
Contributor

@anna-parker anna-parker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Felix - this is looking really great I found one bug if you could fix it - I was also wondering if you could add a couple more desired fields to the values.yaml to check this works correctly - this is also a place where there could be duplications - a field can be desired and also sample-specific - so I wanted to test duplications are handled correctly there :-)

Update: you can see a list of desired fields here: https://pathoplexus.org/docs/concepts/metadataformat, I was also adding them here: https://github.com/pathoplexus/pathoplexus/pull/335/files#diff-21a8c5ff083727625bb435b82d73142d7b8bd0a857948a5743aaab1f96b29fd7

sourceColumns.forEach((sourceColumn) => {
const bestMatch = this.getBestMatchingTargetColumn(sourceColumn, availableFields);
mapping.set(sourceColumn, bestMatch);
availableFields = availableFields.filter((field) => field.name !== bestMatch);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! I found another issue here, so first you need to do an exact match and then only fuzzy matching, for example if I have my list of columns [geoLocAdmin3, geoLocAdmin1] then geoLocAdmin3 will get mapped to geoLocAdmin1 and geoLocAdmin1 will get mapped to geoLocAdmin2, see example below where I did this on the preview:

image

It might also make sense to mark which fields have not been mapped exactly (e.g. score under 1) with a different color so that the user will quickly see that this mapping happened - otherwise they might miss sth being mapped incorrectly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha! I had a murky feeling about the mapping, good think you spotted it. I've fixed it now. Also the non-exact matches are not in italic, does that work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added all the 'desired' flags!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
preview Triggers a deployment to argocd
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat(website): support column remapping for uploaded metadata files
5 participants