Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JSON-LD bulk import module #10798 #10885

Merged
merged 47 commits into from
Jun 13, 2024
Merged

Conversation

jacobtylerwalls
Copy link
Member

@jacobtylerwalls jacobtylerwalls commented May 7, 2024

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Description of Change

Add a bulk import module that calls the CLI loader for JSON-LD.

Add some hooks to BaseImportModule to make it more extensible for things like:

  • shelling out to a CLI command
  • wrapping the tile save in a transaction to support overwriting (if resources must be deleted before excess tile checks, that shouldn't become permanent if the tile save later fails)

Issues Solved

Closes #10798

Checklist

  • Unit tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Ticket Background

  • Sponsored by: Yale LUX
  • Found by: @
  • Tested by: @
  • Designed by: @

Further comments

Follow-up tickets I found as part of this work:

Testing instructions

I tested with the sample models in the unit tests. I opened a django shell, ran most of arches.tests.importer.jsonld_import_tests.JsonLDImportTests.setUpClass(), hacking bits in/out as needed, and then manually created some resources of the various models, viewed the JSON-LD at the /resources route, and zipped that up according to the instructions in #10798.

Copy link
Member

@apeters apeters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing any error details when the load fails even though I can see more info in the db.
In the UI:
Screenshot 2024-05-13 at 6 50 30 PM

From the load_errors table:
Screenshot 2024-05-13 at 6 52 38 PM

It would be nice if the error message showed the actual name of the .json file that triggered the error instead of just the "block". Additionally I notice that the errors shown in the UI specify Validation Errors, but the errors I'm seeing are really load errors. Maybe we need both, but for load errors it may be enough to show the json file that errored and it's corresponding error message.

Also, it would be nice if users could opt to pass the --ignore-errors flag so that all resources that pass are loaded to the db and the ones that don't pass log errors in the ui.

@jacobtylerwalls
Copy link
Member Author

jacobtylerwalls commented May 14, 2024

Thanks, good notes. The current error UX looks like this...

error-ui

... and clicking "Full error report" gives:

full-error-report

Additionally I notice that the errors shown in the UI specify Validation Errors, but the errors I'm seeing are really load errors. Maybe we need both, but for load errors it may be enough to show the json file that errored and it's corresponding error message.

I understood load errors and validation errors to be synonyms in this context. What was the distinction you had in mind?

Also, it would be nice if users could opt to pass the --ignore-errors flag so that all resources that pass are loaded to the db and the ones that don't pass log errors in the ui.

This is an interesting idea, but I think the whole interface of the bulk data manager assumes that you're not going to perform a load unless everything passes. In the scenario you describe, you'd have a partial success / partial failure state. I don't think the bulk data manager will support that.

@jacobtylerwalls
Copy link
Member Author

PS -- I could copy over the message you see in that JSON dump with detailed information into the error field so that it appears in the UI without clicking for more details. That's an easy change, but could clutter the table UI (and other node datatype validation errors that didn't happen during "parsing a block" could no longer be distinguished).

@jacobtylerwalls jacobtylerwalls requested a review from apeters May 14, 2024 16:55
Copy link
Member

@apeters apeters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the Validation Errors table I would change Column Name to File Name. Also the Count column appears to be incorrect (0 index issue?). Are the Node Alias and Details columns used?

Screenshot 2024-06-03 at 2 22 54 PM

Comment on lines 216 to 218
def stage_files(self, files, summary, cursor):
for file in files:
self.stage_excel_file(file, summary, cursor)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like such a small (unless I'm missing something) but nice refactor that I think we should just do this refactor now.

@jacobtylerwalls
Copy link
Member Author

Solved the magic string and the column names by just overwriting the template block 👍

Looking at the Validation Errors table I would change Column Name to File Name. Also the Count column appears to be incorrect (0 index issue?). Are the Node Alias and Details columns used?

  • Node alias should be populated if a tile failed validation as part of staging the tiles. (Most of the failures you and I have been seeing during testing are the "early failures" in the CLI. As we discussed, I'm using a dummy node for those, so showing that dummy node alias here would not be helpful.) I could add an N/A if you think it's best.
  • Details: fixed -- query wasn't quite right
  • Count: same

@jacobtylerwalls jacobtylerwalls requested a review from apeters June 5, 2024 17:48
@jacobtylerwalls
Copy link
Member Author

The validation error table looks like this now (with the first "Details" link expanded):

validation-error-table

serialized_rollback is the doc'd way to ensure
the data from the initial migration is present
in a TransactionTestCase. The little gotchas:
- setUpClass() runs before the serialized data is restored, so it shouldn't do db stuff
- signals shouldn't create other related objects if raw=True
Copy link
Member

@apeters apeters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @jacobtylerwalls . Thanks for all those last minute UI tweaks.

@apeters apeters merged commit 1fb1607 into dev/7.6.x Jun 13, 2024
7 checks passed
@apeters apeters deleted the jtw/json-ld-zip-import branch June 13, 2024 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Yale - Import data from external source to arches via UI
2 participants