Add JSON-LD bulk import module #10798 #10885

jacobtylerwalls · 2024-05-07T20:17:32Z

Types of changes

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Description of Change

Add a bulk import module that calls the CLI loader for JSON-LD.

Add some hooks to BaseImportModule to make it more extensible for things like:

shelling out to a CLI command
wrapping the tile save in a transaction to support overwriting (if resources must be deleted before excess tile checks, that shouldn't become permanent if the tile save later fails)

Issues Solved

Closes #10798

Checklist

Unit tests pass locally with my changes
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)

Ticket Background

Sponsored by: Yale LUX
Found by: @
Tested by: @
Designed by: @

Further comments

Follow-up tickets I found as part of this work:

Testing instructions

I tested with the sample models in the unit tests. I opened a django shell, ran most of arches.tests.importer.jsonld_import_tests.JsonLDImportTests.setUpClass(), hacking bits in/out as needed, and then manually created some resources of the various models, viewed the JSON-LD at the /resources route, and zipped that up according to the instructions in #10798.

The tiles triggers can't be disabled during a transaction.

apeters

I'm not seeing any error details when the load fails even though I can see more info in the db.
In the UI:

From the load_errors table:

It would be nice if the error message showed the actual name of the .json file that triggered the error instead of just the "block". Additionally I notice that the errors shown in the UI specify Validation Errors, but the errors I'm seeing are really load errors. Maybe we need both, but for load errors it may be enough to show the json file that errored and it's corresponding error message.

Also, it would be nice if users could opt to pass the --ignore-errors flag so that all resources that pass are loaded to the db and the ones that don't pass log errors in the ui.

jacobtylerwalls · 2024-05-14T16:41:56Z

Thanks, good notes. The current error UX looks like this...

... and clicking "Full error report" gives:

Additionally I notice that the errors shown in the UI specify Validation Errors, but the errors I'm seeing are really load errors. Maybe we need both, but for load errors it may be enough to show the json file that errored and it's corresponding error message.

I understood load errors and validation errors to be synonyms in this context. What was the distinction you had in mind?

Also, it would be nice if users could opt to pass the --ignore-errors flag so that all resources that pass are loaded to the db and the ones that don't pass log errors in the ui.

This is an interesting idea, but I think the whole interface of the bulk data manager assumes that you're not going to perform a load unless everything passes. In the scenario you describe, you'd have a partial success / partial failure state. I don't think the bulk data manager will support that.

jacobtylerwalls · 2024-05-14T16:44:05Z

PS -- I could copy over the message you see in that JSON dump with detailed information into the error field so that it appears in the UI without clicking for more details. That's an easy change, but could clutter the table UI (and other node datatype validation errors that didn't happen during "parsing a block" could no longer be distinguished).

apeters

Looking at the Validation Errors table I would change Column Name to File Name. Also the Count column appears to be incorrect (0 index issue?). Are the Node Alias and Details columns used?

arches/app/templates/views/components/etl_modules/base-import.htm

apeters · 2024-06-03T19:10:49Z

arches/app/etl_modules/base_import_module.py

+    def stage_files(self, files, summary, cursor):
+        for file in files:
+            self.stage_excel_file(file, summary, cursor)


This seems like such a small (unless I'm missing something) but nice refactor that I think we should just do this refactor now.

Allows logging to the correct place in a request-response cycle when called in a view.

jacobtylerwalls · 2024-06-05T17:48:06Z

Solved the magic string and the column names by just overwriting the template block 👍

Looking at the Validation Errors table I would change Column Name to File Name. Also the Count column appears to be incorrect (0 index issue?). Are the Node Alias and Details columns used?

Node alias should be populated if a tile failed validation as part of staging the tiles. (Most of the failures you and I have been seeing during testing are the "early failures" in the CLI. As we discussed, I'm using a dummy node for those, so showing that dummy node alias here would not be helpful.) I could add an N/A if you think it's best.
Details: fixed -- query wasn't quite right
Count: same

jacobtylerwalls · 2024-06-05T17:49:16Z

The validation error table looks like this now (with the first "Details" link expanded):

serialized_rollback is the doc'd way to ensure the data from the initial migration is present in a TransactionTestCase. The little gotchas: - setUpClass() runs before the serialized data is restored, so it shouldn't do db stuff - signals shouldn't create other related objects if raw=True

apeters

This looks great @jacobtylerwalls . Thanks for all those last minute UI tweaks.

jacobtylerwalls added 21 commits May 6, 2024 18:19

Add migration for JSON-LD import module #10798

63ae837

Initial commit of JSON-LD import components

b8ecdea

Initial commit of JSON-LD import backend (etl module) #10798

7f0edaf

Populate staging table re #10798

f99e6ed

Decompose populate_staging_table() re #10798

9e1d8d0

Wrap delete_from_default_storage() in finally re #10798

7aa8d2d

Remove templates re #10798

98c4dd0

Handle errors from load_jsonld command re #10798

e33a342

Allow JSON uploads re #10798

35b4f1b

Surface more node information in errors re #10798

1e17e31

Capture early failures re #10798

f95c51d

Improve error handling

f83e77a

Adjust DS_Store exception logic

f2fbc58

Various fixes

dff6ff0

Fix excel uploads with file type checking enabled

7438dfe

Fix LoadStaging tile value

0bf698b

Work around missing nodes in __get_nodegroup_tree

9f7fe1b

Allow overwriting resources re #10798

1d7a968

Implement run_load_task_async re #10798

792d7f7

Decompose save_to_tiles() to allow wrapping inside a transaction.

582678d

The tiles triggers can't be disabled during a transaction.

Add unit test re #10798

ab98af3

jacobtylerwalls requested a review from apeters May 7, 2024 20:17

jacobtylerwalls assigned apeters May 7, 2024

jacobtylerwalls linked an issue May 7, 2024 that may be closed by this pull request

Yale - Import data from external source to arches via UI #10798

Closed

jacobtylerwalls added 6 commits May 7, 2024 16:27

Add minor incompatibility notice re #10798

e8ba32c

Workaround test isolation issue re: runtime trigger disables

09330d3

nit: typo

29dab69

Remove cheesy test image (for now...)

bba8893

nits re #10798

95b0bc5

Preserve backward compat in BaseImportModule.__init__()

ed4dbd8

Merge branch 'dev/7.6.x' into jtw/json-ld-zip-import

01251eb

apeters requested changes May 14, 2024

View reviewed changes

jacobtylerwalls added 4 commits May 14, 2024 09:47

Surface FileValidationError to user

5c96ae6

Fix fallback node

eada4a7

Surface load errors to UI

d5a3aac

Use file name as source in load errors

2cf5529

jacobtylerwalls requested a review from apeters May 14, 2024 16:55

apeters and others added 4 commits May 16, 2024 12:34

update error messages with better node info, re #10798

0d1da51

Merge branch 'dev/7.6.x' into jtw/json-ld-zip-import

ffb1fa3

Merge branch 'dev/7.6.x' into jtw/json-ld-zip-import

f20eb5b

Fix migration conflict re #10798

993b22f

apeters requested changes Jun 3, 2024

View reviewed changes

jacobtylerwalls added 5 commits June 5, 2024 10:53

Make stage_files() abstract

1cffb67

Avoid directly print()'ing in load_jsonld command

0ccfeae

Allows logging to the correct place in a request-response cycle when called in a view.

Override etl_error_report block

b726477

Fix error message query for early json-ld failures

d562da3

Merge branch 'dev/7.6.x' into jtw/json-ld-zip-import

46ae5a4

jacobtylerwalls requested a review from apeters June 5, 2024 17:48

jacobtylerwalls added 4 commits June 7, 2024 11:19

Merge branch 'dev/7.6.x' into jtw/json-ld-zip-import

e6eae30

Blacken prior work

9740c84

Update git-blame-ignore-revs

ae4a8de

apeters approved these changes Jun 13, 2024

View reviewed changes

apeters merged commit 1fb1607 into dev/7.6.x Jun 13, 2024
7 checks passed

apeters deleted the jtw/json-ld-zip-import branch June 13, 2024 20:28

jacobtylerwalls mentioned this pull request Jun 17, 2024

Add lenient file-type checking mode #10862 #10863

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add JSON-LD bulk import module #10798 #10885

Add JSON-LD bulk import module #10798 #10885

jacobtylerwalls commented May 7, 2024 •

edited

Loading

apeters left a comment •

edited

Loading

jacobtylerwalls commented May 14, 2024 •

edited

Loading

jacobtylerwalls commented May 14, 2024

apeters left a comment

apeters Jun 3, 2024

jacobtylerwalls commented Jun 5, 2024

jacobtylerwalls commented Jun 5, 2024

apeters left a comment

Add JSON-LD bulk import module #10798 #10885

Add JSON-LD bulk import module #10798 #10885

Conversation

jacobtylerwalls commented May 7, 2024 • edited Loading

Types of changes

Description of Change

Issues Solved

Checklist

Ticket Background

Further comments

Testing instructions

apeters left a comment • edited Loading

Choose a reason for hiding this comment

jacobtylerwalls commented May 14, 2024 • edited Loading

jacobtylerwalls commented May 14, 2024

apeters left a comment

Choose a reason for hiding this comment

apeters Jun 3, 2024

Choose a reason for hiding this comment

jacobtylerwalls commented Jun 5, 2024

jacobtylerwalls commented Jun 5, 2024

apeters left a comment

Choose a reason for hiding this comment

jacobtylerwalls commented May 7, 2024 •

edited

Loading

apeters left a comment •

edited

Loading

jacobtylerwalls commented May 14, 2024 •

edited

Loading