fixtures: profile upload peformance #88

tiborsimko · 2024-11-22T13:24:48Z

Current behaviour

Seen on the QA instance on November 15th.

Updating ATLAS records from from cernopendata/opendata.cern.ch#3688 using cernopendata-portal image 0.1.11 works very fast, both locally and on PROD:

$ time docker exec -i -t opendatacernch-web-1 cernopendata fixtures records --mode insert-or-replace -f /content/data/records/atlas-CERN-EP-2024-159.json
...
docker exec -i -t opendatacernch-web-1 cernopendata fixtures records --mode    0.01s user 0.01s system 0% cpu 2.563 total

However, on QA the same upload process got stuck:

$ kubectl exec -i -t web-68-8vg94 /bin/bash
bash-5.1$ time cernopendata fixtures records --mode replace -f /tmp/data/records/atlas-CERN-EP-2024-159.json
/opt/invenio/var/instance/python/lib/python3.9/site-packages/invenio_config/default.py:77: UserWarning: Set configuration variable SECRET_KEY with random string
  warnings.warn(

/opt/invenio/var/instance/python/lib/python3.9/site-packages/invenio_rest/ext.py:30: FutureWarning: CSRF validation will be enabled by default in the version 1.3.x
  self.init_app(app)

/opt/invenio/var/instance/python/lib/python3.9/site-packages/flask_caching/__init__.py:145: DeprecationWarning: Using the initialization functions in flask_caching.backend is deprecated.  Use the a full path to backend classes directly.
  self._set_cache(app, config)

Loading records from /tmp/data/records/atlas-CERN-EP-2024-159.json (1/1)...

There was no reply for many minutes; the process seems to "run away".

I have interrupted it after about 6 minutes:

^C
Aborted!
/usr/lib64/python3.9/site-packages/XRootD/client/finalize.py:46: DeprecationWarning: Importing 'itsdangerous.json' is deprecated and will be removed in ItsDangerous 2.1. Use Python's 'json' module instead.
  if isinstance(obj, File) and obj.is_open():


real    6m20.950s
user    0m11.243s
sys    0m0.737s

Expected behaviour

The records should be updated fast, within 2-3 seconds, as with 0.1.11.

Notes

This is especially interesting because the change in the record JSON was only minimal:

$ git diff -p upstream/pr/3688~1..upstream/pr/3688 -- data/records/atlas-CERN-EP-2024-159.json | cat
diff --git a/data/records/atlas-CERN-EP-2024-159.json b/data/records/atlas-CERN-EP-2024-159.json
index 34aa42f83..63c349c3f 100644
--- a/data/records/atlas-CERN-EP-2024-159.json
+++ b/data/records/atlas-CERN-EP-2024-159.json
@@ -245,7 +245,8 @@
     "type": {
       "primary": "Dataset",
       "secondary": [
-        "Derived"
+        "Derived",
+        "Simulated"
       ]
     },
     "usage": {

That is, there was no change in attached files when performing this update, and the record itself has only about 34 files attached, all directly and not via index files... So waiting for 6 minutes seems excessive.

It would be good to profile the fixture loading command to see where this extra time was spent. (Perhaps some missing DB indexes and an inefficient DB query causing slow downs?)

The text was updated successfully, but these errors were encountered:

psaiz · 2024-11-27T11:08:56Z

@tiborsimko: if the change is only in the metadata, doing it with the option of skip-files would definitely improve the time needed. Note that this record has 17 file indices. In versions 0.1.11 (and older), the file indices are stored as normal files. Starting with the 0.2, the file indices are read, and the files inside the file index are processed. This particular record has more than 2000 files that have to be deleted/reinserted (unless the --skip-files option is specified)

psaiz · 2024-11-29T16:49:25Z

FYI, I've just checked this with the latest version (0.2.5). I get:

[invenio@edb0311dad44 code]$ time cernopendata fixtures records --file /content/data/records/atlas-CERN-EP-2024-159.json 
...
File index created
record 80030 inserted

real	1m3.396s
user	0m57.114s
sys	0m0.810s

The update takes a bit longer, since it has to delete:

...
record 80030 updated

real	1m27.749s
user	1m21.797s
sys	0m0.885s

Could you please check if you get similar timings (instead of the six minutes mentioned above?

psaiz · 2024-12-02T17:26:49Z

The previous comment was not a fair comparison because it was in a newly created DB (with less entries). Doing it in the qa instance, with the 2.5M entries was again slow. There was indeed a missing index on the file_object table. Creating it on dev improved the timing quite a lot. We will put the same index on qa

psaiz · 2024-12-12T14:14:34Z

Closing the issue. Feel free to reopen it if there are still any issues

psaiz self-assigned this Nov 26, 2024

psaiz closed this as completed Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixtures: profile upload peformance #88

fixtures: profile upload peformance #88

tiborsimko commented Nov 22, 2024

psaiz commented Nov 27, 2024

psaiz commented Nov 29, 2024

psaiz commented Dec 2, 2024

psaiz commented Dec 12, 2024

fixtures: profile upload peformance #88

fixtures: profile upload peformance #88

Comments

tiborsimko commented Nov 22, 2024

Current behaviour

Expected behaviour

Notes

psaiz commented Nov 27, 2024

psaiz commented Nov 29, 2024

psaiz commented Dec 2, 2024

psaiz commented Dec 12, 2024