Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixtures: profile upload peformance #88

Closed
tiborsimko opened this issue Nov 22, 2024 · 4 comments
Closed

fixtures: profile upload peformance #88

tiborsimko opened this issue Nov 22, 2024 · 4 comments
Assignees

Comments

@tiborsimko
Copy link
Member

Current behaviour

Seen on the QA instance on November 15th.

Updating ATLAS records from from cernopendata/opendata.cern.ch#3688 using cernopendata-portal image 0.1.11 works very fast, both locally and on PROD:

$ time docker exec -i -t opendatacernch-web-1 cernopendata fixtures records --mode insert-or-replace -f /content/data/records/atlas-CERN-EP-2024-159.json
...
docker exec -i -t opendatacernch-web-1 cernopendata fixtures records --mode    0.01s user 0.01s system 0% cpu 2.563 total

However, on QA the same upload process got stuck:

$ kubectl exec -i -t web-68-8vg94 /bin/bash
bash-5.1$ time cernopendata fixtures records --mode replace -f /tmp/data/records/atlas-CERN-EP-2024-159.json
/opt/invenio/var/instance/python/lib/python3.9/site-packages/invenio_config/default.py:77: UserWarning: Set configuration variable SECRET_KEY with random string
  warnings.warn(

/opt/invenio/var/instance/python/lib/python3.9/site-packages/invenio_rest/ext.py:30: FutureWarning: CSRF validation will be enabled by default in the version 1.3.x
  self.init_app(app)

/opt/invenio/var/instance/python/lib/python3.9/site-packages/flask_caching/__init__.py:145: DeprecationWarning: Using the initialization functions in flask_caching.backend is deprecated.  Use the a full path to backend classes directly.
  self._set_cache(app, config)

Loading records from /tmp/data/records/atlas-CERN-EP-2024-159.json (1/1)...

There was no reply for many minutes; the process seems to "run away".

I have interrupted it after about 6 minutes:

^C
Aborted!
/usr/lib64/python3.9/site-packages/XRootD/client/finalize.py:46: DeprecationWarning: Importing 'itsdangerous.json' is deprecated and will be removed in ItsDangerous 2.1. Use Python's 'json' module instead.
  if isinstance(obj, File) and obj.is_open():


real    6m20.950s
user    0m11.243s
sys    0m0.737s

Expected behaviour

The records should be updated fast, within 2-3 seconds, as with 0.1.11.

Notes

This is especially interesting because the change in the record JSON was only minimal:

$ git diff -p upstream/pr/3688~1..upstream/pr/3688 -- data/records/atlas-CERN-EP-2024-159.json | cat
diff --git a/data/records/atlas-CERN-EP-2024-159.json b/data/records/atlas-CERN-EP-2024-159.json
index 34aa42f83..63c349c3f 100644
--- a/data/records/atlas-CERN-EP-2024-159.json
+++ b/data/records/atlas-CERN-EP-2024-159.json
@@ -245,7 +245,8 @@
     "type": {
       "primary": "Dataset",
       "secondary": [
-        "Derived"
+        "Derived",
+        "Simulated"
       ]
     },
     "usage": {

That is, there was no change in attached files when performing this update, and the record itself has only about 34 files attached, all directly and not via index files... So waiting for 6 minutes seems excessive.

It would be good to profile the fixture loading command to see where this extra time was spent. (Perhaps some missing DB indexes and an inefficient DB query causing slow downs?)

@psaiz psaiz self-assigned this Nov 26, 2024
@psaiz
Copy link
Contributor

psaiz commented Nov 27, 2024

@tiborsimko: if the change is only in the metadata, doing it with the option of skip-files would definitely improve the time needed. Note that this record has 17 file indices. In versions 0.1.11 (and older), the file indices are stored as normal files. Starting with the 0.2, the file indices are read, and the files inside the file index are processed. This particular record has more than 2000 files that have to be deleted/reinserted (unless the --skip-files option is specified)

@psaiz
Copy link
Contributor

psaiz commented Nov 29, 2024

FYI, I've just checked this with the latest version (0.2.5). I get:

[invenio@edb0311dad44 code]$ time cernopendata fixtures records --file /content/data/records/atlas-CERN-EP-2024-159.json 
...
File index created
record 80030 inserted

real	1m3.396s
user	0m57.114s
sys	0m0.810s

The update takes a bit longer, since it has to delete:

...
record 80030 updated

real	1m27.749s
user	1m21.797s
sys	0m0.885s

Could you please check if you get similar timings (instead of the six minutes mentioned above?

@psaiz
Copy link
Contributor

psaiz commented Dec 2, 2024

The previous comment was not a fair comparison because it was in a newly created DB (with less entries). Doing it in the qa instance, with the 2.5M entries was again slow. There was indeed a missing index on the file_object table. Creating it on dev improved the timing quite a lot. We will put the same index on qa

@psaiz
Copy link
Contributor

psaiz commented Dec 12, 2024

Closing the issue. Feel free to reopen it if there are still any issues

@psaiz psaiz closed this as completed Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants