Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(searcher): adapt to the new metadata schema with file indices #147

Merged
merged 2 commits into from
Feb 16, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions AUTHORS.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,6 @@ The list of contributors in alphabetical order:
- [Jan Okraska](https://orcid.org/0000-0002-1416-3244)
- [Jiri Kuncar](https://github.com/jirikuncar)
- [Joud Masoud](https://github.com/joudmas)
- [Pablo Saiz](https://github.com/psaiz)
- [Parth Shandilya](https://github.com/ParthS007)
- [Tibor Simko](https://orcid.org/0000-0001-7202-5803)
73 changes: 31 additions & 42 deletions cernopendata_client/searcher.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# -*- coding: utf-8 -*-
# This file is part of cernopendata-client.
#
# Copyright (C) 2020 CERN.
# Copyright (C) 2020, 2025 CERN.
#
# cernopendata-client is free software; you can redistribute it and/or modify
# it under the terms of the GPLv3 license; see LICENSE file for more details.
Expand Down Expand Up @@ -196,51 +196,40 @@ def get_files_list(
if server != SERVER_HTTP_URI and searcher_protocol != "xrootd":
searcher_protocol = server.split(":")[0]
files_list = []
for file_ in record_json["metadata"]["files"]:
files_list.append((file_["uri"], file_["size"], file_["checksum"]))
if expand:
# let's unwind file indexes
files_list_expanded = []
for file_ in files_list:
if file_[0].endswith("_file_index.json"):
try:
url_file = "{}/record/{}/files/{}".format(
server, str(record_json["id"]), file_[0].split("/")[-1]
)
json_files = requests.get(url_file).json()
except Exception:
display_message(
msg_type="error",
msg="Error occured while fetching file info. Please try again.",
)
sys.exit(1)
for file_ in json_files:
files_list_expanded.append(
(
file_["uri"],
file_["size"],
file_["checksum"],
)
)
elif file_[0].endswith("_file_index.txt"):
pass
else:
files_list_expanded.append(file_)
files_list = files_list_expanded

new_server = SERVER_ROOT_URI
if searcher_protocol == "http":
files_list = [
(file_[0].replace(SERVER_ROOT_URI, server), file_[1], file_[2])
for file_ in files_list
]
new_server = server
elif searcher_protocol == "https":
files_list = [
new_server = SERVER_HTTPS_URI

for file_ in record_json["metadata"].get("files", []):
files_list.append(
(
file_[0].replace(SERVER_ROOT_URI, SERVER_HTTPS_URI),
file_[1],
file_[2],
file_["uri"].replace(SERVER_ROOT_URI, new_server),
file_["size"],
file_["checksum"],
)
)
for file_ in record_json["metadata"].get("_file_indices", []):
if expand:
# let's unwind file indexes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the changes do not pass unit tests, e.g. see the CI report for Python 3.12:

================== 22 failed, 50 passed, 8 skipped in 45.12s ===================

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After CERN Open Data portal service update, I'm still getting locally failed tests:

$ tox -e py312
...
FAILED tests/test_cli_download_files.py::test_download_files_http_requests - assert 1 == 0
FAILED tests/test_cli_download_files.py::test_download_files_https_requests - assert 1 == 0
FAILED tests/test_cli_download_files.py::test_download_files_download_engine - assert 1 == 0
FAILED tests/test_cli_download_files.py::test_download_files_with_verify - assert 1 == 0
FAILED tests/test_cli_download_files.py::test_download_files_filter_name - assert 1 == 0
FAILED tests/test_cli_download_files.py::test_download_files_filter_name_multiple_values - assert 1 == 0
FAILED tests/test_cli_download_files.py::test_download_files_filter_regexp_single_file - assert 1 == 0
FAILED tests/test_cli_download_files.py::test_download_files_filter_regexp_multiple_files - assert 1 == 0
FAILED tests/test_cli_download_files.py::test_download_files_filter_range - assert 1 == 0
FAILED tests/test_cli_download_files.py::test_download_files_filter_range_multiple_values - assert 1 == 0
FAILED tests/test_cli_download_files.py::test_download_files_filter_single_range_single_regexp - assert 1 == 0
FAILED tests/test_cli_download_files.py::test_download_files_filter_multiple_range_single_regexp - assert 1 == 0
FAILED tests/test_cli_get_file_locations.py::test_get_file_locations_from_recid_without_files - AssertionError: assert 1 == 0
FAILED tests/test_cli_verify_files.py::test_verify_files - assert 1 == 0
FAILED tests/test_cli_verify_files.py::test_verify_files_https_server - assert 1 == 0
FAILED tests/test_metadater.py::test_get_metadata_from_filter_metadata_two - assert 1 == 0
FAILED tests/test_verifier.py::test_get_file_info_local_good_input - assert 1 == 0
FAILED tests/test_verifier.py::test_get_file_info_local_good_input_wrong_count - assert 1 == 0
FAILED tests/test_verifier.py::test_get_file_info_local_good_input_wrong_checksum - assert 1 == 0
FAILED tests/test_verifier.py::test_get_file_info_local_good_input_wrong_size - assert 1 == 0

For example, this command works:

$ cernopendata-client download-files --recid 1 --no-expand
==> Downloading file 1 of 6
  -> File: ./1/CMS_Run2010B_BTau_AOD_Apr21ReReco-v1_0000_file_index.json
  -> Progress: 322/322 KiB (100%)
^C

$ cernopendata-client download-files --recid 1
==> Downloading file 1 of 2916
  -> File 00E16FBB-9071-E011-83D3-003048673F12.root is incomplete. Resuming download.
  -> File: ./1/00E16FBB-9071-E011-83D3-003048673F12.root
^C-> Progress: 124229/596996 KiB (20%)
Aborted!

Whilst this (simplest) use case of directly attached files does not work:

$ cernopendata-client download-files --recid 5500
==> Downloading file 1 of 11
==> ERROR: Download error occured. Please try again.
Traceback (most recent call last):
  File "/home/tibor/.virtualenvs/cernopendata-client/bin/cernopendata-client", line 8, in <module>
    sys.exit(cernopendata_client())
             ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tibor/.virtualenvs/cernopendata-client/lib/python3.12/site-packages/click/core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tibor/.virtualenvs/cernopendata-client/lib/python3.12/site-packages/click/core.py", line 1082, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/tibor/.virtualenvs/cernopendata-client/lib/python3.12/site-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tibor/.virtualenvs/cernopendata-client/lib/python3.12/site-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tibor/.virtualenvs/cernopendata-client/lib/python3.12/site-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tibor/.virtualenvs/cernopendata-client/lib/python3.12/site-packages/cernopendata_client/cli.py", line 377, in download_files
    download_single_file(
  File "/home/tibor/.virtualenvs/cernopendata-client/lib/python3.12/site-packages/cernopendata_client/downloader.py", line 340, in download_single_file
    downloader.file_downloader()
  File "/home/tibor/.virtualenvs/cernopendata-client/lib/python3.12/site-packages/cernopendata_client/downloader.py", line 80, in file_downloader
    response = requests.get(self.file_location, headers=headers, stream=True)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tibor/.virtualenvs/cernopendata-client/lib/python3.12/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tibor/.virtualenvs/cernopendata-client/lib/python3.12/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tibor/.virtualenvs/cernopendata-client/lib/python3.12/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tibor/.virtualenvs/cernopendata-client/lib/python3.12/site-packages/requests/sessions.py", line 697, in send
    adapter = self.get_adapter(url=request.url)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tibor/.virtualenvs/cernopendata-client/lib/python3.12/site-packages/requests/sessions.py", line 792, in get_adapter
    raise InvalidSchema(f"No connection adapters were found for {url!r}")
requests.exceptions.InvalidSchema: No connection adapters were found for 'root://eospublic.cern.ch//eos/opendata/cms/software/HiggsExample20112012/BuildFile.xml'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed IRL, I took over and fixed the download problem and squashed the fix with your branch. I have also added you to the AUTHORS file and fixed an independent metadata filtering test issue following the deprecation of CCID.

for inner_file in file_["files"]:
files_list.append(
(
inner_file["uri"].replace(SERVER_ROOT_URI, new_server),
inner_file["size"],
inner_file["checksum"],
)
)
else:
files_list.append(
(
f"{SERVER_HTTPS_URI}/record/{record_json['metadata']['recid']}/file_index/{file_['key']}",
file_["size"],
"",
)
)
for file_ in files_list
]
return files_list


Expand Down
4 changes: 2 additions & 2 deletions tests/test_metadater.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
#
# This file is part of cernopendata-client.
#
# Copyright (C) 2020 CERN.
# Copyright (C) 2020, 2025 CERN.
#
# cernopendata-client is free software; you can redistribute it and/or modify
# it under the terms of the GPLv3 license; see LICENSE file for more details.
Expand Down Expand Up @@ -63,7 +63,7 @@ def test_get_metadata_from_filter_metadata_two():
"--filter",
"affiliation=CERN",
"--filter",
"ccid=CCID-722528",
"inspireid=INSPIRE-00330082",
],
)
assert test_result.exit_code == 0
Expand Down
Loading