Skip to content

Conversation

@jonded94
Copy link
Contributor

@jonded94 jonded94 commented Sep 19, 2025

Rationale for this change

In Python, pyarrow.Schema before was not hashable when it has metadata set.

>>> import pyarrow
>>> schema = pyarrow.schema([], metadata={b"1": b"1"})
>>> hash(schema)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/types.pxi", line 2921, in pyarrow.lib.Schema.__hash__
TypeError: unhashable type: 'dict'

This is because the metadata (which is a dict) was tried to be hashed as-is, which doesn't work.

What changes are included in this PR?

Slightly change how hashes are computed for Schema, by converting the dict[str, str] to the frozenset of key- and value tuples.

For reference, this is faster than computing the hash of a sorted tuple of key- and value tuples (https://stackoverflow.com/a/6014481/10070873).

Are these changes tested?

Yes.

Are there any user-facing changes?

Besides that Schema now correctly is hashable, no.

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@jonded94 jonded94 changed the title [Python] Fix Schema hashable when it has metadata GH-47602 [Python] Fix Schema hashable when it has metadata Sep 19, 2025
@jonded94 jonded94 changed the title GH-47602 [Python] Fix Schema hashable when it has metadata GH-47602: [Python] Fix Schema hashable when it has metadata Sep 19, 2025
@github-actions
Copy link

⚠️ GitHub issue #47602 has been automatically assigned in GitHub to PR creator.

@jonded94
Copy link
Contributor Author

@AlenkaF , @raulcd or @rok , sorry for pinging, but could you give this a review? 🥺

Tests seem fine, and this would unblock some functionality we need on our end.

@jonded94 jonded94 changed the title GH-47602: [Python] Fix Schema hashable when it has metadata GH-47602: [Python] Make Schema hashable even when it has metadata Sep 23, 2025
Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for fixing the bug and adding the test for it!
The change looks good to me. I only have one minor nit otherwise the PR is ready to go.

Pinging @raulcd for one extra review.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Sep 23, 2025
@AlenkaF
Copy link
Member

AlenkaF commented Sep 23, 2025

@github-actions crossbow submit -g python

Co-authored-by: Alenka Frim <[email protected]>
@github-actions
Copy link

Revision: ad52c12

Submitted crossbow builds: ursacomputing/crossbow @ actions-6cf16e3659

Task Status
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-1.3.4-numpy-1.21.2 GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.12-cpython-debug GitHub Actions
test-conda-python-3.12-pandas-latest-numpy-1.26 GitHub Actions
test-conda-python-3.12-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.13 GitHub Actions
test-conda-python-3.13-pandas-nightly-numpy-nightly GitHub Actions
test-conda-python-3.13-pandas-upstream_devel-numpy-nightly GitHub Actions
test-conda-python-emscripten GitHub Actions
test-cuda-python-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-python-3-amd64 GitHub Actions
test-debian-12-python-3-i386 GitHub Actions
test-fedora-42-python-3 GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions
test-ubuntu-22.04-python-313-freethreading GitHub Actions
test-ubuntu-24.04-python-3 GitHub Actions

@jonded94
Copy link
Contributor Author

The crossbow stuff for some reason coredumped for ubuntu-22.04 (https://github.com/ursacomputing/crossbow/actions/runs/17939291515/job/51011788800, https://github.com/ursacomputing/crossbow/actions/runs/17939291336/job/51011788225). But I wouldn't see how that is connected to this change? 🤔

@AlenkaF
Copy link
Member

AlenkaF commented Sep 23, 2025

These three failures are not connected.
(We are dealing with quite some job issues that we need to fix)

@jonded94
Copy link
Contributor Author

@raulcd @rok review please? 🥺

Apparently people over at Ray are already cherry-picking this, could we merge this? 😬

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Sorry for taking a little to review, I've been busy with other issues.
Thanks for the ping

@raulcd raulcd merged commit ef718a7 into apache:main Oct 3, 2025
14 checks passed
@raulcd raulcd removed the awaiting committer review Awaiting committer review label Oct 3, 2025
@github-actions github-actions bot added the awaiting merge Awaiting merge label Oct 3, 2025
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit ef718a7.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Oct 15, 2025
…ta (apache#47601)

### Rationale for this change

In Python, `pyarrow.Schema` before was not hashable when it has `metadata` set.

```
>>> import pyarrow
>>> schema = pyarrow.schema([], metadata={b"1": b"1"})
>>> hash(schema)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/types.pxi", line 2921, in pyarrow.lib.Schema.__hash__
TypeError: unhashable type: 'dict'
```

This is because the metadata (which is a dict) was tried to be hashed as-is, which doesn't work.

### What changes are included in this PR?

Slightly change how hashes are computed for Schema, by converting the `dict[str, str]` to the frozenset of key- and value tuples. 

For reference, this is faster than computing the hash of a sorted tuple of key- and value tuples (https://stackoverflow.com/a/6014481/10070873).

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Besides that `Schema` now correctly is hashable, no.
* GitHub Issue: apache#47602

Lead-authored-by: Jonas Dedden <[email protected]>
Co-authored-by: Alenka Frim <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants