Skip to content

Conversation

@hoxxep
Copy link
Contributor

@hoxxep hoxxep commented Sep 26, 2025

@Nicoshev totally up to you whether you want to include these or not. I'm also happy to adjust docs or rename things per your preference. I wrote them for the godbolt example, so figured it's worth at least having a PR to refer back to if someone needs it in future.

I assume we'll leave it up to the end user to write their own integer-tuple hash functions, as they should simply be using rapidhash_to_le_* and building a byte array to then be hashed with a constant length.

@hoxxep
Copy link
Contributor Author

hoxxep commented Sep 26, 2025

Actually, this needs some more preprocessor logic per rapid_read64, and then we can simplify the rapid_read definitions.

@hoxxep hoxxep force-pushed the integer-hash-functions branch from 5cd114c to 4150b43 Compare September 26, 2025 14:00
@EvanBalster
Copy link

EvanBalster commented Sep 26, 2025

While I don't think this is an inelegant solution, I am a bit leery of relying on the optimizer to consolidate two byteswaps down to a no-op. I'm putting together an alternative take at this that uses a new _internal function accepting a pair of uint64_t instead of a buffer.

EDIT: I made a pull request with my alternative approach, linked just below. Although I feel a bit silly "solving a solved problem", dissecting how rapidhash works on these small inputs was a fun exercise.

@Nicoshev
Copy link
Owner

@hoxxep This PR is not bad. Concerns are:

  1. As you said, most modern compilers optimize the little-endian case
  2. It is too much code for just reading and writing variables
  3. Most hash maps use the identity function when the key is an integer. The important case are strings and byte streams

@hoxxep
Copy link
Contributor Author

hoxxep commented Sep 29, 2025

In response to 1,2,3:

  1. Compilers should generate optimal code on both big- and little-endian platforms; on big-endian platforms the double byte-swap should be easy for the compiler to prove and optimise away.
  2. I'll shorten the docstrings if that helps? I think it makes rapid_read easier to understand as the big/little endian code is clearly encapsulated, and makes the portable to-little-endian logic re-usable by the user (useful when they are hashing integers or more complex types).
  3. True, but if the low bits have low entropy there's value in hashing, such as a nanoseconds field that is rounded to microsecond precision (MacOS timestamps etc). Bloom filters and hyperloglog are other examples where hash quality matters.

I have no skin in the game here though, totally understand if it's not deemed necessary.

@hoxxep hoxxep force-pushed the integer-hash-functions branch from 4150b43 to c2a86d0 Compare September 29, 2025 14:17
@hoxxep hoxxep force-pushed the integer-hash-functions branch from c2a86d0 to 9d742d7 Compare September 29, 2025 14:19
@EvanBalster
Copy link

  1. Most hash maps use the identity function when the key is an integer. The important case are strings and byte streams

This is unsuitable for many use cases, with and without hashmaps.

A concrete example... in procedural generation we often use multi-dimensional coordinate packed into an integer, which may differ by just a single bit between cells. This makes the avalanche effect highly desirable: hashes of contiguous coordinates are used as a pseudorandom noise function. These coordinates will often differ from their neighbors by power-of-two increments, which will produce many collisions in an identity hashmap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants