Skip to content

Conversation

@EvanBalster
Copy link

@EvanBalster EvanBalster commented Sep 26, 2025

Adds integer hashes whose results are invariant with respect to the system endianness. On a little-endian system, these behave identical to the existing rapidhash functions; on a big-endian system, they omit the byteswap operation.

new function(s) description
rapidhashInt_internal hash logic for 32-, 64- and 128-bit integers.
rapidhashShort_internal hash logic for 8- and 16-bit integers.
rapidhashInt8(key), rapidhashInt8_withSeed(key, seed) hash one byte.
rapidhashInt16(key), rapidhashInt16_withSeed(key, seed) endian-invariant integer hash.
rapidhashInt32(key), rapidhashInt32_withSeed(key, seed) endian-invariant integer hash.
rapidhashInt64(key), rapidhashInt64_withSeed(key, seed) endian-invariant integer hash.
rapidhashInt128(lsb, msb),
rapidhashInt128_withSeed(lsb, msb, seed)
endian-invariant integer hash, fashioned after rapid_mum.

This is an alternative approach to @hoxxep's lovely PR #37 which was written in response to my issue #36. Their implementation and mine can be expected to produce the same machine code when compiled with full optimizations.

My approach was to write simplified special-case internal functions rather than relying on the optimizer's ability to make clever simplifications (e.g. cancelling out two byteswaps with a memcpy in between). This results in simpler, easier-to-follow source code and potentially smaller code when compiled without optimization — at the cost of repeating the logic for small hashing jobs.

@EvanBalster EvanBalster changed the title Integer hashing functions that use simplified code Integer hashing functions that use simplified subroutines Sep 26, 2025
@EvanBalster
Copy link
Author

Here is the quick-and-dirty C++ code I used to check that my hash functions produce identical results to rapidhashNano on little-endian systems. Tested on AMD64. These tests will fail on big-endian architectures, but that's very much the point.

int main()
{
    uint64_t seeds[4] = {0, 0xDEADBEEF4C47, 0x123456789ABCDEF, 0xC15C7D2FA43F24F1};

    for (int s = 0; s < 4; ++s)
    {
        const uint64_t seed = seeds[s];
        uint64_t fails;

        std::cout << "Using seed " << seed << std::endl;


        // 8-bit test
        std::cout << "  Compare Int8" << std::endl;
        fails = 0;
        for (unsigned i = 0; i < 256; ++i)
        {
            uint8_t key = i;
            if (rapidhashNano_withSeed(&key,1,seed) != rapidhashInt8_withSeed(key,seed)) ++fails;
        }
        if (fails) std::cout << "\tMismatches: " << fails << std::endl;
        else       std::cout << "\tSUCCESS" << std::endl;

        

        // 16-bit test
        std::cout << "  Compare Int16" << std::endl;
        fails = 0;
        for (unsigned i = 0; i < 65536; ++i)
        {
            uint16_t key = i;
            if (rapidhashNano_withSeed(&key,2,seed) != rapidhashInt16_withSeed(key,seed)) ++fails;
        }
        if (fails) std::cout << "\tMismatches: " << fails << std::endl;
        else       std::cout << "\tSUCCESS" << std::endl;

        

        // 32-bit test
        std::cout << "  Compare Int32" << std::endl;
        fails = 0;
        for (uint64_t i = 0; i < 0x1000000ull; ++i)
        {
            uint32_t key = i;
            if (rapidhashNano_withSeed(&key,4,seed) != rapidhashInt32_withSeed(key,seed)) ++fails;
        }
        if (fails) std::cout << "\tMismatches: " << fails << std::endl;
        else       std::cout << "\tSUCCESS" << std::endl;


        // 64-bit test
        std::cout << "  Compare Int64" << std::endl;
        fails = 0;
        for (uint64_t i = 0; i < 0x1000000ull; ++i)
        {
            uint64_t key = i;
            if (rapidhashNano_withSeed(&key,8,seed) != rapidhashInt64_withSeed(key,seed)) ++fails;
        }
        if (fails) std::cout << "\tMismatches: " << fails << std::endl;
        else       std::cout << "\tSUCCESS" << std::endl;


        // 128-bit test
        std::cout << "  Compare Int128" << std::endl;
        fails = 0;
        for (uint64_t b = 0; b < 0x100ull; ++b)
            for (uint64_t i = 0; i < 0x100000ull; ++i)
        {
            uint64_t key[2] = {i,b};
            if (rapidhashNano_withSeed(key,16,seed) != rapidhashInt128_withSeed(i,b,seed)) ++fails;
        }
        if (fails) std::cout << "\tMismatches: " << fails << std::endl;
        else       std::cout << "\tSUCCESS" << std::endl;
    }
}

@Nicoshev
Copy link
Owner

@EvanBalster

The problem with this PR is that hashing logic gets duplicated across two different functions.

It is more difficult to maintain, given that changing one place requires the same change in the other one.

@EvanBalster
Copy link
Author

EvanBalster commented Sep 30, 2025

I was following a perceived precedent — the header already contains three copies of the hashing logic in question, and it's easy enough to validate that the resulting behavior is identical. De-duplicating the existing code would restructure the whole header so I assume that's a non-starter.

My motive here was to extract some code that is friendly to dumb compilers and casual human readers. As a compromise, it would be simple to combine rapidhashInt_internal and rapidhashShort_internal into one function branching on len < 4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants