-
Notifications
You must be signed in to change notification settings - Fork 41
Integer hashing functions that use simplified subroutines #38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Here is the quick-and-dirty C++ code I used to check that my hash functions produce identical results to int main()
{
uint64_t seeds[4] = {0, 0xDEADBEEF4C47, 0x123456789ABCDEF, 0xC15C7D2FA43F24F1};
for (int s = 0; s < 4; ++s)
{
const uint64_t seed = seeds[s];
uint64_t fails;
std::cout << "Using seed " << seed << std::endl;
// 8-bit test
std::cout << " Compare Int8" << std::endl;
fails = 0;
for (unsigned i = 0; i < 256; ++i)
{
uint8_t key = i;
if (rapidhashNano_withSeed(&key,1,seed) != rapidhashInt8_withSeed(key,seed)) ++fails;
}
if (fails) std::cout << "\tMismatches: " << fails << std::endl;
else std::cout << "\tSUCCESS" << std::endl;
// 16-bit test
std::cout << " Compare Int16" << std::endl;
fails = 0;
for (unsigned i = 0; i < 65536; ++i)
{
uint16_t key = i;
if (rapidhashNano_withSeed(&key,2,seed) != rapidhashInt16_withSeed(key,seed)) ++fails;
}
if (fails) std::cout << "\tMismatches: " << fails << std::endl;
else std::cout << "\tSUCCESS" << std::endl;
// 32-bit test
std::cout << " Compare Int32" << std::endl;
fails = 0;
for (uint64_t i = 0; i < 0x1000000ull; ++i)
{
uint32_t key = i;
if (rapidhashNano_withSeed(&key,4,seed) != rapidhashInt32_withSeed(key,seed)) ++fails;
}
if (fails) std::cout << "\tMismatches: " << fails << std::endl;
else std::cout << "\tSUCCESS" << std::endl;
// 64-bit test
std::cout << " Compare Int64" << std::endl;
fails = 0;
for (uint64_t i = 0; i < 0x1000000ull; ++i)
{
uint64_t key = i;
if (rapidhashNano_withSeed(&key,8,seed) != rapidhashInt64_withSeed(key,seed)) ++fails;
}
if (fails) std::cout << "\tMismatches: " << fails << std::endl;
else std::cout << "\tSUCCESS" << std::endl;
// 128-bit test
std::cout << " Compare Int128" << std::endl;
fails = 0;
for (uint64_t b = 0; b < 0x100ull; ++b)
for (uint64_t i = 0; i < 0x100000ull; ++i)
{
uint64_t key[2] = {i,b};
if (rapidhashNano_withSeed(key,16,seed) != rapidhashInt128_withSeed(i,b,seed)) ++fails;
}
if (fails) std::cout << "\tMismatches: " << fails << std::endl;
else std::cout << "\tSUCCESS" << std::endl;
}
} |
|
The problem with this PR is that hashing logic gets duplicated across two different functions. It is more difficult to maintain, given that changing one place requires the same change in the other one. |
|
I was following a perceived precedent — the header already contains three copies of the hashing logic in question, and it's easy enough to validate that the resulting behavior is identical. De-duplicating the existing code would restructure the whole header so I assume that's a non-starter. My motive here was to extract some code that is friendly to dumb compilers and casual human readers. As a compromise, it would be simple to combine rapidhashInt_internal and rapidhashShort_internal into one function branching on |
Adds integer hashes whose results are invariant with respect to the system endianness. On a little-endian system, these behave identical to the existing rapidhash functions; on a big-endian system, they omit the byteswap operation.
rapidhashInt_internalrapidhashShort_internalrapidhashInt8(key),rapidhashInt8_withSeed(key, seed)rapidhashInt16(key),rapidhashInt16_withSeed(key, seed)rapidhashInt32(key),rapidhashInt32_withSeed(key, seed)rapidhashInt64(key),rapidhashInt64_withSeed(key, seed)rapidhashInt128(lsb, msb),rapidhashInt128_withSeed(lsb, msb, seed)rapid_mum.This is an alternative approach to @hoxxep's lovely PR #37 which was written in response to my issue #36. Their implementation and mine can be expected to produce the same machine code when compiled with full optimizations.
My approach was to write simplified special-case
internalfunctions rather than relying on the optimizer's ability to make clever simplifications (e.g. cancelling out two byteswaps with amemcpyin between). This results in simpler, easier-to-follow source code and potentially smaller code when compiled without optimization — at the cost of repeating the logic for small hashing jobs.