Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize (vectorize?) integer deserialization #25

Open
k0ekk0ek opened this issue Feb 20, 2023 · 15 comments
Open

Optimize (vectorize?) integer deserialization #25

k0ekk0ek opened this issue Feb 20, 2023 · 15 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Milestone

Comments

@k0ekk0ek
Copy link
Contributor

Currently, converting numbers to binary is done the idiomatic way. Since integers (int8, int16 and int32) occur quite a lot in zone files, e.g. TTL, optimization may result in a nice peformance boost.

To tackle this, it's good to know that for some integer values, use of a symbolic value is allowed to. e.g. algorithm field in DS records. Also, TTL values are 32-bit integer fields, but the MSB must not be set.

@k0ekk0ek k0ekk0ek added enhancement New feature or request good first issue Good for newcomers labels Feb 20, 2023
@k0ekk0ek k0ekk0ek added this to the Release 0.2.0 milestone Feb 20, 2023
@k0ekk0ek k0ekk0ek added the help wanted Extra attention is needed label Feb 20, 2023
@aqrit
Copy link

aqrit commented Jun 17, 2023

My thoughts on parsing integers start here: https://stackoverflow.com/a/74453028
I don't know if it actually makes anything faster.

@k0ekk0ek
Copy link
Contributor Author

Thanks @aqrit! I'll be sure to give this a try down the road. Most RRs don't contain integer data (people tend to specify the TTL once), but while wrinting #68, I did see some types where this may prove very useful.

@k0ekk0ek
Copy link
Contributor Author

@aqrit, looked at your posts (and mentioned posts/articles) quite a bit. I've changed the function to detect invalid characters too. This is a first stab at the problem, I've not benchmarked it. For simdzone, we know what type of integer (8 bit, 16 bit or 32 bit) we're going to parse, so chances are we can do much better.

static int parse_integer(const char *buffer, size_t length, uint64_t *number)
{
  uint64_t digits, bad_chars;
  static const uint8_t shift[] =
    { 0,  56, 48, 40, 32, 24, 16, 8,  0,  0,  0,  0,  0,  0,  0,  0 };

  length &= 0xfull;

  memcpy(&digits, buffer, sizeof(digits)); // assumes little-endian
  // flip 0x30 (ASCII 0 (0x30) - 9 (0x39)) bits, detect non-digits
  digits ^= 0x3030303030303030ull;
  // shift off non-digits
  digits <<= shift[length];

  bad_chars = (digits | (digits + 0x0606060606060606ull)) & 0xf0f0f0f0f0f0f0f0ull;

  digits = (digits & 0x0f0f0f0f0f0f0f0fllu) * 2561 >> 8;
  digits = (digits & 0x00ff00ff00ff00ffllu) * 6553601 >> 16;
  digits = (digits & 0x0000ffff0000ffffllu) * 42949672960001 >> 32;

  *number = digits;

  return (bad_chars == 0);
}

int main(int argc, char *argv[])
{
  char buffer[16] = { 0 }; // padding guaranteed in simdzone
  size_t length;
  if (argc == 2)
    length = strlen(argv[1]);
  else
    length = 0;

  memcpy(buffer, argv[1], length&0xfull);

  uint64_t number;

  if (parse_integer(buffer, length, &number)) {
    fprintf(stdout, "number: %llu\n", number);
    return 0;
  } else {
    fprintf(stderr, "Invalid number\n");
    return -1;
  }
}

@k0ekk0ek
Copy link
Contributor Author

I found that with validation required, it's incredibly hard to beat the naive implementation. I focused on int8 (maximum of 255) first. What I found consistently faster is:

static int parse_int8_swar(const char *str, size_t len, uint8_t *num)
{
  union { uint8_t as_str[4]; uint32_t as_int; } digits;

  memcpy(&digits.as_int, str, sizeof(digits));
  // flip 0x30, detect non-digits
  digits.as_int ^= 0x30303030lu;
  // shift off trash bytes
  digits.as_int <<= (4 - (len & 0x3)) * 8;

  const uint32_t x = digits.as_str[1] * 10 + digits.as_str[2];
  const uint32_t y = x * 10 + digits.as_str[3];
  *num = (uint8_t)y;
  return (digits.as_int & 0xf0f0f0f0) == 0 && y < 256 && len && len < 4;
}

Where naive is:

__attribute__((noinline))
static int parse_int8_naive(const char *str, size_t len, uint8_t *num)
{
  uint32_t n = 0;

  for (size_t i=0, r=len&0x3; i < r; i++) {
    uint8_t d = str[i] - '0';
    if (d > 9)
      return 0;
    n = n * 10 + d;
  }

  *num = (uint8_t)n;
  return n < 256 && len && len < 4;
}

Don't know if it's worth the trouble. @aqrit, @lemire, any pointers?

@lemire
Copy link
Collaborator

lemire commented Nov 28, 2023

@lemire
Copy link
Collaborator

lemire commented Nov 28, 2023

(My tests suggest that you can definitively do much better than naive with SWAR.)

@k0ekk0ek
Copy link
Contributor Author

Great stuff @lemire. Thanks!

@lemire
Copy link
Collaborator

lemire commented Nov 29, 2023

@k0ekk0ek I am tweaking it a bit since the initial version did not fully validate the input.

@k0ekk0ek
Copy link
Contributor Author

I forgot to add 0x06060606 after the initial XOR back to check for character greater than '9'. That may not cover all scenarios though, i.e. when the initial value is 11001111, then it'd overflow.

@lemire
Copy link
Collaborator

lemire commented Nov 29, 2023

I recommend the following...

int parse_uint8_fastswar(const char *str, size_t len, uint8_t *num) {
  if(len == 0 || len > 3) { return 0; }
  union { uint8_t as_str[4]; uint32_t as_int; } digits;
  memcpy(&digits.as_int, str, sizeof(digits));
  digits.as_int ^= 0x30303030lu;
  digits.as_int <<= ((4 - len) * 8);
  uint32_t all_digits = ((digits.as_int & 0xf0f0f0f0) |
                         ((0x76767676 + digits.as_int) & 0x80808080)) == 0;
  *num = (uint8_t)((UINT64_C(0x640a0100) * digits.as_int) >> 32);
  return all_digits & ((__builtin_bswap32(digits.as_int) <= 0x020505));
}

I think it covers everything.

@lemire
Copy link
Collaborator

lemire commented Nov 29, 2023

Slightly more efficient...

int parse_uint8_fastswar(const char *str, size_t len, uint8_t *num) {
  if(len == 0 || len > 3) { return 0; }
  union {
    uint8_t as_str[4];
    uint32_t as_int;
  } digits;

  memcpy(&digits.as_int, str, sizeof(digits));
  // flip 0x30, detect non-digits
  digits.as_int ^= 0x30303030lu;
  // shift off trash bytes, technically undefined behaviour when ((4 - len) * 8) is not
  // in [0,32) prior to C17 / C++14.
  digits.as_int <<= ((4 - len) * 8);
  // check all digits
  uint32_t all_digits = ((digits.as_int | (0x06060606 + digits.as_int)) & 0xF0F0F0F0) == 0;
  *num = (uint8_t)((0x640a01 * digits.as_int) >> 24);
  return all_digits & ((__builtin_bswap32(digits.as_int) <= 0x020505));
}

This is now squarely 2x faster than naive on random inputs, and still measurably faster on predictable inputs.

It is likely practical.

@k0ekk0ek
Copy link
Contributor Author

k0ekk0ek commented Nov 30, 2023

Thanks @lemire! It's amazing how much time one can spend on this stuff.

I have been trying to workaround the shift (note that parse_uint8_fastswar_bob suffers from the same problem in case len is greater). e.g. ((len & 3) ^ 3) + 1) * 8, (0x0d << (len & 3)) & 0x18) and a shift table. However, they are all slower than what's currently there (dependency chain?). Maybe there's an alternative to get the bytes in the "right" location. For simdzone, a len of 0 (zero) is not really a problem as in that case there won't be a token.

There's one case we haven't covered though. Leading zeroes. i.e. 0 is allowed, but 01 is not. Leading zeroes actually are allowed. At least, BIND and Knot (and the current parser in NSD), do.

I'll give this some more thought at a later stage.

@k0ekk0ek
Copy link
Contributor Author

Or, maybe it's just better solved in SIMD? Trying to get things to go "fast" for the fallback parser may not be the right choice(?) From a practical perspective, users won't notice as no-one is using CPU's that don't support SIMD for real work. Anyway, I'll have to polish the current code for the first release, I'll work on this more afterwards. (I have some vacation coming up 🙂)

@lemire
Copy link
Collaborator

lemire commented Nov 30, 2023

There's one case we haven't covered though. Leading zeroes. i.e. 0 is allowed, but 01 is not.

Note that this is true of the naïve version as well.

@k0ekk0ek
Copy link
Contributor Author

Random blog post I came across. May be worth investigating at some point.

@k0ekk0ek k0ekk0ek mentioned this issue Apr 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants