Skip to content

Conversation

1st1
Copy link
Member

@1st1 1st1 commented Sep 18, 2025

The C implementation considerably boosts the performance of the key UUID operations:

------------------------------------
Operation                    Speedup
------------------------------------
uuid4() generation            15.01x
uuid7() generation            29.64x
UUID from string               6.76x
UUID from bytes                5.16x
str(uuid) conversion           6.66x
------------------------------------

Summary of changes:

  • The UUID type is reimplemented in C in its entirety.

  • The pure-Python is kept around and is used of the C implementation isn't available for some reason.

  • Both implementations are tested extensively; additional tests are added to ensure that the C implementation of the type follows the pure Python implementation fully.

  • The Python implementation stores UUID values as int objects. The C implementation stores them as uint8_t[16] array.

  • The C implementation supports unpickling of UUIDs created with Python 2 using protocols starting with 0.

  • The C implementation has faster hash() implementation but also caches the computed hash value to speedup cases when UUIDs are used as set/dict keys.

  • The C implementation has a freelist to make new UUID object instantiation as fast as possible.

  • uuid4() and uuid7() are now implmented in C. The most performance boost (10x) comes from overfetching entropy to decrease the number of _PyOS_URandom() calls. On its own it's a safe optimization with the edge case that Unix fork needs to be explicitly handled. We do that by comparing the current PID to the PID of when the random buffer was populated.

  • Portions of code are coming from my implementation of faster UUID in gel-python [1]. I did use AI during the development, but basically had to rewrite the code it generated to be more idiomatic and efficient.

  • The benchmark can be found here [2].

  • This PR makes Python UUID operations as fast as they are in NodeJS and Bun runtimes.

[1] https://github.com/MagicStack/py-pgproto/blob/b8109fb311a59f30f9947567a13508da9a776564/uuid.pyx

[2] https://gist.github.com/1st1/f03e816f34a61e4d46c78ff98baf4818

@python-cla-bot
Copy link

python-cla-bot bot commented Sep 18, 2025

All commit authors signed the Contributor License Agreement.

CLA signed

@1st1 1st1 changed the title The C implementation considerably boosts the performance of the key UUID Reimplement base UUID type, uuid4(), and uuid7() in C Sep 18, 2025
1st1 added a commit to 1st1/cpython that referenced this pull request Sep 18, 2025
… in C

The C implementation considerably boosts the performance of the key UUID
operations:

------------------------------------
Operation                    Speedup
------------------------------------
uuid4() generation            15.01x
uuid7() generation            29.64x
UUID from string               6.76x
UUID from bytes                5.16x
str(uuid) conversion           6.66x
------------------------------------

Summary of changes:

* The UUID type is reimplemented in C in its entirety.

* The pure-Python is kept around and is used of the C implementation
  isn't available for some reason.

* Both implementations are tested extensively; additional tests are
  added to ensure that the C implementation of the type follows the pure
  Python implementation fully.

* The Python implementation stores UUID values as int objects. The C
  implementation stores them as `uint8_t[16]` array.

* The C implementation supports unpickling of UUIDs created with Python
  2 using protocols starting with 0. That necessitated a small fix to
  the `copyreg` module (the change is only affecting legacy pickle
  pathway).

* The C implementation has faster hash() implementation but also caches
  the computed hash value to speedup cases when UUIDs are used as
  set/dict keys.

* The C implementation has a freelist to make new UUID object
  instantiation as fast as possible.

* uuid4() and uuid7() are now implmented in C. The most performance
  boost (10x) comes from overfetching entropy to decrease the number of
  _PyOS_URandom() calls. On its own it's a safe optimization with the
  edge case that Unix fork needs to be explicitly handled. We do that by
  comparing the current PID to the PID of when the random buffer was
  populated.

* Portions of code are coming from my implementation of faster UUID
  in gel-python [1]. I did use AI during the development, but basically
  had to rewrite the code it generated to be more idiomatic and
  efficient.

* The benchmark can be found here [2].

* This PR makes Python UUID operations as fast as they are in NodeJS and
  Bun runtimes.

[1] https://github.com/MagicStack/py-pgproto/blob/b8109fb311a59f30f9947567a13508da9a776564/uuid.pyx

[2] https://gist.github.com/1st1/f03e816f34a61e4d46c78ff98baf4818
@1st1 1st1 force-pushed the uuid_push branch 2 times, most recently from 2d0a381 to c4363f8 Compare September 18, 2025 13:22
@1st1 1st1 changed the title Reimplement base UUID type, uuid4(), and uuid7() in C #139122: Reimplement base UUID type, uuid4(), and uuid7() in C Sep 18, 2025
@1st1 1st1 changed the title #139122: Reimplement base UUID type, uuid4(), and uuid7() in C #gh-139122: Reimplement base UUID type, uuid4(), and uuid7() in C Sep 18, 2025
@1st1 1st1 force-pushed the uuid_push branch 2 times, most recently from fa12192 to 4a5c2a2 Compare September 18, 2025 13:24
@1st1
Copy link
Member Author

1st1 commented Sep 18, 2025

Note to reviewers:

  • Please compare uuid7 C and Python implementations side by side
  • Please review all of the bitshift/magic code closely (Python implementation uses int, i'm using char[16], so low-level operations are very different but have to be equivalent)


*counter = (((uint64_t)(high & 0x1FF) << 32) | (low >> 32)) & 0x1FFFFFFFFFF;
*tail = (uint32_t)low;
return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested thoroughly (mostly looking at GodBolt output), but this ought to generate more efficient code (and there might be more ways to generate the right values directly, I just didn't go that far):

    struct {
        uint16_t high;
        uint32_t low1;
        uint32_t low2;
    } rand_bytes;
    if (gen_random(state, (uint8_t*)&rand_bytes, sizeof(rand_bytes)) < 0) {
        return -1;
    }

    *counter = (((uint64_t)(rand_bytes.high & 0x1FF) << 32) | (rand_bytes.low1)) & 0x1FFFFFFFFFF;
    *tail = rand_bytes.low2;

uint8_t bytes[16];
uint64_t timestamp_ms, counter;
uint32_t tail;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ought to be able to do a similar struct/union trick here for faster conversion.

goto fail;
}

uuid_mod = PyImport_ImportModule("uuid");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we possibly simplify this whole concept (caching the values here) by pushing the queries down to a Python subclass of _uuid.UUID? So that the core create/parse/etc. functionality is entirely native, and the more esoteric queries are written in Python as class uuid.UUID(_uuid.UUID)? (Answer can be "no" if it's a bad idea for reasons I haven't considered)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Steve, I really tried hard to simplify it.

  • [1] is my attempt, where the C implementation defines the fundamental ops and the rest is kept in Python.
  • despite me pushing it really hard and even implementing freelist for Python-land classes (probably was first such perversive attempt in the stdlib), the performance of UUID instantiation takes a big hit. Benchmarks became 10-20% slower just because the UUID type is inherited from a Python base.
  • Overall savings were about 300 lines of C. Which is considerable, but IMO making UUID as fast as we can is more important.

So let's keep this PR approach as is.

That said, good news! I figured out how to fully support pickle without my prior shenanigans with copyreg, so now the C implementation is a full drop-in replacement. 🚀

[1] https://github.com/1st1/cpython/blob/8a5877165e993afb2633cd48da5222326d3f6e0e/Modules/_uuidmodule.c#L4

@StanFromIreland StanFromIreland changed the title #gh-139122: Reimplement base UUID type, uuid4(), and uuid7() in C gh-139122: Reimplement base UUID type, uuid4(), and uuid7() in C Sep 18, 2025
@picnixz
Copy link
Member

picnixz commented Sep 18, 2025

I would be happy to have a C implementation of UUID but for reviewing purposes, may I suggest that we first focus on implementing the wrapper in C in one PR and have different PRs for each UUID? it would be much easier to focus on the algorithm to review.

Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the first round comments I have. I realy want multiple PRs because the module becomes really huge and there are parts that I don't think we need to reimplement in C.

@@ -219,13 +242,21 @@ def __init__(self, hex=None, bytes=None, bytes_le=None, fields=None,
raise ValueError('badly formed hexadecimal UUID string')
int = int_(hex, 16)
elif bytes_le is not None:
if not isinstance(bytes_le, bytes_):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will make a behavioural change so let's not change this here (previously there would be an assertion error). Let's do it in a separate PR (and remove the assert at the same time)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't view this as a behavioral change. It was an error condition before, it's now handled properly. I wouldn't touch this code if I didn't have to re-implement it in C; re-implementing "assert" statement behavior is just counterproductive. I'd do that, if it was really a behavioral change, but I strongly believe it is not.

}

int overflow;
uint64_t value = PyLong_AsLongLongAndOverflow(field, &overflow);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have a converter to uint64

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see PyLong_AsUnsignedLongLongAndOverflow in the API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably named Uint64 something

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked, I can't find it. If you have a concrete doc / code link - please share, but please don't make me fish for something you yourself not sure exists :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize, I'm in the wrong here. Just saw the functions, i've no idea why i couldn't find them before. Sorry. I'll fix the code.

@1st1
Copy link
Member Author

1st1 commented Sep 19, 2025

I realy want multiple PRs because the module becomes really huge and there are parts that I don't think we need to reimplement in C.

Sorry, I'm not going to do multiple PRs, I've no time for that. But I will address your comments here for sure!

…7() in C

The C implementation considerably boosts the performance of the key UUID
operations:

------------------------------------
Operation                    Speedup
------------------------------------
uuid4() generation            15.01x
uuid7() generation            29.64x
UUID from string               6.76x
UUID from bytes                5.16x
str(uuid) conversion           6.66x
------------------------------------

Summary of changes:

* The UUID type is reimplemented in C in its entirety.

* The pure-Python is kept around and is used of the C implementation
  isn't available for some reason.

* Both implementations are tested extensively; additional tests are
  added to ensure that the C implementation of the type follows the pure
  Python implementation fully.

* The Python implementation stores UUID values as int objects. The C
  implementation stores them as `uint8_t[16]` array.

* The C implementation has faster hash() implementation but also caches
  the computed hash value to speedup cases when UUIDs are used as
  set/dict keys.

* The C implementation has a freelist to make new UUID object
  instantiation as fast as possible.

* uuid4() and uuid7() are now implmented in C. The most performance
  boost (10x) comes from overfetching entropy to decrease the number of
  _PyOS_URandom() calls. On its own it's a safe optimization with the
  edge case that Unix fork needs to be explicitly handled. We do that by
  comparing the current PID to the PID of when the random buffer was
  populated.

* Portions of code are coming from my implementation of faster UUID
  in gel-python [1]. I did use AI during the development, but basically
  had to rewrite the code it generated to be more idiomatic and
  efficient.

* The benchmark can be found here [2].

* This PR makes Python UUID operations as fast as they are in NodeJS and
  Bun runtimes.

[1] https://github.com/MagicStack/py-pgproto/blob/b8109fb311a59f30f9947567a13508da9a776564/uuid.pyx

[2] https://gist.github.com/1st1/f03e816f34a61e4d46c78ff98baf4818
@picnixz
Copy link
Member

picnixz commented Sep 19, 2025

I would REALLY appreciate smaller PRs. Usually this is our (current) workflow, namely incrementing changes. I think other core devs would agree with me here (@vstinner, @serhiy-storchaka and @encukou).

Note that whatever we do, this is a feature that will only be included in 3.15 so I can also make the small PRs once I am back at home. Changing 3.14 is no more possible unless the RM gives their approval here (cc @hugovk, who could also whether he prefers one or multiple PR) as performance improvements are usually considered as new features (especially if we are adding a C implementation).

@1st1
Copy link
Member Author

1st1 commented Sep 19, 2025

I would REALLY appreciate smaller PRs. Usually this is our (current) workflow, namely incrementing changes. I think other core devs would agree with me here (@vstinner, @serhiy-storchaka and @encukou).

The current PR layout:

  1. 90% is reimplementing the UUID type
  2. 10% is uuid4 and uuid7 implementations which are isolated separate functions from (1); uuid4 is very tiny.

I'm usually also for making smaller PRs, but objectively there's no point for that right here; I don't see how reviewing this would be fundamentally any different if instead of one PR with 1700 lines in uuidmodule.c, you'd have 3 PRs, one with 1600 lines and another with like 100 lines, and yet another with 30. Reviewing time would be the same, no? What's the point?

If you absolutely insist I can do this, but tbqh I really don't understand why you're pushing for this so hard. It would create a lot of work for me and not necessarily make reviewers more happy.

Note that whatever we do, this is a feature that will only be included in 3.15 so I can also make the small PRs once I am back at home. Changing 3.14 is no more possible unless the RM gives their approval here (cc @hugovk, who could also whether he prefers one or multiple PR) as performance improvements are usually considered as new features (especially if we are adding a C implementation).

I don't think this is 3.14 material and I wasn't pushing for that. 3.15 is fine (unless other people but me want this to be in 3.14, in which case I won't say no).

That said I'd like to see my work through and would love to still merge this myself.

uint64_t random_idx;
uint64_t random_last_pid;

// A freelist for uuid objects -- 15-20% performance boost.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a generic freelist implementation for python objects (see https://github.com/python/cpython/blob/7257b24140ac1b39fb8cfd4610134ec79575a396/Include/internal/pycore_freelist_state.h). Unless there are strong reasons not to, the uuid object should use the generic implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly I can't use that, as uuidobject.c is compiled as a shared lib.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a pity. I would suggest to leave out the freelist implementation in this PR. It makes the PR simpler (and therefore easier to review and accept). Even without the 10-20% performance gain from the freelist this PR is worthwhile. We can then reconsider the freelist in followup PRs.

@1st1
Copy link
Member Author

1st1 commented Sep 21, 2025

@picnixz I'm sorry for my snappiness (I blame 4 days spent deep in C coding! :)) I'm adding some basic machinery to test that C implementation of uuid() functions behave identically. Since this is even more extra code it's now very reasonable to split this into two prs: one for the base UUID type, another for uuid4() and uuid7(). I'll do that later.

@picnixz
Copy link
Member

picnixz commented Sep 21, 2025

No problem! I also agree that sometimes splitting PRs is not always the best approach and thank you for your understanding!

I am on my way back to Switzerland, so I can review this PR closely starting tomorrow.

@serhiy-storchaka serhiy-storchaka self-requested a review September 21, 2025 16:52
uuidobject *self = NULL;
uuid_state *state = get_uuid_state_by_cls(type);

Py_BEGIN_CRITICAL_SECTION(type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if a subclass of UUID is created? Then both an exact UUID and the subclass can than access the freelist concurrently.

The following minimal example with a subclass segfaults for me

from uuid import UUID

class U(UUID):
    pass

a = UUID(int=10)
print(a)

b = U(int=10)
print(b)

print('-')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, I'll fix tomorrow

// There's a precedent with NodeJS doing exact same thing for
// improving performance of their UUID implementation.

// IMPORTANT: callers should have a critical section or a lock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the comment and rename the method to gen_random_lock_held (or a variation to indicate which lock is held).

In addition you could add inside the method an assert statement to verify the lock is indeed held. For example for critical sections there is _Py_CRITICAL_SECTION_ASSERT_OBJECT_LOCKED

A side effect of some compatility fixes is the new code, with which
the new C uuid7() is now 35x faster that pure Python (used to be 30x).
@hugovk
Copy link
Member

hugovk commented Sep 21, 2025

Note that whatever we do, this is a feature that will only be included in 3.15 so I can also make the small PRs once I am back at home. Changing 3.14 is no more possible unless the RM gives their approval here (cc @hugovk, who could also whether he prefers one or multiple PR) as performance improvements are usually considered as new features (especially if we are adding a C implementation).

I don't think this is 3.14 material and I wasn't pushing for that. 3.15 is fine (unless other people but me want this to be in 3.14, in which case I won't say no).

We're all in agreement this is for 3.15 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants