-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ABI] ABI Update For adding Version to DLPack #113
Conversation
notice thread #110 |
ready for review, this PR will remain open for at least one week |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, just some small comments/questions (most you can also ignore).
I want to think a bit more on it, but I think the main things I am still pondering is about the (Python) exchange protocol, which is an addition to the changes here.
include/dlpack/dlpack.h
Outdated
* This is a helper flag to help dependent packages to check possible extensions. | ||
* | ||
* The dependent package should be static_assert(kExtendedFlag > kDLDeviceTypeEnd); | ||
* to ensure that version update do not override existing flags. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow. Do dependent packages sometimes store more things in the device?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sometimes depenendent package can try to store extended device that are not yet part of DLPack (which can be included later). So this is just providing convenience for suhc packages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm, but that seems tricky in a dynamically linked world? I cannot do static-asserts otherwise and would have to trust a version check.
But in that case, getting an older version, I would have to check that the device actually makes sense in the old version, even though I compile against the new version!
Can we say that "extended" versions MUST NOT be shared in a dynamically linked world like Python? Or alternatively, explicitly block off an "extended" area?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sometimes depenendent package can try to store extended device that are not yet part of DLPack
Would it not be better to set kDLDeviceTypeEnd
to a really large integer to ensure that there are guaranteed to be no clashes? Like sort of reserved range for vendor-specific extensions that haven't been standardized yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will suggest sacrificing the top bit for this purpose and making that explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize that there are different opinions on this one and it is not necessarily closely tied with this proposal updat. As a result, I will remove it for now.
include/dlpack/dlpack.h
Outdated
|
||
/*! \brief The current ABI version of dlpack */ | ||
#define DLPACK_ABI_VERSION 1 | ||
#define DLPACK_ABI_VERSION 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, this doesn't even break ABI, did we (since only new symbols got added)?
I am still thinking a bit about how we should go about this. I.e. we could make the main DLManagedTensorVersioned
struct always be the newest one and when we update, add a DLManagedTensorV1
(or add it now).
The main point being: When a new version is introduced, we need to keep the old one around so NumPy, etc. can maintain compatibility with older libraries more easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense for future update. On the other hand, keeping many versions in a single file can be prohibitive, so ideally we want to only rotate among two versions. We can do so after another ABI update.
This is indeed not a ABI breaking change, on the other hand, it does indicate a change of ABI in the Data API exchange(of what is being contained in pycapsule), as a result the update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is fine... You could also start backcompat having headers like dlpack_v1.h
. They won't matter too much if they pile up a bit and we can decide how long we want to keep old stuff around.
include/dlpack/dlpack.h
Outdated
typedef enum { | ||
#endif | ||
/*! \brief bit mask to indicate that the tensor is read only. */ | ||
kDLFlagBitMaskReadOnly = 1UL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the habit of writing 1UL << 0
for flags, but it doesn't matter for one flag :).
What would be interesting is to note whether we want any other flags already at this point!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
* | ||
* \sa DLPackFlagBitMask | ||
*/ | ||
uint64_t flags; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to note somewhere that we guarantee the fields remain ABI stable until here (I guess including the flags).
That is because consumers must be able to call the destructor even if they don't understand the version.
But after this, everything may be version dependent (it may make sense to specify whether growing the struct at the end is considered valid without bumping ABI in theory).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is the key part, and it would be useful to summarize the high-level idea in this header since it's the most important file of the entire repository.
If we implement DLPack version X
and get a tensor of version Y
(where Y>X
), what are we allowed to do? What are we not allowed to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are also two version numbers: the DLpack version, and the ABI version. Is the first one completely irrelevant in deciding what is valid/disallowed, or does it also play a potential role? If yes, how is it different from the ABI version? If no, why does it need to be part of the ABI via a field stored in data structures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be a major.minor
and minor allows e.g. additional (compatible!) flags or additional devices/dtypes (not strictly compatible, but clearly can't fail). Not sure what the current scheme is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some clarification of the current scheme
* if there is no way for the caller to provide a reasonable destructor. | ||
* The destructors deletes the argument self as well. | ||
/*! | ||
* \brief Destructor - this should be called |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit picking: formatting is a little messed up (double spaces, rows of different length).
Edit: That also applies to a few of the other newly added comments, but I will just mention it once here to avoid spamming the PR.
Also: I think the idea of the doxygen \brief tag is that the description is actually brief (1 sentence) followed by a newline to separate it from a more thorough comment. The \brief part is used as a preview.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great point, fixed for this particular case. we can do further cleanups
I feel like the interface could still be more explicit about what I would argue is the most important point about a versioned ABI: what happens when a program is given a DLPack tensor by some other tool, and the With the breakneck speed of advances happening with ML and array programming, it is not inconceivable that the
If such extensions are added, the would likely be irrelevant for most users except for a small sub-community that requires them. And that's basically the point I want to raise: It would be a shame if simply adding an extra field will require another ABI break. The current PR contains the message
(and then the actual things of interest follow, specifically I find this quite conservative. Reading between the lines, to me this statement suggests that any disagreement in the In practice, it would be perfectly fine to add fields to the data structure without breaking interoperability with a lower-versioned consumer, especially when those extra fields encode nuanced attributes that are perhaps only relevant to very specific use cases. ABI breaks are extremely disruptive to the community, especially in the Python ecosystem where everyone ships separately pre-compiled packages via PyPI. Even for this proposal, I suspect that will take a long time for the community to adopt versioned DLTensors. Therefore, this is also a unique chance to lower the cost of future changes, and I feel like the current PR misses that opportunity. |
To turn the above complaint into something more constructive, let me propose a three-part (
|
Thanks @wjakob for the proposal. I agree that we want to be careful with the changes and it is great to have those ideas discussed before we commit to them. Would love to see what others think as well |
This is an interesting discussion, but I would argue it is not of concern to the Python Data Consortium. Basically, all we need is a safe way to continue evolving the existing DLPack protocol for describing a span of tensor data residing on a single device (important to your second point below). IIRC there's no discussion on evolving the protocol itself, as so far everyone is mostly happy with the convenience brought by DLPack. There are plenty of prior discussion scattered in the open issues of this repo that are finally converging to this PR.
The fact that such a possibility already exists but has never generated any change request to DLPack speaks for itself. In the case of multiple CUDA contexts co-residing on the same device, generally it is hard to exchange properly without some common understanding (ex: both sides must use the same CUDA context, which in practice almost always refers to the CUDA runtime context). It is even harder to take into account the needed API/ABI change to accommodate for similar situations in other vendors' programming models. If NV and Intel both consider this is appropriate (I do not speak for either on this), I'd argue let's not overcomplicate the story 🙂 Flexibility is good, but less so if we cannot make any progress.
As mentioned above, DLPack has been designed as a single-device exchange protocol. This use case is exactly why array libraries like Dask and cuNumeric cannot support DLPack, CUDA Array Interface, or any similar protocol if they ever exist. Distributed protocols have been discussed extensively (ex: here), but the fact that it went nowhere is a big issue (no distributed library provider can find common ground) that should not block this PR.
An actual data type here is represented by |
My gut feeling is squashing "minor" and "patch" if fine (but I don't care either way). There are two main groups of changes:
From the Python side I am not scared about either kind of changes at all. But I ask for a version request (by the consumer). If they say We don't have to discuss the details about what might happen, a true ABI break should be assumed to happen eventually. It doesn't matter much which scenario causes it :). |
Should this be pushed forward soon? It seems to me that the versioning may be the only small discussion here. I am just hoping it might unblock further steps. I am trying to think about the exchange protocol more, but unless we change API more there is probably not much to do. For Python, I would update the capsule name and pass a maximum supported version (so that the exporter can produce older versions if they wish to). It isn't that I can't think of API breaks to consider, but it seems OK to push this in any case. |
Yes, sorry was a bit slow near the end of the year . Summarizing suggestion on versioning here. How about we do the following struct DLPackVersion {
// A number that indicates an ABI break
int32_t major;
// A number that indicates a API change with backward compatible API.
int32_t minor;
}; This would mark a change in versionin scheme, which is probably fine. Given DLPack is pre 1.0. We can start with |
So with the proposed |
80669fa
to
e86f014
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this seems like a good next step to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sad to see minor and patch being squashed, but it seems I am in the minority. I have added a small request to include more information from the discussion here into the header file.
By the way, one off-topic point: the Python tensor.__dlpack__(version = ..)
interface will need some careful thinking to make sure it doesn't end up being a pain. For example, a version number in string form as shown by @seberg above would require everyone to build custom version number parsers.
What's worse is that there is no good way to test if a Python function actually accepts a version
argument other than calling it to see if it raises an exception (which is super-slow especially when combined with C++ bindings that need to catch and re-throw exceptions in different languages). Existing libraries that don't support versioned tensors (and won't, for a while) will fall into this category.
So.. I renew my request for a different API function (e.g. __vdlpack__(version: Optional[(int, int)])
for versioned DLPack). Anyways, probably a discussion for another PR, but I don't think this part will be as easy as it may seem to be.
* data layout of the ABI - DLManagedTensorVersioned. | ||
* | ||
* A change in minor version indicates that we have added new | ||
* code, such as a new device type, but the ABI is kept the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add something of the following sort just to spell it out clearly somewhere:
If an obtained DLPack tensor has a major version that disagrees with the version number specified in this header file (i.e. major != DLPACK_MAJOR_VERSION), the consumer must call the deleter. It is not safe to access any other fields as the memory layout will have changed. In the case of a minor version mismatch, the tensor can be safely used as long as the consumer knows how to interpret all fields. Minor version updates indicate the addition of enumeration values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the great suggestion, just add the comment per suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion +1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for one minor nit.
This PR adds DLPackManagedTensorVersioned to DLPack - Add a new data structure DLManagedTensorVersioned, which contains a new version field. - Add DLPackVersion struct to indicate both api and abi versions. - Add flags field for boolean options, with support for read only bit mask. - Add kDLDeviceTypeEnd to help dependent packages to check range of possible extension values. The change is still ABI compatible, as the new data structure ABI comes with the DLManagedTensorVersioned. The data-api however would involve an ABI update and update of all DLPack importer/exporters to make use of DLManagedTensorVersioned.
Remove the device end notation for now. Change bitmask definition to be macro to avoid enum int64 difference between C and C++.
@wjakob please let me know if you have more comments :) |
Going merge if there is no more input in the next three days |
Thanks everyone for participating in discussion and reiews, this change is now merged. |
This PR adds DLPackManagedTensorVersioned to DLPack
The change is still ABI compatible, as the new data structure ABI comes with the DLManagedTensorVersioned. The data-api however would involve an ABI update and update of all DLPack importer/exporters to make use of DLManagedTensorVersioned.