-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Debug and unskip flux_tests.py::FluxTest::testCompareDevIreeBf16AgainstHuggingFaceF32 #1050
Comments
CI logs are only available for a limited time thus you might want to copy and past the relevant error message here. |
I was able to reproduce this on another test with a toy sized model variant. Which will be easier to debug as it is faster to run. The crash is sporadic but it happens more often then not on my setup. The test is in this PR #1075. I was able to reproduce on top of IREE commit iree-org/iree@00e8873. To run the test
The seg fault happens at the end of the test during buffer destruction when the Python objects get cleaned up. The native stack trace is
On another occasion the trace is
My intuition is that we probably have a double free of the common context that is holding the status and the mutex. |
@AWoloszyn, @drprajap, would you like me to open a parallel issue in IREE about this or you could debug from that? |
And first glance it looks like you are deleting the buffer-view AFTER you destroy the device? |
After some discussion with @AWoloszyn we established that |
We don't have proper tests for a toy size variant of the model, which is desirable for CI tests on every commit. Some of the tests fail during IREE buffer destruction. Which is a known issue. See #1050.
There are sporadic occasions where a buffer is destroyed after its corresponding IREE device is destroyed. See nod-ai#1050 Here is introduced a construct that utilizes function scope to ensures devices outlive local objects.
There are sporadic occasions where a buffer is destroyed after its corresponding IREE device is destroyed. See nod-ai#1050 Here is introduced a construct that utilizes function scope to ensures devices outlive local objects.
There are sporadic occasions where a buffer is destroyed after its corresponding IREE device is destroyed. See nod-ai#1050 Here is introduced a construct that utilizes function scope to ensures devices outlive local objects.
There are sporadic occasions where a buffer is destroyed after its corresponding IREE device is destroyed. See #1050 Here is introduced a construct that utilizes function scope to ensures devices outlive local objects.
Currently they segfault
So I did
See logs here: https://gist.github.com/renxida/9aa2fa758c4238ef727a9bde0b0f8a99
This comes from #1003
The text was updated successfully, but these errors were encountered: