Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug and unskip flux_tests.py::FluxTest::testCompareDevIreeBf16AgainstHuggingFaceF32 #1050

Open
renxida opened this issue Mar 7, 2025 · 5 comments
Assignees

Comments

@renxida
Copy link
Contributor

renxida commented Mar 7, 2025

Currently they segfault

So I did

    @pytest.mark.skip(
        reason="Segmentation fault during output comparison. See https://github.com/nod-ai/shark-ai/actions/runs/13704870816/job/38327614337?pr=1003"
    )

See logs here: https://gist.github.com/renxida/9aa2fa758c4238ef727a9bde0b0f8a99

This comes from #1003

@marbre
Copy link
Member

marbre commented Mar 7, 2025

CI logs are only available for a limited time thus you might want to copy and past the relevant error message here.

@sogartar
Copy link
Contributor

sogartar commented Mar 12, 2025

I was able to reproduce this on another test with a toy sized model variant. Which will be easier to debug as it is faster to run. The crash is sporadic but it happens more often then not on my setup. The test is in this PR #1075.

I was able to reproduce on top of IREE commit iree-org/iree@00e8873.

To run the test

export SHARKTANK_TEST_SKIP=0

python \
    -m pytest \
    --verbose \
    --capture=no \
    --log-cli-level=info \
    "-k=testCompareToyIreeBf16AgainstEagerF64" \
    sharktank/tests/models/flux/flux_test.py

The seg fault happens at the end of the test during buffer destruction when the Python objects get cleaned up.

The native stack trace is

* thread #1, name = 'pt_main_thread', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x5c)
  * frame #0: 0x00007fff3a5f4238 _runtime.cpython-311-x86_64-linux-gnu.so`iree_slim_mutex_lock [inlined] iree_slim_mutex_try_lock_compare_exchange(mutex=0x000000000000005c, expected=<unavailable>, desired=-2147483647) at synchronization.c:469:10
    frame #1: 0x00007fff3a5f4236 _runtime.cpython-311-x86_64-linux-gnu.so`iree_slim_mutex_lock(mutex=0x000000000000005c) at synchronization.c:479:7
    frame #2: 0x00007fff3a5af739 _runtime.cpython-311-x86_64-linux-gnu.so`iree_hal_hip_cleanup_thread_add_cleanup(thread=0x000000000000003c, event=<unavailable>, callback=<unavailable>, user_data=<unavailable>) at cleanup_thread.c:153:3
    frame #3: 0x00007fff3a5aa231 _runtime.cpython-311-x86_64-linux-gnu.so`iree_hal_hip_device_add_asynchronous_cleanup(base_device=<unavailable>, callback=<unavailable>, user_data=<unavailable>) at hip_device.c:1162:12 [artificial]
    frame #4: 0x00007fff3a5aee59 _runtime.cpython-311-x86_64-linux-gnu.so`iree_hal_hip_buffer_release_callback(user_data=0x000000000b6758b0, buffer=0x000000000b7080f0) at hip_allocator.c:629:14
    frame #5: 0x00007fff3a5af26d _runtime.cpython-311-x86_64-linux-gnu.so`iree_hal_hip_buffer_destroy(base_buffer=0x000000000b7080f0) at hip_buffer.c:110:5
    frame #6: 0x00007fff3a56d698 _runtime.cpython-311-x86_64-linux-gnu.so`iree_hal_buffer_view_release [inlined] iree_hal_buffer_view_destroy(buffer_view=0x000000000bf885d0) at buffer_view.c:99:3
    frame #7: 0x00007fff3a56d687 _runtime.cpython-311-x86_64-linux-gnu.so`iree_hal_buffer_view_release(buffer_view=0x000000000bf885d0) at buffer_view.c:91:5
    frame #8: 0x00007fff3a53483d _runtime.cpython-311-x86_64-linux-gnu.so`iree::python::ApiRefCounted<iree::python::HalBufferView, iree_hal_buffer_view_t>::~ApiRefCounted() [inlined] iree::python::ApiPtrAdapter<iree_hal_buffer_view_t>::Release(bv=<unavailable>) at hal.h:80:5
    frame #9: 0x00007fff3a534838 _runtime.cpython-311-x86_64-linux-gnu.so`iree::python::ApiRefCounted<iree::python::HalBufferView, iree_hal_buffer_view_t>::~ApiRefCounted() [inlined] iree::python::ApiRefCounted<iree::python::HalBufferView, iree_hal_buffer_view_t>::Release(this=<unavailable>) at binding.h:109:7
    frame #10: 0x00007fff3a534830 _runtime.cpython-311-x86_64-linux-gnu.so`iree::python::ApiRefCounted<iree::python::HalBufferView, iree_hal_buffer_view_t>::~ApiRefCounted(this=<unavailable>) at binding.h:61:22
    frame #11: 0x00007fff3a55dcb3 _runtime.cpython-311-x86_64-linux-gnu.so`nanobind::detail::inst_dealloc(self=0x00007ffe33e9e270) at nb_type.cpp:241:13

On another occasion the trace is

* thread #1, name = 'pt_main_thread', stop reason = signal SIGSEGV: invalid address (fault address: 0x0)
  * frame #0: 0x00007fff3a566e20 _runtime.cpython-311-x86_64-linux-gnu.so`iree_status_ignore [inlined] iree_status_free(status=0x735f70756f72676b) at status.c:517:45
    frame #1: 0x00007fff3a566e1a _runtime.cpython-311-x86_64-linux-gnu.so`iree_status_ignore(status=0x735f70756f72676b) at status.c:532:3
    frame #2: 0x00007fff3a5aee61 _runtime.cpython-311-x86_64-linux-gnu.so`iree_hal_hip_buffer_release_callback(user_data=0x000000000c03f600, buffer=0x000000000bfe2660) at hip_allocator.c:633:3
    frame #3: 0x00007fff3a5af26d _runtime.cpython-311-x86_64-linux-gnu.so`iree_hal_hip_buffer_destroy(base_buffer=0x000000000bfe2660) at hip_buffer.c:110:5
    frame #4: 0x00007fff3a56d698 _runtime.cpython-311-x86_64-linux-gnu.so`iree_hal_buffer_view_release [inlined] iree_hal_buffer_view_destroy(buffer_view=0x000000000bfb9010) at buffer_view.c:99:3
    frame #5: 0x00007fff3a56d687 _runtime.cpython-311-x86_64-linux-gnu.so`iree_hal_buffer_view_release(buffer_view=0x000000000bfb9010) at buffer_view.c:91:5
    frame #6: 0x00007fff3a53483d _runtime.cpython-311-x86_64-linux-gnu.so`iree::python::ApiRefCounted<iree::python::HalBufferView, iree_hal_buffer_view_t>::~ApiRefCounted() [inlined] iree::python::ApiPtrAdapter<iree_hal_buffer_view_t>::Release(bv=<unavailable>) at hal.h:80:5
    frame #7: 0x00007fff3a534838 _runtime.cpython-311-x86_64-linux-gnu.so`iree::python::ApiRefCounted<iree::python::HalBufferView, iree_hal_buffer_view_t>::~ApiRefCounted() [inlined] iree::python::ApiRefCounted<iree::python::HalBufferView, iree_hal_buffer_view_t>::Release(this=<unavailable>) at binding.h:109:7
    frame #8: 0x00007fff3a534830 _runtime.cpython-311-x86_64-linux-gnu.so`iree::python::ApiRefCounted<iree::python::HalBufferView, iree_hal_buffer_view_t>::~ApiRefCounted(this=<unavailable>) at binding.h:61:22
    frame #9: 0x00007fff3a55dcb3 _runtime.cpython-311-x86_64-linux-gnu.so`nanobind::detail::inst_dealloc(self=0x00007ffe2fe9e150) at nb_type.cpp:241:13

My intuition is that we probably have a double free of the common context that is holding the status and the mutex.

@sogartar
Copy link
Contributor

@AWoloszyn, @drprajap, would you like me to open a parallel issue in IREE about this or you could debug from that?

@AWoloszyn
Copy link
Contributor

And first glance it looks like you are deleting the buffer-view AFTER you destroy the device?

@sogartar
Copy link
Contributor

sogartar commented Mar 12, 2025

After some discussion with @AWoloszyn we established that iree.runtime.HalBuffer and iree.runtime.HalBufferView don't hold a ref to the device and they need to get destroyed before it. We can't rely on the standard Python object lifetime management.

@sogartar sogartar self-assigned this Mar 12, 2025
sogartar added a commit that referenced this issue Mar 12, 2025
We don't have proper tests for a toy size variant of the model, which is
desirable for CI tests on every commit.

Some of the tests fail during IREE buffer destruction. Which is a known
issue. See #1050.
sogartar added a commit to sogartar/sharktank that referenced this issue Mar 13, 2025
There are sporadic occasions where a buffer is destroyed after its
corresponding IREE device is destroyed.
See nod-ai#1050

Here is introduced a construct that utilizes function scope to ensures
devices outlive local objects.
sogartar added a commit to sogartar/sharktank that referenced this issue Mar 13, 2025
There are sporadic occasions where a buffer is destroyed after its
corresponding IREE device is destroyed.
See nod-ai#1050

Here is introduced a construct that utilizes function scope to ensures
devices outlive local objects.
sogartar added a commit to sogartar/sharktank that referenced this issue Mar 14, 2025
There are sporadic occasions where a buffer is destroyed after its
corresponding IREE device is destroyed.
See nod-ai#1050

Here is introduced a construct that utilizes function scope to ensures
devices outlive local objects.
sogartar added a commit that referenced this issue Mar 14, 2025
There are sporadic occasions where a buffer is destroyed after its
corresponding IREE device is destroyed.
See #1050

Here is introduced a construct that utilizes function scope to ensures
devices outlive local objects.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants