Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[shortfin llm] Sharded integration tests with improved device settings #1086

Draft
wants to merge 49 commits into
base: main
Choose a base branch
from
Draft
Changes from 1 commit
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
481765a
initial commit of sharded tests
renxida Feb 24, 2025
517b041
add ci task
renxida Feb 24, 2025
568d42b
fix device flags
renxida Feb 24, 2025
de67c4e
correct flags and correct machine
renxida Feb 24, 2025
0a98a0a
run on mi300x-4 for now
renxida Feb 24, 2025
a2627b9
use ossci 4gpu
renxida Feb 25, 2025
5ee94d0
tp4 but on a single gpu
renxida Feb 25, 2025
d51e2b6
back to 4gpu
renxida Feb 25, 2025
04bf5ad
update to correct label
renxida Feb 26, 2025
2921b9b
Merge branch 'main' into sharded-integration-tests
renxida Feb 26, 2025
95e0a45
update device ids
renxida Feb 26, 2025
eb64c62
merge commit
renxida Feb 26, 2025
1d8819d
Merge branch 'main' into sharded-integration-tests
renxida Mar 3, 2025
ed62463
Merge branch 'main' of https://github.com/nod-ai/shark-ai into sharde…
renxida Mar 3, 2025
919c0f6
Merge branch 'main' into sharded-integration-tests
renxida Mar 10, 2025
57bc750
switch to tp2 to see if that fixes things
renxida Mar 10, 2025
2bb76bf
Merge branch 'main' into sharded-integration-tests
renxida Mar 10, 2025
15d4942
get tpX device settings function
renxida Mar 12, 2025
708e966
Merge branch 'main' into sharded-integration-tests
renxida Mar 13, 2025
f79f46a
update sharded tests to include a cpu case
renxida Mar 14, 2025
1339a60
remove bad flags
renxida Mar 14, 2025
e708d99
less logging upon success
renxida Mar 14, 2025
c39ac15
merge device_settings
renxida Mar 20, 2025
48b0f59
remove copy and paste artifacts
renxida Mar 20, 2025
db1b4ea
fix attention mask and iree-hal-target-device=rocm problems
renxida Mar 20, 2025
46067b1
Merge branch 'main' into cpu-sharded-integration-tests
renxida Mar 21, 2025
6c4c337
clean up device flag generation
renxida Mar 21, 2025
4fc936a
update device settings to match docs
renxida Mar 21, 2025
78dc5f4
memory access fault
renxida Mar 21, 2025
f016055
fix xfail
renxida Mar 24, 2025
7df3034
Merge branch 'main' into cpu-sharded-integration-tests
renxida Mar 24, 2025
6be07d4
use 8b meta llama instead of toy model for now
renxida Mar 25, 2025
719cb59
relax accuracy test
renxida Mar 25, 2025
965debd
clean up model_management
renxida Mar 25, 2025
5c244d8
remove xfail
renxida Mar 25, 2025
bb24469
simplify
renxida Mar 25, 2025
e0054e5
do not use bs=1 in tests - unstable when combined with sharding
renxida Mar 25, 2025
10bdc71
record logs from server
renxida Mar 25, 2025
7d66649
Merge branch 'main' into cpu-sharded-integration-tests
renxida Mar 25, 2025
cb47cfd
Merge branch 'cpu-sharded-integration-tests' of github.com:renxida/SH…
renxida Mar 25, 2025
ee55b02
change bs on the correct model
renxida Mar 25, 2025
1d93152
add tinystories to sharded_model_test
renxida Mar 25, 2025
85979e3
change server fixture interface
renxida Mar 25, 2025
e14db44
fix interface change
renxida Mar 25, 2025
f2478d4
Revert "add tinystories to sharded_model_test"
renxida Mar 25, 2025
6deb056
remove vestigial test file
renxida Mar 25, 2025
277b7f0
remove unused TENSOR_PARALLELISM_SIZE global var
renxida Mar 25, 2025
ebea310
don't run the broken tp4 version
renxida Mar 25, 2025
ab39043
Merge branch 'main' of https://github.com/nod-ai/shark-ai into cpu-sh…
renxida Mar 25, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
fix xfail
renxida committed Mar 24, 2025
commit f0160551c4aae5705f05ee6c087c0b43f1ce181b
Original file line number Diff line number Diff line change
@@ -88,7 +88,7 @@ class TestShardedModelServer:
"""Test suite for sharded model server functionality on both CPU and GPU."""

@pytest.mark.xfail(
"Memory access fault by GPU node-3 (Agent handle: 0x555c24e83f80) on address 0x7fc28a1e4000. Reason: Unknown."
reason="Memory access fault by GPU node-3 (Agent handle: 0x555c24e83f80) on address 0x7fc28a1e4000. Reason: Unknown."
)
def test_concurrent_generation_sharded(
self, server: tuple[Any, int], test_device, device_type