Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service crash during tests of field goal for 30+ schedulers in 1million nodes tests #84

Closed
q131172019 opened this issue Jul 13, 2022 · 7 comments

Comments

@q131172019
Copy link
Collaborator

q131172019 commented Jul 13, 2022

In https://github.com/yb01/arktos/wiki/730-test, for 'Field goal' case - 1million nodes / 5 regions / 40 schedulers / 25K nodes per scheduler, when number of scheduler reaches 30 or 30+ , service crash frequently happens.

Here is statistics of test between 2022-07-12 afternoon through 2022-07-13 morning:

  • The only one test for 40 schedulers x 25K nodes was successful, other tests were service crashed
  • The only two tests for 34 schedulers x 25K nodes were successful, other tests were service crashed
  • The tests for 30 or 30+ schedulers x 25K nodes were crashed due to unknown bugs (scheduler??/distributor event queue watch??) Maybe Redis parameter tunning / ubuntu OS tunning Should be considered
  • The most tests for 20 – 29 schedulers x 25K nodes were successful

The following error was caught by log:

I0713 21:11:14.543144    9523 aggregator.go:114] Event Processor Processed nodes : results : true
I0713 21:11:15.092644    9523 aggregator.go:98] Total (16479) region node events are pulled successfully in (10) RPs

fatal error: concurrent map writes

goroutine 1033088 [running]:
runtime.throw({0x737971, 0xc05f12d5d0})
        /usr/local/go/src/runtime/panic.go:1198 +0x71 fp=0xc05f12d570 sp=0xc05f12d540 pc=0x434b31
runtime.mapassign_faststr(0x6cdba0, 0xc061187830, {0x734153, 0xb})
        /usr/local/go/src/runtime/map_faststr.go:294 +0x38b fp=0xc05f12d5d8 sp=0xc05f12d570 pc=0x4142ab
global-resource-service/resource-management/pkg/common-lib/types/event.(*NodeEvent).SetCheckpoint(0xc05ff6a800, {0x734153, 0xb})
        /home/ubuntu/go/src/global-resource-service/resource-management/pkg/common-lib/types/event/event.go:43 +0x2f8 fp=0xc05f12d688 sp=0xc05f12d5d8 pc=0x64f478
global-resource-service/resource-management/pkg/service-api/endpoints.(*Installer).serverWatch(0xc000098230, {0x7a16a0, 0xc076036d20}, 0xc06470a200, {0xc06446e48f, 0x2b})
        /home/ubuntu/go/src/global-resource-service/resource-management/pkg/service-api/endpoints/installer.go:230 +0xa50 fp=0xc05f12d848 sp=0xc05f12d688 pc=0x65b290
global-resource-service/resource-management/pkg/service-api/endpoints.(*Installer).ResourceHandler(0xc000098230, {0x7a16a0, 0xc076036d20}, 0xc06470a200)
        /home/ubuntu/go/src/global-resource-service/resource-management/pkg/service-api/endpoints/installer.go:149 +0x678 fp=0xc05f12d940 sp=0xc05f12d848 pc=0x65a7b8
global-resource-service/resource-management/pkg/service-api/endpoints.(*Installer).ResourceHandler-fm({0x7a16a0, 0xc076036d20}, 0xc06444bda0)
        /home/ubuntu/go/src/global-resource-service/resource-management/pkg/service-api/endpoints/installer.go:110 +0x3c fp=0xc05f12d970 sp=0xc05f12d940 pc=0x69341c
net/http.HandlerFunc.ServeHTTP(0xc06470a100, {0x7a16a0, 0xc076036d20}, 0x99c320)
        /usr/local/go/src/net/http/server.go:2047 +0x2f fp=0xc05f12d998 sp=0xc05f12d970 pc=0x6139cf
github.com/gorilla/mux.(*Router).ServeHTTP(0xc0066fc000, {0x7a16a0, 0xc076036d20}, 0xc06470a000)
        /home/ubuntu/go/pkg/mod/github.com/gorilla/[email protected]/mux.go:210 +0x1cf fp=0xc05f12dac0 sp=0xc05f12d998 pc=0x68f8cf
net/http.serverHandler.ServeHTTP({0x7a0918}, {0x7a16a0, 0xc076036d20}, 0xc06470a000)
        /usr/local/go/src/net/http/server.go:2879 +0x43b fp=0xc05f12db80 sp=0xc05f12dac0 pc=0x615e9b
net/http.(*conn).serve(0xc04f5f6320, {0x7a2980, 0xc000478510})
        /usr/local/go/src/net/http/server.go:1930 +0xb08 fp=0xc05f12dfb8 sp=0xc05f12db80 pc=0x612d28
net/http.(*Server).Serve·dwrap·87()
        /usr/local/go/src/net/http/server.go:3034 +0x2e fp=0xc05f12dfe0 sp=0xc05f12dfb8 pc=0x6167ee
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc05f12dfe8 sp=0xc05f12dfe0 pc=0x463f41
created by net/http.(*Server).Serve
        /usr/local/go/src/net/http/server.go:3034 +0x4e8

goroutine 1 [semacquire, 12 minutes]:
sync.runtime_Semacquire(0x0)
        /usr/local/go/src/runtime/sema.go:56 +0x25
sync.(*WaitGroup).Wait(0xc000000003)
        /usr/local/go/src/sync/waitgroup.go:130 +0x71
global-resource-service/resource-management/cmds/service-api/app.Run(0xc0000b2240)
        /home/ubuntu/go/src/global-resource-service/resource-management/cmds/service-api/app/server.go:80 +0x696
main.main()
        /home/ubuntu/go/src/global-resource-service/resource-management/cmds/service-api/service-api.go:46 +0x3af

@q131172019
Copy link
Collaborator Author

Cherry-pick the following two commits from PR85, when testing 36 schedulers, the error is reproduced.

commit ebf3b7190710301d71da024147b12924b585f87a (HEAD -> Carl_Test_By_PR81)
Author: Yunwen Bai [email protected]
Date: Wed Jul 13 16:36:18 2022 -0700

fix typo

commit 38f121177aac9363cd7a35d6c722cfb5251df0b8
Author: Yunwen Bai [email protected]
Date: Wed Jul 13 14:10:07 2022 -0700

fix issue #79 and #80
fatal error: E0714 04:05:00.080562   10898 event_metrics.go:69] Event (MODIFIED, Id d9462c04-5347-4d70-bad4-d9910ddc74d4, RV 20388) does not have DIS_SENT stamped
concurrent map writes

goroutine 1112443 [running]:
runtime.throw({0x7395e7, 0x72595e4200010001})
        /usr/local/go/src/runtime/panic.go:1198 +0x71 fp=0xc01f530db0 sp=0xc01f530d80 pc=0x434b31
runtime.mapassign_faststr(0x6cfaa0, 0xc01f530f58, {0x734c22, 0x8})
        /usr/local/go/src/runtime/map_faststr.go:211 +0x39c fp=0xc01f530e18 sp=0xc01f530db0 pc=0x4142bc
global-resource-service/resource-management/pkg/common-lib/types/event.(*NodeEvent).SetCheckpoint(0xc036437800, {0x734c22, 0x8})
        /home/ubuntu/go/src/global-resource-service/resource-management/pkg/common-lib/types/event/event.go:48 +0x2ae fp=0xc01f530ec8 sp=0xc01f530e18 pc=0x65024e
global-resource-service/resource-management/pkg/distributor/cache.(*NodeEventQueue).Watch.func1(0xc0c3cfa5a0, {0x0, 0x0, 0xc0000a6168}, 0xc090cf9ec0, 0xc090cf9f20)
        /home/ubuntu/go/src/global-resource-service/resource-management/pkg/distributor/cache/eventqueue.go:179 +0x287 fp=0xc01f530fa0 sp=0xc01f530ec8 pc=0x6552a7
global-resource-service/resource-management/pkg/distributor/cache.(*NodeEventQueue).Watch·dwrap·6()
        /home/ubuntu/go/src/global-resource-service/resource-management/pkg/distributor/cache/eventqueue.go:184 +0x43 fp=0xc01f530fe0 sp=0xc01f530fa0 pc=0x654fe3
runtime.goexit()


@q131172019
Copy link
Collaborator Author

PR 85 is merged into main branch, test 36 schedulers again.

@q131172019
Copy link
Collaborator Author

The same issue is there.

E0714 05:32:19.970900   11747 event_metrics.go:69] Event (MODIFIED, Id 67d6777d-875d-4aa5-bc38-fcf5ef8509fe, RV 20022) does not have DIS_SENT stamped
E0714 05:32:19.980324   11747 event_metrics.go:69] Event (MODIFIED, Id 787cee5f-3854-402d-9421-bdc97c83cbae, RV 20021) does not have DIS_SENT stamped
fatal error: E0714 05:32:19.984597   11747 event_metrics.go:69] Event (MODIFIED, Id 7f741e0a-d5e4-46ee-9ab6-6bee255dcc0b, RV 20021) does not have DIS_SENT stamped
concurrent map writes

goroutine 1023450 [running]:
runtime.throw({0x7395e7, 0x29c1280100010001})
        /usr/local/go/src/runtime/panic.go:1198 +0x71 fp=0xc058bc45b0 sp=0xc058bc4580 pc=0x434b31
runtime.mapassign_faststr(0x6cfaa0, 0xc058bc4758, {0x734c22, 0x8})
        /usr/local/go/src/runtime/map_faststr.go:211 +0x39c fp=0xc058bc4618 sp=0xc058bc45b0 pc=0x4142bc
global-resource-service/resource-management/pkg/common-lib/types/event.(*NodeEvent).SetCheckpoint(0xc018700620, {0x734c22, 0x8})
        /home/ubuntu/go/src/global-resource-service/resource-management/pkg/common-lib/types/event/event.go:48 +0x2ae fp=0xc058bc46c8 sp=0xc058bc4618 pc=0x65024e
global-resource-service/resource-management/pkg/distributor/cache.(*NodeEventQueue).Watch.func1(0xc0004f8000, {0x0, 0x0, 0xc0000a6168}, 0xc0004fb440, 0xc0004fb4a0)
        /home/ubuntu/go/src/global-resource-service/resource-management/pkg/distributor/cache/eventqueue.go:179 +0x287 fp=0xc058bc47a0 sp=0xc058bc46c8 pc=0x6552a7
global-resource-service/resource-management/pkg/distributor/cache.(*NodeEventQueue).Watch·dwrap·6()
        /home/ubuntu/go/src/global-resource-service/resource-management/pkg/distributor/cache/eventqueue.go:184 +0x43 fp=0xc058bc47e0 sp=0xc058bc47a0 pc=0x654fe3
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc058bc47e8 sp=0xc058bc47e0 pc=0x463f41
created by global-resource-service/resource-management/pkg/distributor/cache.(*NodeEventQueue).Watch
        /home/ubuntu/go/src/global-resource-service/resource-management/pkg/distributor/cache/eventqueue.go:156 +0x18f

goroutine 1 [semacquire, 27 minutes]:
sync.runtime_Semacquire(0x0)
        /usr/local/go/src/runtime/sema.go:56 +0x25
sync.(*WaitGroup).Wait(0x0)
        /usr/local/go/src/sync/waitgroup.go:130 +0x71
global-resource-service/resource-management/cmds/service-api/app.Run(0xc0000b2240)
        /home/ubuntu/go/src/global-resource-service/resource-management/cmds/service-api/app/server.go:93 +0x7bc
main.main()
        /home/ubuntu/go/src/global-resource-service/resource-management/cmds/service-api/service-api.go:47 +0x3ef

@yb01
Copy link
Collaborator

yb01 commented Jul 14, 2022

the checkpoints map in the event type, in event.go file, is missing the R/W lock. this map being used in aggregator, distributor and service routines so the concurrency needs to be enforced for the map.

@q131172019
Copy link
Collaborator Author

After cherry-pick PR86 and do the tests for 36 schedulers under 1 million nodes successfully without service crash. Here is summary of 12th, 24th, 36th schedulers.

12th scheduler:

I0714 17:27:39.637251   53699 singleClientTest.go:184] End of results
I0714 17:27:39.637265   53699 stats.go:26] RegisterClientDuration: 81.859717ms
I0714 17:27:39.637276   53699 stats.go:39] ListDuration: 1.605102489s. Number of nodes listed: 25076
I0714 17:27:39.637286   53699 stats.go:56] Watch session last: 30m0.00126982s
I0714 17:27:39.637292   53699 stats.go:57] Number of nodes Added: 0
I0714 17:27:39.637299   53699 stats.go:58] Number of nodes Updated: 16592
I0714 17:27:39.637305   53699 stats.go:59] Number of nodes Deleted: 0
I0714 17:27:39.637312   53699 stats.go:60] Number of nodes watch prolonged than 1s: 3384

24th scheduler:

I0714 17:28:42.376546   45135 singleClientTest.go:184] End of results
I0714 17:28:42.376566   45135 stats.go:26] RegisterClientDuration: 10.673115ms
I0714 17:28:42.376578   45135 stats.go:39] ListDuration: 1.825130466s. Number of nodes listed: 25016
I0714 17:28:42.376588   45135 stats.go:56] Watch session last: 30m0.000946683s
I0714 17:28:42.376595   45135 stats.go:57] Number of nodes Added: 0
I0714 17:28:42.376603   45135 stats.go:58] Number of nodes Updated: 15015
I0714 17:28:42.376609   45135 stats.go:59] Number of nodes Deleted: 0
I0714 17:28:42.376616   45135 stats.go:60] Number of nodes watch prolonged than 1s: 4403

36th scheduler:

I0714 17:29:58.400894   15447 singleClientTest.go:184] End of results
I0714 17:29:58.400923   15447 stats.go:26] RegisterClientDuration: 78.371798ms
I0714 17:29:58.400937   15447 stats.go:39] ListDuration: 1.981126414s. Number of nodes listed: 25062
I0714 17:29:58.400950   15447 stats.go:56] Watch session last: 30m0.001565072s
I0714 17:29:58.400957   15447 stats.go:57] Number of nodes Added: 0
I0714 17:29:58.400968   15447 stats.go:58] Number of nodes Updated: 14574
I0714 17:29:58.400976   15447 stats.go:59] Number of nodes Deleted: 0
I0714 17:29:58.401001   15447 stats.go:60] Number of nodes watch prolonged than 1s: 3366

@q131172019
Copy link
Collaborator Author

q131172019 commented Jul 14, 2022

After cherry-pick PR86 and do the tests for 36 schedulers under 1 million nodes successfully without service crash. Here is summary of 14th, 28th, 40th schedulers. In theory, the 40th scheduler should not be allocated with 25000 requested machines due to "no enough hosts", it is wired. Do further tests with 3 x 14 schedulers to see what happens.

14th scheduler:

I0714 18:41:14.485234   55222 singleClientTest.go:184] End of results
I0714 18:41:14.485257   55222 stats.go:26] RegisterClientDuration: 61.756175ms
I0714 18:41:14.485268   55222 stats.go:39] ListDuration: 2.07549303s. Number of nodes listed: 25039
I0714 18:41:14.485277   55222 stats.go:56] Watch session last: 30m0.001373103s
I0714 18:41:14.485284   55222 stats.go:57] Number of nodes Added: 0
I0714 18:41:14.485291   55222 stats.go:58] Number of nodes Updated: 14688
I0714 18:41:14.485297   55222 stats.go:59] Number of nodes Deleted: 0
I0714 18:41:14.485305   55222 stats.go:60] Number of nodes watch prolonged than 1s: 4028

28th scheduler:

I0714 18:41:33.979621   46593 singleClientTest.go:184] End of results
I0714 18:41:33.979638   46593 stats.go:26] RegisterClientDuration: 12.99817ms
I0714 18:41:33.979650   46593 stats.go:39] ListDuration: 1.234941703s. Number of nodes listed: 25080
I0714 18:41:33.979660   46593 stats.go:56] Watch session last: 30m0.001516515s
I0714 18:41:33.979668   46593 stats.go:57] Number of nodes Added: 0
I0714 18:41:33.979675   46593 stats.go:58] Number of nodes Updated: 14919
I0714 18:41:33.979681   46593 stats.go:59] Number of nodes Deleted: 0
I0714 18:41:33.979688   46593 stats.go:60] Number of nodes watch prolonged than 1s: 4241

40th scheduler:

I0714 18:42:35.897095   16851 singleClientTest.go:184] End of results
I0714 18:42:35.897121   16851 stats.go:26] RegisterClientDuration: 69.512541ms
I0714 18:42:35.897134   16851 stats.go:39] ListDuration: 2.390363876s. Number of nodes listed: 25007
I0714 18:42:35.897144   16851 stats.go:56] Watch session last: 30m0.001212571s
I0714 18:42:35.897152   16851 stats.go:57] Number of nodes Added: 0
I0714 18:42:35.897158   16851 stats.go:58] Number of nodes Updated: 14908
I0714 18:42:35.897167   16851 stats.go:59] Number of nodes Deleted: 0
I0714 18:42:35.897174   16851 stats.go:60] Number of nodes watch prolonged than 1s: 4151

@Sindica
Copy link
Collaborator

Sindica commented Jul 21, 2022

Resolving per fix in PR #93

@Sindica Sindica closed this as completed Jul 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants