Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade local cluster freezes topic creation #4162

Closed
fraidev opened this issue Sep 2, 2024 · 8 comments
Closed

Upgrade local cluster freezes topic creation #4162

fraidev opened this issue Sep 2, 2024 · 8 comments
Assignees

Comments

@fraidev
Copy link
Contributor

fraidev commented Sep 2, 2024

The CI local_upgrade_test that runs make FLUVIO_MODE=local FLUVIO_BIN=~/bin/fluvio upgrade-test created at #4138 was failing a lot.

The problem is when the local cluster is upgraded and resumed, and then it tried to create a new topic.

I tested it locally on a Linux machine with a weak cpu (J4125) and the problem happens most of the time.
I also tested it on my macbook m1 pro max only happened once after many tries.

I tried to add more sleeps in the tests does not seem very effective.

The only errors log that I got it was that the watch connection failed, but this not seems the problem.

I suspect that is related to our shutdown command is not really graceful for local clusters because when I try shutdown and resume again, everything works fine. Also, seems that we don't have logs to catch this.

SC logs:

CLI Option: ScOpt {
    run_mode: ScOptRunMode {
        local: Some(
            "/home/runner/.fluvio/data/metadata",
        ),
        k8: false,
        read_only: None,
    },
    bind_public: Some(
        "127.0.0.1:9003",
    ),
    bind_private: None,
    namespace: None,
    tls: TlsConfig {
        tls: false,
        server_cert: None,
        server_key: None,
        enable_client_cert: false,
        ca_cert: None,
        bind_non_tls_public: None,
        secret_name: None,
    },
    x509_auth_scopes: None,
    auth_policy: None,
    white_list: [],
}
2024-08-30T02:27:13.747030Z  INFO fluvio_sc::start: Platform version="0.11.12-dev-1"
2024-08-30T02:27:13.747048Z  INFO fluvio_sc::start: Git commit="4f9935422040397f0286c5e8ce402c2a13f5e87e"
2024-08-30T02:27:13.747065Z  INFO fluvio_sc::start: System name=Some("Ubuntu")
2024-08-30T02:27:13.747075Z  INFO fluvio_sc::start: System kernel=Some("6.5.0-1025-azure")
2024-08-30T02:27:13.747094Z  INFO fluvio_sc::start: System os_version=Some("Linux 22.04 Ubuntu")
2024-08-30T02:27:13.747211Z  INFO fluvio_sc::start: System core_count=Some(2)
2024-08-30T02:27:13.747221Z  INFO fluvio_sc::start: System total_memory=16757342208
2024-08-30T02:27:13.747226Z  INFO fluvio_sc::start: System available_memory=15691501568
2024-08-30T02:27:13.747241Z  INFO fluvio_sc::start: Uptime in secs uptime=159
Starting SC, platform: 0.11.12-dev-1
2024-08-30T02:27:13.748080Z  INFO fluvio_sc::start: Running in local mode metadata="/home/runner/.fluvio/data/metadata"
2024-08-30T02:27:13.748869Z  INFO fluvio_sc::start: starting local main loop
2024-08-30T02:27:13.749326Z  INFO fluvio_sc::controllers::spus::controller: starting spu controller
2024-08-30T02:27:13.749403Z  INFO sc_private_server{address="0.0.0.0:9004"}: fluvio_sc::services::private_api: starting internal services
2024-08-30T02:27:13.749419Z  INFO fluvio_sc::init::pub_server: using root authorization
2024-08-30T02:27:13.749459Z  INFO PartitionController: fluvio_sc::controllers::partitions::controller: started
2024-08-30T02:27:13.749541Z  INFO fluvio_sc::controllers::mirroring::controller: starting mirroring controller
Streaming Controller started successfully
2024-08-30T02:27:13.749585Z  INFO accept_incoming{self=FluvioApiServer("0.0.0.0:9004")}: fluvio_service::server: Opened TcpListener, waiting for connections
2024-08-30T02:27:13.749595Z  INFO accept_incoming{self=FluvioApiServer("127.0.0.1:9003")}: fluvio_service::server: Opened TcpListener, waiting for connections
2024-08-30T02:27:13.749592Z  INFO SpuControllerLoop: fluvio_sc::controllers::spus::controller: started
2024-08-30T02:27:15.709211Z  INFO accept_incoming{self=FluvioApiServer("127.0.0.1:9003")}: fluvio_service::server: Received connection, spawning request handler
2024-08-30T02:27:15.751561Z  INFO accept_incoming{self=FluvioApiServer("0.0.0.0:9004")}: fluvio_service::server: Received connection, spawning request handler
2024-08-30T02:27:15.751714Z  INFO handle_request{host="0.0.0.0:9004"}:respond{socket=Socket(9) _connection=peer("127.0.0.1:48510")}: fluvio_sc::services::private_api::private_server: SPU connected spu_id=5001
2024-08-30T02:27:15.751753Z  INFO handle_request{host="0.0.0.0:9004"}:respond{socket=Socket(9) _connection=peer("127.0.0.1:48510")}:update{spu=5001 new_value=true}: fluvio_sc::stores::spu::health_check: Initial SPU health report spu=5001 new_value=true
2024-08-30T02:27:17.213229Z  INFO handle_request{host="127.0.0.1:9003"}: fluvio_service::server: Response sent successfully, closing connection host=127.0.0.1:9003 peer_addr=127.0.0.1:46966
2024-08-30T02:27:22.246788Z  INFO accept_incoming{self=FluvioApiServer("127.0.0.1:9003")}: fluvio_service::server: Received connection, spawning request handler
2024-08-30T02:27:22.247550Z ERROR WatchControllerLoop{spec="Partition" sink=6}:sync_and_send_changes: fluvio_sc::services::public_api::watch: error watch sending Partition, correlation_id: 2, err: Socket io Broken pipe (os error 32)
2024-08-30T02:27:22.247561Z ERROR WatchControllerLoop{spec="Topic" sink=6}:sync_and_send_changes: fluvio_sc::services::public_api::watch: error watch sending Topic, correlation_id: 3, err: Socket io Broken pipe (os error 32)
2024-08-30T02:27:22.247557Z  INFO handle_request{host="127.0.0.1:9003"}: fluvio_service::server: Response sent successfully, closing connection host=127.0.0.1:9003 peer_addr=127.0.0.1:56032
2024-08-30T02:27:22.306729Z  INFO accept_incoming{self=FluvioApiServer("127.0.0.1:9003")}: fluvio_service::server: Received connection, spawning request handler
2024-08-30T02:27:22.307223Z  INFO handle_request{host="127.0.0.1:9003"}: fluvio_service::server: Response sent successfully, closing connection host=127.0.0.1:9003 peer_addr=127.0.0.1:56042
2024-08-30T02:27:22.307264Z ERROR WatchControllerLoop{spec="Partition" sink=6}:sync_and_send_changes: fluvio_sc::services::public_api::watch: error watch sending Partition, correlation_id: 2, err: Socket io Broken pipe (os error 32)
2024-08-30T02:27:22.307271Z ERROR WatchControllerLoop{spec="Topic" sink=6}:sync_and_send_changes: fluvio_sc::services::public_api::watch: error watch sending Topic, correlation_id: 3, err: Socket io Broken pipe (os error 32)
2024-08-30T02:27:22.366749Z  INFO accept_incoming{self=FluvioApiServer("127.0.0.1:9003")}: fluvio_service::server: Received connection, spawning request handler
2024-08-30T02:27:22.367232Z  INFO handle_request{host="127.0.0.1:9003"}: fluvio_service::server: Response sent successfully, closing connection host=127.0.0.1:9003 peer_addr=127.0.0.1:56044
2024-08-30T02:27:22.367268Z ERROR WatchControllerLoop{spec="Partition" sink=6}:sync_and_send_changes: fluvio_sc::services::public_api::watch: error watch sending Partition, correlation_id: 2, err: Socket io Broken pipe (os error 32)
2024-08-30T02:27:22.367300Z ERROR WatchControllerLoop{spec="Topic" sink=6}:sync_and_send_changes: fluvio_sc::services::public_api::watch: error watch sending Topic, correlation_id: 3, err: Socket io Broken pipe (os error 32)
2024-08-30T02:27:22.397243Z  INFO accept_incoming{self=FluvioApiServer("127.0.0.1:9003")}: fluvio_service::server: Received connection, spawning request handler
2024-08-30T02:27:22.397744Z  INFO handle_request{host="127.0.0.1:9003"}:respond{socket=Socket(6) _connection=peer("127.0.0.1:56054")}:handle_create_request:handle_create_topics_request: fluvio_sc::services::public_api::topic::create: creating topic topic=prerelease
2024-08-30T02:27:28.749591Z  INFO SystemTopicController: fluvio_sc::controllers::topics::controller: topic created CONSUMER_STORAGE_TOPIC="consumer-offset"
2024-08-30T02:28:22.401727Z  INFO accept_incoming{self=FluvioApiServer("127.0.0.1:9003")}: fluvio_service::server: Received connection, spawning request handler
2024-08-30T02:28:22.444484Z  INFO handle_request{host="127.0.0.1:9003"}: fluvio_service::server: Response sent successfully, closing connection host=127.0.0.1:9003 peer_addr=127.0.0.1:48902
2024-08-30T02:28:22.447865Z  INFO accept_incoming{self=FluvioApiServer("127.0.0.1:9003")}: fluvio_service::server: Received connection, spawning request handler
2024-08-30T02:28:22.492568Z  INFO handle_request{host="127.0.0.1:9003"}: fluvio_service::server: Response sent successfully, closing connection host=127.0.0.1:9003 peer_addr=127.0.0.1:48914
2024-08-30T02:28:22.495843Z  INFO accept_incoming{self=FluvioApiServer("127.0.0.1:9003")}: fluvio_service::server: Received connection, spawning request handler
2024-08-30T02:28:22.540495Z  INFO handle_request{host="127.0.0.1:9003"}: fluvio_service::server: Response sent successfully, closing connection host=127.0.0.1:9003 peer_addr=127.0.0.1:48922
2024-08-30T02:28:22.587417Z  INFO accept_incoming{self=FluvioApiServer("127.0.0.1:9003")}: fluvio_service::server: Received connection, spawning request handler
@sehz
Copy link
Contributor

sehz commented Sep 2, 2024

Did all process properly shutdown and everything has been properly flushed? where is test failing?

@fraidev
Copy link
Contributor Author

fraidev commented Sep 2, 2024

Did all process properly shutdown and everything has been properly flushed? where is test failing?

Yes, on my tested all processes could be closed and started again

@sehz
Copy link
Contributor

sehz commented Sep 2, 2024

I am not seeing assertion in the test.

@fraidev
Copy link
Contributor Author

fraidev commented Sep 2, 2024

Also, I tried many changes in the fluvio code to try to fix it or at least reduce it failling.

One thing that I tried and seems that improved it, it was removing the add_records and add_bytes metrics. After doing it, I noticed the test was working most of the time on my Linux machine.
Also, I tried to enable it on my mac (this metrics is not enabled on arm machines) and I got a rare case of this test failing on my Mac.

Not sure if this is really related, I suspect that these metrics slowdown performance on not arm machines and make it easier to get this bug.

Btw, I think we could be using more Relaxed instead of SeqCst Ordering for simple fetch_add cases:
rust-lang/nomicon#166
https://doc.rust-lang.org/nomicon/atomics.html#relaxed
crossbeam-rs/crossbeam#317

I noticed this test failing less times when I tried to change it to Relaxed too.

@fraidev
Copy link
Contributor Author

fraidev commented Sep 2, 2024

flushed

what means everything has been properly flushed? How to assert it?

@sehz
Copy link
Contributor

sehz commented Sep 9, 2024

Can you create PR again with upgrade test enabled?

@fraidev
Copy link
Contributor Author

fraidev commented Sep 9, 2024

Can you create PR again with upgrade test enabled?

Yeah, sure! #4173

@fraidev fraidev added the bug Something isn't working label Sep 12, 2024
@fraidev fraidev self-assigned this Sep 12, 2024
@LeoBorai LeoBorai removed the bug Something isn't working label Oct 4, 2024
@LeoBorai LeoBorai changed the title [Bug] Upgrade local cluster freezes topic creation Upgrade local cluster freezes topic creation Oct 4, 2024
@fraidev
Copy link
Contributor Author

fraidev commented Oct 18, 2024

Fixed by #4175

@fraidev fraidev closed this as completed Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants