Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout registering at RouDi. Is RouDi running? #2401

Open
NeilZhy opened this issue Jan 4, 2025 · 17 comments
Open

Timeout registering at RouDi. Is RouDi running? #2401

NeilZhy opened this issue Jan 4, 2025 · 17 comments
Labels
needs info A bug report is waiting for more information

Comments

@NeilZhy
Copy link

NeilZhy commented Jan 4, 2025

Required information

Operating system:
E.g. Ubuntu 18.04 LTS

Compiler version:
E.g. GCC 7.4.0

Eclipse iceoryx version:
v2.0.3

Observed result or behaviour:
when I call iox::runtime::PoshRuntime::initRuntime("test");, I get the error Timeout registering at RouDi. Is RouDi running?.
the iox-roudi has been run.

@NeilZhy
Copy link
Author

NeilZhy commented Jan 5, 2025

Let me add some more details

  1. I run the iox-roudi
  2. I run my test-sample
  3. my test-sample puts the log 2023-01-01 00:04:30.484 [Warning]: Received a REG_ACK with an outdated timestamp!
  4. after 5s, the iox-roudi puts the log 2023-01-01 00:04:35.523 [Warning]: Application test-sample not responding (last response 5039 milliseconds ago) --> removing it
  5. then my test-sample puts the log 2023-01-01 00:04:39.789 [ Fatal ]: Timeout registering at RouDi. Is RouDi running? and 2023-01-01 00:04:39.860 [ Error ]: ICEORYX error! IPC_INTERFACE__REG_ROUDI_NOT_AVAILABLE
    @elBoberido Can you help me look at this problem, thank you!

@elBoberido
Copy link
Member

@NeilZhy this should be fixed on main. Can you try the v2.95.3 tag?

@elBoberido elBoberido added the needs info A bug report is waiting for more information label Jan 6, 2025
@NeilZhy
Copy link
Author

NeilZhy commented Jan 7, 2025

@elBoberido thank you very much, i will try the v2.95.3 tag for the test.
Could you help explain why this problem occurs?
Is there any issue discussing this issue? Is there any related commit or test?

@NeilZhy
Copy link
Author

NeilZhy commented Jan 7, 2025

I use v2.95.3 for compile and get these error:

FAILED: iceoryx_examples/icehello/iox-cpp-subscriber-helloworld 
: && aarch64-sel4-g++ -no-pie -O3 -DNDEBUG  iceoryx_examples/icehello/CMakeFiles/iox-cpp-subscriber-helloworld.dir/iox_subscriber_helloworld.cpp.o -o iceoryx_examples/icehello/iox-cpp-subscriber-helloworld -L/xxx/xxx/lib   -L/xxx/data/libattr/xxx/lib   -L/xxx/lib -Wl,-rpath,/xxx/lib:/xxx/lib:/xxx/lib:/xxx/platform:  posh/libiceoryx_posh.so.2.95.3  hoofs/libiceoryx_hoofs.so.2.95.3  platform/libiceoryx_platform.so.2.95.3  -lacl  -latomic  -lrt  -lpthread && :
/usr/lib/gcc-cross/aarch64-linux-gnu/11/../../../../aarch64-linux-gnu/bin/ld: hoofs/libiceoryx_hoofs.so.2.95.3: undefined reference to `pthread_mutexattr_setprioceiling'

there is a stackoverflow about this error: https://stackoverflow.com/questions/23250863/difference-between-pthread-and-lpthread-while-compiling

Bottom line: you should use the -pthread option.
Passing -lpthread does get the whole POSIX threading library.

@elBoberido
Copy link
Member

@NeilZhy can you create a PR to fix the issue?

The problem usually occurs when an application tries to register at RouDi and RouDi takes a long time to respond, so that the application sends a second request. In the meantime, RouDi responded for the old request. That's the Received a REG_ACK with an outdated timestamp! warning. In the end, there was a race which was not handled correctly and then the described issue can happen.

@NeilZhy
Copy link
Author

NeilZhy commented Jan 7, 2025

@elBoberido Sorry, I don't quite understand which problem I need to fix. I haven't solved this compilation problem yet.

@elBoberido
Copy link
Member

@NeilZhy I meant the issue with -lpthread. I'm also preparing a PR right now, so I can add the fix. It's just since you came up with the solution I thought you might want to commit it yourself.

@NeilZhy
Copy link
Author

NeilZhy commented Jan 7, 2025

@elBoberido thanks. My local problem may not be caused by -lpthread, it may be because my libpthread.so version is too low, or some other reason, I have not found the specific reason yet.

@NeilZhy
Copy link
Author

NeilZhy commented Jan 7, 2025

@elBoberido I would like to ask, on what basis do we use pthread_mutexattr_setprioceiling? In my temporary solution, can I remove pthread_mutexattr_setprioceiling and directly return 0; I don't know how much impact this will have.

@NeilZhy
Copy link
Author

NeilZhy commented Jan 7, 2025

@NeilZhy this should be fixed on main. Can you try the v2.95.3 tag?

@elBoberido Could you please provide me a patch or a commit to fix it? I will add the patch to our project for testing. It will be very difficult for me to upgrade the system to v2.95.3 in a short time. This is urgent for me. Thank you very much.

@elBoberido
Copy link
Member

@NeilZhy I think pthread_mutexattr_setprioceiling is not used in production code yet, so it should not have any impact.

Sorry, I'm currently quite busy with other stuff and can't look when/if this bug was fixed in v2.95.3. It's just a guess since we did not have such issues since a long time. But then it should in theory also be fixed in v2.0.

At first we need to know if it is fixed on main, which is basically v2.95.3.

@NeilZhy
Copy link
Author

NeilZhy commented Jan 8, 2025

I use v2.95.3 for compile and get these error:

FAILED: iceoryx_examples/icehello/iox-cpp-subscriber-helloworld 
: && aarch64-sel4-g++ -no-pie -O3 -DNDEBUG  iceoryx_examples/icehello/CMakeFiles/iox-cpp-subscriber-helloworld.dir/iox_subscriber_helloworld.cpp.o -o iceoryx_examples/icehello/iox-cpp-subscriber-helloworld -L/xxx/xxx/lib   -L/xxx/data/libattr/xxx/lib   -L/xxx/lib -Wl,-rpath,/xxx/lib:/xxx/lib:/xxx/lib:/xxx/platform:  posh/libiceoryx_posh.so.2.95.3  hoofs/libiceoryx_hoofs.so.2.95.3  platform/libiceoryx_platform.so.2.95.3  -lacl  -latomic  -lrt  -lpthread && :
/usr/lib/gcc-cross/aarch64-linux-gnu/11/../../../../aarch64-linux-gnu/bin/ld: hoofs/libiceoryx_hoofs.so.2.95.3: undefined reference to `pthread_mutexattr_setprioceiling'

there is a stackoverflow about this error: https://stackoverflow.com/questions/23250863/difference-between-pthread-and-lpthread-while-compiling

Bottom line: you should use the -pthread option. Passing -lpthread does get the whole POSIX threading library.

It is not the -lpthread error. This is because the system library of our cross-compilation toolchain only declares the function, but the system library does not implement this function.

@NeilZhy
Copy link
Author

NeilZhy commented Jan 14, 2025

@NeilZhy can you create a PR to fix the issue?

The problem usually occurs when an application tries to register at RouDi and RouDi takes a long time to respond, so that the application sends a second request. In the meantime, RouDi responded for the old request. That's the Received a REG_ACK with an outdated timestamp! warning. In the end, there was a race which was not handled correctly and then the described issue can happen.

@elBoberido
Considering this situation, can we be less strict in our judgment? Can we not use the following judgment logic?

                if (transmissionTimestamp == receivedTimestamp)
                {
                    return RegAckResult::SUCCESS;
                }
                else
                {
                    LogWarn() << "Received a REG_ACK with an outdated timestamp!";
                }

Can we use the following judgment logic? Determine whether the REG_ACK message currently received is 5 seconds different from the REG message sent.

                const int64_t fiveSecondsInMicroseconds = 5 * 1000000;
                int64_t difference = std::abs(transmissionTimestamp - receivedTimestamp);
                if (difference <= fiveSecondsInMicroseconds)
                {
                    return RegAckResult::SUCCESS;
                }
                else
                {
                    LogWarn() << "Received a REG_ACK with an outdated timestamp!";
                }

@NeilZhy
Copy link
Author

NeilZhy commented Jan 14, 2025

@NeilZhy can you create a PR to fix the issue?
The problem usually occurs when an application tries to register at RouDi and RouDi takes a long time to respond, so that the application sends a second request. In the meantime, RouDi responded for the old request. That's the Received a REG_ACK with an outdated timestamp! warning. In the end, there was a race which was not handled correctly and then the described issue can happen.

@elBoberido Considering this situation, can we be less strict in our judgment? Can we not use the following judgment logic?

                if (transmissionTimestamp == receivedTimestamp)
                {
                    return RegAckResult::SUCCESS;
                }
                else
                {
                    LogWarn() << "Received a REG_ACK with an outdated timestamp!";
                }

Can we use the following judgment logic? Determine whether the REG_ACK message currently received is 5 seconds different from the REG message sent.

                const int64_t fiveSecondsInMicroseconds = 5 * 1000000;
                int64_t difference = std::abs(transmissionTimestamp - receivedTimestamp);
                if (difference <= fiveSecondsInMicroseconds)
                {
                    return RegAckResult::SUCCESS;
                }
                else
                {
                    LogWarn() << "Received a REG_ACK with an outdated timestamp!";
                }

@elBoberido
In order to balance security and improve tolerance, the user's runtime_name can be returned when roudi returns REG_ACK. Then, after the runtime gets the result, it can make the following judgments:

                 const int64_t fiveSecondsInMicroseconds = 5 * 1000000;
                 int64_t difference = std::abs(transmissionTimestamp - receivedTimestamp);
                 RuntimeName_t runtime_name;
                 cxx::convert::fromString(receiveBuffer.getElementAtIndex(5U).c_str(), runtime_name);
                 if (runtime_name == m_runtimeName && difference <= fiveSecondsInMicroseconds)
                 {
                     return RegAckResult::SUCCESS;
                 }
                 else
                 {
                     LogWarn() << "Received a REG_ACK with an outdated timestamp!";
                 }

roudi needs to add runtime_name when replying to REG_ACK:

    sendBuffer << runtime::IpcMessageTypeToString(runtime::IpcMessageType::REG_ACK)
               << m_roudiMemoryInterface.mgmtMemoryProvider()->size() << offset << transmissionTimestamp
               << m_mgmtSegmentId << name;

@elBoberido Please help analyze whether this method is feasible. Thanks!

@NeilZhy
Copy link
Author

NeilZhy commented Jan 16, 2025

@elBoberido Can you give me some advice, thanks

@elBoberido
Copy link
Member

@NeilZhy the core team is currently quite busy with their day job working on iceoryx, so there is not much time for community support at the moment. We hope to be able to dedicate more time for community support again in the future.

To your question. The timestamp the application receives, is the same that it sends. RouDi does not add it's own timestamp but just respond with the timestamp of the request. This is to ensure that the application is in sync with the messages to Roudi. If this is not done, the following issue can happen

  • app sends first registration request
  • roudi is busy
  • app sends second registration request
  • roudi responds to the first registration request
  • app assumes it's the respond to the second request
  • app requests a publisher
  • roudi responds to the second registration request
  • app reads the response and assumes it's for the publisher but it was the out of sync response for the registration

As you can see, the timestamp acts as unique ID in order to prevent the issue mentioned above.

Please check if the bug is reproducible with v2.95.3 before you continue with v2.0.

@NeilZhy
Copy link
Author

NeilZhy commented Jan 20, 2025

@elBoberido Thank you very much, I understand what you mean. I will try the latest version sometime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs info A bug report is waiting for more information
Projects
None yet
Development

No branches or pull requests

2 participants