-
-
Notifications
You must be signed in to change notification settings - Fork 514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory corruption when saving network settings #2313
Comments
In a quick and dirty test I made NetworkSettings execute enableAdminMode() and applyConfig() from inside its loop() (to have these functions execute synchronously to the main loop()), but it seems the issue is not fully resolved: Log
It seems I can't reproduce the issue when I wait for the device to print that it got it's IP address before hitting the "Save" button again, at least with the change I described in this comment. Edit: I found this: #2298 (comment) which could be related. |
@schlimmchen yes I have read that issue #2298 too. Basically your fix sounds good, but there seems to be an RF task running in parallel when I look at the Example Logs you provided. Here the IRQ subroutine may be triggered which could “interrupt” [sic] your inline code path. |
It would be helpfull to show use the log output in vscode as it automatically integrates the exception parser and shows a proper stack trace with readable symbols. |
When testing the migration to arduino core 3 I also realized that it immediatly crashes in |
@schlimmchen is this solved with your other fix from today that prevents allocating a |
IMHO this is a different issue as it only occours when pressing save. |
see #2360 |
Nope... Got this on the first try using v24.10.15 including the write guard. The backtrace seems to suggest that the Async TCP Server can't handle the underlying network connection breaking, or to be more specific: Once the connection breaks, handling the disconnect triggers some kind of issue.
Hm, similar context. Maybe a double free?
One more for good measure. Again, similar context. Maybe @mathieucarbou is willing to take a look? |
Yeah maybe same context, the plot thickens as they say:
While the webrequest seems to cope with AsyncWebServer::_handleDisconnect() during AsyncWebServerRequest::_onDisconnect() Can we provoke this situation somehow? I assume maybe something like a WLAN disconnect / Application Level Firewall or a tcpdump may tell us a bit more what happens on the network and where in the code / process this breaks. |
I'm sorry I won't be of a great help on this at the moment (too few info)... Having a reproductible use case outside of the app would help. Although we see AsyncTCP and ESPAsycnWS in the stack following user interaction it does not mean that the issue is there, but the normal processing of ESPAsyncWS could be impacted by other things running. There are a lof of similar ones reported in https://github.com/espressif/esp-idf. Questions, in the process to try isolate the cause:
|
These stack traces are nearly the same:
AsyncWebServerRequest::~AsyncWebServerRequest() {
_headers.clear();
_pathParams.clear();
if (_response != NULL) {
delete _response;
} For that to happen, the created response pointer (in this case So the This is important to note that This is important to consider because some code like this is brittle and relied probably on the fact that the response was sent and received by the browser, which is not the case: In WebApi.sendJsonResponse(request, response, __FUNCTION__, __LINE__);
Utils::removeAllFiles();
RestartHelper.triggerRestart(); The 2 list lines will be executed, and the request will be sent once the middleware chain and request handler have finished. Another one:
Here, the last lines will execute BEFORE the request is sent. All the things executed AFTER a response is attached to a request have to be carefully written in order to not have some impacts on any pointer the response could still reference. In the case of a json response, in some use cases, ArduinoJson won't do a copy but just point to the existing pointers (but this is not the issue we saw). In the issue we saw above, the lwip layer fails (wifi disconnect or something else) which triggers the response deletion.
void AsyncWebServerRequest::_onPoll() {
// os_printf("p\n");
if (_response != NULL && _client != NULL && _client->canSend()) {
if (!_response->_finished()) {
_response->_ack(this, 0, 0);
} else {
AsyncWebServerResponse* r = _response;
_response = NULL;
delete r;
_client->close();
}
}
}
void AsyncWebServerRequest::_onAck(size_t len, uint32_t time) {
// os_printf("a:%u:%u\n", len, time);
if (_response != NULL) {
if (!_response->_finished()) {
_response->_ack(this, len, time);
} else if (_response->_finished()) {
AsyncWebServerResponse* r = _response;
_response = NULL;
delete r;
_client->close();
}
}
}
``
I suspect this might be the issue... |
In AsyncWebServerRequest::~AsyncWebServerRequest() {
_headers.clear();
_pathParams.clear();
if (_response != NULL) {
delete _response;
_response = NULL;
} or (better): AsyncWebServerResponse* r = _response
_response = NULL;
delete r; To be sure the response is not used by any of the callbacks above. |
Thanks for looking into this. I see you spent quite some of your time, thanks! I could not yet fully understand your longer comment. I'll re-read it again later. What I do understand is that one has to be careful when sending a response (actually queuing sending a response) but executing code in the same context, which runs before the response was actually sent. I know about the example you gave, where the ESP is restarted. I did not dare to question it, but indeed it seems that sending the response and restarting the ESP (be it with a delay or not) is something of a race. What we would actually like to do is wait for the response to be sent, then trigger the reboot. The reason I asked whether you would want to have a look is that I suspect that you are interested in making ESPAsyncWebServer rebust against the issue we are looking at, even if we are using it in a questionable manner. Assuming that something can be done in the lib... The changes you proposed unfortunatly don't prevent the issue from being triggered.
For the record: I edited the file
I assume that's okay for a quick check? |
Yes! That will do. In Except if you know it already ? |
I have released Let mw know when you'll have more logs to pinpoint what causes the issue :-) |
@mathieucarbou thanks for your pointers to look closely at, this is my little bed-time crime story 🔍 🧐 🎩 for the evening I am still trying to understand what exactly happens, AsyncWebServerResponse* r = _response
_response = NULL;
delete r; You do that exactly the same way in the following three locations:
But before you send it you do something else:
void AsyncWebServerRequest::send(AsyncWebServerResponse* response) {
if (_sent)
return;
if (_response)
delete _response;
_response = response;
if (_response == NULL) {
_client->close(true);
_onDisconnect();
_sent = true;
return;
}
if (!_response->_sourceValid())
send(500);
} I also noticed that it always complains about being unable to free the heap memory for a std::list<AsyncWebHeader> _headers; Though I found only explicit code for clearing the memory of such a
Could it be that we are left with a dangling pointer to this / actually no valid AsyncWebHeader list through the above NULL and delete operations ? |
we are in the destructor, so the idea is to set the ref to the response to null ASAP in case we have some code elsewhere that could still see this pointer (this is the case for the 2 other callbacks). Then, once this is done, we can trigger the object deletion (which can take some time), but at least during this time the response in the request will be null.
Yes, this is to have the pointer set to null asap, then free after. void AsyncWebServerRequest::send(AsyncWebServerResponse* response) {
if (_sent)
return;
if (_response)
delete _response;
_response = response;
if (_response == NULL) {
_client->close(true);
_onDisconnect();
_sent = true;
return;
}
if (!_response->_sourceValid())
send(500);
} This code is just a feature to swap the response by another one. You are not using that. A middleware could decide to change a response that was set by a handler. So if a response was set, we delete it, then set the new one. This operation happens during the middleware chain processing (just after the handler and before the requests is sent on the network). So it is not the cause of the issue here.
That is exactly what I also find strange...
Not in the case of this list: AsyncWebHeader does not need to any destructor because this is a holder object of 2 strings and the list is storing objects by copying the value in a new instance held in the node (which is freed at node destruction). But the way a linked list work, is by pointing to the next structure, so when the object is freed from memory, each node are freed and this is a longer operation, compared to just remove an array from memory if we got a vector instead. The issue with a vector is that it requires reallocation. @schlimmchen should try to add some logs to know what is being executed (which request when it fails and also log before the delete calls on a response), also, point to the new version :-) We will have more info. It is possible that when the lwip layer sends an error (following network issue, com broken, etc) then there is a concurrency issue happening. |
@schlimmchen could you maybe test this patch? diff --git a/src/NetworkSettings.cpp b/src/NetworkSettings.cpp
index f545ac37..3067281f 100644
--- a/src/NetworkSettings.cpp
+++ b/src/NetworkSettings.cpp
@@ -28,8 +28,6 @@ void NetworkSettingsClass::init(Scheduler& scheduler)
WiFi.setScanMethod(WIFI_ALL_CHANNEL_SCAN);
WiFi.setSortMethod(WIFI_CONNECT_AP_BY_SIGNAL);
- WiFi.disconnect(true, true);
-
WiFi.onEvent(std::bind(&NetworkSettingsClass::NetworkEvent, this, _1, _2));
if (PinMapping.isValidW5500Config()) {
@@ -101,7 +99,6 @@ void NetworkSettingsClass::NetworkEvent(const WiFiEvent_t event, WiFiEventInfo_t
MessageOutput.printf("WiFi disconnected: %" PRIu8 "\r\n", info.wifi_sta_disconnected.reason);
if (_networkMode == network_mode::WiFi) {
MessageOutput.println("Try reconnecting");
- WiFi.disconnect(true, false);
WiFi.begin();
raiseEvent(network_event::NETWORK_DISCONNECTED);
} It reverts #2117 ....... at least for me, the reconnect works more reliable without these two lines and I also don't get an exception any more when changing network settings. |
Interesting. One click of the button and the ESP32 restarted. So I guess this is still reproducible. Second try needed a couple more clicks, but still very easy. After applying the patch, I got these:
Looks to me like the issue persists despite the patch... I only clicked the Save button repeatedly in the network settings UI. Hardware: OpenDTU Fusion with PoE Add-On and wired Ethernet in use. Software: OpenDTU Development build. Should I retry without wired Ethernet and with vanilla OpenDTU? |
Not again 😱 ! @schlimmchen : would you be able to add the complete stack traces again and point me to the current project and code again ? Did you verify in the pio build that you didn't have some duplicate asynctcp libs (following the organization change to ESP32Async) ? |
I can do more tests and provide full backtraces, using the upstream project's code. Not today, but eventually.
Errr... I am still on the "old" libs. Thomas did not publish the updated platformio.ini yet. And I don't know what needs to be done, to be honest. Can you help me out? I only see this:
Update this to the lib at the org and that is it? Do I need to clean something? I would expect that platformio does the rest... |
Now it is:
It is just that with the transitive resolving, I wanted to know if you happen to have some duplicates of AsyncTCP lib in the libdeps folder ;-) |
To be honest, I am not even sure if this issue is related to the webserver or asynctcp.... I applied the patch mentioned above and did not get any stack traces any more. could also be a random behavior on my side. Edit: Ok you are right, when pressing the save button a lot of times, the crash happens again
|
When you say that: are you actually triggering concurrent ajax requests ? Or the JS layer makes sure they are sent in sequence and wait for the first one to return before issuing the second one ? I am asking that because if you have a long-running operation somewhere, could the second request start to execute and when building the json payload, have an effect on the first one ? It's like you are missing some stack trace - there is no lines around the app code. |
Now I waited ~10sec before pressing the button again (after this time the function has definetly completed its job). Got a little bit different stack trace
|
Now with time stamps.... Every time you see
|
I don't understand... Why there are so many network interface resets with the same IP ? |
Every time I press the save button the wifi parameters will be applied again (ssid, password), the wifi mode will be updated (when saving the config, the internal access point will be activated) and the ip configuration will be applied (dhcp or static) Tomorrow I will check what happens if I execute this network specific stuff in the main loop. |
Before resetting the network layer, are you making sure to stop all network services, server included, and start them back again ? Or, easier, just save and restart the ESP ? |
I am able to reproduce also with the code from main branch... |
I found the issue and how to fix it. I think it is a combination of several problems:
Possible refactorings:
Something like that (patch for main), but to be improved because I have statically exposed things: diff --git a/include/NetworkSettings.h b/include/NetworkSettings.h
index 90d3962b..43effc31 100644
--- a/include/NetworkSettings.h
+++ b/include/NetworkSettings.h
@@ -56,6 +56,8 @@ public:
bool onEvent(DtuNetworkEventCb cbEvent, const network_event event = network_event::NETWORK_EVENT_MAX);
void raiseEvent(const network_event event);
+
+ static uint32_t reconfigure;
private:
void loop();
diff --git a/include/WebApi.h b/include/WebApi.h
index 6e85bafd..6deb15f1 100644
--- a/include/WebApi.h
+++ b/include/WebApi.h
@@ -33,6 +33,9 @@ public:
void init(Scheduler& scheduler);
void reload();
+ void beginServer() { _server.begin(); }
+ void endServer() { _server.end(); }
+
static bool checkCredentials(AsyncWebServerRequest* request);
static bool checkCredentialsReadonly(AsyncWebServerRequest* request);
diff --git a/src/MqttSettings.cpp b/src/MqttSettings.cpp
index 39e76e22..83106c52 100644
--- a/src/MqttSettings.cpp
+++ b/src/MqttSettings.cpp
@@ -6,6 +6,8 @@
#include "Configuration.h"
#include "MessageOutput.h"
+#include "WebApi.h"
+
MqttSettingsClass::MqttSettingsClass()
{
}
@@ -16,6 +18,7 @@ void MqttSettingsClass::NetworkEvent(network_event event)
case network_event::NETWORK_GOT_IP:
MessageOutput.println("Network connected");
performConnect();
+ WebApi.beginServer();
break;
case network_event::NETWORK_DISCONNECTED:
MessageOutput.println("Network lost connection");
diff --git a/src/NetworkSettings.cpp b/src/NetworkSettings.cpp
index f545ac37..a0fe5e81 100644
--- a/src/NetworkSettings.cpp
+++ b/src/NetworkSettings.cpp
@@ -12,6 +12,10 @@
#include <ESPmDNS.h>
#include <ETH.h>
+#include "WebApi.h"
+
+uint32_t NetworkSettingsClass::reconfigure = 0;
+
NetworkSettingsClass::NetworkSettingsClass()
: _loopTask(TASK_IMMEDIATE, TASK_FOREVER, std::bind(&NetworkSettingsClass::loop, this))
, _apIp(192, 168, 4, 1)
@@ -205,6 +209,12 @@ String NetworkSettingsClass::getApName() const
void NetworkSettingsClass::loop()
{
+ if(reconfigure && reconfigure < millis()) {
+ reconfigure = 0;
+ enableAdminMode();
+ applyConfig();
+ }
+
if (_ethConnected) {
if (_networkMode != network_mode::Ethernet) {
// Do stuff when switching to Ethernet mode
@@ -278,6 +288,8 @@ void NetworkSettingsClass::loop()
void NetworkSettingsClass::applyConfig()
{
+ WebApi.endServer();
+
setHostname();
if (!strcmp(Configuration.get().WiFi.Ssid, "")) {
return;
diff --git a/src/WebApi_network.cpp b/src/WebApi_network.cpp
index 51db32e4..4c0b4881 100644
--- a/src/WebApi_network.cpp
+++ b/src/WebApi_network.cpp
@@ -204,6 +204,6 @@ void WebApiNetworkClass::onNetworkAdminPost(AsyncWebServerRequest* request)
WebApi.sendJsonResponse(request, response, __FUNCTION__, __LINE__);
- NetworkSettings.enableAdminMode();
- NetworkSettings.applyConfig();
+ // delay reconfigure in 500ms
+ NetworkSettings.reconfigure = millis() + 500;
} Note: a UI spinner would be required though. But frankly, reconfiguring important things like the network should IMO trigger an ESP reboot for safety... Even with such refactoring, we can see that the code is doing weird things because it goes again through the captive portal: on mac, when clicking on save, it goes in AP_STA mode and the captive portal reopens, while at the same time the DHCP triggers. If already connected to WiFi, we should stay in WiFi mode I think, and AP mode should only spawned after a timeout. I know how these things are complicated: I've written a network manager handling AP / CP / WiFi / ETH here: https://github.com/mathieucarbou/MycilaESPConnect and it is hard to have something configurable while the app keeps running. |
What happened?
I noticed a bunch of ESP reboots after exceptions and started digging. One of them I could isolate and pin down to also occur in this project: When saving the network settings, there is some kind of memory corruption.
To Reproduce Bug
Save the network settings by clicking the save button in the web UI. Repeat until you observe a random crash.
Expected Behavior
Graceful application of network settings.
Install Method
Self-Compiled
What git-hash/version of OpenDTU?
3559007
What firmware variant (PIO Environment) are you using?
generic_esp32s3_usb
Relevant log/trace output
No response
Anything else?
Example 1
Example 2
Example 3
Please confirm the following
The text was updated successfully, but these errors were encountered: