Description
Over the last week a few of us have been hammering at what is causing these weird wifi-related failures on boot at random. When they occur, the MiSTer gets an IP, even gets the time when you don't have an RTC board, and SAMBA/SSH connections work, but anything attempting to get an internet connection after that fails. Even nslookup google.com 8.8.8.8
fails with a timeout to the DNS.
What we have found are the following:
- The issue seems to stem from the commits around April 2022. This commit may be the underlying culprit, but it is unsure how.
- The issue will occur if the wlan0 interface exists when dhcpcd and ifup start up. Normally the interface does not exist until further in the boot process, but for mine it is there early (within a few seconds), so dhcpcd pulls an initial IP when its process is started.
- On my MiSTer, a second udev fires, causing a second lease to be called, and it obtains a 169 IP address, while the previous lease still has a correct local IP that is reachable. This does not fire automatically for most other people, but forcefully firing it does reproduce the issue:
udevadm trigger /sys/class/net/wlan0 --action add
- Wifi seems to initiate for me very early, before the filesystem is mounted as r/w. This is not the case for most others. I am using a TP-Link Archer T3U Plus adapter (Amazon link). It is uncertain that this plays a factor in the issue at all. The random loss of internet occurs with the adapter from Porkchop as well, but it is uncertain that it is the same exact issue.
- Resetting the router does often fix the issue, but only shortly. Otherwise, you'll find the issue fixed after cold rebooting the MiSTer 6-10 times, and then it returns a number of days later again, requiring another count of excessive reboots to fix.
- dhcpcd fails to write the lease file to /var/lib/dhcpcd/interface-ssid.lease because the dhcpcd directory does not exist and it states the filesystem is read-only, which is the truth because it fires before the remount to r/w occurs. Considering I do not have an RTC addon (and the system time defaults to 1/1/1979), this means that the lease is about 40 years old once the time is pulled from NTP servers, which is most likely expired. It is uncertain if this causes it to pull a second lease.
- dhcpcd will also fail to write the duid to /var/lib/dhcpcd/duid, since it is a read-only filesystem at the time it runs.
- dhcpcd is outdated on the MiSTer. It is uncertain if updating it can be the fix. Possibly BusyBox needs work.
- The issue occurs with multiple of my own routers/modems, so the issue shouldn't be related to the router. The udev command is reproducible even on reputable routers.
- The DUID problem appears to cause some routers(mine) to give a new/different lease on every boot. Additionally it appears the r/o filesystem lease problem causes the system to trigger the IPV4LL behavior because if I move the lease directory to /media/fat/something with a symlink I can use the udevadm command multiple times and it just uses the existing lease.
- This second lease seems to load under udhcpd and not dhcpcd. It is probably best to only use one and not both. This may be causing issues.
- Depending on timing, the DUID can be static, but with my MiSTer it pulls it too early, and fails to keep it.
- The wifi initiating early is most likely caused by the wlan0 interface loading within the first few seconds of booting. On working MiSTers, it seems to not load the wlan0 interface for a while longer, which allows the system to work normally.
- Running
rm /sbin/udhcpc
causes ifup to use dhcpcd, and then the rc script starts another copy. It seems the priority for the startup script is to load udhcpc before dhcpcd. It would probably be best to only use dhcpcd. If udhcpd is wanted instead, eth0 does not load since it is not in the interfaces config file, so it would need to be added there.
Possible methods to fix (doing multiple is not a bad idea):
- Seemingly the beest way to fix could be to use wpa_supplicant hook scripts (
/usr/share/dhcpcd/hooks/10-wpa_supplicant
), but it is broken with the MiSTer's wpa_supplicant implementation. wpa_supplicant would need to be fixed. This would allow using dhcpcd.conf to address anything related to network issues. And have ifupdown just do loopback. - Find what's causing the interface to exist on boot. This alone should fix the issue, but it is a bit of a dirty fix.
- Fixing write-permissions on boot, so dhcpcd can write the lease/duid files properly. Another dirty fix, but should address the issue.
- Fix it so only dhcpcd or udhcpd run, and not have a possibility of both. I recommend doing this regardless of other methods.
Any thoughts as to what may be directly causing this is welcome. I'd like to get to the bottom of this, as various users besides me have reported this happening at random with their MiSTer. We are using the latest Mr. Fusion images as far as I know.
I have attached my syslog for review.