-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
otg_fs / usbfs potential lock up #59
Comments
just to check, are you sure the waker is not being woken, and not that it is being woken but returns pending again even though the event you were waiting for should have triggered? i had a quick look at the usbfs code for the first time just now and one thing that sticks out to me is that you’re checking interrupt flags etc in the poll_fns. the reason i did this in the interrupt handler in my stab at usbhs is that the hardware is very prone to letting you miss interrupts if you aren’t careful: when you receive a transfer complete interrupt, you will “lose” the information stored in the registers if another transfer occurs, including rx_len, which happens fairly often. so the poll_fn never sees the interrupt it’s looking for (the hardware sets the endpoint to nak after the ack is sent… i thought? it doesn’t say this in the RM now that i’m reading it. well, if it did, you won’t even get the next transfers interrupt which would have a screwed up datatoggle) in fact, i suppose it is possible that the interrupt handler is “preempted” by the hardware: say you read the int_st, then another transfer completes, then you read rx_len: that rx_len would be for the wrong transfer. race conditions should be checked for by clearing the interrupt flag early, then making sure it is still cleared at the end, else send an error for the endpoint (we can’t be sure what we read was correct as we might have raced) and loop back around to handle the new transfer (or we could just let the interrupt handler fire again… but i don’t really trust the interrupt latency on risc v for that) two other things: ch32-hal/src/otg_fs/endpoint.rs Line 168 in e2bee21
and also: ch32-hal/src/otg_fs/endpoint.rs Line 176 in e2bee21
checks like this one are not needed since it should totally be possible to have an IN and OUT endpoint on the same number (except for control endpoints) |
First of all, thank you so much for review our code! Really could use an extra pair of eyes.
Well, I believe so but will double check again later today. We added some prints, the prints acquire a critical section so they should be "globally ordered". (debugging IN endpoint Code with more prints:
Above is where the "bug" shows up, by seeing both This is what led me to believe there is a lost wakeup.
Our understanding has been that with Line 226 in d7f1b26
So I have actually never seen a case (yet) that the registers are clobbered / lost for otg_fs as long as the int_busy is set. (which the WCH example code also does)
That is a good catch. I think we didn't actually think that through as much when we wrote that assertion.
I see, I did in fact have a misunderstanding of the usb spec. But I guess this is "technically not broken yet" but good point we probably should fix that in the future. |
Ah, I'd glossed over int_busy. That makes sense (as long as there isn't a bug with it or something) As for your problem: I'm not really sure. maybe the waker is not registered somehow?! unfortunately AtomicWaker doesn't give you a way to check. only other guess is an executor bug but that seems unlikely. might have a proper look tomorrow, if I can repro I have a usb analyzer and might be able to get a debugger working. |
I have an analyzer too, I don't believe there is very much interesting going on the bus. If you are interested I'm more than happy to schedule an interactive session / voice call where we can debug this. |
Very interesting, turned off Still unsure what's going on TBH. |
In /// Wake a task by `TaskRef`.
///
/// You can obtain a `TaskRef` from a `Waker` using [`task_from_waker`].
pub fn wake_task(task: TaskRef) {
let header = task.header();
if header.state.run_enqueue() {
// We have just marked the task as scheduled, so enqueue it.
unsafe {
let executor = header.executor.get().unwrap_unchecked();
executor.enqueue(task);
}
}
}
/// Wake a task by `TaskRef` without calling pend.
///
/// You can obtain a `TaskRef` from a `Waker` using [`task_from_waker`].
pub fn wake_task_no_pend(task: TaskRef) {
let header = task.header();
if header.state.run_enqueue() {
// We have just marked the task as scheduled, so enqueue it.
unsafe {
let executor = header.executor.get().unwrap_unchecked();
executor.run_queue.enqueue(task);
}
}
} It looks like the executor.run_queue.enqueue error is not checked. @Codetector1374 found that we have the flag set in the state but no task on the run_queue. Edit: just kidding, the run_queue.enqueue doesn't return an error. |
Hummm looks like either we have a hardware bug in the atomic instruction or there is some bug in embassy_executor Basically
|
gg |
Ugh.... sounds like we don't have atomics.... Thanks..... WCH |
to elaborate, we switched the state/run_queue impl within embassy-executor from atomic to critical section, and the problem stopped reproducing. It could be the atomic impl is wrong, but neither of us could come up with a flaw, and given WCH's track record... |
you know… i was wondering why they removed atomics from some of the new V3 cores. might be why 🙃 (this Exists wrt testing atomics on riscv. never even touched it myself so no idea how hard it would be to run. all that really matters in practice though is “it works to spec” or “nope” so if one actually cared to prove if it doesn’t work it might just be easier to come up with a single counterexample) |
This change introduce a compile error if the toolchain is using a target with atomics. It looks like at least on the the QingKe V4 atomics are suspiciously broken. There is also a feature you can enable `i-understand-that-atomics-are-likely-broken` you can enable to remove this compile error See issue: ch32-rs/ch32-hal#59
This change introduce a compile error if the toolchain is using a target with atomics. It looks like at least on the the QingKe V4 atomics are suspiciously broken. There is also a feature you can enable `i-understand-that-atomics-are-likely-broken` you can enable to remove this compile error See issue: ch32-rs/ch32-hal#59
This introduce a breaking change where the atomic is no longer availiable DMA can no longer be using atomic, so critical section is used instead. As discovered in ch32-rs#59, the QingKeV4 atomic implementation is likely broken. As a result we added a compiler check to make sure the atomic exetnsion is disabled in ch32-rs/qingke#8. This change updates the dependency to use the new qingke as well as remove any reference to `core::atomic` in ch32-hal.
This change remvoes all use of atomic from ch32-hal As discovered in ch32-rs#59, the QingKeV4 atomic implementation is likely broken. As a result we added a compiler check to make sure the atomic exetnsion is disabled in ch32-rs/qingke#8. This change updates the dependency to use the new qingke as well as remove any reference to `core::atomic` in ch32-hal.
This change remvoes all use of atomic from ch32-hal As discovered in ch32-rs#59, the QingKeV4 atomic implementation is likely broken. As a result we added a compiler check to make sure the atomic exetnsion is disabled in ch32-rs/qingke#8. This change updates the dependency to use the new qingke as well as remove any reference to `core::atomic` in ch32-hal.
This change introduce a compile error if the toolchain is using a target with atomics. It looks like at least on the the QingKe V4 atomics are suspiciously broken. There is also a feature you can enable `i-understand-that-atomics-are-likely-broken` you can enable to remove this compile error See issue: ch32-rs/ch32-hal#59
This change introduce a compile error if the toolchain is using a target with atomics. It looks like at least on the the QingKe V4 atomics are suspiciously broken. There is also a feature you can enable `i-understand-that-atomics-are-likely-broken` you can enable to remove this compile error See issue: ch32-rs/ch32-hal#59
This change introduce a compile error if the toolchain is using a target with atomics. It looks like at least on the the QingKe V4 atomics are suspiciously broken. There is also a feature you can enable `unsafe-trust-wch-atomics` you can enable to remove this compile error See issue: ch32-rs/ch32-hal#59
This change introduce a compile error if the toolchain is using a target with atomics. It looks like at least on the the QingKe V4 atomics are suspiciously broken. There is also a feature you can enable `unsafe-trust-wch-atomics` you can enable to remove this compile error See issue: ch32-rs/ch32-hal#59
ExplodingWaffle pointed out[1] the host is allowed to send a shorter packet than what we were expecting. Remove the assertions and links to embassy. [1] ch32-rs#59 (comment) Reported-by: Harry Brooke <[email protected]>
This change remvoes all use of atomic (CAS) from ch32-hal As discovered in ch32-rs#59, the QingKeV4 atomic implementation is likely broken. As a result we added a compiler check to make sure the atomic exetnsion is disabled in ch32-rs/qingke#8. This change updates the dependency to use the new qingke as well as remove any reference to `core::atomic` in ch32-hal.
This change remvoes all use of atomic (CAS) from ch32-hal As discovered in ch32-rs#59, the QingKeV4 atomic implementation is likely broken. As a result we added a compiler check to make sure the atomic exetnsion is disabled in ch32-rs/qingke#8. This change updates the dependency to use the new qingke as well as remove any reference to `core::atomic` in ch32-hal.
ExplodingWaffle pointed out[1] the host is allowed to send a shorter packet than what we were expecting. Remove the assertions and links to embassy. [1] #59 (comment) Reported-by: Harry Brooke <[email protected]>
This change remvoes all use of atomic (CAS) from ch32-hal As discovered in ch32-rs#59, the QingKeV4 atomic implementation is likely broken. As a result we added a compiler check to make sure the atomic exetnsion is disabled in ch32-rs/qingke#8. This change updates the dependency to use the new qingke as well as remove any reference to `core::atomic` in ch32-hal.
you've probably got this down as a lost cause :) but it still strikes me as odd that they could mess this up and still advertise it as a feature, so i did a bit of reading about atomics on riscv and there is one smoking gun. (one caveat: I haven't seen if your issue is caused by LR/SC or AMOs, but i'm assuming LR/SC because i really can't see what could go wrong with an AMO) exhibit a, unpriv ISA 14.2. "Zalrsc" Extension for Load-Reserved/Store-Conditional Instructions:
exhibit b, priv ISA 3.3.2. Trap-Return Instructions:
it's a little sneaky especially considering "preemptive context switch" isn't actually defined anywhere, but maybe ISRs should be invalidating the reservation set to avoid problems in CAS everywhere else? |
This change remvoes all use of atomic (CAS) from ch32-hal As discovered in #59, the QingKeV4 atomic implementation is likely broken. As a result we added a compiler check to make sure the atomic exetnsion is disabled in ch32-rs/qingke#8. This change updates the dependency to use the new qingke as well as remove any reference to `core::atomic` in ch32-hal.
Let me try this today you might be correct. But I think at least reading the Linux code it sounds like it is really only required if we are "saving / restoring" program state on a different "HART". Because on the same hart, with interrupts,
|
Ohhhhhhh I had a "shower thought" I think I understand how can this be messed up: We have the hardware save and restore enabled..... What if ... What if.... the hardware save restore for the interrupt context is save and restoring the address reservation? that would allow the second store conditional to succeed! I probably should also try that. |
Actually just tried
neither fixed the problem, unfortunately. Looks like we will just have to disable atomics for now. |
worth a try. i should really just find the time to repro this myself 😅. |
Oh I repro this just mashing a key on my “hid example” and was using tick timer 1 |
Discovered by accident lol. The repro process is kinda funny.
|
Writing this more or less as a PSA, the otgfs / usbfs driver seems to be able to enter a lock up state where an interrupt can fire and call the waker but the waker is not woken up.
currently debugging this issue, feel free to help if interested.
The text was updated successfully, but these errors were encountered: