-
Notifications
You must be signed in to change notification settings - Fork 204
Support CoreMgmt over DRTIO #2559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
artiq/firmware/runtime/mgmt.rs
Outdated
ddma_mutex: &Mutex, subkernel_mutex: &Mutex, | ||
routing_table: &drtio_routing::RoutingTable, linkno: u8, | ||
destination: u8, stream: &mut TcpStream, key: &String, value: &Vec<u8>) -> Result<(), Error<SchedError>> { | ||
let mut message = Cursor::new(Vec::with_capacity(key.len() + value.len() + 4 * 2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+ 4 * 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will transmit both key and value through the DRTIO. Both key and value has a length parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, forgot it's not C-strings we're dealing with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically we can derive one of them using the size of total payload, but there is already string and bytes IO in the protocol, so I decided against it
artiq/firmware/satman/main.rs
Outdated
drtioaux::Packet::CoreMgmtSetLogLevelRequest {destination: _destination, .. } => { | ||
forward!(router, _routing_table, _destination, *rank, *self_destination, _repeaters, &packet); | ||
|
||
error!("RISC-V satellite devices do not support buffered logging"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How difficult would it be to implement that? I think it would be useful to have remote log access on RISC-V too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we just need to change the UARTLogger to BufferLogger, should be straight-forward.
artiq/firmware/satman/main.rs
Outdated
coremgr.clear_data(); | ||
} | ||
|
||
if succeeded { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not have a CoreMgmtReply { succeeded: bool }
packet instead? Just for consistency with other aux packets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just found it less symmetric when we deal with requests that has a return payload (e.g. Config read, then it is whatever ConfigReadReply{ .. }
and CoreMgmtReply{ false }
, never would we see CoreMgmtReply{ true }
).
I will change it for consistency.
Satellite now disables transceiver TX before flashing, and reboot on its own after. The log from master device while the satellite is flashing looks like this:
|
artiq/firmware/satman/mgmt.rs
Outdated
drtioaux::send(0, &drtioaux::Packet::CoreMgmtAck)?; | ||
#[cfg(not(soc_platform = "efc"))] | ||
unsafe { | ||
clock::spin_us(10000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the wait there to make sure the message is sent out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. There are no specific reason for picking 10ms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10ms sounds like a lot, doesn't drtioaux::send wait until all data is out of the BRAM buffer, and then the encoder/transceiver latency is just a few hundred nanoseconds at most (and some of it might be in parallel with the txenable path)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The delay also compensates the time the master devices need to receive the message.
With a delay of 10us, the master device has a large chance to detect link down before processing the message, so the remote_coremgmt
module would report aux packet error (link down)
.
100us seems to work reliably in a 2 RISC-V devices setup (master <-> satellite).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just implement a reliable handshake then, instead of "seems to work".
I.e. satellite waits for "going down ACK" from master before bringing down the link.
artiq/firmware/satman/mgmt.rs
Outdated
let value = self.current_payload.read_bytes().unwrap(); | ||
|
||
match key.as_str() { | ||
"gateware" | "bootloader" | "firmware" => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would assume that all three elements would get flashed at once, like they would if calling artiq_flash
or on Zynq with the whole boot.bin
. Should it reboot after flashing each part, actually? Or should it allow writing to flash without reboot, and the system would have to be rebooted in a separate step?
Especially since it seems master will not do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, just flash everything at once. Developers who need to write only one part or perform other unusual actions would use JTAG.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having some other instruction to prompt the flash looks hacky, especially when we implement this under config write
. The config isn't really written in this case.
And there are 3 binaries we need to flash. Maybe we make a new coremgmt
action to specifically deal with this instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely less hacky than a magic delay. The magic delay is also suboptimal and unreliable and when it fails it'll be difficult to debug.
Yes, replacing the firmware is different from config write, and having a separate protocol would be OK.
Is there a sufficient amount of checksumming on the binary payloads before they are written to flash? We have been (very occasionally, but still) seeing what looks like data corruption on the DRTIO links (aux packet error log messages), and potentially writing out corrupted gateware/firmware sounds like a debugging nightmare. |
Side note: I suspect this may be a widespread issue which people just aren't noticing. I do wonder whether having a checksum on DRTIO would be a good idea to make this kind of thing more obvious and easier to debug? |
Just checksum the entire binaries before starting to flash, regardless of what DRTIO does. |
038135a
to
9990012
Compare
artiq/firmware/satman/main.rs
Outdated
drtioaux::Packet::CoreMgmtDropLinkAck { destination: _destination } => { | ||
forward!(router, _routing_table, _destination, *rank, *self_destination, _repeaters, &packet); | ||
|
||
#[cfg(not(soc_platform = "efc"))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not(has_drtio_eem) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
artiq/artiq/firmware/satman/main.rs
Lines 755 to 758 in 333623e
#[cfg(not(soc_platform = "efc"))] | |
unsafe { | |
csr::gt_drtio::txenable_write(0xffffffffu32 as _); | |
} |
This checks not EFC, though I do agree it should check for DRTIO EEM as well.
Probably not an intended change in #2410, as pointed out in #2134. I will make a PR to revert this.
artiq/firmware/satman/mgmt.rs
Outdated
} | ||
|
||
} else { | ||
panic!("CRC failed in SDRAM (actual {:08x}, expected {:08x})", actual_crc, expected_crc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"in SDRAM" makes it sound like this is a SDRAM issue. Just write something like "CRC failed, images have not been written to flash".
artiq/frontend/aqctl_corelog.py
Outdated
@@ -25,6 +25,9 @@ def get_argparser(): | |||
help="Simulation - does not connect to device") | |||
parser.add_argument("core_addr", metavar="CORE_ADDR", | |||
help="hostname or IP address of the core device") | |||
parser.add_argument("-s", "--satellite", default=0, | |||
metavar="DRTIO_ID", type=int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metavar="SATELLITE"
It should match the argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it should be DRTIO_DEST or similar since this does not apply only to satellites.
artiq/frontend/aqctl_corelog.py
Outdated
@@ -25,6 +25,9 @@ def get_argparser(): | |||
help="Simulation - does not connect to device") | |||
parser.add_argument("core_addr", metavar="CORE_ADDR", | |||
help="hostname or IP address of the core device") | |||
parser.add_argument("-s", "--satellite", default=0, | |||
metavar="DRTIO_ID", type=int, | |||
help="the logged DRTIO destination") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"logged" is not very appropriate.
artiq/frontend/artiq_coremgmt.py
Outdated
def artifact_path(this_binary_dir, *path_filename): | ||
if args.srcbuild: | ||
# source tree - use path elements to locate file | ||
return os.path.join(this_binary_dir, *path_filename) | ||
else: | ||
# flat tree | ||
# all files in the same directory, discard path elements | ||
*_, filename = path_filename | ||
return os.path.join(this_binary_dir, filename) | ||
|
||
def convert_gateware(bit_filename): | ||
bin_handle, bin_filename = tempfile.mkstemp( | ||
prefix="artiq_", | ||
suffix="_" + os.path.basename(bit_filename)) | ||
with open(bit_filename, "rb") as bit_file, \ | ||
open(bin_handle, "wb") as bin_file: | ||
bit2bin(bit_file, bin_file) | ||
atexit.register(lambda: os.unlink(bin_filename)) | ||
return bin_filename | ||
|
||
gateware = convert_gateware(artifact_path( | ||
args.directory, "gateware", "top.bit")) | ||
bootloader = artifact_path( | ||
args.directory, "software", "bootloader", "bootloader.bin") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This stuff is copied from artiq_flash, can it be shared?
It's also kinda messy, this way we'd only have to clean it up once later.
artiq/frontend/artiq_coremgmt.py
Outdated
t_flash = tools.add_parser("flash", | ||
help="flash the running system") | ||
|
||
p_zynq = t_flash.add_argument("-z", "--zynq", default=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any chance for auto detection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can simply try both riscv and zynq binaries, then go for the one that succeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just like runtime/satman is currently detected.
RELEASE_NOTES.rst
Outdated
@@ -19,6 +19,8 @@ ARTIQ-9 (Unreleased) | |||
* Python 3.12 support. | |||
* Compiler can now give automatic suggestions for ``kernel_invariants``. | |||
* Idle kernels now restart when written with ``artiq_coremgmt`` and stop when erased/removed from config. | |||
* Support for coredevice reflashing through new the ``flash`` tool in ``artiq_coremgmt``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: 'through new the'
The runtime firmware currently does not support the termination of EFC's DRTIO connection, we may need to implement reconnect for DRTIO-over-EEM as well. |
Why not? Sounds like a bug/oversight/design issue. |
artiq/frontend/fetch_bin.py
Outdated
if component == "gateware": | ||
path = convert_gateware(path) | ||
|
||
return path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge this file and bit2bin into one new file with an appropriate name.
artiq/frontend/fetch_bin.py
Outdated
prefix="artiq_", suffix="_" + os.path.basename(bit_filename)) | ||
with open(bit_filename, "rb") as bit_file, open(bin_handle, "wb") as bin_file: | ||
bit2bin(bit_file, bin_file) | ||
atexit.register(lambda: os.unlink(bin_filename)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit hacky that functions would have random atexit side effects like that.
artiq/frontend/fetch_bin.py
Outdated
atexit.register(lambda: os.unlink(bin_filename)) | ||
return bin_filename | ||
|
||
if type(component) == list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be cleaner if the function only accepted lists, potentially containing only one element?
artiq/frontend/artiq_coremgmt.py
Outdated
pass | ||
|
||
if retrieved_bins is None: | ||
raise FileNotFoundError("both risc-v and zynq binaries not found") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: probably 'neither risc-v nor zynq'
This avoids reallocation when transfering binaries. Reallocation during flash handshakes could cause DRTIO timeouts.
ec4327a
to
90764b0
Compare
artiq/frontend/flash_tools.py
Outdated
# after | ||
# https://github.com/mfischer/fpgadev-zynq/blob/master/top/python/bit_to_zynq_bin.py | ||
|
||
import struct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move import to top with the others.
ARTIQ Pull Request
Description of Changes
This PR implements the followings:
artiq_coremgmt
viaconfig write
similar to Zynq device.Item (1) should belong to its own PR, but I am not sure if we should toe Zynq's line on this.
Coremgmt now requires a DRTIO destination byte. It is currently implemented as mandatory, so it is a breaking change.
Related Issue
Closes #2498
Known Issues
Satellite will be unable to respond to AUX traffic while flashing. Taking down DRTIO link and reboot is WIP. We may want to rethink the new DRTIO messages for this.Resolved.Routing the new reply messages are probably missing.Only needs to be controlled by the root core device. It should only forward packets, routing is unnecessary.Tests
Remote flashing, rebooting, logging (via
artiq_coremgmt
andaqctl_corelog
), and configuration via DRTIO between 2 core devices. Core devices ranges from Kasli 1.1, 2.0 and Kasli-SoC.A corresponding PR will be opened to artiq-zynq soon.
Type of Changes
Steps (Choose relevant, delete irrelevant before submitting)
All Pull Requests
Code Changes
flake8
to check code style (follow PEP-8 style).flake8
has issues with parsing Migen/gateware code, ignore as necessary.Git Logistics
git rebase --interactive
). Merge/squash/fixup commits that just fix or amend previous commits. Remove unintended changes & cleanup. See tutorial.git show
). Format:Licensing
See copyright & licensing for more info.
ARTIQ files that do not contain a license header are copyrighted by M-Labs Limited and are licensed under LGPLv3+.