Skip to content

Support CoreMgmt over DRTIO #2559

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 43 commits into from
Nov 20, 2024
Merged

Conversation

occheung
Copy link
Contributor

@occheung occheung commented Aug 27, 2024

ARTIQ Pull Request

Description of Changes

This PR implements the followings:

  1. Flashing gateware/bootloader/firmware through artiq_coremgmt via config write similar to Zynq device.
  2. Implement corresponding core-mgmt handling for satellites
  3. Create new DRTIO messages for (2).

Item (1) should belong to its own PR, but I am not sure if we should toe Zynq's line on this.
Coremgmt now requires a DRTIO destination byte. It is currently implemented as mandatory, so it is a breaking change.

Related Issue

Closes #2498

Known Issues

  • Satellite will be unable to respond to AUX traffic while flashing. Taking down DRTIO link and reboot is WIP. We may want to rethink the new DRTIO messages for this. Resolved.
  • Routing the new reply messages are probably missing. Only needs to be controlled by the root core device. It should only forward packets, routing is unnecessary.

Tests

Remote flashing, rebooting, logging (via artiq_coremgmt and aqctl_corelog), and configuration via DRTIO between 2 core devices. Core devices ranges from Kasli 1.1, 2.0 and Kasli-SoC.

A corresponding PR will be opened to artiq-zynq soon.

Type of Changes

Type
✨ New feature

Steps (Choose relevant, delete irrelevant before submitting)

All Pull Requests

  • Use correct spelling and grammar.
  • Update RELEASE_NOTES.rst if there are noteworthy changes, especially if there are changes to existing APIs.
  • Close/update issues.

Code Changes

  • Run flake8 to check code style (follow PEP-8 style). flake8 has issues with parsing Migen/gateware code, ignore as necessary.
  • Test your changes or have someone test them. Mention what was tested and how.
  • Add and check docstrings and comments
  • Check, test, and update the unittests in /artiq/test/ or gateware simulations in /artiq/gateware/test

Git Logistics

  • Split your contribution into logically separate changes (git rebase --interactive). Merge/squash/fixup commits that just fix or amend previous commits. Remove unintended changes & cleanup. See tutorial.
  • Write short & meaningful commit messages. Review each commit for messages (git show). Format:
    topic: description. < 50 characters total.
    
    Longer description. < 70 characters per line
    

Licensing

See copyright & licensing for more info.
ARTIQ files that do not contain a license header are copyrighted by M-Labs Limited and are licensed under LGPLv3+.

@sbourdeauducq sbourdeauducq requested a review from Spaqin August 27, 2024 14:41
ddma_mutex: &Mutex, subkernel_mutex: &Mutex,
routing_table: &drtio_routing::RoutingTable, linkno: u8,
destination: u8, stream: &mut TcpStream, key: &String, value: &Vec<u8>) -> Result<(), Error<SchedError>> {
let mut message = Cursor::new(Vec::with_capacity(key.len() + value.len() + 4 * 2));
Copy link
Collaborator

@Spaqin Spaqin Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+ 4 * 2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will transmit both key and value through the DRTIO. Both key and value has a length parameter.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, forgot it's not C-strings we're dealing with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically we can derive one of them using the size of total payload, but there is already string and bytes IO in the protocol, so I decided against it

drtioaux::Packet::CoreMgmtSetLogLevelRequest {destination: _destination, .. } => {
forward!(router, _routing_table, _destination, *rank, *self_destination, _repeaters, &packet);

error!("RISC-V satellite devices do not support buffered logging");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How difficult would it be to implement that? I think it would be useful to have remote log access on RISC-V too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we just need to change the UARTLogger to BufferLogger, should be straight-forward.

coremgr.clear_data();
}

if succeeded {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not have a CoreMgmtReply { succeeded: bool } packet instead? Just for consistency with other aux packets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just found it less symmetric when we deal with requests that has a return payload (e.g. Config read, then it is whatever ConfigReadReply{ .. } and CoreMgmtReply{ false }, never would we see CoreMgmtReply{ true }).

I will change it for consistency.

@occheung
Copy link
Contributor Author

Satellite now disables transceiver TX before flashing, and reboot on its own after. The log from master device while the satellite is flashing looks like this:

[    19.307458s]  INFO(runtime::mgmt): new connection from <host ip:port>
[    22.075730s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] communication failed (LinkDown)
[    22.282018s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] link is down
[    22.286957s]  INFO(runtime::rtio_mgt::drtio): [DEST#1] destination is down
[    33.694013s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] link RX became up, pinging
[    37.659062s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] remote replied after 21 packets
[    37.877679s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] link initialization completed
[    37.885080s]  INFO(runtime::rtio_mgt::drtio): [DEST#1] destination is up
[    37.890785s]  INFO(runtime::rtio_mgt::drtio): [DEST#1] buffer space is 128

drtioaux::send(0, &drtioaux::Packet::CoreMgmtAck)?;
#[cfg(not(soc_platform = "efc"))]
unsafe {
clock::spin_us(10000);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the wait there to make sure the message is sent out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. There are no specific reason for picking 10ms.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10ms sounds like a lot, doesn't drtioaux::send wait until all data is out of the BRAM buffer, and then the encoder/transceiver latency is just a few hundred nanoseconds at most (and some of it might be in parallel with the txenable path)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The delay also compensates the time the master devices need to receive the message.
With a delay of 10us, the master device has a large chance to detect link down before processing the message, so the remote_coremgmt module would report aux packet error (link down).
100us seems to work reliably in a 2 RISC-V devices setup (master <-> satellite).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just implement a reliable handshake then, instead of "seems to work".
I.e. satellite waits for "going down ACK" from master before bringing down the link.

let value = self.current_payload.read_bytes().unwrap();

match key.as_str() {
"gateware" | "bootloader" | "firmware" => {
Copy link
Collaborator

@Spaqin Spaqin Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assume that all three elements would get flashed at once, like they would if calling artiq_flash or on Zynq with the whole boot.bin. Should it reboot after flashing each part, actually? Or should it allow writing to flash without reboot, and the system would have to be rebooted in a separate step?

Especially since it seems master will not do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just flash everything at once. Developers who need to write only one part or perform other unusual actions would use JTAG.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having some other instruction to prompt the flash looks hacky, especially when we implement this under config write. The config isn't really written in this case.

And there are 3 binaries we need to flash. Maybe we make a new coremgmt action to specifically deal with this instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely less hacky than a magic delay. The magic delay is also suboptimal and unreliable and when it fails it'll be difficult to debug.

Yes, replacing the firmware is different from config write, and having a separate protocol would be OK.

@dnadlinger
Copy link
Collaborator

Is there a sufficient amount of checksumming on the binary payloads before they are written to flash? We have been (very occasionally, but still) seeing what looks like data corruption on the DRTIO links (aux packet error log messages), and potentially writing out corrupted gateware/firmware sounds like a debugging nightmare.

@hartytp
Copy link
Collaborator

hartytp commented Aug 29, 2024

seeing what looks like data corruption on the DRTIO links (aux packet error log messages), and potentially writing out corrupted gateware/firmware sounds like a debugging nightmare.

Side note: I suspect this may be a widespread issue which people just aren't noticing. I do wonder whether having a checksum on DRTIO would be a good idea to make this kind of thing more obvious and easier to debug?

@sbourdeauducq
Copy link
Member

Just checksum the entire binaries before starting to flash, regardless of what DRTIO does.

@occheung occheung marked this pull request as ready for review September 20, 2024 09:38
drtioaux::Packet::CoreMgmtDropLinkAck { destination: _destination } => {
forward!(router, _routing_table, _destination, *rank, *self_destination, _repeaters, &packet);

#[cfg(not(soc_platform = "efc"))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not(has_drtio_eem) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#[cfg(not(soc_platform = "efc"))]
unsafe {
csr::gt_drtio::txenable_write(0xffffffffu32 as _);
}

This checks not EFC, though I do agree it should check for DRTIO EEM as well.
Probably not an intended change in #2410, as pointed out in #2134. I will make a PR to revert this.

}

} else {
panic!("CRC failed in SDRAM (actual {:08x}, expected {:08x})", actual_crc, expected_crc);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"in SDRAM" makes it sound like this is a SDRAM issue. Just write something like "CRC failed, images have not been written to flash".

@@ -25,6 +25,9 @@ def get_argparser():
help="Simulation - does not connect to device")
parser.add_argument("core_addr", metavar="CORE_ADDR",
help="hostname or IP address of the core device")
parser.add_argument("-s", "--satellite", default=0,
metavar="DRTIO_ID", type=int,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metavar="SATELLITE"
It should match the argument.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it should be DRTIO_DEST or similar since this does not apply only to satellites.

@@ -25,6 +25,9 @@ def get_argparser():
help="Simulation - does not connect to device")
parser.add_argument("core_addr", metavar="CORE_ADDR",
help="hostname or IP address of the core device")
parser.add_argument("-s", "--satellite", default=0,
metavar="DRTIO_ID", type=int,
help="the logged DRTIO destination")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"logged" is not very appropriate.

Comment on lines 174 to 197
def artifact_path(this_binary_dir, *path_filename):
if args.srcbuild:
# source tree - use path elements to locate file
return os.path.join(this_binary_dir, *path_filename)
else:
# flat tree
# all files in the same directory, discard path elements
*_, filename = path_filename
return os.path.join(this_binary_dir, filename)

def convert_gateware(bit_filename):
bin_handle, bin_filename = tempfile.mkstemp(
prefix="artiq_",
suffix="_" + os.path.basename(bit_filename))
with open(bit_filename, "rb") as bit_file, \
open(bin_handle, "wb") as bin_file:
bit2bin(bit_file, bin_file)
atexit.register(lambda: os.unlink(bin_filename))
return bin_filename

gateware = convert_gateware(artifact_path(
args.directory, "gateware", "top.bit"))
bootloader = artifact_path(
args.directory, "software", "bootloader", "bootloader.bin")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stuff is copied from artiq_flash, can it be shared?
It's also kinda messy, this way we'd only have to clean it up once later.

t_flash = tools.add_parser("flash",
help="flash the running system")

p_zynq = t_flash.add_argument("-z", "--zynq", default=False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any chance for auto detection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can simply try both riscv and zynq binaries, then go for the one that succeed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just like runtime/satman is currently detected.

@@ -19,6 +19,8 @@ ARTIQ-9 (Unreleased)
* Python 3.12 support.
* Compiler can now give automatic suggestions for ``kernel_invariants``.
* Idle kernels now restart when written with ``artiq_coremgmt`` and stop when erased/removed from config.
* Support for coredevice reflashing through new the ``flash`` tool in ``artiq_coremgmt``.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: 'through new the'

@occheung
Copy link
Contributor Author

The runtime firmware currently does not support the termination of EFC's DRTIO connection, we may need to implement reconnect for DRTIO-over-EEM as well.

@sbourdeauducq
Copy link
Member

sbourdeauducq commented Oct 10, 2024

The runtime firmware currently does not support the termination of EFC's DRTIO connection

Why not? Sounds like a bug/oversight/design issue.

if component == "gateware":
path = convert_gateware(path)

return path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge this file and bit2bin into one new file with an appropriate name.

prefix="artiq_", suffix="_" + os.path.basename(bit_filename))
with open(bit_filename, "rb") as bit_file, open(bin_handle, "wb") as bin_file:
bit2bin(bit_file, bin_file)
atexit.register(lambda: os.unlink(bin_filename))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit hacky that functions would have random atexit side effects like that.

atexit.register(lambda: os.unlink(bin_filename))
return bin_filename

if type(component) == list:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be cleaner if the function only accepted lists, potentially containing only one element?

pass

if retrieved_bins is None:
raise FileNotFoundError("both risc-v and zynq binaries not found")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: probably 'neither risc-v nor zynq'

# after
# https://github.com/mfischer/fpgadev-zynq/blob/master/top/python/bit_to_zynq_bin.py

import struct
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move import to top with the others.

@sbourdeauducq sbourdeauducq merged commit 8b335c4 into m-labs:master Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Coremgmt support for satellites
6 participants