Version
v0.10.3-0-g4d11815e6 (and v0.9.3-0-gd09a7dd35)
Describe the bug.
When using versions below v0.10.x (presumably, but confirmed v0.10.3) and then upgrading the site and NICo controller software to the v0.10.x lineage, the software will stop working completely and crash repeatedly on start up. This occurs in any deployment that has previously configured NVLink partitioning features, but not on deployments that don't use this functionality.
Manual remediations had to be applied to get the software to start again.
Details:
After the upgrade, and after applying the included migrations from the source, the software will start and immediately crash on a failed JSON decode trying to decode raw JSON from a column in the database, where an expected field is missing:
Error: Database Error: error occurred while decoding: missing field `chassis_serial` file=crates/api-db/src/machine.rs line=299 query=SELECT row_to_json(m.*) FROM (SELECT m.*, sku.device_type as hw_sku_device_type, COALESCE(i.json, '[]') AS interfaces, COALESCE(t.json, '[]') AS topology, COALESCE(bmc.json, t.bmc_info, '{}'::jsonb) AS bmc_info FROM machines m LEFT JOIN machine_skus sku on m.hw_sku = sku.id LEFT JOIN LATERAL ( SELECT json_build_array(json_build_object( 'machine_id', mt2.machine_id, 'topology', mt2.topology, 'created', mt2.created, 'updated', mt2.updated, 'topology_update_needed', mt2.topology_update_needed )) AS json, mt2.topology->'bmc_info' AS bmc_info FROM machine_topologies mt2 WHERE mt2.machine_id = m.id ORDER BY mt2.created DESC LIMIT 1 ) AS t ON true LEFT JOIN LATERAL ( SELECT jsonb_strip_nulls( COALESCE(t.bmc_info, '{}'::jsonb) || jsonb_build_object( 'machine_interface_id', bmc_i.id, 'ip', COALESCE(host(bmc_addr.address), t.bmc_info->>'ip'), 'mac', COALESCE(bmc_i.mac_address::text, t.bmc_info->>'mac') ) ) AS json FROM machine_interfaces bmc_i LEFT JOIN LATERAL ( SELECT a.address FROM machine_interface_addresses a WHERE a.interface_id = bmc_i.id ORDER BY family(a.address), a.address LIMIT 1 ) AS bmc_addr ON true WHERE bmc_i.machine_id = m.id AND bmc_i.interface_type = 'Bmc' ORDER BY bmc_i.created ASC LIMIT 1 ) AS bmc ON true LEFT JOIN LATERAL ( SELECT json_agg(x) AS json FROM ( SELECT i2.*, COALESCE(a.json, '[]') AS addresses, COALESCE(v.json, '[]') AS vendors, ns.network_segment_type FROM machine_interfaces i2 LEFT JOIN LATERAL ( SELECT json_agg(a2.address) AS json FROM machine_interface_addresses a2 WHERE a2.interface_id = i2.id ) AS a ON true LEFT JOIN LATERAL ( SELECT json_agg(d.vendor_string) AS json FROM dhcp_entries d WHERE d.machine_interface_id = i2.id ) AS v ON true INNER JOIN network_segments ns ON ns.id = i2.segment_id WHERE i2.machine_id = m.id AND i2.interface_type != 'Bmc' ) x ) AS i ON true) m INNER JOIN machines ON machines.id = m.id WHERE TRUE.
This seems to originate from here:
- infra-controller/crates/api-model/src/machine/json.rs:106 includes nvlink_info: Option<MachineNvLinkInfo>.
- infra-controller/crates/api-model/src/hardware_info.rs:365 makes MachineNvLinkInfo.chassis_serial mandatory.
The code in question seems to have been added in #1580.
To remediate this, information has to be copied from another field in the database manually and added to the JSON field, SQL snippet to do this:
BEGIN;
UPDATE machines m
SET nvlink_info = jsonb_set(
m.nvlink_info,
'{chassis_serial}',
to_jsonb(src.chassis_serial::text),
true
)
FROM (
SELECT
m.id AS machine_id,
(
SELECT trim(gpu->'platform_info'->>'chassis_serial')
FROM machine_topologies mt
CROSS JOIN LATERAL jsonb_array_elements(
mt.topology->'discovery_data'->'Info'->'gpus'
) AS gpu
WHERE mt.machine_id = m.id
AND gpu->'platform_info' IS NOT NULL
AND nullif(trim(gpu->'platform_info'->>'chassis_serial'), '') IS NOT NULL
LIMIT 1
) AS chassis_serial
FROM machines m
WHERE m.nvlink_info IS NOT NULL
AND coalesce(nullif(trim(m.nvlink_info->>'chassis_serial'), ''), '') = ''
) src
WHERE m.id = src.machine_id
AND src.chassis_serial IS NOT NULL;
COMMIT;
Minimum reproducible example
Install any version before v0.10.x
Use the software
Set up an NVLink partition
Install a version v0.10.x or newer
Relevant log output
Other/Misc.
No response
Code of Conduct
Version
v0.10.3-0-g4d11815e6 (and v0.9.3-0-gd09a7dd35)
Describe the bug.
When using versions below v0.10.x (presumably, but confirmed v0.10.3) and then upgrading the site and NICo controller software to the v0.10.x lineage, the software will stop working completely and crash repeatedly on start up. This occurs in any deployment that has previously configured NVLink partitioning features, but not on deployments that don't use this functionality.
Manual remediations had to be applied to get the software to start again.
Details:
After the upgrade, and after applying the included migrations from the source, the software will start and immediately crash on a failed JSON decode trying to decode raw JSON from a column in the database, where an expected field is missing:
This seems to originate from here:
The code in question seems to have been added in #1580.
To remediate this, information has to be copied from another field in the database manually and added to the JSON field, SQL snippet to do this:
Minimum reproducible example
Relevant log output
Other/Misc.
No response
Code of Conduct