Skip to content

bug: Upgrading from any prior (minor) version to v0.10.3 causes NICo check-fail #2857

Description

@jd-nv

Version

v0.10.3-0-g4d11815e6 (and v0.9.3-0-gd09a7dd35)

Describe the bug.

When using versions below v0.10.x (presumably, but confirmed v0.10.3) and then upgrading the site and NICo controller software to the v0.10.x lineage, the software will stop working completely and crash repeatedly on start up. This occurs in any deployment that has previously configured NVLink partitioning features, but not on deployments that don't use this functionality.

Manual remediations had to be applied to get the software to start again.


Details:

After the upgrade, and after applying the included migrations from the source, the software will start and immediately crash on a failed JSON decode trying to decode raw JSON from a column in the database, where an expected field is missing:

Error: Database Error: error occurred while decoding: missing field `chassis_serial` file=crates/api-db/src/machine.rs line=299 query=SELECT row_to_json(m.*) FROM (SELECT m.*, sku.device_type as hw_sku_device_type, COALESCE(i.json, '[]') AS interfaces, COALESCE(t.json, '[]') AS topology, COALESCE(bmc.json, t.bmc_info, '{}'::jsonb) AS bmc_info FROM machines m LEFT JOIN machine_skus sku on  m.hw_sku = sku.id LEFT JOIN LATERAL ( SELECT json_build_array(json_build_object( 'machine_id', mt2.machine_id, 'topology', mt2.topology, 'created', mt2.created, 'updated', mt2.updated, 'topology_update_needed', mt2.topology_update_needed )) AS json, mt2.topology->'bmc_info' AS bmc_info FROM machine_topologies mt2 WHERE mt2.machine_id = m.id ORDER BY mt2.created DESC LIMIT 1 ) AS t ON true LEFT JOIN LATERAL ( SELECT jsonb_strip_nulls( COALESCE(t.bmc_info, '{}'::jsonb) || jsonb_build_object( 'machine_interface_id', bmc_i.id, 'ip', COALESCE(host(bmc_addr.address), t.bmc_info->>'ip'), 'mac', COALESCE(bmc_i.mac_address::text, t.bmc_info->>'mac') ) ) AS json FROM machine_interfaces bmc_i LEFT JOIN LATERAL ( SELECT a.address FROM machine_interface_addresses a WHERE a.interface_id = bmc_i.id ORDER BY family(a.address), a.address LIMIT 1 ) AS bmc_addr ON true WHERE bmc_i.machine_id = m.id AND bmc_i.interface_type = 'Bmc' ORDER BY bmc_i.created ASC LIMIT 1 ) AS bmc ON true LEFT JOIN LATERAL ( SELECT json_agg(x) AS json FROM ( SELECT i2.*, COALESCE(a.json, '[]') AS addresses, COALESCE(v.json, '[]') AS vendors, ns.network_segment_type FROM machine_interfaces i2 LEFT JOIN LATERAL ( SELECT json_agg(a2.address) AS json FROM machine_interface_addresses a2 WHERE a2.interface_id = i2.id ) AS a ON true LEFT JOIN LATERAL ( SELECT json_agg(d.vendor_string) AS json FROM dhcp_entries d WHERE d.machine_interface_id = i2.id ) AS v ON true INNER JOIN network_segments ns ON ns.id = i2.segment_id WHERE i2.machine_id = m.id AND i2.interface_type != 'Bmc' ) x ) AS i ON true) m INNER JOIN machines ON machines.id = m.id WHERE TRUE.

This seems to originate from here:

- infra-controller/crates/api-model/src/machine/json.rs:106 includes nvlink_info: Option<MachineNvLinkInfo>.
  - infra-controller/crates/api-model/src/hardware_info.rs:365 makes MachineNvLinkInfo.chassis_serial mandatory.

The code in question seems to have been added in #1580.

To remediate this, information has to be copied from another field in the database manually and added to the JSON field, SQL snippet to do this:

BEGIN;

UPDATE machines m
SET nvlink_info = jsonb_set(
    m.nvlink_info,
    '{chassis_serial}',
    to_jsonb(src.chassis_serial::text),
    true
)
FROM (
    SELECT
        m.id AS machine_id,
        (
            SELECT trim(gpu->'platform_info'->>'chassis_serial')
            FROM machine_topologies mt
            CROSS JOIN LATERAL jsonb_array_elements(
                mt.topology->'discovery_data'->'Info'->'gpus'
            ) AS gpu
            WHERE mt.machine_id = m.id
              AND gpu->'platform_info' IS NOT NULL
              AND nullif(trim(gpu->'platform_info'->>'chassis_serial'), '') IS NOT NULL
            LIMIT 1
        ) AS chassis_serial
    FROM machines m
    WHERE m.nvlink_info IS NOT NULL
      AND coalesce(nullif(trim(m.nvlink_info->>'chassis_serial'), ''), '') = ''
) src
WHERE m.id = src.machine_id
  AND src.chassis_serial IS NOT NULL;

COMMIT;

Minimum reproducible example

Install any version before v0.10.x

Use the software

Set up an NVLink partition

Install a version v0.10.x or newer

Relevant log output

Other/Misc.

No response

Code of Conduct

  • I agree to follow NVIDIA Infra Controller's Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report

Metadata

Metadata

Assignees

Labels

bugA defect in existing software (deprecated - use issue type, but it's needed for reporting now)interest/dsx

Type

No fields configured for Bug.

Projects

Status
Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions