Downgrading firmware on a DGX A100 involves several steps, including preparing the system, enabling the downgrade option, and performing the downgrade. Here's a detailed guide:
1. Prepare the System**
- Access the BMC: Ensure you have access to the Baseboard Management Controller (BMC) of the DGX A100. This is crucial for managing firmware updates and downgrades.- Backup Data: Before making any changes, ensure that all critical data is backed up. Firmware downgrades can potentially cause system instability or data loss.
- Download Firmware: Obtain the desired older version of the firmware from the NVIDIA Enterprise Support Portal. Make sure it is compatible with your DGX A100 system.
2. Enable the Downgrade Option**
- Set the `ForceUpdate` Flag: You need to enable the `ForceUpdate` flag on the BMC to allow downgrading. This can be done using the `nvfwupd` command:bash
nvfwupd --target ip= user=admin password=admin force_update enable
Replace `` with the actual IP address of the BMC.
- Verify the Flag Status: Confirm that the `ForceUpdate` flag is set to `True`:
bash
nvfwupd --target ip= user=admin password=admin force_update status
3. Perform the Firmware Downgrade**
- Update Firmware: Use the firmware update utility to downgrade the firmware. You can use methods like NVSM, Docker, or the `.run` file, depending on your preference. For example, using the `.run` file:bash
sudo ./nvfw-dgxa100_.run update_fw
Replace `` with the version you downloaded and `` with the component you want to downgrade.
- NVSM Example: If using NVSM, you might need to set flags like `update_fw `:
bash
nvsm(/system/localhost/firmware/install)-> set Flags=update_fw
4. Post-Downgrade Steps**
- Disable the `ForceUpdate` Flag: Once the downgrade is complete, disable the `ForceUpdate` flag to prevent unintended updates:bash
nvfwupd --target ip= user=admin password=admin force_update disable
- Verify Flag Status: Confirm that the flag is set back to `False`:
bash
nvfwupd --target ip= user=admin password=admin force_update status
- Reboot and Test: Reboot the system and test to ensure that the downgrade was successful and the system is stable.
Additional Considerations
- Power Cycling: If certain components like NVMe drive firmware, FPGA, or CEC1712 were updated during the downgrade process, you may need to perform a DC power cycle using the BMC:bash
sudo ipmitool -I lanplus -H ${BMC_IP} -U ${BMC_USER} -P ${BMC_PW} chassis power cycle
- Datacenter Operations: If you are managing a large number of DGX A100 systems in a datacenter, consider power cycling systems in batches to avoid triggering power alarms or tripping breakers[1][4].
Citations:
[1] https://github.com/NVIDIA/deepops/blob/master/docs/deepops/dgx-diagnostic-firmware.md
[2] https://docs.nvidia.com/dgx/dgxa100-fw-container-release-notes/using-utility.html
[3] https://www.manualslib.com/manual/1925509/Nvidia-Dgx-A100.html
[4] https://docs.nvidia.com/dgx/dgxh100-fw-update-guide/firmware-downgrade.html
[5] https://forums.developer.nvidia.com/t/looking-for-nvidia-dgx-a100-system-firmware-update-utility/241833
[6] https://www.netapp.com/media/19432-nva-1151-design.pdf
[7] https://kb.brightcomputing.com/knowledge-base/how-to-upgrade-dgx-a100-firmware-from-headnode/
[8] https://docs.nvidia.com/dgx/dgxa100-user-guide/updating-restoring-sw.html
[9] https://docs.nvidia.com/dgx/dgxa100-fw-container-release-notes/dgxa100-fw-update-iso.html