OneFS Pre-upgrade Healthchecks

Another piece of useful functionality that debuted in OneFS 9.9 is the enhanced integration of pre-upgrade healthchecks (PUHC) with the PowerScale non-disruptive upgrade (NDU) process.

Specifically, this feature complements the OneFS NDU framework by adding the ability to run pre-upgrade healthchecks as part of the NDU state machine, while providing a comprehensive view and control of the entire pre-check process. This means that OneFS 9.9 and later can now easily and efficiently include upgrade pre-checks by leveraging the existing healthcheck patch process.

These pre-upgrade healthchecks (PUHC) can either be run as an independent assessment (isi upgrade assess) or as an integral part of a OneFS upgrade. In both scenarios, the same pre-upgrade checks are run by the assessment and the actual upgrade process.

Prior to OneFS 9.9, there was no WebUI support for a pre-upgrade healthcheck assessment. This meant that an independent assessment had to be run from the CLI:

# isi upgrade assess

Additionally, there was no ‘view’ option for this ‘isi upgrade assess’ command. So after starting a pre-upgrade assessment, the only way to see which checks were failing was to parse the upgrade logs in order to figure out what was going on. For example, with the ‘isi_upgrade_logs’ CLI utility:

# isi_upgrade_logs -h

Usage: isi_upgrade_logs [-a|--assessment][--lnn][--process {process name}][--level {start level,end level][--time {start time,end time][--guid {guid} | --devid {devid}]

 + No parameter this utility will pull error logs for the current upgrade process

 + -a or --assessment - will interrogate the last upgrade assessment run and display the results

 Additional options that can be used in combination with 'isi_upgrade_logs' command:

  --guid     - dump the logs for the node with the supplied guid

  --devid    - dump the logs for the node/s with the supplied devid/s

  --lnn      - dump the logs for the node/s with the supplied lnn/s

  --process  - dump the logs for the node with the supplied process name

  --level    - dump the logs for the supplied level range

  --time     - dump the logs for the supplied time range

  --metadata - dump the logs matching the supplied regex

  --get-fw-report - get firmware report

                    =nfp-devices : Displays report of devices present in NFW package

                    =full        : Displays report of all devices on the node

                    Default value for No option provided is "nfp-devices".

When run with the ‘-a’ flag, ‘isi_upgrade_logs’ queries the archived logs from the latest assessment run:

# isi_upgrade_logs -a

Or by node ID or LNN:

# isi_upgrade_logs --lnn

# isi_upgrade_logs --devid

So, when running healthchecks as part of an upgrade in OneFS 9.8 or earlier, whenever any check failed, typically all that was reported was a generic check ‘hook fail’ alert. For example, a mandatory pre-check failure was reported as follows:

As can be seen, only general pre-upgrade insight was provided, without details such as which specific check(s) were failing.

Similarly from the upgrade logs:

Identifying in upgrade logs that PUHC hook scripts ran: 18 2024-11-05T02:19:21 /usr/sbin/isi_upgrade_agent_d Debug Queueing up hook script: /usr/share/upgrade/event-actions/pre-upgrade-mandatory/isi_puhc_mandatory 18 2024-11-05T02:12:21 /usr/sbin/isi_upgrade_agent_d Debug Queueing up hook script: /usr/share/upgrade/event-actions/pre-upgrade-optional/isi_puhc_optional

Additionally, when starting an upgrade in OneFS 9.8 or earlier, there was no opportunity to either skip any superfluous optional checks or quiesce any irrelevant or unrelated failing checks.

By way of contrast, OneFS 9.9 now includes the ability to run a pre-upgrade assessment (Precheck) directly from the WebUI via Cluster management > Upgrade > Overview > Start Precheck.

Similarly, a ‘view’ option is also added to the ‘isi upgrade assess’ CLI command syntax in OneFS 9.9. For example:

# isi upgrade assess view

PreCheck Summary:

             Status: Completed with errors - not ready for upgrade
Percentage Complete: 100%
       Completed on: 2024-11-4T21:44:54.938Z

Check Name       Type      LNN(s)  Message
----------------------------------------------------------------------------------------------------------------------------------------------------------------
ifsvar_acl_perms Mandatory -       An underprivileged user (not in wheel group) has access to the ifsvar directory. Run 'chmod -b 770 /ifs/.ifsvar' to reset the permissions back to the default permissions to resolve the security risk. Then, run 'chmod +a# 0 user ese allow traverse /ifs/.ifsvar' to add the system-level SupportAssist User back to the /ifs/.ifsvar ACL.
----------------------------------------------------------------------------------------------------------------------------------------------------------------

Total: 1

Or from the WebUI:

This means that the cluster admin now gets a first-hand view of explicitly which check(s) are failing, plus their appropriate mitigation steps. As such, the time to resolution can often be drastically improved by avoiding the need to manually comb the log files in order to troubleshoot cluster pre-upgrade issues.

OneFS delineates between mandatory (blocking) and optional (non-blocking) pre-checks:

Evaluation Type Description
Mandatory PUHC These checks will block an upgrade on failure. As such, the option are to either fix the underlying issue causing the check to fail, or to roll-back the upgrade.
Optional PUHC These can be treated as a warning. On failure, either the underlying condition can be resolved, or skipped the check skipped, allowing the upgrade to continue.

Also provided is the ability to pick and choose which specific optional checks are run prior to an upgrade. This can also alleviate redundant effort and save considerable overhead.

Architecturally, pre-upgrade health checks operate as follows:

The ‘optional’ and ‘mandatory’ hooks of the Upgrade framework queue up a pre-check evaluation request to the HealthCheck framework. The results are then stored in an assessment database, which allows a comprehensive view of the pre-checks.

The array of upgrade pre-checks is pretty extensive and are tailored to a target OneFS version.

# isi healthcheck checklists list | grep -i pre_upgrade

pre_upgrade         Checklist to determine pre upgrade cluster health, 
many items in this list use the target_version parameter

A list of the individual checks can be viewed from the WebUI under Cluster management > Healthcheck > Healthchecks > pre_upgrade:

In the next article in this series, we’ll take a closer look at the management and monitoring of OneFS Pre-upgrade Healthchecks.

Leave a Reply

Your email address will not be published. Required fields are marked *