OneFS Fast Reboots

As part of engineering’s on-going PowerScale ‘always-on’ initiative, OneFS offers a fast reboot service, that focuses on decreasing the duration, and lessening the impact, of planned node reboots on clients. It does this by automatically reducing the size of the lock cache on all nodes before a group change event.

By shortening group change window times, this new faster reboot service will be extremely advantageous to cluster upgrades and planned shutdowns, by helping to alleviate the window of unavailability for clients connected to a rebooting node.

The fast reboot service is automatically enabled on installation or upgrade to OneFS 9.1, and it requires no further configuration. However, be aware that it will only begin to apply for upgrades, when moving from 9.1 to a future release.

Under the hood, this feature works by proactively de-staging all the lock management work, and removing it from the client latency path. This means that the time taken during group change activity – handling the locks, negotiating which coordinator has which lock, etc – is moved to an earlier window of time in the process. So, for example, for a planned cluster reboot or shutdown, instead of doing a lock dance during the group change window, the lazy lock queue is proactively drained for a period of up to 5 minutes, in order to move that activity to earlier in the process. This directly benefits OneFS upgrades, by shrinking the time for the actual group change. For a typical size cluster, this is reduced to approximately 1 second – down from around 17 seconds in prior releases. And engineering have been testing this feature with up to 5 million locks per domain.

There are several useful new and updated sysctls that indicate the status of the reboot service.

Firstly, efs.gmp.group has been enhanced to include both reboot and draining fields, that confirm which node(s) the reboot service is active on, and whether locks are being drained:

# sysctl efs.gmp.group efs.gmp.group: <35baa7> (3) :{ 1-3:0-5, nfs: 3, isi_cbind_d: 1-3, lsass: 1-3, drain: 1, reboot: 1 }

To complement this, the lki_draining sysctl confirms whether draining is still occurring:

# sysctl efs.lk.lki_draining

efs.lk.lki_draining: 1

OneFS has around 20 different lock domains, each with its own queue. These queues each contain lazy locks, which are locks that are not currently in use, but are just being held by the node in case it needs to use them again.

The stats from the various lock domain queues are aggregated, and displayed as a current total by the lazy_queue_size  sysctl:

# sysctl efs.lk.lazy_queue_size

efs.lk.lazy_queue_size: 460658

And finally, to indicates whether any of the lazy queues are above their reboot threshold:

# sysctl efs.lk.lazy_queue_above_reboot

efs.lk.lazy_queue_above_reboot: 0

In addition to the sysctls, and to aid with troubleshooting and debugging, the reboot service writes its status information about the locks being drained, etc, to /var/log/isi_shutdown.log.

As we can see in the first example, the node has activated the reboot service and is waiting for the lazy queues to be drained. And these messages are printed every 60 seconds until complete.

Once done, a log message is then written confirming that the lazy queues have been drained, and that the node is about to reboot or shutdown.

So there you have it – the new faster reboot service and low-impact group changes, completing the next milestone in the OneFS ‘always on’ journey.

Leave a Reply

Your email address will not be published. Required fields are marked *