OneFS MultiScan, AutoBalance, & Collect

As we’ve seen throughout the recent file system maintenance job articles, OneFS utilizes file system scans to perform such tasks as detecting and repairing drive errors, reclaiming freed blocks, etc. Since these scans typically involve complex sequences of operations, they are implemented via syscalls and coordinated by the Job Engine. These jobs are generally intended to run as minimally disruptive background tasks in the cluster, using spare or reserved capacity.

FS Maintenance Job Description
AutoBalance Restores node and drive free space balance
Collect Reclaims leaked blocks
FlexProtect Replaces the traditional RAID rebuild process
MediaScan Scrub disks for media-level errors
MultiScan Run AutoBalance and Collect jobs concurrently

In this final article of the series, we’ll turn our attention to MultiScan. This job is a combination of both the of the AutoBalance job, which rebalances data across drives, and the Collect job, which recovers leaked blocks from the filesystem. In addition to reclaiming unused capacity as a result of drive replacements, snapshot and data deletes, etc, MultiScan also helps expose and remediate any filesystem inconsistencies.

The OneFS job engine defines two exclusion sets that govern which jobs can execute concurrently on a cluster. MultiScan straddles both of the job engine’s exclusion sets, with AutoBalance (and AutoBalanceLin) in the restripe set, and Collect in the mark set.

The restriping exclusion set is per-phase instead of per job, which helps to more efficiently parallelize restripe jobs when they don’t need to lock down resources. However, with the marking exclusion set, OneFS can only accommodate a single marking job at any point in time.

MultiScan is an unscheduled job that runs by default at ‘LOW’ impact and executes AutoBalance and Collect simultaneously. It is triggered by cluster group change events, which include node boot, shutdown, reboot, drive replacement, etc. While AutoBalance will execute each time the MultiScan job is triggered, Collect typically won’t be run more often that once every 2 weeks. AutoBalance and/or Collect are typically only run manually if MultiScan has been disabled.

When a new node or drive is added to the cluster, its blocks are almost entirely free, whereas the rest of the cluster is usually considerably more full, capacity-wise. AutoBalance restores the balance of free blocks in the cluster. As such, AutoBalance runs if a cluster’s nodes have a greater than 5% imbalance in capacity utilization. In addition, AutoBalance also fixes recovered writes that occurred due to transient unavailability and also addresses fragmentation.

If the cluster’s nodes contain SSDs, AutoBalanceLin (as opposed to the regular AutoBalance job) runs most efficiently by performing a LIN scan using a flash-backed metadata mirror. When a cluster is unbalanced, there is not an obvious subset of files to filter, since the files to be restriped are the ones which are not using the node or drive with less free space. In the case of an added node or drive, no files will be using it. As a result, almost any file scanned is enumerated for restripe.

As mentioned, the Collect job reclaims leaked blocks using a mark and sweep process. In traditional UNIX systems this function is typically performed by the ‘fsck’ utility. With OneFS, however, the other traditional functions of fsck are not required, since the transaction system keeps the file system consistent. Leaks only affect free space.

Collect’s ‘mark and sweep’ gets its name from the in-memory garbage collection algorithm. First, the in-use blocks and any new allocations are marked with the current generation in the Mark phase. When this is complete, the drives are swept of any blocks which don’t have the current generation in the Sweep phase.

In addition to automatic job execution following a group change event, Multiscan can also be initiated on demand. The following CLI syntax will kick of a manual job run:

# isi job start multiscan

Started job [209]

# isi job list

ID   Type      State   Impact  Pri  Phase  Running Time

--------------------------------------------------------

209  MultiScan Running Low     4    1/4    1s

--------------------------------------------------------

Total: 1

The Multiscan job’s progress can be tracked via a CLI command as follows:

# isi job jobs view 209

               ID: 209

             Type: MultiScan

            State: Running

           Impact: Low

           Policy: LOW

              Pri: 4

            Phase: 1/4

       Start Time: 2021-01-03T20:15:16

     Running Time: 34s

     Participants: 1, 2, 3

         Progress: Collect: 225 LINs, 0 errors

                   AutoBalance: 225 LINs, 0 errors

                   LIN Estimate based on LIN count of 2793 done on Jan 04 20:02:57 2021

                   LIN Based Estimate:  3m 2s Remaining (8% Complete)

                   Block Based Estimate: 5m 48s Remaining (4% Complete)

                   0 errors total

Waiting on job ID: -

      Description: Collect, AutoBalance

The LIN (logical inode) statistics above include both files and directories.

Be aware that the estimated LIN percentage can occasionally be misleading/anomalous. If concerned, verify that the stated total LIN count is roughly in line with the file count for the cluster’s dataset. Even if the LIN count is in doubt, the estimated block progress metric should always be accurate and meaningful.

If the job is in its early stages and no estimation can be given (yet), isi job will instead report its progress as “Started”. Note that all progress is reported per phase, with MultiScan phase 1 being the one where the lion’s share of the work is done. By comparison, phases 2-4 of the job are comparatively short.

A job’s resource usage can be traced from the CLI as such:

# isi job statistics view

     Job ID: 209

      Phase: 1

   CPU Avg.: 11.46%

Memory Avg.

        Virtual: 301.06M

       Physical: 28.71M

        I/O

            Ops: 3513425

          Bytes: 26.760G

Finally, upon completion, the Multiscan job report, detailing all four stages, can be viewed by using the following CLI command with the job ID as the argument:

# isi job reports view 209

MultiScan[209] phase 1 (2021-01-03T20:02:57)

--------------------------------------------

Elapsed time          307 seconds (5m7s)

Working time          307 seconds (5m7s)

Errors                0

Rebalance/LINs        2793

Rebalance/Files       2416

Rebalance/Directories 377

Rebalance/Errors      0

Rebalance/Bytes       372607773184 bytes (347.018G)

Collect/LINs          2788

Collect/Files         2411

Collect/Directories   377

Collect/Errors        0

Collect/Bytes         130187742208 bytes (121.247G)




MultiScan[209] phase 2 (2021-01-03T20:02:57)

--------------------------------------------

Elapsed time     0 seconds

Working time     0 seconds

Errors           0

LINs traversed   0

LINs processed   0

SINs traversed   0

SINs processed   0

Files seen       0

Directories seen 0

Total bytes      0 bytes




MultiScan[209] phase 3 (2021-01-03T20:02:58)

--------------------------------------------

Elapsed time          1 seconds

Working time          1 seconds

Errors                0

Rebalance/SINs        0

Rebalance/Files       0

Rebalance/Directories 0

Rebalance/Errors      0

Rebalance/Bytes       0 bytes

Collect/SINs          0

Collect/Files         0

Collect/Directories   0

Collect/Errors        0

Collect/Bytes         0 bytes

Unbalanced diskpools  Pool_Name = h600_18tb_3.2tb-ssd_256gb:2, free_blocks = 8693136159, total_blocks = 8715355092

Pool_Name = h600_18tb_3.2tb-ssd_256gb:3, free_blocks = 7259260440, total_blocks = 7262795910







MultiScan[209] phase 4 (2021-01-03T20:03:17)

--------------------------------------------

Elapsed time 19 seconds

Working time 19 seconds

Errors       0

Drives swept 33

LINs freed   0

Inodes freed 128359

Bytes freed  80022503424 bytes (74.527G)

Keys freed   0

Inodes lost  0

Leave a Reply

Your email address will not be published. Required fields are marked *