OneFS FlexProtect

The FlexProtect job is responsible for maintaining the appropriate protection level of data across the cluster.  For example, it ensures that a file which is configured to be protected at +2n, is actually protected at that level. Given this, FlexProtect is arguably the most critical of the OneFS maintenance jobs because it represents the Mean-Time-To-Repair (MTTR) of the cluster, which has an exponential impact on MTTDL. Any failures or delay has a direct impact on the reliability of the OneFS file system.

In addition to FlexProtect, there is also a FlexProtectLin job. FlexProtectLin is run by default when there is a copy of file system metadata available on solid state drive (SSD) storage. FlexProtectLin typically offers significant runtime improvements over its conventional disk based counterpart.

As such, the primary purpose of FlexProtect is to repair nodes and drives which need to be removed from the cluster. In the case of a cluster group change, for example the addition or subtraction of a node or drive, OneFS automatically informs the job engine, which responds by starting a FlexProtect job. Any drives and/or nodes to be removed are marked with OneFS’ ‘restripe_from’ capability. The job engine coordinator notices that the group change includes a newly-smart-failed device and then initiates a FlexProtect job in response.

FlexProtect falls within the job engine’s restriping exclusion set and, similar to AutoBalance, comes in two flavors: FlexProtect and FlexProtectLin.

Run automatically after a drive or node removal or failure, FlexProtect locates any unprotected files on the cluster, and repairs them as rapidly as possible.  The FlexProtect job runs by default with an impact level of ‘medium’ and a priority level of ‘1’, and includes six distinct job phases:

The regular version of FlexProtect has the following phases:

Job Phase Description
Drive Scan Job engine scans the disks for inodes needing repair. If an inode needs repair, the job engine sets the LIN’s ‘needs repair’ flag for use in the next phase.

 

LIN Verify This phase scans the OneFS LIN tree to addresses the drive scan limitations.
LIN Re-verify The prior repair phases can miss protection group and metatree transfers. FlexProtect may have already repaired the destination of a transfer, but not the source. If a LIN is being restriped when a metatree transfer, it is added to a persistent queue, and this phase processes that queue.

 

Repair LINs with the ‘needs repair’ flag set are passed to the restriper for repair. This phase needs to progress quickly and the job engine workers perform parallel execution across the cluster.
Check This phase ensures that all LINs were repaired by the previous phases as expected.
Device Removal The successfully repaired nodes and drives that were marked ‘restripe from’ at the beginning of phase 1 are removed from the cluster in this phase. Any additional nodes and drives which were subsequently failed remain in the cluster, with the expectation that a new FlexProtect job will handle them shortly.

Be aware that prior to OneFS 8.2, FlexProtect is the only job allowed to run if a cluster is in degraded mode, such as when a drive has failed, for example. Other jobs will automatically be paused and will not resume until FlexProtect has completed and the cluster is healthy again. In OneFS 8.2 and later, FlexProtect does not pause when there is only one temporarily unavailable device in a disk pool, when a device is smartfailed, or for dead devices.

The FlexProtect job executes in userspace and generally repairs any components marked with the ‘restripe from’ bit as rapidly as possible. Within OneFS, a LIN Tree reference is placed inside the inode, a logical block. A B-Tree describes the mapping between a logical offset and the physical data blocks:

In order for FlexProtect to avoid the overhead of having to traverse the whole way from the LIN Tree reference -> LIN Tree -> B-Tree -> Logical Offset -> Data block, it leverages the OneFS construct known as the ‘Width Device List’ (WDL). The WDL enables FlexProtect to perform fast drive scanning of inodes because the inode contents are sufficient to determine need for restripe. The WDL keeps a list of the drives in use by a particular file, and are stored as an attribute within an inode and are thus protected by mirroring. There are two WDL attributes in OneFS, one for data and one for metadata. The WDL is primarily used by FlexProtect to determine whether an inode references a degraded node or drive. It New or replaced drives are automatically added to the WDL as part of new allocations.

As mentioned previously, the FlexProtect job has two distinct variants. In the FlexProtectLin version of the job the Disk Scan and LIN Verify phases are redundant and therefore removed, while keeping the other phases identical. FlexProtectLin is preferred when at least one metadata mirror is stored on SSD, providing substantial job performance benefits.

In addition to automatic job execution after a drive or node removal or failure, FlexProtect can also be initiated on demand. The following CLI syntax will kick of a manual job run:

# isi job start flexprotect

Started job [274]
# isi job list

ID   Type        State   Impact  Pri  Phase  Running Time

----------------------------------------------------------

274  FlexProtect Running Medium  1    1/6    4s

----------------------------------------------------------

Total: 1


The FlexProtect job’s progress can be tracked via a CLI command as follows:

# isi job jobs view 274

               ID: 274

             Type: FlexProtect

            State: Succeeded

           Impact: Medium

           Policy: MEDIUM

              Pri: 1

            Phase: 6/6

       Start Time: 2020-12-04T17:13:38

     Running Time: 17s

     Participants: 1, 2, 3

         Progress: No work needed

Waiting on job ID: -

      Description: {"nodes": "{}", "drives": "{}"}

Upon completion, the FlexProtect job report, detailing all six stages, can be viewed by using the following CLI command with the job ID as the argument:

# isi job reports view 274

FlexProtect[274] phase 1 (2020-12-04T17:13:44)

----------------------------------------------

Elapsed time 6 seconds

Working time 6 seconds

Errors       0

Drives       33

LINs         250

Size         363108486755 bytes (338.171G)

ECCs         0




FlexProtect[274] phase 2 (2020-12-04T17:13:55)

----------------------------------------------

Elapsed time 11 seconds

Working time 11 seconds

Errors       0

LINs         33

Zombies      0




FlexProtect[274] phase 3 (2020-12-04T17:13:55)

----------------------------------------------

Elapsed time 0 seconds

Working time 0 seconds

Errors       0

LINs         0

Zombies      0




FlexProtect[274] phase 4 (2020-12T17:13:55)

----------------------------------------------

Elapsed time 0 seconds

Working time 0 seconds

Errors       0

LINs         0

Zombies      0




FlexProtect[274] phase 5 (2020-12-04T17:13:55)

----------------------------------------------

Elapsed time 0 seconds

Working time 0 seconds

Errors       0

Drives       0

LINs         0

Size         0 bytes

ECCs         0




FlexProtect[274] phase 6 (2020-12-04T17:13:55)

----------------------------------------------

Elapsed time       0 seconds

Working time       0 seconds

Errors             0

Nodes marked gone  {}

Drives marked gone {}

While a FlexProtect job is running, the following command will detail which LINs the job engine workers are currently accessing:

# sysctl efs.bam.busy_vnodes | grep isi_job_d

vnode 0xfffff802938d18c0 (lin 0) is fd 11 of pid 2850: isi_job_d

vnode 0xfffff80294817460 (lin 1:0002:0008) is fd 12 of pid 2850: isi_job_d

vnode 0xfffff80294af3000 (lin 1:0002:001a) is fd 20 of pid 2850: isi_job_d

vnode 0xfffff8029c7c7af0 (lin 1:0002:001b) is fd 17 of pid 2850: isi_job_d

vnode 0xfffff802b280dd20 (lin 1:0002:000a) is fd 14 of pid 2850: isi_job_d

Using the ‘isi get -L’ command, a LIN address can be translated to show the actual file name and its path. For example:

# isi get -L 1:0002:0008

A valid path for LIN 0x100020008 is /ifs/.ifsvar/run/isi_job_d.lock

 

OneFS IntegrityScan

Under normal conditions, OneFS typically relies on checksums, identity fields, and magic numbers to verify file system health and correctness. Within OneFS, system and data integrity, can be subdivided into four distinct phases:

Here’s what each of these phases entails:

Phase Description
Detection The act of scanning the file system and detecting data block instances that are not what OneFS expects to see at that logical point. Internally, OneFS stores a checksum or IDI (Isi data integrity) for every allocated block under /ifs.
Enumeration Enumeration involves notifying the cluster administrator of any file system damage uncovered in the detection phase. For example, logging to the /var/log/idi.log file. System panics may also be noticed if the damage identified is not one that OneFS can reasonably recover from.
Isolation Isolation is the act of cauterizing the file system, ensuring that any damage identified during the detection phase does not spread beyond the file(s) that are already affected. This usually involves removing all references to the file(s) from the file system.
Repair Removing and repairing any damage discovered and removing the “corruption” from OneFS. Typically a DSR (Dynamic Sector recovery) is all that is required to rebuild a block that fails IDI.

 

Focused on the detection phase, the primary OneFS tool for uncovering system integrity issues is IntegrityScan. This job is run on across the cluster to discover instances of damaged files, and provide an estimate of the spread of the damage.

Unlike traditional ‘fsck’ style file system integrity checking tools (including OneFS’ isi_cpr utility), IntegrityScan is explicitly designed to run while the cluster is fully operational – thereby removing the need for any downtime. It does this by systematically reading every block and verifying its associated checksum. In the event that IntegrityScan detects a checksum mismatch, it generates an alert, logs the error to the IDI logs (/var/log/idi.log), and provides a full report upon job completion.

IntegrityScan is typically run manually if the integrity of the file system is ever in doubt. By default, the job runs at am impact level of ‘Medium’ and a priority of ‘1’ and accesses the file system via a LIN scan. Although IntegrityScan itself may take several hours or days to complete, the file system is online and completely available during this time. Additionally, like all phases of the OneFS job engine, IntegrityScan can be re-prioritized, paused or stopped, depending on its impact to cluster operations. Along with Collect and MultiScan, IntegrityScan is part of the job engine’s Marking exclusion set.

OneFS can only accommodate a single marking job at any point in time. However, since the file system is fully journalled, IntegrityScan is only needed in exceptional situations. There are two principle use cases for IntegrityScan:

  • Identifying and repairing corruption on a production cluster. Certain forms of corruption may be suggestive of a bug, in which case IntegrityScan can be used to determine the scope of the corruption and the likelihood of spreading. It can also fix some forms of corruption.
  • Repairing a file system after a lost journal. This use case is much like traditional fsck. This scenario should be treated with care as it is not guaranteed that IntegrityScan fixes everything. This is a use case that will require additional product changes to make feasible.

IntegrityScan can be initiated manually, on demand. The following CLI syntax will kick off a manual job run:

# isi job start integrityscan

Started job [283]
# isi job list

ID   Type          State   Impact  Pri  Phase  Running Time

------------------------------------------------------------

283  IntegrityScan Running Medium  1    1/2    1s

------------------------------------------------------------

Total: 1

With LIN scan jobs, even though the metadata is of variable size, the job engine can fairly accurately predict how much effort will be required to scan all LINs. The IntegrityScan job’s progress can be tracked via a CLI command, as follows:

# isi job jobs view 283

               ID: 283

             Type: IntegrityScan

            State: Running

           Impact: Medium

           Policy: MEDIUM

              Pri: 1

            Phase: 1/2

       Start Time: 2020-12-05T22:20:58

     Running Time: 31s

     Participants: 1, 2, 3

         Progress: Processed 947 LINs and approx. 7464 MB: 867 files, 80 directories; 0 errors

                   LIN & SIN Estimate based on LIN & SIN count of 3410 done on Dec 5 22:00:10 2020 (LIN) and Dec 5 22:00:10 2020 (SIN)

                   LIN & SIN Based Estimate:  1m 12s Remaining (27% Complete)

                   Block Based Estimate: 10m 47s Remaining (4% Complete)




Waiting on job ID: -

      Description:

The LIN (logical inode) statistics above include both files and directories.

Be aware that the estimated LIN percentage can occasionally be misleading/anomalous. If concerned, verify that the stated total LIN count is roughly in line with the file count for the cluster’s dataset. Even if the LIN count is in doubt. The estimated block progress metric should always be accurate and meaningful.

If the job is in its early stages and no estimation can be given (yet), isi job will instead report its progress as ‘Started’. Note that all progress is reported per phase.

A job’s resource usage can be traced from the CLI as such:

# isi job statistics view

     Job ID: 283

      Phase: 1

   CPU Avg.: 30.27%

Memory Avg.

        Virtual: 302.27M

       Physical: 24.04M

        I/O

            Ops: 2223069

          Bytes: 16.959G

Finally, upon completion, the IntegrityScan job report, detailing both job stages, can be viewed by using the following CLI command with the job ID as the argument:

# isi job reports view 283

IntegrityScan[283] phase 1 (2020-12-05T22:34:56)

------------------------------------------------

Elapsed time     838 seconds (13m58s)

Working time     838 seconds (13m58s)

Errors           0

LINs traversed   3417

LINs processed   3417

SINs traversed   0

SINs processed   0

Files seen       3000

Directories seen 415

Total bytes      178641757184 bytes (166.373G)




IntegrityScan[283] phase 2 (2020-12-05T22:34:56)

------------------------------------------------

Elapsed time     0 seconds

Working time     0 seconds

Errors           0

LINs traversed   0

LINs processed   0

SINs traversed   0

SINs processed   0

Files seen       0

Directories seen 0

Total bytes      0 bytes

In addition to the IntegrityScan job, OneFS also contains an ‘isi_iscan_report’ utility. This is a tool to collate the errors from IDI log files (/var/log/idi.log) generated on different nodes. It generates a report file which can be used as an input to ‘isi_iscan_query’ tool. Additionally, it reports the number of errors seen for each file containing IDI errors. At the end of the run, a report file can be found at /ifs/.ifsvar/idi/tjob.<pid>/log.repo.

The associated ‘isi_iscan_query’ utility can then be used to parse the log.repo report file and filter by node, time range, or block address (baddr). The syntax for the isi_iscan_query tool is:

/usr/sbin/isi_iscan_query filename [FILTER FIELD] [VALUE]

FILTER FIELD:

                node <logical node number> e.g. 1, 2, 3, ...

                timerange <start time> <end time> e.g. 2020-12-05T17:38:02Z 2020-12-06T17:38:56Z

                baddr <block address> e.g. 2,1,185114624:8192