Unstructured Data Quick Tips

OneFS Job Engine Exclusion Sets

In order for the OneFS Job Engine framework to support concurrent job execution, it provides the concept of exclusion sets. These are classes of similar jobs that determine which jobs can run simultaneously.

A job is not required to be part of any exclusion set, and jobs may also belong to multiple exclusion sets. As the graphic above shows, there are currently two exclusion sets that jobs can be part of:

Restriping
Marking.

Let’s first take a look at the restriping exclusion set.

OneFS protects data by writing file blocks across multiple drives on different nodes. This process is known as ‘restriping’ in the OneFS lexicon. The Job Engine defines a restripe exclusion set that contains these jobs that involve file system management, protection and on-disk layout.

The restripe exclusion set contains the following jobs:

Job Name	Job Description	Access Method
AutoBalance	Balances free space in the cluster.	Drive + LIN
AutoBalanceLin	Balances free space in the cluster.	LIN
FlexProtect	Rebuilds and re-protects the file system to recover from a failure scenario.	Drive + LIN
FlexProtectLin	Re-protects the file system.	LIN
MediaScan	Scans drives for media-level errors.	Drive + LIN
MultiScan	Runs Collect and AutoBalance jobs concurrently.	LIN
SetProtectPlus	Applies the default file policy. This job is disabled if SmartPools is activated on the cluster.	LIN
ShadowStoreProtect	Protect shadow stores which are referenced by a LIN with higher requested protection.	LIN
SmartPools	Job that runs and moves data between the tiers of nodes within the same cluster. Also executes the CloudPools functionality if licensed and configured.	LIN
Upgrade	Manage OneFS upgrades.	LIN

The restriping exclusion set is per-phase instead of per job. This helps to more efficiently parallelize restripe jobs when they don’t need to lock down resources. Restriping jobs only block each other when the current phase may perform restriping. This is most evident with MultiScan, whose final phase only sweeps rather than restripes. Similarly, MediaScan, which rarely ever restripes, is usually able to run to completion more without contending with other restriping jobs.

For example, below the two restripe jobs, MediaScan and AutoBalanceLin, are both running their respective first job phases. ShadowStoreProtect, also a restriping job, is in a ‘waiting’ state, blocked by AutoBalanceLin.

Running and queued jobs:

ID    Type               State       Impact  Pri  Phase  Running Time

----------------------------------------------------------------------

26850 AutoBalanceLin     Running     Low     4    1/3    20d 18h 19m 

26910 ShadowStoreProtect Waiting     Low     6    1/1    -           

28133 MediaScan          Running     Low     8    1/8    1d 15h 37m  

----------------------------------------------------------------------

MediaScan restripes in phases 3 and 5 of the job, and only if there are disk errors (ECCs) which require data reprotection. If MediaScan reaches its third job phase with ECCs, it will pause until AutoBalanceLin is no longer running. If MediaScan’s priority were in the range 1-3, it would cause AutoBalanceLin to pause instead.

If two jobs happen to reach their restriping phases simultaneously and the jobs have different priorities, the higher priority job (ie. priority value closer to “1”) will continue to run, and the other will pause. If the two jobs have the same priority, the one already in its restriping phase will continue to run, and the one newly entering its restriping phase will pause.

Next, we’ll examine the marking exclusion set.

The OneFS file system actively marks the blocks which are actually in use by the file system. A good example of this class of job is IntegrityScan, which traverses the live file system, marking every block of every LIN in the cluster to proactively detect and resolve any issues with the structure of data in a cluster.

The jobs that comprise the marking exclusion set are:

Job Name	Job Description	Access Method
Collect	Reclaims disk space that could not be freed due to a node or drive being unavailable while they suffer from various failure conditions.	Drive + LIN
IntegrityScan	Performs online verification and correction of any file system inconsistencies.	LIN
MultiScan	Runs Collect and AutoBalance jobs concurrently.	LIN

Jobs may also belong to both exclusion sets. An example of this is MultiScan, since it includes both AutoBalance and Collect.

Multiple jobs from the same exclusion set will not run at the same time. For example, Collect and IntegrityScan cannot be executed simultaneously, as they are both members of the marking jobs exclusion set. Similarly, MediaScan and SetProtectPlus won’t run concurrently, as they are both part of the restripe exclusion set.

The majority of the jobs do not belong to an exclusion set. These are typically the data services feature support jobs, and they can coexist and contend with any of the other jobs.

Exclusion sets do not change the scope of the individual jobs themselves, so any runtime improvements via parallel job execution are the result of job management and impact control. The Job Engine monitors node CPU load and drive I/O activity per worker thread every twenty seconds to ensure that maintenance jobs do not cause cluster performance problems.

If a job affects overall system performance, Job Engine reduces the activity of maintenance jobs and yields resources to clients. Impact policies limit the system resources that a job can consume and when a job can run. You can associate jobs with impact policies, ensuring that certain vital jobs always have access to system resources.

It’s worth noting that the job engine exclusion sets are pre-defined, and cannot be modified or reconfigured.

OneFS Job Engine Monitoring & Reporting

The OneFS Job Engine provides detailed monitoring and statistics gathering, with insight into jobs and Job Engine. A variety of Job Engine specific metrics are available via the OneFS CLI, including per job disk usage, etc, with CLI command ‘isi job statistics list’.

For example:

# isi job statistics list

Job ID  Phase  CPU Avg.  Virt. Mem. Avg.  Phys. Mem. Avg.  I/O Ops  I/O Bytes

----------------------------------------------------------------------------

87      1      0.52%     77.44M           37.00M           1115337  8.51G

88      1      2.35%     75.84M           36.80M           16192    125.66M

89      1      0.48%     6.01M            2.67M            0        0.00

----------------------------------------------------------------------------

In verbose mode, worker statistics and job level resource usage can also be viewed:

# isi job statistics list -v | more

Job ID: 87

 Phase: 1
 Nodes
             Node: 1
              PID: 38164
              CPU: 0.00% (0.00% min, 4.30% max, 0.40% avg)

       Virtual Mem.: 76.92M (74.38M min, 76.92M max, 76.53M avg)
      Physical Mem.: 36.33M (35.93M min, 36.33M max, 36.13M avg)

           I/O Read: 16634 ops, 129.95M
          I/O Write: 38899 ops, 303.90M
          Workers: 1 (9.00 STW avg.)

             Node: 2
              PID: 66095
              CPU: 0.68% (0.34% min, 4.98% max, 1.22% avg)

       Virtual Mem.: 77.05M (74.38M min, 77.05M max, 76.92M avg)
      Physical Mem.: 42.07M (41.39M min, 42.07M max, 41.97M avg)

           I/O Read: 16602 ops, 129.70M
          I/O Write: 39173 ops, 306.04M
          Workers: 1 (0.00 STW avg.)
…

Additionally, the status of the Job Engine workers is available via the OneFS CLI using the ‘isi job statistics view’ command. For example:

# isi job statistics view --job-id 857

Job ID: 857

Phase: 1
Nodes
  Node: 1
     PID: 26224
     CPU: 7.96% (0.00% min, 28.96% max, 4.60% avg)

 Virtual: 187.23M (187.23M min, 187.23M max, 187.23M avg)
Physical: 19.01M (18.52M min, 19.33M max, 18.96M avg)

    Read: 931043 ops, 7.099G
   Write: 1610213 ops, 12.269G
 Workers: 1 (0.00 STW avg.)

Job events, including pause/resume, waiting, phase completion, job success, failure, etc, are reported under the ‘Job Events’ tab of the WebUI. Additional information for each event is available via the “View Details” button for the appropriate job events entry in the WebUI. These are accessed by navigating to Cluster Management > Job Operations > Job Events.

A comprehensive job report is also provided for each phase of a job. This report contains detailed information on runtime, CPU, drive and memory utilization, the number of data and metadata objects scanned, and other work details or errors specific to the job type.

After a job finishes, you can view a report about the job from the CLI. You need to specify the job ID to view the report for a completed job. The isi job reports list command displays a list of all recent jobs, including job IDs. Run the isi job reports view command with a specific job ID. The following command displays the report of a Collect job with an ID of 857:

# isi job reports view 857

Collect[857] phase 1 (2020-11-14T11:39:57)

------------------------------------------

LIN scan

Elapsed time:              6506 seconds

LINs traversed:            433423

Files seen:                396980

Directories seen:          36439

Errors:                    0

Total blocks:              27357443452 (13678721726 KB)

CPU usage:                 max 28% (dev 1), min 0% (dev 1), avg 4%

Virtual memory size:       max 193300K (dev 1), min 191728K (dev 1), avg 1925

Resident memory size:      max 21304K (dev 1), min 18884K (dev 2), avg 20294K

Read:                      11637860 ops, 95272875008 bytes (90859.3M)

Write:                     20717079 ops, 169663891968 bytes (161804.1M)

While a job is running, an Active Job Details report is also available. This provides contextual information, including elapsed time, current job phase, job progress status, etc.

For inode (LIN) based jobs, progress as an estimated percentage completion is also displayed, based on processed LIN counts.

Detailed and granular job performance information and statistics are now available in a job’s report. These new statistics include per job phase CPU and memory utilization (including minimum, minimum, and average), and total read and write IOPS and throughput.

OneFS performance resource management provides statistics for the resources used by jobs – both cluster-wide and per-node. This information is available via the isi statistics workload CLI command. Available in a ‘top’ format, this command displays the top jobs and processes, and periodically updates the information.

For example, the following syntax shows, and indefinitely refreshes, the top five processes on a cluster:

# isi statistics workload –-limit=5 -–format=top

last update:  2020-11-14T16:45:25 (s)ort: default

CPU   Reads Writes      L2    L3    Node  SystemName        JobType

1.4s  9.1k  0.0         3.5k  497.0 2     Job:  237         IntegrityScan[0]

1.2s  85.7  714.7       4.9k  0.0   1     Job:  238         Dedupe[0]

1.2s  9.5k  0.0         3.5k  48.5  1     Job:  237         IntegrityScan[0]

1.2s  7.4k  541.3       4.9k  0.0   3     Job:  238         Dedupe[0]

1.1s  7.9k  0.0         3.5k  41.6  2     Job:  237         IntegrityScan[0]

The resource statistics tracked per job, per job phase, and per node include CPU, reads, writes, and L2 & L3 cache hits. Unlike the output from the ‘top’ command, this makes it easier to diagnose individual job resource issues, etc.

OneFS Job Engine Management

In this next installment of the Job Engine series, we take a look at the configuration, control and management of jobs.

Although OneFS runs several critical system maintenance jobs automatically when necessary, the majority of the Job Engine’s jobs have no default schedule and can be manually started by a cluster administrator. For example, the Collect job, which reclaims free space that previously could not be freed because the node or drive was unavailable. The following CLI command syntax can be used to run the Collect job with a ‘medium’ impact policy and a higher priority.

# isi job jobs start Collect --policy MEDIUM --priority 2

Started job [9]

When the job starts, a message such as appears. In the output here, [9] represents the job ID number, which can be used as the argument to run other commands on the job.

For example, the following CLI command will cancel this collect job with ID 9:

# isi job jobs cancel 9

Similarly, to start a job from the OneFS WebUI, navigate to Cluster Management > Job Operations > Job Type and click on the ‘start’ button for the desired job. For example, to manually run AutoBalance:

Other jobs such as ComplianceStoreDelete, FilePolicy, FSAnalyze, MediaScan, ShadowStoreDelete, SmartPools, and WormQueue are normally started via a schedule. The default job execution schedule is shown in the table below.

Job Name	Default Job Schedule
AutoBalance	Manual
AutoBalanceLIN	Manual
AVScan	Manual
ChangelistCreate	Manual
CloudPoolsLin/Treewalk	Manual
Collect	Manual
ComplianceStoreDelete	The 2nd Saturday of every month at 12am
Dedupe	Manual
DedupeAssessment	Manual
DomainMark/Tag	Manual
FilePolicy	Every day at 22:00
FlexProtect	Manual
FlexProtectLIN	Manual
FSAnalyze	Every day at 22:00
IndexUpdate	Manual
IntegrityScan	Manual
LinCount	Manual
MediaScan	The 1st Saturday of every month at 12am
MultiScan	Manual
PermissionRepair	Manual
QuotaScan	Manual
SetProtectPlus	Manual
ShadowStoreDelete	Every Sunday at 12:00am
SmartPools	Every day at 22:00
SmartPoolsTree	Manual
SnapRevert	Manual
SnapshotDelete	Manual
TreeDelete	Manual
WormQueue	Every day at 02:00

The full list of jobs and schedules can be viewed via the CLI using the following syntax:

# isi job types list --verbose

This information can also be displayed via the WebUI, by navigating to Cluster Management > Job Operations > Job Types.

System maintenance jobs can be customized for a particular environment and/or workflow by configuring a schedule and modifying the default priority level or impact level for a particular job type. For example, the following CLI syntax can be used to set a schedule for the MediaScan job to run every Saturday morning at 9 AM. Note that the –force option overrides the confirmation step.

# isi job types modify mediascan --schedule 'every Saturday at 09:00' --force

Similarly, the following command removes the schedule for the job:

# isi job types modify mediascan --clear-schedule --force

As such, all subsequent iterations of the MediaScan job type will run with the new settings. However, if a MediaScan job is in progress, it will continue to use the old settings.

Finally, the following syntax modifies the default priority level and impact level for the MediaScan job type:

# isi job types modify mediascan --priority 2 --policy medium

To create or edit a job’s schedule from the WebUI, click on the “View / Edit” button for the desired job, located in the “Actions” column of the “Job Types” WebUI tab above. From here, check the “Scheduled” radio button, and select between a Daily, Weekly, Monthly, or Yearly schedule, as appropriate. For each of these time period options, it’s possible to schedule the job to run either once or multiple times on each specified day.

The Job Engine schedule for certain feature supporting jobs can be configured directly from the feature’s WebUI area, as well as from the Job Engine WebUI management pages. An example of this is Antivirus and the AVScan job.

The OneFS Job Engine can also initiate certain jobs on its own. For example, if the SnapshotIQ process detects that a snapshot has been marked for deletion, it will automatically queue a SnapshotDelete job.

The Job Engine will also execute jobs in response to certain system event triggers. In the case of a cluster group change, for example the addition or subtraction of a node or drive, OneFS automatically informs the job engine, which responds by starting a FlexProtect job. The coordinator notices that the group change includes a newly-smart-failed device and then initiates a FlexProtect job in response.

Job administration and execution can be controlled via the WebUI, the command line interface (CLI), or the OneFS RESTful platform API. For each of these control methods, additional administrative security can be configured using roles-based access control (RBAC). By restricting access via the ISI_PRIV_JOB_ENGINE privilege, it is possible to allow only a sub-set of cluster administrators to configure, manage and execute job engine functionality, as desirable for the security requirements of a particular environment.

When a job is started by any of the methods described above, in addition to starting and stopping, the job can also be paused.

Once paused, the job can also easily be resumed, and execution will continue from where the job left off when it became paused. This is managed by utilizing the Job Engines’ check-pointing system, described below.

Alternatively, this can also be performed from the CLI:

# isi job jobs pause 28

# isi job jobs resume 28

For example, the above syntax will pause and resume the job with ID 28.

OneFS Job Engine Orchestration and Execution

The OneFS job engine is based on a map reduce model. Under the hood, it comprises a delegation hierarchy made up of coordinator, director, manager, and worker processes.

There are other threads which are not illustrated in the graphic, which relate to internal functions, such as communication between the various JE daemons, and collection of statistics. Also, with three jobs running simultaneously, each node would have three manager processes, each with its own number of worker threads.

Once the work is initially allocated, the job engine uses a shared work distribution model in order to execute the work, and each job is identified by a unique Job ID. When a job is launched, whether it’s scheduled, started manually, or responding to a cluster event, the Job Engine spawns a child process from the isi_job_d daemon running on each node. This job engine daemon is also known as the parent process.

The entire job engine’s orchestration is handled by the coordinator, which is a process that runs on one of the nodes in a cluster. Any node can act as the coordinator, and the principle responsibilities include:

Monitoring workload and the constituent nodes’ status
Controlling the number of worker threads per-node and cluster-wide
Managing and enforcing job synchronization and checkpoints

While the actual work item allocation is managed by the individual nodes, the coordinator node takes control, divides up the job, and evenly distributes the resulting tasks across the nodes in the cluster. For example, if the coordinator needs to communicate with a manager process running on node five, it first sends a message to node five’s director, which then passes it on down to the appropriate manager process under its control. The coordinator also periodically sends messages, via the director processes, instructing the managers to increment or decrement the number of worker threads.

The coordinator is also responsible for starting and stopping jobs, and also for processing work results as they are returned during the execution of a job. Should the coordinator process die for any reason, the coordinator responsibility automatically moves to another node.

The coordinator node can be identified via the following CLI command:

# isi job status --verbose | grep Coordinator

Each node in the cluster has a job engine director process, which runs continuously and independently in the background. The director process is responsible for monitoring, governing and overseeing all job engine activity on a particular node, constantly waiting for instruction from the coordinator to start a new job. The director process serves as a central point of contact for all the manager processes running on a node, and as a liaison with the coordinator process across nodes. These responsibilities include:

Manager process creation
Delegating to and requesting work from other peers
Sending and receiving status messages

The manager process is responsible for arranging the flow of tasks and task results throughout the duration of a job. The manager processes request and exchange work with each other and supervise the worker threads assigned to them. At any point in time, each node in a cluster can have up to three manager processes, one for each job currently running. These managers are responsible for overseeing the flow of tasks and task results.

Each manager controls and assigns work items to multiple worker threads working on items for the designated job. Under direction from the coordinator and director, a manager process maintains the appropriate number of active threads for a configured impact level, and for the node’s current activity level. Once a job has completed, the manager processes associated with that job, across all the nodes, are terminated. And new managers are automatically spawned when the next job is moved into execution.

The manager processes on each node regularly send updates to their respective node’s director, which, in turn, informs the coordinator process of the status of the various worker tasks.

Each worker thread is given a task, if available, which it processes item-by-item until the task is complete or the manager un-assigns the task. The status of the nodes’ workers can be queried by running the following CLI command:

# isi job statistics view

In addition to the number of current worker threads per node, a sleep to work (STW) ratio average is also provided, giving an indication of the worker thread activity level on the node.

Towards the end of a job phase, the number of active threads decreases as workers finish up their allotted work and become idle. Nodes which have completed their work items just remain idle, waiting for the last remaining node to finish its work allocation. When all tasks are done, the job phase is considered to be complete and the worker threads are terminated.

As jobs are processed, the coordinator consolidates the task status from the constituent nodes and periodically writes the results to checkpoint files. These checkpoint files allow jobs to be paused and resumed, either proactively, or in the event of a cluster outage. For example, if the node on which the Job Engine coordinator was running went offline for any reason, a new coordinator would be automatically started on another node. This new coordinator would read the last consistency checkpoint file, job control and task processing would resume across the cluster from where it left off, and no work would be lost.

Job engine checkpoint files are stored in ‘results’ and ‘tasks’ subdirectories under the path ‘/ifs/.ifsvar/modules/jobengine/cp/<job_id>/’ for a given job. On large clusters and/or with a job running at high impact, there can be many checkpoint files accessed from all nodes, which may result in contention. Checkpoints are split into sixteen subdirectories under both tasks and results to alleviate this bottleneck.

The Job Engine resource monitoring and execution framework allows jobs to be throttled based on both CPU and disk I/O metrics. The granularity of the resource utilization monitoring data provides the coordinator process with visibility into exactly what is generating IOPS on any particular drive across the cluster. This level of insight allows the coordinator to make very precise determinations about exactly where and how impact control is best applied. As we will see, the coordinator itself does not communicate directly with the worker threads, but rather with the director process, which in turn instructs a node’s manager process for a particular job to cut back threads.

For example, if the job engine is running a low-impact job and CPU utilization drops below the threshold, the worker thread count is gradually increased up to the maximum defined by the ‘low’ impact policy threshold. If client load on the cluster suddenly spikes for some reason, then the number of worker threads is gracefully decreased. The same principal applies to disk I/O, where the job engine will throttle back in relation to both IOPS as well as the number of I/O operations waiting to be processed in any drive’s queue. Once client load has decreased again, the number of worker threads is correspondingly increased to the maximum ‘low’ impact threshold.

In summary, detailed resource utilization telemetry allows the job engine to automatically tune its resource consumption to the desired impact level and customer workflow activity.

Certain jobs, if left unchecked, could consume vast quantities of a cluster’s resources, contending with and impacting client I/O. To counteract this, the Job Engine employs a comprehensive work throttling mechanism which is able to limit the rate at which individual jobs can run. Throttling is employed at a per-manager process level, so job impact can be managed both granularly and gracefully.

Every twenty seconds, the coordinator process gathers cluster CPU and individual disk I/O load data from all the nodes across the cluster. The coordinator uses this information, in combination with the job impact configuration, to decide how many threads may run on each cluster node to service each running job. This can be a fractional number, and fractional thread counts are achieved by having a thread sleep for a given percentage of each second.

Using this CPU and disk I/O load data, every sixty seconds the coordinator evaluates how busy the various nodes are and makes a job throttling decision, instructing the various job engine processes as to the action they need to take. This enables throttling to be sensitive to workloads in which CPU and disk I/O load metrics yield different results. Additionally, there are separate load thresholds tailored to the different classes of drives utilized in OneFS powered clusters, including high speed SAS drives, lower performance SATA disks and flash-based solid-state drives (SSDs).

The Job engine allocates a specific number of threads to each node by default, thereby controlling the impact of a workload on the cluster. If little client activity is occurring, more worker threads are spun up to allow more work, up to a predefined worker limit. For example, the worker limit for a low-impact job might allow one or two threads per node to be allocated, a medium-impact job from four to six threads, and a high-impact job a dozen or more. When this worker limit is reached (or before, if client load triggers impact management thresholds first), worker threads are throttled back or terminated.

For example, a node has four active threads, and the coordinator instructs it to cut back to three. The fourth thread is allowed to finish the individual work item it is currently processing, but then quietly exit, even though the task as a whole might not be finished. A restart checkpoint is taken for the exiting worker thread’s remaining work, and this task is returned to a pool of tasks requiring completion. This unassigned task is then allocated to the next worker thread that requests a work assignment, and processing continues from the restart check-point. This same mechanism applies in the event that multiple jobs are running simultaneously on a cluster.

In situations where the job engine sees the available capacity on one or more disk pools fall below a low space threshold, it engages low space mode. This enables space-saving jobs to run and reclaim space before the job engine or even the cluster become unusable. When the job engine is in low-space mode new jobs will not be started, and any jobs that are not space-saving will be paused. Once free space returns above the low-space threshold, jobs that have been paused for space are resumed.

The space-saving jobs are:

AutoBalance(LIN)
Collect
MultiScan
ShadowStoreDelete
SnapshotDelete
TreeDelete

Once the cluster is no longer space constrained, any paused jobs are automatically resumed.

Not all OneFS Job Engine jobs run equally fast. For example, a job which is based on a file system tree walk will run slower on a cluster with a very large number of small files than on a cluster with a low number of large files. Jobs which compare data across nodes, such as Dedupe, will run more slowly where there are many more comparisons to be made. Many factors play into this, and true linear scaling is not always possible. If a job is running slowly the first step is to discover what the specific context of the job is.

There are three main methods for jobs, and their associated processes, to interact with the file system:

Via metadata, using a LIN scan. An example of this is IntegrityScan, when performing an on-line file system verification.
Traversing the directory structure directly via a tree walk. For example, QuotaScan, when performing quota domain accounting.
Directly accessing the underlying cylinder groups and disk blocks, via a linear drive scan. For example, MediaScan, when looking for bad disk sectors.

Each of these approaches has its pros and cons and will suit particular jobs. The specific access method influences the run time of a job. For instance, some jobs are unaffected by cluster size, others slow down or accelerate with the more nodes a cluster has, and some are highly influenced by file counts and directory depths.

For a number of jobs, particularly the LIN-based ones, the job engine will provide an estimated percentage completion of the job during runtime.

With LIN scans, even though the metadata is of variable size, the job engine can fairly accurately predict how much effort will be required to scan all LINs. The data, however, can be of widely-variable size, and so estimates of how long it will take to process each task will be a best reasonable guess.

For example, the job engine might know that the highest LIN is 1:0009:0000. Assuming the job will start with a single thread on each of three nodes, the coordinator evenly divides the LINs into nine ranges: 1:0000:0000-1:0000:ffff, 1:0001:0000-1:0001:ffff, etc., through 1:0008:0000-1:0009:0000. These nine tasks would then be divided between the three nodes. However, there is no guaranty that each range will take the same time to process. For example, the first range may have fewer actual LINs, as a result of old LINs having been deleted, so complete unexpectedly fast. Perhaps the third range contains a disproportional number of large files and so takes longer to process. And maybe the seventh range has heavy contention with client activity, also resulting in an increased execution time. Despite such variances, the splitting and redistribution of tasks across the node manager processes alleviates this issue, mitigating the need for perfectly-fair divisions at the onset.

Priorities play a large role in job initiation and it is possible for a high priority job to significantly impact the running of other jobs. This is by design, since FlexProtect should be able to run with a greater level of urgency than SmartPools, for example. However, sometimes this can be an inconvenience, which is why the storage administrator has the ability to manually control the impact level and relative priority of jobs.

Certain jobs like FlexProtect have a corresponding job provided with a name suffixed by ‘Lin’, for example FlexProtectLin. This indicates that the job will automatically, where available, use an SSD-based copy of metadata to scan the LIN tree, rather than the drives themselves. Depending on the workflow, this will often significantly improve job runtime performance.

On large clusters with multiple jobs running at high impact, the job coordinator can become bombarded by the volume of task results being sent directly from the worker threads. This is mitigated by certain jobs performing intermediate merging of results on individual nodes and batching delivery of their results to the coordinator. The jobs that support results merging include:

· AutoBalance(Lin)	· MultiScan
· AVScan	· PermissionRepair
· CloudPoolsLin	· QuotaScan
· CloudPoolsTreewalk	· SnapRevert
· Collect	· SnapshotDelete
· FlexProtect(Lin)	· TreeDelete
· LinCount	· Upgrade

OneFS Job Engine Architecture and Overview

Received several comments on the previous post requesting more background information on the Job Engine. So, over the next few blog articles, we’ll try to remedy this by delving into its architecture, idiosyncrasies, and operation.

The OneFS Job Engine runs across the entire cluster and is responsible for dividing and conquering large storage management and protection tasks. To achieve this, it reduces a task into smaller work items and then allocates, or maps, these portions of the overall job to multiple worker threads on each node. Progress is tracked and reported on throughout job execution and a detailed report and status is presented upon completion or termination.

Job Engine includes a comprehensive check-pointing system which allows jobs to be paused and resumed, in addition to stopped and started. It also includes an adaptive impact management system, CPU and drive-sensitive impact control, and the ability to run multiple jobs at once.

The Job Engine typically executes jobs as background tasks across the cluster, using spare or especially reserved capacity and resources. The jobs themselves can be categorized into three primary classes:

Category	Description
File System Maintenance Jobs	These jobs perform background file system maintenance, and typically require access to all nodes. These jobs are required to run in default configurations, and often in degraded cluster conditions. Examples include file system protection and drive rebuilds.
Feature Support Jobs	The feature support jobs perform work that facilitates some extended storage management function, and typically only run when the feature has been configured. Examples include deduplication and anti-virus scanning.
User Action Jobs	These jobs are run directly by the storage administrator to accomplish some data management goal. Examples include parallel tree deletes and permissions maintenance.

Although the file system maintenance jobs are run by default, either on a schedule or in reaction to a particular file system event, any Job Engine job can be managed by configuring both its priority-level (in relation to other jobs) and its impact policy. These are covered in more detail below.

Job Engine jobs often comprise several phases, each of which are executed in a pre-defined sequence. These run the gamut from jobs like TreeDelete, which have just a single phase, to complex jobs like FlexProtect and MediaScan that have multiple distinct phases.

A job phase must be completed in entirety before the job can progress to the next phase. If any errors occur during execution, the job is marked “failed” at the end of that particular phase and the job is terminated.

Each job phase is composed of a number of work chunks, or Tasks. Tasks, which are comprised of multiple individual work items, are divided up and load balanced across the nodes within the cluster. Successful execution of a work item produces an item result, which might contain a count of the number of retries required to repair a file, plus any errors that occurred during processing.

When a Job Engine job needs to work on a large portion of the file system, there are four main methods available to accomplish this:

Inode (LIN) Scan
Tree Walk
Drive Scan
Changelist

The most straightforward access method is via metadata, using a Logical Inode (LIN) Scan. In addition to being simple to access in parallel, LINs also provide a useful way of accurately determining the amount of work required.

A directory tree walk is the traditional access method since it works similarly to common UNIX utilities, such as find – albeit in a far more distributed way. For parallel execution, the various job tasks are each assigned a separate subdirectory tree. Unlike LIN scans, tree walks may prove to be heavily unbalanced, due to varying sub-directory depths and file counts.

Disk drives provide excellent linear read access, so a drive scan can deliver orders of magnitude better performance than a directory tree walk or LIN scan for jobs that don’t require insight into file system structure. As such, drive scans are ideal for jobs like MediaScan, which linearly traverses each node’s disks looking for bad disk sectors.

A fourth class of Job Engine jobs utilize a ‘changelist’, rather than LIN-based scanning. The changelist approach analyzes two snapshots to find the LINs which changed (delta) between the snapshots, and then dives in to determine the exact changes.

The following table provides a comprehensive list of the exposed jobs and operations that the OneFS Job Engine performs, and their file system access methods:

Job Name	Job Description	Access Method
AutoBalance	Balances free space in the cluster.	Drive + LIN
AutoBalanceLin	Balances free space in the cluster.	LIN
AVScan	Virus scanning job that antivirus server(s) run.	Tree
ChangelistCreate	Create a list of changes between two consecutive SyncIQ snapshots	Changelist
CloudPoolsLin	Archives data out to a cloud provider according to a file pool policy.	LIN
CloudPoolsTreewalk	Archives data out to a cloud provider according to a file pool policy.	Tree
Collect	Reclaims disk space that could not be freed due to a node or drive being unavailable while they suffer from various failure conditions.	Drive + LIN
ComplianceStoreDelete	SmartLock Compliance mode garbage collection job.	Tree
Dedupe	Deduplicates identical blocks in the file system.	Tree
DedupeAssessment	Dry run assessment of the benefits of deduplication.	Tree
DomainMark	Associates a path and its contents with a domain.	Tree
DomainTag	Associates a path and its contents with a domain.	Tree
EsrsMftDownload	ESRS managed file transfer job for license files.
FilePolicy	Efficient SmartPools file pool policy job.	Changelist
FlexProtect	Rebuilds and re-protects the file system to recover from a failure scenario.	Drive + LIN
FlexProtectLin	Re-protects the file system.	LIN
FSAnalyze	Gathers file system analytics data that is used in conjunction with InsightIQ.	Changelist
IndexUpdate	Creates and updates an efficient file system index for FilePolicy and FSAnalyze jobs,	Changelist
IntegrityScan	Performs online verification and correction of any file system inconsistencies.	LIN
LinCount	Scans and counts the file system logical inodes (LINs).	LIN
MediaScan	Scans drives for media-level errors.	Drive + LIN
MultiScan	Runs Collect and AutoBalance jobs concurrently.	LIN
PermissionRepair	Correct permissions of files and directories.	Tree
QuotaScan	Updates quota accounting for domains created on an existing directory path.	Tree
SetProtectPlus	Applies the default file policy. This job is disabled if SmartPools is activated on the cluster.	LIN
ShadowStoreDelete	Frees space associated with a shadow store.	LIN
ShadowStoreProtect	Protect shadow stores which are referenced by a LIN with higher requested protection.	LIN
ShadowStoreRepair	Repair shadow stores.	LIN
SmartPools	Job that runs and moves data between the tiers of nodes within the same cluster. Also executes the CloudPools functionality if licensed and configured.	LIN
SmartPoolsTree	Enforce SmartPools file policies on a subtree.	Tree
SnapRevert	Reverts an entire snapshot back to head.	LIN
SnapshotDelete	Frees disk space that is associated with deleted snapshots.	LIN
TreeDelete	Deletes a path in the file system directly from the cluster itself.	Tree
Undedupe	Removes deduplication of identical blocks in the file system.	Tree
Upgrade	Upgrades cluster on a later OneFS release.	Tree
WormQueue	Scan the SmartLock LIN queue	LIN

Note that there are also a few background Job Engine jobs, such as the Upgrade job, which are not exposed to administrative control.

A job impact policy can consist of one or many impact intervals, which are blocks of time within a given week. Each impact interval can be configured to use a single, pre-defined impact-level which specifies the amount of cluster resources to use for a particular cluster operation. The available impact-levels are:

Paused
Low
Medium
High

This degree of granularity allows impact intervals and levels to be configured per job, in order to ensure smooth cluster operation. And the resulting impact policies dictate when a job runs and the resources that a job can consume. The following default job engine impact policies are provided:

Impact policy	Schedule	Impact Level
LOW	Any time of day	Low
MEDIUM	Any time of day	Medium
HIGH	Any time of day	High
OFF_HOURS	Outside of business hours (9AM to 5PM, Mon to Fri), paused during business hours	Low

Be aware that these default impact policies cannot be modified or deleted.

However, new impact policies can be created, either via the “Add an Impact Policy” WebUI button, or by cloning a default policy and then modifying its settings as appropriate.

A mix of jobs with different impact levels will result in resource sharing. Each job cannot exceed the impact levels set for it, and the aggregate impact level cannot exceed the highest level of the individual jobs.

For example:

Job A (HIGH), job B (LOW).

The impact level of job A is HIGH.

The impact level of job B is LOW.

The total impact level of the two jobs combined is HIGH.

Job A (MEDIUM), job B (LOW), job C (MEDIUM).

The impact level of job A is MEDIUM.

The impact level of job B is LOW.

The impact level of job C is MEDIUM.

The total impact level of the three jobs combined is MEDIUM.

Job A (LOW), job B (LOW), job C (LOW), job D (LOW).

The impact level of job A is LOW.

The impact level of job B is LOW.

The impact level of job C is LOW.

The impact level of job D is LOW.

The job that was most recently queued/paused, or has the highest job ID value, will be paused.

The total impact level of the three running jobs, and one paused job, combined is LOW.

A best practice is to keep the default impact and priority settings, where possible, unless there’s a valid reason to change them.

The majority of Job Engine jobs are intended to run with “LOW” impact and execute in the background. Notable exceptions are the FlexProtect jobs, which by default are set at “medium” impact. This allows FlexProtect to quickly and efficiently re-protect data, without critically impacting other user activities.

Job Engine jobs are prioritized on a scale of one to ten, with a lower value signifying a higher priority. This is similar in concept to the UNIX scheduling utility, ‘nice’.

Higher priority jobs always cause lower-priority jobs to be paused, and, if a job is paused, it is returned to the back of the Job Engine priority queue. When the job reaches the front of the priority queue again, it resumes from where it left off. If the system schedules two jobs of the same type and priority level to run simultaneously, the job that was queued first is run first.

Priority takes effect when two or more queued jobs belong to the same exclusion set, or when, if exclusion sets are not a factor, four or more jobs are queued. The fourth queued job may be paused, if it has a lower priority than the three other running jobs.

In contrast to priority, job impact policy only comes into play once a job is running and determines the amount of resources a job can utilize across the cluster. As such, a job’s priority and impact policy are orthogonal to one another.

The FlexProtect(LIN) and IntegrityScan jobs both have the highest job engine priority level of 1, by default. Of these, FlexProtect is the most important, because of its core role in re-protecting data.

All the Job Engine jobs’ priorities are configurable by the cluster administrator. The default priority settings are strongly recommended, particularly for the highest priority jobs mentioned above.

As we saw in the previous blog article, the OneFS Job Engine allows up to three jobs to be run simultaneously. This concurrent job execution is governed by the following criteria:

Job Priority
Exclusion Sets – jobs which cannot run together (ie, FlexProtect and AutoBalance)
Cluster health – most jobs cannot run when the cluster is in a degraded state.

OneFS Job Contention

Got asked the following question from the field recently:

“I kicked off a job on my cluster which seemed to be running happily. When I went back into the UI to check it had stopped and was marked waiting and wouldn’t restart.”

Situations like this occur when a running lower priority job is trumped by higher priority job(s). Since the OneFS Job Engine only allows up to three jobs to be run simultaneously, if a fourth job with a higher priority is started, the lowest of the currently executing jobs will be paused. For example:

# isi job start fsanalyze

Started job [583]

# isi job status

The job engine is running.

Running and queued jobs:

ID   Type       State Impact  Pri  Phase Running Time

---------------------------------------------------------

578  SmartPools Running Low     6   1/2   11s

581  Collect    Running Low     4   1/3   16s

583  FSAnalyze  Running Low     5   1/10  1s

---------------------------------------------------------

Total: 3

In this case, we have three jobs running: SmartPools with a priority of 6, MultiScan with priority 4, and FSAnalyze with priority 5.

Next, we go ahead and start a deduplication job, with a priority value of 4:

# isi job start dedupe

Started job [584]

If we now take a look at the cluster’s job status, we can see that SmartPools job has been put into a waiting state (paused), because of its relative priority. A value of ‘1’ indicates the highest priority job level that OneFS supports, with ‘10’ being the lowest.

# isi job status

The job engine is running.

Running and queued jobs:

ID   Type       State Impact  Pri  Phase Running Time

---------------------------------------------------------

578  SmartPools Waiting Low     6  1/2   11s

581  Collect    Running Low     4  1/3   1m 4s

583  FSAnalyze  Running Low     5  9/10  43s

584  Dedupe     Running Low     4  1/1    -

---------------------------------------------------------

Total: 4

Once the FSAnalyze job has completed, the SmartPools job is automatically restarted again:

# isi job status

The job engine is running.

Running and queued jobs:

ID   Type       State Impact  Pri  Phase Running Time

---------------------------------------------------------

578  SmartPools Running Low     6   1/2  23s

581  Collect    Running Low     4   1/3  5m 9s

584  Dedupe     Running Low     4   1/1  1m 2s

---------------------------------------------------------

Total: 3

Let’s look at this in a bit more detail. So the Job Engine’s concurrent job execution is governed by the following criteria:

Job Priority
Exclusion Sets – jobs which cannot run together (ie, FlexProtect and AutoBalance)
Cluster health – most jobs cannot run when the cluster is in a degraded state.

In addition to the per-job impact controls described above, additional impact management is also provided by the notion of job exclusion sets. For multiple concurrent job execution, exclusion sets, or classes of similar jobs, determine which jobs can run simultaneously. A job is not required to be part of any exclusion set, and jobs may also belong to multiple exclusion sets.

Currently, there are two exclusion sets that jobs can be part of: Restriping and Marking:

Here’s a list of the basic concurrent job combinations that OneFS supports:

1 Restripe Job + 1 Mark Job + 1 Other Job
1 Restripe Job + 2 Other Jobs
1 Mark Job + 2 Other Jobs
1 Mark and Restripe Job + 2 Other Jobs
3 Other Jobs

OneFS marks blocks that are actually in use by the file system. IntegrityScan, for example, traverses the live file system, marking every block of every LIN in the cluster to proactively detect and resolve any issues with the structure of data in a cluster. The jobs that comprise the marking exclusion set are:

Collect
IntegrityScan
MultiScan

OneFS protects data by writing file blocks across multiple drives on different nodes in a process known as ‘restriping’. The Job Engine defines a restripe exclusion set that contains these jobs which involve file system management, protection and on-disk layout. The restripe exclusion set contains the following jobs:

AutoBalance
AutoBalanceLin
FlexProtect
FlexProtectLin
MediaScan
MultiScan
SetProtectPlus
ShadowStoreProtect
SmartPools
Upgrade

The restriping exclusion set is per-phase instead of per job. This helps to more efficiently parallelize restripe jobs when they don’t need to lock down resources.

Restriping jobs only block each other when the current phase may perform restriping. This is most evident with MultiScan, whose final phase only sweeps rather than restripes. Similarly, MediaScan, which rarely ever restripes, is usually able to run to completion more without contending with other restriping jobs.

Running and queued jobs:

ID    Type               State       Impact  Pri  Phase  Running Time

----------------------------------------------------------------------

26850 AutoBalanceLin     Running     Low     4    1/3    20d 18h 19m 

26910 ShadowStoreProtect Waiting     Low     6    1/1    -           

28133 MediaScan          Running     Low     8    1/8    1d 15h 37m  

----------------------------------------------------------------------

MediaScan restripes in phases 3 and 5 of the job, and only if there are disk errors (ECCs) which require data reprotection. If MediaScan reaches phase 3 with ECCs, it will pause until AutoBalanceLin is no longer running. If MediaScan’s priority were in the range 1-3, it would cause AutoBalanceLin to pause instead.

Jobs may also belong to both exclusion sets. An example of this is MultiScan, since it includes both AutoBalance and Collect.

The majority of the jobs do not belong to an exclusion set, as illustrated in the following graphic. These are typically the feature support jobs, as described above, and they can coexist and contend with any of the other jobs.

Looking at our previous example again, where the SmartPools job is paused when FSAnalyze is running:

# isi job list

ID   Type       State Impact  Pri  Phase Running Time

---------------------------------------------------------

578  SmartPools Waiting Low     6  1/2    15s

581  Collect    Running Low     4  1/3    37m

584  Dedupe     Running Low     4  1/1    33m

586  FSAnalyze  Running Low     5  1/10   9s

---------------------------------------------------------

Total: 4

If this is undesirable, FSAnalyze can be manually paused, to allow SmartPools to run unimpeded:

# isi job pause FSAnalyze

# isi job list

ID   Type       State       Impact Pri  Phase  Running Time

-------------------------------------------------------------

578  SmartPools Waiting     Low     6   1/2    15s

581  Collect    Running     Low     4   1/4    38m

584  Dedupe     Running     Low     4   1/1    34m

586  FSAnalyze  User Paused Low     5   1/10   20s

-------------------------------------------------------------

Total: 4

Alternatively, the priority of the SmartPools job can also be elevated to value ‘4’ (or the priority of FSAnalyze lowered to value ‘7’) to permanently prioritize it over the FSAnalyze job. For example:

# isi job types modify SmartPools --priority 4

Are you sure you want to modify the job type SmartPools? (yes/[no]): yes

# isi job types view SmartPools

         ID: SmartPools

Description: Enforce SmartPools file policies. This job requires a SmartPools license.

    Enabled: Yes

     Policy: LOW

   Schedule: every day at 22:00

   Priority: 4

Or, via the WebUI:

Navigate to Job Operations > Job Type and configure the desired priority value edit the appropriate job type’s details:

OneFS FilePolicy Job

Traditionally, OneFS has used the SmartPools jobs to apply its file pool policies. To accomplish this, the SmartPools job visits every file, and the SmartPoolsTree job visits a tree of files. However, the scanning portion of these jobs can result in significant random impact to the cluster and lengthy execution times, particularly in the case of SmartPools job. To address this, OneFS also provides the FilePolicy job, which offers a faster, lower impact method for applying file pool policies than the full-blown SmartPools job.

But first, a quick Job Engine refresher…

As we know, he Job Engine is OneFS’ parallel task scheduling framework, and is responsible for the distribution, execution, and impact management of critical jobs and operations across the entire cluster.

The OneFS Job Engine schedules and manages all the data protection and background cluster tasks: creating jobs for each task, prioritizing them and ensuring that inter-node communication and cluster wide capacity utilization and performance are balanced and optimized. Job Engine ensures that core cluster functions have priority over less important work and gives applications integrated with OneFS – Isilon add-on software or applications integrating to OneFS via the OneFS API – the ability to control the priority of their various functions to ensure the best resource utilization.

Each job, for example the SmartPools job, has an “Impact Profile” comprising a configurable policy and a schedule which characterizes how much of the system’s resources the job will take – plus an Impact Policy and an Impact Schedule. The amount of work a job has to do is fixed, but the resources dedicated to that work can be tuned to minimize the impact to other cluster functions, like serving client data.

Here’s a list of the specific jobs that are directly associated with OneFS SmartPools:

Job	Description
SmartPools	Job that runs and moves data between the tiers of nodes within the same cluster. Also executes the CloudPools functionality if licensed and configured.
SmartPoolsTree	Enforces SmartPools file policies on a subtree.
FilePolicy	Efficient changelist-based SmartPools file pool policy job.
IndexUpdate	Creates and updates an efficient file system index for FilePolicy job.
SetProtectPlus	Applies the default file policy. This job is disabled if SmartPools is activated on the cluster.

In conjunction with the IndexUpdate job, FilePolicy improves job scan performance for by using a ‘file system index’, or changelist, to find files needing policy changes, rather than a full tree scan.

Avoiding a full treewalk dramatically decreases the amount of locking and metadata scanning work the job is required to perform, reducing impact on CPU and disk – albeit at the expense of not doing everything that SmartPools does. The FilePolicy job enforces just the SmartPools file pool policies, as opposed to the storage pool settings. For example, FilePolicy does not deal with changes to storage pools or storage pool settings, such as:

Restriping activity due to adding, removing, or reorganizing node pools.
Changes to storage pool settings or defaults, including protection.
Packing small files into shadow store containers.

However, the majority of the time SmartPools and FilePolicy perform the same work. Disabled by default, FilePolicy supports the full range of file pool policy features, reports the same information, and provides the same configuration options as the SmartPools job. Since FilePolicy is a changelist-based job, it performs best when run frequently – once or multiple times a day, depending on the configured file pool policies, data size and rate of change.

The FilePolicy job can be invoked from the OneFS CLI with the following configuration options:

Config Option	Details
–directory-only	Skip processing of regular files.
–ingest	Fast mode for quickly setting policies on directories. Alias for –directory-only –policy-only.
–nop	Dry run, calculate what would be done.
–policy-only	Apply policies but skip restriping.

Job schedules can easily be configured from the OneFS WebUI by navigating to Cluster Management > Job Operations, highlighting the desired job and selecting ‘View\Edit’. The following example illustrates configuring the IndexUpdate job to run every six hours at a LOW impact level with a priority value of 5:

When enabling and using the FilePolicy and IndexUpdate jobs, the recommendation is to continue running the SmartPools job as well, but at a reduced frequency (monthly).

In addition to running on a configured schedule, the FilePolicy job can also be executed manually.

FilePolicy requires access to a current index. In the event that the IndexUpdate job has not yet been run, attempting to start the FilePolicy job will fail with the error below. Instructions in the error message will be displayed, prompting to run the IndexUpdate job first. Once the index has been created, the FilePolicy job will run successfully. The IndexUpdate job can be run several times daily (ie. every six hours) to keep the index current and prevent the snapshots from getting large.

Consider using the FilePolicy job with the job schedules below for workflows and datasets with the following characteristics:

Data with long retention times
Large number of small files
Path-based File Pool filters configured
Where FSAnalyze job is already running on the cluster (InsightIQ monitored clusters)
There is already a SnapshotIQ schedule configured
When the SmartPools job typically takes a day or more to run to completion at LOW impact

For clusters without the characteristics described above, the recommendation is to continue running the SmartPools job as usual and to not activate the FilePolicy job.

The following table provides a suggested job schedule when deploying FilePolicy:

Job	Schedule	Impact	Priority
FilePolicy	Every day at 22:00	LOW	6
IndexUpdate	Every six hours, every day	LOW	5
SmartPools	Monthly – Sunday at 23:00	LOW	6

Since no two clusters are the same, so this suggested job schedule may require additional tuning to meet the needs of a specific environment.

Note that when clusters running older OneFS versions and the FSAnalyze job are upgraded to OneFS 8.2.x or later, the legacy FSAnalyze index and snapshots are removed and replaced by new snapshots the first time that IndexUpdate is run. The new index stores considerably more file and snapshot attributes than the old FSA index. Until the IndexUpdate job effects this change, FSA keeps running on the old index and snapshots.

OneFS Re-protection and Cluster Expansion

Received the following question from the field, which seemed like it would make a useful blog article:

“I have an 8 node A2000 cluster with 12TB drives that has a recommended protection level of +2d:1n. I need to add capacity because the cluster is already 87% full so will be adding a new half chassis/node pair. According to the sizer at 10 nodes the cluster’s recommended protection level changes to +3d:1n1d. What’s the quickest way to go about this?”

Essentially, this boils down to whether the protection level should be changed before or after adding the new nodes.

The principle objective here is efficiency, by limiting the amount of protection and layout work that OneFS has to perform. In this case, both node addition and a change in protection level require that the cluster’s restriper daemon to run. This entails two long running operations: To balance data evenly across the cluster and to increase the data protection.

If the new A2000s are added first, and then the cluster protection is changed to the recommend level for the new configuration, the process would look like:

1) Add a new node pair to the cluster

2) Let rebalance finish

3) Configure the data protection level to +3d:1n1d

4) Allow the restriper to complete the re-protection

However, by altering the protection level first, all the data restriping can be performed more efficiently and in a single step:

1) Change the protection level setting to +3d:1n1d

2) Add nodes (immediately after changing the protection level)

3) Let rebalance finish

In addition to reducing the amount of work the cluster has to do, this streamlined process also has the benefit of getting data re-protected at the new recommended level more quickly.

OneFS protects and balances data by writing file blocks across multiple drives on different nodes. This process is known as ‘restriping’ in OneFS jargon. The Job Engine defines a restripe exclusion set that contains those jobs which involve file system management, protection and on-disk layout. The restripe set encompasses the following jobs:

Job	Description
Autobalance(Lin)	Balance free space in a cluster
FlexProtect(Lin)	Scans file system after device failure to ensure all files remain protected
MediaScan	Locate and clear media-level errors from disks
MultiScan	Runs AutoBalance and Collect jobs concurrently
SetProtectPlus	Applies the default file policy (unless SmartPools is activated)
ShadowStoreProtect	Protect shadow stores
SmartPools	Protects and moves data between tiers of nodes within cluster
Upgrade	Manages OneFS version upgrades

Each of the Job Engine jobs has an associated restripe goal, which can be displayed with the following command:

# isi_gconfig -t job-config | grep restripe_goal

The different restriper functions operate as follows, where each in the path is a superset of the previous:

The following table describes the action and layout goal of each restriper function:

Function	Detail	Goal
Retune	Always restripe using the retune layout goal. Originally intended to optimize layout for performance, but has instead become a synonym for ‘force restripe’.	LAYOUT_RETUNE
Rebalance	Attempt to balance utilization between drives etc. Also address all conditions implied by REPROTECT.	LAYOUT_REBALANCE
Reprotect	Change the protection level to more closely match the policy if the current cluster state allows wider striping or more mirrors. Re-evaluate the disk pool policy and SSD strategy. Also address all conditions implied by REPAIR.	LAYOUT_REPAIR
Repair	Replaces any references to restripe_from (down/smartfailed) components. Also fixes recovered writes.	LAYOUT_REPAIR

Here’s how the various Job Engine jobs (as reported by the isi_gconfig –t job-config command above) align with the four restriper goals:

The retune goal moves the current disk pool to the bottom of the list, increasing the likelihood (but not guaranteeing) that another pool will be selected as the restripe target. This is useful, for example, in the event of a significant drive loss in one of the disk pools that make up the node’s pool (eg. disk pool 4 suffers loss of 2+ drives and it becomes > 90% full). Using a retune goal more ‘quickly’ forces rebalance to the other pools.

So, an efficient approach to the earlier cluster expansion scenario is to change protection and then add the new node. A procedure for this is as follows:

Reconfigure the protection level to the recommended setting for the appropriate node pool(s). This can be done from the WebUI by navigating to File System > Storage Pools > SmartPools and editing the appropriate node pool(s):

2. Especially for larger clusters (ie. twenty nodes or more), of if there’s a mix of node hardware generations, it’s helpful to do some prep work upfront prior to adding a node. Prior to adding node(s):

1. Image any new node to the same OneFS version that the cluster is running.
2. Ensure that any new nodes have the correct versions of node and drive firmware, plus any patches that may have been added, before joining to the cluster.
3. If the nodes are from different hardware generations or configuration, ensure that they fit within the Node Compatibility requirements for the cluster’s OneFS version.

3. Set the Job Engine daemon to ‘disable’ an hour or so prior to adding new node(s) to help ensure a clean node join. This can be done with the following command:

# isi services –a isi_job_d disable

4. Add the new node(s) and verify the healthy state of the expanded cluster:

5. Confirm there are no un-provisioned drives:

# disi –I diskpools ls | grep –i “Unprovisioned drives”

6. Check that the node(s) joined the existing pools

# isi storagepool list

7. Restart the Job Engine:

# isi services –a isi_job_d enable

8. After adding all nodes, the recommendation for a cluster with SSDs is to run AutoBalanceLin with an impact policy of ‘LOW’ or OFF_HOURS. For example:

# isi job jobs start autobalancelin --policy LOW

9. To ensure the restripe is going smoothly, monitor the disk IO (‘DiskIn’ and ‘DiskOut’ counters) using the following command:

# isi statistics system –nall –-oprates –-nohumanize

Monitor the ‘DiskIn’ and ‘DiskOut’ counters. Between 2500-5000 disk IOPS is pretty healthy for nodes containing SSDs.

10. Additionally, cancelling Mediascan and/or MultiScan and pausing FSanalysis will reduce resource contention and allow the AutoBalanceLIN job to complete more efficiently.

# isi job jobs cancel mediascan

# isi job jobs cancel multiscan

# isi job jobs pause fsanalyze

Finally, it’s worth periodically running and reviewing the OneFS health check reports – especially pre and post configuring changes and adding new nodes to the cluster. These can be found an run from the WebUI by navigating to Cluster Management > HealthCheck > HealthChecks.

The OneFS Healthcheck diagnostic tool will help verify that the OneFS configuration is as expected and verify there are no cluster issues. The important checks to run include basic, job engine, cluster capacity, pre-upgrade, and performance.

OneFS Audit Log Purging

OneFS audit logs can grow quickly based on a customer’s audit configuration. As such, audit logs often need to be trimmed for space or for regulatory compliance purposes. OneFS 9.1 introduces the ability to automatically and non-disruptively purge both configuration and protocol audit log files. In releases prior to OneFS 9.1, there was no easy way to remove audit logs without stopping and restarting the audit service. Additionally, it involved a somewhat complex and potentially error-prone procedure. The steps must be followed explicitly, otherwise it may lead to data unavailability. OneFS audit logs are stored and compressed binary files which are located under /ifs/.ifsvar/audit/logs, and each log file can grow to a maximum of 1 GB.

Audit log purging provides a simple, efficient method to trim audit log entries. Log purging is applied to all nodes in the cluster, either automatically and periodically based on a retention policy, or manually. Configuration is via the OneFS CLI or platform API, and an automated purging policy is based on a time range, or ‘retentionperiod(by days)’. When run automatically, the log purger runs once an hour as a background process with minimal performance impact.

The OneFS 9.1 ‘isi audit settings’ command set enables the configuration and control of automatic log purging. The configuration parameters can be viewed with the following syntax:

# isi audit settings global view

Protocol Auditing Enabled: No

            Audited Zones: -

          CEE Server URIs: -

                 Hostname:

  Config Auditing Enabled: No

    Config Syslog Enabled: No

    Config Syslog Servers: -

  Protocol Syslog Servers: -

     Auto Purging Enabled: No

         Retention Period: 180

The following CLI command will enable automatic purging:

# isi audit settings global modify --auto-purging-enabled=yes

You are enabling the automatic log purging.

Automatic log purging will run in background to delete audit log files.

Please check the retention period before enabling automatic log purging.

Are you sure you want to do this?? (yes/[no]): yes

This change can be verified with the following syntax:

# isi audit settings global view | grep -i purging

     Auto Purging Enabled: Yes

Similarly, the retention period can also be configured from the OneFS CLI as follows. In this case, it’s being changed to 90 days from its default of 180 days:

# isi audit settings global modify --retention-period=90

# isi audit settings global view | grep -i retention

         Retention Period: 365

Manual deletion uses the same underlying mechanism as auto deletion, the only difference being that it is initiated manually. The isi audit logs delete CLI command can be used to purge the audit logs prior to a specified date. For example, the following syntax will manually purge the audit logs of entries prior to 1st April 2020:

# isi audit logs delete --before=2020-04-01

You are going to delete the audit logs before 2020-04-01.

Are you sure you want to do this?? (yes/[no]): yes

The purging request has been triggered.

The following CLI command can be run to monitor and verify the activity of the manual purging process:

# isi audit logs check

Purging Status:

             Using Before Value: 2019-10-01

Currently Manual Purging Status: COMPLETED

The removal of protocol or configuration audit logs for a particular time period can be verified with the OneFS audit event viewer command line utility. For example, the following syntax will display any protocol audit events between 1^st January 2020 and 31^st March 2019:

# isi_audit_viewer -t protocol -s "2020-01-01 12:00:00" -e "2020-03-31 12:00:00"

There are a couple of things to keep in mind when configuring and using audit logfile purging:

If syslog forwarding is enabled and is unable to forward quick enough, it can potentially block the purging process. This could potentially result in remaining, un-purged audit log entries, even if they fall outside of the retention period. The same bottlenecking challenge also could potentially occur with the CEE forwarder.

Performance-wise, depending on the quantity of audit data to be purged, the initial deletion may be lengthy and somewhat resource intensive. However, subsequent deletions will typically be much faster and with negligible performance impact.

The following CLI commands can be useful in determining how logging in general is progressing on a cluster. The ‘isi_audit_progress’ syntax will display both last consumed and logged event timestamps. Note that the command is node-local, so will need to be run with ‘isi_for_array’ to get a full cluster report:

# isi_for_array -s isi_audit_progress -t protocol CEE_FWD

tme-1:  Last consumed event time: '2022-12-02 10:16:24'

tme-1:  Last logged event time:   '2022-12-02 10:16:27'

tme-2:  Last consumed event time: '2022-12-02 10:16:19'

tme-2:  Last logged event time:   '2022-12-02 10:16:24'

tme-3:  Last consumed event time: '2022-12-02 10:16:31'

tme-3:  Last logged event time:   '2022-12-02 10:16:31'

Next, the following CLI command will confirm that nodes are happily forwarding to the CEE server(s):

# isi statistics query current list --keys=node.audit.cee.export.rate --nodes all

   Node  node.audit.cee.export.rate

-----------------------------------

      1                 107.200000

      2                  89.400000

      3                    94.100000

average                  96.900000

-----------------------------------

OneFS CELOG – Part 2

In the previous article in this series, we looked at an overview of CELOG – OneFS’ cluster event log and alerting infrastructure. For this blog post, we’ll focus in on CELOG’s configuration and management.

CELOG’s CLI is integrated with OneFS’ RESTful platform API and roles-based access control (RBAC) and is based around the ‘isi event’ series of commands. These are divided into three main groups:

Command Group	Description
Alerting	Alerts: Manage rules for reporting on groups of correlated events Channels: Manage channels for sending alerts
Monitoring and Control	Events: List and view events Groups: Manage groups of correlated event occurrences Test: Create test events
Configuration	Settings: Manage maintenance window, data retention and storage limits

The isi event alerts command set allow for the viewing, creation, deletion, and modification of alert conditions. These are the rules that specify how sets of event groups are reported.

As such, alert conditions combine a set of event groups or event group categories with a condition and a set of channels (isi event channels). When any of the specified event group conditions are met, an alert fires and is dispatched via the specified channel(s).

An alert condition comprises:

The threshold or condition under which alerts should be sent.
The event groups and/or categories it applies to (there is a special value ‘all’).
The channels through which alerts should be sent.

The channels must already exist and cannot be deleted while in use by any alert conditions. Some alert conditions have additional parameters, including: NEW, NEW_EVENTS, ONGOING, SEVERITY_INCREASE, SEVERITY_DECREASE and RESOLVED.

An alert condition may also possess a duration, or ‘transient period’. If this is configured, then no event group which is active (ie. not resolved) for less than this period of time will be reported upon via the alert condition that specifies it.

Note: The same event group may be reported upon under other alert conditions that do not specify a transient period or specify a different one.

The following command creates an alert named ExternalNetwork, sets the alert condition to NEW, the source event group to ID 100010001 (SYS_DISK_VARFULL), the channel to TechSupport, sets the severity level to critical, and the maximum alert limit to 5:

# isi event alerts create ExternalNetwork NEW –add_eventgroup 100010001 --channel TechSupport --severity critical --limit 5

Or, from the WebUI by browsing to Cluster Management > Events and Alerts > Alerts:

Similarly, the following will add the event group ID 123456 to the ExternalNetwork alert, and only send alerts for event groups with critical severity:

# isi event alerts modify ExternalNetwork -–add-eventgroup 123456 --severity critical

Channels are the routes via which alerts are sent, and include any necessary routing, addressing, node exclusion information, etc. The isi event channels command provides create, modify, delete, list and view options for channels.

The supported communication methods include:

SMTP
SNMPA
ConnectEMC

The following command creates the channel ‘TechSupport’ used in the example above, and sets its type to EMCConnect:

# isi event channels create TechSupport connectemc

Note that ESRS connectivity must be enabled prior to configuring a ‘connectemc’ channel.

Conversely, a channel can easily be removed with the following syntax:

# isi event channels delete TechSupport

Or from the WebUI by browsing to Cluster Management > Events and Alerts > Alerts:

For SMTP, a valid email server is required. This can be checked with the ‘isi email view’ command. If the “SMTP relay address” field is empty, this can be configured by running something along the lines of:

# isi email settings modify -–mail-relay=mail.mycompany.com

The following syntax modifies the channel named TechSupport, changing the SMTP username to admin, and resetting the SMTP password:

# isi event channels modify TechSupport --smtp-username admin -–smtp-password p@ssw0rd

SNMP traps are sent by running either ‘snmpinform’ or ‘snmptrap’ with appropriate parameters (agent, community, etc). To configure a cluster to send SNMP traps in order for a network monitoring system (NMS) to receive them, from the WebUI navigate to Dashboard > Events > Notification Rules > Add Rule and create a rule with Recipients = SNMP, and enter the correct values for ‘Community’ and ‘Host’ appropriate for your NMS.

The isi event events list command displays events with their ID, time of occurrence, severity, logical node number, event group and message. The events for a single event group occurrence can be listed using the –eventgroup-id parameter.

To identify the instance ID of the event that you want to view, run the following command:

# isi event events list

To view the details of a specific event, run the isi event events view command and specify the event instance ID. The following displays the details for an event with the instance ID of 6.201114:

# isi event events view 6.201114

           ID: 6.201114

Eventgroup ID: 4458074

   Event Type: 400040017

      Message: (policy name: cb-policy target: 10.245.109.130) SyncIQ encountered a filesystem error. Failure due to file system error(s): Could not sync stub file 103f40aa8: Input/output error

        Devid: 6

          Lnn: 6

         Time: 2020-10-26T12:23:14

     Severity: warning

        Value: 0.0

The list of event groups can be filtered by cause, begin (occurred after this time), end (occurred before this time), resolved, ignored or event count (event group occurrences with at least as many events as specified). By default only event group occurrences which are not ignored will be shown.

The configurations options are to set or revert the ignore status and to set the resolved status. Be warned that an event group marked as resolved cannot be reverted.

For example, the following example command modifies event group ID 10 to a status ‘ignored’:

# isi event groups modify 10 --ignored true

Note: If desired, the isi event group bulk command will set all event group occurrences to either ignore or resolved. Use sparingly!

The isi events settings view command displays the current values of all settings, whereas the modify command allows any of them to be reconfigured.

The configurable options include:

Config Option	Detail
retention-days	Retention period for data concerning resolved event groups in days
storage-limit	The amount of cluster storage that CELOG is allowed to consume – measured in millionths of the total on the cluster (megabytes per terabyte of total storage). Values of 1-100 are allowed (up to one ten thousandth of the total storage), however there is a 1GB floor for small clusters.
maintenance-start maintenance-duration	These two should always be used together to specify a maintenance period during which no alerts will be generated. This is intended to suppress alerts during periods of maintenance when they are likely to be false alarms.
heartbeat-interval	CELOG runs a periodic (once daily, by default) self test by sending a heartbeat event from each node which is reported via the system ‘Heartbeat Self-Test’ channel. Any failures are logged in /var/log/messages.

The following syntax alters the number of days that resolved event groups are saved to 90, and increases the storage limit for event data to 5MB for every 1TB of total cluster storage:

# isi event settings modify --retention-days 90 --storage-limit 5

A maintenance window can be configured to discontinue alerts while performing maintenance on your cluster.

For example, the following command schedules a maintenance window that starts at 11pm on October 27, 2020, and lasts for one day:

# isi event settings modify --maintenance-start 2020-10-27T23:00:00 --maintenance-duration 1D

Maintenance periods and retention settings can also be configured from the WebUI by browsing to Cluster Management > Events and Alerts > Settings:

The isi event test command is provided in order to validate the communication path and confirm that events are getting transmitted correctly. The following generates a test alert with the message “Test msg from OneFS”:

# isi event test create "Test msg from OneFS"

Here are the log files that CELOG uses for its various purposes:

Log File	Description
/var/log/isi_celog_monitor.log	System monitoring and event creation
/var/log/isi_celog_capture.log	First stop recording of events, attachment generation
/var/log/isi_celog_analysis.log	Assignment of events to eventgroups
/var/log/isi_celog_reporting.log	Evaluation of alert conditions and sending of alert requests
/var/log/isi_celog_alerting.log	Sending of alerts
/var/log/messages	Heartbeat failures
/ifs/.ifsvar/db/celog_alerting/<channel>/fail.log	Failure messages from alert sending
/ifs/.ifsvar/db/celog_alerting/<channel>/sent.log	Alerts sent via <channel>

These logs can be invaluable for troubleshooting the various components of OneFS events and alerting.

As mentioned previously, CELOG combines multiple events into a single event group. This allows an incident to be communicated and managed as a single, coherent issue. A similar process occurs for multiple instances of the same event. As such, the following deduplication rules apply to these broad categories of events, namely:

Event Category	Descriptions
Repeating Events	Events firing repeatedly before the elapsed time-out value will be condensed into a “… is triggering often” event
Sensor	Multiple events by hardware sensors on a node within a given time-frame will be combined as a “hardware problems” event
Disk	Multiple events generated by a specific disk will be coalesced into a logical “disk problems” event
Network	Various issues may exist, depending on the type of connectivity problem between nodes of a cluster: § When a node cannot contact any other nodes in the cluster, each of its connection errors will be condensed into a “node cannot contact cluster” event. § When a node is not reachable by the rest of the cluster nodes, cluster will combine the connection errors as a “cluster cannot reach node X“ event § When clusters split into chunks, each set of nodes will report connection errors coalesced as a “nodes X, Y, Z cannot contact nodes A, B, C“ event. § When cluster re-forms, events will again be combined into a single logical “cluster split into N groups: { A, B, C }, { X, Y, Z }, …” event. § When connectivity between all nodes is restored and cluster is reformed, the events will be condensed into a single “All nodes lost internal connectivity” event.
Reboot	If a node is unreachable even after a defined time elapses after a reboot, further connection errors will be coalesced as a “node did not rejoin cluster after reboot” event.

So CELOG is your cluster guardian – continuously monitoring the health and performance of the hardware, software, and services – and generating events when situations occur that might require your attention.