OneFS Job Engine and Parallel Restriping – Part4

In the final article in this series, we take a look at the configuration and management of parallel restriping. To support this, OneFS 9.7 includes a new ‘isi job settings’ CLI command set, allowing the parallel restriper configuration to be viewed and modified. By default, no changes are made to the Job Engine upon upgrade to 9.7, so the legacy behavior allowing only a single restripe job to run at any point in time is preserved. This is reflected in the new ‘isi job settings’ CLI syntax:

# isi job settings view

Parallel Restriper Mode: Off

However, once a OneFS 9.7 upgrade has been committed, the parallel restriper can be configured in one of three modes:

  Mode Description
Off Default: Legacy restripe exclusion set behavior, with only one restripe job permitted.
Partial FlexProtect/FlexProtectLin runs alone, but all other restripers can be run together.
All No restripe job exclusions, beyond the overall limit of three concurrently running jobs.

For example, the following CLI command can be used to configure ‘partial’ parallel restriping support:

# isi job settings modify --parallel_restriper_mode=partial

# isi job settings view

Parallel Restriper Mode: Partial

As such, restriping jobs can run in parallel in ‘partial’ mode. For example, SmartPools and MultiScan, as in the following cluster‘s CLI output:

# isi job jobs list

ID   Type       State   Impact  Policy  Pri  Phase  Running Time

-----------------------------------------------------------------

3166 MultiScan  Running Low     LOW     4    1/4    17d 8h 5m

3790 SmartPools Running Low     LOW     6    1/2    5d 17h 16m

-----------------------------------------------------------------

Total: 2

However, if the FlexProtect is started when a cluster is in ‘partial’ mode, all other restriping jobs are automatically paused. For example:

# isi job jobs start FlexProtect

Started job [4088]

# isi job jobs start FlexProtect

Started job [4114]

# isi job jobs list

ID   Type        State              Impact  Policy  Pri  Phase  Running Time

-----------------------------------------------------------------------------

3790 SmartPools  Waiting            Low     LOW     6    1/2    36s

3166 MultiScan   Running -> Waiting Low     LOW     4    1/4    28s

4114 FlexProtect Waiting            Medium  MEDIUM  1    1/6    -

-----------------------------------------------------------------------------

Total: 3

# isi job jobs list

ID   Type        State   Impact  Policy  Pri  Phase  Running Time

------------------------------------------------------------------

3166 MultiScan   Waiting Low     LOW     4    1/4    17d 8h 7m

3790 SmartPools  Waiting Low     LOW     6    1/2    5d 17h 17m

4088 FlexProtect Running Medium  MEDIUM  1    1/6    2s

------------------------------------------------------------------

Total: 3

Similarly, no restripe job exclusions can be implemented with the following CLI syntax:

# isi job settings modify --parallel_restriper_mode=all

This allows any of the restriping jobs, including FlexProtect, to run in parallel up to the Job Engine limit of three concurrent jobs. For example, MultiScan and SmartPools are both running below:

# isi job jobs list

ID   Type       State   Impact  Policy  Pri  Phase  Running Time

-----------------------------------------------------------------

3166 MultiScan  Waiting Low     LOW     4    1/4    17d 8h 7m

3790 SmartPools Waiting Low     LOW     6    1/2    5d 17h 17m

-----------------------------------------------------------------

Total: 2

# isi job settings view

Parallel Restriper Mode: All

If the FlexProtect job is then started, all three restriping jobs are allowed to run concurrently:

# isi job jobs start FlexProtect

Started job [4089]

# isi job jobs list

ID   Type        State   Impact  Policy  Pri  Phase  Running Time

------------------------------------------------------------------

3166 MultiScan   Running Low     LOW     4    1/4    17d 8h 8m

3790 SmartPools  Running Low     LOW     6    1/2    5d 17h 18m

4089 FlexProtect Running Medium  MEDIUM  1    1/6    3s

------------------------------------------------------------------

Total: 3

Furthermore, the restripe jobs, including FlexProtect, can be run with the desired priority and impact settings. For example:

# isi job jobs start Flexprotect --policy LOW --priority 6

Started job [4100]

# isi job jobs list

ID   Type        State   Impact  Policy  Pri  Phase  Running Time

------------------------------------------------------------------

4097 SmartPools  Running Medium  MEDIUM  1    1/2    1m 42s

4098 MultiScan   Running Medium  MEDIUM  1    1/4    1m 13s

4100 FlexProtect Running Low     LOW     6    1/6    -

------------------------------------------------------------------

Total: 3

If necessary, the Job Engine can always be easily reverted to its default restripe exclusion set behavior, with only one restripe job permitted, as follows:

 # isi job settings modify --parallel_restriper_mode=off

Note that a user account with the PRIV_JOB_ENGINE RBAC role is required to configure the parallel restripe settings.

Similar to other Job Engine configuration, the parallel restripe settings are stored in gconfig under the core.parallel_restripe_mode tree.

Like any multi-threaded or parallel architecture, contending restriping jobs may lock LINs for long periods due to bigger range locks. Also, since by nature restriping jobs are moving blocks around, they tend to be quite hard on drives. So, multiple restripers running in parallel have the potential to impact cluster performance and potentially client I/O (protocol throughput, etc) – especially if the contending restripe jobs are run at a MEDIUM impact.

Also note that the new parallel restripe mode only applies to waiting jobs, or jobs transitioning between phases. Typically, if you attempt to start a second job with restripe exclusion enabled, that second job will be placed into a ‘waiting’ state. If parallel restripe is then enabled, the second job will be re-evaluated, and promoted to a ‘running’ state. However, if both jobs are running and parallel restripe is then disabled, the second job will not automatically be paused. Instead, intervention from a cluster admin would be needed to manually pause that job, if desired.

Note too that restripe exclusion is on a per-job-phase basis. For example, the MultiScan job has four phases. The first three can restripe, while the fourth does not. As such, a different restriping job (e.g. SmartPools or FlexProtect) will not conflict with MultiScan’s fourth phase. There’s also no need to run AutoBalance and a restriping MultiScan at the same time since they do exactly the same thing.
Additionally, unless there’s a really valid reason to, a good practice is to avoid running AutoBalance or MultiScan while FlexProtect is running. Re-protecting the cluster is usually of considerably more importance than correct balance, so allowing FlexProtect to consume any available resources while it’s running is typically a prudent move.

When troubleshooting the parallel restriper, the Job Engine coordinator logs to both isi_job_d.log and /var/log/messages, writing both the initial value and the subsequent configuration change. This can be a good thing to check if unexpectedly high drive load is encountered. Maybe someone inadvertently enabled parallel restripe, or at least forgot to disable it again after an intended short term configuration change.

OneFS Job Engine and Parallel Restriping – Part3

One of the issues is that, in trying to keep the cluster healthy, jobs such as FlexProtect, MultiScan, and AutoBalance are run, often in degraded conditions. And these maintenance jobs are conflicting with customer assigned jobs like SmartPools, in particular.

In order to run restripe jobs in parallel, the Job Engine makes use of multi-writer. Within the OneFS locking hierarchy, multi-writer allows a cluster to support concurrent writes to the same file from multiple writer threads. This granular write locking is achieved by sub-diving the file into separate regions and granting exclusive data write locks to these individual ranges, as opposed to the entire file. This process allows multiple clients, or write threads, attached to a node to simultaneously write to different regions of the same file.

Concurrent writes to a single file need more than just supporting data locks for ranges. Each writer also needs to update a file’s metadata attributes such as timestamps, block count, etc.

A mechanism for managing inode consistency is also needed, since OneFS is based on the concept of a single inode lock per file.

In addition to the standard shared read and exclusive write locks, OneFS also provides the following locking primitives, via journal deltas, to allow multiple threads to simultaneously read or write a file’s metadata attributes:

Lock Type Description
Exclusive A thread can read or modify any field in the inode. When the transaction is committed, the entire inode block is written to disk, along with any extended attribute blocks.
Shared A thread can read, but not modify, any inode field.
DeltaWrite A thread can modify any inode fields which support deltawrites. These operations are sent to the journal as a set of deltas when the transaction is committed.
DeltaRead A thread can read any field which cannot be modified by inode deltas.

These locks allow separate threads to have a Shared lock on the same LIN, or for different threads to have a DeltaWrite lock on the same LIN. However, it is not possible for one thread to have a Shared lock and another to have a DeltaWrite. This is because the Shared thread cannot perform a coherent read of a field which is in the process of being modified by the DeltaWrite thread.
The DeltaRead lock is compatible with both the Shared and DeltaWrite lock. Typically the filesystem will attempt to take a DeltaRead lock for a read operation, and a DeltaWrite lock for a write, since this allows maximum concurrency as all these locks are compatible.

Here’s what the write lock compatibilities looks like:

OneFS protects data by writing file blocks (restriping) across multiple drives on different nodes. The Job Engine defines a ‘restripe set’ comprising jobs which involve file system management, protection and on-disk layout. The restripe set contains the following jobs:

  • AutoBalance & AutoBalanceLin
  • FlexProtect & FlexProtectLin
  • FilePolicy
  • MediaScan
  • MultiScan
  • SetProtectPlus
  • SmartPools
  • Upgrade

Note that OneFS multi-writer ranges are not a fixed size and instead tied to layout/protection groups. So typically in the megabytes size range.

The number of threads that can write to the same file concurrently, from the filesystem perspective, is only limited by file size. However, NFS file handle affinity (FHA) comes into play from the protocol side, and so the default is typically eight threads per node.

The clients themselves do not apply for granular write range locks in OneFS, since multi-writer operation is completely invisible to the protocol. Multi-writer uses proprietary locking which OneFS performs to coordinate filesystem operations. As such, multi-writer is distinct from byte-range locking that application code would call, or even oplocks/leases which the client protocol stack would call.

Depending on the workload, multi-writer can improve performance by allowing for more concurrency. Unnecessary contention should be avoided as a general rule. For example:

  • Avoid placing unrelated data in the same directory. Use multiple directories instead. Even if it is related, split it up if there are many entries.
  • Similarly, use multiple files. Even if the data is ultimately related, from a performance/scalability perspective, having each client use its own file and then combining them as a final stage is the correct way to architect for performance.

Multi-writer for restripe, introduced in OneFS 8.0, allows multiple restripe worker threads to operate on a single file concurrently. This in turn improves read/write performance during file re-protection operations, plus helps reduce the window of risk (MTTDL) during drive Smartfails, etc. This is particularly true for workflows consisting of large files, while one of the above restripe jobs is running. Typically, the larger the files on the cluster, the more benefit multi-writer for restripe will offer.

With multi-writer for restripe, an exclusive lock is no longer required on the LIN during the actual restripe of data. Instead, OneFS tries to use a delta write lock to update the cursors used to track which parts of the file need restriping. This means that a client application or program should be able to continue to write to the file while the restripe operation is underway. An exclusive lock is only required for a very short period of time while a file is set up to be restriped.  A file will have fixed widths for each restripe lock, and the number of range locks will depend on the quantity of threads and nodes which are actively restriping a single file.

Prior to the multi-writer feature work, back in Riptide/OneFS 8.0, it was unsafe to run multiple restripe jobs – plain and simple. Since then, it is possible for these jobs to contend. However, these are often the ones that customers complain about the performance of. So an abundance of caution was exercised, and field feedback gathered, before engineering made the decision to allow parallel restriping.

On committing a OneFS 9.7 upgrade, the default mode is to change nothing and retain the restriping exclusion set and its single job restriction. However, a new CLI configuration option is now provided, allowing a cluster admin with the PRIV_JOB_ENGINE RBAC role to enable parallel restripe, if so desired.

There is no WebUI option to configure parallel restripe at this point – just CLI and platform API for now.

Most of the restriping jobs impact the cluster more heavily than desirable. So, depending on how loaded the cluster is, it was prudent to continue with the exclusion set as default, and allow the customer to make changes appropriate to their environment.

OneFS Job Engine and Parallel Restriping – Part2

The Job Engine resource monitoring and execution framework allows jobs to be throttled based on both CPU and disk I/O metrics. The granularity of the resource utilization monitoring data provides the coordinator process with visibility into exactly what is generating IOPS on any particular drive across the cluster. This level of insight allows the coordinator to make very precise determinations about exactly where and how impact control is best applied. As we will see, the coordinator itself does not communicate directly with the worker threads, but rather with the director process, which in turn instructs a node’s manager process for a particular job to cut back threads.

For example, if the job engine is running a low-impact job and CPU utilization drops below the threshold, the worker thread count is gradually increased up to the maximum defined by the ‘low’ impact policy threshold. If client load on the cluster suddenly spikes for some reason, then the number of worker threads is gracefully decreased. The same principal applies to disk I/O, where the job engine will throttle back in relation to both IOPS as well as the number of I/O operations waiting to be processed in any drive’s queue. Once client load has decreased again, the number of worker threads is correspondingly increased to the maximum ‘low’ impact threshold.

In summary, detailed resource utilization telemetry allows the job engine to automatically tune its resource consumption to the desired impact level and customer workflow activity.

Certain jobs, if left unchecked, could consume vast quantities of a cluster’s resources, contending with and impacting client I/O. To counteract this, the Job Engine employs a comprehensive work throttling mechanism which is able to limit the rate at which individual jobs can run. Throttling is employed at a per-manager process level, so job impact can be managed both granularly and gracefully.

Every twenty seconds, the coordinator process gathers cluster CPU and individual disk I/O load data from all the nodes across the cluster. The coordinator uses this information, in combination with the job impact configuration, to decide how many threads may run on each cluster node to service each running job. This can be a fractional number, and fractional thread counts are achieved by having a thread sleep for a given percentage of each second.

Using this CPU and disk I/O load data, every sixty seconds the coordinator evaluates how busy the various nodes are and makes a job throttling decision, instructing the various job engine processes as to the action they need to take. This enables throttling to be sensitive to workloads in which CPU and disk I/O load metrics yield different results. Additionally, there are separate load thresholds tailored to the different classes of drives utilized in OneFS powered clusters, including high speed SAS drives, lower performance SATA disks and flash-based solid-state drives (SSDs).

The Job engine allocates a specific number of threads to each node by default, thereby controlling the impact of a workload on the cluster. If little client activity is occurring, more worker threads are spun up to allow more work, up to a predefined worker limit. For example, the worker limit for a low-impact job might allow one or two threads per node to be allocated, a medium-impact job from four to six threads, and a high-impact job a dozen or more. When this worker limit is reached (or before, if client load triggers impact management thresholds first), worker threads are throttled back or terminated.

For example, a node has four active threads, and the coordinator instructs it to cut back to three. The fourth thread is allowed to finish the individual work item it is currently processing, but then quietly exit, even though the task as a whole might not be finished. A restart checkpoint is taken for the exiting worker thread’s remaining work, and this task is returned to a pool of tasks requiring completion. This unassigned task is then allocated to the next worker thread that requests a work assignment, and processing continues from the restart check-point. This same mechanism applies in the event that multiple jobs are running simultaneously on a cluster.

Not all OneFS Job Engine jobs run equally fast. For example, a job which is based on a file system tree walk will run slower on a cluster with a very large number of small files than on a cluster with a low number of large files.  Jobs which compare data across nodes, such as Dedupe, will run more slowly where there are many more comparisons to be made.  Many factors play into this, and true linear scaling is not always possible. If a job is running slowly the first step is to discover what the specific context of the job is.

There are four main methods for jobs, and their associated processes, to interact with the file system:

Method Description
LIN Scan Via metadata, using a LIN scan. An example of this is the IntegrityScan restriping job, when performing an on-line file system verification.
Tree Walk Traversing the directory structure directly via a tree walk. For example, the SmartPoolsTree restriping job, when enacting file pool policies on a filesystem subtree.
Drive Scan Directly accessing the underlying cylinder groups and disk blocks, via a linear drive scan. For example, the MediaScan restriping job, when looking for bad disk sectors.
Changelist For example, the FilePolicy restriping job, which, in conjunction with IndexUpdate, runs an efficient SmartPools file pool policy job.

Each of these approaches has its fortes and drawbacks and will suit particular jobs. The specific access method influences the run time of a job. For instance, some jobs are unaffected by cluster size, others slow down or accelerate with the more nodes a cluster has, and some are highly influenced by file counts and directory depths.

For a number of jobs, particularly the LIN-based ones, the job engine will provide an estimated percentage completion of the job during runtime (see figure 20 below).

With LIN scans, even though the metadata is of variable size, the job engine can fairly accurately predict how much effort will be required to scan all LINs. The data, however, can be of widely-variable size, and so estimates of how long it will take to process each task will be a best reasonable guess.

For example, the job engine might know that the highest LIN is 1:0009:0000. Assuming the job will start with a single thread on each of three nodes, the coordinator evenly divides the LINs into nine ranges: 1:0000:0000-1:0000:ffff, 1:0001:0000-1:0001:ffff, etc., through 1:0008:0000-1:0009:0000. These nine tasks would then be divided between the three nodes. However, there is no guaranty that each range will take the same time to process. For example, the first range may have fewer actual LINs, as a result of old LINs having been deleted, so complete unexpectedly fast. Perhaps the third range contains a disproportional number of large files and so takes longer to process. And maybe the seventh range has heavy contention with client activity, also resulting in an increased execution time. Despite such variances, the splitting and redistribution of tasks across the node manager processes alleviates this issue, mitigating the need for perfectly-fair divisions at the onset.

Priorities play a large role in job initiation and it is possible for a high priority job to significantly impact the running of other jobs.  This is by design, since FlexProtect should be able to run with a greater level of urgency than SmartPools, for example. However, sometimes this can be an inconvenience, which is why the storage administrator has the ability to manually control the impact level and relative priority of jobs.

Certain jobs like FlexProtect have a corresponding job provided with a name suffixed by ‘Lin’, for example FlexProtectLin. This indicates that the job will automatically, where available, use an SSD-based copy of metadata to scan the LIN tree, rather than the drives themselves. Depending on the workflow, this will often significantly improve job runtime performance.

In situations where the job engine sees the available capacity on one or more disk pools fall below a low space threshold, it engages low space mode. This enables space-saving jobs to run and reclaim space before the job engine or even the cluster become unusable. When the job engine is in low-space mode new jobs will not be started, and any jobs that are not space-saving will be paused. Once free space returns above the low-space threshold, jobs that have been paused for space are resumed.

The space-saving jobs are:

  • AutoBalance(LIN)
  • Collect
  • MultiScan
  • ShadowStoreDelete
  • SnapshotDelete
  • TreeDelete

Once the cluster is no longer space constrained, any paused jobs are automatically resumed.

Until OneFS 9.7, the Job Engine had two clearly defined ‘exclusion sets’ for classes of jobs that could potentially cause performance or data integrity issues if run together. These exclusion sets help ensure that job phases with overlapping exclusion sets do not run and the same time, and the lowest priority job will be waiting.

The first of these is the Marking exclusion set, which includes Collect and Integrity Scan which is strictly enforced since OneFS can only permit a single mark job without running the risk of corruption.

The other is the Restripe exclusion set, and the focus of this Job Engine enhancement. The restripe set are the jobs that move /ifs data blocks around for repair, balance, tiering, etc, in a process known as ‘restriping’ in the OneFS vernacular. These jobs include FlexProtect, MediaScan, AutoBalance, and SmartPools plus its sidekick, FilePolicy. Restriping typically has three specific goals:

Goal Description
Repair Ensures that files have the proper protection after the loss of a storage device.
Reprotect Moves files and reprotects them based on their file pool policy, while repairing at the same time, if needed.
Rebalance Ensures the correct placement of a files’ blocks to balance the drives based on the file’s policy and protection settings.

The fundamental responsibility of the jobs within the Restripe exclusion set is to ensure that the data on /ifs is protected at the desired level, balanced across nodes, and properly accounted for. It does this by running various file system maintenance jobs either manually, via a predefined schedule, or based on a cluster event, like a group change. These jobs include:

Multiscan

The MultiScan job, which combines the functionality of AutoBalance and Collect, is automatically run after a group change which adds a device to the cluster. AutoBalance(Lin) and/or Collect are only run manually if MultiScan has been disabled.

In addition to group change notifications, MultiScan is also started when:

  • Data is unbalanced within one or more disk pools, which triggers MultiScan to start the AutoBalance phase only.
  • When drives have been unavailable for long enough to warrant a Collect job, which triggers MultiScan to start both its AutoBalance and Collect phases.

AutoBalance

The goal of the AutoBalance job is to ensure that each node has the same amount of data on it, in order to balance data evenly across the cluster. AutoBalance, along with the Collect job, is run after any cluster group change, unless there are any storage nodes in a “down” state.

Upon visiting each file, AutoBalance performs the following two operations:

  • File level rebalancing
  • Full array rebalancing

For file level rebalancing, AutoBalance evenly spreads data across the cluster’s nodes in order to achieve balance within a particular file. And with full array rebalancing, AutoBalance moves data between nodes to achieve an overall cluster balance within a 5% delta across nodes.

There is also an AutoBalanceLin job available, which can be run in place of by AutoBalance when the cluster has a metadata copy available on SSD, providing an expedited job runtime. The following CLI syntax will enable the AutoBalanceLin job:

# isi_gconfig -t job-config jobs.common.lin_based_jobs = True

Collect

The Collect job is responsible for locating unused inodes and data blocks across the file system. Collect runs by default after a cluster group change, in conjunction with AutoBalance, as part of the MultiScan job.

In its first phase, Collect performs a marking job, scanning all the inodes (LINs) and identifying their associated blocks. Collect marks all the blocks which are currently allocated and in use, and any unmarked blocks are identified as candidates to be freed for reuse, so that the disk space they occupy can be reclaimed and re-allocated. All metadata must be read in this phase in order to mark every reference, and must be done completely, to avoid sweeping in-use blocks and introducing allocation corruption.

Collect’s second phase scans all the cluster’s drives and performs the freeing up, or sweeping, of any unmarked blocks so that they can be reused.

MediaScan

MediaScan’s role within the file system protection framework is to periodically check for and resolve drive bit errors across the cluster. This proactive data integrity approach helps guard against a phenomenon known as ‘bit rot’, and the resulting specter of hardware induced silent data corruption.

MediaScan is run as a low-impact, low-priority background process, based on a predefined schedule (monthly, by default).

First, MediaScan’s search and repair phase checks the disk sectors across all the drives in a cluster and, where necessary, utilizes OneFS’ dynamic sector repair (DSR) process to resolve any ECC sector errors that it encounters. For any ECC errors which can’t immediately be repaired, MediaScan will first try to read the disk sector again several times in the hopes that the issue is transient, and the drive can recover. Failing that, MediaScan will attempt to restripe files away from irreparable ECCs. Finally, the MediaScan summary phase generates a report of the ECC errors found and corrected.

IntegrityScan

The IntegrityScan job is responsible for examining the entire live file system for inconsistencies. It does this by systematically reading every block and verifying its associated checksum. Unlike traditional ‘fsck’ style file system integrity checking tools, IntegrityScan is designed to run while the cluster is fully operational, thereby removing the need for any downtime. In the event that IntegrityScan detects a checksum mismatch, it generates and alert, logs the error to the IDI logs and provides a full report upon job completion.

IntegrityScan is typically run manually if the integrity of the file system is ever in doubt. Although the job itself may take several days or more to complete, the file system is online and completely available during this time. Additionally, like all phases of the OneFS job engine, IntegrityScan can be prioritized, paused or stopped, depending on the impact to cluster operations.

FlexProtect

The FlexProtect job is responsible for maintaining the appropriate protection level of data across the cluster.  For example, it ensures that a file which is configured to be protected at +2n, is actually protected at that level. Given this, FlexProtect is arguably the most critical of the OneFS maintenance jobs because it represents the Mean-Time-To-Repair (MTTR) of the cluster, which has an exponential impact on MTTDL. Any failures or delay has a direct impact on the reliability of the OneFS file system.

In addition to FlexProtect, there is also a FlexProtectLin job. FlexProtectLin is run by default when there is a copy of file system metadata available on solid state drive (SSD) storage. FlexProtectLin typically offers significant runtime improvements over its conventional disk based counterpart.

As such, the primary purpose of FlexProtect is to repair nodes and drives which need to be removed from the cluster. In the case of a cluster group change, for example the addition or subtraction of a node or drive, OneFS automatically informs the job engine, which responds by starting a FlexProtect job. Any drives and/or nodes to be removed are marked with OneFS’ ‘restripe_from’ capability. The job engine coordinator notices that the group change includes a newly-smart-failed device and then initiates a FlexProtect job in response.

FlexProtect falls within the job engine’s restriping exclusion set and, similar to AutoBalance, comes in two flavors: FlexProtect and FlexProtectLin.

Run automatically after a drive or node removal or failure, FlexProtect locates any unprotected files on the cluster, and repairs them as rapidly as possible.  The FlexProtect job runs by default with an impact level of ‘medium’ and a priority level of ‘1’, and includes six distinct job phases:

The regular version of FlexProtect has the following phases:

Job Phase Description
Drive Scan Job engine scans the disks for inodes needing repair. If an inode needs repair, the job engine sets the LIN’s ‘needs repair’ flag for use in the next phase.

 

LIN Verify This phase scans the OneFS LIN tree to addresses the drive scan limitations.
LIN Re-verify The prior repair phases can miss protection group and metatree transfers. FlexProtect may have already repaired the destination of a transfer, but not the source. If a LIN is being restriped when a metatree transfer, it is added to a persistent queue, and this phase processes that queue.

 

Repair LINs with the ‘needs repair’ flag set are passed to the restriper for repair. This phase needs to progress quickly and the job engine workers perform parallel execution across the cluster.
Check This phase ensures that all LINs were repaired by the previous phases as expected.
Device Removal The successfully repaired nodes and drives that were marked ‘restripe from’ at the beginning of phase 1 are removed from the cluster in this phase. Any additional nodes and drives which were subsequently failed remain in the cluster, with the expectation that a new FlexProtect job will handle them shortly.

Be aware that prior to OneFS 8.2, FlexProtect is the only job allowed to run if a cluster is in degraded mode, such as when a drive has failed, for example. Other jobs will automatically be paused and will not resume until FlexProtect has completed and the cluster is healthy again. In OneFS 8.2 and later, FlexProtect does not pause when there is only one temporarily unavailable device in a disk pool, when a device is smartfailed, or for dead devices.

The FlexProtect job executes in userspace and generally repairs any components marked with the ‘restripe from’ bit as rapidly as possible. Within OneFS, a LIN Tree reference is placed inside the inode, a logical block. A B-Tree describes the mapping between a logical offset and the physical data blocks:

In order for FlexProtect to avoid the overhead of having to traverse the whole way from the LIN Tree reference -> LIN Tree -> B-Tree -> Logical Offset -> Data block, it leverages the OneFS construct known as the ‘Width Device List’ (WDL). The WDL enables FlexProtect to perform fast drive scanning of inodes because the inode contents are sufficient to determine need for restripe. The WDL keeps a list of the drives in use by a particular file, and are stored as an attribute within an inode and are thus protected by mirroring. There are two WDL attributes in OneFS, one for data and one for metadata. The WDL is primarily used by FlexProtect to determine whether an inode references a degraded node or drive. It New or replaced drives are automatically added to the WDL as part of new allocations.

As mentioned previously, the FlexProtect job has two distinct variants. In the FlexProtectLin version of the job the Disk Scan and LIN Verify phases are redundant and therefore removed, while keeping the other phases identical. FlexProtectLin is preferred when at least one metadata mirror is stored on SSD, providing substantial job performance benefits.

In addition to automatic job execution after a drive or node removal or failure, FlexProtect can also be initiated on demand. The following CLI syntax will kick of a manual job run:

# isi job start flexprotect
Started job [274]

# isi job list
ID   Type        State   Impact  Pri  Phase  Running Time
----------------------------------------------------------
274  FlexProtect Running Medium  1    1/6    4s
----------------------------------------------------------
Total: 1

The FlexProtect job’s progress can be tracked via a CLI command as follows:

# isi job jobs view 274
               ID: 274
             Type: FlexProtect
            State: Succeeded
           Impact: Medium
           Policy: MEDIUM
              Pri: 1
            Phase: 6/6
       Start Time: 2020-12-04T17:13:38
     Running Time: 17s
     Participants: 1, 2, 3
         Progress: No work needed
Waiting on job ID: -
      Description: {"nodes": "{}", "drives": "{}"}

Upon completion, the FlexProtect job report, detailing all six stages, can be viewed by using the following CLI command with the job ID as the argument:

# isi job reports view <job_id>

OneFS Job Engine and Parallel Restriping

One of the cluster’s functional areas that sees feature enhancement love in the new OneFS 9.7 release is the Job Engine. Specifically, the ability to support multiple restriping jobs.

As you’re probably aware, the Job Engine is a OneFS service, or daemon, that runs cluster housekeeping jobs, storage services, plus a variety of user initiated data management tasks. As such, the Job Engine performs a diverse and not always complimentary set of roles. On one hand it attempts to keeps the cluster healthy and balanced, while mitigating performance impact, and still allowing customers to perform on-demand large parallel, cluster-wide deletes, full-tree permissions management, data tiering, etc.

At a high level, this new OneFS 9.7 parallel restriping feature enables the Job Engine to run multiple restriping jobs at the same time. Restriping in OneFS is the process whereby filesystem blocks are moved around for repair, balance, tiering, etc. These restriping jobs include FlexProtect, MediaScan, AutoBalance, MultiScan, SmartPools, etc.

As such, an example of parallel restring could be running SmartPools alongside MultiScan, helping to unblock a data tiering workflow which was stuck behind an important cluster maintenance job. The following OneFS 9.7 example shows the FlexProtectLin, MediaScan, and SmartPools restriping jobs running concurrently:

# isi job jobs list

ID   Type           State   Impact  Policy  Pri  Phase  Running Time

---------------------------------------------------------------------

2273 MediaScan      Running Low     LOW     8    1/8    7h 57m

2275 SmartPools     Running Low     LOW     6    1/2    9m 44s

2305 FlexProtectLin Running Medium  MEDIUM  1    1/4    10s

---------------------------------------------------------------------

Total: 3

By way of contrast, in releases prior to OneFS 9.7, only a single restriping job can run at any point in time. Any additional restriping jobs are automatically places in a ‘waiting state’. But before getting into the details of the parallel restriping feature, a quick review of the Job Engine, and its structure and function could be useful.

In OneFS, the Job Engine runs across the entire cluster and is responsible for dividing and conquering large storage management and protection tasks. To achieve this, it reduces a task into smaller work items and then allocates, or maps, these portions of the overall job to multiple worker threads on each node. Progress is tracked and reported on throughout job execution and a detailed report and status is presented upon completion or termination.

A comprehensive check-pointing system allows jobs to be paused and resumed, in addition to stopped and started. Additionally, the Job Engine also includes an adaptive impact management system, CPU and drive-sensitive impact control, and the ability to run up to three jobs at once.

Jobs are executed as background tasks across the cluster, using spare or especially reserved capacity and resources, and can be categorized into three primary classes:

Category Description
File System Maintenance Jobs These jobs perform background file system maintenance, and typically require access to all nodes. These jobs are required to run in default configurations, and often in degraded cluster conditions. Examples include file system protection and drive rebuilds.
Feature Support Jobs The feature support jobs perform work that facilitates some extended storage management function, and typically only run when the feature has been configured. Examples include deduplication and anti-virus scanning.
User Action Jobs These jobs are run directly by the storage administrator to accomplish some data management goal. Examples include parallel tree deletes and permissions maintenance.

Although the file system maintenance jobs are run by default, either on a schedule or in reaction to a particular file system event, any Job Engine job can be managed by configuring both its priority-level (in relation to other jobs) and its impact policy.

Job Engine jobs often comprise several phases, each of which are executed in a pre-defined sequence. For instance, jobs like TreeDelete comprise a single phase, whereas more complex jobs like FlexProtect and MediaScan that have multiple distinct phases.

A job phase must be completed in entirety before the job can progress to the next phase. If any errors occur during execution, the job is marked “failed” at the end of that particular phase and the job is terminated.

Each job phase is composed of a number of work chunks, or Tasks. Tasks, which are comprised of multiple individual work items, are divided up and load balanced across the nodes within the cluster. Successful execution of a work item produces an item result, which might contain a count of the number of retries required to repair a file, plus any errors that occurred during processing.

When a Job Engine job needs to work on a large portion of the file system, there are four main methods available to accomplish this. The most straightforward access method is via metadata, using a Logical Inode (LIN) Scan. In addition to being simple to access in parallel, LINs also provide a useful way of accurately determining the amount of work required.

A directory tree walk is the traditional access method since it works similarly to common UNIX utilities, such as find – albeit in a far more distributed way. For parallel execution, the various job tasks are each assigned a separate subdirectory tree. Unlike LIN scans, tree walks may prove to be heavily unbalanced, due to varying sub-directory depths and file counts.

Disk drives provide excellent linear read access, so a drive scan can deliver orders of magnitude better performance than a directory tree walk or LIN scan for jobs that don’t require insight into file system structure. As such, drive scans are ideal for jobs like MediaScan, which linearly traverses each node’s disks looking for bad disk sectors.

A fourth class of Job Engine jobs utilize a ‘changelist’, rather than LIN-based scanning. The changelist approach analyzes two snapshots to find the LINs which changed (delta) between the snapshots, and then dives in to determine the exact changes.

Architectural, the job engine is based on a delegation hierarchy comprising coordinator, director, manager, and worker processes.

There are other threads which are not included in the diagram above, which relate to internal functions, such as communication between the various JE daemons, and collection of statistics. Also, with three jobs running simultaneously, each node would have three manager processes, each with its own number of worker threads.

Once the work is initially allocated, the job engine uses a shared work distribution model in order to execute the work, and each job is identified by a unique Job ID. When a job is launched, whether it’s scheduled, started manually, or responding to a cluster event, the Job Engine spawns a child process from the isi_job_d daemon running on each node. This job engine daemon is also known as the parent process.

The entire job engine’s orchestration is handled by the coordinator, which is a process that runs on one of the nodes in a cluster. Any node can act as the coordinator, and the principal responsibilities include:

  • Monitoring workload and the constituent nodes’ status
  • Controlling the number of worker threads per-node and cluster-wide
  • Managing and enforcing job synchronization and checkpoints

While the actual work item allocation is managed by the individual nodes, the coordinator node takes control, divides up the job, and evenly distributes the resulting tasks across the nodes in the cluster. For example, if the coordinator needs to communicate with a manager process running on node five, it first sends a message to node five’s director, which then passes it on down to the appropriate manager process under its control. The coordinator also periodically sends messages, via the director processes, instructing the managers to increment or decrement the number of worker threads.

The coordinator is also responsible for starting and stopping jobs, and also for processing work results as they are returned during the execution of a job. Should the coordinator process die for any reason, the coordinator responsibility automatically moves to another node.

The coordinator node can be identified via the following CLI command:

# isi job status --verbose | grep Coordinator

Each node in the cluster has a job engine director process, which runs continuously and independently in the background. The director process is responsible for monitoring, governing and overseeing all job engine activity on a particular node, constantly waiting for instruction from the coordinator to start a new job. The director process serves as a central point of contact for all the manager processes running on a node, and as a liaison with the coordinator process across nodes. These responsibilities include:

  • Manager process creation
  • Delegating to and requesting work from other peers
  • Sending and receiving status messages

The manager process is responsible for arranging the flow of tasks and task results throughout the duration of a job. The manager processes request and exchange work with each other and supervise the worker threads assigned to them. At any point in time, each node in a cluster can have up to three manager processes, one for each job currently running. These managers are responsible for overseeing the flow of tasks and task results.

Each manager controls and assigns work items to multiple worker threads working on items for the designated job. Under direction from the coordinator and director, a manager process maintains the appropriate number of active threads for a configured impact level, and for the node’s current activity level. Once a job has completed, the manager processes associated with that job, across all the nodes, are terminated. And new managers are automatically spawned when the next job is moved into execution.

The manager processes on each node regularly send updates to their respective node’s director, which, in turn, informs the coordinator process of the status of the various worker tasks.

Each worker thread is given a task, if available, which it processes item-by-item until the task is complete or the manager un-assigns the task. The status of the nodes’ workers can be queried by running the CLI command “isi job statistics view”. In addition to the number of current worker threads per node, a sleep to work (STW) ratio average is also provided, giving an indication of the worker thread activity level on the node.

Towards the end of a job phase, the number of active threads decreases as workers finish up their allotted work and become idle. Nodes which have completed their work items just remain idle, waiting for the last remaining node to finish its work allocation. When all tasks are done, the job phase is considered to be complete and the worker threads are terminated.

As jobs are processed, the coordinator consolidates the task status from the constituent nodes and periodically writes the results to checkpoint files. These checkpoint files allow jobs to be paused and resumed, either proactively, or in the event of a cluster outage. For example, if the node on which the Job Engine coordinator was running went offline for any reason, a new coordinator would be automatically started on another node. This new coordinator would read the last consistency checkpoint file, job control and task processing would resume across the cluster from where it left off, and no work would be lost.

Job engine checkpoint files are stored in ‘results’ and ‘tasks’ subdirectories under the path ‘/ifs/.ifsvar/modules/jobengine/cp/<job_id>/’ for a given job. On large clusters and/or with a job running at high impact, there can be many checkpoint files accessed from all nodes, which may result in contention. Checkpoints are split into sixteen subdirectories under both tasks and results to alleviate this bottleneck.

PowerScale OneFS 9.7

Dell PowerScale is already powering up the holiday season with the launch of the innovative OneFS 9.7 release, which shipped today (13th December 2023). This new 9.7 release is an all-rounder, introducing PowerScale innovations in cloud, performance, security, and ease of use.

Enhancements to APEX File Storage for AWS

After the debut of APEX File Storage for AWS earlier this year, OneFS 9.7 extends and simplifies the PowerScale in the public cloud offering delivering more features on more instance types across more regions.

In addition to providing the same OneFS software platform on-prem and in the cloud, and customer-managed for full control, APEX File Storage for AWS in OneFS 9.7 sees a 60% capacity increase, providing linear capacity and performance scaling up to six SSD nodes and 1.6 PiB per namespace/cluster, and up to 10GB/s reads and 4GB/s writes per cluster. This can make it a solid fit for traditional file shares and home directories, vertical workloads like M&E, healthcare, life sciences, finserv, and next-gen AI, ML and analytics applications.

PowerScale’s scale-out architecture can be deployed on customer managed AWS EBS and ECS infrastructure, providing the scale and performance needed to run a variety of unstructured workflows in the public cloud. Plus, with OneFS 9.7, an ‘easy button’ for streamlined AWS infrastructure provisioning and deployment.

Once in the cloud, existing PowerScale investments can be further leveraged by accessing and orchestrating your data through the platform’s multi-protocol access and APIs.

This includes the common OneFS control plane (CLI, WebUI, and platform API), and the same enterprise features: Multi-protocol, SnapshotIQ, SmartQuotas, Identity management, etc.

With OneFS 9.7, APEX File Storage for AWS also sees the addition of support for HDFS and FTP protocols, in addition to NFS, SMB, and S3. Plus granular performance prioritization and throttling is also enabled with SmartQoS, allowing admins to configure limits on the maximum number of protocol operations that NFS, S3, SMB, or mixed protocol workloads can consume on an APEX File Storage for AWS cluster.

Security

With data integrity and protection being top of mind in this era of unprecedented cyber threats, OneFS 9.7 brings a bevy of new features and functionality to keep your unstructured data and workloads more secure than ever. These new OneFS 9.7 security enhancements help address US Federal and DoD mandates, such as FIPS 140-2 and DISA STIGs – in addition to general enterprise data security requirements. Included in the new OneFS 9.7 release is a simple cluster configuration backup and restore utility, address space layout randomization, and single sign-on (SSO) lookup enhancements.

Data mobility

On the data replication front, SmartSync sees the introduction of GCP as an object storage target in OneFS 9.7, in addition to ECS, AWS and Azure. The SmartSync data mover allows flexible data movement and copying, incremental resyncs, push and pull data transfer, and one-time file to object copy.

Performance improvements

Building on the streaming read performance delivered in a prior release, OneFS 9.7 also unlocks dramatic write performance enhancements, particularly for the all-flash NVMe platforms – plus infrastructure support for future node hardware platform generations. A sizable boost in throughput to a single client helps deliver performance for the most demanding GenAI workloads, particularly for the model training and inferencing phases. Additionally, the scale-out cluster architecture enables performance to scale linearly as GPUs are increased, allowing PowerScale to easily supports AI workflows from small to large.

Cluster support for InsightIQ 5.0

The new InsightIQ 5.0 software expands PowerScale monitoring capabilities, including a new user interface, automated email alerts and added security. InsightIQ 5.0 is available today for all existing and new PowerScale customers at no additional charge. These innovations are designed to simplify management, expand scale and security and automate operations for PowerScale performance monitoring for AI, GenAI and all other workloads.

In summary, OneFS 9.7 brings the following new features and functionality to the Dell PowerScale ecosystem:

Feature Info
Cloud ·         APEX File Storage for AWS 60% capacity increase

·         Streamlined and automated APEX provisioning and deployment

·         HDFS, FTP, and SmartQoS support

Simplicity ·         Job Engine Restripe Parallelization

·         Cluster support for InsightIQ 5.0

·         SmartSync GCP support

Performance ·         Write performance improvements for NVMe-based all-flash platforms

·         Infrastructure support for next generation all-flash node hardware platforms

Security ·         Cluster configuration backup and restore

·         Address space layout randomization

·         Single sign-on (SSO) lookup enhancements

We’ll be taking a deeper look at these new features and functionality in blog articles over the course of the next few weeks.

Meanwhile, the new OneFS 9.7 code is available on the Dell Online Support site, as both an upgrade and reimage file, allowing both installation and upgrade of this new release.

OneFS and Client Bandwidth Measurement with iPerf

Sometimes in a storage admin’s course of duty there’s a need to quickly and easily assess the bandwidth between a PowerScale cluster and client. The ubiquitous iPerf tool is a handy utility for taking active measurements of the maximum achievable bandwidth between a PowerScale cluster and client, across the node’s front-end IP network(s).

iPerf was developed by NLANR/DAST as a modern alternative for measuring maximum TCP and UDP bandwidth performance. iPerf is a flexible tool, allowing the tuning of various parameters and UDP characteristics, and reporting network performance stats including bandwidth, delay jitter, datagram loss, etc.

In addition and contrast to the classic iPerf (typically version 2.x), a newer and more feature rich iPerf3 version is also available. Unlike the classic incantation, iPerf3 is primarily developed and maintained by ESnet and the Lawrence Berkeley National Laboratory, and made available under BSD licensing. Note that iPerf3 neither shares code nor provides backwards compatibility with the classic iPerf.

Additional optional features of iPerf3 include:

  • CPU affinity setting
  • IPv6 flow labeling
  • SCTP
  • TCP congestion algorithm settings
  • Sendfile / zerocopy
  • Socket pacing
  • Authentication

Both iPerf and iPerf3 are available preinstalled on OneFS, and can be useful for measuring and verifying anticipated network performance prior to running any performance benchmark. The standard ‘iperf’ CLI command automatically invokes the classic (v2) version:

# iperf -v

iperf version 2.0.4 (7 Apr 2008) pthreads

Within OneFS, the iPerf binary can be found in the /usr/local/bin/ directory on each node:

# whereis iperf

iperf: /usr/local/bin/iperf /usr/local/man/man1/iperf.1.gz

Whereas the enhanced iPerf version 3 uses the ‘iperf3’ CLI syntax, and also lives under /usr/local/bin:

# iperf3 -v

iperf 3.4 (cJSON 1.5.2)

# whereis iperf3

iperf3: /usr/local/bin/iperf3 /usr/local/man/man1/iperf3.1.gz

For Linux and Windows clients, Iperf binaries can also be downloaded and installed from the following location:

https://iperf.fr/

The iPerf source code is also available at Sourceforge for those ‘build-your-own’ aficionados among us:

http://sourceforge.net/projects/iperf/

Under the hood, iPerf allows the configuration and tuning of a variety of buffering and timing parameters across both TCP and UDP, and with support for IPv4 and IPv6 environments. For each test, iPerf reports the maximum bandwidth, loss, and other salient metrics.

More specifically, iPerf supports the following features:

Attribute Details
TCP ·         Measure bandwidth

·         Report MSS / MTU size and observed read sizes

·         Supports SCTP multi-homing and redundant paths for reliability and resilience.

UDP ·         Client can create UDP streams of specified bandwidth

·         Measure packet loss

·         Measure delay jitter

·         Supports muti-cast

Platform support ·         Windows, Linux, MacOS, BSD UNIX, Solaris, Android, VxWorks.
Concurrency ·         Client and server can support multiple simultaneous connections (-P flag).

·         iPerf3 server accepts multiple simultaneous connections from the same client.

Duration ·         Can be configured run for a specified time (-t flag), in addition to a set amount of data (-n and -k flags).

·         Server can be run as a daemon (-D flag)

Reporting ·         Can display periodic, intermediate bandwidth, jitter, and loss reports at configurable intervals (-i flag).

When it comes to running iPerf, the most basic use case is testing a single connection from a client to a node on the cluster. This can be initiated as follows:

On the cluster node, the following CLI command will initiate the iPerf server:

# iperf -s

Similarly, on the client, the following CLI syntax will target the iPerf server on the cluster node:

# iperf -c <server_IP>

For example, with a freeBSD client with IP address 10.11.12.9 connecting to a cluster node at 10.10.11.12:

# iperf -c 10.10.11.12

------------------------------------------------------------

Client connecting to 10.10.11.12, TCP port 5001

TCP window size:   131 KByte (default)

------------------------------------------------------------

[  3] local 10.11.12.9 port 65001 connected with 10.10.11.12 port 5001

[ ID] Interval       Transfer     Bandwidth

[  3]  0.0-10.0 sec  31.8 GBytes  27.3 Gbits/sec

And from the cluster node:

# iperf -s

------------------------------------------------------------

Server listening on TCP port 5001

TCP window size:   128 KByte (default)

------------------------------------------------------------

[  4] local 10.10.11.12 port 5001 connected with 10.11.12.9 port 65001

[ ID] Interval       Transfer     Bandwidth

[  4]  0.0-10.0 sec  31.8 GBytes  27.3 Gbits/sec

As indicated in the above output, iPerf uses a default window size of 128KB. Also note that the classic iPerf (v2) uses TCP port 5001 by default on OneFS. As such, this port must be open on any and all firewalls and/or packet filters situated between client and node for the above to work. Similarly, iPerf3 defaults to TCP 5201, and the same open port requirements between clients and cluster apply.

Here’s the output from the same configuration but using iPerf3:

For example, from the server:

# iperf3 -s

-----------------------------------------------------------

Server listening on 5201

-----------------------------------------------------------

Accepted connection from 10.11.12.9, port 12543

[  5] local 10.10.11.12 port 5201 connected to 10.11.12.9 port 55439

[ ID] Interval           Transfer     Bitrate

[  5]   0.00-1.00   sec  3.22 GBytes  27.7 Gbits/sec

[  5]   1.00-2.00   sec  3.59 GBytes  30.9 Gbits/sec

[  5]   2.00-3.00   sec  3.52 GBytes  30.3 Gbits/sec

[  5]   3.00-4.00   sec  3.95 GBytes  33.9 Gbits/sec

[  5]   4.00-5.00   sec  4.07 GBytes  34.9 Gbits/sec

[  5]   5.00-6.00   sec  4.10 GBytes  35.2 Gbits/sec

[  5]   6.00-7.00   sec  4.14 GBytes  35.6 Gbits/sec

[  5]   6.00-7.00   sec  4.14 GBytes  35.6 Gbits/sec

- - - - - - - - - - - - - - - - - - - - - - - - -

[ ID] Interval           Transfer     Bitrate

[  5]   0.00-7.00   sec  27.8 GBytes  34.1 Gbits/sec                  receiver

iperf3: the client has terminated

-----------------------------------------------------------

Server listening on 5201

-----------------------------------------------------------

And from the client:

# iperf3 -c 10.10.11.12

Connecting to host 10.10.11.12, port 5201

[  5] local 10.11.12.9 port 55439 connected to 10.10.11.12 port 5201

[ ID] Interval           Transfer     Bitrate         Retr  Cwnd

[  5]   0.00-1.00   sec  3.22 GBytes  27.7 Gbits/sec    0    316 KBytes

[  5]   1.00-2.00   sec  3.59 GBytes  30.9 Gbits/sec    0    316 KBytes

[  5]   2.00-3.00   sec  3.52 GBytes  30.3 Gbits/sec    0    504 KBytes

[  5]   3.00-4.00   sec  3.95 GBytes  33.9 Gbits/sec    2    671 KBytes

[  5]   4.00-5.00   sec  4.07 GBytes  34.9 Gbits/sec    0    671 KBytes

[  5]   5.00-6.00   sec  4.10 GBytes  35.2 Gbits/sec    1    664 KBytes

[  5]   6.00-7.00   sec  4.14 GBytes  35.6 Gbits/sec    0    664 KBytes

^C[  5]   7.00-7.28   sec  1.17 GBytes  35.6 Gbits/sec    0    664 KBytes

- - - - - - - - - - - - - - - - - - - - - - - - -

[ ID] Interval           Transfer     Bitrate         Retr

[  5]   0.00-7.28   sec  27.8 GBytes  32.8 Gbits/sec    3             sender

[  5]   0.00-7.28   sec  0.00 Bytes  0.00 bits/sec                  receiver

iperf3: interrupt - the client has terminated

Regarding iPerf CLI syntax, the following options are available in each version of the tool:

Options Description iPerf iPerf3
<none> Default settings X
–authorized-users-path Path to the configuration file containing authorized users credentials to run iperf tests (if built with OpenSSL support) X
-A Set the CPU affinity, if possible (Linux, FreeBSD, and Windows only). X
-b Set target bandwidth/bitrate  to n bits/sec (default 1 Mbit/sec). Requires UDP (-u). X X
-B Bind to <host>, an interface or multicast address X X
-c Run in client mode, connecting to <host> X X
-C Compatibility; for use with older versions – does not sent extra msgs X
-C Set the congestion control algorithm (Linux and FreeBSD only) X
–cport Bind data streams to a specific client port (for TCP and UDP only, default is to use an ephemeral port) X
–connect-timeout Set timeout for establishing the initial control connection to the server, in milliseconds.  Default behavior is the OS’ timeout for TCP connection establishment. X
-d Simultaneous bi-directional bandwidth X
-d Emit debugging output X
-D Run the server as a daemon X X
–dscp Set the IP DSCP bits X
-f Format to report: Kbits/Mbits/Gbits/Tbits X
-F Input the data to be transmitted from a file X X
–forceflush Force flushing output at every interval, to avoid buffering when sending output to pipe. X
–fq-rate Set a rate to be used with fair-queueing based socket-level

pacing, in bits per second.

X
–get-server-output Get the output from the server.  The output format is determined by the server (ie. JSON ‘-j’) X
-h Help X X
-i Interval: Pause n seconds between periodic bandwidth reports. X X
-I Input the data to be transmitted from stdin X
-I Write a file with the process ID X
-J Output in JSON format X
-k Number of blocks (packets) to transmit (instead of -t or -n) X
-l Length of buffer to read or write.  For TCP tests, the default value is 128KB.  With UDP, iperf3 tries to dynamically determine a reasonable sending size based on the path MTU; if that cannot be determined it uses 1460 bytes as a sending size. For SCTP tests, the default size is 64KB. X
-L Set length read/write buffer (defaults to 8 KB) X
-L Set the IPv6 flow label X
–logfile Send output to a log file. X
-m Print TCP maximum segment size (MTU – TCP/IP header) X
-M Set TCP maximum segment size (MTU – 40 bytes) X X
-n number of bytes to transmit (instead of -t) X X
-N Set TCP no delay, disabling Nagle’s Algorithm X X
–nstreams Set number of SCTP streams. X
-o Output the report or error message to a specified file X
-O Omit the first n seconds of the test, to skip past the TCP slow-start period. X
-p Port: set server port to listen on/connect to X X
-P Number of parallel client threads to run X X
–pacing-timer Set pacing timer interval in microseconds (default 1000 microseconds, or 1 ms).  This controls iperf3’s internal pacing timer for the -b/–bitrate option. X
-r Bi-directional bandwidth X
-R Reverse the direction of a test, so that the server sends data to the client X
–rsa-private-key-path Path to the RSA private key (not password-protected) used to decrypt authentication credentials from the client (if built with OpenSSL support). X
–rsa-public-key-path Path to the RSA public key used to encrypt authentication credentials (if built with OpenSSL support) X
-s Run iPerf in server mode X X
-S Set the IP type of service. X
–sctp use SCTP rather than TCP (FreeBSD and Linux) X
-t Time in seconds to transmit for (default 10 secs) X X
-T Time-to-live, for multicast (default 1) X
-T Prefix every output line with this title string X
-u Use UDP rather than TCP. X X
-U Run in single threaded UDP mode X
–username Username to use for authentication to the iperf server (if built with OpenSSL support).  The password will be prompted for interactively when the test is run. X
-v Print version information and quit X X
-V Set the domain to IPv6 X
-V Verbose – give more detailed output X
-w TCP window size (socket buffer size) X X
-x Exclude C(connection), D(data), M(multicast), S(settings), V(server) reports X
-X Bind SCTP associations to a specific subset of links using sctp_bindx X
-y If set to C or c, report results as CSV (comma separated values) X
-Z Set TCP congestion control algorithm (Linux only) X
-Z Use a ‘zero copy’ method of sending data, such as sendfile instead of the usual write. X
-1 Handle one client connection, then exit. X
-4 Only use IPv4 X
-6 Only use Ipv6 X

To run the iPerf server across all nodes in a cluster, it can be initiated in conjunction with the OneFS ‘isi_for_array’ CLI utility, as follows:

# isi_for_array iperf -s

Bidirectional testing can also sometimes be a useful sanity-check, with OneFS acting as the client pointing to a client OS running the server instance of iPerf. For example:

# iperf -c 10.10.11.205 -i 5 -t 60 -P 4

Start the iperf client on a Linux client connecting to one of the PowerScale nodes.

# iperf -c 10.10.1.100

For a Windows client, the same CLI syntax, issued from the command shell (cmd.exe), can be used to start the iperf client and connect to a PowerScale nodes. For example:

C:\Users\pocadmin\Downloads\iperf-2.0.9-win64\iperf-2.0.9-win64>iperf.exe -c 10.10.0.196

iPerf Write Testing

When it comes to write performance testing, the following CLI syntax can be used on the client to executes a write speed (Client –> Cluster) test:

# iperf -P 8 -c <clusterIP>

Note that the ‘-P’ flag designates parallel client threads, allowing the iPerf threads to be match up with the number of physical CPU cores (not hyper-threads) available to the client.

Similarly, the following CLI command can be used on the client to initiate a read speed (Client <– Cluster) test:

# iperf -P 8 -R -c <clusterIP>

Below is an example command from a Linux VM to a single PowerScale node.  Testing was repeated from each Linux client to each node in the cluster to validate results and verify consistent network performance. Using the cluster nodes as the server, the bandwidth tested to ~ 7.2Gbps per VM. (Note that, in this case, the VM limit is 8.0 Gbps):

# iperf -c onefs-node1 -i 5 -t 60 -P 4

------------------------------------------------------------

Client connecting to isilon-node1, TCP port 5001

TCP window size: 94.5 KByte (default)

------------------------------------------------------------

[  4] local 10.10.0.205 port 44506 connected with 172.16.0.5 port 5001

[SUM]  0.0-60.0 sec  50.3 GBytes  7.20 Gbits/sec

Two Linux VMs were also testing running iPerf in parallel to maximize the ExpressRoute network link. This test involved dual iPerf writes from the Linux clients to separate cluster nodes.

[admin@Linux64GB16c-3 ~]$ iperf -c onefs-node3 -i 5 -t 40 -P 4

[SUM]  0.0-40.0 sec  22.5 GBytes  4.83 Gbits/sec 

[admin@linux-vm2 ~]$ iperf -c onefs-node2 -i 5 -t 40 -P 4

[SUM]  0.0-40.0 sec  22.1 GBytes  4.75 Gbits/sec

As can be seen from the results of the iPerf tests, writes appear to split evenly from the Linux clients to the cluster nodes, while saturating the bandwidth of its Azure ExpressRoute link.

OneFS HTTP Services and Security

To facilitate granular HTTP security configuration, OneFS provides an option to disable nonessential HTTP components selectively. Disabling a specific component’s service still allows other essential services on the cluster to continue to run unimpeded. In OneFS 9.4 and later, the following nonessential HTTP services may be disabled:

Service Description
PowerScaleUI The OneFS WebUI configuration interface.
Platform-API-External External access to the OneFS platform API endpoints.
Rest Access to Namespace (RAN) REST-ful access via HTTP to a cluster’s /ifs namespace.
RemoteService Remote Support and In-Product Activation.
SWIFT (deprecated) Deprecated object access to the cluster via the SWIFT protocol. This has been replaced by the S3 protocol in OneFS.

Each of these services may be enabled or disabled independently via the CLI or platform API by a user account with the ISI_PRIV_HTTP RBAC privilege.

The ‘isi http services’ CLI command set can be used to view and modify the nonessential services HTTP services:

# isi http services list

ID                    Enabled

------------------------------

Platform-API-External Yes

PowerScaleUI          Yes

RAN                   Yes

RemoteService         Yes

SWIFT                 No

------------------------------

Total: 5

For example, remote HTTP access to the OneFS /ifs namespace can easily be disabled as follows:

 # isi http services modify RAN --enabled=0

You are about to modify the service RAN. Are you sure? (yes/[no]): yes

Similarly, a subset of the HTTP configuration settings can also be viewed and edited via the WebUI by navigating to Protocols > HTTP settings:

That said, the implications and impact of disabling each of the services is as follows:

Service Disabling Impacts
WebUI The WebUI is completely disabled, and access attempts (default TCP port 8080) are denied with the following warning:

“Service Unavailable. Please contact Administrator.”

If the WebUI is re-enabled, the external platform API service (Platform-API-External) is also started if it is not running. Note that disabling the WebUI does not affect the PlatformAPI service.

Platform API External API requests to the cluster are denied, and the WebUI is disabled, since it uses the Platform-API-External service.

Note that the Platform-API-Internal service is not impacted if/when the Platform-API-External is disabled, and internal pAPI services continue to function as expected.

If the Platform-API-External service is re-enabled, the WebUI will remain inactive until the PowerScaleUI service is also enabled.

RAN If RAN is disabled, the WebUI components for File System Explorer and File Browser are also automatically disabled.

From the WebUI, attempts to access the OneFS file system explorer (File System > File System Explorer) fail with the following warning message:

“Browse is disabled as RAN service is not running. Contact your administrator to enable the service.”

This same warning is also displayed when attempting to access any other WebUI components that require directory selection.

RemoteService If RemoteService is disabled, the WebUI components for Remote Support and In-Product Activation are disabled.

In the WebUI, going to Cluster Management > General Settings and selecting the Remote Support tab displays the following message:

“The service required for the feature is disabled. Contact your administrator to enable the service.”

In the WebUI, going to Cluster Management > Licensing and scrolling to the License Activation section displays the following message: The service required for the feature is disabled. Contact your administrator to enable the service.

SWIFT Deprecated object protocol and disabled by default.

OneFS HTTP configuration can be displayed from the CLI via the ‘isi http settings view’ command:

# isi http settings view

            Access Control: No

      Basic Authentication: No

    WebHDFS Ran HTTPS Port: 8443

                       Dav: No

         Enable Access Log: Yes

                     HTTPS: No

 Integrated Authentication: No

               Server Root: /ifs

                   Service: disabled

           Service Timeout: 8m20s

          Inactive Timeout: 15m

           Session Max Age: 4H

Httpd Controlpath Redirect: No

Similarly, HTTP configuration can be managed and changed using the ‘isi http settings modify’ CLI syntax.

For example, to reduce the maximum session age from 4 to 2 hours:

# isi http settings view | grep -i age

           Session Max Age: 4H

# isi http settings modify --session-max-age=2H

# isi http settings view | grep -i age

           Session Max Age: 2H

The full set of configuration options for ‘isi http settings’ include:

Option Description
–access-control <boolean> Enable Access Control Authentication for HTTP service.  Access Control  Authentication requires at least one type of authentication to be enabled.
–basic-authentication <boolean> Enable Basic Authentication for HTTP service.
–webhdfs-ran-https-port <integer> Configure Data Services Port for HTTP service.
–revert-webhdfs-ran-https-port Set value to system default for –webhdfs-ran-https-port.
–dav <boolean> Comply with Class 1 and 2 of the DAV specification (RFC 2518) for HTTP service. All DAV clients must go through a single node.  DAV compliance is NOT met if you go through SmartConnect, or via 2 or more node IPs.
–enable-access-log <boolean> Enable writing to a log when the HTTP server is accessed for HTTP service.
–https <boolean> Enable HTTPS transport protocol for HTTP service.
–https <boolean> Enable HTTPS transport protocol for HTTP service.
–integrated-authentication <boolean> Enable Integrated Authentication for HTTP service.
–server-root <path> Document root directory for HTTP service. Must be within /ifs.
–service (enabled | disabled | redirect | disabled_basicfile) Enable/disable HTTP Service or redirect to WebUI or disabled BasicFileAccess.
–service-timeout <duration> Amount of time(seconds) the server will wait for certain events before failing a request. A value of 0 indicates that the service timeout value is Apache default.
–revert-service-timeout Set value to system default for –service-timeout.
–inactive-timeout <duration> Get the HTTP RequestReadTimeout directive from both WebUI and HTTP service.
–revert-inactive-timeout Set value to system default for –inactive-timeout.
–session-max-age <duration> Get the HTTP SessionMaxAge directive from both WebUI and HTTP service.
–revert-session-max-age Set value to system default for –session-max-age.
–httpd-controlpath-redirect <boolean> Enable or disable WebUI redirection to HTTP service.

Note that, while the OneFS S3 service uses HTTP, it is considered as a tier-1 protocol, and as such is managed via its own ‘isi s3’ CLI command set and corresponding WebUI area. For example, the following CLI command will force the cluster to only accept encrypted HTTPS/SSL traffic on TCP port 9999 (rather than the default TCP port 9021):

# isi s3 settings global modify --https-only 1 –https-port 9921

# isi s3 settings global view

         HTTP Port: 9020

        HTTPS Port: 9999

        HTTPS only: Yes

S3 Service Enabled: Yes

Additionally, the S3 service can be disabled entirely with the following CLI syntax:

# isi services s3 disable

The service 's3' has been disabled.

Or from the WebUI under Protocols > S3 > Global settings:

 

OneFS Additional Security Hardening – Part 3

As mentioned in previous articles in this series, applying a hardening profile is one of multiple tasks that are required in order to configure a STIG-compliant PowerScale cluster. These include:

Component Tasks
Audit Configure remote syslog servers for auditing.
Authentication Configure secure auth provider, SecurityAdmin account, and default restricted shell.
CELOG Create event channel for security officers and system admin to monitor /root and /var partition usage, audit service, security verification, and account creation.
MFA & SSO Enable and configure multi-factor authentication and single sign-on.
NTP Configure secure NTP servers with SHA256 keys.
SMB Configure SMB global settings and defaults.

Enable SMB encryption on shares.

SNMP Enable SNMP and configure SNMPv3 settings.
SyncIQ Configure SyncIQ to use CA certificates so both the source and target clusters (primary and secondary DSCs) have both Server Authentication and Client Authentication set in their Extended Key Usages fields.

In this final article in the series, we’ll cover the security configuration details for SyncIQ replication using the OneFS CLI.

SyncIQ Setup

SyncIQ supports over-the-wire, end-to-end encryption for data replication, protecting and securing in-flight data between clusters. A global setting enforces encryption on all incoming and outgoing SyncIQ policies.

  1. First, on the source cluster, which is also the primary DSC (Digital Signature Certificate), add the CA (Certificate Authority) certificate(s) to certificate store.
# isi certificate authority import [ca certificate path]

Where:

Item Description
[ca certificate path] The path to the CA certificate file.

Note that SyncIQ certificates for both the source and target clusters (aka primary and secondary DSC respectively) must have both ‘Server Authentication’ and ‘Client Authentication’ set in their ‘Extended Key Usages’ fields.

Repeat as necessary, and include root and intermediate CA certificates for both the source and target, plus the OCSP (Online Certificate Status Protocol) issuer:

  • source cluster
  • target cluster
  • OCSP issuer

To prevent unauthorized access to the private key/certificate, ensure the certificate and private key files are deleted/removed once all necessary import steps have been successfully completed.

 

  1. Next, on the source cluster (primary DSC), add the source cluster certificate to the SyncIQ server certificate store. This can be accomplished with the following CLI syntax:
# isi sync certificates server import [source certificate path]

Where:

Item Description
[source certificate path] The path to the source certificate file (in PEM or DER format).
[source certificate key path] The path to the source certificate private key file.

Once again, to prevent unauthorized access to the private key/certificate, remove the certificate and private key files once import has been completed successfully.

 

  1. On the source cluster (primary DSC), set the cluster certificate to the certificate imported above.

Find certificate ID:

# isi certificate server list -v

Then configure cluster certificate ID:

# isi sync settings modify --cluster_certificate_id [certificate_id]

Where:

Item Description
[certificate id] The ID of the cluster certificate.

 

  1. On the source cluster (primary DSC) add the target cluster’s (secondary DSC) certificate as a peer certificate.
# isi sync certificates peer import [target certificate path]
Item Description
[target certificate path] The path to the target cluster/secondary DSC certificate file..

To prevent unauthorized access to the private key/certificate, remove certificate and private key files once done with all necessary import steps.

 

  1. On the source cluster (primary DSC) configure the global Open Certificate Status Protocol (OCSP) ID and address settings.
# isi sync settings modify

 --ocsp-issuer-certificate-id=[ocsp issuer certificate id]

 --ocsp-address=[OCSP server URI]

Where:

Item Description
[ocsp issuer certificate id] The ID of the certificate as registered in the PowerScale certificate manager.
[OCSP server URI] The URI of the OCSP responder.

To find the OCSP issuer certificate ID:

# isi certificate authority list -v

This assumes that the OCSP issuer certificate file has already been successfully imported into the PowerScale certificate manager.

 

  1. On the target cluster (secondary DSC), add the CA certificate(s) to the certificate store.
# isi certificate authority import [ca certificate path]

Where:

Item Description
[ca certificate path] The path to the CA certificate file.

Repeat as necessary, including the root and intermediate CA certificates for:

  • source cluster
  • target cluster
  • OCSP issuer

To prevent unauthorized access to the private key/certificate, remove certificate and private key files once done with all necessary import steps.

On the target cluster (secondary DSC), add the target cluster certificate to the SyncIQ server certificate store.

# isi sync certificates server import [target certificate path]

Where:

Item Description
[target certificate path] The path to the target certificate file (in PEM or DER format).
[target certificate key path] The path to the target certificate private key file.

To prevent unauthorized access to the private key/certificate, remove the certificate and private key files once done with all necessary import steps.

 

  1. On the target cluster (secondary DSC), set the cluster certificate to the certificate imported above.

First, retrieve the certificate ID:

# isi certificate server list -v

Next, configure the cluster certificate ID:

# isi sync settings modify --cluster_certificate_id [certificate_id]

Where:

Item Description
[certificate id] The ID of the cluster certificate

 

  1. On the target cluster (secondary DSC), add the source cluster’s (primary DSC) certificate as a peer certificate.
# isi sync certificates peer import [source certificate path]

Where:

Item Description
[source certificate path] The path to the source cluster/secondary DSC certificate file.

To prevent unauthorized access to the private key/certificate, remove certificate and private key files once done with all necessary import steps.

On the target cluster (secondary DSC), configure the  global Open Certificate Status Protocol (OCSP) settings.

# isi sync settings modify

 --ocsp-issuer-certificate-id=[ocsp issuer certificate id]

 --ocsp-address=[OCSP server URI]

Where:

Item Description
[ocsp issuer certificate id] The ID of the certificate as registered in the PowerScale certificate manager.
[OCSP server URI] The URI of the OCSP responder.

To find the OCSP issuer certificate ID:

# isi certificate authority list -v

This assumes that the OCSP issuer certificate file has already been imported into the PowerScale certificate manager.

  1. Finally, for any pre-existing policies, configure the following OCSP settings on the source cluster (primary DSC).
# isi sync policies modify [policy name]

 --ocsp-issuer-certificate-id=[ocsp issuer certificate id]

 --ocsp-address=[OCSP server URI]

Where:

Item Description
[ocsp issuer certificate id] The ID of the certificate as registered in the PowerScale certificate manager.

To find the OCSP issuer certificate ID:

# isi certificate authority list -v

At this point, the SyncIQ certificate configuration work should be complete.

OneFS Additional Security Hardening – Part 2

As mentioned in previous articles in this series, applying a hardening profile is one of multiple tasks that are required in order to configure a STIG-compliant PowerScale cluster. These include:

Component Tasks
Audit Configure remote syslog servers for auditing.
Authentication Configure secure auth provider, SecurityAdmin account, and default restricted shell.
CELOG Create event channel for security officers and system admin to monitor /root and /var partition usage, audit service, security verification, and account creation.
MFA & SSO Enable and configure multi-factor authentication and single sign-on.
NTP Configure secure NTP servers with SHA256 keys.
SMB Configure SMB global settings and defaults.

Enable SMB encryption on shares.

SNMP Enable SNMP and configure SNMPv3 settings.
SyncIQ Configure SyncIQ to use CA certificates so both the source and target clusters (primary and secondary DSCs) have both Server Authentication and Client Authentication set in their Extended Key Usages fields.

In this article, we’ll cover the specific configuration requirements and details of the NTP, SMB, SNMP components using the OneFS CLI.

NTP Setup

  1. When implementing a secure configuration for the OneFS NTP service, create an NTP key file and populate it with NTP server key hashes.

To add secure NTP servers to the OneFS configuration, first create an NTP keys file. This can be accomplished via the following CLI syntax:

# echo "[key index] sha256 [SHA hash]" > [keyfile]

Where:

Item Description
[key index] The index (increasing from 1) of the key hash.
[SHA hash] The SHA256 hash identifying the NTP server.
[keyfile] The path to the NTP key file.

Append as many additional key entries as are necessary. The ntp.keys(5) man page provides detailed information on the NTP key file format.

  1. Next, configure OneFS to use this NTP key file.
# isi ntp settings modify --key-file /ifs/ntp.keys
  1. The following CLI syntax can be used to configure NTP servers.
# isi ntp servers create [server hostname/IP] --key [key index]

Where:

Item Description
[server hostname/IP] The fully qualified domain name (FQDN) or IP address of the NTP server.
[key index] The key used by this particular server in the NTP keys file configured above..

Note that STIG requirements explicitly state that more than one (1) NTP server is required for compliance.

SMB setup

  1. Deploying SMB in a hardened environment typically involves enabling SMB3 encryption, security signatures, and disabling unencrypted access to shares. To accomplish this, first configure the global settings and defaults as follows.
# isi smb settings global modify --support-smb3-encryption true
 --enable-security-signatures true --require-security-signatures true
 --reject-unencrypted-access true


# isi_gconfig registry.Services.lwio.Parameters.Drivers.srv.SupportSmb1=0


# isi_gconfig registry.Services.lwio.Parameters.Drivers.rdr.Smb1Enabled=0
  1. Next, update the per-share SMB settings to enable SMB encryption.
# isi smb shares modify [share_name] --smb3-encryption-enabled true

SNMP Setup

  1. The following CLI command can be used to enable the OneFS SNMP v3 service and configure its settings and password.
# isi snmp settings modify --service=true --snmp-v3-access=true --snmp-v3-password=[password]

In the next and final article in this series, we’ll focus on the remaining topic in the list:

Namely secure SyncIQ configuration.

OneFS Additional Security Hardening – Part 1

When configuring security hardening on OneFS 9.5 or later, one thing to note is that, even with the STIG profile activated, not all the rules are automatically marked as ‘applied’. Specifically:

# isi hardening report view STIG | grep “Not Applied”

check_stig_celog_alerts                        Cluster   Not Applied Military Unique Deployment Guide manually configured CELOG settings.

check_synciq_default_ocsp_settings             Cluster   Not Applied /sync/settings/:cluster_certificate_id

check_synciq_policy_ocsp_settings              Cluster   Not Applied /sync/policies/:ocsp_issuer_certificate_id

check_multiple_ntp_servers_configured          Cluster   Not Applied /protocols/ntp/servers:total

set_auth_webui_sso_mfa_idp                     Cluster   Not Applied auth/providers/saml-services/idps/System

set_auth_webui_sso_mfa_sp_host                 Cluster   Not Applied auth/providers/saml-services/sp?zone=System:hostname

Applying a hardening profile is one of multiple tasks that are required in order to configure a STIG-compliant PowerScale cluster. These include:

Component Tasks
Audit Configure remote syslog servers for auditing.
Authentication Configure secure auth provider, SecurityAdmin account, and default restricted shell.
CELOG Create event channel for security officers and system admin to monitor /root and /var partition usage, audit service, security verification, and account creation.
MFA & SSO Enable and configure multi-factor authentication and single sign-on.
NTP Configure secure NTP servers with SHA256 keys.
SMB Configure SMB global settings and defaults.

Enable SMB encryption on shares.

SNMP Enable SNMP and configure SNMPv3 settings.
SyncIQ Configure SyncIQ to use CA certificates so both the source and target clusters (primary and secondary DSCs) have both Server Authentication and Client Authentication set in their Extended Key Usages fields.

Over the course of the next two blog articles, we’ll cover the specific configuration requirements and details of each of these components via the OneFS CLI.

In this article, we’ll focus on the following tasks:

Audit Setup

  1. To set up secure auditing, first configure the remote syslog server(s). Note that, while the configuration differentiates between configuration, protocol, and system auditing, these can be sent to the same central syslog server(s). When complete, these syslog servers can be added to the OneFS audit configuration via the following CLI syntax:
# isi audit settings global modify --config-syslog-servers=[server FQDN/IP] --protocol-syslog-servers=[server FQDN/IP] --system-syslog-servers=[server FQDN/IP]
  1. Also consider adding the cluster certificate to the audit settings for mutual Transport Layer Security (TLS) authentication.
# isi audit certificates syslog import [certificate_path] [key_path]

To prevent unauthorized access to the private key/certificate, the recommendation is to remove certificate and private key files once the necessary import steps have been completed.

Authentication Setup

  1. Set the default shell for any new users created in the Local Provider.
# isi auth local modify System --login-shell=/usr/local/restricted_shell/bin/restricted_shell.py
  1. Next, configure the remote authentication provider. This could be Kerberos, LDAP, or Active Directory. For more information, see the OneFS 9.5 CLI Administration Guide.

Note that all Active Directory users must have an e-mail address configured for them for use with ADFS multi-factor authentication (MFA).

Every Active Directory user must have a home directory created on the cluster, containing the correct public key in ~/.ssh/authorized_keys for the certificate presented by SSH clients (SecureCRT, PuTTY-CAC, etc).

If using Active Directory, the recommendation is to enable LDAP encryption, commonly referred to as ‘LDAP sign and seal’. For example:

# isi auth ads modify [provider-name] --ldap-sign-and-seal true

Additionally, the ‘machine password lifespan’ should be configured to a value of 60 days or less:

# isi auth ads modify [provider-name] --machine-password-lifespan=60D

Where [provider-name] is the name of the chosen Active Directory provider.

  1. Finally, identify a remote-authenticated user and assign them administrative privileges.
# isi auth roles modify SecurityAdmin --add-user [username]

# isi auth roles modify SystemAdmin --add-user [username]

Where [username] is the name of the chosen administrative user.

CELOG Setup

  1. For CELOG security setup, create and event channel for the required ISSO/SA alerts and configure appropriate event thresholds.

The following events need to send alerts on a channel monitored by an organization’s Information Systems Security Officers (ISSOs) or System Administrators (SAs):

Event ID Event
100010001 The /var partition is near capacity.
100010002 The /var/crash partition is near capacity.
100010003 The root partition is near capacity.
400160002 Audit system cannot provide service.
400160005 Audit daemon failed to persist events.
400200001 Security verification check failed.
400200002 Security verification successfully ran.
400260000 User account(s) created/updated/removed.

The event channel can be created as follows:

# isi event channels create [channel name] [type] [options]

Next, the thresholds for the above event IDs can be set:

# isi event thresholds modify 100010001 --info 74 --warn 75

# isi event thresholds modify 100010002 --warn 75

# isi event thresholds modify 100010003 --warn 75

# isi event alerts create [event name 1] NEW [channel name]

 --eventgroup 100010001  --eventgroup 100010002 --eventgroup 100010003

 --eventgroup 400160002 --eventgroup 400160005 --eventgroup 400200001

 --eventgroup 400200002 --eventgroup 400260000



# isi event alerts create [event name 2] SEVERITY_INCREASE [channel name]

 --eventgroup 100010001 --eventgroup 100010002 --eventgroup 100010003

 --eventgroup 400160002 --eventgroup 400160005 --eventgroup 400200001

 --eventgroup 400200002 --eventgroup 400260000

Where:

Item Description
[channel name] The name of the newly configured event channel.
[event name 1] and [event name 2] The names of the events that will trigger alerts when a new event occurs or when an event increases in severity, respectively.
Multi-Factor Authentication (MFA)/Single Sign-On (SSO) Setup

  1. First, configure the SSO service provider. This can be achieved as follows:
# isi auth sso sp modify --hostname=[node IP or cluster FQDN]

Where [node IP or cluster FQDN] is the IP address of a node in the PowerScale cluster or the fully qualified domain name (FQDN) of the PowerScale cluster.

  1. Next, configure the Identity Provider (IdP) as follows:
# isi auth sso idps create [name] [options]
  1. Enable MFA/SSO.
# isi auth sso settings modify --sso-enabled=true

At this point, we’ve covered the configuration and setup of the first four components in the list.

In the next article in this series, we’ll focus on the remaining topics:

Namely secure NTP, SMB, SNMP, and SyncIQ configuration.