OneFS Job Engine SmartThrottling Architecture

Prior to SmartThrottling, the native Job Engine resource monitoring and processing framework has allowed jobs to be throttled based on both CPU and disk I/O metrics. This legacy process still operates in OneFS 9.8 when SmartThrottling is not running. The coordinator itself does not communicate directly with the worker threads, but rather with the director process, which in turn instructs a node’s manager process for a particular job to cut back threads.

For example, if the Job Engine is running a job with LOW impact and CPU utilization drops below the threshold, the worker thread count is gradually increased up to the maximum defined by the LOW impact policy threshold. If client load on the cluster suddenly spikes, the number of worker threads is gracefully decreased. The same principle applies to disk I/O, where the Job Engine throttles back in relation to both IOPS as well as the number of I/O operations waiting to be processed in any drive’s queue. Once client load has decreased again, the number of worker threads is correspondingly increased to the maximum LOW impact threshold.

Every 20 seconds, the coordinator process gathers cluster CPU and individual disk I/O load data from all the nodes across the cluster. The coordinator uses this information, in combination with the job impact configuration, to determine how many threads may run on each cluster node to service each running job. This number can be fractional, and fractional thread counts are achieved by having a thread sleep for a given percentage of each second.

Using this CPU and disk I/O load data, every 60 seconds the coordinator evaluates how busy the various nodes are and makes a job throttling decision, instructing the various Job Engine processes as to the action they need to take. This enables throttling to be sensitive to workloads in which CPU and disk I/O load metrics yield different results. Additionally, separate load thresholds are tailored to the different classes of drives used in OneFS powered clusters, including high-speed SAS drives, lower-performance SATA disks, and flash-based solid state drives (SSDs).

The Job Engine allocates a specific number of threads to each node by default, thereby controlling the impact of a workload on the cluster. If little client activity is occurring, more worker threads are spun up to allow more work, up to a predefined worker limit. For example, the worker limit for a LOW impact job might allow one or two threads per node to be allocated, a MEDIUM impact job from four to six threads, and a HIGH impact job a dozen or more. When this worker limit is reached (or before, if client load triggers impact management thresholds first), worker threads are throttled back or terminated.

For example, a node has four active threads, and the coordinator instructs it to cut back to three. The fourth thread is allowed to finish the individual work item it is currently processing, but then quietly exit, even though the task as a whole might not be finished. A restart checkpoint is taken for the exiting worker thread’s remaining work, and this task is returned to a pool of tasks requiring completion. This unassigned task is then allocated to the next worker thread that requests a work assignment, and processing continues from the restart checkpoint. This same mechanism applies in the event that multiple jobs are running simultaneously on a cluster.

In contrast to this legacy job Engine impact management process, SmartThrottling instead draws its metrics from the OneFS Partitioned Performance (PP) framework. This framework is the same telemetry source that SmartQoS uses to limit client protocol operations.

Under the hood, SmartThrottling operates as follows:

  1. First, Partitioned Performance directly monitors the cluster resource usage at the IRP layer, paying attention to the latencies of the critical client protocol load.
  2. Based on these PP metrics, the Job Engine then attempts to maintain latencies within a specified threshold.
  3. If they approach the configured upper bound, PP directs the Job Engine to stop increasing the amount of work performed.
  4. If the latencies exceed those thresholds, then the Job Engine actively reduces the amount of work performed by quiescing job worker threads as necessary.
  5. There’s also a secondary throttling mechanism for situations when no protocol load exists, to prevent the Job Engine from commandeering all the cluster resources. This backup throttling monitors the drives, just in case there’s something else going on that’s causing the disks to become overloaded – and similarly attempts to maintain disk IO health within set limits.

The SmartThrottling thresholds, and the rate of ramping up or down the amount of work, differs based on the impact setting of a specific job. The actual Job impact configuration remains unchanged from earlier releases, and can still be set to Low, Medium, or High. And each job still has the same default impact level and priority, which can be further adjusted if desired.

Note that, since the new SmartThrottling is a freshly introduced feature at this point, it is currently disabled by default in OneFS 9.8 in an abundance of caution. So it needs to be manually enabled if you want it to run.

In the next article in this series, we’ll dig into the details of configuring and managing SmartThrotting.

OneFS Job Engine SmartThrottling

Within a PowerScale cluster, the OneFS Job Engine framework performs the background maintenance work on the cluster. It’s always there, but jobs come and go, and are run as necessary. Some of them are scheduled and executed automatically by the cluster, while others are run manually by cluster admins. Some of these jobs are very time critical like FlexProtect, who’s responsibility it is to reprotect data and help maintain and cluster’s availability and durability SLAs. Other jobs are less essential and perform general maintenance work, some optimizations, feature support, etc. And these can typically run with less criticality and a lower impact.

Some cluster administrators are blissfully unaware of the Job Engine’s existence, as it does its thing discretely behind the scenes, while others are distinctly more familiar with it.

The job engine uses the same set of resources as any clients accessing cluster. So the job engine has to manage how much CPU, memory, disk IO, etc, it uses, to avoid impinging upon client workloads. Obviously, if it consumes too much, the client loads will start to slow down and negatively impact customer productivity. The job engine manages its impact on client activity based on a set of internal disk IO and CPU metrics. But, until now, it has not paid attention to client load performance directly. So for protocol activity, the Job Engine in OneFS 9.7 and earlier does not monitor whether or not the latencies of protocol operations increase due to the jobs its running. And unfortunately, sometimes this results in client workloads being impacted more than desired. So OneFS 9.8 attempts to directly address this undesirable situation.

At its core, SmartThrottling is the Job Engine’s new automatic impact management framework.

As such, it intelligently prioritizes primary client traffic, while automatically using any spare resources for cluster housekeeping.

It does this by dynamically throttling jobs forward and backward. And this means enhanced impact policy effectiveness, and improved predictability for cluster maintenance and data management tasks.

The read and write latencies of critical client protocol load are monitored, and SmartThrottling uses these metrics to keep the latencies within specified thresholds. As they approach the limit, the Job Engine stops increasing its work, and if latency exceeds the thresholds, it actively reduces the amount of work the jobs perform.

SmartThrottling also monitors the cluster’s drives and similarly maintains disk IO health within set limits. The actual job impact configuration remains unchanged in OneFS 9.8, and each job still has the same default level and priority as in prior releases.

But before we get into the nitty gritty of SmartThrottling, first, a quick Job Engine refresher.

The OneFS Job Engine itself is based on a delegation hierarchy made up of coordinator, director, manager, and worker processes.

Once the work is initially allocated, the Job Engine uses a shared work distribution model to process the work, and a unique Job ID identifies each job. When a job is launched, whether it is scheduled, started manually, or responding to a cluster event, the Job Engine spawns a child process from the isi_job_d daemon running on each node. This Job Engine daemon is also known as the parent process.

The Job Engine’s orchestration and job execution is handled by the coordinator process. Any node can act as the coordinator, and its principal responsibilities include:

  • Monitoring workload and the constituent nodes’ status
  • Controlling the number of worker threads per node and clusterwide
  • Managing and enforcing job synchronization and checkpoints

While the individual nodes manage the actual work item allocation, the coordinator node takes control, divvies up the job, and evenly distributes the resulting tasks across the nodes in the cluster. The coordinator also periodically sends messages, through the director processes, instructing the managers to increment or decrement the number of worker threads as appropriate.

The coordinator is also responsible for starting and stopping jobs, and for processing work results as they are returned during job processing. Should it die for any reason, the coordinator responsibility automatically moves to another node.

Each node in the cluster has a Job Engine director process, which runs continuously and independently in the background. The director process is responsible for monitoring, governing, and overseeing all Job Engine activity on a particular node, constantly waiting for instruction from the coordinator to start a new job. The director process serves as a central point of contact for all the manager processes running on a node and as a liaison with the coordinator process across nodes. These responsibilities include manager process creation, delegating to and requesting work from other peers, and communicating status.

As such, the manager process is responsible for arranging the flow of tasks and task results throughout the duration of a job. The various manager processes request and exchange work with each other and supervise the worker threads assigned to them. At any time, each node in a cluster can have up to three manager processes, one for each job currently running. These managers are responsible for overseeing the flow of tasks and task results.

Each manager controls and assigns work items to multiple worker threads working on items for the designated job. Under direction from the coordinator and director, a manager process maintains the appropriate number of active threads for a configured impact level, and for the node’s current activity level. Once a job has been completed, the manager processes associated with that job, across all the nodes, are terminated. New managers are automatically spawned when the next job begins.

The manager processes on each node regularly send updates to their respective node’s director, which, in turn, informs the coordinator process of the status of the various worker tasks.

Each worker thread is given a task, if available, which it processes item-by-item until the task is complete or the manager unassigns the task. You can query the status of the nodes’ workers by running the CLI command isi job statistics view. In addition to the number of current worker threads per node, the query also provides a sleep-to-work (STW) ratio average, giving an indication of the worker thread activity level on the node.

Towards the end of a job phase, the number of active threads decreases as workers finish their allotted work and become idle. Nodes that have completed their work items remain idle, waiting for the last remaining node to finish its work allocation. When all tasks are done, the job phase is considered to be complete, and the worker threads are terminated.

As jobs are processed, the coordinator consolidates the task status from the constituent nodes and periodically writes the results to checkpoint files. These checkpoint files allow jobs to be paused and resumed, either proactively or in the event of a cluster outage. For example, if the node on which the Job Engine coordinator is running goes offline for any reason, a new coordinator automatically starts on another node. This new coordinator reads the last consistency checkpoint file, job control and task processing resume across the cluster, and no work is lost.

Each Job Engine job has an associated impact policy, dictating when a job runs and the resources that a job can consume. The default Job Engine impact policies are as follows:

Impact policy Schedule Impact level
LOW Any time of day Low
MEDIUM Any time of day Medium
HIGH Any time of day High
OFF_HOURS Outside of business hours (9 a.m. to 5 p.m., Monday to Friday), paused during business hours Low

While these default impact policies cannot be modified or deleted, additional custom impact policies can be manually created as needed.

A mix of jobs with different impact levels results in resource sharing. Each job cannot exceed the impact level set for it, and the aggregate impact level cannot exceed the highest level of the individual jobs.

In addition to the impact level, each Job Engine job also has a priority. These are based on a scale of one to ten, with a lower value signifying a higher priority. This is similar in concept to the UNIX ‘nice’ scheduling utility.

Higher-priority jobs cause lower-priority jobs to be paused. If a job is paused, it is returned to the back of the Job Engine priority queue. When the job reaches the front of the priority queue again, it resumes from where it left off. If the system schedules two jobs of the same type and priority level to run simultaneously, the job that was queued first runs first.

Priority takes effect when two or more queued jobs belong to the same exclusion set, or when, if exclusion sets are not a factor, four or more jobs are queued. The fourth queued job may be paused if it has a lower priority than the three other running jobs.

In contrast to priority, job impact policy only comes into play once a job is running and determines the resources a job can use across the cluster.

The FlexProtect, FlexProtectLIN, and IntegrityScan jobs have the highest Job Engine priority level of 1, by default. Of these, the FlexProtect jobs, having the core role of reprotecting data, are the most important.

All Job Engine job priorities are configurable by the cluster administrator. The default priority settings are strongly recommended, particularly for the highest-priority jobs.

The default impact policy and relative priority settings for the range of Job Engine jobs are as follows. Typically, the elevated impact jobs are also run at an increased priority. Note that the recommendation is to keep the default impact and priority settings, where possible, unless there is a compelling reason to change them.

Job name Impact policy Priority
AutoBalance LOW 4
AutoBalanceLIN LOW 4
AVScan LOW 6
ChangelistCreate LOW 5
Collect LOW 4
ComplianceStoreDelete LOW 6
Deduplication LOW 4
DedupeAssessment LOW 6
DomainMark LOW 5
DomainTag LOW 6
FilePolicy LOW 6
FlexProtect MEDIUM 1
FlexProtectLIN MEDIUM 1
FSAnalyze LOW 6
IndexUpdate LOW 5
IntegrityScan MEDIUM 1
MediaScan LOW 8
MultiScan LOW 4
PermissionRepair LOW 5
QuotaScan LOW 6
SetProtectPlus LOW 6
ShadowStoreDelete LOW 2
ShadowStoreProtect LOW 6
ShadowStoreRepair LOW 6
SmartPools LOW 6
SmartPoolsTree MEDIUM 5
SnapRevert LOW 5
SnapshotDelete MEDIUM 2
TreeDelete MEDIUM 4
WormQueue LOW 6

The majority of Job Engine jobs are intended to run in the background with LOW impact. Notable exceptions are the FlexProtect jobs, which by default are set at MEDIUM impact. This allows FlexProtect to quickly and efficiently reprotect data without critically affecting other user activities.

In the next article in this series, we’ll delve into the architecture and operation of SmartThrottling.

PowerScale OneFS 9.8

It’s launch season here at Dell, and PowerScale is already scaling up spring with the introduction of the innovative OneFS 9.8 release, which shipped today (9th April 2024). This new 9.8 release has something for everyone, introducing PowerScale innovations in cloud, performance, serviceability, and ease of use.

APEX File Storage for Azure

After the debut of APEX File Storage for AWS last year, OneFS 9.8 amplifies PowerScale’s presence in the public cloud by introducing APEX File Storage for Azure.

In addition to providing the same OneFS software platform on-prem and in the cloud, and customer-managed for full control, APEX File Storage for Azure in OneFS 9.8 provides linear capacity and performance scaling from four up to eighteen SSD nodes and up to 3PB per cluster. This can make it a solid fit for AI, ML and analytics applications, as well as traditional file shares and home directories, and vertical workloads like M&E, healthcare, life sciences, and financial services.

PowerScale’s scale-out architecture can be deployed on customer managed AWS and Azure infrastructure, providing the capacity and performance needed to run a variety of unstructured workflows in the public cloud.

Once in the cloud, existing PowerScale investments can be further leveraged by accessing and orchestrating your data through the platform’s multi-protocol access and APIs.

This includes the common OneFS control plane (CLI, WebUI, and platform API), and the same enterprise features: Multi-protocol, SnapshotIQ, SmartQuotas, Identity management, etc.

Simplicity and Efficiency

OneFS 9.8 SmartThrottling is an automated impact control mechanism for the job engine, allowing the cluster to automatically throttle job resource consumption if it exceeds pre-defined thresholds, in order to prioritize client workloads.

OneFS 9.8 also delivers automatic on-cluster core file analysis, and SmartLog provides an efficient, granular log file gathering and transmission framework. Both these new features help dramatically accelerate the ease and time to resolution of cluster issues.

Performance

OneFS 9.8 also adds support for Remote Direct Memory Access (RDMA) over NFS 4.1 support for applications and clients, allowing substantially higher throughput performance, especially for single connection and read intensive workloads such as machine learning and generative AI model training – while also reducing both cluster and client CPU utilization. It also provides the foundation for interoperability with NVIDIA’s GPUDirect.

RDMA over NFSv4.1 in OneFS 9.8 leverages the ROCEv2 network protocol. OneFS CLI and WebUI configuration options including global enablement, and IP pool configuration, filtering and verification of RoCEv2 capable network interfaces. NFS over RDMA is available on all PowerScale platforms containing Mellanox ConnectX network adapters on the front end, and with a choice of 25, 40, or 100 Gigabit Ethernet connectivity. The OneFS user interface helps easily identify which of a cluster’s NICs support RDMA.

Under the hood, OneFS 9.8 also introduces efficiencies such as lock sharding and parallel thread handling, delivering a substantial performance boost for streaming write heavy workloads, such as generative AI inferencing and model training. Performance scales linearly as compute is increased, keeping GPUs busy, and allowing PowerScale to easily support AI and ML workflows from small to large. Plus 9.8 also includes infrastructure support for future node hardware platform generations.

Multipath Client Driver

The addition of a new Multipath Client Driver helps expand PowerScale’s role in Dell’s strategic collaboration with NVIDIA, delivering the first and only end-to-end large scale AI system. This is based on the PowerScale F710 platform, in conjunction with PowerEdge XE9680 GPU servers, and NVIDIA’s Spectrum-X Ethernet switching platform, to optimize performance and throughput at scale.

In summary, OneFS 9.8 brings the following new features to the Dell PowerScale ecosystem:

 

Feature Info
Cloud ·         APEX File Storage for Azure.

·         Up to 18 SSD nodes and 3PB per cluster.

Simplicity ·         Job Engine SmartThrottling.

·         Source-based routing for IPv6 networks.

Performance ·         NFSv4.1 over RDMA.

·         Streaming write performance enhancements.

·         Infrastructure support for next generation all-flash node hardware platform.

Serviceability ·         Automatic on-cluster core file analysis.

·         SmartLog efficient, granular log file gathering.

 

We’ll be taking a deeper look at this new functionality in blog articles over the course of the next few weeks.

Meanwhile, the new OneFS 9.8 code is available on the Dell Online Support site, both as an upgrade and reimage file, allowing installation and upgrade of this new release.

OneFS Cluster Quorum – Part 2

The CAP theorem states that a distributed system cannot simultaneously guarantee consistency, availability, and partition tolerance. This means that, when faced with a network partition, a choice must be made between consistency and availability. OneFS does not compromise on consistency, so a mechanism is required to manage a cluster’s transient state.

In order for a cluster to properly function and accept data writes, a quorum of nodes must be active and responding. A quorum is defined as a simple majority: a cluster with N nodes must have ⌊N/2⌋+1 nodes online in order to allow writes.

OneFS uses a quorum to prevent ‘split-brain’ conditions that can be introduced if the cluster should temporarily divide into two clusters. By following the quorum rule, the architecture guarantees that regardless of how many nodes fail or come back online, if a write takes place, it can be made consistent with any previous writes that have ever taken place.

Within OneFS, quorum is a property of the group management protocol (GMP) group which helps enforce consistency across node disconnects. As we saw in the previous article, since both nodes and drives in OneFS may be readable, but not writable, OneFS actually has two quorum properties:

Read quorum is represented by ‘efs.gmp.has_quorum’ and write quorum by ‘efs.gmp.has_super_block_quorum’. For example:

# sysctl efs.gmp.has_quorum

efs.gmp.has_quorum: 1

# sysctl efs.gmp.has_super_block_quorum

efs.gmp.has_super_block_quorum: 1

The value of ‘1’ for each of the above confirms that the cluster currently has both read and write quorum respectively.

In OneFS, a group is basically a list of nodes and drives which are currently participating in the cluster. Any nodes that are not in a cluster’s main quorum group form multiple groups.. As such, the main purpose of the OneFS Group Management Protocol (GMP) is to help create and maintain a group of synchronized nodes. Having a consistent view of the cluster state is critical, since initiators need to know which node and drives are available to write to, etc.

The group of nodes with quorum is referred to as the ‘majority side’. Conversely, any node group without quorum is termed a ‘minority side’.

There can only be one majority group, but there may be multiple minority groups. A group which has one or more components in a failed state is called ‘degraded’. The degraded property is frequently used as an optimization to avoid checking the capabilities of each component. The term ‘degraded’ is also used to refer to components without their maximum capabilities.

The following table lists and describes that various terminology associated with OneFS groups and quorum.

Term Definition
Degraded A group which has one or more components in a failed state is called ‘degraded’.
Dynamic Aspect The dynamic aspect refers to the state (and health) of nodes and drives which may change.
GMP Group Management Protocol, which helps create and maintain a group of synchronized nodes. Having a consistent view of the cluster state is critical, since initiators need to know which node and drives are available to write to, etc
Group A group is a given set of nodes which have synchronized state.
Majority side A group of nodes with quorum is referred to as the ‘majority side’. By definition, there can only be one majority group.
Minority side Any group of nodes without quorum is a ‘minority side’. There may be multiple minority groups.
Quorum group A group of nodes with quorum, referred to as the ‘majority side’
Static Aspect The static aspect is the composition of the cluster, which is stored in the array.xml file.

Under normal operating conditions, every node and its disks are part of the current group, which can be shown by running sysctl efs.gmp.group on any node of the cluster. For example, the complex group output from a 93 node cluster:

# sysctl efs.gmp.group

efs.gmp.group: <d70af9> (93) :{ 1-14:0-14, 15:0-13, 16-19:0-14, 20:0-13, 21-28,30-33:0-14, 34:0-4,6-10,12-14, 35-36:0-14, 37-48:0-19, 49-60:0-14, 61-62:0-13, 63-81:0-14, 82:0-7,9-14, 83-87:0-14, 88:0-13, 89-91:0-14, 92:0-1,3-14, down: 29, soft_failed: 29, read_only: 29, smb: 1-28,30-92, nfs: 1-28,30-92, swift: 1-28,30-92, all_enabled_protocols: 1-28,30-92, isi_cbind_d: 1-28,30-92, lsass: 1-28,30-92, s3: 1-28,30-92, external_connectivity: 1-28,30-92 }

As can be seen above, protocol and external network participation is also reported, in addition to the overall state of the nodes and drives in the group,.

For more verbose output, the efs.gmp.current_info sysctl yields extensive current GMP information.

# sysctl efs.gmp.current_info

So a quorum group, as reported by GMP, consists of two parts:

Group component Description
Sequence number Provides identification for the group
Membership list Describes the group

The sequence number in the example above is:  <d70af9>

Next, the membership list shows the group members within brackets. For example, { 1-4:0-14 … } represents a four node pool, with Array IDs 1 through 4. Each node contains 15 drives, numbered zero through 14.

  • The numbers before the colon in the group membership list represent the participating Array IDs.
  • The numbers after the colon represent Drive IDs.

Note that node IDs differ from Logical Node Numbers (LNNs), the node numbers that occur within node names, and displayed by isi stat.

GMP distributes a variety of state information about nodes and drives, from identifiers to usage statistics. The most fundamental of these is the composition of the cluster, or ‘static aspect’ of the group, which is stored in the array.xml file. The array.xml file also includes info such as the ID, GUID, and whether the node is diskless or storage, plus attributes not considered part of the static aspect, such as internal IP addresses.

Similarly, the state of a node’s drives is stored in the drives.xml file, along with a flag indicating whether the drive is an SSD. Whereas GMP manages node states directly, drive states are actually managed by the ‘drv’ module, and broadcast via GMP. A significant difference between nodes and drives is that for nodes, the static aspect is distributed to every node in the array.xml file, whereas drive state is only stored locally on a node. The array.xml information is needed by every node in order to define the cluster and allow nodes to form connections. In contrast, drives.xml is only stored locally on a node. When a node goes down, other nodes have no method to obtain the drive configuration of that node. Drive information may be cached by the GMP, but it is not available if that cache is cleared.

Conversely, ‘dynamic aspect’ refers to the state of nodes and drives which may change. These states indicate the health of nodes and their drives to the various file system modules – plus whether or not components can be used for particular operations. For example, a soft-failed node or drive should not be used for new allocations. These components can be in one of seven states:

Node State Description
Dead The component is not allowed to come back to the UP state and should be removed.
Down The component is not responding.
Gone The component has been removed.
Read-only This state only applies to nodes.
Soft-failed The component is in the process of being removed.
Stalled A drive is responding slowly.
Up The component is responding.

Note that a  node or drive may go from ‘down, soft-failed’ to ‘up, soft-failed’ and back. These flags are persistently stored in the array.xml file for nodes and the drives.xml file for drives.

Group and drive state information allows the various file system modules to make timely and accurate decisions about how they should utilize nodes and drives. For example, when reading a block, the selected mirror should be on a node and drive where a read can succeed (if possible). File system modules use the GMP to test for node and drive capabilities, which include:

Capability Description
Readable Drives on this node may be read.
Restripe From Move blocks away from the node.
Writable Drives on this node may be written to.

Access levels help define ‘as a last resort’ with states for which access should be avoided unless necessary. The access levels, in order of increased access, are as follows:

Access Level Description
Modify stalled Allows writing to stalled drives.
Never Indicates a group state never supports the capability.
Normal The default access level
Read soft-fail Allows reading from soft-failed nodes and drives.
Read stalled Allows reading from stalled drives.

Drive state and node state capabilities are shown in the following tables. As shown, the only group states affected by increasing access levels are soft-failed and stalled.

 

Minimum Access Level for Capabilities Per Node State

Node States Readable Writeable Restripe From
UP Normal Normal No
UP, Smartfail Soft-fail Never Yes
UP, Read-only Normal Never No
UP, Smartfail, Read-only Soft-fail Never Yes
DOWN Never Never No
DOWN, Smartfail Never Never Yes
DOWN, Read-only Never Never No
DOWN, Smartfail, Read-only Never Never Yes
DEAD Never Never Yes

 

Minimum Access Level for Capabilities Per Drive State

Drive States Minimum Access Level to Read Minimum Access Level to Write Restripe From
UP Normal Normal No
UP, Smartfail Soft-fail Never Yes
DOWN Never Never No
DOWN, Smartfail Never Never Yes
DEAD Never Never Yes
STALLED Read_Stalled Modify_Stalled No

 

OneFS depends on a consistent view of a cluster’s group state. For example, some decisions, such as choosing lock coordinators, are made assuming all nodes have the same coherent notion of the cluster.

Group changes originate from multiple sources, depending on the particular state. Drive group changes are initiated by the drv module. Service group changes are initiated by processes opening and closing service devices. Each group change creates a new group ID, comprising a node ID and a group serial number. This group ID can be used to quickly determine whether a cluster’s group has changed, and is invaluable for troubleshooting cluster issues, by identifying the history of group changes across the nodes’ log files.

GMP provides coherent cluster state transitions using a process similar to two-phase commit, with the up and down states for nodes being directly managed by the GMP. RBM or Remote Block Manager code provides the communication channel that connect devices in the OneFS. When a node mounts /ifs it initializes the RBM in order to connect to the other nodes in the cluster, and uses it to exchange GMP Info, negotiate locks, and access data on the other nodes.

When a group change occurs, a cluster-wide process writes a message describing the new group membership to /var/log/messages on every node. Similarly, if a cluster ‘splits’, the newly-formed sub-clusters behave in the same way: each node records its group membership to /var/log/messages. When a cluster splits, it breaks into multiple clusters (multiple groups). This is rarely, if ever, a desirable event. A cluster is defined by its group members. Nodes or drives which lose sight of other group members no longer belong to the same group and therefore no longer belong to the same cluster.

The ‘grep’ CLI utility can be used to view group changes from one node’s perspective, by searching /var/log/messages for the expression ‘new group’. This will extract the group change statements from the logfile. The output from this command may be lengthy, so can be piped to the ‘tail’ command to limit it the desired number of lines. For example, to get the last two group changes from the local node’s  log:

# grep -i 'new group' /var/log/messages | tail -n 2

2024-03-25T16:47:22.114319+00:00 <0.4> TME1-8(id8) /boot/kernel.amd64/kernel: [gmp_info.c:2690](pid 63964="kt: gmp-drive-updat")(tid=101253) new group: <d70aac> (93) { 1-14:0-14, 15:0-13, 16-19:0-14, 20:0-13, 21-28,30-33:0-14, 34:0-4,6-10,12-14, 35-36:0-14, 37-48:0-19, 49-60:0-14, 61-62:0-13, 63-81:0-14, 82:0-7,9-14, 83-87:0-14, 88:0-13, 89-91:0-14, 92:0-1,3-14, down: 29, read_only: 29 }

2024-03-26T15:34:57.131337+00:00 <0.4> TME1-8(id8) /boot/kernel.amd64/kernel: [gmp_info.c:2690](pid 88332="kt: gmp-config")(tid=101526) new group: <d70aed> (93) { 1-14:0-14, 15:0-13, 16-19:0-14, 20:0-13, 21-28,30-33:0-14, 34:0-4,6-10,12-14, 35-36:0-14, 37-48:0-19, 49-60:0-14, 61-62:0-13, 63-81:0-14, 82:0-7,9-14, 83-87:0-14, 88:0-13, 89-91:0-14, 92:0-1,3-14, down: 29, soft_failed: 29, read_only: 29 }

OneFS and Cluster Quorum

Received a couple of recent enquires about the role and effects of cluster quorum in OneFS. So thought it might be useful to revisit this, and associated concepts, in an article.

The premise was this:

A 3 node cluster at +2d:1n or +1n protection can run fine in a degraded mode with only two active nodes and one failed node:

Given the above, shouldn’t a 4 node cluster at +2n also be able to sustain a two node failure and run fine in degraded state with two active nodes?

Spoiler alert: The answer is no, and the reason is the OneFS cluster quorum requirement.

So what’s going on here?

In order for a cluster to properly function and accept data writes, a quorum of nodes must be active and responding. A quorum is defined as a simple majority: a cluster with N nodes must have ⌊N/2⌋+1 nodes online in order to allow writes. For example, in a seven-node cluster, four nodes would be required for a quorum. If a node or group of nodes is up and responsive, but is not a member of a quorum, it runs in a read-only state.

OneFS uses a quorum to prevent ‘split-brain’ conditions that can be introduced if the cluster should temporarily divide into two clusters. By following the quorum rule, the architecture guarantees that regardless of how many nodes fail or come back online, if a write takes place, it can be made consistent with any previous writes that have ever taken place. The quorum also dictates the number of nodes required in order to move to a given data protection level. For an erasure-code-based protection-level of 𝑁+𝑀, the cluster must contain at least 2𝑀+1 nodes. For example, a minimum of five nodes is required for a +2n configuration:

This allows for a simultaneous loss of two nodes while still maintaining a quorum of three nodes for the cluster to remain fully operational.

If a cluster does drop below quorum, the file system will automatically be placed into a protected, read-only state, denying writes, but still allowing read access to the available data.

Within OneFS, quorum is a property of the group management protocol (GMP) group which helps enforce consistency across node disconnects. It is very similar to the common definition of quorum in distributed systems. It can be shown that requiring ⌊𝑁/2⌋+ 1 replicas to be available can guarantee that no updates are lost. Quorum performs this specific purpose within OneFS.

Since both nodes and drives in OneFS may be readable, but not writable, OneFS actually has two quorum properties:

Type Description
Read quorum Read quorum is defined as having ⌊𝑁/2⌋ + 1 nodes readable.
Write quorum Write quorum is defined as having at least ⌊𝑁/2⌋ + 1 nodes writable.

Under the hood, OneFS read quorum is represented by the sysctl ‘efs.gmp.has_quorum’, and write quorum by  ‘efs.gmp.has_super_block_quorum’. For example:

# sysctl efs.gmp.has_quorum

efs.gmp.has_quorum: 1

# sysctl efs.gmp.has_super_block_quorum

efs.gmp.has_super_block_quorum: 1

In the above example, the value of ‘1’ for each confirms that the cluster currently has both read and write quorum respectively.

Note that any nodes that are not in a cluster’s main quorum group form multiple groups. A group of nodes with quorum is referred to as the ‘majority side’. Similarly, any node group without quorum is termed a ‘minority side’. By definition, there can only be one majority group, but there may be multiple minority groups. A group which has one or more components in a failed state is called ‘degraded’. The degraded property is frequently used as an optimization to avoid checking the capabilities of each component. The term ‘degraded’ is also used to refer to components without their maximum capabilities.

For example, consider the earlier 4-node cluster example with a protection level of +2n and two nodes down. Even though the protection level can theoretically sustain two node failures, the minimum cluster size has been violated, hence the cluster cannot write due to lack of quorum. The following table lists various OneFS protection levels and their associated minimum cluster or pool sizes and quorum counts:

FEC Protection level Failure Tolerance Minimum Cluster/Pool Size Minimum Quorum Size
+1 Tolerate failure of 1 drive OR 1 node 3 nodes 2 nodes
+2 Tolerate failure of 2 drives OR 2 nodes 5 nodes 3 nodes
+3 Tolerate failure of 3 drives or 3 nodes 7 nodes 4 nodes
+4 Tolerate failure of 4 nodes 9 nodes 5 nodes

The OneFS Job Engine also includes a process called Collect, which acts as an orphaned block collector. If a cluster splits during a write operation, some blocks that were allocated for the file may need to be re-allocated on the quorum side. This will ‘orphan’ allocated blocks on the non-quorum side. When the cluster re-merges, the job engine’s Collect job locates these orphaned blocks through a parallelized mark-and-sweep scan and reclaims them as free space for the cluster.

File system operations typically query a GMP group several times before completing. A group may change over the course of an operation, but the operation needs a consistent view. This is provided by the group info, which is the primary interface modules use to query group state.

The efs.gmp.group sysctl can be queried to determine the current group state of a cluster. For example:

# sysctl efs.gmp.group

efs.gmp.group: <8f8f4b> (92) :{ 1-14:0-14, 15:0-13, 16-19:0-14, 20:0-13, 21-33:0-14, 34:0-4,6-10,12-14, 35-36:0-14, 37-48:0-19, 49-60:0-14, 61-62:0-13, 63-81:0-14, 82:0-7,9-14, 83-87:0-14, 88:0-13, 89-91:0-14, 92:0-1,3-14, smb: 1-92, nfs: 1-92, swift: 1-92, all_enabled_protocols: 1-92, isi_cbind_d: 1-92, lsass: 1-92, s3: 1-92, external_connectivity: 1-92 }

As shown in this large cluster example above, the output includes the GMP’s group state, but also information about services provided by nodes in the cluster. This allows nodes in the cluster to discover when services change state on other nodes and take the appropriate action when this happens. An example is SMB lock expiry, which uses GMP service information to clean up locks held by other nodes when the service owning the lock goes down.

Additional detailed current GMP state information can be gleaned from the output of the following sysctl:

# sysctl efs.gmp.current_info

Processes change the service state in GMP by opening and closing service devices. A particular service will transition from down to up in the GMP group when it opens the file descriptor for a device. Closing the service file descriptor will trigger a group change that reports the service as down. A process can explicitly close the file descriptor if it chooses, but most often the file descriptor will remain open for the duration of the process and closed automatically by the kernel when it terminates.

OneFS depends on a consistent view of a cluster’s group state. For example, in addition to read and write quorum, other decisions, such as choosing lock coordinators, are made assuming all nodes have the same coherent notion of the cluster.

As such, an understanding of OneFS quorum, groups, and their related group change messages allows you to determine the current health of a cluster – as well as reconstruct the cluster’s history when troubleshooting issues that involve cluster stability, network health, and data integrity.

Group changes originate from multiple sources, depending on the particular state. Drive group changes are initiated by the drv module. Service group changes are initiated by processes opening and closing service devices. Each group change creates a new group ID, comprising a node ID and a group serial number. This group ID can be used to quickly determine whether a cluster’s group has changed, and is invaluable for troubleshooting cluster issues, by identifying the history of group changes across the nodes’ log files.

GMP provides coherent cluster state transitions using a process similar to two-phase commit, with the up and down states for nodes being directly managed by the GMP. The Remote Block Manager (RBM)  provides the communication channel that connect devices in the OneFS. When a node mounts /ifs it initializes the RBM in order to connect to the other nodes in the cluster, and uses it to exchange GMP Info, negotiate locks, and access data on the other nodes.

Before /ifs is mounted, a ‘cluster’ is just a list of MAC and IP addresses in array.xml, managed by ibootd when nodes join or leave the cluster. When mount_efs is called, it must first determine what it‘s contributing to the file system, based on the information in drives.xml. After a cluster (re)boot, the first node to mount /ifs is immediately placed into a group on its own, with all other nodes marked down. As the Remote Block Manager (RBM) forms connections, the GMP merges the connected nodes, enlarging the group until the full cluster is represented. Group transactions where nodes transition to UP are called a ‘merge’, whereas a node transitioning to down is called a split.

OneFS Software-Defined Persistent Memory Journal

Unlike previous platforms which used NVDIMMs, the F710 and F210 nodes see a change to the system journal, instead using a 32GB Software Defined Persistent Memory (SDPM) solution to provide persistent storage for the OneFS journal. This change also has the benefit of freeing up a DIMM slot, unlike the NVDIMM on previous platforms.

But before we get into the details, first, a quick refresher on the OneFS journal.

A primary challenge for any storage system is providing performance and ACID (atomicity, consistency, isolation, and durability) guarantees using commodity drives. Drives only support the atomicity of a single sector write, yet complex file system operations frequently update several blocks in a single transaction. For example, a rename operation must modify both the source and target directory blocks. If the system crashes or loses power during an operation that updates multiple blocks, the file system will be inconsistent if some updates are visible and some are not.

The journal is among the most critical components of a PowerScale node. When the OneFS writes to a drive, the data goes straight to the journal, allowing for a fast reply.

OneFS uses journalling to ensure consistency across both disks locally within a node and disks across nodes.

Block writes go to the journal first, and a transaction must be marked as ‘committed’ in the journal before returning success to the file system operation. Once the transaction is committed, the change is guaranteed to be stable. If the node crashes or loses power, the changes can still be applied from the journal at mount time via a ‘replay’ process. The journal uses a battery-backed persistent storage medium, such as NVRAM, in order to be available after a catastrophic node event such as a crash or power loss. It must also be:

Journal Performance Characteristic Description
High throughput All blocks (and therefore all data) go through the journal, so it cannot become a bottleneck.
Low latency Since transaction state changes are often in the latency path multiple times for a single operation, particularly for distributed transactions.

The OneFS journal mostly operates at the physical level, storing changes to physical blocks on the local node. This is necessary because all initiators in OneFS have a physical view of the file system, and therefore issue physical read and write requests to remote nodes. The OneFS journal supports both 512byte and 8KiB block sizes of 512 bytes for storing written inodes and blocks respectively.

By design, the contents of a node’s journal are only needed in a catastrophe, such as when memory state is lost. For fast access during normal operation, the journal is mirrored in RAM. Thus, any reads come from RAM and the physical journal itself is write-only in normal operation. The journal contents are read at mount time for replay. In addition to providing fast stable writes, the journal also improves performance by serving as a write-back cache for disks. When a transaction is committed, the blocks are not immediately written to disk. Instead, it is delayed until the space is needed. This allows the I/O scheduler to perform write optimizations such as reordering and clustering blocks. This also allows some writes to be elided when another write to the same block occurs quickly, or the write is otherwise unnecessary, such as when the block is freed.

So the OneFS journal provides the initial stable storage for all writes and does not release a block until it is guaranteed to be stable on a drive. This process involves multiple steps and spans both the file system and operating system. The high-level flow is as follows:

Step Operation Description
1 Transaction prep A block is written on a transaction, for example a write_block message is received by a node. An asynchronous write is started to the journal. The transaction prepare step will wait until all writes on the transaction complete.
2 Journal delayed write The transaction is committed. Now the journal issues a delayed write. This simply marks the buffer as dirty.
3 Buffer monitoring A daemon monitors the number of dirty buffers and issues the write to the drive upon reach its threshold.
4 Write completion notification The journal receives an upcall indicating that the write is complete.
5 Threshold reached Once journal space runs low or an idle timeout expires, the journal issues a cache flush to the drive to ensure the write is stable.
6 Flush to disk When cache flush completes, all writes completed before the cache flush are known stable. The journal frees the space.

The F710 and F210 see the introduction of Dell’s VOSS M.2 SSD drive as the non-volatile device for the SDPM journal vault.  The SDPM itself comprises two main elements:

Component Description
BBU The BBU pack (battery backup unit) supplies temporary power to the CPUs and memory allowing them to perform a backup in the event of a power loss.
Vault A 32GB M.2 NVMe to which the system memory is vaulted.

While the BBU is self-contained, the M.2 NVMe vault is housed within a VOSS module, and both components are easily replaced if necessary.

The following CLI command confirms the 32GB size of the SDPM journal in the F710 and F210 nodes:

# grep -r supported_size /etc/psi/psf

/etc/psi/psf/MODEL_F210/journal/JOURNAL_SDPM/journal-1.0-psi.conf:             supported_size = 34359738368;

/etc/psi/psf/MODEL_F710/journal/JOURNAL_SDPM/journal-1.0-psi.conf:             supported_size = 34359738368;

/etc/psi/psf/journal/JOURNAL_NVDIMM_1x16GB/journal-1.0-psi.conf:               supported_size = 17179869184;

The basic SDPM operation is illustrated in the diagram below:

Essentially, the node’s memory state, including any uncommitted writes, etc, in the DDR5 RDIMMS that are being protected, come up through the memory controller, through the CPU and caching hierarchy, and are then vaulted to the non-volatile M.2 within the VOSS module.

The VOSS M.2 module itself is comprised of the following parts:

In the event of a failure, this entire carrier assembly is replaced, rather than just the M.2 itself.

Note that with the new VOSS, M.2 firmware upgrades are now managed by iDRAC using DUP, rather than by OneFS and the DSP as in prior PowerScale platforms.

Both the BBU and VOSS module are located at the front of the chassis, and are connected to the motherboard and power source as depicted by the red and blue lines in the following graphic:

Additionally, with OneFS 9.7, given the low latency IO characteristics of the drives, the PowerScale NVMe-based all-flash nodes also now have a write operation fast path direct to SSD for newly allocated blocks as shown below:

This is a major performance boost, particularly for streaming write workloads, and we’ll explore this more closely in a future article.

PowerScale F710 Platform Node

In this article, we’ll turn our focus to the new PowerScale F710 hardware node that was launched a couple of weeks back. Here’s where this new platform lives in the current hardware hierarchy:

The PowerScale F710 is a high-end all-flash platform that utilizes a dual-socket 4th gen Zeon processor with 512GB of memory and ten NVMe drives, all contained within a 1RU chassis. Thus, the F710 offers a substantial hardware evolution from previous generations, while also focusing on environmental sustainability, reducing power consumption and carbon footprint, while delivering blistering performance. This makes the F710 and ideal candidate for demanding workloads such as M&E content creation and rendering, high concurrency and low latency workloads such as chip design (EDA), high frequency trading, and all phases of generative AI workflows, etc.

An F710 cluster can comprise between 3 and 252 nodes. Inline data reduction, which incorporates compression, dedupe, and single instancing, is also included as standard to further increase the effective capacity.

The F710 is based on the 1U R660 PowerEdge server platform, with dual socket Intel Sapphire Rapids CPUs. Front-End networking options include 10/25 GbE and with 100 GbE for the Back-End network. As such, the F710’s core hardware specifications are as follows:

Attribute F710 Spec
Chassis 1RU Dell PowerEdge R660
CPU Dual socket, 24 core Intel Sapphire Rapids 6442Y @2.6GHz
Memory 512GB Dual rank DDR5 RDIMMS (16 x 32GB)
Journal 1 x 32GB SDPM
Front-end network 2 x 100GbE or 25GbE
Back-end network 2 x 100GbE
NVMe SSD drives 10

These node hardware attributes can be easily viewed from the OneFS CLI via the ‘isi_hw_status’ command. Also note that, at the current time, the F710 is only available in a 512GB memory configuration.

Starting at the business end of the node, the front panel allows the user to join an F710 to a cluster and displays the node’s name once it has successfully joined.

Removing the top cover, the internal layout of the F710 chassis is as follows:

The Dell ‘Smart Flow’ chassis is specifically designed for balanced airflow, and enhanced cooling is primarily driven by four dual-fan modules. Additionally, the redundant power supplies also contain their own air flow apparatus and can be easily replaced from the rear without opening the chassis.

For storage, each PowerScale F710 node contains ten NVMe SSDs, which are currently available in the following capacities and drive styles:

Standard drive capacity SED-FIPS drive capacity SED-non-FIPS drive capacity
3.84 TB TLC 3.84 TB TLC
7.68 TB TLC 7.68 TB TLC
15.36 TB QLC Future availability 15.36 TB QLC
30.72 TB QLC Future availability 30.72 TB QLC

Note that 15.36TB and 30.72TB SED-FIPS drive options are planned for future release.

Drive subsystem-wise, the PowerScale F710 1RU chassis is fully populated with ten NVMe SSDs. These are housed in drive bays spread across the front of the node as follows:

This is in contrast to, and provides improved density over its predecessor, the F600, which contains eight NVMe drives per node.

The NVMe drive connectivity is across PCIe lanes, and these drives use the NVMe and NVD drivers. The NVD is a block device driver that exposes an NVMe namespace like a drive and is what most OneFS operations act upon, and each NVMe drive has a /dev/nvmeX, /dev/nvmeXnsX and /dev/nvdX device entry  and the locations are displayed as ‘bays’. Details can be queried with OneFS CLI drive utilities such as ‘isi_radish’ and ‘isi_drivenum’. For example:

# isi_drivenum

Bay  0   Unit 15     Lnum 9     Active      SN:S61DNE0N702037   /dev/nvd5

Bay  1   Unit 14     Lnum 10    Active      SN:S61DNE0N702480   /dev/nvd4

Bay  2   Unit 13     Lnum 11    Active      SN:S61DNE0N702474   /dev/nvd3

Bay  3   Unit 12     Lnum 12    Active      SN:S61DNE0N702485   /dev/nvd2

Bay  4   Unit 19     Lnum 5     Active      SN:S61DNE0N702031   /dev/nvd9

Bay  5   Unit 18     Lnum 6     Active      SN:S61DNE0N702663   /dev/nvd8

Bay  6   Unit 17     Lnum 7     Active      SN:S61DNE0N702726   /dev/nvd7

Bay  7   Unit 16     Lnum 8     Active      SN:S61DNE0N702725   /dev/nvd6

Bay  8   Unit 23     Lnum 1     Active      SN:S61DNE0N702718   /dev/nvd1

Bay  9   Unit 22     Lnum 2     Active      SN:S61DNE0N702727   /dev/nvd10

Moving to the back of the chassis, the rear of the F710 contains the power supplies, network, and management interfaces, which are arranged as follows:

The F710 nodes are available in the following networking configurations, with a 25/100Gb ethernet front-end and 100Gb ethernet back-end:

Front-end NIC Back-end NIC F710 NIC Support
100GbE 100GbE Yes
100GbE 25GbE No
25GbE 100GbE Yes
25GbE 25GbE No

Note that, like the F210, an Infiniband backend is not supported on the F710 at the current time.

Compared with its F600 predecessor, the F710 sees a number of hardware performance upgrades. These include a move to PCI Gen5, Gen 4 NVMe, DDR5 memory, Sapphire Rapids CPU, and a new software-defined persistent memory file system journal ((SPDM). Also the 1GbE management port has moved to Lan-On-Motherboard (LOM), whereas the DB9 serial port is now on a RIO card. Firmware-wise, the F710 and OneFS 9.7 require a minimum of NFP 12.0.

In terms of performance, the new F710 provides a considerable leg up on both the previous generation F600 and F600 prime. This is particularly apparent with NFSv3 streaming reads, as can be seen below:

Given its additional drives (ten SSDs versus eight for the F600s) plus this performance disparity, the F710 does not currently have any other compatible node types. This means that, unlike the F210, the minimum F710 configuration requires the addition of a three node pool.

PowerScale F210 Platform Node

In this article, we’ll take a quick peek at the new PowerScale F210 hardware platform that was released last week. Here’s where this new node sits in the current hardware hierarchy:

The PowerScale F210 is an entry level, performant, all-flash platform that utilizes NVMe SSDs and a single-socket CPU 1U PowerEdge platform with 128GB of memory per node.  The ideal use cases for the F210 include high performance workflows, such as M&E, EDA, AI/ML, and other HPC applications.

An F210 cluster can comprise between 3 and 252 nodes, each of which contains four 2.5” drive bays populated with a choice of 1.92TB, 3.84TB, 7,68TB TLC, or 15.36TB QLC enterprise NVMe SSDs. Inline data reduction, which incorporates compression, dedupe, and single instancing, is also included as standard and enabled by default to further increase the effective capacity.

The F210 is based on the 1U R660 PowerEdge server platform, with a single socket Intel Sapphire Rapids CPU.

The node’s front panel has limited functionality compared to older platform generations and simply allows the user to join a node to a cluster and display the node name once the node has successfully joined.

An F210 node’s serial number can be found either by viewing /etc/isilon_serial_number or via the following CLI command syntax. For example:

# isi_hw_status | grep SerNo
  SerNo: HVR3FZ3

The serial number reported by OneFS will match that of the service tag attached to the physical hardware and the /etc/isilon_system_config file will report the appropriate node type. For example:

# cat /etc/isilon_system_config
PowerScale F210

Under the hood, the F210’s core hardware specifications are as follows:

Attribute F210 Spec
Chassis 1RU Dell PowerEdge R660
CPU Single socket, 12 core Intel Sapphire Rapids 4410Y @2GHz
Memory 128GB Dual rank DDR5 RDIMMS (8 x 16GB)
Journal 1 x 32GB SDPM
Front-end network 2 x 100GbE or 25GbE
Back-end network 2 x 100GbE or 25GbE
NVMe SSD drives 4

The node hardware attributes can be gleaned from OneFS by running the ‘isi_hw_status’ CLI command. For example:

f2101-1# isi_hw_status -c

  HWGen: PSI

Chassis: POWEREDGE (Dell PowerEdge)

    CPU: GenuineIntel (2.00GHz, stepping 0x000806f8)

   PROC: Single-proc, 12-HT-core

    RAM: 102488403968 Bytes

   Mobo: 0MK29P (PowerScale F210)

  NVRam: NVDIMM (NVDIMM) (8192MB card) (size 8589934592B)

 DskCtl: NONE (No disk controller) (0 ports)

 DskExp: None (No disk expander)

PwrSupl: PS1 (type=AC, fw=00.1B.53)

PwrSupl: PS2 (type=AC, fw=00.1B.53)

While the actual health of the CPU and power supplies can be quickly verified as follows:

# isi_hw_status -s

Power Supplies OK

Power Supply PS1 good

Power Supply PS2 good

CPU Operation (raw 0x881B0000)  = Normal

Additionally, the ‘-A’ flag (All) can also be used with ‘isi_hw-status’ to query a plethora of hardware and environmental information.

Node and drive firmware versions can also be checked with the ‘isi_firmware_tool’ utility. For example:

f2101-1# isi_firmware_tool --check

Ok

f2101-1# isi_firmware_tool --show

Thu Oct 26 11:42:32 2023 - Drive_Support_v1.46.tgz

Thu Oct 26 11:42:58 2023 - IsiFw_Package_v11.7qa1.tar

The internal layout of the F210 chassis with the risers removed is as follows:

The cooling is primarily driven by four dual-fan modules, which can be easily accessed and replaced as follows:

Additionally, the power supplies also contain their own air flow apparatus, and can be easily replaced from the rear without opening the chassis.

For storage, each PowerScale F210 node contains four NVMe SSDs, which are currently available in the following capacities and drive styles:

Standard drive capacity SED-FIPS drive capacity SED-non-FIPS drive capacity
1.92 TB TLC 1.92 TB TLC

3.84 TB TLC 3.84 TB TLC

7.68 TB TLC 7.68 TB TLC

15.36 TB QLC Future availability 15.36 TB QLC

Note that a 15.36TB SED-FIPS drive option is planned for future release. Additionally, the 1.92TB drives in the F210 can also be short-stroke formatted for node compatibility with F200s containing 960GB SSD drives. More on this later in the article.

The F210’s NVMe SSDs populate the drive bays on the left front of the chassis, as illustrated in the following front view (with bezel removed):

Drive subsystem-wise, OneFS provides NVMe support across PCIe lanes, and the SSDs use the NVMe and NVD drivers. The NVD is a block device driver that exposes an NVMe namespace like a drive and is what most OneFS operations act upon, and each NVMe drive has a /dev/nvmeX, /dev/nvmeXnsX and /dev/nvdX device entry  and the locations are displayed as ‘bays’. Details can be queried with OneFS CLI drive utilities such as ‘isi_radish’ and ‘isi_drivenum’. For example:

f2101-1# isi_drivenum
Bay 0   Unit 3      Lnum 0     Active      SN:BTAC2263000M15PHGN   /dev/nvd3
Bay 1   Unit 2      Lnum 2     Active      SN:BTAC226206VB15PHGN   /dev/nvd2
Bay 2   Unit 0      Lnum 1     Active      SN:BTAC226206R515PHGN   /dev/nvd0
Bay 3   Unit 1      Lnum 3     Active      SN:BTAC226207ER15PHGN   /dev/nvd1
Bay 4   Unit N/A    Lnum N/A   N/A         SN:N/A              N/A
Bay 5   Unit N/A    Lnum N/A   N/A         SN:N/A              N/A
Bay 6   Unit N/A    Lnum N/A   N/A         SN:N/A              N/A
Bay 7   Unit N/A    Lnum N/A   N/A         SN:N/A              N/A
Bay 8   Unit N/A    Lnum N/A   N/A         SN:N/A              N/A
Bay 9   Unit N/A    Lnum N/A   N/A         SN:N/A              N/A

As shown, the four NVMe drives occupy bays 0-3, with the remaining six bays unoccupied. These four drives and their corresponding PCI bus addresses can also be viewed via the following CLI command:

f2101-1# pciconf -l | grep nvme
nvme0@pci0:155:0:0:     class=0x010802 card=0x219c1028 chip=0x0b608086 rev=0x00 hdr=0x00
nvme1@pci0:156:0:0:     class=0x010802 card=0x219c1028 chip=0x0b608086 rev=0x00 hdr=0x00
nvme2@pci0:157:0:0:     class=0x010802 card=0x219c1028 chip=0x0b608086 rev=0x00 hdr=0x00
nvme3@pci0:158:0:0:     class=0x010802 card=0x219c1028 chip=0x0b608086 rev=0x00 hdr=0x00

Comprehensive details and telemetry for individual drive are available via the ‘isi_radish’ CLI command using their /dev/nvdX device entry. For example, for /dev/nvd0:

f2101-1# isi_radish -a /dev/nvd0
Drive log page ca: Intel Vendor Unique SMART Log
              Key                              Attribute                                         Field                                                 Value
============================== ======================================== 
(5.0) (4.0)=(171) (0.0)              Program Fail Count                 Normalized Value                                        100
(5.0) (4.0)=(171) (0.1)                                                 Raw Value                                               0
(5.0) (4.0)=(172) (0.0)              Erase Fail Count                   Normalized Value                                        100
(5.0) (4.0)=(172) (0.1)                                                 Raw Value                                               0
(5.0) (4.0)=(173) (2.0)              Wear Leveling Count                Normalized Value                                        100
(5.0) (4.0)=(173) (2.1)                                                 Min. Erase Cycle                                        2
(5.0) (4.0)=(173) (2.2)                                                 Max. Erase Cycle                                        14
(5.0) (4.0)=(173) (2.3)                                                Avg. Erase Cycle                                        5
(5.0) (4.0)=(184) (1.0)              End to End Error Detection Count   Raw Value                                               0
(5.0) (4.0)=(234) (3.0)              Thermal Throttle Status            Percentage                                              0
(5.0) (4.0)=(234) (3.1)                                                 Throttling event count                                  0
(5.0) (4.0)=(243) (1.0)              PLL Lock Loss Count                Raw Value                                               0
(5.0) (4.0)=(244) (1.0)              NAND sectors written divided by .. Raw Value                                               3281155
(5.0) (4.0)=(245) (1.0)              Host sectors written divided by .. Raw Value                                               1445498
(5.0) (4.0)=(246) (1.0)              System Area Life Remaining         Raw Value                                               0
Drive log page de: DellEMC Unique Log Page

              Key                              Attribute                                         Field                                                 Value
============================== ======================================== ======================================================= ==================================================
(6.0)                            DellEMC Unique Log Page                Log Page Revision                                       2
(6.1)                                                                   System Aread Percent Used                               0
(6.2)                                                                   Max Temperature Seen                                    48
(6.3)                                                                   Media Total Bytes Written                               110097292328960
(6.4)                                                                   Media Total Bytes Read                                  176548657233920
(6.5)                                                                   Host Total Bytes Read                                   164172138545152
(6.6)                                                                   Host Total Bytes Written                                48502864347136
(6.7)                                                                   NAND Min. Erase Count                                   2
(6.8)                                                                   NAND Avg. Erase Count                                   5
(6.9)                                                                   NAND Max. Erase Count                                   14
(6.10)                                                                  Media EOL PE Cycle Count                                3000
(6.11)                                                                  Device Raw Capacity                                     15872
(6.12)                                                                  Total User Capacity                                     15360
(6.13)                                                                  SSD Endurance                                           4294967295
(6.14)                                                                  Command Timeouts                                        18446744073709551615
(6.15)                                                                  Thermal Throttle Count                                  0
(6.16)                                                                 Thermal Throttle Status                                 0
(6.17)                                                                  Short Term Write Amplification                          192
(6.18)                                                                  Long Term Write Amplification                           226
(6.19)                                                                  Born on Date                                            06212022
(6.20)                                                                  Assert Count                                            0
(6.21)                                                                  Supplier firmware-visible hardware revision             5
(6.22)                                                                  Subsystem Host Read Commands                            340282366920938463463374607431768211455
(6.23)                                                                  Subsystem Busy Time                                     340282366920938463463374607431768211455
(6.24)                                                                  Deallocate Command Counter                              0
(6.25)                                                                  Data Units Deallocated Counter                          165599450
Log Sense data (Bay 2/nvd0 ) --
Supported log pages 0x1 0x2 0x3 0x4 0x5 0x6 0x80 0x81

SMART/Health Information Log
============================
Critical Warning State:         0x00
 Available spare:               0
 Temperature:                   0
 Device reliability:            0
 Read only:                     0
 Volatile memory backup:        0
Temperature:                    307 K, 33.85 C, 92.93 F
Available spare:                100
Available spare threshold:      10
Percentage used:                0
Data units (512,000 byte) read: 320648767
Data units written:             94732208
Host read commands:             3779434531
Host write commands:            1243274334
Controller busy time (minutes): 33
Power cycles:                   93
Power on hours:                 2718
Unsafe shutdowns:               33
Media errors:                   0
No. error info log entries:     0
Warning Temp Composite Time:    0
Error Temp Composite Time:      0
Temperature 1 Transition Count: 0
Temperature 2 Transition Count: 0
Total Time For Temperature 1:   0
Total Time For Temperature 2:   0

SMART status is threshold NOT exceeded (Bay 2/nvd0 )
NAND Write Amplification: 2.269913, (Bay 2/nvd0 )

Error Information Log
=====================
No error entries found
Bay 2/nvd0  is Dell Ent NVMe SED P5316 RI 15.36TB FW:1.2.0 SN:BTAC226206R515PHGN, 30001856512 blks

                Attr                          Value
=================================== =========================
NAND Bytes Written                  3281155
Host Bytes Written                  1445498

Drive Attributes: (Bay 2/nvd0 )

In contrast, the rear of the F710 chassis contains the power supplies, network, and management interfaces, which are laid out as follows:

The F210 nodes are available in the following networking configurations, with a 25/100Gb ethernet back-end and 25/100Gb ethernet front-end:

Front-end NIC Back-end NIC F210 NIC Support
100GbE 100GbE Yes
100GbE 25GbE No
25GbE 100GbE Yes
25GbE 25GbE Yes

Note that there is currently no support for an F210 Infiniband backend in OneFS 9.7.

These NICs and their PCI bus addresses can be determined via the ’pciconf’ CLI command, as follows:

f2101-1# pciconf -l | grep mlx
mlx5_core0@pci0:23:0:0: class=0x020000 card=0x005815b3 chip=0x101d15b3 rev=0x00 hdr=0x00
mlx5_core1@pci0:23:0:1: class=0x020000 card=0x005815b3 chip=0x101d15b3 rev=0x00 hdr=0x00
mlx5_core2@pci0:111:0:0:        class=0x020000 card=0x005815b3 chip=0x101d15b3 rev=0x00 hdr=0x00
mlx5_core3@pci0:111:0:1:        class=0x020000 card=0x005815b3 chip=0x101d15b3 rev=0x00 hdr=0x00

Similarly, the NIC hardware details and drive firmware versions can be view as follows:

f2101-1# mlxfwmanager
Device #1:
----------
  Device Type:      ConnectX6DX
  Part Number:      0F6FXM_08P2T2_Ax
  Description:      Mellanox ConnectX-6 Dx Dual Port 100 GbE QSFP56 Network Adapter
  PSID:             DEL0000000027
  PCI Device Name:  pci0:23:0:0
  Base GUID:        a088c20300052a3c
  Base MAC:         a088c2052a3c
  Versions:         Current        Available
     FW             22.36.1010     N/A
     PXE            3.6.0901       N/A
     UEFI           14.29.0014     N/A
  Status:           No matching image found

Device #2:
----------
  Device Type:      ConnectX6DX
  Part Number:      0F6FXM_08P2T2_Ax
  Description:      Mellanox ConnectX-6 Dx Dual Port 100 GbE QSFP56 Network Adapter
  PSID:             DEL0000000027
  PCI Device Name:  pci0:111:0:0
  Base GUID:        a088c2030005194c
  Base MAC:         a088c205194c
  Versions:         Current        Available
     FW             22.36.1010     N/A
     PXE            3.6.0901       N/A
     UEFI           14.29.0014     N/A
  Status:           No matching image found

Performance-wise, the new F210 is a relative powerhouse compared to the F200. This is especially true for NFSv3 streaming reads, as can be seen below:

OneFS node compatibility provides the ability to have similar node types and generations within the same node pool. In OneFS 9.7, compatibility between the F210 nodes and the previous generation F200 platform is supported.

Component F200 F210
Platform R640 R660
Drives 4 x SAS SSD 4 x NVMe SSD
CPU Intel Xeon Silver 4210 (Cascade Lake) Intel Xeon Silver 4410Y (Sapphire Rapids)
Memory 96GB DDR4 96GB DDR5

This compatibility facilitates the addition of individual F210 nodes to an existing node pool comprising three of more F200s if desired, rather creating a F210 new node pool. Despite the different drive subsystem across the two platforms, and the performance profiles above. Because of this, however, the F210/F200 node compatibility is slightly more nuanced, and the F210 NVMe SSDs are considered ‘soft restriction’ compatible with the F200 SAS SSDs. Additionally, the 1.92TB is the smallest capacity option available for the F210, and the only supported drive configuration for F200 compatibility.

In compatibility mode the 1.92Tb drives will be short stroke formatted, resulting in a 960 GB capacity per drive.​ Also note that, while the F210 is node pool compatible with the F200, a performance degradation is experienced where the F210 is effectively throttled to match the performance envelope of the F200s. ​

When an F210 is added to the F200 node pool, OneFS will display the following WebUI warning message alerting to this ‘soft restriction’:

And similarly from the CLI:

PowerScale All-flash F710 and F210 Platform Nodes

Hot on the heels of the recent OneFS 9.7 release sees the launch of two new PowerScale F-series hardware offerings. Between them, these new F710 and F210 all-flash nodes add some major horsepower to the PowerScale stable.

Built atop the latest generation of Dell’s PowerEdge R660 platform, the F710 and F210 each boast a range of Gen4 NVMe SSD capacities, paired with a Sapphire Rapids CPU, a generous helping of DDR5 memory, and PCI Gen5 100GbE front and back-end network connectivity – all housed within a compact, power-efficient 1RU form factor chassis.

Here’s where these new nodes sit in the current hardware hierarchy:

As illustrated in the greyed out region of the above chart, these new nodes refresh the current F600 and F200 platforms, and further extend PowerScale’s price-performance envelope.

The PowerScale F210 and F710 nodes offer a substantial hardware evolution from previous generations, while also focusing on environmental sustainability, reducing power consumption and carbon footprint. Housed in a 1RU ‘Smart Flow’ chassis for balanced airflow and enhanced cooling, both new platforms offer greater density than their F600 and F200 predecessors – the F710 now accommodating ten NVMe SSDs per node and 25% greater density, and the F210 now offering NVMe drives with a 15.36 TB option, and doubling the F200’s maximum density. Both platforms also include in-line compression and deduplication by default, further increasing their capacity headroom and effective density. Plus, using Intel’s 4th gen Xeon Sapphire Rapids CPUs results in 19% lower cycles-per-instruction, while PCIe Gen 5 quadruples throughput over Gen 3, and the latest DDR5 DRAM offers greater speed and bandwidth – all netting up to 90% higher performance per watt. Additionally, the F710 and F210 debut a new 32 GB Software Defined Persistent Memory (SDPM) file system journal, in place of NVDIMM-n in prior platforms, thereby saving a DIMM slot on the motherboard too.

On the OneFS side, the recently launched 9.7 release delivers a dramatic performance bump – particularly for the all-flash platforms. OneFS 9.7 benefits from latency-improving enhancements to its locking infrastructure and protocol heads – plus ‘direct write’ non-cached IO, which we will explore in a future article.

This combination of generational hardware upgrades plus OneFS 9.7 software advancements results in dramatic performance gains for the F710 and F210 – particularly for streaming reads and writes, which see a 2x or greater improvement over the prior F600 and F200 platforms. This makes the F710 and F210 ideal candidates for demanding workloads such as M&E content creation and rendering, high concurrency and low latency workloads such as chip design (EDA), high frequency trading, and all phases of generative AI workflows, etc.

Scalability-wise, both platforms require a minimum of three nodes to form a cluster (or node pool), with up to a maximum of 252 nodes, and the basic specs for the new nodes include:

Component PowerScale F710 PowerScale F210
CPU Dual–socket Intel Sapphire Rapids, 2.6GHz, 24C Single–socket Intel Sapphire Rapids, 2GHz, 12C
Memory 512GB DDR5 DRAM 128GB DDR5 DRAM
SSDs per node 10 x NVMe SSDs 4 x NVMe SSDs
Raw capacities per node 38.4TB to 307TB 7.7TB to 61TB
Drive options 3.84TB, 7.68TB TLC and 15.36TB, 30.72TB QLC 1.92TB, 3.84TB, 7.68TB TLC and 15.36TB QLC
Front-end network 2 x 100GbE or 25GbE 2 x 100GbE or 25GbE
Back-end network 2 x 100 GbE 2 x 100GbE or 25GbE

Note that, while the F210 can coexist with the F200 in the same node pool, the F710 does not currently have any node pool compatibility peers.

Over the next couple of articles, we’ll dig into the technical details of each of the new platforms. But, in summary, when combined with OneFS 9.7, the new PowerScale all-flash F710 and F210 platforms quite simply deliver on efficiency, flexibility, performance, and scalability.

OneFS and Externally Managed Network Pools – Management and Monitoring

In the first article in this series, we took a look at the overview and architecture of the OneFS 9.7 externally managed network pools feature. Now, we’ll turn our focus to its management and monitoring.

From a cluster security point of view, the externally managed IP service has opened up a potential new attack vector whereby a rogue DHCP server could provide bad data. As such, the recommendation is to configure a firewall around this new OneFS DHCP service to ensure that the cluster is protected. While the OneFS firewall could in theory provide this protection, in order to know what the DHCP server is, the cluster first has to discover and talk to the DHCP server and get its IP. This seems a bit paradoxical (and insecure) to be creating a firewall rule after having already talked to and trusted the DHCP server.

The following table contains recommended configuration settings for the AWS firewall.

Setting Value
Name Eg. ‘DHCP”
Type ‘ingress’
From Port 67
To Port 68
Protocol UDP
CIDR Blocks <cluster_gateway>/32
IPv6 CIDR Blocks []
Security Group ID // customer specific

Note that, as mentioned in the first article in this series, there are a currently a couple of instances of unsupported networking functionality in the APEX file services for AWS offering, as compared to on-prem OneFS, and these include:

  • IPv6 support
  • VLANs
  • Link aggregation
  • NFSoverRDMA

These limitations for externally managed network pools are highlighted in red below, and are read-only settings since they are managed by the cloud provider (interfaces and IPs).

Externally managed network pools can only be created by the system with OneFS 9.7 and therefore pools cannot be manually reconfigured either to or from externally managed – even by root.

In general manual IP configuration is protected in order to guard against accidental misconfiguration. However, clusters admin may occasionally be required to manually configure the IPs in the network pool, and can be performed with the ‘isi network pool modify’ plus the inclusion of the ‘–force’ flag:

# isi network pool modify subnet0.pool0 –ranges <ip_add_range> --force

Note that AWS has a maximum threshold for the number of IPs that can be configured per network interface based on AMI instance type. If this limit is exceeded, AWS will prevent the IP address from being configured, resulting in a potential data unavailability event.  OneFS 9.7 now prevents most instances of IP oversubscription at configuration time in order to ensure availability during a 1/3 cluster outage.

While OneFS accounts for externally managed, static, dynamic IPs, and SSIPs, it is unable to account for unevenly allocated dynamic IPs, so it’s therefore unable to prevent all instances.

OneFS also displays an informative error message if attempting to configure this. For example, using an AMI instance type of ‘m5d.large’:

# isi network pool modify subnet0.pool0 –ranges 10.20.30.203-10.20.30.254

AWS only allows node 2 (instance type AWS=m5d.large) to have a maximum of 10 IPv4 addresses configured. In a degraded state, the requested configuration will result in node 2 attempting to configure 28 addresses, which will leave 18 address(es) unavailable. To resolve this, consider increasing the number of nodes in dynamic pools or reducing the number of IPv4 addresses.

When it comes to troubleshooting externally managed pools, there are two log files which are useful to check. Namely:

  • /var/log/dhclient.log
  • /var/log/isi_smartconnect

The first of these is a dedicated dhclient.log file for the new dhclient instance that OneFS 9.7 introduces. In contrast, the IP Merger and IP Reporter modules will output to the isi_smartconnect log.

There are also a handful of relevant system files that are also worth being aware of, and these include:

  • /var/db/dhclient/lease.ena1
  • /ifs/.ifsvar/modules/flexnet/ip_reporter/DHCP/node.
  • /ifs/.ifsvar/modules/flexnet/pool_members/groupnet.1.subnet.1.pool.1
  • /ifs/.ifsvar/modules/smartconnect/resource/workers/ip_merger

The first of these, lease.ena1, is an append log maintained by dhclient. So the most recent lease in there is the one that is SmartConnect is looking at. Note that there may be other lease files in the system, but only the lease files in /var/db/dhclient are relevant, and being viewed by SmartConnect. OneFS has a special configuration for dhclient to ensure this.

The IP reports live in the /ifs/.ifsvar/modules/flexnet directory. The pool_members directory has been present in OneFS for a number of years now. And OneFS now coordinates the IP merger with the file under ./smartconnect/resource/workers/ directory.

As for useful CLI commands, these include the following:

# isi_smartconnect_client action –a wake-ip-reporter

The ‘isi_smartconnect_client’ CLI utility, which can be used to interact with the SmartConnect daemon, gets an additional ‘wake-ip-reporter’ action in OneFS 9.7. Under normal circumstances, the IP Reporter only checks the contents of the lease file every five minutes. However, ‘wake-ip-reporter’ now instructs IP Reporter to check the lease file immediately. So if there was some issue where dhclient restarted for some reason, IP Reporter can be awoken and forced to read the lease, rather than waiting for its next scheduled check.

Additionally, the following ‘log_level’ command arguments can be used to change the logging level of SmartConnect to the desired verbosity:

# isi_smartconnect_client log_level [-l | -r]

Note that, in OneFS 9.7, this does not change the Flexnet config file which was required in prior releases.

Instead, this log level is reset when the process dies or the ‘–r’ argument is passed. It’s worth noting that this command does not operate cluster-wise. Rather, it just affects the current instance of SmartConnect running on the local node.

Another thing to be aware of when a cluster is using externally managed pools is that networking is dependent on, and can be impacted by, the availability of AWS’ DHCP servers. While the leased IP never changes, the leases themselves have an expiration of an hour. As such, if OneFS is unable to reach the DHCP server to renew, it may lose its Primary IPs. While this is often outside the realm of control, the OneFS CELOG event service will fire a critical warning alert (SW_SC_DHCP_LEASE_REBIND) before a primary IP expires. This alert will contain the following event description:

DHCP server has not responded to requests to renew lease on <interface>. Attempting to contact other DHCP servers. If we are unable to renew the lease, the IP address <ip_address> will be removed at expiry.

For example:

In addition to the above alert, there are several log messages that give a good indication of what may be amiss. These, and their resolution info, are summarized in the following table:

Log Message Description Resolution
Unable to merge IP 1.2.3.4 on ext-1 from devid 1 – no matching pool found IP is not configured in any Network Pool Add IP to the Primary IP Pool
Unable to parse lease on NIC: ena1. Attempting to retrieve new lease The lease file generated by dhclient could not be read. None should be required. We will automatically backup the old lease file and restart dhclient
Lease on NIC: ena1 not found Lease file does not exist for the specified interface OneFS will automatically restart dhclient
Unexpected error comparing IP Reports. Attempting rewrite We try to dedupe writes by comparing newly generated IP report with what is on disk. In the event of a failure, we’ll just overwrite.
No IP Report received from DHCP External Manager OneFS unable to determine its IP from the DHCP leases. Will continue retrying, but currently unable to report an IP If issues persists, check on dhclient to ensure it is operating correctly.
Failed to write IP Report node. for DHCP to disk: OneFS unable to report its IP to /ifs, so the IP merger is unable to update Flexnet/IP Assignments with this information. Check why SmartConnect is unable to write to /ifs. Is it read only?