PowerScale P100 & B100 Accelerators

In addition to a variety of software features, OneFS 9.3 also introduces support for two new PowerScale accelerator nodes. Based on the 1RU Dell PE R640 platform, these include the:

  • PowerScale P100 performance accelerator
  • PowerScale B100 backup accelerator.

Other than a pair of low capacity SSD boot drives, neither the B100 or P100 nodes contain any local storage or journal. Both accelerators are fully compatible with clusters containing the current PowerScale and Gen6+ nodes, plus the previous generation of Isilon Gen5 platforms. Also, unlike storage nodes which require the addition of a 3 or 4 node pool of similar nodes, a single P100 or B100 can be added to a cluster.

The P100 accelerator nodes can simply, and cost effectively, augment the CPU, RAM, and bandwidth of a network or compute-bound cluster without significantly increasing its capacity or footprint.

Since the accelerator nodes contain no storage and a sizable RAM footprint, they have a substantial L1 cache, since all the data is fetched from other storage nodes. Cache aging is based on a least recently used (LRU) eviction policy and the P100 is available in two memory configurations, with either 384GB or 768GB of DRAM per node. The P100 also supports both inline compression and deduplication.

In particular, the P100 accelerator can provide significant benefit to serialized, read-heavy, streaming workloads by virtue of its substantial, low-churn L1 cache, helping to increase throughput and reduce latency. For example, a typical scenario for P100 addition could be a small all-flash cluster supporting a video editing workflow that is looking for a performance and/or front-end connectivity enhancement, but no additional capacity.

On the backup side, the PowerScale B100 contains a pair of 16Gb fibre channel ports, enabling direct or two-way NDMP backup from a cluster directly to tape or VTL, or across an FC fabric.

The B100 backup accelerator integrates seamlessly with current DR infrastructure, as well as with leading data backup and recovery software technologies to satisfy the availability and recovery SLA requirements of a wide variety of workloads. The B100 can be added to a cluster containing  current and prior generation all-flash, hybrid, and archive nodes.

The B100 aids overall cluster performance by offloading NDMP backup traffic directly to the FC ports and reducing CPU and memory consumption on storage nodes, thereby minimizing impact on front end workloads. This can be of particular benefit to clusters that have been using gen-6 nodes populated with FC cards. In these cases, a simple, non-disruptive addition of B100 node(s) will free up compute resources on the storage nodes, both improving client workload performance and shrinking NDMP backup windows.

Finally, the hardware specs for the new PowerScale P100 and B100 accelerator platforms are as follows:

Component (per node) P100 B100
OneFS release 9.3 or later 9.3 or later
Chassis PowerEdge R640 PowerEdge R640
CPU 20 cores (dual socket @ 2.4Ghz) 20 (dual socket @ 2.4Ghz)
Memory 384GB or 768GB 384GB
Front-end I/O Dual port 10/25 Gb Ethernet

Or

Dual port 40/100Gb Ethernet

Dual port 10/25 Gb Ethernet

Or

Dual port 40/100Gb Ethernet

Back-end I/O Dual port 10/25 Gb Ethernet

Or

Dual port 40/100Gb Ethernet

Or

Dual port QDR Infiniband

Dual port 10/25 Gb Ethernet

Or

Dual port 40/100Gb Ethernet

Or

Dual port QDR Infiniband

Journal N/A N/A
Data Reduction Support Inline compression and dedupe Inline compression and dedupe
Power Supply Dual redundant 750W 100-240V, 50/60Hz Dual redundant 750W 100-240V, 50/60Hz
Rack footprint 1RU 1RU
Cluster addition Minimum one node, and single node increments Minimum one node, and single node increments

 

OneFS S3 Protocol Enhancements

The new OneFS 9.3 sees some useful features added to its S3 object protocol stack, including:

  • Chunked Upload
  • Delete Multiple Objects support
  • Non-slash delimiter support for ListObjects/ListObjectsV2

When uploading data to OneFS via S3, there are two types of uploading options for authenticating requests using the S3 Authorization header:

  • Transfer payload in a single chunk
  • Transfer payload in multiple chunks (chunked upload)

Applications that typically use the chunked upload option by default include Restic, Flink, Datadobi, and the AWS S3 Java SDK. The new 9.3 release enables these and other applications to work seamlessly with OneFS.

Chunked upload, as the name suggests, facilitates breaking data payload into smaller units, or chunks for more efficient upload. These can be fixed or variable-size, and chunking aids performance by avoiding reading the entire payload in order to calculate the signature. Instead, for the first chunk, a seed signature is calculated which uses only the request headers. The second chunk contains the signature for the first chunk, and each subsequent chunk contains the signature for the preceding one. At the end of the upload, a zero byte chunk is transmitted which contains the last chunk’s signature. This protocol feature is described in more detail in the AWS S3 Chunked Upload documentation.

The AWS S3 DeleteObjects API enables the deletion of multiple objects from a bucket using a single HTTP request. If you know the object keys that you wish to delete, the DeleteObjects API provides an efficient alternative to sending individual delete requests, reducing per-request overhead.

For example, the following python code can be used to delete the three objects file1, file2, and file3 from bkt01 in a single operation:

import boto3




# set HOST IP, user access id and secret key

HOST='192.168.198.10'  # Your SmartConnect name or cluster IP goes here

USERNAME='1_s3test_accid'  # Your access ID

USERKEY='WttVbuRv60AXHiVzcYn3b8yZBtKc'   # Your secret key

URL = 'http://{}:9020'.format(HOST)




s3 = boto3.resource('s3')

session = boto3.Session()




s3client = session.client(service_name='s3',aws_access_key_id=USERNAME,aws_secret_access_key=USERKEY,endpoint_url=URL,use_ssl=False,verify=False)




bkt_name='bkt01'

response=s3client.delete_objects(

Bucket='bkt01',

Delete={

'Objects': [

{

'Key': 'file1'

},

{

'Key': 'file2'

},

{

'Key': 'file3'

}

]

}

)

print(response)

Note that Boto3, the AWS S3 SDK for python, is used in the code above. Boto3 can be downloaded here and installed on a Linux client via pip (ie. # pip install boto3).

Another S3 feature that’s added in OneFS 9.3 is non-slash delimiter support. The AWS S3 data model is a flat structure with no physical hierarchy of directories or folders: A bucket is created, under which objects are stored. However, AWS S3 does make provision for a logical hierarchy using object key name prefixes and delimiters to support a rudimentary concept of folders, as described in Amazon S3 Delimiter and Prefix. In prior OneFS releases, only a slash (‘/’) was supported as a delimiter. However, the new OneFS 9.3 release now expands support to include non-slash delimiters for listing objects in buckets. Also, the new delimiter can comprise multiple characters.

To illustrate this, take the keys “a/b/c”, “a/bc/e” , abc”:

  • If the delimiter is “b” with no prefix, “a/b” and “ab” are returned as the common prefix.
  • With delimiter “b” and prefix “a/b”, “a/b/c” and “a/bc/e” will be returned.

The delimiter can also have either ‘no slash’ or ‘slash’ at the end. For example, “abc”, “/”, “xyz/” are all supported. However, “a/b”, “/abc”, “//” are invalid.

In the following example, three objects (file1, file2, and file3) are uploaded from a Linux client to a cluster via the OneFS S3 protocol with object keys, and stored under the following topology:

# tree bkt1

bkt1

├── dir1
│   ├── file2
│   └── sub-dir1
│       └── file3
└── file1

2 directories, 3 files

These objects can be listed using ‘sub’ as the delimiter value by running the following python code:

import boto3

# set HOST IP, user access id and secret key

HOST='192.168.198.10'  # Your SmartConnect name or cluster IP goes here

USERNAME='1_s3test_accid'  # Your access ID

USERKEY=' WttVbuRv60AXHiVzcYn3b8yZBtKc'   # Your secret key

URL = 'http://{}:9020'.format(HOST)  


s3 = boto3.resource('s3')

session = boto3.Session()


s3client = session.client(service_name='s3',aws_access_key_id=USERNAME,aws_secret_access_key=USERKEY,endpoint_url=URL,use_ssl=False,verify=False)


bkt_name='bkt1'

response=s3client.list_objects(

    Bucket=bkt_name,

    Delimiter='sub'

)

print(response)

The keys ‘file1’ and ‘dir1/file2’ are returned in the , and ‘dir1/sub’ is returned as a common prefix.

{'ResponseMetadata': {'RequestId': '564950507', 'HostId': '', 'HTTPStatusCode': 200, 'HTTPHeaders': {'connection': 'keep-alive', 'x-amz-request-id': '564950507', 'content-length': '796'}, 'RetryAttempts': 0}, 'IsTruncated': False, 'Marker': '', 'Contents': [{'Key': 'dir1/file2', 'LastModified': datetime.datetime(2021, 11, 24, 16, 15, 6, tzinfo=tzutc()), 'ETag': '"d41d8cd98f00b204e9800998ecf8427e"', 'Size': 0, 'StorageClass': 'STANDARD', 'Owner': {'DisplayName': 's3test', 'ID': 's3test'}}, {'Key': 'file1', 'LastModified': datetime.datetime(2021, 11, 24, 16, 10, 43, tzinfo=tzutc()), 'ETag': '"d41d8cd98f00b204e9800998ecf8427e"', 'Size': 0, 'StorageClass': 'STANDARD', 'Owner': {'DisplayName': 's3test', 'ID': 's3test'}}], 'Name': 'bkt1', 'Prefix': '', 'Delimiter': 'sub', 'MaxKeys': 1000, 'CommonPrefixes': [{'Prefix': 'dir1/sub'}]}

OneFS 9.3 also delivers significant improvements to the S3 multi-part upload functionality. In prior OneFS versions, each constituent piece of an upload was written to a separate file, and all the parts concatenated on a completion request. As such, the concatenation process could take a significant duration for large file.

With the new OneFS 9.3 release, multi-part upload instead writes data directly into a single file, so completion is near-instant. The multiple parts are consecutively numbered, and all have same size except for the final one. Since no re-upload or concatenation is required, the process is both lower overhead as well as significantly quicker.

OneFS 9.3 also includes improved handling of inter-level directories. For example, if ‘a/b’ is put on a cluster via S3, the directory ‘a’ is created implicitly. In previous releases, if ‘b’ was then deleted, the directory ‘a’ remained and was treated as an object. However, with OneFS 9.3, the directory is still created and left, but is now identified as an inter-level directory. As such, it is not shown as an object via either ‘Get Bucket’ or ‘Get Object’. With 9.3, an S3 client can now remove a bucket if it only has inter-level directories. In prior releases, this would have failed with a ‘bucket not empty’ error. However, the multi-protocol behavior is unchanged, so a directory created via another OneFS protocol, such as NFS, is still treated as an object. Similarly, if an inter-level directory was created on a cluster prior to a OneFS 9.3 upgrade, that directory will continue to be treated as an object.

OneFS Virtual Hot Spare

There have been a several recent questions from the field around how a cluster manages space reservation and pre-allocation of capacity for data repair and drive rebuilds.

OneFS provides a mechanism called Virtual Hot Spare (VHS), which helps ensure that node pools maintain enough free space to successfully re-protect data in the event of drive failure.

Although globally configured, Virtual Hot Spare actually operates at the node pool level so that nodes with different size drives reserve the appropriate VHS space. This helps ensure that, while data may move from one disk pool to another during repair, it remains on the same class of storage. VHS reservations are cluster wide and configurable as either a percentage of total storage (0-20%) or as a number of virtual drives (1-4). To achieve this, the reservation mechanism allocates a fraction of the node pool’s VHS space in each of its constituent disk pools.

No space is reserved for VHS on SSDs unless the entire node pool consists of SSDs. This means that a failed SSD may have data moved to HDDs during repair, but without adding additional configuration settings. This avoids reserving an unreasonable percentage of the SSD space in a node pool.

The default for new clusters is for Virtual Hot Spare to have both “subtract the space reserved for the virtual hot spare…” and “deny new data writes…” enabled with one virtual drive. On upgrade, existing settings are maintained.

It is strongly encouraged to keep Virtual Hot Spare enabled on a cluster, and a best practice is to configure 10% of total storage for VHS. If VHS is disabled and you upgrade OneFS, VHS will remain disabled. If VHS is disabled on your cluster, first check to ensure the cluster has sufficient free space to safely enable VHS, and then enable it.

VHS can be configured via the OneFS WebUI, and is always available, regardless of whether SmartPools has been licensed on a cluster. For example:

From the CLI, the cluster’s VHS configuration are part of the storage pool settings, and can be viewed with the following syntax:

# isi storagepool settings view

     Automatically Manage Protection: files_at_default

Automatically Manage Io Optimization: files_at_default

Protect Directories One Level Higher: Yes

       Global Namespace Acceleration: disabled

       Virtual Hot Spare Deny Writes: Yes

        Virtual Hot Spare Hide Spare: Yes

      Virtual Hot Spare Limit Drives: 1

     Virtual Hot Spare Limit Percent: 10

             Global Spillover Target: anywhere

                   Spillover Enabled: Yes

        SSD L3 Cache Default Enabled: Yes

                     SSD Qab Mirrors: one

            SSD System Btree Mirrors: one

            SSD System Delta Mirrors: one

Similarly, the following command will set the cluster’s VHS space reservation to 10%.

# isi storagepool settings modify --virtual-hot-spare-limit-percent 10

Bear in mind that reservations for virtual hot sparing will affect spillover. For example, if VHS is configured to reserve 10% of a pool’s capacity, spillover will occur at 90% full.

Spillover allows data that is being sent to a full pool to be diverted to an alternate pool. Spillover is enabled by default on clusters that have more than one pool. If you have a SmartPools license on the cluster, you can disable Spillover. However, it is recommended that you keep Spillover enabled. If a pool is full and Spillover is disabled, you might get a “no space available” error but still have a large amount of space left on the cluster.

If the cluster is inadvertently configured to allow data writes to the reserved VHS space, the following informational warning will be displayed in the SmartPools WebUI:

There is also no requirement for reserved space for snapshots in OneFS. Snapshots can use as much or little of the available file system space as desirable and necessary.

A snapshot reserve can be configured if preferred, although this will be an accounting reservation rather than a hard limit and is not a recommend best practice. If desired, snapshot reserve can be set via the OneFS command line interface (CLI) by running the ‘isi snapshot settings modify –reserve’ command.

For example, the following command will set the snapshot reserve to 10%:

# isi snapshot settings modify --reserve 10

It’s worth noting that the snapshot reserve does not constrain the amount of space that snapshots can use on the cluster. Snapshots can consume a greater percentage of storage capacity specified by the snapshot reserve.

Additionally, when using SmartPools, snapshots can be stored on a different node pool or tier than the one the original data resides on.

For example, as above, the snapshots taken on a performance aligned tier can be physically housed on a more cost effective archive tier.

OneFS NFSv4.1 Trunking

As part of new OneFS 9.3 release’s support for NFSv4.1 and NFSv4.2, the NFS session model, is now incorporated into the OneFS NFS stack, which allows clients to leverage trunking and its associated performance benefits. Similar to multi-pathing in the SMB3 world, NFS trunking enables the use of multiple connections between a client and the cluster in order to dramatically increase the I/O path.

OneFS 9.3 supports both session and client ID trunking:

  • Client ID trunking is the association of multiple sessions per client.

  • Session trunking involves multiple connections per mount.

A connection, which represents a socket, exists within an object called a channel, and there can be many sessions associated with a channel. The fore channel represents client > cluster communication, and the back channel cluster > client.

Each channel has a set of configuration values that affect a session’s connections. With a few exceptions, the cluster must respect client-negotiated values. Typically, the configuration value meanings are the same for both the fore and back channels, although the defaults are typically significantly different for each.

Also, be aware that there can only be one client per session, but multiple sessions per client. And here’s what combined session and client ID trunking looks like:

Most Linux flavors support session trunking via the ‘nconnect’ option within the ‘mount’  command, which is included in kernel version 5.3 and later. However, support for client ID trunking is fairly nascent across the current Linux distributions. As such, we’ll focus on session trunking for the remainder of this article.

So let’s walk through a simple example of configuring NFS v4.1 and session trunking in OneFS 9.3.

The first step is to enable the NFS service, if it’s not already running, and select the desired protocol versions. This can be done from the CLI via the following command syntax:

# isi services nfs enable
# isi nfs settings global modify --nfsv41-enabled=true --nfsv42-enabled=true

Next, create an NFS export:

# isi nfs exports create --paths=/ifs/data

When using NFSv4.x, the domain name should be uniform across both the cluster and client(s). The NFSv4.x domain is presented as user@domain or group@domain pairs in ‘getattr’ and ‘setattr’ operations, for example. If the domain does not match, new and existing files will appear as owned by user ‘nobody user on the cluster.

The cluster’s NFSv4.x domain can be configured via the CLI using the ‘isi nfs settings zone modify’ command as follows:

# isi nfs settings zone modify --nfsv4-domain=nfs41test --zone=System

Once the cluster is configured, the next step is to prepare the NFSv4.1 client(s). As mentioned previously, Linux clients running the 5.3 kernel or later can use the nconnect mount option to configure session trunking.

Note that the current maximum limit of client-server connections opened by nconnect is 16. If unspecified, this value defaults to 1.

The following example uses an Ubuntu 21.04 client with the Linux 5.11 kernel version. The linux client will need to have the ‘nfs-common’ package installed in order to obtain the necessary nconnect binaries and libraries. If not already present, this can be installed as follows:

# sudo apt-get install nfs-common nfs-kernel-server

Next, edit the client’s /etc/idmapd.conf and add the appropriate the NFSv4.x domain:

# cat /etc/idmapd.conf

[General]

Verbosity = 0

Pipefs-Directory = /run/rpc_pipefs

# set your own domain here, if it differs from FQDN minus hostname

Domain = nfs41test

[Mapping]

Nobody-User = nobody

Nobody-Group = nogroup

NFSv4.x clients use the nfsidmap daemon for the NFSv4.x ID <-> name mapping translation, and the following CLI commands will restart the nfs-idmapd daemon and confirm that it’s happily running:

# systemctl restart nfs-idmapd
# systemctl status nfs-idmapd

 nfs-idmapd.service - NFSv4 ID-name mapping service

     Loaded: loaded (/lib/systemd/system/nfs-idmapd.service; static)

     Active: active (running) since Thurs 2021-11-18 19:47:01 PDT; 6s ago

    Process: 2611 ExecStart=/usr/sbin/rpc.idmapd $RPCIDMAPDARGS (code=exited, status=0/SUCCESS)

   Main PID: 2612 (rpc.idmapd)

      Tasks: 1 (limit: 4595)

     Memory: 316.0K

     CGroup: /system.slice/nfs-idmapd.service

             └─2612 /usr/sbin/rpc.idmapd

Nov 18 19:47:01 ubuntu systemd[1]: Starting NFSv4 ID-name mapping service...

Nov 18 25 19:47:01 ubuntu systemd[1]: Started NFSv4 ID-name mapping service.

The domain value can also be verified by running the nfsidmap command as follows:.

# sudo nfsidmap -d

nfs41test

Next, mount the cluster’s NFS export via NFSv4.1, v4.2, and trunking, as desired. For example, the following syntax will establish an NFSv4.1 mount using 4 trunked sessions, specified via the nconnect argument:

# sudo mount -t nfs -vo nfsvers=4.1,nconnect=4 10.1.128.10:/ifs/data/ /mnt/nfs41

This can be verified on the client side by running nestat and grepping for port 2049, the output in this case confirming the four TCP connections established for the above mount, as expected:

# netstat -ant4 | grep 2049

tcp        0      0 0.0.0.0:2049            0.0.0.0:*               LISTEN    

tcp        0      0 10.1.128.131:857     10.1.128.10:2049     ESTABLISHED

tcp        0      0 10.1.128.131:681     10.1.128.10:2049     ESTABLISHED

tcp        0      0 10.1.128.131:738     10.1.128.10:2049     ESTABLISHED

tcp        0      0 10.1.128.131:959     10.1.128.10:2049     ESTABLISHED

Similarly, from the cluster side, the NFS connections can be checked with the OneFS ‘isi_nfs4mgmt’ CLI command. The command output includes the client ID, NFS version, session ID, etc.

# isi_nfs4mgmt –list

ID                 Vers  Conn  SessionId  Client Address  Port  O-Owners  Opens Handles L-Owners

456576977838751506  4.1   n/a   4          912.168.198.131 959   0         0     0       0

The OneFS isi_nfs4mgmt CLI command also includes a ‘—dump’ flag, which when used with the ID as the argument, will display the details of a client mount, such as the TCP port, NFSv4.1 channel options, auth type, etc.

# isi_nfs4mgmt --dump=456576977838751506

Dump of client 456576977838751506

  Open Owners (0):

Session ID: 4

Forward Channel

Connections:

             Remote: 10.1.128.131.959    Local: 10.1.128.10.2049

             Remote: 10.1.128.131.738    Local: 10.1.128.10.2049

             Remote: 10.1.128.131.857    Local: 10.1.128.10.2049

             Remote: 10.1.128.131.681    Local: 10.1.128.10.2049

Attributes:

             header pad size                  0

             max operations                   8

             max request size           1048576

             max requests                    64

             max response size          1048576

             max response size cached      7584


Slots Used/Available: 1/63

         Cache Contents:

             0)  SEQUENCE


Back Channel

Connections:

             Remote: 10.1.128.131.959    Local: 10.1.128.10.2049

Attributes:

             header pad size                  0

             max operations                   2

             max request size              4096

             max requests                    16

             max response size             4096

             max response size cached         0

Security Attributes:

         AUTH_SYS:

             gid                              0

             uid                              0


Summary of Client 456576977838751506:

  Long Name (hex): 0x4c696e7578204e465376342e31207562756e74752e312f3139322e3136382e3139382e313000

  Long Name (ascii): Linux.NFSv4.1.ubuntu.1/10.1.128.10.

  State: Confirmed

  Open Owners: 0

  Opens: 0

  Open Handles: 0

  Lock Owners: 0

  Sessions: 1

Full JSON dump can be found at /var/isi_nfs4mgmt/nfs_clients.dump_2021-11-18T15:25:18

Be aware that sessions trunking is not permitted across access zones, because of different auth levels, since a session represents a single auth level. Similarly, sessions trunking is disallowed across dynamic IP addresses.

OneFS NFSv4.1 and v4.2 Support

The NFSv4.1 spec introduced several new features and functions to the NFSv4 protocol standard, as defined in RFC-5661 and covered in Section 1.8 of the RFC. Certain features are listed as ‘required’, which indicates that they must be implemented in or supported by the NFS server to claim RFC standard compliance. Other features are denoted as ‘recommended’ or ‘optional’ and are supported ad hoc by the NFS server, but are not required to claim RFC compliance.

OneFS 9.3 introduces support for both NFSv4.1 and NFSv4.2. This is achieved by implementing all the ‘required’ features defined in RFC-5661, with the exception of the Secret State Verifier (SSV). SSV is currently not supported by any open source Linux distributions, plus most server implementations also do not support SSV.

The following chart illustrates the supported NFS operations in the new OneFS 9.3 release:

Both NFSv4.1 and v4.2 use the existing OneFS NFSv4.0 I/O stack, and NFSv4.2 is a superset of NFSv4.1, with all of the new features being optional.

Note that NFSv4.2 is a true minor version and does not make any changes to handshake, mount, or caching mechanisms. Therefore an unfeatured NFSv4.2 mount is functionally equivalent to an NFSv4.1 mount. As such, OneFS enables clients to mount exports and access data via NFSv4.2, even though the 4.2 operations have yet to be implemented.

Architecturally, the new NFSv4.1 features center around a new handshake mechanism and cache state, which is created around connections and connection management.

NFSv4.1 formalizes the notion of a replay cache, which is one-to-one with a channel. This reply cache, or duplicate request cache, tracks recent transactions, and resends the cached response rather than performing the operation again. As such, performance can also benefit from the avoidance of unnecessary work.

Existing NFSv4.0 I/O routines are used alongside new NFSv4.1 handshake and state management routines such as EXCHANGEID, CREATESESSION and DESTROYSESSION, while deprecating some of the older handshake mechanisms like SETCLIENTID and SETCLIDENTIDCONFIRM.

In NFSv4.1, explicit client disconnect allows a client to request that a server that it would like to disconnect and destroy all of its state. By contrast, in 4.0 client disconnect is implied and requires on timeouts.

While the idea of a lock reclamation grace period was implied in NFSv4.0, the NFSv4.1 and 4.2 RFC explicitly defines lock failover. So if a client attaches to a server that it does not recognize or have a prior connection to, it will automatically attempt to reclaim locks using the LKF protocol lock grace period mechanism.

Connection tracking is also implemented in NFSv4.1 allow a server to keep track of its connections under each session channel, which is required for trunking.

Performance-wise, NFSv4.0 and NFSv4.1 are very similar across a single TCP connection. However, with NFSv4.1, Linux clients can now utilize trunking to enjoy the performance advantages of multiplexing. We’ll be taking a closer look at session and client ID trunking in the next blog article in this series.

The NFS service is disabled by default in OneFS, but can be easily started and configured from either the CLI or WebUI. Linux clients will automatically mount the highest available version available, and because of this NFSv4.1 and NFSv4.2 are disabled by default on install or upgrade to OneFS 9.3, so environments will not be impacted. If it’s desired to use particular NFS version(s), this should be specified in the mount syntax.

The NFSv4.1 or v4.2 protocol versions can be easily enabled from the OneFS CLI, for example:

# isi services nfs enable
# isi nfs settings global modify --nfsv41-enabled=true --nfsv42-enabled=true

Or from the WebUI, by navigating to Protocols > NFS > Global Settings and checking both the service enablement box and the desired protocol versions:

Create an NFS export with WebUI or CLI command.

# isi nfs exports create --paths=/ifs/data

When using NFSv4.x, the domain name should be uniform both the cluster and client(s). The NFSv4.x domain is presented as user@doamin or group@domain pairs in ‘getattr’ and ‘setattr’ operations, for example. If the domain is does not match, new and existing files appear owned by user ‘nobody’ user on the cluster. The cluster’s NFSv4.x domain can be configured via the CLI using the ‘isi nfs settings zone modify’ command as follows:

# isi nfs settings zone modify --nfsv4-domain=nfs41test --zone=System

Or from the WebUI by navigating to Protocols > NFS > Zone settings.

On the Linux client side, the NFSv4 domain can be configured by editing the /etc/idmapd.conf file:

# cat /etc/idmapd.conf

[General]

Verbosity = 0

Pipefs-Directory = /run/rpc_pipefs

# set your own domain here, if it differs from FQDN minus hostname

Domain = nfs41test

[Mapping]

Nobody-User = nobody

Nobody-Group = nogroup

NFSv4.x clients use the nfsidmap daemon for the NFSv4.x ID <-> name mapping translation, so ensure the daemon is running correctly after configuring the NFSv4.x domain. The following CLI commands will restart the nfs-idmapd daemon and confirm that it’s happily running:

# systemctl restart nfs-idmapd
# systemctl status nfs-idmapd

 nfs-idmapd.service - NFSv4 ID-name mapping service

     Loaded: loaded (/lib/systemd/system/nfs-idmapd.service; static)

     Active: active (running) since Thurs 2021-11-18 19:47:01 PDT; 6s ago

    Process: 2611 ExecStart=/usr/sbin/rpc.idmapd $RPCIDMAPDARGS (code=exited, status=0/SUCCESS)

   Main PID: 2612 (rpc.idmapd)

      Tasks: 1 (limit: 4595)

     Memory: 316.0K

     CGroup: /system.slice/nfs-idmapd.service

             └─2612 /usr/sbin/rpc.idmapd


Nov 18 19:47:01 ubuntu systemd[1]: Starting NFSv4 ID-name mapping service...

Nov 18 25 19:47:01 ubuntu systemd[1]: Started NFSv4 ID-name mapping service.

The domain value can also be checked by running the nfsidmap command as follows:.

# sudo nfsidmap -d

nfs41test

Next, mount the NFS export via NFSv4.1 or NFSv4.2, or both versions, as desired:

# sudo mount -t nfs -vo nfsvers=4.1 10.1.128.131.10:/ifs/data /mnt/nfs41/

Netstat can be used as follows to verify the established NFS TCP connection and its associated port.

# netstat -ant4 | grep 2049

tcp        0      0 0.0.0.0:2049            0.0.0.0:*               LISTEN    

tcp        0      0 10.1.128.131.131:996     10.1.128.131.10:2049     ESTABLISHED

From the cluster’s CLI, the NFS connections can be checked with ‘isi_nfs4mgmt’.  The isi_nfs4mgmt CLI tool has been enhanced in OneFS 9.3, and new functionality includes:

  • Expanded reporting. includes sessions, channels, and connections
  • Nfs4mgmt summary reports the version of each client connection
  • Nfs4mgmt enables a cluster admin to open or lock a session,
  • Allows cache state to be viewed without creating a coredump

When used with the ‘list’ flag, the ‘isi_nfs4mgmt’ command output includes the client ID, NFS version, session ID, etc.

# isi_nfs4mgmt –list

ID                 Vers  Conn  SessionId  Client Address  Port  O-Owners  Opens Handles L-Owners

605157838779675654  4.1   n/a   2          912.168.198.131 959   0         0     0       0

can be found at /var/isi_nfs4mgmt/nfs_clients.dump_2021-11-18T15:25:18

In summary, OneFS 9.3 adds support for both NFSv4.1 and v4.2, implements new functionality, lays the groundworks for addition future functionaility, and delivers NFS trunking, which we’ll explore in the next article.

OneFS Secure Boot

Secure Boot is an industry security standard introduced in UEFI (Unified Extensible Firmware Interface) version 2.3.1, which ensures only authorized EFI binaries are loaded by firmware during the boot process.  As such, it helps secure a system from malicious early boot code “rootkit” or ”bootkit” vulnerabilities, providing a trusted execution environment for the OS and user applications. This is of increasing importance to security conscious users in these unpredictable times.

In OneFS 9.3, the familiar boot path components remain in place, but are enhanced via the addition of new code and libraries to deliver Secure Boot functionality. Specifically, the BIOS is updated, and the bootloader and kernel are modified for verifying signatures. Secure Boot only runs at boot time, and uses public key cryptography to verify the signatures of signed code and establish a chain of trust to the platform, loading only trusted/validated code.

Introduction of Secure Boot necessitates that all OneFS releases from 9.3 onwards now be signed. If Secure Boot is disabled or unsupported in BIOS, no signature verification is performed, and the feature is considered dormant. OneFS 9.3 Secure Boot goes beyond the standard UEFI framework to include OneFS kernel and modules. As such:

UEFI Secure Boot + OS Secure Boot = Secure Boot

The UEFI infrastructure performs EFI signature validation and binary loading within UEFI Secure Boot, and the BSD ‘veriexec’ function is used to perform signature verification in both the loader.efi and kernel.

Public key cryptography is used to validate the signature of signed code before it can be loaded and executed. Under the hood, Dell’s Code Signing Service (CSS) is used to sign all EFI binaries (platform FW), as well as the OneFS bootloader, and kernel and kernel modules

With OneFS 9.3 and later, automated code signing is now an integral part of the Onefs build pipeline infrastructure, which includes an interface to the Dell CSS for signing keys access. Keys and signature databases are pre-enrolled in the BIOS image and there is intentionally no interface between OneFS and key management to eliminate security risks.

The OneFS Security Configuration Guide includes a recommendation to and instructions for configuring a BIOS GUI admin password.

In OneFS 9.3, Secure Boot feature requires the following prerequisites:

  • Isilon A2000 node platform
  • OneFS 9.3.0.0 or greater
  • Node Firmware Package (NFP) version 11.3 or greater

Be aware that a cluster must be upgrade-committed to OneFS 9.3 prior to upgrading the NFP to v11.3.

A PowerScale cluster will happily contain a mix of Secure Boot enabled and disabled nodes, and no additional is required in order to activate the feature. Indeed, Secure Boot can be enabled or disabled on an A2000 node at any point without any future implications. Additional PowerScale hardware platforms will also be added to the Secure Boot support matrix in future releases.

Since Secure Boot is node-local, it does necessitate individual configuration of each applicable node in a cluster. The configuration process also requires a reboot, so a suitable maintenance window will need to be planned for enabling or disabling Secure Boot. As a security best practice, it is strongly recommended to configure a BIOS admin password in order to restrict node access. Since Secure Boot is only executed at boot time, it has no impact on cluster performance.

The Secure Boot feature can be easily enabled on an A2000 node as follows:

  1. First, ensure that the cluster is running OneFS 9.3 and Node Firmware Package 11.3 or greater.
  2. Next, run the following CLI commands to enable the PowerScale Secure Boot feature:
# ipmitool raw 0x30 0x12 0x08 0x13 0x01 0x53 0x55 0x42 0x54

# ipmitool raw 0x30 0x11 0x04 0x00 0x08 0x13 0x01

  08 13 01 53 55 42 54

The output ‘08 13 01 53 55 42 54’ indicates successful command execution.

# ipmitool raw 0x30 0x12 0x0C 0x13 0x01 0x01

# ipmitool raw 0x30 0x11 0x01 0x00 0x0C 0x13 0x01

  0c 13 01 01

Similarly, the output ‘0c 13 01 01’ indicates successful command execution.

  1. Finally, reboot the node to apply the PowerScale Secure Boot feature.

The following sysctl CLI command can be run to verify that secure boot is enabled on the node:

# sysctl security.mac.veriexec.state

security.mac.veriexec.state: loaded active enforce locked

Since Secure Boot configuration is node-local, this procedure will need to be performed on each A2000 node in the cluster.

Be aware that once the PowerScale Secure Boot feature is enabled on a node, it cannot be reimaged via PXE. However, reimaging from a USB drive is supported. If a node does require PXE reimaging, first disable Secure Boot, reimage, and then re-enable Secure Boot when completed.

Conversely, disabling the PowerScale Secure Boot feature can only be performed from the BIOS interface, which involves interrupting a node’s boot sequence. This is to ensures that secure boot can only be disabled if you have both physical and administrator access to the node. Again, disabling the feature also necessitates performing the process on each Secure Boot enabled node. The following procedure will disable the PowerScale Secure Boot feature on an A2000 node:

1. During the early stages of the A2000’s boot sequence, use ‘F2’or ‘DEL’ to enter the BIOS setup menu, navigate to the ‘Security’ tab, and select the “Secure Boot” option.

2. Next, set the “Secure Boot” entry from ‘Enabled’ to ‘Disabled’ to deactivate the PowerScale Secure Boot feature.

3. Finally, ‘ESC’ to return to the main menu, navigate to the ‘Save & Exit’ tab, and select the ‘Save Changes and Exit’ option.

Once Secure Boot is disabled, the A2000 node will continue to boot after exiting the BIOS.

In future OneFS releases, Secure Boot will be expanded to encompass additional PowerScale platforms.

 

OneFS Data Inlining – Performance and Monitoring

In the second of this series of articles on data inlining, we’ll shift the focus to monitoring and performance.

The storage efficiency potential of inode inlining can be significant for data sets comprising large numbers of small files, which would have required a separate inode and data blocks for housing these files prior to OneFS 9.3.

Latency-wise, the write performance for inlined file writes is typically comparable or slightly better as compared to regular files, because OneFS does not have to allocate extra blocks and protect them. This is also true for reads, too, since OneFS doesn’t have to search for and retrieve any blocks beyond the inode itself. This also frees up space in the OneFS read caching layers, as well as on disk, in addition to requiring fewer CPU cycles.

The following diagram illustrates the levels of indirection a file access request takes to get to its data. Unlike a standard file, an inline file will skip the later stages of the path which involve the inode metatree redirection to the remote data blocks.

Access starts with the Superblock, which is located at multiple fixed block addresses on every drive in the cluster.  The Superblock contains the address locations of the LIN Master block, which contains the root of the LIN B+ Tree (LIN table).  The LIN B+Tree maps logical inode numbers to the actual inode addresses on disk, which, in the case of an inlined file, also contains the data.  This saves the  overhead of finding the address locations of the file’s data blocks and retrieving data from them.

For hybrid nodes with sufficient SSD capacity, using the metadata-write SSD strategy will automatically place inlined small files on flash media. However, since the SSDs on hybrid nodes default to 512byte formatting, when using metadata read/write strategies, these SSD metadata pools will need to have the ‘–force-8k-inodes’ flag set in order for files to be inlined. This can be a useful performance configuration for small file HPC workloads, such as EDA, for data that is not residing on an all-flash tier. But keep in mind that forcing 8KB inodes on a hybrid pool’s SSDs will result in a considerable reduction in available inode capacity than would be available with the default 512 byte inode configuration.

The OneFS ‘isi_drivenum’ CLI command can be used to verify the drive block sizes in a node. For example, below is the output for a PowerScale Gen6 H-series node, showing drive bay 1 containing an SSD with 4KB physical formatting and 512byte logical sizes, and bays A to E comprising hard disks (HDDs) with both 4KB logical and physical formatting.

# isi_drivenum -bz

Bay 1  Physical Block Size: 4096     Logical Block Size:   512

Bay 2  Physical Block Size: N/A     Logical Block Size:   N/A

Bay A0 Physical Block Size: 4096     Logical Block Size:   4096

Bay A1 Physical Block Size: 4096     Logical Block Size:   4096

Bay A2 Physical Block Size: 4096     Logical Block Size:   4096

Bay B0 Physical Block Size: 4096     Logical Block Size:   4096

Bay B1 Physical Block Size: 4096     Logical Block Size:   4096

Bay B2 Physical Block Size: 4096     Logical Block Size:   4096

Bay C0 Physical Block Size: 4096     Logical Block Size:   4096

Bay C1 Physical Block Size: 4096     Logical Block Size:   4096

Bay C2 Physical Block Size: 4096     Logical Block Size:   4096

Bay D0 Physical Block Size: 4096     Logical Block Size:   4096

Bay D1 Physical Block Size: 4096     Logical Block Size:   4096

Bay D2 Physical Block Size: 4096     Logical Block Size:   4096

Bay E0 Physical Block Size: 4096     Logical Block Size:   4096

Bay E1 Physical Block Size: 4096     Logical Block Size:   4096

Bay E2 Physical Block Size: 4096     Logical Block Size:   4096

Note that the SSD disk pools used in PowerScale hybrid nodes that are configured for meta-read or meta-write SSD strategies use 512 byte inodes by default. This can significantly save space on these pools, as they often have limited capacity. However, it will prevent data inlining from occurring. By contrast, PowerScale all-flash nodepools are configured by default for 8KB inodes.

The OneFS ‘isi get’ CLI command provides a convenient method to verify which size inodes are in use in a given node pool. The command’s output includes both the inode mirrors size and the inline status of a file.

When it comes to efficiency reporting, OneFS 9.3 provides three CLI tools for validating and reporting the presence and benefits of data inlining, namely:

  1. The ‘isi statistics data-reduction’ CLI command has been enhanced to report inlined data metrics, including both a capacity saved and an inlined data efficiency ratio:
# isi statistics data-reduction

                      Recent Writes Cluster Data Reduction

                           (5 mins)

--------------------- ------------- ----------------------

Logical data                 90.16G                 18.05T

Zero-removal saved                0                      -

Deduplication saved           5.25G                624.51G

Compression saved             2.08G                303.46G

Inlined data saved            1.35G                  2.83T

Preprotected physical        82.83G                 14.32T

Protection overhead          13.92G                  2.13T

Protected physical           96.74G                 26.28T




Zero removal ratio         1.00 : 1                      -

Deduplication ratio        1.06 : 1               1.03 : 1

Compression ratio          1.03 : 1               1.02 : 1

Data reduction ratio       1.09 : 1               1.05 : 1

Inlined data ratio         1.02 : 1               1.20 : 1

Efficiency ratio           0.93 : 1               0.69 : 1

--------------------- ------------- ----------------------

Be aware that the effect of data inlining is not included in the data reduction ratio because it is not actually reducing the data in any way – just relocating it and protecting it more efficiently.  However, data inlining is included in the overall storage efficiency ratio.

The ‘inline data saved’ value represents the count of files which have been inlined multiplied by 8KB (inode size).  This value is required to make the compression ratio and data reduction ratio correct.

  1. The ‘isi_cstats’ CLI command now includes the accounted number of inlined files within /ifs, which is displayed by default in its console output.
# isi_cstats

Total files                 : 397234451

Total inlined files         : 379948336

Total directories           : 32380092

Total logical data          : 18471 GB

Total shadowed data         : 624 GB

Total physical data         : 26890 GB

Total reduced data          : 14645 GB

Total protection data       : 2181 GB

Total inode data            : 9748 GB




Current logical data        : 18471 GB

Current shadowed data       : 624 GB

Current physical data       : 26878 GB

Snapshot logical data       : 0 B

Snapshot shadowed data      : 0 B

Snapshot physical data      : 32768 B


Total inlined data savings  : 2899 GB

Total inlined data ratio    : 1.1979 : 1

Total compression savings   : 303 GB

Total compression ratio     : 1.0173 : 1

Total deduplication savings : 624 GB

Total deduplication ratio   : 1.0350 : 1

Total containerized data    : 0 B

Total container efficiency  : 1.0000 : 1

Total data reduction ratio  : 1.0529 : 1

Total storage efficiency    : 0.6869 : 1


Raw counts

{ type=bsin files=3889 lsize=314023936 pblk=1596633 refs=81840315 data=18449 prot=25474 ibyte=23381504 fsize=8351563907072 iblocks=0 }

{ type=csin files=0 lsize=0 pblk=0 refs=0 data=0 prot=0 ibyte=0 fsize=0 iblocks=0 }

{ type=hdir files=32380091 lsize=0 pblk=35537884 refs=0 data=0 prot=0 ibyte=1020737587200 fsize=0 iblocks=0 }

{ type=hfile files=397230562 lsize=19832702476288 pblk=2209730024 refs=81801976 data=1919481750 prot=285828971 ibyte=9446188553728 fsize=17202141701528 iblocks=379948336 }

{ type=sdir files=1 lsize=0 pblk=0 refs=0 data=0 prot=0 ibyte=32768 fsize=0 iblocks=0 }

{ type=sfile files=0 lsize=0 pblk=0 refs=0 data=0 prot=0 ibyte=0 fsize=0 iblocks=0 }
  1. The ‘isi get’ CLI command can be used to determine whether a file has been inlined. The output reports a file’s logical ‘size’, but indicates that it consumes zero physical, data, and protection blocks. There is also an ‘inlined data’ attribute further down in the output that also confirms that the file is inlined.
# isi get -DD file1


* Size:              2

* Physical Blocks:  0

* Phys. Data Blocks: 0

* Protection Blocks: 0

* Logical Size:      8192


PROTECTION GROUPS

* Dynamic Attributes (6 bytes):

*

ATTRIBUTE           OFFSET SIZE

Policy Domains      0      6


INLINED DATA

0,0,0:8192[DIRTY]#1

*************************************************

So, in summary, some considerations and recommended practices for data inlining in OneFS 9.3 include the following:

  • Data inlining is opportunistic and is only supported on node pools with 8KB inodes.
  • No additional software, hardware, or licenses are required for data inlining.
  • There are no CLI or WebUI management controls for data inlining.
  • Data inlining is automatically enabled on applicable nodepools after an upgrade to OneFS 9.3 is committed.
  • However, data inlining will only occur for new writes and OneFS 9.3 will not perform any inlining during the upgrade process. Any applicable small files will instead be inlined upon their first write.
  • Since inode inlining is automatically enabled globally on clusters running OneFS 9.3, OneFS will recognize any diskpools with 512 byte inodes and transparently avoid inlining data on them.
  • In OneFS 9.3, data inlining will not be performed on regular files during tiering, truncation, upgrade, etc.
  • CloudPools Smartlink stubs, sparse files, and writable snapshot files are also not candidates for data inlining in OneFS 9.3.
  • OneFS shadow stores will not apply data inlining. As such:
    • Small file packing will be disabled for inlined data files.
    • Cloning will work as expected with inlined data files..
    • Inlined data files will not apply deduping and non-inlined data files that are once deduped will not inline afterwards.
  • Certain operations may cause data inlining to be reversed, such as moving files from an 8KB diskpool to a 512 byte diskpool, forcefully allocating blocks on a file, sparse punching, etc.

OneFS Small File Data Inlining

OneFS 9.3 introduces a new filesystem storage efficiency feature which stores a small file’s data within the inode, rather than allocating additional storage space. The principal benefits of data inlining in OneFS include:

  • Reduced storage capacity utilization for small file datasets, generating an improved cost per TB ratio.
  • Dramatically improved SSD wear life.
  • Potential read and write performance for small files.
  • Zero configuration, adaptive operation, and full transparency at the OneFS file system level.
  • Broad compatibility with other OneFS data services, including compression and deduplication.

Data inlining explicitly avoids allocation during write operations since small files do not require any data or protection blocks for their storage. Instead, the file content is stored directly in unused space within the file’s inode. This approach is also highly flash media friendly since it significantly reduces the quantity of writes to SSD drives.

OneFS inodes, or index nodes, are a special class of data structure that store file attributes and pointers to file data locations on disk.  They serve a similar purpose to traditional UNIX file system inodes, but also have some additional, unique properties. Each file system object, whether it be a file, directory, symbolic link, alternate data stream container, shadow store, etc, is represented by an inode.

Within OneFS, SSD node pools in F series all-flash nodes always use 8KB inodes. For hybrid and archive platforms, the HDD node pools are either 512 bytes or 8KB in size, and this is determined by the physical and logical block size of the hard drives or SSDs in a node pool. There are three different styles of drive formatting used in OneFS nodes, depending on the manufacturer’s specifications:

Drive Formatting Characteristics
Native 4Kn (native) •       A native 4Kn drive has both a physical and logical block size of 4096B.
512n (native) •       A drive that has both physical and logical size of 512 is a native 512B drive.
512e (emulated) •       A 512e (512 byte-emulated) drive has a physical block size of 4096, but a logical block size of 512B.

If the drives in a cluster’s nodepool are native 4Kn formatted, by default the inodes on this nodepool will be 8KB in size.  Alternatively, if the drives are 512e formatted, then inodes by default will be 512B in size. However, they can also be reconfigured to 8KB in size if the ‘force-8k-inodes’ setting is set to true.

A OneFS inode is composed of several sections. These include a static area, which is typically 134byes in size and contains fixed-width, commonly used attributes like POSIX mode bits, owner, and file size. Next, the regular inode contains a metatree cache, which is used to translate a file operation directly into the appropriate protection group. However, for inline inodes, the metatree is no longer required, so data is stored directly in this area instead. Following this is a preallocated dynamic inode area where the primary attributes, such as OneFS ACLs, protection policies, embedded B+ Tree roots, timestamps, etc, are cached. And lastly a sector where the IDI checksum code is stored.

When a file write coming from the writeback cache, or coalescer, is determined to be a candidate for data inlining, it goes through a fast write path in BSW. Compression will be applied, if appropriate, before the inline data is written to storage.

The read path for inlined files is similar to that for regular files. However, if the file data is not already available in the caching layers, it is read directly from the inode, rather than from separate disk blocks as with regular files.

Protection for inlined data operates the same way as for other inodes and involves mirroring. OneFS uses mirroring as protection for all metadata because it is simple and does not require the additional processing overhead of erasure coding. The number of inode mirrors is determined by the nodepool’s achieved protection policy, as per the table below:

OneFS Protection Level Number of Inode Mirrors
+1n 2 inodes per file
+2d:1n 3 inodes per file
+2n 3 inodes per file
+3d:1n 4 inodes per file
+3d:1n1d 4 inodes per file
+3n 4 inodes per file
+4d:1n 5 inodes per file
+4d:2n 5 inodes per file
+4n 5 inodes per file

Unlike file inodes above, directory inodes, which comprise the OneFS single namespace, are mirrored at one level higher than the achieved protection policy. The root of the LIN Tree is the most critical metadata type and is always mirrored at 8x

Data inlining is automatically enabled by default on all 8KB formatted nodepools for clusters running OneFS 9.3, and does not require any additional software, hardware, or product licenses in order to operate. Its operation is fully transparent and, as such, there are no OneFS CLI or WebUI controls to configure or manage inlining.

In order to upgrade to OneFS 9.3 and benefit from data inlining, the cluster must be running a minimum OneFS 8.2.1 or later. A full upgrade commit to OneFS 9.3 is required before inlining becomes operational.

Be aware that data inlining in OneFS 9.3 does have some notable caveats. Specifically, data inlining will not be performed in the following instances:

  • When upgrading to OneFS 9.3 from an earlier release which does not support inlining.
  • During restriping operations, such as SmartPools tiering, when data is moved from a 512 byte diskpool to an 8KB diskpool.
  • Writing CloudPools SmartLink stub files.
  • On file truncation down to non-zero size.
  • Sparse files (for example, NDMP sparse punch files) where allocated blocks are replaced with sparse blocks at various file offsets.
  • For files within a writable snapshot.

Similarly, in OneFS 9.3 the following operations may cause inlined data inlining to be undone, or spilled:

  • Restriping from an 8KB diskpool to a 512 byte diskpool.
  • Forcefully allocating blocks on a file (for example, using the POSIX ‘madvise’ system call).
  • Sparse punching a file.
  • Enabling CloudPools BCM on a file.

These caveats will be addressed in a future release.

OneFS Job Execution and Node Exclusion

The majority of the OneFS job engine’s jobs have no default schedule and are typically manually started by a cluster administrator or process. Other jobs such as FSAnalyze, MediaScan, ShadowStoreDelete, and SmartPools, are normally started via a schedule. The job engine can also initiate certain jobs on its own. For example, if the SnapshotIQ process detects that a snapshot has been marked for deletion, it will automatically queue a SnapshotDelete job.

The Job Engine will also execute jobs in response to certain system event triggers. In the case of a cluster group change, for example the addition or subtraction of a node or drive, OneFS automatically informs the job engine, which responds by starting a FlexProtect job. The coordinator notices that the group change includes a newly-smart-failed device and then initiates a FlexProtect job in response.

Job administration and execution can be controlled via the WebUI, CLI, or platform API and a job can be started, stopped, paused and resumed, and this is managed via the job engines’ check-pointing system. For each of these control methods, additional administrative security can be configured using roles-based access control (RBAC).

The job engine’s impact control and work throttling mechanism is able to limit the rate at which individual jobs can run. Throttling is employed at a per-manager process level, so job impact can be managed both granularly and gracefully.

Every twenty seconds, the coordinator process gathers cluster CPU and individual disk I/O load data from all the nodes across the cluster. The coordinator uses this information, in combination with the job impact configuration, to decide how many threads may run on each cluster node to service each running job. This can be a fractional number, and fractional thread counts are achieved by having a thread sleep for a given percentage of each second.

Using this CPU and disk I/O load data, every sixty seconds the coordinator evaluates how busy the various nodes are and makes a job throttling decision, instructing the various job engine processes as to the action they need to take. This enables throttling to be sensitive to workloads in which CPU and disk I/O load metrics yield different results. Additionally, there are separate load thresholds tailored to the different classes of drives utilized in OneFS powered clusters, from capacity optimized SATA disks to flash-based SSDs.

However, up through OneFS 9.2, a job engine job was an all or nothing entity. Whenever a job ran, it involved the entire cluster – regardless of individual node type, load, or condition. As such, any nodes that were overloaded or in a degraded state could still impact the execution ability of the job at large.

To address this, OneFS 9.3 provides the capability to exclude one or more nodes from participating in running a job. This allows the temporary removal of any nodes with high load, or other issues, from the job execution pool so that jobs do not become stuck. Configuration is via the OneFS CLI and gconfig and is global, such that it applies to all jobs on startup. However, the exclusion configuration is not dynamic, and once a job is started with the final node set, there is no further reconfiguration permitted. So if a participant node is excluded, it will remain excluded until the job has completed. Similarly, if a participant needs to be excluded, the current job will have to be cancelled and a new job started. Any nodes can be excluded, including the node running the job engine’s coordinator process. The coordinator will still monitor the job, it just won’t spawn a manager for the job.

The list of participating nodes for a job are computed in three phases:

  1. Query the cluster’s GMP group.
  2. Call to job.get_participating_nodes to get a subset from the gmp group
  3. Remove the nodes listed in core.excluded_participants from the subset.

The CLI syntax for configuring an excluded nodes list on a cluster is as follows (in this example, excluding nodes one through three):

# isi_gconfig –t job-config core.excluded_participants="{1,2,3}"

The ‘excluded_participants’ are entered as a comma-separated devid value list with no spaces, specified within parentheses and double quotes. All excluded nodes must be specified in full, since there’s no aggregation. Note that, while the excluded participant configuration will be displayed via gconfig, it is not reported as part of the ‘sysctl efs.gmp.group’ output.

A job engine node exclusion configuration can be easily reset to avoid excluding any nodes by assigning the “{}” value.

# isi_gconfig –t job-config core.excluded_participants="{}"

A ‘core.excluded_participant_percent_warn’ parameter defines the maximum percentage of removed nodes.

# isi_gconfig -t job-config core.excluded_participant_percent_warn

core.excluded_participant_percent_warn (uint) = 10

This parameter defaults to 10%, above which a CELOG event warning is generated.

As many nodes as desired can be removed from the job group. CELOG informational event will notify of removed nodes. If too many nodes have been removed, (gconfig parameter sets too many node threshold) CELOG will fire a warning event. If some nodes are removed but they’re not part of the GMP group, a different warning event will trigger.

If all nodes are removed, a CLI/pAPI error will be returned, the job will fail, and a CELOG warning will fire. For example:

# isi job jobs start LinCount

Job operation failed: The job had no participants left. Check core.excluded_participants setting and make sure there is at least one node to run the job:  Invalid argument

# isi job status

10   LinCount         Failed    2021-10-24T:20:45:23

------------------------------------------------------------------

Total: 9

Note, however, that the following core system maintenance jobs will continue to run across all nodes in a cluster even if a node exclusion has been configured:

  • AutoBalance
  • Collect
  • FlexProtect
  • MediaScan
  • MultiScan

OneFS 9.3 Introduction

Arriving hot on the heels of the PowerStore H700 & H7000 hybrid chassis and A300 & A300 archive platforms that debuted last month, the new PowerScale OneFS 9.3 release shipped on Monday, 18th October 2021. This new 9.3 release brings with it an array of new features and functionality, including:

Feature Info
NVMe SED support for PowerScale All-flash ·         FIPS-certified Self-Encrypting Drives (SEDs) support for PowerScale F600 & F900 nodes
Writable Snapshots ·         Enables the creation and management of a space and time efficient, modifiable copy of a regular OneFS snapshot.
NFS v4.1 and v4.2 Support ·         Connectivity support for versions 4.1 and 4.2 of the NFS file access protocol.
Long filename support ·         Provision for file names up to 1024 bytes, allowing support for long names in UTF-8 multi-byte character sets.
Inline data in inodes ·         Data efficiency feature allowing a small file’s data to be stored in unused space within its inode block.
HDFS ACLs ·         Provide support for HDFS-4685 access control lists, allowing users to manage permissions on their datasets from Hadoop clients.
Job engine exclusions ·         Allow Job engine jobs to be run on a defined subset of a cluster’s nodes.
CloudPools Recall ·         Improved CloudPools file recall & rehydrate performance.
Safe SMB client disconnects ·         Allows SMB clients the opportunity to flush their caches prior to being disconnected.
S3 protocol enhancements ·         Added support for S3 chunked upload, multi-object delete, and non-slash delimiters for lists.

The new OneFS 9.3 code is available on the Dell EMC Online Support site as both a  reimage file for configuring new clusters with 9.3, and an upgrade package for legacy clusters.

For upgrading existing clusters, the recommendation is to open a Service Request with Dell – EMC Support to schedule an upgrade. To provide a consistent and positive upgrade experience, Dell EMC is offering assisted upgrades to OneFS 9.3 at no cost to customers with a valid support contract. Please refer to Knowledge Base article KB544296 for additional information on how to initiate the upgrade process.

We’ll also be taking a deeper look at  these new OneFS 9.3 features in blog articles over the course of the next few weeks.