OneFS Drain-based Upgrade

In the previous blog article, we looked at the mechanism by which OneFS enables non-disruptive upgrades. For NFS users, rolling node upgrades is typically a pretty seamless event since client connections can be dynamically moved and rebalanced across the other nodes. However, for SMB clients, rolling upgrades can be more impactful.

During an upgrade workflow, nodes will reboot, and the SMB protocol service must be stopped temporarily. This results in a brief disruption for Windows clients  connected to the rebooting node. To solve this, OneFS 9.2 introduces a drain-based upgrade feature, which provides a mechanism by which nodes are prevented from rebooting or restarting protocol services until all SMB clients have disconnected from the node. A single SMB client that does not disconnect can cause the upgrade to be delayed indefinitely, so the cluster administrator is provided with options to reboot the node despite persisting clients.

The drain-based upgrade supports both OneFS, firmware and combined upgrades, and can be configured and managed via the OneFS WebUI, CLI, and RESTful platform API.

  • SMB protocol
  • OneFS upgrades
  • Firmware upgrades
  • Cluster reboots
  • Combined upgrades (OneFS and Firmware)

Drain-based upgrade is predicated upon the parallel upgrade workflow, covered in a previous article, and which offers accelerated upgrades for large clusters by working across OneFS neighborhoods, or fault domains. By concurrently upgrading a node per neighborhood, the more node neighborhoods there are within a cluster the more parallel activity can occur.

For simplicities sake, assume a PowerScale cluster comprising five H700 chassis, divided into two neighborhoods, each containing ten nodes.

The following CLI command can be used to identify the correlation between the cluster’s nodes and OneFS neighborhoods, or failure domains:

# sysctl efs.lin.lock.initiator.coordinator_weights

Once the drain-based upgrade is started, a maximum of one node from each neighborhood will get the reservation which allows the nodes to upgrade simultaneously. OneFS will not reboot these nodes until the number of SMB clients is “0”. In this example Node 1 and Node 8 get the reservation for upgrading at the same time. However, there is one SMB connection to Node 1 and two SMB connections to Node 8. Neither of these nodes will be able to reboot until their SMB connection count gets to “0”. At this point, there are three options available:

Drain Action Description
Wait Wait until the SMB connection count reaches “0” or it hits the drain timeout value. The drain timeout value isa configurable parameter for each upgrade process. It is the maximum waiting period. If drain timeout is set to “0”, it means wait forever.
Delay drain Add the node into the delay list to delay client draining. The upgrade process will continue on another node in this neighborhood. After all the non-delayed nodes are upgraded, OneFS will rewind to the node in the delay list.
Skip drain Stop waiting for clients to migrate away from the draining node and reboot immediately.

 

The following CLI command can be used to confirm whether there are any active SMB connections. In this case, node 1 has one connected Windows client:

# isi statistics query current --keys=node.clientstats.connected.smb

 Node  node.clientstats.connected.smb

-------------------------------------

    1                               1

-------------------------------------

The ‘isi upgrade’ CLI command syntax can be used to perform the drain-based upgrade, and now includes flags for configuring drain-timeout and alert-timeout. In this example setting each to value 60 minutes and 45 minutes respectively. As such, if there is still an SMB connection after 45 minutes, a CELOG alert will be triggered to notify the cluster administrator. And after an hour, any remaining SMB connections will be dropped and the node upgrade reboot will continue.

# isi upgrade start --parallel --skip-optional --install-image-path=/ifs /data/<installation-file-name> --drain-timeout=60m --alert-timeout=45m

From the OneFS WebUI, the same can be achieved by navigating to Upgrade under Cluster management. In the example below, the WebUI indicates that node 1 is waiting for a draining SMB client. The response can be either to Skip or Delay.

If ‘Delay’ is selected, node one pauses to allow the remaining active client connection to drain:

After ‘Skip’ is chosen, the active client count is reduced to 0 and upgrade continues:

Here, the WebUI reports that node 2 has completed upgraded and is in the process of rebooting:

Once all nodes have completed, the upgrade can be committed by running the following command:

# isi upgrade cluster commit

Or from the WebUI:

Finally, confirm that the current version of OneFS is correct by running the following command:

# isi version

OneFS Non-disruptive Upgrades

When it comes to updating the OneFS version on a cluster, there are three primary options:

Of these, the simultaneous reboot is fast but disruptive, in that all the cluster’s nodes are upgraded and restarted in unison.

The other two options, rolling and parallel, are non-disruptive upgrades (NDUs), which allow a storage admin to upgrade a cluster while their end users continue to access data.

During the rolling upgrade process, one node at a time is updated to the new code, and the active clients attached to it are automatically migrated to other nodes in the cluster. Partial upgrade is also permitted, whereby a subset of cluster nodes can be upgraded, and the subset of nodes may also be grown during the upgrade. OneFS also allows an upgrade to be paused and resumed enabling customers to span upgrades over multiple smaller Maintenance Windows.

However, for larger clusters, OneFS also offers a parallel upgrade option. Parallel upgrade provides upgrade efficiency within node pools on clusters with multiple neighborhoods (availability zones), allowing the simultaneous upgrading of a node per neighborhood until the pool is complete . By doing this, the upgrade duration is dramatically reduced, while ensuring that end-users still continue to have full access to their data.

The parallel upgrade option avoids rebooting nodes unless a Diskpools DB reservation can be taken on that node. Each node runs the pre-upgrade optional and mandatory steps in lockstep. Nodes will not proceed to the MarkUpgrading state until the pre-upgrade checks have run successfully on all nodes. Once a node has reached the MarkUpgrading state, it will proceed through the upgrade hooks without regard for the completion state of hook on other nodes (ie not in lockstep).

Given that OneFS’ parallel upgrade option can dramatically improve the OneFS upgrade efficiency without impacting the data availability, the following formula can be used to estimate the duration of the parallel upgrade:

𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 = (𝑝𝑒𝑟 𝑛𝑜𝑑𝑒 𝑢𝑝𝑔𝑟𝑎𝑑𝑒 𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛) × (ℎ𝑖𝑔ℎ𝑒𝑠𝑡 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑜𝑑𝑒𝑠 𝑝𝑒𝑟 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟ℎ𝑜𝑜𝑑)

In the above formula:

  • The first parameter – per node upgrade duration – is around 20 minutes on average.
  • The second parameter – the highest number of nodes per neighborhood – can be obtained by running the following CLI command:
# sysctl efs.lin.lock.initiator.coordinator_weights

For example, consider a 150 node OneFS cluster. In an ideal layout, there would be 15 neighborhoods, each containing ten nodes. Neighborhood 1 would comprise nodes 1 to 10, neighborhood 2, nodes 11 to 20, and so on and so forth.

During the parallel upgrade, the upgrade framework will pick at most one node from each neighborhood, to run the upgrading job simultaneously. So in this case, node 1 from neighborhood 1st, node 11 from neighborhood 2nd, node 21 from neighborhood 3rd and etc will be upgraded at the same time. Considering, they are all in different neighborhoods or failure domain, it will not impact the current running workload.  After the first pass completes, it will go to the 2nd pass and then 3rd and etc.

So, in the 150 node example above, the estimated duration of the parallel upgrade is 200 minutes:

𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 = (𝑝𝑒𝑟 𝑛𝑜𝑑𝑒 𝑢𝑝𝑔𝑟𝑎𝑑𝑒 𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛) × (ℎ𝑖𝑔ℎ𝑒𝑠𝑡 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑜𝑑𝑒𝑠 𝑝𝑒𝑟 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟ℎ𝑜𝑜𝑑) = 20 × 10 = 200 𝑚𝑖𝑛𝑢𝑡𝑒𝑠

Under the hood, the OneFS non-disruptive upgrade system consists an UpgradeAgent and UpgradeSupervisor components.

The UpgradeAgent is a daemon that runs on every node. The UpgradeAgent’s role is to continually attempt to advance the upgrade process through to completion. It accomplishes this doing two things:

  1. Ensuring that an UpgradeSupervisoris running somewhere on the cluster by (a) checking to see if an upgrade is in progress and (b) waiting for its time slot, grabbing a lock file and then attempting to launch a supervisor.
  2. Receiving messages from any actively running UpgradeSupervisorand taking action on those messages.

The UpgradeSupervisor is a short-lived process which assesses the current state of the cluster and then takes action to advance the progress of the upgrade. The UpgradeSupervisor is stateless. It collects the persistent state of each node from that node’s UpgradeAgent using a status message. It also collects any information persistent on a cluster-wide basis. After reconstructing the current state of the upgrade process, it will then take action to affect the progress of the upgrade by dispatching an action message to the appropriate UpgradeAgent.

Since isi upgrade is an asynchronous process, the nodes in the cluster take turns running the controlling process. As such, the process that starts the upgrade does not run the upgrade but only sets it up. So when an ‘isi upgrade’ CLI command is run it will return fairly quickly. This also means that can’t stop the upgrade by stopping one process. Instead, a stop and restart option is provided using the ‘isi upgrade pause’ and ‘isi upgrade resume’ CLI commands.

Parallel upgrades are easily configured from the OneFS CLI by navigating to Cluster Management > Upgrade, and selecting ‘Parallel upgrade’ from the Upgrade type drop-down menu:

This can also be kicked-off from the OneFS command line using the following CLI syntax:

# isi upgrade start --parallel <upgrade_image>

Similarly, to start a rolling upgrade, which is the default, run:

# isi upgrade cluster start <upgrade_image>

The following CLI syntax will initiate a simultaneous upgrade:

# isi upgrade cluster start --simultaneous <upgrade_image>

Note that the upgrade framework always defaults to a rolling upgrade. Caution is advised when using the CLI to perform a simultaneous upgrade and the scheduling ‘type’ must be specified, i.e., –rolling, –simultaneous or –parallel

For example:

# isi upgrade cluster start /ifs/install.tar isi upgrade cluster start <code_path>

Since OneFS supports the ability to roll back to the previous version, in-order to complete an upgrade it must be committed.

isi upgrade cluster commit

Up until the time an upgrade is committed, an upgrade can be rolled back to the prior version as follows.

isi upgrade cluster rollback

The isi upgrade view CLI command can be used to monitor how the upgrade is progressing:

# isi upgrade viewisi upgrade view -i/--interactive

The following command will provide more detailed/verbose output:

# isi_upgrade_status

A faster, simpler version of isi_upgrade_status is also available:

# isi_upgrade_node_state-a (aggregate the latest hook update for each node)-devid=<X,Y,E-F>  (filter and display by devid)-lnn=<X-Y,A,C> (filter and display by LNN)-ts (time sort entries)

If the end of a maintenance window is reached but the cluster is not fully upgraded, the upgrade process can be quiesced and then restarted using the following CLI commands:

# isi upgrade pause
# isi upgrade resume

For example:

# isi upgrade pause

You are about to pause the running process, are you sure?  (yes/[no]):

yes

The process will be paused once the current step completes.

The current operation can be resumed with the command:

# isi upgrade resume

Note that pausing is not immediate: The upgrade will remain in a “Pausing” state until the currently
upgrading node is completed. Additional nodes will not be upgraded until the upgrade process is resumed.

The ‘pausing’ state can be viewed with the following commands: ‘isi upgrade view’ and ‘isi_upgrade_status’. Note that a rollback can be initiated either during ‘Pausing’ or ‘Paused’ states. Also, be aware that the ‘isi upgrade pause’ command has no effect when performing a simultaneous OneFS upgrade.

A rolling reboot can be initiated from the CLI on a subset of cluster nodes using the ‘isi upgrade rolling-reboot’ syntax and the ‘–nodes’ flag specifying the desired LNNs for upgrade:

# isi upgrade rolling-reboot --help

Description:

    Perform a Rolling Reboot of cluster.

Required Privileges:

    ISI_PRIV_SYS_UPGRADE

Usage:

    isi upgrade cluster rolling-reboot

        [--nodes <integer_range_list>]

        [--force]

        [{--help | -h}]

Options:

    --nodes <integer_range_list>

        List of comma (1,3,7) or dash (1-7) specified node LNNs to select. "all"

        can also be used to select all the cluster nodes at any given time.

  Display Options:

    --force

        Do not ask confirmation.

    --help | -h

        Display help for this command.

This ‘isi upgrade view’ syntax provides better visibility, status and progress of the rolling reboot process. For example:

# isi upgrade view

Upgrade Status:

Current Upgrade Activity: RollingReboot

   Cluster Upgrade State: committed

   Upgrade Process State: Not started

      Current OS Version: 9.2.0.0

      Upgrade OS Version: N/A

        Percent Complete: 0%

Nodes Progress:

     Total Cluster Nodes: 3

       Nodes On Older OS: 3

          Nodes Upgraded: 0

Nodes Transitioning/Down: 0

LNN  Progress  Version  Status

---------------------------------

1    100%        9.2.0.0  committed

2    rebooting   Unknown  non-responsive

3    0%          9.2.0.0  committed

Due to the duration of OneFS upgrades on larger clusters, it can sometimes be unclear if an OS upgrade is actually progressing or has stalled. To address this, if an upgrade is not making progress after fifteen minutes, the upgrade framework automatically sends a SW_UPGRADE_NODE_NON_RESPONSIVE alert via CELOG. For example:

# isi event events list

ID        Occurred       Sev    Lnn   Eventgroup ID  Message                                                                     

---------------------------------------------------------------------------------------------------------------

2.1805  06/14 04:33  C       2      1087               Excessive Time executing a Hook on Node: 3
# isi status

...

Critical Events:

Time                 LNN  Event                          

---------------     ----  ------------------------------ 

06/14 05:16:30  2    Excessive Time executing a ... 

...

The isi_upgrade_logs command also provides detailed upgrade tracking and debugging data.

Usage: isi_upgrade_logs [-a|--assessment][--lnn][--process {process name}][--level {start level,end level][--time {start time,end time][--guid {guid} | --devid {devid}]

 + No parameter this utility will pull error logs for the current upgrade process

 + -a or --assessment - will interrogate the last upgrade assessment run and display the results

The following arguments enable filtering to help extract the desired upgrade information:

Filter CMD Flag Description
–guid dump the logs for the node with the supplied guid
–devid dump the logs for the node/s with the supplied devid/s
–lnn dump the logs for the node/s with the supplied lnn/s
–process dump the logs for the node with the supplied process name
–level dump the logs for the supplied level range
–time dump the logs for the supplied time range
–metadata dump the logs matching the supplied regex

For example, to display all of the logs generated by isi_upgrade_agent_d on the node with LNN1:

# isi_upgrade_logs --lnn 1 --process /usr/sbin/isi_upgrade_agent_d  
…
1  2021-06-14T18:06:15  /usr/sbin/isi_upgrade_agent_d  Debug  Starting /usr/share/upgrade/event-actions/pre-upgrade-optional/read_only_node_check.py

1  2021-06-14T23:59:59  /usr/sbin/isi_upgrade_agent_d  Debug  Starting /usr/share/upgrade/event-actions/pre-upgrade-optional/isi_upgrade_checker

1  2021-06-14T18:06:15  /usr/sbin/isi_upgrade_agent_d  Debug  Starting /usr/share/upgrade/event-actions/pre-upgrade-optional/volcopy_check

1  2021-06-14T18:06:15  /usr/sbin/isi_upgrade_agent_d  Debug  Starting /usr/share/upgrade/event-actions/pre-upgrade-optional/empty

1  2021-06-14T18:06:15  /usr/sbin/isi_upgrade_agent_d  Debug  Starting Hook [/usr/share/upgrade/event-actions/pre-upgrade-optional/read_only_node_check.py]
 …

Note that the ‘–process’ flag requires the full name including path to be specified, as it is displayed in the logs.

For example, the following CLI syntax displays a list all of the Upgrade-related process names that have logged to LNN 1:

# isi_upgrade_logs --lnn 1 | awk ‘{print $3}’ | sort | uniq

These process names can then be added to the ‘–process’ argument.

OneFS CELOG Alerts and Events WebUI

The OneFS 9.2 release introduced a number of OneFS usability enhancements for managing cluster events and alerts. This new functionality makes it considerably simpler to filter events chronologically, categorize by their status, filter by the severity, search the event history, resolve, suppress or ignore bulk events, and manage scheduled maintenance windows.

For example, you can easily categorize, identify, and filter events by using the following criteria:

Action Detail
Show ·         Show events for:

–    Today

–    This week

–    This month

–    Custom range/

–    All

Categorize ·         Categorize events by their status:

–    Active

–    Ignored

–    Resolved

–    All

Filter ·         Filter events by severity:

–    Emergency

–    Critical

–    Warning

–    Information

Search ·         Search for specific event(s) in the event history
Resolve ·         Resolve bulk events
Ignore ·         Ignore bulk events

The new WebUI page for event group history can be accessed by navigating to Cluster management > Events and Alerts. For example:

With OneFS 9.2, CELOG maintenance mode can also easily be manually enabled and disabled. During a maintenance window, the system will continue to log events but not generate alerts. However, all events that occurred during the maintenance window can then be reviewed upon disabling maintenance mode. Active event groups will automatically resume generating alerts when the scheduled maintenance period ends. For example, to enable CELOG maintenance mode, in the OneFS WebUI select Cluster Management > Events and Alerts > Alert management tab and click on the ‘Enable CELOG maintenance mode’ button. In the prompt window, select ‘Enable CELOG maintenance mode’ as follows:

Create an Alert channel. Either SMTP or SNMP can be configured for the alert channel communication, and can be created by selecting ‘Create channel’:

To create an alert rule, click the tab Alerting rule and then click the button Create alert rule. In the prompt window, fill the Rule name, set the Rule condition to NEW, apply it to all the Alert categories, and attached it to the channel you have just created.

Events can be created using CLI syntax similar to the following:

# /usr/bin/isi_celog/celog_send_events.py -o 940100002

Heap looptimes: [-1]

running -1 [940100002]

1612871342 :: Sending eventids [940100002] with specifier None

940100002 message is OneFS {version} is currently running on unsupported nodes (devid(s) {devids}). {msg}.

1.195 (70368744177859) corresponds to eventid 940100002

Out of events to run. Exiting.
# /usr/bin/isi_celog/celog_send_events.py -o 940100001

Heap looptimes: [-1]

running -1 [940100001]

1612872343 :: Sending eventids [940100001] with specifier None

940100001 message is OneFS {version} is currently running and is not supported on this hardware: {msg}.

1.196 (70368744177860) corresponds to eventid 940100001

Out of events to run. Exiting.

During maintenance mode, OneFS will still show the event but there will be no associated alert. In this example, there is no SMTP alert email triggered.

The following CLI syntax can also be used to filter all the events which happened while the cluster was in CELOG maintenance mode:

# isi event groups list --maintenance-mode=true

ID   Started     Ended       Causes Short                           Lnn  Events  Severity

------------------------------------------------------------------------------------------

16   02/09 11:49 --          HW_CLUSTER_ONEFS_VERSION_NOT_SUPPORTED 1    1       critical

17   02/09 12:05 02/09 12:19 HW_ONEFS_VERSION_NOT_SUPPORTED         1    1       critical

------------------------------------------------------------------------------------------

Click the “Disable CELOG maintenance mode’ button. and select one of the following from the display window:

    1. View event details
    2. Ignore event
    3. Resolve event

In this example, the event HW_ONEFS_VERSION_NOT_SUPPORTED is marked resolved by clicking Action and Resolve event.

After the CELOG maintenance mode is disabled, you will get the email notification only for HW_CLUSTER_ONEFS_VERSION_NOT_SUPPORTED. The event which has been marked resolved will not trigger any notification.

When an event type is suppressed, it prevents an event from alerting on all configured CELOG channels. However, the event will still be displayed in the event group history.

To suppress an event type, click the button Suppress for a specific event under Event type ID tab. In this example, both 930100006 and 930100005 have been suppressed.

Create several events, for example by using CLI commands such as the following:

# /usr/bin/isi_celog/celog_send_events.py -o 930100006

Heap looptimes: [-1]

running -1 [930100006]

1612873812 :: Sending eventids [930100006] with specifier None

930100006 message is {sensor} out of spec in chassis {chassis} slot {slot}.

1.200 (70368744177864) corresponds to eventid 930100006

Out of events to run. Exiting.
# /usr/bin/isi_celog/celog_send_events.py -o 930100005

Heap looptimes: [-1]

running -1 [930100005]

1612873817 :: Sending eventids [930100005] with specifier None

930100005 message is {sensor} out of spec in chassis {chassis} slot {slot}.

1.201 (70368744177865) corresponds to eventid 930100005

Out of events to run. Exiting.

To list all the events in the suppressed list, use the following CLI syntax:

# isi event suppress list

ID        Name

—————————-

930100005 HWMON_ANY_DISCRETE

930100006 HWMON_ANY_METERS

—————————-

These suppressed events will only show in the event history and will not trigger any notification in any channels.

The desired event types can be un-suppressed by clicking pertinent the Un-suppress button(s).

 

OneFS External Key Management for Data Encryption

Data-at-rest data is inactive content that is physically housed on a cluster, or other storage medium. Protecting this data with cryptography ensures that it’s guarded against theft, in the event that drives or nodes are removed from a PowerScale cluster. Data-at-rest encryption (DARE) is a requirement for federal and industry regulations ensuring data is encrypted when it is stored. OneFS has provided DARE solutions for many years through self-encrypting drives (SEDs) and, until now, an internal key management system.

The new OneFS 9.2 release introduces External Key Management (EKM) support for encrypted clusters, through the key management interoperability protocol, or KMIP. This enables offloading of the Master Key from a node to an External Key Manager, such as SKLM, SafeNet or Vormetric. Enhanced security is inherently provided through the separation of key manager from cluster, since node(s) cannot be rebooted without the keys. It also supports the secure transport of nodes. Additionally, centralized key management is also made available for multiple SED clusters, and provides the option to migrate existing keys from a cluster’s internal key store.

EKM provides enhanced security through the separation of the key manager from the cluster, enabling the secure transport of nodes, and helping organizations to meet regulatory compliance and corporate data at rest security requirements.

In order to use the OneFS 9.2 External Key Manager feature, clearly a cluster with self-encrypting drives is needed. Additionally, for the SED drives to be unlocked and user data made available, each node in the cluster must first contact the KMIP server to obtain the master encryption key from the server. Nodes in the cluster cannot boot without contacting the KMIP server first. Note that clusters without all their nodes connected to a front-end network (NANON) do not support External Key Management.

For the server, a KMIP Compliant Server supporting KMIP version 1.2 or greater is needed, such as:

Vendor Key Manager & Version
Dell EMC CloudLink Center 6.0
Gemalto Gemalto KeySecure 8.7 k150v

Gemalto KeySecure 8.7 k170v

IBM Secure Key Lifecycle Manager (SKLM) 2.6.0.2

Secure Key Lifecycle Manager (SKLM) 2.7.0.0

Secure Key Lifecycle Manager (SKLM) 3.0.0

Thales E-Security KeyAuthority 4.0

Also:

  • KMIP Storage Array with SEDS Profile Version 1.0
  • KMIP server host/port information
  • 509 PKI for TLS mutual authentication
    • Certificate authority bundle
    • Client certificate and private key

External Key Management can be configured on OneFS 9.2 as follows:

  1. Obtain the KMIPs Server and Client Certificates. Copy both certificates to the cluster and make a note of the file names and location.
  2. From the OneFS web interface, select Access > Key Management

Alternatively, this can also be accomplished via the OneFS CLI:

# isi keymanager kmip servers create
  1. From the WebUI Key Management page, enter the KMIP server information and specify the filename with the server/client certificates’ location. If the KMIP has a client certificate password specified, enter that here. Check the “Enable Key Management” box and click Submit.

  1. Next, OneFS contacts the KMIP and confirms the connection or displays any errors.
  2. Once the KMIP server is added, the keys can now be migrated. Click the Keys tab to display all current Master Keys on the cluster. Click on Migrate all to migrate the keys to the KMIP server. From the “Migrate all” pop-up, click Migrate to start the migration.

  1. The key migration process may take several minutes or more to complete depending on the cluster and network utilization. During this time, a “Migration in process” message is displayed.

  1. Once the process is complete, and “Migration Successful” message is displayed:

The following OneFS key management CLI commands are also available:

To configure external KMIP servers:

# isi keymanager kmip servers -h     

Description:

    Configure external KMIP servers.

Required Privileges:

    ISI_PRIV_KEY_MANAGER

Usage:

    isi keymanager kmip servers <action>

        [--timeout <integer>]

        [{--help | -h}]

Actions:

    create    Configure a new external KMIP server.

    delete    Delete a external KMIP server.

    modify    Modify a external KMIP server.

    list      View a list of configured external KMIP servers.

    view      View a single external KMIP server.

Options:

  Display Options:

    --timeout <integer>

        Number of seconds for a command timeout (specified as 'isi --timeout NNN <command>').

    --help | -h

        Display help for this command.

See 'isi keymanager kmip servers <subcommand> --help' for more information on a specific subcommand.

To manage SED keystore settings:

# isi keymanager sed settings -h

Description:

    Manage self-encrypting drive keystore settings.

Required Privileges:

    ISI_PRIV_KEY_MANAGER

Usage:

    isi keymanager sed settings <action>

        [{--help | -h}]

Actions:

    modify    Modify SED settings

    view      View current SED settings.

Options:

  Display Options:

    --help | -h

        Display help for this command.

See 'isi keymanager sed settings <subcommand> --help' for more information on a specific subcommand.

And to report keymanager SED status:

# isi keymanager sed status    

 Node Status  Location  Remote Key ID  Error Info(if any)
----------------------------------------------------------
  1   REMOTE  Server    F84B50640CABD44B9D5F75427C2B5E

  2   REMOTE  Server    24285969BD8804A9A61EE39D99573B

  3   REMOTE   Server    7D561B1CA89B72B891B21BF834097F  
-----------------------------------------------------------
Total: 3

Bigdata File Formats Support on Dell EMC ECS 3.6

This article describes the Dell EMC ECS’s support for Apache Hadoop file formats in terms of disk space utilization. To determine this, we will use Apache Hive service to create and store different file format tables and analyze the disk space utilization by each table on the ECS storage.

Apache Hive supports several familiar file formats used in Apache Hadoop. Hive can load and query different data files created by other Hadoop components such as PIG, Spark, MapReduce, etc. In this article, we will check Apache Hive file formats such as TextFile, SequenceFIle, RCFile, AVRO, ORC and Parquet formats. Cloudera Impala also supports these file formats.

To begin with, let us understand a bit about these Bigdata File formats. Different file formats and compression codes work better for different data sets in Hadoop, the main objective of this article is to determine their supportability on DellEMC ECS storage which is a S3 compatible object store for Hadoop cluster.

Following are the Hadoop file formats

Test File: This is a default storage format. You can use the text format to interchange the data with another client application. The text file format is very common for most of the applications. Data is stored in lines, with each line being a record. Each line is terminated by a newline character(\n).

The test format is a simple plane file format. You can use the compression (BZIP2) on the text file to reduce the storage spaces.

Sequence File: These are Hadoop flat files that store values in binary key-value pairs. The sequence files are in binary format and these files can split. The main advantage of using the sequence file is to merge two or more files into one file.

RC File: This is a row columnar file format mainly used in Hive Datawarehouse, offers high row-level compression rates. If you have a requirement to perform multiple rows at a time, then you can use the RCFile format. The RCFile is very much like the sequence file format. This file format also stores the data as key-value pairs.

AVRO File: AVRO is an open-source project that provides data serialization and data exchange services for Hadoop. You can exchange data between the Hadoop ecosystem and a program written in any programming language. Avro is one of the popular file formats in Big Data Hadoop based applications.

ORC File: The ORC file stands for Optimized Row Columnar file format. The ORC file format provides a highly efficient way to store data in the Hive table. This file system was designed to overcome limitations of the other Hive file formats. The Use of ORC files improves performance when Hive is reading, writing, and processing data from large tables.

More information on the ORC file format: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

Parquet File: Parquet is a column-oriented binary file format. The parquet is highly efficient for the types of large-scale queries. Parquet is especially good for queries scanning particular columns within a particular table. The Parquet table uses compression Snappy, gzip; currently Snappy by default.

More information on the Parquet file format: https://parquet.apache.org/documentation/latest/

Please note for below testing Cloudera CDP Private Cloud Base 7.1.6 Hadoop cluster is used.

Disk Space Utilization on Dell EMC ECS

What is the space on the disk that is used for these formats in Hadoop on Dell EMC ECS? Saving on disk space is always a good thing, but it can be hard to calculate exactly how much space you will be used with compression. Every file and data set is different, and the data inside will always be a determining factor for what type of compression you’ll get. The text will compress better than binary data. Repeating values and strings will compress better than pure random data, and so forth.

As a simple test, we took the 2008 data set from http://stat-computing.org/dataexpo/2009/the-data.html. The compressed bz2 download measures at 108.5 Mb, and uncompressed at 657.5 Mb. We then uploaded the data to Dell EMC ECS through s3a protocol, and created an external table on top of the uncompressed data set:

Copy the original dataset to Hadoop cluster
[root@hop-kiran-n65 ~]# ll
total 111128
-rwxr-xr-x 1 root root 113753229 May 28 02:25 2008.csv.bz2
-rw-------. 1 root root 1273 Oct 31 2020 anaconda-ks.cfg
-rw-r--r--. 1 root root 36392 Dec 15 07:48 docu99139
[root@hop-kiran-n65 ~]# hadoop fs -put ./2008.csv.bz2 s3a://hive.ecs.bucket/diff_file_format_db/bz2/
[root@hop-kiran-n65 ~]# hadoop fs -ls s3a://hive.ecs.bucket/diff_file_format_db/bz2/
Found 1 items
-rw-rw-rw- 1 root root 113753229 2021-05-28 02:00 s3a://hive.ecs.bucket/diff_file_format_db/bz2/2008.csv.bz2
[root@hop-kiran-n65 ~]#
From Hadoop Compute Node, create a database with data location on ECS bucket and create an external table for the flights data uploaded to ECS bucket location.
DROP DATABASE IF EXISTS diff_file_format_db CASCADE;

CREATE database diff_file_format_db COMMENT 'Holds all the tables data on ECS bucket' LOCATION 's3a://hive.ecs.bucket/diff_file_format_db' ;
USE diff_file_format_db;

Create external table flight_arrivals_txt_bz2 (
year int,
month int,
DayofMonth int,
DayOfWeek int,
DepTime int,
CRSDepTime int,
ArrTime int,
CRSArrTime int,
UniqueCarrier string,
FlightNum int,
TailNum string,
ActualElapsedTime int,
CRSElapsedTime int,
AirTime int,
ArrDelay int,
DepDelay int,
Origin string,
Dest string,
Distance int,
TaxiIn int,
TaxiOut int,
Cancelled int,
CancellationCode int,
Diverted int,
CarrierDelay string,
WeatherDelay string,
NASDelay string,
SecurityDelay string,
LateAircraftDelay string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
location 's3a://hive.ecs.bucket/diff_file_format_db/bz2/';
The total number of records in this master table is
select count(*) from flight_arrivals_txt_bz2 ;

+----------+
|   _c0    |
+----------+
| 7009728  |
+----------+
Similarly, create different file format tables using the master table

To create different file formats files by simply specifying ‘STORED AS FileFormatName’ option at the end of a CREATE TABLE Command.

Create external table flight_arrivals_external_orc stored as ORC as select * from flight_arrivals_txt_bz2;
Create external table flight_arrivals_external_parquet stored as Parquet as select * from flight_arrivals_txt_bz2;
Create external table flight_arrivals_external_textfile stored as textfile as select * from flight_arrivals_txt_bz2;
Create external table flight_arrivals_external_sequencefile stored as sequencefile as select * from flight_arrivals_txt_bz2;
Create external table flight_arrivals_external_rcfile stored as rcfile as select * from flight_arrivals_txt_bz2;
Create external table flight_arrivals_external_avro stored as avro as select * from flight_arrivals_txt_bz2;
Disk space utilization of the tables

Now, let us compare the disk usage on ECS of all the files from Hadoop compute nodes.

[root@hop-kiran-n65 ~]# hadoop fs -du -h s3a://hive.ecs.bucket/diff_file_format_db/ | grep flight_arrivals
597.7 M 597.7 M s3a://hive.ecs.bucket/diff_file_format_db/flight_arrivals_external_avro
93.5 M 93.5 M s3a://hive.ecs.bucket/diff_file_format_db/flight_arrivals_external_orc
146.6 M 146.6 M s3a://hive.ecs.bucket/diff_file_format_db/flight_arrivals_external_parquet
403.1 M 403.1 M s3a://hive.ecs.bucket/diff_file_format_db/flight_arrivals_external_rcfile
751.1 M 751.1 M s3a://hive.ecs.bucket/diff_file_format_db/flight_arrivals_external_sequencefile
670.7 M 670.7 M s3a://hive.ecs.bucket/diff_file_format_db/flight_arrivals_external_textfile
[root@hop-kiran-n65 ~]#

Summary

From the below table we can conclude that Dell EMC ECS as S3 storage supports all the Hadoop file formats and provides the same disk utilization as with the traditional HDFS storage.

Compressed Percentage lower is batter and Compression ratio higher is better.

Format
Size
Compressed%
Compressed Ratio
CSV (Text) 670.7 M
BZ2 108.5 M 16.18% 83.82%
ORC 93.5 M 13.94% 86.06%
Parquet 146.6 M 21.85% 78.15%
RC FIle 403.1 M 60.10% 39.90%
AVRO 597.7 M 89.12% 10.88%
Sequence 751.1 M 111.97% -11.87%

Here the default settings and values were used to create all the different file format tables, there were no other optimizations done for this testing. Each file format ships with many options and optimizations to compress the data, only the defaults that ship CDP pvt cloud base 7.1.6 were used.

 

 

 

 

 

 

 

OneFS Cluster Configuration Export & Import

OneFS 9.2 introduces the ability to export a cluster’s configuration, which can then be used to perform a configuration restore to the original cluster, or to an alternate cluster that also supports this feature. A configuration export and import can be performed via either the OneFS CLI or platform API, and encompasses the following OneFS components for configuration backup and restore:

  • NFS
  • SMB
  • S3
  • NDMP
  • HTTP
  • Quotas
  • Snapshots

The underlying architecture comprises four layers , and the process flow is as follows:

Each layer of the architecture is a follows:

 Component Description
User Interface Allows users to submit operations with multiple choices, such as REST, CLI, or WebUI.
pAPI Handler Performs different actions according to the requests flowing in
Config Manager Core layer executing different jobs which are called by PAPI handlers.
Database Lightweight database manage asynchronous jobs, tracing state and receiving task data.

By default, configuration backup and restore files reside at:

File Location
Backup JSON file: /ifs/data/Isilon_Support/config_mgr/backup/<JobID>/<component>_<JobID>.json
Restore JSON file: /ifs/data/Isilon_Support/config_mgr/restore/<JobID>/<component>_<JobID>.json

The log file for configuration manager is located at /var/log/config_mgr.log and can be useful to monitor the progress of a config backup and restore.

So let’s take a look at this cluster configuration management process:

The following procedure steps through the export and import of a cluster’s NFS and SMB configuration – within the same cluster:

  1. Open an SSH connection to any node in the cluster and log in using the root account.
  2. Create several SMB shares and NFS exports using the following CLI command
# isi smb shares create --create-path --name=test --path=/ifs/test

# isi smb shares create --create-path --name=test2 --path=/ifs/test2

# isi nfs exports create --paths=/ifs/test

# isi nfs exports create --paths=/ifs/test2
  1. Export the NFS and SMB configuration using the following CLI command
# isi cluster config exports create --components=nfs,smb --verbose

As indicated in the output below, the job ID for this export task is ‘ PScale-20210524105345’

Are you sure you want to export cluster configuration? (yes/[no]): yes

This may take a few seconds, please wait a moment

Created export task ' PScale-20210524105345'
  1. To view the results of the export operation, use the following CLI command:
# isi cluster config exports view PScale-20210524105345

As displayed in the output below, the backup JSON files are located at /ifs/data/Isilon_Support/config_mgr/backup/PScale-20210524105345

     ID: PScale-20210524105345

 Status: Successful

   Done: ['nfs', 'smb']

 Failed: []

Pending: []

Message:

   Path: /ifs/data/Isilon_Support/config_mgr/backup/PScale-20210524105345
  1. The JSON files can be viewed under /ifs/data/Isilon_Support/config_mgr/backup/ PScale-20210524105345. OneFS will generate a separate configuration backup JSON file for each component (ie. SMB and NFS in this example):
# ls /ifs/data/Isilon_Support/config_mgr/backup/PScale-20210524105345 backup_readme.json              nfs_PScale-20210524105345.json  smb_PScale-20210524105345.json
  1. Delete all the SMB shares and NFS exports using the following commands:
# isi smb shares delete test

# isi smb shares delete test2

# isi nfs exports delete 9

# isi nfs exports delete 10
  1. Use the following CLI command to restore the SMB and NFS configuration:
# isi cluster config imports create PScale-20210524105345 --components=smb,nfs
  1. From the output below, the import job ID is ‘ PScale-20210524105345’
Are you sure you want to import cluster configuration? (yes/[no]): yes

This may take a few seconds, please wait a moment

Created import task ' PScale-20210524105345'
  1. To view the restore results, use the following command:
# isi cluster config imports view PScale-20210524105345

       ID: PScale-20210524110659

Export ID: PScale-20210524105345

   Status: Successful

     Done: ['nfs', 'smb']

   Failed: []

  Pending: []

  Message:

     Path: /ifs/data/Isilon_Support/config_mgr/restore/ PScale-20210524110659
  1. Verify that the SMB shares and NFS exports are restored:
# isi smb shares list

Share Name  Path

----------------------

test        /ifs/test

test2       /ifs/test2

----------------------

Total: 2
# isi nfs exports list

ID   Zone   Paths      Description

-----------------------------------

11   System /ifs/test

12   System /ifs/test2

-----------------------------------

Total: 2

A WebUI management component for this feature will be included in a future release, as will the ability to run a diff, or comparison, between two exported configurations .

PowerScale F900 All-flash NVMe Node

In this article, we’ll take a quick peek at the new PowerScale F900 hardware platform that was released last week. Here’s where this new node sits in the current PowerScale hardware hierarchy:

The PowerScale F900 is a high-end all-flash platform that utilizes NVMe SSDs and a dual-CPU 2U PowerEdge platform with 736GB of memory per node.  The ideal use cases for the F900 include high performance workflows, such as M&E, EDA, AI/ML, and other HPC applications and next gen workloads.

An F900 cluster can comprise between 3 and 252 nodes, each of which contains twenty four 2.5” drive bays populated with a choice of 1.92TB, 3.84TB, 7,68TB, or 15.36TB enterprise NVMe SSDs, and netting up to 181TB of RAM and 91PB of all-flash storage per cluster. Inline data reduction, which incorporates compression, dedupe, and single instancing, is also included as standard to further increase the effective capacity.

The F900 is based on the 2U Dell R740 PowerEdge server platform, with dual socket Intel CPUs, as follows:

Description PowerScale F900 

(PE R740xd platform w/ NVMe SSDs)

Minimum # of nodes in a cluster 3
Raw capacity per minimum sized cluster (3 nodes) 138TB to 1080TB

Drive capacity options:

1.92 TB, 3.82 TB, 7.68 TB, or 15.36 TB

SSD Drives in min. sized cluster 24 x 3 = 72
Rack Unit (RU) per min. cluster 6 RU
Processor Dual socket Intel Xeon Processor Gold 6240R (2.2GHz, 24C)
Memory per node 736 GB per node
Front-End Connectivity 2 x 10/25GbE or 2 x 40/100GbE
Back-end Connectivity 2 x 40/100GbE or

2 x QDR Infiniband (IB) for interoperability to previous generation clusters

Or, as reported by OneFS:

# isi_hw_status -ic
SerNo: 5FH9K93
Config: PowerScale F900
ChsSerN: 5FH9K93
ChsSlot: n/a
FamCode: F
ChsCode: 2U
GenCode: 00
PrfCode: 9
Tier: 7
Class: storage
Series: n/a
Product: F900-2U-Dual-736GB-2x100GE QSFP+-45TB SSD
HWGen: PSI
Chassis: POWEREDGE (Dell PowerEdge)
CPU: GenuineIntel (2.39GHz, stepping 0x00050657)
PROC: Dual-proc, 24-HT-core
RAM: 789523222528 Bytes
Mobo: 0YWR7D (PowerScale F900)
NVRam: NVDIMM (NVDIMM) (8192MB card) (size 8589934592B)
DskCtl: NONE (No disk controller) (0 ports)
DskExp: None (No disk expander)
PwrSupl: PS1 (type=AC, fw=00.1D.7D)
PwrSupl: PS2 (type=AC, fw=00.1D.7D)

The F900 nodes are available in two networking configurations, with either a 10/25GbE or 40/100GbE front-end, plus a standard 100GbE or QDR Infiniband back-end for each.

The 40G and 100G connections are actually four lanes of 10G and 25G respectively, allowing switches to ‘breakout’ a QSFP port into 4 SFP ports. While this is automatic on the Dell back-end switches, some front-end switches may need configuring.

Drive subsystem-wise, the PowerScale F900 has twenty four total drive bays spread across the front of the chassis:

Under the hood on the F900, OneFS provides support NVMe across PCIe lanes, and the SSDs use the NVMe and NVD drivers. The NVD is a block device driver that exposes an NVMe namespace like a drive and is what most OneFS operations act upon, and each NVMe drive has a /dev/nvmeX, /dev/nvmeXnsX and /dev/nvdX device entry  and the locations are displayed as ‘bays’. Details can be queried with OneFS CLI drive utilities such as ‘isi_radish’ and ‘isi_drivenum’. For example:

# isi devices drive list
Lnn Location Device Lnum State Serial
------------------------------------------------------
1 Bay 0 /dev/nvd15 9 HEALTHY S61DNE0N702037
1 Bay 1 /dev/nvd14 10 HEALTHY S61DNE0N702480
1 Bay 2 /dev/nvd13 11 HEALTHY S61DNE0N702474
1 Bay 3 /dev/nvd12 12 HEALTHY S61DNE0N702485
1 Bay 4 /dev/nvd19 5 HEALTHY S61DNE0N702031
1 Bay 5 /dev/nvd18 6 HEALTHY S61DNE0N702663
1 Bay 6 /dev/nvd17 7 HEALTHY S61DNE0N702726
1 Bay 7 /dev/nvd16 8 HEALTHY S61DNE0N702725
1 Bay 8 /dev/nvd23 1 HEALTHY S61DNE0N702718
1 Bay 9 /dev/nvd22 2 HEALTHY S61DNE0N702727
1 Bay 10 /dev/nvd21 3 HEALTHY S61DNE0N702460
1 Bay 11 /dev/nvd20 4 HEALTHY S61DNE0N700350
1 Bay 12 /dev/nvd3 21 HEALTHY S61DNE0N702023
1 Bay 13 /dev/nvd2 22 HEALTHY S61DNE0N702162
1 Bay 14 /dev/nvd1 23 HEALTHY S61DNE0N702157
1 Bay 15 /dev/nvd0 0 HEALTHY S61DNE0N702481
1 Bay 16 /dev/nvd7 17 HEALTHY S61DNE0N702029
1 Bay 17 /dev/nvd6 18 HEALTHY S61DNE0N702033
1 Bay 18 /dev/nvd5 19 HEALTHY S61DNE0N702478
1 Bay 19 /dev/nvd4 20 HEALTHY S61DNE0N702280
1 Bay 20 /dev/nvd11 13 HEALTHY S61DNE0N702166
1 Bay 21 /dev/nvd10 14 HEALTHY S61DNE0N702423
1 Bay 22 /dev/nvd9 15 HEALTHY S61DNE0N702483
1 Bay 23 /dev/nvd8 16 HEALTHY S61DNE0N702488
------------------------------------------------------
Total: 24

Or for the details of a particular drive:

# isi devices drive view 15
Lnn: 1
Location: Bay 15
Lnum: 0
Device: /dev/nvd0
Baynum: 15
Handle: 346
Serial: S61DNE0N702481
Model: Dell Ent NVMe AGN RI U.2 1.92TB
Tech: NVME
Media: SSD
Blocks: 3750748848
Logical Block Length: 512
Physical Block Length: 512
WWN: 363144304E7024810025384500000003
State: HEALTHY
Purpose: STORAGE
Purpose Description: A drive used for normal data storage operation
Present: Yes
Percent Formatted: 100
# isi_radish -a /dev/nvd0

Bay 15/nvd0   is Dell Ent NVMe AGN RI U.2 1.92TB FW:2.0.2 SN:S61DNE0N702481, 3750748848 blks

Log Sense data (Bay 15/nvd0  ) --

Supported log pages 0x1 0x2 0x3 0x4 0x5 0x6 0x80 0x81

SMART/Health Information Log

============================

Critical Warning State:         0x00

 Available spare:               0

 Temperature:                   0

 Device reliability:            0

 Read only:                     0

 Volatile memory backup:        0

Temperature:                    310 K, 36.85 C, 98.33 F

Available spare:                100

Available spare threshold:      10

Percentage used:                0

Data units (512,000 byte) read: 3804085

Data units written:             96294

Host read commands:             29427236

Host write commands:            480646

Controller busy time (minutes): 7

Power cycles:                   36

Power on hours:                 774

Unsafe shutdowns:               31

Media errors:                   0

No. error info log entries:     0

Warning Temp Composite Time:    0

Error Temp Composite Time:      0

Temperature Sensor 1:           310 K, 36.85 C, 98.33 F

Temperature 1 Transition Count: 0

Temperature 2 Transition Count: 0

Total Time For Temperature 1:   0

Total Time For Temperature 2:   0

SMART status is threshold NOT exceeded (Bay 15/nvd0  )

Error Information Log

=====================

No error entries found

The F900 nodes’ front panel has limited functionality compared to older platform generations and will simply allow the user to join a node to a cluster and display the node name after the node has successfully joined the cluster.

Similar to legacy Gen6 platforms, a PowerScale node’s serial number can be found either by viewing /etc/isilon_serial_number or running the ‘isi_hw_status | grep SerNo’ CLI command syntax. The serial number reported by OneFS will match that of the service tag attached to the physical hardware and the /etc/isilon_system_config file will report the appropriate node type. For example:

# cat /etc/isilon_system_config

PowerScale F900

OneFS 9.2 and PowerScale F900 Introduction

It’s release season here and we’re delighted to introduce both PowerScale OneFS 9.2 and the new PowerScale F900 all-flash NVMe node.

The PowerScale F900 will be the highest performing platform in the PowerScale portfolio. It’s based on the Dell R740xd platform, and features dual socked 24-core 2.2GHz Intel Xeon Gold CPU, 736 GB of RAM, 100Gb Ethernet or QDR Infiniband backend, and twenty four 2.5 inch NVMe drives per 2U node. These drives are available in 1.9TB, 3.8TB, 7.4TB and 15TB sizes, yielding 46TB, 92TB, 184TB, and 360TB raw node capacities respectively, allowing the F900 to deliver up to 93PB of raw NVMe all-flash capacity per cluster. ​

A recent Forrester Total Economic Indicator (TEI) study showed that the F900 can deliver an ROI of up to 374% and a payback period of less than 6 months. ​Plus it can be consumed either as an appliance or as an APEX Data Storage Service.

The F900 can scale from 3 to 252 nodes per cluster, and inline data reduction is enabled by default to further extend the effective capacity and efficiency of this platform.

With the latest OneFS 9.2, we have also powered up the F600 and F200, launched last year. There’s higher performance with up to 70% increase in sequential reads for F600 and up to 25% for sequential reads for the F200. Plus customers also get more flexibility through new drive options, and the ability to non-disruptively add these nodes to existing Isilon clusters. Finally, customers get data-at-rest encryption through self-encrypting drives (SED) on F200.

OneFS 9.2 also introduces Remote Direct Memory Access support for applications and clients with NFS over RDMA, and allows substantially higher throughput performance, especially for single connection and read intensive workloads such as M&E edit and playback and machine learning – while also reducing both cluster and client CPU utilization. It also provides a foundation for future OneFS interoperability with NVIDIA’s GPUDirect.

Specifically, OneFS 9.2 supports NFSv3 over RDMA by leveraging the ROCEv2 network protocol (also known as Routable RoCE or RRoCE). New OneFS CLI and WebUI configuration options have been added, including global enablement, and IP pool configuration, filtering and verification of RoCEv2 capable network interfaces. Be aware that neither ROCEv1 nor NFSv4 over RDMA are supported in the OneFS 9.2 release. And IPv6 is also unsupported when using NFSv3 over RDMA

NFS over RDMA is available on all PowerScale which contain Mellanox ConnectX network adapters on the front end with either 25, 40, or 100 Gig Ethernet connectivity. The ‘isi network interfaces list’ CLI command can be used to easily identify which of a cluster’s NICs support RDMA.

The new 9.2 release introduces External Key Management support for encrypted clusters, through the key management interoperability protocol, or KMIP, which enables offloading of the Master Key from a node to an External Key Manager, such as SKLM, SafeNet or Vormetric. This allows centralized key management for multiple SED clusters, and includes an option to migrate existing keys from a cluster’s internal key store.

This feature provides enhanced security through the separation of the key manager from the cluster, enabling the secure transport of nodes, and helping organizations to meet regulatory compliance and corporate data at rest security requirements

Configuration is via either the WebUI or CLI, and, in order to test the External Key Manager feature, a PowerScale cluster with self-encrypting drives will be required:

In addition to external key management for SEDs, OneFS 9.2 introduces several other Security & Compliance features, including Administrator-only Log Access, where Security and Federal requirements mandate limiting access to configuration and log information to administrators only for /ifsvar, /var/log, /boot, and a variety of /etc config files and subdirectories.

Also, in OneFS 9.2, the HTTP Basic Authentication scheme will be disabled by default, on new installs requiring session-based authentication. This only impacts the API and RAN endpoints of the web server, including /platform, /object, and /namespace on TCP port 8080. The regular HTTP protocol access on TCP 80 and 443 remains unchanged.

9.2 also introduces a new roles-based administration privilege, ISI_PRIV_RESTRICTED_AUTH, intended for help-desk admins that don’t require full ISI_PRIV_AUTH privileges. This means that an admin with ISI_PRIV_RESTRICTED_AUTH can only modify users and groups with the same or fewer privileges.

While IPv6 has been available in OneFS for several releases now, 9.2 introduces support to meet the stringent USGv6 security requirements for United States Government deployments. In particular, the USGv6 feature implements both Router Advertisements to update the IPv6 default gateway, and Duplicate Address Detection to detect conflicting IP addresses. SmartConnect DNS is also enhanced to detect DAD for the SmartConnect Service IP, allowing it to log and remove an SSIP if a duplicate is detected.

There are also several serviceability-related enhancements in this new release. As part of OneFS’ always-on initiative, 9.2 introduces Drain Based Upgrades, where nodes are prevented from rebooting or restarting protocol services until all SMB clients have disconnected from the node. Since a single SMB client that does not disconnect could cause the upgrade to be delayed indefinitely, options are available to reboot the node, despite persisting clients.

OneFS 9.2 sees a redesign of the CELOG WebUI for improved usability. This makes it simple to filter events chronologically, categorize by their status, filter by the severity, easily search the event history, resolve, suppress or ignore bulk events, and more easily manage scheduled maintenance windows.

9.2 also introduces the ability to export a cluster’s configuration, which can then be used to perform a config restore to either the original or a different cluster. This can be performed either from the CLI or platform API, and includes the configuration for the core protocols (NFS, SMB, S3 and HDFS) plus Snapshots, Quotas, and NDMP backup,

Another feature of OneFS 9.2 is S3 ETag Consistency. Unlike AWS, if the MD5 checksum is not specified in an S3 client request, OneFS generates a unique string for that file as an ETag in response, which can cause issues with some applications. Therefore, 9.2 now allows admins to specify if the MD5 should be calculated and verified.

In 9.2, Energy Star efficiency data is now retrieved through the IPMI interface, and reported via the CLI, allowing cluster admins and compliance engineers to query a cluster’s inlet temperatures and power consumption.

With OneFS 9.2, In-line data reduction is extended to include the new F900 platform. OneFS in-line data reduction substantially increases a cluster’s storage density, and helps eliminate management burden, while seamlessly boosting efficiency and lowering the TCO. The in-line data reduction write pipeline comprises three main phases:

  • Zero block removal
  • In-line dedupe
  • In-line compression

And, like everything OneFS, it scales linearly across a cluster, as additional nodes are added.

We’ll be looking more closely at these new features and functionality over the course of the next few blog articles.

OneFS SnapRevert Job

There have been a couple of recent inquiries from the field about the SnapRevert job.

For context, SnapRevert is one of three main methods for restoring data from a OneFS snapshot. The options are:

Method Description
Copy Copying specific files and directories directly from the snapshot
Clone Cloning a file from the snapshot
Revert Reverting the entire snapshot via the SnapRevert job

Copying a file from a snapshot duplicates that file, which roughly doubles the amount of storage space it consumes. Even if the original file is deleted from HEAD, the copy of that file will remain in the snapshot. Cloning a file from a snapshot also duplicates that file. Unlike a copy, however, a clone does not consume any additional space on the cluster – unless either the original file or clone is modified.

However, the most efficient of these approaches is the SnapRevert job, which automates the restoration of an entire snapshot to its top level directory. This allows for quickly reverting to a previous, known-good recovery point – for example in the event of virus outbreak. The SnapRevert job can be run from the Job Engine WebUI, and requires adding the desired snapshot ID.

There are two main components to SnapRevert:

  • The file system domain that the objects are put into.
  • The job that reverts everything back to what’s in a snapshot.

So what exactly is a SnapRevert domain? At a high level, a domain defines a set of behaviors for a collection of files under a specified directory tree. The SnapRevert domain is described as a ‘restricted writer’ domain, in OneFS parlance. Essentially, this is a piece of extra filesystem metadata and associated locking that prevents a domain’s files being written to while restoring a last known good snapshot.

Because the SnapRevert domain is essentially just a metadata attribute placed onto a file/directory, a best practice is to create the domain before there is data. This avoids having to wait for DomainMark (the aptly named job that marks a domain’s files) to walk the entire tree, setting that attribute on every file and directory within it.

The SnapRevert job itself actually uses a local SyncIQ policy to copy data out of the snapshot, discarding any changes to the original directory.  When the SnapRevert job completes, the original data is left in the directory tree.  In other words, after the job completes, the file system (HEAD) is exactly as it was at the point in time that the snapshot was taken.  The LINs for the files/directories don’t change, because what’s there is not a copy.

SnapRevert can be manually run from the OneFS WebUI by navigating to Cluster Management > Job Operations > Job Types > SnapRevert and clicking the ‘Start Job’ button.

Additionally, the job’s impact policy and relative priority can also be adjusted, if desired:

Before a snapshot is reverted, SnapshotIQ creates a point-in-time copy of the data that is being replaced. This enables the snapshot revert to be undone later, if necessary.

Additionally, individual files, rather than entire snapshots, can also be restored in place using the isi_file_revert command line utility.

# isi_file_revert
usage:
isi_file_revert -l lin -s snapid
isi_file_revert -p path -s snapid
-d (debug output)
-f (force, no confirmation)

This can help drastically simplify virtual machine management and recovery, for example.

Before creating snapshots, it’s worth considering that reverting a snapshot requires that a SnapRevert domain exist for the directory that is being restored. As such, it is recommended that you create SnapRevert domains for those directories while the directories are empty. Creating a domain for an empty (or sparsely populated) directory takes considerably less time.

Files may belong to multiple domains. Each file stores a set of domain IDs indicating which domain they belong to in their inode’s extended attributes table. Files inherit this set of domain IDs from their parent directories when they are created or moved. The domain IDs refer to domain settings themselves, which are stored in a separate system B-tree. These B-tree entries describe the type of the domain (flags), and various other attributes.

As mentioned, a Restricted-Write domain prevents writes to any files except by threads that are granted permission to do so. A SnapRevert domain that does not currently enforce Restricted-Write shows up as “(Writable)” in the CLI domain listing.

Occasionally, a domain will be marked as “(Incomplete)”. This means that the domain will not enforce its specified behavior. Domains created by job engine are incomplete if not all of the files that are part of the domain are marked as being members of that domain. Since each file contains a list of domains of which it is a member, that list must be kept up to date for each file. The domain is incomplete until each file’s domain list is correct.

In addition to SnapRevert, OneFS also currently uses domains for SyncIQ replication and SnapLock immutable archiving.

A SnapRevert domain needs to be created on a directory before it can be reverted to a particular point in time snapshot. As mentioned before, the recommendation is to create SnapRevert domains for a directory while the directory is empty.

The root path of the SnapRevert domain must be the same root path of the snapshot. For example, a domain with a root path of /ifs/data/marketing cannot be used to revert a snapshot with a root path of /ifs/data/marketing/archive.

For example, for snaphsot DailyBackup_04-27-2021_12:00 which is rooted at /ifs/data/marketing/archive:

  1. First, set the SnapRevert domain by running the DomainMark job (which marks all the files):
# isi job jobs start domainmark --root /ifs/data/marketing --dm-type SnapRevert
  1. Verify that the domain has been created:
# isi_classic domain list –l

In order to restore a directory back to the state it was in at the point in time when a snapshot was taken, you need to:

  • Create a SnapRevert domain for the directory.
  • Create a snapshot of a directory.

To accomplish this:

  1. First, identify the ID of the snapshot you want to revert by running the isi snapshot snapshots view command and picking your PIT (point in time).

For example:

# isi snapshot snapshots view DailyBackup_04-27-2021_12:00

ID: 38

Name: DailyBackup_04-27-2021_12:00

Path: /ifs/data/marketing

Has Locks: No

Schedule: daily

Alias: -

Created: 2021-04-27T12:00:05

Expires: 2021-08-26T12:00:00

Size: 0b

Shadow Bytes: 0b

% Reserve: 0.00%

% Filesystem: 0.00%

State: active
  1. Revert to a snapshot by running the isi job jobs start command. The following command reverts to snapshot ID 38 named DailyBackup_04-27-2021_12:00:
# isi job jobs start snaprevert --snapid 38

This can also be done from the WebUI, by navigating to Cluster Management > Job Operations > Job Types > SnapRevert and clicking the ‘Start Job’ button.

OneFS automatically creates a snapshot right before the SnapRevert process reverts the specified directory tree. The naming convention for these snapshots is of the form: <snapshot_name>.pre_revert.*

# isi snap snap list | grep pre_revert
39 DailyBackup_04-27-2021_12:00.pre_revert.1655328160 /ifs/data/marketing

This allows for an easy roll-back of a SnapRevert if the desired results are not achieved.

Note that, if a domain is currently preventing the modification or deletion of a file, a protection domain cannot be created on a
directory that contains that file. For example, if files under /ifs/data/smartlock are set to a WORM state by a
SmartLock domain, OneFS will not allow a SnapRevert domain to be created on /ifs/data/.

If desired or required, SnapRevert domains can also be deleted using the job engine CLI. For example, to delete the SnapRevert domain at /ifs/data/marketing:

# isi job jobs start domainmark --root /ifs/data/marketing --dm-type SnapRevert --delete

How To Configure NFS over RDMA

Starting from OneFS 9.2.0.0, NFSv3 over RDMA is introduced for better performance. Please refer to Chapter 6 of OneFS NFS white paper for the technical details. This article provides guidance on using the NFSv3 over RDMA feature with your OneFS clusters. Note that the OneFS NFSv3 over RDMA functionality requires that any clients are ROCEv2 capable. As such, client-side configuration is also needed.

OneFS Cluster configuration

To use NFSv3 over RDMA, your OneFS cluster hardware must meet requirements:

  • Node type: All Gen6 (F800/F810/H600/H500/H400/A200/A2000), F200, F600, F900
  • Front end network: Mellanox ConnectX-3 Pro, ConnectX-4 and ConnectX-5 network adapters that deliver 25/40/100 GigE speed.

1. Check your cluster network interfaces have ROCEv2 capability by running the following command and noting the interfaces that report ‘SUPPORTS_RDMA_RRoCE’. This check is only available on the CLI.

# isi network interfaces list -v

2. Create an IP pool that contains ROCEv2 capable network interface.

(CLI)

# isi network pools create --id=groupnet0.40g.40gpool1 --ifaces=1:40gige- 1,1:40gige-2,2:40gige-1,2:40gige-2,3:40gige-1,3:40gige-2,4:40gige-1,4:40gige-2 --ranges=172.16.200.129-172.16.200.136 --access-zone=System --nfsv3-rroce-only=true

(WebUI) Cluster management –> Network configuration

3. Enable NFSv3 over RDMA feature by running the following command.

(CLI)

# isi nfs settings global modify --nfsv3-enabled=true --nfsv3-rdma-enabled=true

(WebUI) Protocols –> UNIX sharing(NFS) –> Global settings

4. Enable OneFS cluster NFS service by running the following command.

(CLI)

# isi services nfs enable

(WebUI) See step 3

5. Create NFS export by running the following command. The –map-root-enabled=false is used to disable NFS export root-squash for testing purpose, which allows root user to access OneFS cluster data via NFS.

(CLI)

# isi nfs exports create --paths=/ifs/export_rdma --map-root-enabled=false

(WebUI) Protocols –> UNIX sharing (NFS) –> NFS exports

NFSv3 over RDMA client configuration

Note: As the client OS and Mellanox NICs may vary in your environment, you need to look for your client OS documentation and Mellanox documentation for the accurate and detailed configuration steps. This section only demonstrates an example configuration using our in-house lab equipment.

To use NFSv3 over RDMA service of OneFS cluster, your NFSv3 client hardware must meet requirements:

  • RoCEv2 capable NICs: Mellanox ConnectX-3 Pro, ConnectX-4, ConnectX-5, and ConnectX-6
  • NFS over RDMA Drivers: Mellanox OpenFabrics Enterprise Distribution for Linux (MLNX_OFED) or OS Distributed inbox driver. It is recommended to install Mellanox OFED driver to gain the best performance.

If you just want to have a functional test on the NFSv3 over RDMA feature, you can set up Soft-RoCE for your client.

Set up a RDMA capable client on physical machine

In the following steps, we are using the Dell PowerEdge R630 physical server with CentOS 7.9 and Mellanox ConnectX-3 Pro installed.

  1. Check OS version by running the following command:
# cat /etc/redhat-release

CentOS Linux release 7.9.2009 (Core)

 

2. Check the network adapter model and information. From the output, we can find the ConnectX-3 Pro is installed, and the network interfaces are named 40gig1 and 40gig2.

# lspci | egrep -i --color 'network|ethernet'

01:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)

01:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)

03:00.0 Ethernet controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]

05:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)

05:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)

# lshw -class network -short

H/W path

==========================================================

/0/102/2/0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 40gig1&nbsp;&nbsp;&nbsp;&nbsp; network&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; MT27520 Family [ConnectX-3 Pro]

/0/102/3/0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; network&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 82599ES 10-Gigabit SFI/SFP+ Network Connection

/0/102/3/0.1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; network&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 82599ES 10-Gigabit SFI/SFP+ Network Connection

/0/102/1c.4/0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1gig1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; network&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; I350 Gigabit Network Connection

/0/102/1c.4/0.1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1gig2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; network&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; I350 Gigabit Network Connection

/3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 40gig2&nbsp;&nbsp;&nbsp;&nbsp; network&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Ethernet interface

3. Find the suitable Mellanox OFED driver version from Mellanox website. As of MLNX_OFED v5.1, ConnectX-3 Pro are no longer supported and can be utilized through MLNX_OFED LTS version. See Figure 3. If you are using ConnectX-4 and above, you can use the latest Mellanox OFED version.

  • MLNX_OFED LTS Download

An important note: the NFSoRDMA module was removed from the Mellanox OFED 4.0-2.0.0.1 version, then it was added again in Mellanox OFED 4.7-3.2.9.0 version. Please refer to Release Notes Change Log History for the details.

4. Download the MLNX_OFED 4.9-2.2.4.0 driver for ConnectX-3 Pro to your client.

5. Extract the driver package, find the “mlnxofedinstall” script to install the driver. As of MLNX_OFED v4.7, NFSoRDMA driver is no longer installed by default. In order to install it over a supported kernel, add the “–with-nfsrdma” installation option to the “mlnxofedinstall” script. Firmware update is skipped in this example, please update it as needed.

#  ./mlnxofedinstall --with-nfsrdma --without-fw-update

Logs dir: /tmp/MLNX_OFED_LINUX.19761.logs

General log file: /tmp/MLNX_OFED_LINUX.19761.logs/general.log

Verifying KMP rpms compatibility with target kernel...

This program will install the MLNX_OFED_LINUX package on your machine.

Note that all other Mellanox, OEM, OFED, RDMA or Distribution IB packages will be removed.

Those packages are removed due to conflicts with MLNX_OFED_LINUX, do not reinstall them.

Do you want to continue?[y/N]:y

Uninstalling the previous version of MLNX_OFED_LINUX

rpm --nosignature -e --allmatches --nodeps mft

Starting MLNX_OFED_LINUX-4.9-2.2.4.0 installation ...

Installing mlnx-ofa_kernel RPM

Preparing...                          ########################################

Updating / installing...

mlnx-ofa_kernel-4.9-OFED.4.9.2.2.4.1.r########################################

Installing kmod-mlnx-ofa_kernel 4.9 RPM
...

Preparing...                          ########################################

mpitests_openmpi-3.2.20-e1a0676.49224 ########################################

Device (03:00.0):

03:00.0 Ethernet controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]

Link Width: x8

PCI Link Speed: 8GT/s

Installation finished successfully.

Preparing...                          ################################# [100%]

Updating / installing...

1:mlnx-fw-updater-4.9-2.2.4.0      ################################# [100%]

Added 'RUN_FW_UPDATER_ONBOOT=no to /etc/infiniband/openib.conf

Skipping FW update.

To load the new driver, run:

# /etc/init.d/openibd restart

6. Load the new driver by running the following command. Unload all module that is in use prompted by the command.

# /etc/init.d/openibd restart

Unloading HCA driver:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; [&nbsp; OK&nbsp; ]

Loading HCA driver and Access Layer:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; [&nbsp; OK&nbsp; ]<br>

7. Check the driver version to ensure the installation is successful.

# ethtool -i 40gig1

driver: mlx4_en

version: 4.9-2.2.4

firmware-version: 2.36.5080

expansion-rom-version:

bus-info: 0000:03:00.0

supports-statistics: yes

supports-test: yes

supports-eeprom-access: no

supports-register-dump: no

supports-priv-flags: yes

8. Check the NFSoRDMA module is also installed. If you are using a driver downloaded from server vendor website (like Dell PowerEdge server) rather than Mellanox website, the NFSoRDMA module may not be included in the driver package. You must obtain the NFSoRDMA module from Mellanox driver package and install it.

# yum list installed | grep nfsrdma

kmod-mlnx-nfsrdma.x86_64&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 5.0-OFED.5.0.2.1.8.1.g5f67178.rhel7u8

9. Mount NFS export with RDMA protocol.

#&nbsp; mount -t nfs -vo nfsvers=3,proto=rdma,port=20049 172.16.200.29:/ifs/export_rdma /mnt/export_rdma

mount.nfs: timeout set for Tue Feb 16 21:47:16 2021

mount.nfs: trying text-based options 'nfsvers=3,proto=rdma,port=20049,addr=172.16.200.29'

Useful reference for Mellanox OFED documentation:

Set up Soft-RoCE client for functional test only

Soft-RoCE (also known as RXE) is a software implementation of RoCE that allows RoCE to run on any Ethernet network adapter whether it offers hardware acceleration or not. Soft-RoCE is released as part of upstream kernel 4.8 (or above). It is intended for users who wish to test RDMA on software over any 3rd party adapters.

In the following example configuration, we are using CentOS 7.9 virtual machine to configure Soft-RoCE. Since Red Hat Enterprise Linux 7.4, the Soft-RoCE driver is already merged into the kernel.

1. Install required software packages.

# yum install -y nfs-utils rdma-core libibverbs-utils

2. Start Soft-RoCE.

# rxe_cfg start

3. Get status, which will display ethernet interfaces

# rxe_cfg status

rdma_rxe module not loaded

Name   Link  Driver  Speed  NMTU  IPv4_addr        RDEV  RMTU

ens33  yes   e1000          1500  192.168.198.129

4. Verify RXE kernel module is loaded by running the following command, ensure that you see rdma_rxe in the list of modules.

# lsmod | grep rdma_rxe

rdma_rxe              114188  0

ip6_udp_tunnel         12755  1 rdma_rxe

udp_tunnel             14423  1 rdma_rxe

ib_core               255603  13 rdma_cm,ib_cm,iw_cm,rpcrdma,ib_srp,ib_iser,ib_srpt,ib_umad,ib_uverbs,rdma_rxe,rdma_ucm,ib_ipoib,ib_isert

5. Create a new RXE device/interface by using rxe_cfg add <interface from rxe_cfg status>.

# rxe_cfg add ens33

6. Check status again, make sure the rxe0 was added under RDEV (rxe device)

# rxe_cfg status

Name   Link  Driver  Speed  NMTU  IPv4_addr        RDEV  RMTU

ens33  yes   e1000          1500  192.168.198.129  rxe0  1024  (3)

7. Mount NFS export with RDMA protocol.

# mount -t nfs -o nfsvers=3,proto=rdma,port=20049 172.16.200.29:/ifs/export_rdma /mnt/export_rdma

You can refer to Red Hat Enterprise Linux configuring Soft-RoCE for more details.