OneFS and QLC Drive Support

Another significant feature of the recent OneFS 9.4 release is support for quad-level cell (QLC) flash media. Specifically, the PowerScale F900 and F600 all-flash NVMe platforms are now available with 15.4TB and 30.7TB QLC NVMe drives.

These new QLC drives offer the gamut of capacity, performance, reliability and affordability – and will be particularly beneficial for workloads such as artificial intelligence, machine and deep learning, and for media and entertainment environments.

The details of the new QLC drive options for the F600 and F900 platforms are as follows:

PowerScale Node Chassis specs

(per node)

Raw capacity

(per node)

Max Raw capacity
(252 node cluster)
F900 2U with 24 NVMe SSD drives 737.28TB with 30.72TB QLC

368.6TB with 15.36TB QLC

185.79PB with 30.72TB QLC

92.83PB with 15.36TB QLC

F600 1U with 8 NVMe SSD drives 245.76TB with 30.72TB QLC

122.88TB with 15.36TB QLC

61.93PB with 30.72TB QLC

30.96PB with 15.36TB QLC

This means an F900 cluster with the 30.7TB QLC drives can now scale up to a whopping 185.79PB in size, with a nice linear performance ramp!

So the new QLC drives double the all-flash capacity footprint, as compared to previous generations – while delivering robust environmental efficiencies in consolidated rack space, power and cooling. What’s more, PowerScale F600 and F900 nodes containing QLC drives can deliver the same level of performance as TLC drives, thereby delivering vastly superior economics and value. As illustrated below, QLC nodes performed at parity or slightly better than TLC nodes for throughput benchmarks and SPEC workloads.

The above graphs show the comparative peak random throughput per-node for both QLC and TLC.

QLC-based F600 and F900 nodes can easily be rapidly and non-disruptively integrated into existing PowerScale clusters, allowing seamless data lake expansion and accommodation of new workloads.

Compatibility-wise, there are a couple of key points to be aware of. If attempting to add a QLC drive to a non-QLC node, or vice versa, the unsupported drive will be blocked with the ‘WRONG_TYPE’ error. However, QLC and non-QLC nodes will happily coexist in different pools within the same cluster. But attempting to merge storage node pools with differing media classes will output the error ‘All nodes in the nodepool must have compatible [HDD|SSD] drive technology’.

From the WebUI, the ‘drive details’ pop-up window displays ‘NVME, SSD, QLC’ as the ‘Connection and media type’.  This can be viewed by navigating to Hardware configuration > Drives and selecting ‘View details’ for the desired drive:

The WebUI SmartPools summary, available by browsing to Storage pools > SmartPools, also incorporates ‘QLC’ into the pool name:

Similarly, in the ‘node pool details’:

From the OneFS CLI, existing commands displaying DSP (drive support package), PSI (platform support infrastructure), and storage and node pools display a new ‘media_class’ string, ‘QLC’. For example:

# isi storagepool nodepools ls

ID   Name                Nodes  Node Type IDs  Protection Policy  Manual

-------------------------------------------------------------------------

1    f600_15tb-ssd-qlc_736gb 1      1              +2d:1n             No

                         2

                         3

-------------------------------------------------------------------------

Total: 1

 

# isi storagepool nodetypes ls

ID   Product Name                               Nodes  Manual

--------------------------------------------------------------

1    F600-1U-Dual-736GB-2x25GE SFP+-15TB SSD QLC 1      No

                                                 2

                                                 3

--------------------------------------------------------------

Total: 1

OneFS 9.4 also introduces new model and vendor class fields, providing a dynamic and extensible path to determine what drive statistics and information to gather, how to capture them, and how to display them – in preparation for future drive technologies. For example:

# isi_radish -a

Bay 0/nvd15 is Dell Ent NVMe P5316 RI U.2 30.72TB FW:0.0.8 SN:BTAC1436043630PGGN, 60001615872 blks

Log Sense data (Bay 0/nvd15 ) –

Supported log pages 0x1 0x2 0x3 0x4 0x5 0x6 0x80 0x81

SMART/Health Information Log

============================

Critical Warning State: 0x00

Available spare: 0

Temperature: 0

Device reliability: 0

Read only: 0

Volatile memory backup: 0

Temperature: 297 K, 23.85 C, 74.93 F

Available spare: 100

Available spare threshold: 10

Percentage used: 0

Data units (512,000 byte) read: 1619199

Data units written: 10075777

Host read commands: 67060074

Host write commands: 4461942671

Controller busy time (minutes): 1

Power cycles: 21

Power on hours: 420

Unsafe shutdowns: 18

Media errors: 0

No. error info log entries: 0

Warning Temp Composite Time: 0

Error Temp Composite Time: 0

Temperature 1 Transition Count: 0

Temperature 2 Transition Count: 0

Total Time For Temperature 1: 0

Total Time For Temperature 2: 0

Finally, PowerScale F600 and F900 nodes must be running OneFS 9.4 and the latest DSP in order to support QLC drives. In the event of a QLC drive failure, it must be replaced with another QLC drive. Additionally, any attempts to downgrade a QLC node to a version prior to OneFS 9.4 will be blocked.

OneFS SmartSync Management and Diagnostics

In this final blog of this series, we’ll look at SmartSync’s diagnostic tools, performance, plus review some of its idiosyncrasies and coexistence with other OneFS features.

But first, performance. Unlike SyncIQ, which operates solely on a push model, SmartSync allows pull replication, too. This can be an incredibly useful performance option for environments that grow organically. As demand for replication on a source cluster increases, the additional compute and network load needs to be considered. Push replication, especially with multiple targets, can generate a significant load on the source cluster, as shown in CPU graphs in the following graphic:

In extreme cases, replication traffic resource utilization can potentially impact client workloads as data is pushed to the target. On the other hand, enabling  a pull replication model for a dataset can drastically reduce the resource impacts on the source cluster’s CPU utilization by offloading replication overhead to the target cluster. This can be seen in the following graphs:

For single-source dataset environments with numerous targets, pull replication can free up the source cluster’s compute and network resources, which can then be used more beneficially for client IO. However, if the target cluster is a capacity-optimized archive cluster without CPU and/or network resources to spare, a pull policy model, rather than the traditional push, may not be an option. In such cases, SmartSync also allows its policies to be limited, or throttled, in order to reduce system and/or network resource impacts from replication. SmartSync throttling come in two flavors: Bandwidth and CPU throttling.

  1. Bandwidth throttling is specified through a set of netmask rules plus a maximum throughput limit, and is configured via the ‘isi dm throttling’ CLI syntax:
# isi dm throttling bw-rules create NETMASK – [subnet] --bw-limit= command

Bandwidth limits are specified in bytes for a specific subnet and netmask. For example:

# isi dm throttling bw-rules create NETMASK --netmask 10.20.100.0/24 --bw-limit=$((20*1024*1024))

In this case, the bandwidth limit of 20MB (20*1024*1024 bytes) is applied to the 10.20.100.0 class C subnet. The throttling configuration change can be verified as follows:

# isi dm throttling bw-rules list

ID Rule Type Netmask Bw Limit

------------------------------------------

0 NETMASK 10.20.100.0/24 20.00MB

------------------------------------------

Total: 1

 

  1. Compute-wise, SmartSync policies by default can consume up to 30% of a node’s CPU cycles if its total CPU usage is less than 90%. If the node’s total CPU utilization reaches 90%, or if the SmartSync consumption reaches 30% of the total CPU, SmartSync automatically throttles its CPU consumption.

Additional CPU throttling is specified through ‘allowed CPU percentage’ and ‘backoff CPU percentage’ limits, which are also configured via the ‘isi dm throttling’ CLI command syntax.

The ‘allowed CPU Threshold’ paramater sets the always-allowed CPU cycles that SmartSync is allowed to use, regardless of the overall node CPU usage.  If system’s CPU usage crosses the ‘System CPU Load Threshold’ and SmartSync uses more than the ‘Allowed CPU Threshold’, it will then throttle CPU utilization to remain at or below the ‘Allowed CPU Threshold’.

For example, to set the allowed CPU threshold to 20% and the system CPU threshold to 80%:

# isi dm throttling settings view

    Allowed CPU Threshold: 30

System CPU Load Threshold: 90

# isi dm throttling settings modify --allowed-cpu-threshold 20 --system-cpu-load-threshold 80

# isi dm throttling settings view

    Allowed CPU Threshold: 20

System CPU Load Threshold: 80

Additionally, SmartSync performance is also aided by a scalable run-time engine, spanning the cluster, and which spins up threads (fibers) on demand and uses asynchronous IO to process replication tasks (chunks). Batch operations are used for efficient small file, attribute, and data block transfer. Namespace contention avoidance, efficient snapshot utilization, and separation of dataset creation from transfer are salient design features of the both the baseline and incremental sync algorithms. Plus, the availability of a pull transfer model can significantly reduce the impact on a source cluster, if needed.

On the caveats and considerations front, SmartSync v1 in OneFS 9.4 does have some notable limitations to be cognizant of. Notably, failover and failback of a SmartSync policy option is not currently supported, nor is an option to allow writes on the target cluster. However, the dataset is available for read and write on copy policies once the replication to the target platform is complete if the ‘–copy-createdataset-on-target=false’ option is specified. These limitations will be lifted in a future OneFS release. But for now, if required, repeat-copy data on the target platform may be copied out of the SmartSync data mover snapshot.

Other interoperability considerations include:

Component Interoperability
ADS/resource forked files With CloudCopy, only the main file stored; Alternate data stream/resource fork skipped when encountered.
Cloud copy-back Not supported unless data was created by a OneFS Datamover.
Cloud incrementals Unsupported for file->object transfers. One-time copy to/from cloud only.
CloudPools CloudPools Smartlink stub files are not supported.
Compression Compression for replication transfer is not supported.
Failover/failback policy Failover and failback option is not available, nor is an option to allow writes on the target cluster.
File metadata With CloudCopy, only POSIX UID, GID, atime, mtime, and ctime are copied.
File name encoding With CloudCopy, all encodings are converted to UTF-8.
Hadoop TDE SmartSync does not support the replication of the TDE domain and keys, rendering TDE encrypted data on the target cluster inaccessible.
Hard links With CloudCopy, hard links are not preserved, and a file/object is created for each link.
Inline data reduction Inline compressed and/or deduped data is rehydrated, decompressed, and transferred uncompressed to the target cluster.
Large files (4TB –> 16 TB) Supported up to the cloud provider’s maximum object size. SmartSync policies only connect with target clusters that also have large file support enabled.
RBAC SmartSync administrative access is assigned through the ISI_PRIV_DATAMOVER privilege
SFSE SFSE containerized small files are unpacked on the source cluster before replication.
SmartDedupe Deduplicated files are rehydrated back to their original size prior to replication.
SmartLock Compliance mode cluster are not supported with SmartSync.
SnapshotIQ Tightly integrated; uses snapshots for incrementals and re-baselining.
Sparse files With CloudCopy, sparse regions of files are written out as zeros.
Special files With CloudCopy, special files are skipped when encountered.
Symbolic links With CloudCopy, symlinks are skipped when encountered.
SyncIQ SmartSync and SyncIQ replication both happily coexist. An active SyncIQ license is required for both.

When it comes to monitoring and troubleshooting SmartSync, there are a variety of diagnostic tools available. These include:

Component Tools Issue
Logging ·         /var/log/isi_dm.log

·         /var/log/messages

·         ifs/data/Isilon_Support/datamover/transfer_failures/baseline_failures_ <jobid>

General SmartSync info and  triage.
Accounts ·         isi dm accounts list / view Authentication, trust and encryption.
CloudCopy ·         S3 Browser (ie. Cloudberry), Microsoft Azure Storage Explorer Cloud access and connectivity.
Dataset ·         isi dm dataset list/view Dataset creation and health.
File system ·         isi get Inspect replicated files and objects.
Jobs ·         isi dm jobs list/view

·         isi_datamover_job_status -jt

Job and task execution, auto-pausing, completion, control, and transfer.
Network ·         isi dm throttling bw-rules list/view

·         isi_dm network ping/discover

Network connectivity and throughput.
Policies ·         isi dm policies list/view

·         isi dm base-policies list/view

Copy and dataset policy execution and transfer.
Service ·         isi services -a isi_dm_d <enable/disable> Daemon configuration and control.
Snapshots ·         isi snapshot snapshots list/view Snapshot execution and access.
System ·         isi dm throttling settings CPU load and system performance.

SmartSync info and errors are typically written to /var/log/isi_dm.log and /var/log/messages, while DM jobs transfer failures generate a log specific to the job ID under /ifs/data/Isilon_Support/datamover/transfer_failures.

Once a policy is running, the job status is reported via ‘isi dm jobs list’. Once complete, job histories are available by running ‘isi dm historical jobs list’. More details for a specific job can be glean from the ‘isi dm job view’ command, using the pertinent job ID from the list output above. Additionally, the ‘isi_datamover_job_status’ command with the job ID as an argument will also supply detailed information about a specific job.

Once running, a DM job can be further controlled via the ‘isi dm jobs modify’ command, and available actions include cancel, partial-completion, pause, or resume.

If a certificate authority (CA) is not correctly configured on a PowerScale cluster, the SmartSync daemon will not start, even though accounts and policies can still be configured. Be aware that the failed policies will not be reported via ‘isi dm jobs list’ or ‘isi dm historical-jobs list’ since they never started. Instead, an improperly configured CA is reported in the /var/log/isi_dm.log as follows:

Certificates not correctly installed, Data Mover service sleeping: At least one CA must be installed: No such file or directory from dm_load_certs_from_store (/b/mnt/src/isilon/lib/isi_dm/isi_dm_remote/src/rpc/dm_tls.cpp:197 ) from dm_tls_init (/b/mnt/src/isilon/lib/isi_dm/isi_dm_remote/src/rpc/dm_tls.cpp:279 ): Unable to load certificate information

Once a CA and identity are correctly configured, the SmartSync service automatically activates. Next, SmartSync attempts a handshake with the target cluster. If the CA or identity is mis-configured, the handshake process fails, and generates an entry in /var/log/isi_dm.log. For example:

2022-06-30T12:38:17.864181+00:00 GEN-HOP-NOCL-RR-1(id1) isi_dm_d[52758]: [0x828c0a110]: /b/mnt/src/isilon/lib/isi_dm/isi_dm_remote/src/acct_mon.cpp:dm_acc tmon_try_ping:348: [Fiber 3778] ping for account guid: 0000000000000000c4000000000000000000000000000000, result: dead

Note that the full handshake error detail is logged if the SmartSync service (isi_dm_d) is set to log at the ‘info’ or ‘debug’ level using isi_ilog:

# isi_ilog -a isi_dm_d --level info+

Valid ilog levels include:

fatal error err notice info debug trace

error+ err+ notice+ info+ debug+ trace+

A copy or repeat-copy policy requires an available dataset for replication before running. If a dataset has not been successfully created prior to the copy or repeat-copy policy job starting for the same base path, the job is paused. In the following example, the base path of the copy policy is not the same as that of the dataset policy, hence the job fails with a “path doesn’t match…” error.

# ls -l /ifs/data/Isilon_support/Datamover/transfer_failures

Total 9

-rw-rw----   1 root  wheel  679  June 29 10:56 baseline_failure_10

# cat /ifs/data/Isilon_support/Datamover/transfer_failures/baseline_failure_10

Task_id=0x00000000000000ce, task_type=root task ds base copy, task_state=failed-fatal path doesn’t match dataset base path: ‘/ifs/test’ != /ifs/data/repeat-copy’:

from bc_task)initialize_dsh (/b/mnt/src/isilon/lib/isi_dm/isi_dm/src/ds_base_copy

from dmt_execute (/b/mnt/src/isilon/lib/isi_dm/isi_dm/src/ds_base_copy_root_task

from dm_txn_execute_internal (/b/mnt/src/isilon/lib/isi_dm/isi_dm_base/src/txn.cp

from dm_txn_execute (/b/mnt/src/isilon/lib/isi_dm/isi_dm_base/src/txn.cpp:2274)

from dmp_task_spark_execute (/b/mnt/src/isilon/lib/isi_dm/isi_dm/src/task_runner.

Once any errors for a policy have been resolved, the ‘isi dm jobs modify’ command can be used to resume the job.

OneFS SmartSync Configuration

In the first blog of this series, we looked at OneFS SmartSync’s architecture and attributes. Next, we’ll delve into the configuration side of things, and walk through a basic setup.

Since there’s no SmartSync WebUI yet in OneFS 9.4, the bulk of the SmartSync configuration is performed via the ‘isi dm’ CLI tool, which contains the following the principal subcommands:

Subcommand Description
isi dm accounts Manage Datamover accounts. An activate SyncIQ license is required to create Datamover accounts.
isi dm base-policies Manage Datamover base-policy. Base policies are templates to provide common values to groups of related concrete Datamover policies. Eg. Define a base policy to override the run schedule of a concrete policy.
isi dm certificates Manage Datamover certificates.
isi dm config Show Datamover Manual Configuration.
isi dm datasets Show Datamover Dataset Information.
isi dm historical-jobs Manage Datamover historical jobs.
isi dm jobs Manage Datamover jobs.
isi dm policies Manage Datamover policy. Policies can be either:

CREATION – Creates/replicates a dataset, either once or on a schedule.

COPY – Defines a one-time copy of a dataset to or from a remote system

isi dm throttling Manage Datamover bandwidth and CPU throttling. Bandwidth throttling rules can be configured for each Datamover job.

The high-level view of the SmartSync setup and configuration process is as follows:

 

  1. The first step involves installing or upgrading the cluster to OneFS 9.4. SmartSync replication is handled by the ‘isi_dm_d’ service, which is disabled by default and needs to be enabled prior to configuring and using SmartSync. This can be easily accomplished with the following CLI syntax:
# isi services -a isi_dm_d

Service 'isi_dm_d' is disabled.

# isi services -a isi_dm_d enable

The service 'isi_dm_d' has been enabled.

 

  1. SmartSync uses TLS (transport layer security, or SSL) and, as such, requires trust to be established between the source and target clusters. In addition to a Certificate Authority (CA) and Certificate Identity (CI) for authorization and authentication, both clusters also require encryption to be enabled in order for the isi_dm_d service to run. The best practice is to use a local CA to sign each cluster’s CI, but self-signed certificates can be used instead in the absence of a suitable CA.

Before creating accounts, certificates must be generated and copied to the appropriate clusters. The following Certificate Authorities (CA) and trust hierarchies are required:

Requirement Description
TLS certificates ● A mutually authenticated TLS handshake is required. Authorization, authentication, and encryption are provided by TLS certificates.

● TLS certificates are always required for daemon startup and all communication between Datamover engines.

● Encryption can be disabled, but authorization and authentication cannot be disabled.

Certificate Authorities (CA) ● One or more Certificate Authorities (CA) are required on each Datamover system.

● Dell recommends that customers use a new, Datamover-specific CA for signing Datamover identity certificates.

● The CA that signs an identity certificate does not need to be installed on the system that the identity certificate is installed on. Two systems trust each other if they have the CAs that signed each other’s identity certificates.

Identity certificates ● The certificate that provides authentication of the identity claimed.

● Exactly one identity certificate must exist on each Datamover system.

● Identity certificates are signed by one of the CAs deployed on the systems that the system is going to communicate with.

Trust hierarchies ● Two systems trust each other if they have the CAs that signed each other’s identity certificates.

● There is no concept of unidirectional trust—trust is entirely mutual.

The following steps can be used to generate and copy the pertinent TLS certificates to the source and target Datamover clusters:

Step Cluster Action Commands
1 Source Generate Certificate Authority (CA). # openssl genrsa -out ca-s.key 4096

# openssl req -x509 -new -nodes -key ca-s.key -sha256 -days 1825 -out ca-s.pem

 

2 Source Copy source cluster’s CA to target cluster. # scp ca-s.pem [Target Cluster IP]:/:/root
3 Source Generate Certificate Identity (CI). # openssl genrsa -out identity-s.key 4096

# openssl req -new -key identity-s.key -out identity-s.csr

4 Source Create a CI on source cluster. # cat << EOF > identity-s.ext authorityKeyIdentifier=keyid,issuer

basicConstraints=CA:FALSE

keyUsage=digitalSignature,nonRepudiation,keyEncipherment,dataEncipherment

EOF

5 Source Sign source cluster’s CI with source cluster’s CA. # openssl x509 -req -in identity-s.csr -CA ca-s.pem -CAkey ca-s.key -CAcreateserial -out identity-s.crt -days 825 -sha256 -extfile identity-s.ext
6 Target Generate a CA on target cluster. # openssl genrsa -out ca-t.key 4096

# openssl req -x509 -new -nodes -key ca-t.key -sha256 -days 1825 -out ca-t.pem

7 Target Copy target cluster CA to source cluster. # scp ca-t.pem [Source Cluster IP]:/root
8 Target Generate CI on target cluster. # openssl genrsa -out identity-t.key 4096

# openssl req -new -key identity-t.key -out identity-t.csr

9 Target Create a CI on target cluster. # cat << EOF > identity-t.ext authorityKeyIdentifier=keyid,issuer basicConstraints=CA:FALSE keyUsage=digitalSignature,nonRepudiation,keyEncipherment,dataEncip herment EOF
10 Target Sign this CI with target cluster’s CA. # openssl x509 -req -in identity-t.csr -CA ca-t.pem -CAkey ca-t.key -CAcreateserial -out identity-t.crt -days 825 – sha256 -extfile identity-t.ext

 

  1. Next, the various CAs and CIs are installed across the two clusters.
Step Cluster Action Command
1 Source Install source cluster’s CA. # isi dm certificates ca create “$PWD”/ca-s.pem –name <source-cluster-ca>
2 Source Install target cluster’s CA. # isi dm certificates ca create “$PWD”/ca-t.pem –name <target-cluster-ca>
3 Source Install source cluster’s CI. # isi dm certificates identity create “$PWD”/identity-s.crt –certificate-key-path “$PWD”/identity-s.key –name <source-cluster-identity>
4 Target Install target cluster’s CA. # isi dm certificates ca create “$PWD”/ca-t.pem –name <target-cluster-ca>

 

5 Target Install source cluster’s CA. # isi dm certificates ca create “$PWD”/ca-s.pem –name <source-cluster-ca>

 

6 Target Install target cluster’s CI. # isi dm certificates identity create “$PWD”/identity-t.crt –certificate-key-path “$PWD”/identity-t.key –name <target-cluster-identity>

Note that the certificates must be located under /ifs when performing the import, otherwise an error similar to the following will be returned:

Invalid certificate path: /root/ca-s.pem [CERTS_CERT_INVALID]

At this point, encryption is now configured on the source and target clusters.

 

  1. By default, a local account, ‘DM Local Account’, is already configured. The ‘isi dm accounts list’ command can be used to display this ‘DM Local Account.’
# isi dm accounts list

ID                                               Name             URI             Account Type  Auth Mode   Local Network Pool  Remote Network Pool

----------------------------------------------------------------------------------------------------------------------------------------------------

0060167118de5018ab62800ce595db9bdb40000000000000 DM Local Account dm://[::1]:7722 DM            CERTIFICATE

----------------------------------------------------------------------------------------------------------------------------------------------------

Total: 1

The following steps illustrate configuring a push policy from the source cluster. Note that a single account can be used for both a push and pull policy, depending on the replication topology. After encryption is configured, the next step is to add a replication account to the source cluster, pointing replication to a target cluster.

On the source cluster, add a replication account using the ‘isi dm accounts create’ CLI command:

# isi dm accounts create DM dm://[Target Cluster IP]:7722 [‘target-acct’]

If desired, local and remote SmartConnect pools can be specified for the source and target clusters, respectively, with the –local-network-pool and –remote-network-pool flags.

The ‘isi dm accounts list’ command can be used to verify successful account creation:

# isi dm accounts list

ID                                               Name             URI             Account Type  Auth Mode   Local Network Pool  Remote Network Pool

----------------------------------------------------------------------------------------------------------------------------------------------------

f8f21e66c32476412b621d182495f22d3e31000000000000 DM Local Account dm://[::1]:7722 DM            CERTIFICATE

000c38b4ga22e3810d53ff27449b285b98c8000000000000 rmt-acct          dm://10.20.50.130:7722 DM

----------------------------------------------------------------------------------------------------------------------------------------------------

Total: 2

In the above, the ‘DM Local Account’ is the source cluster’s account, and ‘rmt-acct’ is the target cluster’s account, plus IP address.

 

  1. Two policies are needed here. First, the ‘isi dm policies create’ CLI command can be run with the ‘CREATION’ policy option in order to create a dataset. The syntax for this command to run at ‘normal’ priority is:
# isi dm policies create [Policy Name] NORMAL true CREATION -- creation-account-id=[DM local account] --creation-base-path= -- creation-dataset-retention-period= --creation-dataset-reserve= -- creation-dataset-expiry-action=DELETE ––recurrence=”cron expression” --start-time="YYYY-MM-DD HH:MM:SS"

The configuration parameters for the ‘isi dm policies create’ command include:

Parameter Description
policy-type Specifies the type of policy. Options are:

● CREATION —the process of creating the dataset

● COPY —used for one-time data transfers

● REPEAT_COPY —used for repeated transfers

● EXPIRATION —how long the snapshot is stored

priority Assigns a priority to this policy. The options are: LOW | NORMAL | HIGH.
true Specifies that the policy is enabled.
creation-account-id The DM local account ID specified in the isi dm accounts list command.
creation-base-path For SmartSync this specifies the directory path or file for the dataset. For cloud copy, this specifies the object store key prefix.
creation-dataset-retention-period How long the dataset is retained in seconds before expiration.
creation-dataset-reserve how many datasets to keep in reserve that are protected from expiration, irrespective of the creation-dataset-retention-period.
creation-dataset-expiry-action Specifies what happens with the dataset after expiration. With OneFS 9.4, the only expiration option is DELETE.
recurrence How often the policy runs.
Start-time The date and time when the policy runs. If a prior date is entered, the policy runs immediately.

The following CLI command creates a Datamover CREATION policy named createTestDataset. The policy creates a dataset with the base filepath /ifs/test/dm/data1. The creation account is the local Datamover account. The dataset expires 1,500 seconds (25 minutes) after its creation, after which it is deleted. The policy starts running June 1, 2022, at 12pm.

# isi dm policies create --name=createTestDataset --enabled=true --priority=low --policy-type=CREATION --creation-base-path=/ifs/test/dm/data1 --creation-account-id=local --creation-dataset-expiry-action=DELETE --creation-dataset-retention-period=1500 --start-time "2022-12-01 12:00:00"

To list the Datamover policies:

# isi dm policies list

ID   Validity  Name              Enabled  Disabled By DM  Priority  Policy Type  Base Policy ID  Date Times  Recurrence  Start Time          Parent Exec Policy ID

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

1 Yes createTestDataset Yes No LOW    CREATION - - - 2022-12-01 12:00:00 -                 

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

The ‘isi dm policies view’ CLI syntax can be used to inspect details of a policy, in this case ‘createTestDataset’ with an ID of 1 above:

# isi dm policies view 1

                        ID: 1

                  Validity: Yes

                      Name: createProdDataset

                   Enabled: Yes

            Disabled By DM: No

                  Priority: LOW

                   Run Now: No

            Base Policy ID: -

     Parent Exec Policy ID: -

                  Schedule

                    Date Times: -

                    Recurrence: -

                    Start Time: 2022-12-01 12:00:00

Policy Specific Attributes

                        Policy Type: CREATION

                    Creation Policy

                           Account ID: local

                            Base Path: /ifs/test/dm/data1

                            Retention

               Dataset Retention Period: 1500

                        Dataset Reserve: 2

                  Dataset Expiry Action: DELETE

In addition to the newly configured CREATION policy, a COPY policy is also required to perform the data move. This can be created as follows:

# isi dm policies create archive-restore NORMAL true COPY --copy-source-base-path=/ifs/test/dm/data1 --copycreate-dataset-on-target=true --copy-base-base-account-id= f8f21e66c32476412b621d182495f22d3e31000000000000 --copy-base-source-accountid= f8f21e66c32476412b621d182495f22d3e31000000000000--copy-base-target-account-id=000c38b4ga22e3810d53ff27449b285b98c8000000000000--copy-base-target-basepath=/ifs/test/dm/data1 --copy-base-target-dataset-type=FILE --copy-base-dataset-retention-period=3600 --copy-base-dataset-reserve=2 --copy-base-policy-dataset-expiry-action=DELETE

Confirm both the COPY and CREATION Datamover policies are present:

# isi dm policies list

ID   Validity  Name              Enabled  Disabled By DM  Priority  Policy Type  Base Policy ID  Date Times  Recurrence  Start Time          Parent Exec Policy ID

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

1 Yes createTestDataset Yes No LOW    CREATION -      -     -     -                 

2 Yes archive-restore   Yes No NORMAL COPY     -      -     -     –

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

 

  1. The next step is to run the CREATION policy (ID = 1) in order to create the dataset:
# isi dm policies modify 1 --run-now=true

The running job can be inspected as follows:

# isi dm jobs list

ID Job Type Job Priority Job Policy ID Job Control Request Job Start Time Job End Time Job State Job State Flags

----------------------------------------------------------------------------------------------------------------

201 DATASET_CREATION_JOB NORMAL  1  NONE  2022-06-23T14:52:22 2022-06-23T14:53:04 finishing   No failure

----------------------------------------------------------------------------------------------------------------

Total: 1

Once the job has completed, the ‘isi dm historical-jobs list’ CLI command allows the dataset creation policy’s status to be queried.

# isi dm historical-jobs list

ID Job Type Job Priority Job Policy ID Job Control Request Job Start Time Job End Time Job State Job State Flags

----------------------------------------------------------------------------------------------------------------

201 DATASET_CREATION_JOB NORMAL  1  NONE  2022-06-23T14:52:22 2022-06-23T14:54:51 finished   No failure

----------------------------------------------------------------------------------------------------------------

Total: 1

Finally, run the COPY policy (ID = 2) to replicate the dataset from the source to target cluster:

# isi dm policies modify 2 --run-now=true

# isi dm jobs list

ID Job Type Job Priority Job Policy ID Job Control Request Job Start Time Job End Time Job State Job State Flags

----------------------------------------------------------------------------------------------------------------

202 DATASET_BASELINE_COPY_JOB NORMAL  2  NONE   2022-06-23T14:55:11 2022-06-23T14:56:48 running   No failure

----------------------------------------------------------------------------------------------------------------

Total: 1

When the COPY job has completed, the ‘historical-jobs’ output will now show both the CREATION and COPY job details:

# isi dm historical-jobs list

ID Job Type Job Priority Job Policy ID Job Control Request Job Start Time Job End Time Job State Job State Flags

----------------------------------------------------------------------------------------------------------------

202 DATASET_BASELINE_COPY_JOB NORMAL  2  NONE   2022-06-23T14:55:11 2022-06-23T14:57:06 finished   No failure

201 DATASET_CREATION_JOB NORMAL  1  NONE  2022-06-23T14:52:22 2022-06-23T14:54:51 finished   No failure

----------------------------------------------------------------------------------------------------------------

Total: 2

Once created, the new dataset can be inspected via the ‘isi dm datasets list’ command output:

# isi dm datasets list

ID Dataset State Dataset Type Dataset Base Path Dataset Subpaths Dataset Creation Time Dataset Expiry Action Dataset Retention Period

-------------------------------------------------------------------------------------------------------------------------------------

1  COMPLETE FILE  /ifs/test/dm/data1   - 2022-06-23T14:54:51  DELETE  2022-06-23T15:54:51

-------------------------------------------------------------------------------------------------------------------------------------

Total: 1

To view Datamover policies:

# isi dm policies view

Note that the procedure above configures push replication of a dataset from a source to target. Conversely, to perform a pull from the target cluster, the replication account is instead added to the target cluster, and with the source cluster’s IP used:

# isi dm accounts create DM dm://[Source Cluster IP]:7722 [‘source-acct’]

Object data replication to public cloud or Dell ECS targets can also be configured with the ‘isi dm accounts create’ CLI command, but does require a couple of additional parameters, namely:

Parameter Description
Object store type AWS_S3, Azure, or ECS_S3
URI {http,https}://hostname:port/bucketname
Auth Access ID, Secret Key
Proxy Optional proxy information

For example:

# isi dm accounts create AWS_S3 https://aws-host:5555/bucket dm-account-name --authmode CLOUD --access-id aws-access-id --secret-key aws-secret-key

Be aware that a dataset must be available before a copy, or repeat-copy data replication policy runs, or the policy will fail.

Behind the scenes, dataset creation leverages a SnapshotIQ snapshot, which can be inspected via the ‘isi snapshot list’ command. These DM dataset snapshots are easily recognizable due to their ‘isi_dm’ prefixed naming convention.

In the final article in this series, we’ll take a look at SmartSync management, monitoring, and troubleshooting.

OneFS SmartSync Datamover

Amongst the bevy of new functionality introduced in OneFS 9.4 is SmartSync v1, and we’ll be taking a look at this new replication product over the course of the next couple of blog articles.

So, the new SmartSync Datamover enables flexible data movement and copying, incremental resyncs, and push and pull data transfer of file data between PowerScale clusters. Additionally, SmartSync CloudCopy also enables the copying of file-to-object data from a source cluster to a cloud object storage target. Cloud object targets include AWS S3 and Microsoft Azure, as well as Dell ECS.

Having a variety of target destination options allows multiple copies of a dataset to be stored across locations and regions, both on and off-prem, providing increased data resilience and the ability to rapidly recover from catastrophic events.

CloudCopy uses HTTP as the data replication transport layer to cloud storage, while cluster to cluster SmartSync leverages a proprietary RCP-based messaging system. In addition to the replication of the actual data, SmartSync also preserves the common file attributes including Windows ACLs, POSIX permissions and attributes, creation times, extended attributes, alternate data streams, etc.

In order to use SmartSync, SyncIQ  must be licensed and active across all nodes in the cluster. Additionally, a cluster account with the ISI_PRIV_DATAMOVER privilege is needed in order to configure and run SmartSync data mover policies. While file-to-file replication requires SmartSync to be running on both source and target clusters, for OneFS Cloud Copy to transfer to/from cloud storage, only the OneFS 9.4 cluster requires the SmartSync data mover. No data mover needed on the cloud systems. Be aware that the inbound TCP 7722 IP port must be open across any intermediate gateways and firewalls to allow SmartSync replication to occur.

Under the hood, replication is handled by the ‘isi_dm_d’ service, which is disabled by default, and needs to be enabled prior to configuring and using SmartSync. SmartSync uses TLS (transport layer security, or SSL) and, as such, requires trust to be established between the source and target clusters. In addition to a Certificate Authority (CA) and Certificate Identity (CI) for authorization and authentication, both clusters also require encryption to be enabled in order for the isi_dm_d service to run. The best practice is to use a local CA to sign each cluster’s CI,  but self-signed certificates can be used instead in the absence of a suitable CA.

The SmartSync Datamover has a purpose-build, integrated job execution engine, and Datamovers are executed on each cluster node in cooperative mode.

Shared Key-Value Stores (KVS) are used for jobs/tasks distribution, and extra indexing is implemented for quick lookups by task state, task type, and alive time. There are no dependencies or communication between tasks, and job cancellation and pausing is handled by posting a ‘request’ into a job record (request polling).

Within the SmartSync hierarchy, accounts define the connections to remote systems, policies define the replication configurations, and jobs perform the work:

Component Details
Accounts Datamover accounts:

–          URI, eg. dm://remotenas.isln.com:7722

–          Local and remote network pools defining nodes/interfaces to use for data transfer

–          Client and server certificates to enable TLS

CloudCopy accounts:

–          Account type (AWS S3, ECS S3, Azure)

–          URI, eg. https://cloudcluster.isln.com:9002/cloudbucket

–          Credentials

Policies –          Dataset creation policy

–          Dataset copy policy

–          Dataset repeat copy policy

–          Dataset expiration policy

Jobs Runtime entities created based on policies schedules. There are two major types of data transfer jobs:

–         Baseline jobs for initial transfers and

–         Incremental jobs for subsequent transfers between FILE Datamover systems.

Tasks Spawned by jobs and are the individual chunks of work that a job must perform. No 1-to-1 relationship to their associated files.

 

SmartSync Datasets are self-contained, independent entities. Once created, they’re assigned globally-unique IDs, and backed by file system snapshots on PowerScale. Parent-child relationships are used for incremental transfers, and a handshake determines the exact changeset to be transferred.

As demand for replication on a source cluster increases, the additional compute and network load needs to be considered. Multiple targets can generate a significant demand on the source cluster, with replication traffic contending with client workloads as data is pushed to the target. Fortunately, SmartSync allows a target cluster to pull the dataset, thereby minimizing the resource impacts on the source.

For single-source dataset environments with numerous targets, push replication can be incredibly useful, allowing the source cluster’s resources to be focused on client IO. In addition to both push and pull replication, SmartSync also supports a variety of topologies, such as fan-out, chaining, etc.

SmartSync provides enhanced replication failure resilience, minimizing replication times even when a job runs into an error. Rather than failing an entire replication job if an error is encountered, requiring a manual restart, SmartSync instead places the job into a paused state, and presents three options:

  1. Cancel the job altogether.
  2. Resolve the errors and resume the job.
  3. Complete a partial replication.

With option 3, the portion of the dataset already transferred is retained, thereby decreasing the subsequent job’s work and execution time.

The SmartSync architecture intentionally decouples source cluster snapshot creation (dataset creation) from the actual data replication transfer to the target, allowing each to run independently via separate configured policies configured for each. This helps mitigate the disruptive chain effect of a failure during the snapshot process early in the process. Additionally, SmartSync offers parent-child policies which launch a replication job only after successful snapshot creation, providing an alternative to recurrence in situations where it’s unclear how long a previous policy may take to complete.

With SmartSync, ‘re-baselining’ (full-resync) is not required for source-target clusters which already contain an earlier version  of a dataset. For example, in the following three-cluster DR topology, cluster A replicates to B, and B replicates to C:

A parent-child relationship means that, if cluster B becomes unavailable, the cluster A to C policy would not require a new baseline. Instead, clusters A and C’s datasets are compared via a handshake, enabling only the changed data blocks to be transferred, thereby minimizing replication overhead. This is particularly beneficial for environments with large datasets, significantly shrinking RPO and RTO times and increasing DR readiness.

When setting up a SmartSync 3-way relationship, be sure to use a single dataset creation policy when configuring datasets on the same path. If there are separate dataset creation policies for each relationship, B and C will have different datasets (snapshots) with different dataset IDs.  In this case, if A dies it would be impossible to establish an incremental sync relationship between B on C on those datasets, since the incremental transfer won’t be able to ‘connect’ the dataset IDs between B and C.

SmartSync allows subsequent incremental data movement by managing and re-transferring failed file transfers. Similarly, Dataset reconnect enables systems with common base datasets to establish instant incremental syncs. SmartSync also proactively locks the SnapshotIQ snapshots it generates, providing better protection and separation between Datamover and other cluster snapshots.

Other SmartSync features and functionality includes:

Feature Details
Bandwidth throttling Set of netmask rules. Limits are per-node.
CPU throttling Allowed and Backoff CPU percentages.
Base policies Template providing common values to groups of related policies (schedule, source base path, enable/disable, etc). Ie. Disabling base policy affects all linked concrete policies.
Concrete policy Predefined set of fields from the base policy
Incremental reconnect Ability to run incrementals between systems with common base datasets but no prior replication relationship
Unconnected nodes (NANON) Active accounts are monitored by each node. No work allocation to nodes without network access.
Snapshot locking Avoids accidental snapshot deletion, with subsequent re-base-lining.

 

SmartSync allows subsequent incremental data movement by managing and re-transferring failed file transfers. Similarly, Dataset reconnect enables systems with common base datasets to establish instant incremental syncs. SmartSync also proactively locks the SnapshotIQ snapshots it uses, providing better separation between Datamover and other snapshots.

Performance-wise, SmartSync is powered by a scalable run-time engine, spanning the cluster, and which spins up threads (fibers) on demand and uses asynchronous IO to process replication tasks (chunks). Batch operations are used for efficient small file, attribute, and data block transfer. Namespace contention avoidance, efficient snapshot utilization, and separation of dataset creation from transfer are salient design features of the both the baseline and incremental sync algorithms. Plus, the availability of a pull transfer model can significantly reduce the impact on a source cluster, if needed.

The streamlined baseline and incremental file transfer jobs operates as follows:

On the CloudCopy side, the SmartSync copy format provides both regular file representation, browsability and usability of file system data in the cloud. That said, as compared to the file-to-file Datamover, there are certain CloudCopy considerations and limitations to be aware of, such as no incremental copy. These also include:

CloudCopy Caveats Details
ADS files Skipped when encountered.
Hardlinks An object will be created for each link (ie. links are not preserved).
Symlinks Skipped when encountered.
Directories An object is created for each directory.
Special files Skipped when encountered.
Metadata Only POSIX mode bits, UID, GID, atime, mtime, ctime are preserved.
Filename encodings Converted to UTF-8.
Path Path relative to root copy directory is used as object key.
Large files An error is returned for files larger than the cloud providers maximum object size.
Long filenames File names exceeding 256 bytes are compressed.
Long paths Junction points are created when paths exceed 1024 bytes to redirect where objects are being stored
Sparse files Sparse sections are not preserved and are written out fully as zeros.

As mentioned earlier, there are also some prerequisites to address before running SmartSync. First, the source and target(s) must be running OneFS 9.4 and with SyncIQ licensed across the cluster. Additionally, the identity certificates and a shared CA must be present to communicate with a peer Datamover.

In the next article in this series, we’ll turn our attention to the configuration and use of SmartSync.

OneFS System Partition Hygiene

In addition to the /ifs data storage partition, like most UNIX-derived operating systems, OneFS uses several system partitions, including:

Partition Description
/ Root partition containing all the data to start up and run the system, and which contains the base OneFS software image.
/dev Device files partition. Drives, for example, are accessed through block device files such as /dev/ad0.
/ifs Clustered filesystem partition, which spans all of a cluster’s nodes. Includes /ifs/.ifsvar.
/usr Partition for user programs.
/var Partition to store variable data, such as log files, etc. In OneFS, this partition is mostly used for /var/run and /var/log.
/var/crash The crash partition is configured for binary dumps.

One advantage of having separate partitions rather than one big chunk of space is that different parts of the OS are somewhat protected from each other. For example, if /var fills up, it doesn’t affect the root / partition.

While OneFS automatically performs the vast majority of its system housekeeping, occasionally the OneFS /var partition on one or more of a cluster’s nodes will fill up, typically as the result of heavy log writing activity and/or the presence corefile(s). If /var reaches 75%, 85%, or 95% of capacity, a CELOG event is  automatically fired and an alert sent.

The following CLI command will provide a view of /var usage across the cluster:

# isi_for_array -s "du -h /var | sort -n | tail -n10"

The typical resolution for this scenario is to rotate the logfiles under /var/log. If, after log rotation, the /var partition returns to a normal usage level, reviewing the list of recently written logs will usually determine if a specific log is rotating frequently/excessively. Log rotation will usually resolve the full-partition issue by compressing or removing large logs and old logs, thereby automatically reducing partition usage.
The ‘df -i’ CLI command, run on the node that reported the error, will display the details of the /var partition. For example:

# df -i | grep var | grep -v crash
Filesystem 1K-blocks Used Avail Capacity iused ifree %iused Mounted on
/dev/mirror/var0 1013068 49160 882864 5% 1650 139276  92% /var

If the percentage used value is 90% or higher, as above, the recommendation is to reduce the number of files in the /var partition. To remove files that do not belong in the /var partition, first run the following ‘find’ command on the node that generated the alert. This will display any files in the /var partition that are greater than 5 MB in size:

# find -x /var -type f -size +10000 -exec ls -lh {} \; | awk '{ print $9 ": " $5 }'

The output will show any large files that files that do not typically belong in the /var partition. These could include artifacts such as OneFS install packages, cluster log gathers, packet captures, or other user-created files. Remove the files or move them to the /ifs directory. If you are unsure which, if any, files are viable candidates for removal, contact Dell Support for assistance.

The ‘fstat’ CLI command is a useful tool for listing the open files on a node or in a directory, or to display files that were opened by a particular process. This information can be invaluable for determining if a process is holding a large file open. For example a node’s open files on a node can be displayed as follows:

# fstat

A list of the open files can help in monitoring the processes that are writing large files.

Using the ‘-f’flag will narrow the fstat output to a particularly directory:

# fstat -f <directory_path>

Similarly, to list the files opened by a particular process:

# fstat -p <pid>

If there are no open files found in the /var directory, it is entirely possible that a large file has become unlinked and is consuming space because one or more processes have the file open. The fstat command can be used to confirm this, as follows:

# fstat -f /var | grep var

If a process is holding a file open, output similar to the following is displayed:

root lwio 98281 4 /var 69612 -rw------- 100120000 rw

Here, the lwio daemon (PID 98281) has a 100MB file open that is approximately 100 MB (100120000 bytes). The file’s inode number, 69612, can be used to retrieve the its name:

# find -x /var -inum 69612 -print

/var/log/lwiod.log

If a process is holding a large file open and it’s inode cannot be found, the file is considered to be ‘unlinked’. In this case, the recourse is typically to restart the offending process. Note that, before stopping and restarting a process, consider any possible negative consequences. For example, stopping the OneFS SMB daemon, lwiod, in the example above would potentially disconnect SMB users.

If neither of the suggestions above resolves the issue, the logfile’s rollover file size limit can be reduced and the file itself compressed. To do this, first create a backup of the /etc/newsyslog.conf file as follows:

# cp /etc/newsyslog.conf /ifs/newsyslog.conf
# cp /etc/newsyslog.conf /etc/newsyslog.bak

Next, open the /ifs/newsyslog.conf file in emacs, vi, or editor or choice and locate the following line:

/var/log/wtmp 644 3 * @01T05 B

Change the line to:

/var/log/wtmp 644 3 10000 @01T05 ZB

These changes instruct the system to roll over the /var/log/wtmp file when it reaches 10 MB and then to compress the file with gzip. Save and close the /ifs/newsyslog.conf file, and then run the following command to copy the updated ‘newsyslog.conf’ file to the remaining nodes on the cluster:

# isi_for_array 'cp /ifs/newsyslog.conf /etc/newsyslog.conf'

If other logs are rotating frequently, or if the preceding solutions do not resolve the issue, run the isi_gather_info command to gather logs, and then contact Dell Support for assistance.

There are several options available to stop processes and create a corefile under OneFS:

CLI Command Description
gcore Generate a core dump file of the running process without actually killing it.
kill -6 Stop a single process and get a core dump file
killall -6 Stop all processes and get a core dump file
kill -9 Force a process to stop

The ‘gcore’ CLI command can generate a core dump file from a running process without actually killing it. First, the ‘ps’ CLI command can be used to find and display the process ID (PID) for a running process:

# ps -auxww | egrep 'USER|lsass' | grep -v grep

USER     PID %CPU %MEM   VSZ   RSS  TT  STAT STARTED      TIME COMMAND
root   68547  0.0  0.3 150464 38868 ??   S    Sun11PM   0:06.87 lw-container lsass (lsass)

In the above example, the PID for the lsass process is 68547. Next, the ‘gcore’ CLI command can be used to generate a core dump of this PID and write the output to a location of choice, in this example a file aptly named ‘lsass.core’.

 # gcore -c /ifs/data/Isilon_Support/lsass.core 68547

# ls -lsia /ifs/data/Isilon_Support/lsass.core
4297467006 58272 -rw-------     1 root  wheel  239280128 Jun 10 19:10 /ifs/data/Isilon_Support/lsass.core

Typically, the /ifs/data/Isilon_Support directory provides an excellent location to write the coredump to. Clearly, /var is not a great choice, since the partition is likely already full.

Finally, when the coredump has been written, the ‘isi_gather_info’ tool can be used to coalesce both the core file and pertinent cluster logs and the core into a convenient tarfile.

# isi_gather_info --local-only -f /ifs/data/Isilon_Support/lsass.core

# ls -lsia /ifs/data/Isilon_Support | grep -i gather
4298180122    26 -rw-r--r-- +    1 root  wheel         19 Jun 10 15:44 last_full_gather

The resulting log set, ‘/ifs/data/Isilon_Support/last_full_gather’, is then ready for upload to Dell Support for further investigation and analysis.

OneFS and CloudPools Upgrades

OneFS’ non-disruptive upgrade (NDU) functionality allows administrators to upgrade a cluster while their end user community continue to access data without error or interruption. During the OneFS rolling upgrade process, one node at a time is updated to the new code, and the active clients attached to it are automatically migrated to other nodes in the cluster. Partial upgrade is also supported, allowing a subset of cluster nodes can be upgraded, and the subset of nodes may also be grown during the upgrade. OneFS also permits an upgrade to be paused and resumed, enabling customers to span cluster upgrades over multiple smaller Maintenance Windows.

OneFS CloudPools v1.0 originally debuted back in the OneFS 8.0 release. The next major update, CloudPools v2.0, was delivered in OneFS 8.2 release, and introduced significant enhancements which included:

  • Support for AWS signature authentication version 4.
  • Network statistics per CloudPools account or file pool policy.
  • Support for Alibaba Cloud and Amazon C2S public cloud providers.
  • Full integration of CloudPools and data services like Snapshot, Sparse file handling, Quota, AVScan and WORM.
  • NDMP and SyncIQ support
  • Non-Disruptive Upgrade (NDU) support

CloudPools, like its SmartPools counterpart, uses the OneFS file pool policy engine to designate which data on a cluster should reside on which tier, or be archived to a cloud storage target. If files match the criteria specified in a file pool policy, the content of those files is moved to cloud storage when the job runs. Under the hood, CloudPools uses ‘SmartLink’ files within the /ifs namespace, each of which contains information about where to retrieve each file’s data blocks that have been cloud tiered. In CloudPools 1.0, these SmartLink v1 files, often referred to as ‘stubs’, do not behave like a normal file. By contrast, the SmartLink v2 files in CloudPools 2.0 are more like traditional files, each containing pointers to the CloudPools target where the data resides.

When a CloudPools 1.0 cluster is upgraded to OneFS 8.2 or later, a ‘changeover’ process is automatically initiated upon upgrade commit. This process is responsible for converting the v1 SmartLink files to v2, ensuring a seemless transition from CloudPools 1.0 to 2.0.

The following table outlines the upgrade paths available when transitioning from CloudPools 1.0 to 2.0:

Current OneFS Version Upgrade to OneFS 8.2 Upgrade to OneFS 8.2.1 with 5/2020 RUPs Upgrade to OneFS 8.2.2 with 5/2020 RUPs Upgrade to OneFS 9.x
OneFS 8.0 Discouraged Viable Recommended Highly recommended
OneFS 8.1 Discouraged Viable Recommended Highly recommended

In a SyncIQ environment with unidirectional replication, the SyncIQ target cluster should be upgraded before the source cluster. Conversely, for bi-directional replication, the recommendation is to disable SyncIQ on both the source and target, and upgrade both clusters simultaneously.

The following CLI commands can be run on both the source and target  clusters to verify and capture their storage account, CloudPools, file pool policy, and SyncIQ configurations:

# isi cloud accounts list -v 

# isi cloud pools list -v 

# isi filepool policies list -v 

# isi sync policies list -v 

SyncIQ can be re-enabled on both source and target once the OneFS upgrades have been committed on both clusters. Be aware that the SmartLink conversion process can take considerable time, depending on the number of SmartLink files and the processing power of the target cluster.

Note that there is no need to stop the SyncIQ and/or SnapshotIQ services during the upgrade in a SyncIQ environment with unidirectional replication. However, since SyncIQ must resynchronize all converted stub files, it might take it some time to process all the changes.

The ‘isi cloud job view <job ID>’ CLI command can be used to check the status of a SmartLink upgrade process. For example, to view job ID 6:

# isi cloud job view 6

ID: 6 Description: Update SmartLink file formats

Effective State: running

Type: smartlink-upgrade

Operation State: running

Job State: running

Create Time: 2022-05-23T14:20:26

State Change Time: 2022-05-17T09:56:08

Completion Time: -

Job Engine Job: -

Job Engine State: -

Total Files: 21907433

Total Canceled: 0

Total Failed: 61

Total Pending: 318672

Total Staged: 0

Total Processing: 48

Total Succeeded: 21588652

Note that the CloudPools recall jobs will not run during an active SmartLink upgrade or conversion.

CloudPools 2.0 supports AWS signature version 4 (v4), in addition to version 2 (v2). Version 4 is generally preferred, since it provides an additional level of security.. However, be aware that any legacy CloudPools v2 cloud storage accounts cannot use v4 in the ‘upgraded’ state if the version prior to the OneFS 8.2.0 upgrade did not support V4. A patch is available for OneFS 8.1.2 to support v4 authentication, as a work-around for this issue.

While CloudPools 2.0 supports write-back in a snapshot, it does not support archiving and recalling files in the snapshot directory. If there is legacy file data in a snapshot on a cluster running a OneFS 8.1.2 or earlier, since that data consumes storage space, upon upgrade to OneFS 8.2, this snapshot storage space cannot be released since CloudPools 2.0 does not support archiving files in snapshots to the cloud.

OneFS non-disruptive upgrades can be easily managed from the WebUI by navigating to Cluster Management > Upgrade, and selecting the desired ‘Upgrade type’ from the drop-down menu. For example:

Rolling upgrades can be initiated from the OneFS CLI with the following syntax:

# isi upgrade cluster start <upgrade_image>

Since OneFS supports the ability to roll back to the previous version, in-order to complete an upgrade it must be committed.

# isi upgrade cluster commit

Up until the time an upgrade is committed, an upgrade can be rolled back to the prior version as follows.

# isi upgrade cluster rollback

The isi upgrade view CLI command can be used to monitor how the upgrade is progressing:

# isi upgrade view -i/--interactive

The following command will provide more detailed/verbose output:

# isi_upgrade_status

A faster, simpler version of isi_upgrade_status is also available in OneFS 8.2.2 and later:

isi_upgrade_node_state-a (aggregate the latest hook update for each node)-devid=<X,Y,E-F>  (filter and display by devid)-lnn=<X-Y,A,C> (filter and display by LNN)-ts (time sort entries)

If the end of a maintenance window is reached but the cluster is not fully upgraded, the upgrade process can be quiesced and then restarted using the following CLI commands:

# isi upgrade pause
# isi upgrade resume

For example:

# isi upgrade pause

You are about to pause the running process, are you sure?  (yes/[no]):

yes

The process will be paused once the current step completes.

The current operation can be resumed with the command:

`isi upgrade resume`

Note that pausing is not immediate: The upgrade will remain in a “Pausing” state until the currently
upgrading node is completed. Additional nodes will not be upgraded until the upgrade process is resumed.

The ‘pausing’ state can be viewed with the following commands: ‘isi upgrade view’ and ‘isi_upgrade_status’. Note that a rollback can be initiated either during ‘Pausing’ or ‘Paused’ states. Also, be aware that the ‘isi upgrade pause’ command has no effect when performing a simultaneous OneFS upgrade.

A rolling reboot can be initiated from the CLI on a subset of cluster nodes using the ‘isi upgrade rolling-reboot’ syntax and the ‘–nodes’ flag specifying the desired LNNs for upgrade:

# isi upgrade rolling-reboot --help

Description:

    Perform a Rolling Reboot of cluster.


Required Privileges:

    ISI_PRIV_SYS_UPGRADE


Usage:

    isi upgrade cluster rolling-reboot

        [--nodes <integer_range_list>]

        [--force]

        [{--help | -h}]


Options:

    --nodes <integer_range_list>

        List of comma (1,3,7) or dash (1-7) specified node LNNs to select. "all"

        can also be used to select all the cluster nodes at any given time.


  Display Options:

    --force

        Do not ask confirmation.

    --help | -h

        Display help for this command.

This ‘isi upgrade view’ syntax provides better visibility, status and progress of the rolling reboot process. For example:

# isi upgrade view


Upgrade Status:


Current Upgrade Activity: RollingReboot

   Cluster Upgrade State: committed

   Upgrade Process State: Not started

      Current OS Version: 9.2.0.0

      Upgrade OS Version: N/A

        Percent Complete: 0%


Nodes Progress:


     Total Cluster Nodes: 3

       Nodes On Older OS: 3

          Nodes Upgraded: 0

Nodes Transitioning/Down: 0


LNN  Progress  Version  Status

---------------------------------

1    100%        9.2.0.0  committed

2    rebooting   Unknown  non-responsive

3    0%          9.2.0.0  committed

OneFS CloudPools Statistics Reporting

For the longest time, the statistics from CloudPools accounts and policies were recorded but were only available via internal tools. In OneFS 9.4, these metrics are now easily accessible and presented via a new CLI ‘cloud’ option, within the familiar ’isi statistics’ CLI command set. This allows cluster administrators to gain insight into cloud accounts and policies for planning or troubleshooting CloudPools related activities.

There is no setup or configuration required in order to use CloudPools statistics, and all new and existing CloudPools accounts and policies statistics will automatically be collected and reported upon upgrade to OneFS 9.4.

The syntax for the new ‘isi statistics cloud’ CLI command is as follows:

Usage:

    isi statistics cloud <action>

        [--account <str>]

        [--policy <str>]

        [{--nodes | -n} <NODE>]

        [{--degraded | -d}]

        [--nohumanize]

        [{--interval | -i} <float>]

        [{--repeat | -r} <integer>]

        [{--limit | -l} <integer>]

        [--long]

        [--output ((Timestamp|time) | (Account|acct) | (Policy|pol) |

          (Cluster-GUID|guid|cluster) | In | Out | Reads | Writes |

          (Deletions|deletes|del) | (Cloud|vendor) | (A-Key|account_key|key) |

          (P-ID|policy_id|id) | Node)]

        [--sort ((Timestamp|time) | (Account|acct) | (Policy|pol) |

          (Cluster-GUID|guid|cluster) | In | Out | Reads | Writes |

          (Deletions|deletes|del) | (Cloud|vendor) | (A-Key|account_key|key) |

          (P-ID|policy_id|id) | Node)]

        [--format (table | json | csv | list | top)]

        [{--no-header | -a}]

        [{--no-footer | -z}]

        [{--verbose | -v}]

        [{--help | -h}]


The following CloudPools metrics and infomation are gathered and reported by the ‘isi statistics cloud list’ CLI command:

Name Description
In Bytes in.
Out Bytes out.
Reads Number of Reads.
Writes Number of writes.
Deletions Number of deletions.
Timestamp Date and time.
GUID Cluster global unique identifier.
Cloud Cloud vendor.
A-Key Cloud account key.
P-ID Cloud policy identifier.
Node Node number.

Standard CLI options are available for the command, including  JSON, Table, CSV output. The comprehensive list of these includes:

Option Description
–account Identify the account to view. Specify the account name or a phrase to match. Default is ‘all’ which will select all accounts.
–policy Identify the policy to view. Specify the policy name or a phrase to match. Default is ‘none’ which will select no policies.
–nodes Specify node(s) for which statistics should be reported.
–degraded Continue to report if some nodes do not respond.
–nohumanuze Output raw numbers without conversion to units.
–interval Wait <INTERVAL> seconds before refreshing the display.
–repeat Print the requested data <REPEAT> times (-1 for infinite).
–limit Number of statistics to display.
–long Display all possible columns.
–output Output specified column(s):

·         Timestamp | (Account|acct) | (Policy|pol) |  (Cluster-GUID|guid|cluster) | In | Out | Reads | Writes |  (Deletions|deletes|del) | (Cloud|vendor) | (A-Key|account_key|key) |

(P-ID|policy_id|id) | Node)

–sort Sort data by the specified comma-separated field(s) above. Prepend ‘asc:’ or ‘desc:’ to a field to change the sort order.
–format Display statistics in table, JSON, CSV, list or top format.
–no-header Do not display headers in CSV or table formats.
–no-footer Do not display table summary footer information.
–verbose Display more detailed information.
–help

In addition to sorting and filtering results, the CloudPools statistics can also be separated by node.

In its simplest form, the OneFS 9.4 ‘isi statistics cloud list’ syntax returns the following information, in this case for a three node cluster:

# isi statistics cloud list

Account Policy In    Out      Reads Writes Deletions Cloud Node

---------------------------------------------------------------

S3       0.0B  510.2KB      0      4         0  AWS   2    

S3       0.0B    0.0KB      0      0         0  AWS   1    

S3       0.0B    0.0KB      0      0         0  AWS   3    

---------------------------------------------------------------

More detailed output can be obtained by including the ‘–long’ command argument. Additionally, the ‘-r’ flag will repeat the output the number of times specified and at an interval specified by the ‘-i’ flag. For example:

# isi statistics cloud list –i 3 -–account s3 -–policy s3policy -–long –r 2

Architecturally, the new CloudPools statistics reporting infrastructure utilizes existing OneFS daemons and systems.

Under the hood, the isi_cpool_io_d service searches the CloudPools accounts and policies every three minutes and registers/unregisters them as appropriate, while new data is sent directly to the isi_cpool_sysctl service. Raw cloud metrics are passed to the isi_stats service, which generates the ordered stats keys, which are reported by the ‘isi statistics cloud list’ CLI utility and the platform API. For example:

# https://<node_ip>:8080/platform/statistics/summary/cloud

While no new log files are introduced in support of the CloudPools statistics framework, the ‘isi_gather_info’ logfile coalescing utility does now include cloud account and policy information and statistics in OneFS 9.4, which will be particularly useful for Dell Support when troubleshoot CloudPools issues

In order to access the new CloudPools statistics, all the cluster’s nodes must be running OneFS 9.4 or higher. Also, be aware of the current limitations, which include not reporting file-based statistics such as the number and size of files archived and recalled, or limiting statistics to a specific time period.

  • No statistics available before OneFS 9.4 upgrade for existing CloudPools accounts since isi statistics wasn’t tracking metrics in prior releases.
  • CloudPools statistics does not include cloud object cache stats. However, these can be displayed as follows:
# isi statistics query history --keys cluster.cloudpools.object_cache.stats

Or via the ‘isi_test_cpool_stats’ CLI command, for example:

# isi_test_cpool_stats -Q --objcache
------------------------------------------------
Object Cache Counters
------------------------------------------------
 object_cache_hits: 22
 object_cache_misses: 2
 object_cache_overwrite: 0
 object_cache_drop: 0
 header_cache_drop: 0
 data_cache_drop: 0
 data_cache_timeout: 0
 header_cache_timeout: 0
 object_cache_bypass: 0
{{ cmo_cache_hits: 0}}
{{ data_cache_hits: 22}}
{{ header_cache_hits: 0}}
{{ data_cache_range_hits: 22}}
{{ data_cache_range_misses: 0}}
------------------------------------------------

OneFS and Self-encrypting Drives

Self-encrypting drives (SEDs), are secure storage devices which transparently encrypt all on-disk data using an internal key and a drive access password. OneFS uses nodes populated with SED drives in to provide data-at-rest encryption (DARE), thereby preventing unauthorized access data access. Encrypting data at rest with cryptography ensures that the data is protected from theft, or other malicious activity, in the event drives or nodes are removed from a PowerScale cluster. Ensuring that data is encrypted when stored is mandated for federal and many industry regulations, and OneFS DARE satisfies a number of compliance requirements, including U.S. Federal FIPS 104-2 Level 2 and PCI-DSS v2.0 section 3.4.

All data that is written to a DARE cluster is automatically encrypted the moment it is written and decrypted when it is read. The stored data is encrypted with a 256-bit data AES data encryption key (DEK), and OneFS controls data access by combining the drive authentication key (AK) with data-encryption keys.

OneFS supports data-at-rest encryption using SEDs across all-flash SSD nodes, as well as HDD-based hybrid and archive platforms. However, all nodes in a DARE cluster must be of the self-encrypting drive type, and mixed SED and non-SED node clusters are not supported.

SEDs have the ability to be ‘locked’ or ‘unlocked’, over configurable ranges of logical block addresses (LBAs) known as ‘bands’. The bands on OneFS storage drives cover the /ifs and reserved partition, but leave a small amount at the beginning and end of the drive unlocked for the partition table. When OneFS formats a drive, isi_sed ‘takes ownership’ of it, which refers to us setting a password on the drive and storing it in the keystore. Similarly, ‘releasing ownership’ refers to resetting the drive back to a known password – the MSID, which is a password provided by the manufacturer, which can be read from the drive itself. Releasing ownership means that isi_sed will be able to use the MSID to authenticate to the drive and take ownership of it again if need be. It’s worth noting that changing these passwords changes the drive’s internal encryption key, and will scramble all data on the drives.

New PowerScale encryption nodes, and the SED drives they contain, initially arrive in an ‘unowned’, factory-fresh state, with encryption is disabled and no encryption keys present on the drives or node. On initialization, first a randomized internal drive encryption key is generated via the drive’s embedded encryption hardware. This key is used by the drive hardware to encrypt all incoming data before writing it to disk, and to decrypt any disk data being read by the node.

Next, a drive control key or drive access password is generated via the OneFS key manager process. This password is used each time the drive is accessed by the node. Without the password, the drive is completely inaccessible. With encryption now configured, the drive is in a secure, owned state and is ready to be formatted.

The data on self-encrypting drives is rendered inaccessible in the following conditions:

Event Description
Smartfail When a self-encrypting drive is smartfailed, drive authentication keys are deleted, making the drive unreadable. When you smartfail and then remove a drive, it is cryptographically erased. NOTE: Smartfailing a drive is the preferred method for removing a self-encrypting drive.
Power loss When a self-encrypting drive loses power, the drive locks to prevent unauthorized access. When power is restored, data is again accessible when the appropriate drive authentication key is provided.
Network loss When a cluster using external key management loses network connection to the external key management server, the drives are locked until the network connection is restored.
Password Loss If a SED drive’s internal key or drive access password is lost, the drive data is rendered permanently inaccessible, and the drive must be reset and reformatted in order to be repurposed.

If a SED drive is tampered with, say by interrupting the formatting process or removing the drive from a powered-on node for example, the node will automatically delete its drive access password from the keystore database where the drive access passwords are stored. If the internal drive key, drive access password, or both are lost or deleted, all of the data on the drive becomes permanently inaccessible and unreadable. This process is referred to as cryptographic erasure, as the data still exists, but can’t be decrypted. The drive is subsequently unusable, and it must be manually reverted to the unowned state by using its Physical Security ID (PSID). The PSID is a unique, static, 32-character key that is embedded in each drive at the factory. PSIDs are printed on the drive’s label, and can be retrieved only by physically removing the drive from the node and reading its label. After the PSID is entered in the OneFS command-line interface at the manual reversion prompt, all of the drive data is deleted and the SED drive is returned to an unowned state.

SEDs include a few additional isi_drive_d user states, as compared to regular drives.

SED State Description
SED_ERROR OneFS could not unlock or use the drive, typically because of a bad password or drive communication error.
ERASE The drive finished a smartfail, but OneFS was unable to release the drive, so just deleted its password.
INSECURE The drive isn’t owned by the node, but was unlocked and would have otherwise gone to ACTIVE.

Generally, these will only be reported in error cases. SEDs that are working properly and unlocked should behave like any other drives, and when running and in-use will show up as HEALTHY. That said, the most common error state is drives that show up as SED_ERROR. However, this just indicates that drive_d encountered anything other than SUCCESS or DRIVE_UNOWNED when attempting to unlock the drive.

To help debug a SED issue, determine if the drive currently owned (non-MSID password), or unowned (default, MSID password) and test with the ‘isi_sed drivedisplay’ CLI command. If the drive is unowned, or the password does not work, you’ll likely need to PSID revert the drive.

Syntax Description
isi_sed drivedisplay <drive> Displays the drive’s current state of ownership. This is often the most directly helpful, since it should be obvious based on this whether error states are legitimate or not. If the command succeeds, you should be told either: drive is unowned, drive owned by this node, drive owned by another node.

·         ‘Drive owned by another node’ is the error case you can expect to see if we had a key and lost it, or you moved drives from node to node. This means the drive is locked and we don’t have a password – you’ll have to revert it or restore your lost passwords.

·         ‘Drive owned by this node’ is expected in most situations. – we have a

·         ‘Drive unowned’ can be either an error or unexpected case. Drives that are released or reverted should be in this state. Drives that have been formatted and are in-use should not. This means the drive is using the MSID (factory default) password.

isi_sed release <drive> Releases ownership of this drive. This will set the drive’s password back to the default, the drive’s MSID. This can be helpful for cases such as moving drives around – this will release a drive so that any other node/isi_sed can take ownership of the drive.
isi_sed revert <drive> This command is useful for lost passwords, etc. The revert operation will allow you to enter the PSID found on the drive, and with that reset the drive to a factory-fresh MSID-unlocked state. Another way to do this is to attempt an “isi dev -a format -d :[baynum of drive to revert]” – you will be prompted for the PSID, and drive_d will attempt to do the revert for you.

A healthy SED drive can easily be securely erased by simply Smartfailing the drive. Once the Smartfail process has completed, the node deletes the drive access password from the keystore and the drive deletes its internal encryption key. At this point, the data is inaccessible and is considered cryptographically erased, and the drive is reset back to its original ‘unowned’ state. The drive can then be reused once a new encryption key has been generated, or safely returned to Dell, without any risk of the vendor or a third party accessing the data.

In order to ensure that data on a SED is unreadable, during a successful Smartfail OneFS cryptographically erases data by changing the DEK and blocks read and write access to existing data by removing the access key. However, if the drive fails to respond during Smartfail and OneFS cannot perform cryptographic erasure, access to the drive’s data is still blocked by deleting the access key.

Smartfail State DEK erased and reset AK erased and reset Cryptographic erasure Data inaccessible
Replace Yes Yes Yes Yes
Erase Yes Yes

For a defective SED drive, the completion of the Smartfail process prompts the node to delete the drive access password from the keystore.

To erase all SED drives in a single node that is being removed from a cluster, smartfail the node from the cluster. All drives will be automatically released and cryptographically erased by the node when the smartfail process completes.

On completion of the Smartfail process, a node automatically reboots to the configuration wizard. The /var/log/isi_sed log log will contains a ‘release_ownership’ message for each drive as it goes through the Smartfail process, confirming it is in a ‘replace’ state. For example:

2022-05-25T22:45:56Z <1.6>H400-SED-4 isi_sed[46365]: Command: release_ownership, drive bays: 1

2022-06-25T22:46:39Z <1.6>H400-SED-4 isi_sed[46365]: Bay 1: Dev da1, HITACHI H5SMM328 CLAR800, SN 71V0G6SX, WWN 5000cca09c00d57f: release_ownership: Success

The following CLI command can be used cryptographically erase a SED by smartfailing the drive, in this case drive 10 on node 1:

# isi devices -a smartfail -d 1:10

The status of an ‘owned’ drive, in this case /dev/da10, is reported as such:

# isi_sed drive da10 -vDrive Status: Bay 10: WWN 5000c50056252af4Drive Model SEAGATE ST330006CLAR3000, SN 71V0G6SX 00009330KYE04Key      Key      MSID     Drive            DriveExists   Works    Works    State            Status=======  =======  =======  ===============  ===============Yes      Yes      No       OWNED            OWNEDDrive owned by nodeBay 10: Dev da10, SEAGATE ST330006CLAR3000, SN 71V0G6SX 00009330KYE04, WWN 5000c50056252af4: drive_display: Success

This drive can easily be ‘released’ as follows:

# isi_sed release da10 -vDrive Status: Before release_ownership (Restore drive to factory default state):Drive Status: Bay 10: WWN 5000c50056252af4Drive Model SEAGATE ST330006CLAR3000, SN 71V0G6SX 00009330KYE04Key      Key      MSID     Drive            DriveExists   Works    Works    State            Status=======  =======  =======  ===============  ===============Yes      Yes      No       OWNED            OWNEDDrive owned by nodeBay 10: Dev da10, SEAGATE ST330006CLAR3000, SN 71V0G6SX 00009330KYE04, WWN 5000c50056252af4: release_ownership: Success 

isi_sed drive da10 -v

Drive Status: Bay 10: WWN 5000c50056252af4

Drive Model SEAGATE ST330006CLAR3000, SN 71V0G6SX 00009330KYE04

Key      Key      MSID     Drive            Drive

Exists   Works    Works    State            Status

=======  =======  =======  ===============  ===============

No       No       Yes      UNOWNED          UNOWNED

Fresh unowned drive

Firmware Port Lock: Enabled, AutoLock: On Power Loss




              Auth Keys:               ReadLock   WriteLock  Auto   LBA       LBA

              MSID Curr Futr Chng Unkn Enb  Set   Enb  Set   Lock   Start     Size

============  ======================== ========== ========== =====  ========= =========

SID           Y    --   --   --

EraseMaster   Y    --   --   --

BandMaster0   Y    --   --   --        N    N     N    N     N      0         0

BandMaster1   Y    --   --   --        N    N     N    N     N      0         0

BandMaster2   Y    --   --   --        N    N     N    N     N      0         0

BandMaster3   Band Disabled

BandMaster4   Band Disabled

BandMaster5   Band Disabled

BandMaster6   Band Disabled

BandMaster7   Band Disabled

BandMaster8   Band Disabled

BandMaster9   Band Disabled

BandMaster10  Band Disabled

BandMaster11  Band Disabled

BandMaster12  Band Disabled

BandMaster13  Band Disabled

BandMaster14  Band Disabled

Bay 10: Dev da10, SEAGATE ST330006CLAR3000, SN 71V0G6SX 00009330KYE04, WWN 5000c50056252af4: drive_display: Success

Similarly, all of the SEDs in a node can be erased by Smartfailing the entire node, in this case node 2:

# isi devices -a smartfail -d <node>

The ‘isi_reformat_node’ CLI tool can be used to either reimage or reformat a single node or entire cluster, thereby erase all its SED drives. Either a reformat or reimage will first ‘release’ the drives and then delete the node keystore. Plus, even if a drive fails to properly release, it will still be cryptographically erased since its drive access passwords is deleted along with the rest of the keystore during the process. However, note that any SED drives in nodes destined for redeployment elsewhere and which are currently in an unreleased state must be manually reverted by using their PSID before they can be used again.

# isi_reformat_node

The node will be automatically formatted. To erase all of the SEDs in an entire cluster, log in to each individual node as root and issue the above ‘isi_reformat_node’ command.

A drive that has been cryptographically erased can be verified as follow. First, use the ‘isi_drivenum’ CLI command to display the device names of the cluster’s drives. For example:
# isi_drivenum

Bay  1   Unit 0      Lnum 30    Active      SN:9VNX0JA02433     /dev/da1

Bay  2   Unit N/A    Lnum N/A   N/A         SN:N/A              N/A

Bay  A0   Unit 13     Lnum 17    Active      SN:0BHHH2TF         /dev/da14

Bay  A1   Unit 29     Lnum 1     Active      SN:0BHHHJRF         /dev/da30

Bay  A2   Unit 1      Lnum 29    Active      SN:0BHHH73F         /dev/da2

Bay  A3   Unit 16     Lnum 14    Active      SN:0BHHDL6F         /dev/da17

Bay  A4   Unit 2      Lnum 28    Active      SN:0BHHH7VF         /dev/da3

Bay  A5   Unit 17     Lnum 13    Active      SN:0BHHDYNF         /dev/da18

Bay  B0   Unit 30     Lnum 0     Active      SN:0BHKUBNH         /dev/da31

Bay  B1   Unit 14     Lnum 16    Active      SN:0BHHEBVF         /dev/da15

Bay  B2   Unit 18     Lnum 12    Active      SN:0BHDH7JF         /dev/da19

Bay  B3   Unit 3      Lnum 27    Active      SN:0BHHE6VF         /dev/da4

Bay  B4   Unit 19     Lnum 11    Active      SN:0BHDH9VF         /dev/da20

Bay  B5   Unit 4      Lnum 26    Active      SN:0BHHEEEF         /dev/da5

Bay  C0   Unit 15     Lnum 15    Active      SN:0BHHDLMF         /dev/da16

Bay  C1   Unit 26     Lnum 4     Active      SN:0BHHDNUF         /dev/da27

Bay  C2   Unit 5      Lnum 25    Active      SN:0BHHDL2F         /dev/da6

Bay  C3   Unit 20     Lnum 10    Active      SN:0BHHDKTF         /dev/da21

Bay  C4   Unit 6      Lnum 24    Active      SN:0BHHHGVF         /dev/da7

Bay  C5   Unit 21     Lnum 9     Active      SN:0BHHH4XF         /dev/da22

Bay  D0   Unit 27     Lnum 3     Active      SN:0BHHDKYF         /dev/da28

Bay  D1   Unit 11     Lnum 19    Active      SN:0BHHH9EF         /dev/da12

Bay  D2   Unit 22     Lnum 8     Active      SN:0BHHDL4F         /dev/da23

Bay  D3   Unit 7      Lnum 23    Active      SN:0BHHDWEF         /dev/da8

Bay  D4   Unit 23     Lnum 7     Active      SN:0BHHDSXF         /dev/da24

Bay  D5   Unit 8      Lnum 22    Active      SN:0BHHDKVF         /dev/da9

Bay  E0   Unit 12     Lnum 18    Active      SN:0BHHH9PF         /dev/da13

Bay  E1   Unit 28     Lnum 2     Active      SN:0BHHHGEF         /dev/da29

Bay  E2   Unit 9      Lnum 21    Active      SN:0BHHHAAF         /dev/da10

Bay  E3   Unit 24     Lnum 6     Active      SN:0BHHE06F         /dev/da25

Bay  E4   Unit 10     Lnum 20    Active      SN:0BHHDZTF         /dev/da11

Bay  E5   Unit 25     Lnum 5     Active      SN:0BHHDRBF         /dev/da26

Note that the drive device names are displayed in the format ‘/dev/da#’, where ‘#’ is a number. Only the ‘da#’ portion is needed for the isi_sed CLI syntax.

For example, to query the state of SED drive ‘da10’:

# isi_sed drive da10

Note that this command may take 30 seconds or longer to complete.

Finally, the data on the drive has been cryptographically erased if the ‘Drive State’ and ‘Drive Status’ columns display a status of UNOWNED, and ‘Fresh unowned drive’ appears in the line below the table. The drive has been reset and its internal encryption key has been destroyed, cryptographically erasing the drive. For example:

# isi_sed drive da10

Drive Status: Bay 10: WWN 5000c50056252af4

Drive Model SEAGATE ST330006CLAR3000, SN 71V0G6SX 00009330KYE04

Key      Key      MSID     Drive            Drive

Exists   Works    Works    State            Status

=======  =======  =======  ===============  ===============

No       No       Yes      UNOWNED          UNOWNED

Fresh unowned drive

Bay 10: Dev da10, SEAGATE ST330006CLAR3000, SN 71V0G6SX 00009330KYE04, WWN 5000c50056252af4: drive_display: Success

If the Drive State and Drive Status columns display a status of AUTH FAILED, this indicates that the drive password (AK) is no longer present in the node keystore. For example:

# isi_sed drive da10

Drive Status: Bay 10: WWN 5000c50056252af4

Drive Model SEAGATE ST330006CLAR3000, SN 71V0G6SX 00009330KYE04

Key      Key      MSID     Drive            Drive

Exists   Works    Works    State            Status

=======  =======  =======  ===============  ===============

No       No       Yes      AUTH FAILED      AUTH FAILED

Since the password is not stored anywhere else, the drive is now inaccessible until it is manually reverted.

If a drive is removed from a running node, OneFS automatically assumes that the drive has failed, and initiates the Smartfail process. If the drive is reinserted before the smartfail process completes, the ‘add’ and ‘stopfail’ commands can be run manually in order to bring the drive back online and return it to a healthy state. However, if the smartfail process has completed before reinserting the drive, running the stopfail command will be ineffective since the drive access password for the removed drive is deleted from the node’s keystore and is considered cryptographically erased.

However, if the drive is reinserted and added back to the cluster after it has been smartfailed, OneFS will report it as being in the SED_ERROR state because the drive still contains encrypted data but the drive access password no longer exists in the node’s keystore. Although the data on the drive is inaccessible, the drive can be reverted to an unowned state by using its PSID. At this point, the drive can then be reused.

When necessary, a SED drive can be cryptographically erased and reset to a factory-fresh state, either by issuing it the ‘release_ownership’ command, or by sending the ‘revert_factory_default’ command. For example, using drive /dev/da10:

# isi_sed release_ownership da10

Or:

# isi_sed revert_factory_default da10

The release command requires the drive password in order to run, whereas the revert command requires the drive physical PSID. If the drive password is still known and functional, the node can release the drive after the smartfail process completes, or during a node reimage, without requiring manual intervention. If the drive password is lost or no longer functional, the revert command must be used instead, and the PSID must be entered manually.

If a SED drive becomes inaccessible for any reason, such as mishandling, malfunction, intentional or accidental release/revert, or loss of the data access password, the drive data cannot be retrieved. Traditional data recovery techniques, such as direct media access and platter extraction, are ineffective on a SED drive since the data is encrypted, and the encryption key cannot be extracted from the drive hardware.

Performance-wise, there is no significant difference in read or write performance between SED and non-SED drives. All data encryption and decryption is done at line speed by dedicated AES encryption hardware that is embedded in the drive.

Format times for SED nodes may vary, but 90 minutes or more is the average for most 4TB SED nodes. The larger the drives, the longer the format process will take to complete. SED nodes take much longer to format than nodes with regular drives, because each drive must be fully overwritten with random data as part of the encryption initialization process. This is an industry-standard step in all full-disk encryption processes that is necessary to help secure the encrypted data against brute-force attacks on the encryption key, and this step cannot be skipped.

OneFS provides drive formatting progress information, which is displayed as a completion percentage for each drive.

It is important to avoid interrupting a formatting process on a SED node. Inadvertently doing so will immediately make all the drives in the node unusable, necessitating a manual revert for each individual drive using its PSID, before the format process can be restarted.

# isi_sed revert /dev/da1

Bear in mind that this can be a somewhat cumbersome process, which can take several hours.

OneFS Cbind and DNS Caching

OneFS cbind is the distributed DNS cache daemon for a PowerScale cluster. As such, its primary role is to accelerate domain name lookups on the cluster, particularly for NFS workloads, which can frequently involve a large number of lookups requests, especially when using netgroups. Cbind itself is logically separated into two halves:

Component Description
Gateway cache The entries a node refreshes from the DNS server.
Local cache The entries a node refreshes from the Gateway node.

Cbind’s architecture helps to distribute the cache and associated DNS workload across all nodes in the cluster, and the daemon runs as a OneFS service under the purview of MCP and the /etc/mcp/sys/services/isi_cbind_d control script:

# isi services -a | grep i bind

   isi_cbind_d          Bind Cache Daemon                        Enabled

On startup the cbind daemon, isi_cbind_d, reads its configuration from the cbind_config.gc gconfig file. If needed, configuration changes can be made using the ‘isi network dnscache’ or ‘isi_cbind’ CLI tools.

The cbind daemon also supports multi-tenancy across the cluster, with each tenant’s groupnet being allocated its own completely independent DNS cache, with multiple client interfaces to separate DNS requests from different groupnets. Cbind uses the 127.42.x.x address range and can be accessed by client applications across the entire range. The lower 16 bytes of the address are set by the client to the groupnet ID for the query. For example, if the client is trying to query the DNS servers on groupnet with ID 5 it will send the DNS query to 127.42.0.5.

Under the hood, the cbind daemon comprises two DNS query/response containers, or ‘stallsets’:

Component Description
DNS stallset The DNS stallset is a collection of DNS stalls which encapsulate a single DNS server and a list of DNS queries which have been sent to the DNS servers and are waiting for a response.
Cluster stallset The cluster stallset is similar to the DNS stallset, except the cluster stalls encapsulate the connection to another node in the cluster, known as the gateway node. It also holds a list of DNS queries which have been forwarded to the gateway node and are waiting for a response.

Contained within a stallset are the stalls themselves, which store the actual DNS requests and responses. The DNS stallset provides a separate stall for each DNS server that cbind has been configured to use, and requests are handled via a round-robin algorithm. Similarly, for the cluster stallset, there is a stall for each node within the cluster. The index of the cluster stallset is the gateway node’s (devid – 1).

The cluster stallset entry for the node that is running the daemon is treated as a special case, known as ‘L1 mode’, because the gateway for these DNS requests is the node executing the code. Requests on the gateway stall also have an entry on the DNS stallset representing the request to the external DNS server. All other actively participating cluster stallset entries are referred to as ‘L2+L1’ mode. However, if a node cannot reach DNS, it is moved to L2 mode to prevent it from being used by the other nodes. An associated log entry is written to /var/log/isi_cbind_d.log, of the form:

isi_cbind_d[6204]: [0x800703800]bind: Error

sending query to dns:10.21.25.11: Host is down

In order to support large clusters, cbind uses a consistent hash to determine the gateway node to cache a request and the appropriate cluster stallset to use. This consistent hashing algorithm, which decides on which node to cache an entry, is designed to minimize the number of entry transfers as nodes are added/removed, while also reducing the number of threads and UDP ports used. To illustrate cbind’s consistent hashing, consider the following three node cluster:

In this scenario, when the cbind service on Node 3 becomes active, one third each of the gateway cache from node 1 and 2 respectively gets transferred to node 3. Similarly, if node 3’s cbind service goes down, its gateway cache is divided equally between nodes 1 and 2. For a DNS request on node 3, the node first checks its local cache. If the entry is not found, it will automatically query the gateway (for example, node 2). This means that even if node 3 cannot talk to the DNS server directly, it can still cache the entries from a different node.

So, upon startup, a node’s cbind process attempts to contact, or ‘ping’, the DNS servers. Once a reply is received, the cbind moves into an up state and notifies GMP that the isi_cbind_d service is running on this node. GMP, in turn, then informs the cbind processes across the rest of the cluster that the node is up and available.

Conversely, after several DNS requests to an external server fail for a given node or the isi_cbind_d process is terminated, then the GMP code is notified that the isi_cbind_d service is down for this node. GMP then notifies the cluster that the node is down. When a cbind process (on node Y) receives this notification, the consistent hash algorithm is updated to report that node X is down. The cluster stallset is not informed of this change. Instead the DNS requests that have changed gateways will eventually timeout and be deleted.

As such, the cbind request and response processes can be summarized as follows:

  1. A client on the node sends a DNS query on the additional loopback address 127.42.x.x which is received by cbind.
  2. The cbind daemon uses the consistent hash algorithm to calculate the gateway value of the DNS query and uses the gateway to index the cluster stallset.
  3. If there is a cache hit, a response is sent to the client and the transaction is complete.
  4. Otherwise, the DNS query is placed in the cluster stallset using the gateway as the index. If this is the gateway node then the request is sent to the external DNS server, otherwise the DNS request is forwarded to the gateway node.
  5. When the DNS server or gateway replies, another thread receives the DNS response and matches it to the query on the list. The response is forwarded to the client and the cluster stallset is updated.

Similarly, when a request is forwarded to the gateway node:

  1. The cbind daemon receives the request, calculates the gateway value of the DNS query using the consistent hash algorithm, and uses the gateway to index the cluster stallset.
  2. If there is a cache hit, a response is returned to the remote cbind process and the transaction is complete.
  3. Otherwise, the DNS query is placed in the cluster stallset using the gateway as the index and the request is sent to the external DNS server.
  4. When the DNS server or gateway returns, another thread receives the DNS response and matches it to the query on the list. The response is forwarded to the calling node and the cluster stallset is updated.

If necessary, cbind DNS caching can be enabled or disabled via the ‘isi network groupnets’ command set, allowing the cache to be managed per groupnet:

# isi network groupnets modify --id=<groupnet-name> --dns-cache-enabled=<true/false>

The global ‘isi network dnscache’ command set can be useful for inspecting the cache configuration and limits:

# isi network dnscache view

Cache Entry Limit: 65536

  Cluster Timeout: 5

      DNS Timeout: 5

    Eager Refresh: 0

   Testping Delta: 30

  TTL Max Noerror: 3600

  TTL Min Noerror: 30

 TTL Max Nxdomain: 3600

 TTL Min Nxdomain: 15

    TTL Max Other: 60

    TTL Min Other: 0

 TTL Max Servfail: 3600

 TTL Min Servfail: 300

 The following table describes these DNS cache parameters, which can be manually configured if desired.

Setting Description
TTL No Error Minimum Specifies the lower boundary on time-to-live for cache hits (default value=30s).
TTL No Error Maximum Specifies the upper boundary on time-to-live for cache hits (default value=3600s).
TTL Non-existent Domain Minimum Specifies the lower boundary on time-to-live for nxdomain (default value=15s).
TTL Non-existent Domain Maximum Specifies the upper boundary on time-to-live for nxdomain (default value=3600s).
TTL Other Failures Minimum Specifies the lower boundary on time-to-live for non-nxdomain failures (default value=0s).
TTL Other Failures Maximum Specifies the upper boundary on time-to-live for non-nxdomain failures (default value=60s).
TTL Lower Limit For Server Failures Specifies the lower boundary on time-to-live for DNS server failures(default value=300s).
TTL Upper Limit For Server Failures Specifies the upper boundary on time-to-live for DNS server failures (default value=3600s).
Eager Refresh Specifies the lead time to refresh cache entries that are nearing expiration (default value=0s).
Cache Entry Limit Specifies the maximum number of entries that the DNS cache can contain (default value=35536 entries).
Test Ping Delta Specifies the delta for checking the cbind cluster health (default value=30s).

 Also, if necessary, the cache can be globally flushed via the following CLI syntax:

# isi network dnscache flush -v

Flush complete.

OneFS also provides the ‘isi_cbind’ CLI utility, which can be used to communicate with the cbind daemon. This utility supports both regular CLI syntax, plus an interactive mode where commands are   prompted for. Interactive mode can be entered by invoking the utility without an argument, for example:

# isi_cbind

cbind:

cbind: quit

#

The following command options are available:

# isi_cbind help

        clear           - clear server statistics

        dump            - dump internal server state

        exit            - exit interactive mode

        flush           - flush cache

        quit            - exit interactive mode

        set             - change volatile settings

        show            - show server settings or statistics

        shutdown        - orderly server shutdown

An individual groupnet’s cache can be flushed as follows, in this case targeting the ‘client1’ groupnet:

# isi_cbind flush groupnet client1

Flush complete.

Note that all the cache settings are global and, as such, will affect all groupnet DNS caches.

The cache statistics are available via the following CLI syntax, for example:

# isi_cbind show cache

  Cache:

    entries:                 10         - entries installed in the cache

    max_entries:            338         - entries allocated, including for I/O and free list

    expired:                  0         - entries that reached TTL and were removed from the cache

    probes:                 508         - count of attempts to match an entry in the cache

    hits:                   498 (98%)   - count of times that a match was found

    updates:                  0         - entries in the cache replaced with a new reply

    response_time:     0.000005         - average turnaround time for cache hits

These cache stats can be cleared as follows:

# isi_cbind clear cache

Similarly, the DNS statistics can be viewed with the ‘show dns’ argument:

# isi_cbind show dns

  DNS server 1: (dns:10.21.25.10)

    queries:                862         - queries sent to this DNS server

    responses:              862 (100%)  - responses that matched a pending query

    spurious:             17315 (2008%) - responses that did not match a pending query

    dropped:              17315 (2008%) - responses not installed into the cache (error)

    timeouts:                 0 (  0%)  - times no response was received in time

    response_time:     0.001917         - average turnaround time from request to reply

  DNS server 2: (dns:10.21.25.11)

    queries:                861         - queries sent to this DNS server

    responses:              860 ( 99%)  - responses that matched a pending query

    spurious:             17314 (2010%) - responses that did not match a pending query

    dropped:              17314 (2010%) - responses not installed into the cache (error)

    timeouts:                 1 (  0%)  - times no response was received in time

    response_time:     0.001603         - average turnaround time from request to reply


When running isi_cbind_d, the following additional options are available, and can be invoked with the following CLI flags and syntax:

Option Flag Description
Debug -d Set debug flag(s) to log specific components.  The flags are comma separated list from the following components:

all     Log all components.

cache   Log information relating to cache data.

cluster  Log information relating to cluster data.

flow    Log information relating to flow data.

lock    Log information relating to lock data.

link    Log information relating to link data.

memory  Log information relating to memory data.

network  Log information relating to network data.

refcount  Log information relating to cache object refcount data.

timing  Log information relating to cache timing data.

external   Special debug option to provide off-node DNS service.

Output -f Isi_cbind will not detach from the controlling terminal and will print debugging messages to stderr.
Dump to -D Target file for isi_cbind dump output.
Port -p Uses specified port instead of default NS port of 53.

The isi_cbind_d process logs messages to syslog or to stderr, depending on the daemon’s mode. The log level can be changed by sending it a SIGUSR2 signal, which will toggle the debug flag to maximum or back to the original setting. For example:

# kill -USR2 `cat /var/run/isi_cbind_d.pid`

Also, when troubleshooting cbind, the following files can provide useful information:

File Description
/var/run/isi_cbind_d.pid the pid of the currently running process
/var/crash/isi_cbind_d.dump output file for internal state and statistics
/var/log/isi_cbind_d.log syslog output file
/etc/gconfig/cbind_config.gc configuration file
/etc/resolv.conf bind resolver configuration file

Additionally, the internal state data of isi_cbind_d can be dumped to a file specified with the -D option, described in the table above.

Astute observers will also notice the presence of an additional loopback address at 127.42.0.1:

0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
        options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
        inet6 ::1 prefixlen 128 zone 1
        inet6 fe80::1%lo0 prefixlen 64 scopeid 0x4 zone 1
        inet 127.0.0.1 netmask 0xff000000 zone 1
        inet 127.42.0.1 netmask 0xffff0000 zone 1
        groups: lo
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
# grep 127 /etc/resolv.conf
nameserver 127.42.0.1
# sockstat | grep "127.42.0.1:53"

root     isi_cbind_ 4078  7  udp4   127.42.0.1:53         *:*

This entry is used to ensure that outbound DNS queries are intercepted by cbind, which then either utilizes its cache or reaches out to the DNS servers based on the groupnet configuration. The standard outbound uses the default groupnet, and Auth is forwarded to the appropriate groupnet DNS.

OneFS NANON

Another functionality enhancement that debuts in the OneFS 9.4 release is increased support for clusters with partial front-end connectivity. In OneFS parlance, these are known as NANON clusters, the acronym abbreviating ‘Not All Nodes On Network’. Today, every PowerScale node in the portfolio includes both front-end and back-end network interfaces. Both of a node’s redundant backend network ports, either Ethernet or Infiniband, must be active and connected to the supplied cluster switches at all times, since these form a distributed systems bus and handle all the intra-cluster communication. However, while the typical cluster topology has all nodes connected to all the frontend client network(s), this is not always possible or even desirable. In certain scenarios, there are distinct benefits to not connecting all the nodes to the front-end network.

But first, some background. Imagine an active archive workload, for example. The I/O and capacity requirements of the workload’s active component can be satisfied by an all-flash F600 pool. In contrast, the inactive archive data is housed on a pool of capacity-optimized A3000 nodes for archiving inactive data. In this case, not connecting the A3000 nodes to the front-end network saves on switch ports, isolates the archive data pool from client I/O, and simplifies the overall configuration, while also potentially increasing security.

Such NANON cluster configurations are increasing in popularity, as customers elect not to connect the archive nodes in larger clusters to save cost and complexity, reduce load on capacity optimized platforms, as well as creating physically secure and air-gapped solutions. The recent introduction of the PowerScale P100 and B100 accelerator nodes also increase a cluster’s front end connectivity flexibility.

The above NANON configuration is among the simplest of the partially connected cluster architectures. In this example, the deployment consists of five PowerScale nodes with only three of them connected to the network. The network is assumed to have full access to all necessary infrastructure services and client access.

More complex topologies can often include separating client and management networks, dedicated replication networks, multi-tenant and other separated front-end solutions, and often fall into the NANOAN, or Not All Nodes On All Networks, category. For example:

The management network can be assigned to Subnet0 on the cluster nodes, with a gateway priority of 10 (ie. default gateway), and the client network using Subnet1 with a gateway priority of 20. This would route all outbound traffic through the management network. Static Routes, or Source-Based Routing (SBR) can be configured to direct traffic to the appropriate gateway if issues arise with client traffic routing through the management network.

In this replication topology, nodes 1 through 3 on the source cluster are used for client connectivity, while nodes 4 and 5 on both the source and target clusters are dedicated for SyncIQ replication traffic.

Other more complex examples, such multi-tenant cluster topologies, can be deployed to support workloads requiring connectivity to multiple physical networks.

The above topology can be configured with a management Groupnet containing Subnet0, and additional Groupnets, each with a subnet, for the client networks. For example:

# isi network groupnets list

ID         DNS Cache Enabled  DNS Search      DNS Servers   Subnets

--------------------------------------------------------------------

Client1    1                  c1.isilon.com   10.231.253.14 subnet1

Client2    1                  c2.isilon.com   10.231.254.14 subnet2

Client3    1                  c3.isilon.com   10.231.255.14 subnet3

Management 1                  mgt.isilon.com  10.231.252.14 subnet0

--------------------------------------------------------------------

Total: 4

Or from the WebUI via Cluster management > Network configuration > External network

The connectivity details of a particular subnet and pool and be queried with the ‘isi network pools status <groupnet.subnet.pool>’ CLI command, and will provide details of node connectivity, as well as protocol health and general node state. For example, querying the management groupnet  (Management.Subnet0.Pool0) for the six node cluster above, we see that nodes 1-4 are externally connected, whereas nodes 5 and 6 are not:

# isi network pools status Management.subnet0.pool0

Pool ID: Management.subnet0.subnet0

SmartConnect DNS Overview:

       Resolvable: 4/6 nodes resolvable

Needing Attention: 2/6 nodes need attention

        SC Subnet: Management.subnet0


Nodes Needing Attention:

              LNN: 5

SC DNS Resolvable: False

       Node State: Up

        IP Status: Doesn't have any usable IPs

 Interface Status: 0/1 interfaces usable

Protocols Running: True

        Suspended: False

--------------------------------------------------------------------------------

              LNN: 6

SC DNS Resolvable: False

       Node State: Up

        IP Status: Doesn't have any usable IPs

 Interface Status: 0/1 interfaces usable

Protocols Running: True

        Suspended: False

There are two core OneFS components that have been enhanced in 9.4 in order to better support NANON configurations on a cluster. These are:

Name Component Description
Group Management GMP_SERCVICE_

EXT_CONNECTIVE

Allows GMP (Group Management Protocol) to report the cluster nodes’ external connectivity status.
MCP process isi_mcp Monitors for any GMP changes and, when detected, will try to start or stop the affected service(s) under its control.
SmartConnect isi_smartconnect_d Cluster’s network configuration and connection management service. If the SmartConnect daemon decides a node is NANON, OneFS will log the cluster’s status with GMP.

Here’s the basic architecture and inter-relation of the services.

The GMP external connectivity status is available via the ‘sysctl efs.gmp.group’ CLI command output.

For example, take a three node cluster with all nodes’ front-end interfaces connected:

GMP confirms that all three nodes are available, as indicated by the new ‘external_connectivity’ field:

# sysctl efs.gmp.group

efs.gmp.group: <79c9d1> (3) :{ 1-3:0-5, all_enabled_protocols: 1-3, isi_cbind_d: 1-3, lsass: 1-3, external_connectivity: 1-3 }

This new external connectivity status is also incorporated into a new ‘Ext’ column in the ‘isi status’ CLI command output, as indicated by a ‘C’ for connected or an ‘N’ for not connected. For example:

# isi status -q

                   Health Ext  Throughput (bps)  HDD Storage      SSD Storage

ID |IP Address     |DASR |C/N|  In   Out  Total| Used / Size     |Used / Size

---+---------------+-----+---+-----+-----+-----+-----------------+-----------

  1|10.219.64.11   | OK  | C |25.9M| 2.1M|28.0M|(10.2T/23.2T(44%)|

  2|10.219.64.12   | OK  | C | 840K| 123M| 124M|(10.2T/23.2T(44%)|

  3|10.219.64.13   | OK  | C |    225M| 466M| 691M|(10.2T/23.2T(44%)|

---+---------------+-----+---+-----+-----+-----+-----------------+-----------

Cluster Totals:              |  n/a|  n/a|  n/a|30.6T/69.6T( 37%)|

     Health Fields: D = Down, A = Attention, S = Smartfailed, R = Read-Only

           External Network Fields: C = Connected, N = Not Connected

Take the following three node NANON cluster:

GMP confirms that only nodes 1 and 3 are connected to the front-end network. Similarly, the absence of node 2 from the command output infers that this node has no external connectivity:

# sysctl efs.gmp.group

efs.gmp.group: <79c9d1> (3) :{ 1-3:0-5, all_enabled_protocols: 1,3, isi_cbind_d: 1,3, lsass: 1,3, external_connectivity: 1,3 }

Similarly, the ‘isi status’ CLI output reports that node 2 is not connected, denoted by an ‘N’, in the ‘Ext’ column:

# isi status -q

                   Health Ext  Throughput (bps)  HDD Storage      SSD Storage

ID |IP Address     |DASR |C/N|  In   Out  Total| Used / Size     |Used / Size

---+---------------+-----+---+-----+-----+-----+-----------------+-----------

  1|10.219.64.11   | OK  | C | 9.9M| 12.1M|22.0M|(10.2T/23.2T(44%)|

  2|10.219.64.12   | OK  | N |    0|    0|    0|(10.2T/23.2T(44%)|

  3|10.219.64.13   | OK  | C | 440M| 221M| 661M|(10.2T/23.2T(44%)|

---+---------------+-----+---+-----+-----+-----+-----------------+-----------

Cluster Totals:              |  n/a|  n/a|  n/a|30.6T/69.6T( 37%)|

     Health Fields: D = Down, A = Attention, S = Smartfailed, R = Read-Only

           External Network Fields: C = Connected, N = Not Connected

Under the hood, OneFS 9.4 sees the addition of a new SmartConnect network module to evaluate and determine if the node has front-end network connectivity. This module leverages the GMP_SERVICE_EXT_CONNECTIVITY service and polls the nodes network settings every five minutes by default. SmartConnect’s evaluation and assessment criteria for network connectivity is as follows:

VLAN VLAN IP Interface Interface IP NIC Network
(any) (any) Up No Up No
(any) (any) Up Yes Up Yes
Enabled Yes (any) (any) Up Yes
(any) (any) (any) (any) Down No

OneFS 9.4 also adds an option to MCP, the master control process, which allows it to prevent certain services from being started if there is no external network. As such, the two services in 9.4 that now fall under MCP’s new NANON purview are:

Service Daemon Description
Audit isi_audit_cee Auditing of system configuration and protocol access events on the cluster.
SRS isi_esrs_d Allows remote cluster monitoring and support through Secure Remote Services (SRS).

There are two new MCP configuration tags, introduced to control services execution depending on external network connectivity:

Tag Description
require-ext-network Delay start of service if no external network connectivity.
stop-on-ext-network-loss Halt service if external network connectivity is lost.

These tags are used in the MCP service control scripts, located under /etc/mcp/sys/services. For example, in the SRS script:

# cat /etc/mcp/sys/services/isi_esrs_d

<?xml version="1.0"?>

<service name="isi_esrs_d" enable="0" display="1" ignore="0" options="require-quorum,stop-on-ext-network-loss">

      <isi-meta-tag id="isi_esrs_d">

            <mod-attribs>enable ignore display</mod-attribs>

      </isi-meta-tag>

      <description>ESRS Service Daemon</description>

      <process name="isi_esrs_d" pidfile="/var/run/isi_esrs_d.pid"

               startaction="start" stopaction="stop"

               depends="isi_tardis_d/isi_tardis_d"/>

      <actionlist name="start">

            <action>/usr/bin/isi_run -z 1 /usr/bin/isi_esrs_d</action>

      </actionlist>

      <actionlist name="stop">

            <action>/bin/pkill -F /var/run/isi_esrs_d.pid</action>

      </actionlist>

</service>

This MCP NANON control will be expanded to additional OneFS services over the course of subsequent releases.

When it comes to troubleshooting NANON configurations, the MCP, SmartConnect and general syslog log files can provide valuable connectivity troubleshooting messages and timestamps,. The pertinent logfiles are:

  • /var/log/messages
  • /var/log/isi_mcp
  • /var/log/isi_smartconnect