OneFS and QLC Drive Support

Another significant feature of the recent OneFS 9.4 release is support for quad-level cell (QLC) flash media. Specifically, the PowerScale F900 and F600 all-flash NVMe platforms are now available with 15.4TB and 30.7TB QLC NVMe drives.

These new QLC drives offer the gamut of capacity, performance, reliability and affordability – and will be particularly beneficial for workloads such as artificial intelligence, machine and deep learning, and for media and entertainment environments.

The details of the new QLC drive options for the F600 and F900 platforms are as follows:

PowerScale Node Chassis specs

(per node)

Raw capacity

(per node)

Max Raw capacity
(252 node cluster)
F900 2U with 24 NVMe SSD drives 737.28TB with 30.72TB QLC

368.6TB with 15.36TB QLC

185.79PB with 30.72TB QLC

92.83PB with 15.36TB QLC

F600 1U with 8 NVMe SSD drives 245.76TB with 30.72TB QLC

122.88TB with 15.36TB QLC

61.93PB with 30.72TB QLC

30.96PB with 15.36TB QLC

This means an F900 cluster with the 30.7TB QLC drives can now scale up to a whopping 185.79PB in size, with a nice linear performance ramp!

So the new QLC drives double the all-flash capacity footprint, as compared to previous generations – while delivering robust environmental efficiencies in consolidated rack space, power and cooling. What’s more, PowerScale F600 and F900 nodes containing QLC drives can deliver the same level of performance as TLC drives, thereby delivering vastly superior economics and value. As illustrated below, QLC nodes performed at parity or slightly better than TLC nodes for throughput benchmarks and SPEC workloads.

The above graphs show the comparative peak random throughput per-node for both QLC and TLC.

QLC-based F600 and F900 nodes can easily be rapidly and non-disruptively integrated into existing PowerScale clusters, allowing seamless data lake expansion and accommodation of new workloads.

Compatibility-wise, there are a couple of key points to be aware of. If attempting to add a QLC drive to a non-QLC node, or vice versa, the unsupported drive will be blocked with the ‘WRONG_TYPE’ error. However, QLC and non-QLC nodes will happily coexist in different pools within the same cluster. But attempting to merge storage node pools with differing media classes will output the error ‘All nodes in the nodepool must have compatible [HDD|SSD] drive technology’.

From the WebUI, the ‘drive details’ pop-up window displays ‘NVME, SSD, QLC’ as the ‘Connection and media type’.  This can be viewed by navigating to Hardware configuration > Drives and selecting ‘View details’ for the desired drive:

The WebUI SmartPools summary, available by browsing to Storage pools > SmartPools, also incorporates ‘QLC’ into the pool name:

Similarly, in the ‘node pool details’:

From the OneFS CLI, existing commands displaying DSP (drive support package), PSI (platform support infrastructure), and storage and node pools display a new ‘media_class’ string, ‘QLC’. For example:

# isi storagepool nodepools ls

ID   Name                Nodes  Node Type IDs  Protection Policy  Manual

-------------------------------------------------------------------------

1    f600_15tb-ssd-qlc_736gb 1      1              +2d:1n             No

                         2

                         3

-------------------------------------------------------------------------

Total: 1

 

# isi storagepool nodetypes ls

ID   Product Name                               Nodes  Manual

--------------------------------------------------------------

1    F600-1U-Dual-736GB-2x25GE SFP+-15TB SSD QLC 1      No

                                                 2

                                                 3

--------------------------------------------------------------

Total: 1

OneFS 9.4 also introduces new model and vendor class fields, providing a dynamic and extensible path to determine what drive statistics and information to gather, how to capture them, and how to display them – in preparation for future drive technologies. For example:

# isi_radish -a

Bay 0/nvd15 is Dell Ent NVMe P5316 RI U.2 30.72TB FW:0.0.8 SN:BTAC1436043630PGGN, 60001615872 blks

Log Sense data (Bay 0/nvd15 ) –

Supported log pages 0x1 0x2 0x3 0x4 0x5 0x6 0x80 0x81

SMART/Health Information Log

============================

Critical Warning State: 0x00

Available spare: 0

Temperature: 0

Device reliability: 0

Read only: 0

Volatile memory backup: 0

Temperature: 297 K, 23.85 C, 74.93 F

Available spare: 100

Available spare threshold: 10

Percentage used: 0

Data units (512,000 byte) read: 1619199

Data units written: 10075777

Host read commands: 67060074

Host write commands: 4461942671

Controller busy time (minutes): 1

Power cycles: 21

Power on hours: 420

Unsafe shutdowns: 18

Media errors: 0

No. error info log entries: 0

Warning Temp Composite Time: 0

Error Temp Composite Time: 0

Temperature 1 Transition Count: 0

Temperature 2 Transition Count: 0

Total Time For Temperature 1: 0

Total Time For Temperature 2: 0

Finally, PowerScale F600 and F900 nodes must be running OneFS 9.4 and the latest DSP in order to support QLC drives. In the event of a QLC drive failure, it must be replaced with another QLC drive. Additionally, any attempts to downgrade a QLC node to a version prior to OneFS 9.4 will be blocked.

OneFS SmartSync Management and Diagnostics

In this final blog of this series, we’ll look at SmartSync’s diagnostic tools, performance, plus review some of its idiosyncrasies and coexistence with other OneFS features.

But first, performance. Unlike SyncIQ, which operates solely on a push model, SmartSync allows pull replication, too. This can be an incredibly useful performance option for environments that grow organically. As demand for replication on a source cluster increases, the additional compute and network load needs to be considered. Push replication, especially with multiple targets, can generate a significant load on the source cluster, as shown in CPU graphs in the following graphic:

In extreme cases, replication traffic resource utilization can potentially impact client workloads as data is pushed to the target. On the other hand, enabling  a pull replication model for a dataset can drastically reduce the resource impacts on the source cluster’s CPU utilization by offloading replication overhead to the target cluster. This can be seen in the following graphs:

For single-source dataset environments with numerous targets, pull replication can free up the source cluster’s compute and network resources, which can then be used more beneficially for client IO. However, if the target cluster is a capacity-optimized archive cluster without CPU and/or network resources to spare, a pull policy model, rather than the traditional push, may not be an option. In such cases, SmartSync also allows its policies to be limited, or throttled, in order to reduce system and/or network resource impacts from replication. SmartSync throttling come in two flavors: Bandwidth and CPU throttling.

  1. Bandwidth throttling is specified through a set of netmask rules plus a maximum throughput limit, and is configured via the ‘isi dm throttling’ CLI syntax:
# isi dm throttling bw-rules create NETMASK – [subnet] --bw-limit= command

Bandwidth limits are specified in bytes for a specific subnet and netmask. For example:

# isi dm throttling bw-rules create NETMASK --netmask 10.20.100.0/24 --bw-limit=$((20*1024*1024))

In this case, the bandwidth limit of 20MB (20*1024*1024 bytes) is applied to the 10.20.100.0 class C subnet. The throttling configuration change can be verified as follows:

# isi dm throttling bw-rules list

ID Rule Type Netmask Bw Limit

------------------------------------------

0 NETMASK 10.20.100.0/24 20.00MB

------------------------------------------

Total: 1

 

  1. Compute-wise, SmartSync policies by default can consume up to 30% of a node’s CPU cycles if its total CPU usage is less than 90%. If the node’s total CPU utilization reaches 90%, or if the SmartSync consumption reaches 30% of the total CPU, SmartSync automatically throttles its CPU consumption.

Additional CPU throttling is specified through ‘allowed CPU percentage’ and ‘backoff CPU percentage’ limits, which are also configured via the ‘isi dm throttling’ CLI command syntax.

To specify the CPU allowed for SmartSync of the total node CPU, use the –-allowedcpu-threshold option. To specify the node’s overall CPU level where SmartSync begins to throttle, use the –system-cpu-load-threshold command. For each option, specify the new threshold without the percentage sign For example, to set the allowed CPU threshold to 20% and the system CPU threshold to 80%:

# isi dm throttling settings view

    Allowed CPU Threshold: 30

System CPU Load Threshold: 90

# isi dm throttling settings modify --allowed-cpu-threshold 20 --system-cpu-load-threshold 80

# isi dm throttling settings view

    Allowed CPU Threshold: 20

System CPU Load Threshold: 80

Additionally, SmartSync performance is also aided by a scalable run-time engine, spanning the cluster, and which spins up threads (fibers) on demand and uses asynchronous IO to process replication tasks (chunks). Batch operations are used for efficient small file, attribute, and data block transfer. Namespace contention avoidance, efficient snapshot utilization, and separation of dataset creation from transfer are salient design features of the both the baseline and incremental sync algorithms. Plus, the availability of a pull transfer model can significantly reduce the impact on a source cluster, if needed.

On the caveats and considerations front, SmartSync v1 in OneFS 9.4 does have some notable limitations to be cognizant of. Notably, failover and failback of a SmartSync policy option is not currently supported, nor is an option to allow writes on the target cluster. However, the dataset is available for read and write on copy policies once the replication to the target platform is complete if the ‘–copy-createdataset-on-target=false’ option is specified. These limitations will be lifted in a future OneFS release. But for now, if required, repeat-copy data on the target platform may be copied out of the SmartSync data mover snapshot.

Other interoperability considerations include:

Component Interoperability with SmartSync
ADS/resource forked files Main file stored; Alternate data stream/resource fork skipped when encountered.
Cloud copy-back Not supported unless data was created by a OneFS Datamover.
Cloud incrementals Unsupported for file->object transfers. One-time copy to/from cloud only.
CloudPools CloudPools Smartlink stub files are not supported.
Compression Compression for replication transfer is not supported.
Failover/failback policy Failover and failback option is not available, nor is an option to allow writes on the target cluster.
File metadata POSIX UID, GID, atime, mtime, and ctime are copied.
File name encoding All encodings are converted to UTF-8.
Hadoop TDE SmartSync does not support the replication of the TDE domain and keys, rendering TDE encrypted data on the target cluster inaccessible.
Hard links Hard links are not preserved. A file/object is created for each link.
Inline data reduction Inline compressed and/or deduped data is rehydrated, decompressed, and transferred uncompressed to the target cluster.
Large files (4TB –> 16 TB) Supported up to the cloud provider’s maximum object size. SmartSync policies only connect with target clusters that also have large file support enabled.
RBAC SmartSync administrative access is assigned through the ISI_PRIV_DATAMOVER privilege
SFSE SFSE containerized small files are unpacked on the source cluster before replication.
SmartDedupe Deduplicated files are rehydrated back to their original size prior to replication.
SmartLock Compliance mode cluster are not supported with SmartSync.
SnapshotIQ Tightly integrated; uses snapshots for incrementals and re-baselining.
Sparse files Sparse regions of files are written out as zeros.
Special files Skipped when encountered.
Symbolic links Skipped when encountered.
SyncIQ SmartSync and SyncIQ replication both happily coexist. An active SyncIQ license is required for both.

When it comes to monitoring and troubleshooting SmartSync, there are a variety of diagnostic tools available. These include:

Component Tools Issue
Logging ·         /var/log/isi_dm.log

·         /var/log/messages

·         ifs/data/Isilon_Support/datamover/transfer_failures/baseline_failures_ <jobid>

General SmartSync info and  triage.
Accounts ·         isi dm accounts list / view Authentication, trust and encryption.
CloudCopy ·         S3 Browser (ie. Cloudberry), Microsoft Azure Storage Explorer Cloud access and connectivity.
Dataset ·         isi dm dataset list/view Dataset creation and health.
File system ·         isi get Inspect replicated files and objects.
Jobs ·         isi dm jobs list/view

·         isi_datamover_job_status -jt

Job and task execution, auto-pausing, completion, control, and transfer.
Network ·         isi dm throttling bw-rules list/view

·         isi_dm network ping/discover

Network connectivity and throughput.
Policies ·         isi dm policies list/view

·         isi dm base-policies list/view

Copy and dataset policy execution and transfer.
Service ·         isi services -a isi_dm_d <enable/disable> Daemon configuration and control.
Snapshots ·         isi snapshot snapshots list/view Snapshot execution and access.
System ·         isi dm throttling settings CPU load and system performance.

SmartSync info and errors are typically written to /var/log/isi_dm.log and /var/log/messages, while DM jobs transfer failures generate a log specific to the job ID under /ifs/data/Isilon_Support/datamover/transfer_failures.

Once a policy is running, the job status is reported via ‘isi dm jobs list’. Once complete, job histories are available by running ‘isi dm historical jobs list’. More details for a specific job can be glean from the ‘isi dm job view’ command, using the pertinent job ID from the list output above. Additionally, the ‘isi_datamover_job_status’ command with the job ID as an argument will also supply detailed information about a specific job.

Once running, a DM job can be further controlled via the ‘isi dm jobs modify’ command, and available actions include cancel, partial-completion, pause, or resume.

If a certificate authority (CA) is not correctly configured on a PowerScale cluster, the SmartSync daemon will not start, even though accounts and policies can still be configured. Be aware that the failed policies will not be reported via ‘isi dm jobs list’ or ‘isi dm historical-jobs list’ since they never started. Instead, an improperly configured CA is reported in the /var/log/isi_dm.log as follows:

Certificates not correctly installed, Data Mover service sleeping: At least one CA must be installed: No such file or directory from dm_load_certs_from_store (/b/mnt/src/isilon/lib/isi_dm/isi_dm_remote/src/rpc/dm_tls.cpp:197 ) from dm_tls_init (/b/mnt/src/isilon/lib/isi_dm/isi_dm_remote/src/rpc/dm_tls.cpp:279 ): Unable to load certificate information

Once a CA and identity are correctly configured, the SmartSync service automatically activates. Next, SmartSync attempts a handshake with the target cluster. If the CA or identity is mis-configured, the handshake process fails, and generates an entry in /var/log/isi_dm.log. For example:

2022-06-30T12:38:17.864181+00:00 GEN-HOP-NOCL-RR-1(id1) isi_dm_d[52758]: [0x828c0a110]: /b/mnt/src/isilon/lib/isi_dm/isi_dm_remote/src/acct_mon.cpp:dm_acc tmon_try_ping:348: [Fiber 3778] ping for account guid: 0000000000000000c4000000000000000000000000000000, result: dead

Note that the full handshake error detail is logged if the SmartSync service (isi_dm_d) is set to log at the ‘info’ or ‘debug’ level using isi_ilog:

# isi_ilog -a isi_dm_d --level info+

Valid ilog levels include:

fatal error err notice info debug trace

error+ err+ notice+ info+ debug+ trace+

A copy or repeat-copy policy requires an available dataset for replication before running. If a dataset has not been successfully created prior to the copy or repeat-copy policy job starting for the same base path, the job is paused. In the following example, the base path of the copy policy is not the same as that of the dataset policy, hence the job fails with a “path doesn’t match…” error.

# ls -l /ifs/data/Isilon_support/Datamover/transfer_failures

Total 9

-rw-rw----   1 root  wheel  679  June 29 10:56 baseline_failure_10

# cat /ifs/data/Isilon_support/Datamover/transfer_failures/baseline_failure_10

Task_id=0x00000000000000ce, task_type=root task ds base copy, task_state=failed-fatal path doesn’t match dataset base path: ‘/ifs/test’ != /ifs/data/repeat-copy’:

from bc_task)initialize_dsh (/b/mnt/src/isilon/lib/isi_dm/isi_dm/src/ds_base_copy

from dmt_execute (/b/mnt/src/isilon/lib/isi_dm/isi_dm/src/ds_base_copy_root_task

from dm_txn_execute_internal (/b/mnt/src/isilon/lib/isi_dm/isi_dm_base/src/txn.cp

from dm_txn_execute (/b/mnt/src/isilon/lib/isi_dm/isi_dm_base/src/txn.cpp:2274)

from dmp_task_spark_execute (/b/mnt/src/isilon/lib/isi_dm/isi_dm/src/task_runner.

Once any errors for a policy have been resolved, the ‘isi dm jobs modify’ command can be used to resume the job.