OneFS Logfile Collection with isi-gather-info

The previous blog article outlining the investigation and troubleshooting of OneFS deadlocks and hang-dumps generated several questions about OneFS logfile gathering. So it seemed like a germane topic to explore in an article.

The OneFS ‘isi_gather_info’  utility has long been a cluster staple for collecting and collating context and configuration that primarily aids support in the identification and resolution of bugs and issues. As such, it is arguably OneFS’ primary support tool and, in terms of actual functionality, it performs the following roles:

  1. Executes many commands, scripts, and utilities on cluster, and saves their results
  2. Gathers all these files into a single ‘gzipped’ package.
  3. Transmits the gather package back to Dell via several optional transport methods.

By default, a log gather tarfile is written to the /ifs/data/Isilon_Support/pkg/ directory. It can also be uploaded to Dell via the following means:

Transport Mechanism Description TCP Port
ESRS Uses Dell EMC Secure Remote Support (ESRS) for gather upload. 443/8443
FTP Use FTP to upload completed gather. 21
HTTP Use HTTP to upload gather. 80/443

More specifically, the ‘isi_gather_info’ CLI command syntax includes the following options:

Option Description
–upload <boolean> Enable gather upload.
–esrs <boolean> Use ESRS for gather upload.
–gather-mode (incremental | full) Type of gather: incremental, or full.
–http-insecure-upload <boolean> Enable insecure HTTP upload on completed gather.
–http-upload-host <string> HTTP Host to use for HTTP upload.
–http-upload-path <string> Path on HTTP server to use for HTTP upload.
–http-upload-proxy <string> Proxy server to use for HTTP upload.
–http-upload-proxy-port <integer> Proxy server port to use for HTTP upload.
–clear-http-upload-proxy-port Clear proxy server port to use for HTTP upload.
–ftp-upload <boolean> Enable FTP upload on completed gather.
–ftp-upload-host <string> FTP host to use for FTP upload.
–ftp-upload-path <string> Path on FTP server to use for FTP upload.
–ftp-upload-proxy <string> Proxy server to use for FTP upload.
–ftp-upload-proxy-port <integer> Proxy server port to use for FTP upload.
–clear-ftp-upload-proxy-port Clear proxy server port to use for FTP upload.
–ftp-upload-user <string> FTP user to use for FTP upload.
–ftp-upload-ssl-cert <string> Specifies the SSL certificate to use in FTPS connection.
–ftp-upload-insecure <boolean> Whether to attempt a plain text FTP upload.
–ftp-upload-pass <string> FTP user to use for FTP upload password.
–set-ftp-upload-pass Specify the FTP upload password interactively.

Once the gather arrives at Dell, it is automatically unpacked by a support process and analyzed using the ‘logviewer’ tool.

Under the hood, there are two principal components responsible for running a gather. These are:

Component Description
Overlord ·         The manager process, triggered by the user, which oversees all the isi_gather_info tasks which are executed on a single node.
Minion ·         The worker process, which runs a series of commands (specified by the overlord) on a specific node.

The ‘isi_gather_info’ utility is primarily written in python, with its configuration under the purview of MCP, and RPC services provided by the isi_rpc_d daemon.

For example:

# isi_gather_info&

# ps -auxw | grep -i gather

root   91620    4.4  0.1 125024  79028  1  I+   16:23        0:02.12 python /usr/bin/isi_gather_info (python3.8)

root   91629    3.2  0.0  91020  39728  -  S    16:23        0:01.89 isi_rpc_d: isi.gather.minion.minion.GatherManager (isi_rpc_d)

root   93231    0.0  0.0  11148   2692  0  D+   16:23        0:00.01 grep -i gather

The overlord uses isi_rdo (the OneFS remote command execution daemon) to start up the minion processes and informs them of the commands to be executed via an ephemeral XML file, typically stored at /ifs/.ifsvar/run/<uuid>-gather_commands.xml. The minion then spins up an executor and a command for each entry in the XML file.

The parallel process executor (the default one to use) acts as a pool, triggering commands to run in parallel until a specified number are running in parallel. The commands themselves take care of the running and processing of results, checking frequently to ensure the timeout threshold has not been passed.

The executor also keeps track of which commands are currently running, and how many are complete, and writes them to a file so that the overlord process can display useful information. Once complete, the executor returns the runtime information to the minion, which records the benchmark file. The executor will also safely shut itself down if the isi_gather_info lock file disappears, such as if the isi_gather_info process is killed.

During a gather the minion returns nothing to the overlord process, since the output of its work is written to disk.

Architecturally, the ‘gather’ process comprises an eight phase workflow:

The details of each phase are as follows:

Phase Description
1.       Setup Reads from the arguments passed in, as well as any config files on disk, and sets up the config dictionary, which will be used throughout the rest of the codebase. Most of the code for this step is contained in isilon/lib/python/gather/igi_config/configuration.py. This is also the step where the program is most likely to exit, if some config arguments end up being invalid
2.       Run local Executes all the cluster commands, which are run on the same node that is starting the gather. All these commands run in parallel (up to the current parallelism value). This is typically the second longest running phase.
3.       Run nodes Executes the node commands across all of the cluster’s nodes. This runs on each node, and while these commands run in parallel (up to the current parallelism value), they do not run in parallel with the local step.
4.       Collect Ensures all of the results end up on the overlord node (the node that started gather). If gather is using /ifs, it is very fast, but if it’s not, it needs to SCP all the node results to a single node.
5.       Generate Extra Files Generates nodes_info and package_info.xml. These are two files that are present in every single gather, and tell us some important metadata about the cluster
6.       Packing Packs (tars and gzips) all the results. This is typically the longest running phase, often by an order of magnitude
7.       Upload Transports the tarfile package to its specified destination. Depending on the geographic location this phase might also be a lengthy duration.
8.       Cleanup Cleanups any intermediary files that were created on cluster. This phase will run even if gather fails, or is interrupted.

Since the isi_gather_info tool is primarily intended for troubleshooting clusters with issues, it runs as root (or compadmin in compliance mode), as it needs to be able to execute under degraded conditions (eg. without GMP, during upgrade, and under cluster splits, etc). Given these atypical requirements, isi_gather_info is built as a stand-alone utility, rather than using the platform API for data collection.

The time it takes to complete a gather is typically determined by cluster configuration, rather than size. For example, a gather on a small cluster with a large number of NFS shares will take significantly longer than on large cluster with a similar NFS configuration. Incremental gathers are not recommended, since the base that’s required to check against in the log store may be deleted. By default, gathers only persist for two weeks in the log processor.

On completion of a gather, a tar’d and zipped logset is generated and placed under the cluster’s /ifs/data/IsilonSupport/pkg directory by default. A standard gather tarfile unpacks to the following top-level structure:

# du -sh *

536M    IsilonLogs-powerscale-f900-cl1-20220816-172533-3983fba9-3fdc-446c-8d4b-21392d2c425d.tgz

320K    benchmark

 24K    celog_events.xml

 24K    command_line

128K    complete

449M    local

 24K    local.log

 24K    nodes_info

 24K    overlord.log

 83M    powerscale-f900-cl1-1

 24K    powerscale-f900-cl1-1.log

119M    powerscale-f900-cl1-2

 24K    powerscale-f900-cl1-2.log

134M    powerscale-f900-cl1-3

 24K    powerscale-f900-cl1-3.log

In this case, for a three node F900 cluster, the compressed tarfile is 536 MB in size. The bulk of the data, which is primarily CLI command output, logs and sysctl output, is contained in the ‘local’ and individual node directories (powerscale-f900-cl1-*). Each node directory contains a tarfile, varlog.tar, containing all the pertinent logfiles for that node.

In the root directory of the tarfile file includes the following:

Item Description
benchmark §  Runtimes for all commands executed by the gather.
celog_events.xml ·         Info about the customer, including, name, phone, email, etc.

·         Contains significant about the cluster and individual nodes, including:

§  Cluster/Node names

§  Node Serial numbers

§  Configuration ID

§  OneFS version info

§  Events

command_line ·         Syntax of gather commands run.
complete §  Lists of complete commands run across the cluster and on individual nodes
local ·         See below.
nodes_info ·         Contains general information about the nodes, including the node ID, the IP address, the node name, and the logical node number
overlord.log §  Gather execution and issue log.
package_info.xml §  Cluster version details, GUID, S/N, and customer info (name, phone, email, etc).

Notable contents of the ‘local’ directory (all the cluster-wide commands that are executed on the node running the gather) include:

Local Contents Item Description
isi_alerts_history

 

·         This file seems to contain a list of all alerts that have ever occurred on the cluster

·         Event Id, which consists of the number of the initiating node and the event number

·         Times that the alert was issued and was resolved

·         Severity

·         Logical Node Number of the node(s) to which the alert applies

·         The message contained in the alert

isi_job_list ·         Contains information about job engine processes

·         Includes Job names, enabled status, priority policy, and descriptions

isi_job_schedule ·         A schedule of when job engine processes run

·         Includes job name, the schedule for a job, and the next time that a run of the job will occur

isi_license ·         The current license status of all of the modules
isi_network_interfaces §  State and configuration of all the cluster’s network interfaces.
isi_nfs_exports §  Configuration detail for all the cluster’s NFS exports.
isi_services §  Listing of all the OneFS services and whether they are enabled or disabled. More detailed configuration for each service is contained in separate files. Ie. For SnapshotIQ:

o   snapshot_list

o   snapshot_schedule

o   snapshot_settings

o   snapshot_usage

o   writable_snapshot_list

isi_smb §  Detailed configuration info for all the cluster’s NFS exports.
isi_stat §  Overall status of the cluster, including networks, drives, etc.
isi_statistics §  CPU, protocol, and disk IO stats.

Contents of the directory for the ‘node’ directory include:

Node Contents Item Description
df ·         output of the df command
du ·         Output of the du command

·         Unfortunately it runs ‘du -h’ which reports capacity in ‘human readable’ for, but make it more complex to sort.

isi_alerts ·         Contains a list of outstanding alerts on the node
ps and ps_full lists of all running process at the time that isi_gather_info was executed.

As the isi_gather_info command runs, status is provided via the interactive CLI session:

# isi_gather_info

Configuring

    COMPLETE

running local commands

    IN PROGRESS \

Progress of local

[########################################################  ]

147/152 files written  \

Some active commands are: ifsvar_modules_jobengine_cp, isi_statistics_heat, ifsv

ar_modules

Once the gather has completed, the location of the tarfile on the cluster itself is reported as follows:

# isi_gather_info

Configuring

    COMPLETE

running local commands

    COMPLETE

running node commands

    COMPLETE

collecting files

    COMPLETE

generating package_info.xml

    COMPLETE

tarring gather

    COMPLETE

uploading gather

    COMPLETE

Path to the tar-ed gather is:

/ifs/data/Isilon_Support/pkg/IsilonLogs-h5001-20220830-122839-23af1154-779c-41e9-b0bd-d10a026c9214.tgz

If the gather upload services are unavailable, errors will be displayed on the console, per below:

…

uploading gather

    FAILED

        ESRS failed - ESRS has not been provisioned

        FTP failed - pycurl error: (28, 'Failed to connect to ftp.isilon.com port 21 after 81630 ms: Operation timed out')

OneFS Deadlocks and Hang-dumps – Part 3

As we’ve seen previously in this series, very occasionally a cluster can become deadlocked and remain in an unstable state until the affected node(s), or sometimes the entire cluster, is rebooted or panicked. However, in addition to the data gathering discussed in the prior article, there are additional troubleshooting steps that can be explored by the more adventurous cluster admin – particularly with regard to investigating a LIN lock.

Lock Domain Resource Description
LIN LIN Every object in the OneFS filesystem (file, directory, internal special LINs) is indexed by a logical inode number (LIN). A LIN provides an extra level of indirection, providing pointers to the mirrored copies of the on-disk inode. This domain is used to provide mutual exclusion around classic BSD vnode operations. Operations that require a stable view of data take a read lock which allows other readers to operate simultaneously but prevents modification. Operations that change data take a write lock that prevents others from accessing that directory while the change is taking place.

 The approach outlined can be useful to assist in identifying the problematic thread(s) and/or node(s) and helping to diagnose and resolve a cluster wide deadlock.

As a quick refresher, the various OneFS locking components include:

Locking Component Description
Coordinator A coordinator node arbitrates locking within the cluster for a particular subset of resources. The coordinator only maintains the lock types held and wanted by the initiator nodes.
Domain Refers to the specific lock attributes (recursion, deadlock detection, memory use limits, etc) and context for a particular lock application. There is one definition of owner, resource, and lock types, and only locks within a particular domain may conflict.
Initiator The node requesting a lock on behalf of a thread is called an initiator. The initiator must contact the coordinator of a resource in order to acquire the lock. The initiator may grant a lock locally for types which are subordinate to the type held by the node. For example, with shared-exclusive locking, an initiator which holds an exclusive lock may grant either a shared or exclusive lock locally.
Lock Type Determines the contention among lockers. A shared or read lock does not contend with other types of shared or read locks, while an exclusive or write lock contends with all other types. Lock types include: Advisory, Anti-virus, Data, Delete, LIN, Mark, Oplocks, Quota, Share Mode, SMB byte-range, Snapshot, and Write.
Locker Identifies the entity which acquires a lock.
Owner A locker which has successfully acquired a particular lock. A locker may own multiple locks of the same or different type as a result of recursive locking.
Resource Identifies a particular lock. Lock acquisition only contends on the same resource. The resource ID is typically a LIN to associate locks with files.
Waiter Has requested a lock but has not yet been granted or acquired it.

So the basic data that will be required for a LIN lock investigation is as follows:

Data Description
<Waiter-LNN> Node number with the largest ‘started’ value.
<Waiting-Address> Address of <Waiter-LNN> node above.
<LIN> LIN from the ‘resource = field of <Waiter-LNN>
<Block-Address> Block address from “resource=’ field of <Waiter-LNN>
<Locker-Node> Node that owns the lock for the <LIN>. Has a non-zero value for ‘owner_count.
<Locker-Address) Address of Locker-Node.

As such, the following process can be used to help investigate a LIN lock:

The details for each step above are as follows:

  1. First, execute the following CLI syntax from any node in the cluster to view the LIN lock.’oldest_waiter’ infol:
# isi_for_array -X 'sysctl efs.lin.lock.initiator.oldest_waiter | grep -E "address|started"' | grep -v "exited with status 1"

Querying the ‘efs.lin.lock.initiator.oldest_waiter’ sysctl returns a deluge of information, for example:

# sysctl efs.lin.lock.initiator.oldest_waiter

efs.lin.lock.initiator.oldest_waiter: resource = 1:02ab:002c

waiter = {

    address = 0xfffffe8ff7674080

    locker = 0xfffffe99a52b4000

    type = shared

    range = [all]

    wait_type = wait ok

    refcount_type = stacking

    probe_id = 818112902

    waiter_id = 818112902

    probe_state = done

    started = 773086.923126 (29.933031 seconds ago)

    queue_in = 0xfffff80502ff0f08

    lk_completion_callback = kernel:lk_lock_callback+0

    waiter_type = sync

    created by:

      Stack: --------------------------------------------------

      kernel:lin_lock_get_locker+0xfe

      kernel:lin_lock_get_locker+0xfe

      kernel:bam_vget_stream_invalid+0xe5

      kernel:bam_vget_stream_valid_pref_hint+0x51

      kernel:bam_vget_valid+0x21

      kernel:bam_event_oprestart+0x7ef

      kernel:ifs_opwait+0x12c

      kernel:amd64_syscall+0x3a6

      --------------------------------------------------

The pertinent areas of interest for this exercise are the ‘address’ and ‘started’ (wait time) fields.

If the ‘started’ value is short (ie. less than 90 seconds), or there is no output returned, then this is potentially an MDS lock issue (which can be investigated via the ‘efs.mds.block_lock.initiator.oldest_waiter’ sysctl).

  1. From the above output, examine the results with ‘started’ lines and find the one with the largest value for ‘(###.### seconds ago)’. The node number (<Waiter-LNN>) of this entry is the one of interest.
  2. Next, examine the ‘address =’ entries and find the one with that same node number (<Waiting-Address>).

Note that if there are multiple entries per node, this could indicate a multiple shared lock with another exclusive lock waiting.

  1. Query the LIN for the waiting address on the correct node using the following CLI syntax:
# isi_for_array -n<Waiter-LNN> 'sysctl efs.lin.lock.initiator.active_entries | egrep "resource|address|owner_count" | grep -B5 <Waiting-Address>'
  1. The LIN for this issue is shown in the ‘resource =’ field from the above output. Use the following command to find which node owns the lock on that LIN:
# isi_for_array -X 'sysctl efs.lin.lock.initiator.active_entries |egrep "resource|owner_count"' | grep -A1 <LIN>

Parse the output from this command to find the entry that has a non-zero value for ‘owner_count’. This is the node that owns the lock for this LIN (<Locker-Node>).

  1. Run the following command to find which thread owns the lock on the LIN:
# isi_for_array -n<Locker-Node> 'sysctl efs.lin.lock.initiator.active_entries | grep -A10 <LIN>'
  1. The ‘locker =’ field will provide the thread address (<Locker-Addr>) for the thread holding the lock on the LIN. The following CLI syntax can be used to find the associated process and stack details for this thread:
    # isi_for_array -n<Locker-Node>'sysctl kern.proc.all_stacks |grep -B1 -A20 <Locker-Addr>'
  2. The output will provide the stack and process details. Depending on the process and stack information available from the previous command output, you may be able to terminate (ie. kill -9) the offending process in order to clear the deadlock issue.

Usually within a couple of minutes of killing the offending process, the cluster will become responsive again. The ‘isi get -L’ CLI command can be used to help determine which file was causing the issue, possibly giving some insight as to the root cause.

Please note that if you are unable to identify an individual culprit process, or are unsure of your findings, contact Dell Support for assistance.

OneFS Deadlocks and Hang-dumps – Part 2

As mentioned in the previous article in this series, hang-dumps can occur under the following circumstances.

Type Description
Transient The time to obtain the lock was long enough to trigger a hang-dump, but the lock is eventually granted. This is the less serious situation. The symptoms are typically general performance degradation, but the cluster is still responsive and able to progress.
Persistent The issue typically requires significant remedial action, such as node reboots. This is usually  indicative of a bug in OneFS, although it could also be caused by hardware issues, where hardware becomes unresponsive, and OneFS waits indefinitely for it to recover.

Certain normal OneFS operations, such as those involving very large files, have the potential to trigger a hang-dump with no long-term ill effects. However, in some situations the thread or process waiting for the lock to be freed, or ‘waiter’, is never actually granted the lock on the file. In such cases, users may be impacted.

If a hang-dump is generated as a result of a LIN lock timeout  (the most likely scenario), this indicates that at least one thread in the system has been waiting for a LIN lock for over 90 seconds. The system hang can involve a single thread, or sometimes multiple threads, for example blocking a batch job. The system hang could be affecting interactive session(s), in which case the affected cluster users will likely experience performance impacts.

Specifically, in the case of a LIN lock timeout, if the LIN number is available, it can be easily mapped back to its associated filename using the ‘isi get’ CLI command.

# isi get -L <lin #>

However, a LIN which is still locked may necessitate waiting until the lock is freed before getting the name of the file.

By default, OneFS hang-dump files are written to the /var/crash directory as compressed text files. During a hang-dump investigation, Dell support typically utilizes internal tools to analyze the logs from all of the nodes and generate a graph to show the lock interactions between the lock holders (the thread or process that is holding the file) and lock waiters. The analytics are per-node and include a full dump of the lock state as seen by the local node, the stack of each thread in the system, plus a variety of other diagnostics including memory usage, etc. Since OneFS source-code access is generally required in order to interpret the stack traces, Dell Support can help investigate the hang-dump log file data, which can then be used to drive further troubleshooting.

A deadlocked cluster may exhibit one or more of the following symptoms:

  • Clients are unable to communicate with the cluster via SMB, NFS, SSH, etc.
  • The WebUI is unavailable and/or commands executed from the CLI fail to start or complete.
  • Processes cannot be terminated, even with SIGKILL (kill -9).
  • Degraded cluster performance is experienced, with low or no CPU/network/disk usage.
  • Inability to access files or folders under /ifs.

In order to recover from a deadlock, Dell support’s remediation will sometimes require panicking or rebooting a cluster. In such instances, thorough diagnostic information gathering should be performed prior to this drastic step. Without this diagnostic data, it will be often be impossible to determine the root cause of the deadlock. If the underlying cause of the deadlock is not corrected, rebooting the cluster and restarting the service may not resolve the issue.

The following steps can be run in order to gather data that will be helpful in determining the cause of a deadlock:

  1. First, verify that there are no indeterminate journal transactions. If there are indeterminate journal transactions found, rebooting or panicking nodes will not resolve the issue.
# isi_for_array -X 'sysctl efs.journal.indeterminate_txns'

1: efs.journal.indeterminate_txns: 0
2: efs.journal.indeterminate_txns: 0
3: efs.journal.indeterminate_txns: 0

For each node, if the output of the above command returns zero, this indicates its journal is intact and all transactions are complete. Note that if the output is anything other than zero, the cluster contains indeterminate transactions, and Dell support should be engaged before any further troubleshooting is performed.

2. Next, check the /var/crash/directory for any recently created hang-dump files:

# isi_for_array -s 'ls -l /var/crash | grep -i hang'

Scan the /var/log/messages/ file for any recent references to ‘LOCK TIMEOUT’.

# isi_for_array -s 'egrep -i "lock timeout|hang" /var/log/messages | grep $(date +%Y-%m-%d)'

3.Collect the output from the ‘fstat’ CLI command, which identifies active files:

# isi_for_array -s 'fstat -m > /ifs/data/Isilon_Support/deadlock-data/fstat_$(hostname).txt'&

4. Record the Group Management Protocol (GMP) merge lock state:

# isi_for_array -s 'sysctl efs.gmp.merge_lock_state > /ifs/data/Isilon_Support/deadlock-data/merge_lock_state_$(hostname).txt'

5. Finally, run an ‘isi diagnostics gather’ logset gather to capture relevant cluster data and send the resulting zipped tarfile to Dell Support (via ESRS, FTP, etc).

# isi diagnostics gather start

A cluster reboot can be accomplished via an SSH connection as root to any node in the cluster, as follows:

# isi config

Welcome to the PowerScale configuration console.

Copyright (c) 2001-2022 Dell Inc. All Rights Reserved.

Enter 'help' to see list of available commands.

Enter 'help <command>' to see help for a specific command.

Enter 'quit' at any prompt to discard changes and exit.

        Node build: Isilon OneFS 9.4.0.0 B_MAIN_2978(RELEASE)

        Node serial number: JACNT194540666

TME1 >>> reboot all
 

!! You are about to reboot the entire cluster

Are you sure you wish to continue? [no] yes

Alternatively, the following CLI syntax can be used to reboot a single node from an SSH connection to it:

# kldload reboot_me

Or to reboot the cluster:

# isi_for_array -x$(isi_nodes -L %{lnn}) 'kldload reboot_me'

Note that simply shutting down or rebooting the affected node(s), or the entire cluster), while typically the quickest path to get up and running again, will not generate the core files required for debugging. If a root cause analysis is desired, these node(s) will need to be panicked in order to generate a dump of all active threads.

Only perform a node panic under the direct supervision of Dell Support! Be aware that panics bypass a number of important node shutdown functions, including unmounting /ifs, etc. However, a panic will generate additional kernel core information which is typically required by Dell Support in order to perform a thorough diagnosis. In situations where the entire cluster needs to be panicked, the recommendation is to start with the highest numbered node and work down to lowest. For each node that’s panicked, the debug information is written to the /var/crash directory, and can be identified by the ‘vmcore’ prefix.

If instructed by Dell Support to do so, the ‘isi_rbm_panic’ CLI command can be used to panic a node, with the argument being the logical node number (LNN) of the desired node to target. For example, to panic a node with LNN=2:

# isi_rbm_panic 2

If in any doubt, the following CLI syntax will return the corresponding node ID and node LNN for each node in the cluster:

# isi_nodes %{id} , %{lnn}

OneFS Deadlocks and Hang-dumps

A principal benefit of the OneFS distributed file system is its ability to efficiently coordinate operations that happen on separate nodes. File locking enables multiple users or processes to access data concurrently and safely. Since all nodes in an PowerScale cluster operate on the same single-namespace file system simultaneously, it requires mutual exclusion mechanisms to function correctly. For reading data, this is a fairly straightforward process involving shared locks. With writes, however, things become more complex and require exclusive locking, since data must be kept consistent.

Under the hood, the locks OneFS uses to provide consistency inside the file system (internal) are separate from the file locks provided for consistency between applications (external). This allows OneFS to reallocate a file’s metadata and data blocks while the file itself is locked by an application. This is the premise of OneFS autobalance, reprotection and tiering, where restriping is performed behind the scenes in small chunks, in order to minimize disruption.

OneFS has a fully distributed lock manager (DLM) that marshals locks across all the nodes in a cluster. This locking manager allows for multiple lock types to support both file system locks as well as cluster-coherent protocol-level locks, such as SMB share mode locks or NFS advisory-mode locks. The DLM distributes the lock data across all the nodes in the cluster. In a mixed cluster, the DLM will balance the memory utilization so that the lower-power nodes are not bullied.

OneFS includes the following lock domains:

Lock Domain Resource Description
Advisory Lock LIN The advisory lock domain (advlock) implements local POSIX advisory locks and NFS NLM locks.
Anti-virus LIN, snapID The AV domain implements locks used by OneFS’ ICAP Antivirus feature.
Data LIN, LBN The datalock lock domain implements locks on regions of data within a file. By reducing the locking granularity to below the file level, this enables simultaneous writers to multiple sections of a single file.
Delete LIN The ref lock domain exists to enable POSIX delete-on-close semantics. In a POSIX filesystem, unlinking an open file does not remove the space associated with the file until every thread accessing that file closes the file.
ID Map B-tree key The idmap database contains mappings between POSIX (uid, gid) and Windows (SID) identities. This lock domain provides concurrency control for the idmap database.
LIN LIN Every object in the OneFS filesystem (file, directory, internal special LINs) is indexed by a logical inode number (LIN). A LIN provides an extra level of indirection, providing pointers to the mirrored copies of the on-disk inode. This domain is used to provide mutual exclusion around classic BSD vnode operations. Operations that require a stable view of data take a read lock which allows other readers to operate simultaneously but prevents modification. Operations that change data take a write lock that prevents others from accessing that directory while the change is taking place.
MDS Lowest baddr All metadata in the OneFS filesystem is mirrored for protection. Operations involving read/write of such metadata are protected using locks in the Mirrored Data Structure (MDS) domain.
Oplocks LIN, snapID The Oplock lock domain implements the underlying support for opportunistic locks and leases.
Quota Quota Domain ID The Quota domain is used to implement concurrency control to quota domain records.
Share mode LIN, snapID The share_mode_lock domain is used to implement the Windows share mode locks.
SMB byte-range (CBRL) The cbrl lock domain implements support for byte-range locking.

 Each lock type implements a set of key/value pairs, and can additionally support a ‘byte range’, or a pair of file offsets, and a ‘user data’ block.

In addition to managing locks on files, DLM also orchestrates access to the storage drives, too. Multiple domains, advisory file locks (advlock), mirrored metadata operations (MDS locks), and logical inode number (LIN locks) for operations involving file system objects that have an inode (ie. files or directories) exist within the lock manager. Within these, LIN locks constitute the lion’s share of OneFS locking issues.

Much like OS-level locking, DLM’s operations are typically invisible to end users. That said, DLM sets a maximum time to wait to obtain a lock. When this threshold is exceeded, OneFS automatically triggers a diagnostic information-gathering process, or hang-dump. Note that the triggering of a hang-dump is not necessarily indicative of an issue, but should definitely prompt some further investigation. Hang-dumps will be covered in more depth in a forthcoming blog article in this series.

So what exactly is deadlock? When one or more processes have obtained locks on resources, a point can be reached in which each process prevents another from obtaining a lock, rendering none of the processes able to proceed. This condition is known as a deadlock.

In the image above, thread 1 has an exclusive lock on Resource A, but also requires Resource B in order to complete execution. Since the Resource B lock is unavailable, thread 1 will have to wait for thread 2 to release it’s lock on Resource B. At the same time, thread 2 has obtained an exclusive lock on Resource B, but requires Resource A for finishing execution. Since the Resource A lock is unavailable, if thread 2 attempts to lock Resource A, both processes will wait indefinitely for each other.

Any multi-process file system architecture that involves locking has the potential for deadlocks if any thread needs to acquire more than one lock at the same time. There are typically two general approaches to handling this scenario:

Option Description
Avoidance Attempt to ensure the code cannot deadlock. This approach involves mechanisms such as consistently acquiring locks in the same order. This approach is generally challenging, not always practical, and can have adverse performance implications for the fast path code.
Acceptance Acknowledge that deadlocks will occur and develop appropriate handling methods.

While OneFS ultimately takes the latter approach, it strives to ensure that deadlocks don’t occur. However, under rare conditions, it is more efficient to manage deadlocks by simply breaking and reestablishing the locks.

In OneFS parlance, a ‘hang-dump’ is a cluster event, resulting from a cluster detecting  a ‘hang’ condition, during which the isi_hangdump_d service generates a set of log files. This typically occurs as a result of merge lock timeouts and deadlocks, and while hang-dumps are typically triggered automatically, they can also be manually initiated if desired.

OneFS monitors each lock domain and has a built-in ‘soft timeout’ (the amount of time in which OneFS can generally be expected to satisfy a lock request) associated with each. If a thread holding a lock blocks another thread’s attempt to obtain a conflicting lock type for longer than the soft timeout period, a hang-dump is triggered to collect a substantial quantity of system state information for potential diagnostic purposes, including the lock state of each domain, plus the stack traces of every thread on each node in the cluster.

When a thread is blocked for an extended period of time, any client that is waiting for the work that the thread is performing is also blocked. The external symptoms resulting from this can include:

  • Open applications stop taking input but do not shut down.
  • Open windows or dialogues cannot be closed.
  • The system cannot be restarted normally because  it does not respond to commands.
  • A node does not respond to client requests.

Hang-dumps can occur as a result of the following conditions:

Type Description
Transient The time to obtain the lock was long enough to trigger a hang-dump, but the lock is eventually granted. This is the less serious situation. The symptoms are typically general performance degradation, but the cluster is still responsive and able to progress.
Persistent The issue typically requires impactful remedial action, such as node reboots. This is usually  indicative of either a defect or bug in OneFS or hardware issue, where a cluster component becomes unresponsive, and OneFS waits indefinitely for it to recover.

Note that a OneFS hang-dump is not necessarily indicative of a serious issue.

OneFS System Logging and Ilog

The OneFS ilog service is a general logging facility for the cluster, allowing applications and services to rapidly decide if or where to log messages, based on the currently active logging configuration. Historically, OneFS used syslog directly or via custom wrappers, and the isi_ilog daemon provides features common to those wrappers plus an array of other capabilities. These include runtime-modification, the ability to log to file, syslog, and or stderr, additional context including message plus ‘component’, ‘job’, and ‘thread_id’, and default fall-back to syslog.

Under the hood, there are actually two different ilog components; kernel ilog and userspace ilog.

Kernel ilog controls log verbosity at runtime, avoids installing a new kernel module to enable more log detail, and allows only enabling such detailed logging at certain times. Ilog defines six logging levels: Error, Warning, Notice, Info, Debug, and Trace, with levels ‘error’, ‘warning’ and ‘notice’ being written to /var/log/messages with the default configuration. The user interface to kernel Ilog is through sysctl variables, each of which can be set to any combination of the logging levels.

Userspace ilog, while conceptually similar to the kernel implementation, lacks single memory space and per-boot permanence of sysctl variables. User-space processes may start and terminate arbitrarily, and there may also be multiple processes running for a given service or app. Consequently, user-space ilog uses a gconfig file and shared memory to implement run-time changes to logging levels.

Runtime control of OneFS services’ logging is via the ‘isi_ilog’ CLI tool, which enables:

  • Adjusting logging levels
  • Defining tags which enable log lines with matching tags
  • Logging by file or file and line number
  • Adding or disabling logging to a file
  • Enabling or disabling logging to syslog
  • Throttling of logging so repeated messages aren’t emitted more than N seconds apart.

For userspace log, when an application or service using ilog starts up, its logging settings are loaded from the ilog gconfig tree, and a small chunk of shared memory is opened and logically linked to that config. When ilog’s logging configuration is modified via the CLI, the gconfig tree is updated and a counter in the shared memory incremented.

The OneFS applications and services that are currently integrated with ilog include:

Service Daemons
API PAPI, isi_rsapi_d
Audit isi_audit_d, isi_audit_syslog, isi_audit_purge_helper
Backend network isi_lbfo_d
CloudPools isi_cpool_d
Cluster monitoring isi_array_d, isi_paxos
Configuration store isi_tardis_d, isi_tardis_gcfg_d
DNS isi_dnsiq_d   isi_cbind_d
Drive isi_drive_d, isi_drive_repurpose_d
Diagnostics isi_diags_d
Fast delete isi_trash_d
Healthchecks isi_protohealth_d
IPMI management isi_ipmi_mgmt_d
Migration isi_vol_copy, isi_retore
NDMP Backup isi_ndmp_d
NFS isi_nfs_convert, isi_netgroup_d
Remote assist isi_esrs_d, isi_esrs_api
SED Key Manager isi_km_d
Services manager isi_mcp_d
SmartLock Compliance isi_comp_d
SmartSync isi_dm_d
SyncIQ siq_bandwidth, siq_generator, siq_pworker, siq_pworker_hasher, siq_stf_diff, siq_sworker, siq_sworker_hasher, siq_sworker_tmonitor, siq_coord, siq_sched, siq_sched_rotate_reports
Upgrade Signing isi_catalog_init

The ilog logging level provides for three types of capabilities:

  1. Severity (which maps to syslog severity)
  2. Special
  3. Custom

Plus the ilog severity level settings are as follows: 

Ilog Severity Level Syslog Mapping
IL_FATAL Maps to LOG_CRIT. Calls exit after message is logged.
IL_ERR Maps to LOG_ERR
IL_NOTICE Maps to LOG_INFO
IL_INFO Maps to LOG_INFO
IL_DEBUG Maps to LOG_DEBUG
IL_TRACE Maps to LOG_DEBUG

For example, the following CLI command will set the NDMP service to log at the ‘info’ level:

# isi_ilog -a isi_ndmp_d --level info

Note that logging levels do not work quite like syslog, as each level is separate. Specifically, if an application’s criteria set to log messages with the ‘IL_DEBUG level’ it will only log those debug messages, and not log messages at any higher severity. To log at a level and all higher severity levels, ilog allows ‘PLUS’ (–level <level>+)  combination settings.

Logging configuration is per named application, not per process, and settings are managed on a per-node basis. Any cluster-wide ilog criteria changes will require the use of the ‘isi_for_array’ CLI utility.

Be aware that syslog is still the standard target for logging and  /etc/mcp/templates/syslog.conf (rather than /etc/syslog.conf) is used to enable sysloging. If ‘use_syslog’ is set to true, but syslog.conf is not modified, syslog entries will not be created. When syslog is enabled, if ‘log_file’ points to the same syslog file, duplicate log entries will occur, one from syslog and one from the log file.

Other isi_log CLI commands include:

List all apps:

# isi_ilog -L

Print settings for an app:

# isi_ilog -a <service_name> -p

Set application level to info:

# isi_ilog -a <service_name> --level info

Turn off syslog logging for application:

# isi_ilog -a <service_name> --syslog off

Turn on logging to a file for a service:

# isi_ilog -a <service_name> --file /ifs/logs/<service_name>.log

Of the various services that use ilog, OneFS auditing is among the most popular. As such, it has its own configuration through the ‘isi audit’ CLI command set, or from the WebUI via Cluster management > Auditing:

Additionally, the ‘audit setting global’ CLI command allows is used to enable and disable cluster auditing, as well as configure retention periods, remote CEE and syslog services, etc.

# isi audit settings global view

Protocol Auditing Enabled: Yes

            Audited Zones: System, az1

          CEE Server URIs: -

                 Hostname:

  Config Auditing Enabled: Yes

    Config Syslog Enabled: Yes

    Config Syslog Servers: 10.20.40.240

  Protocol Syslog Servers: 10.20.40.240

     Auto Purging Enabled: No

         Retention Period: 180

Additionally, the various audit event attributes can be viewed and modified via the ‘isi audit settings’ CLI command.

# isi audit settings view

            Audit Failure: create_file, create_directory, open_file_write, open_file_read, close_file_unmodified, close_file_modified, delete_file, delete_directory, rename_file, rename_directory, set_security_file, set_security_directory

            Audit Success: create_file, create_directory, open_file_write, open_file_read, close_file_unmodified, close_file_modified, delete_file, delete_directory, rename_file, rename_directory, set_security_file, set_security_directory

      Syslog Audit Events: create_file, create_directory, open_file_write, open_file_read, close_file_unmodified, close_file_modified, delete_file, delete_directory, rename_file, rename_directory, set_security_file, set_security_directory

Syslog Forwarding Enabled: Yes

To configure syslog forwarding, review the zone specific audit settings and ensure syslog audit events (for local) are set and syslog forwarding is enabled (for remote).

Note that the ‘isi audit settings’ CLI command defaults to the ‘system’ zone unless the ‘–zone’ flag is specified. For example, to view the configuration for the ‘az1’ access zone, which in this case is set to non-forwarding:

# isi audit settings view --zone=az1

            Audit Failure: create_file, create_directory, open_file_write, open_file_read, close_file_unmodified, close_file_modified, delete_file, delete_directory, rename_file, rename_directory, set_security_file, set_security_directory

            Audit Success: create_file, create_directory, open_file_write, open_file_read, close_file_unmodified, close_file_modified, delete_file, delete_directory, rename_file, rename_directory, set_security_file, set_security_directory

      Syslog Audit Events: create_file, create_directory, open_file_write, open_file_read, close_file_unmodified, close_file_modified, delete_file, delete_directory, rename_file, rename_directory, set_security_file, set_security_directory

Syslog Forwarding Enabled: No

The cluster’s /etc/syslog.conf file should include the IP address of the server that’s being forwarded to (in this example, a Linux box at 10.20.40.240):

!audit_config

*.*                                             /var/log/audit_config.log

*.*                                             @10.20.40.240

!audit_protocol

*.*                                             /var/log/audit_protocol.log

*.*                                             @10.20.40.240

Output on the remote host will be along the lines of:

Jul 31 17:46:40 isln-tme-1(id1) audit_protocol[2188] S-1-22-1-0|0|System|1|10.20.40.1|SMB|OPEN|SUCCESS|1442207|FILE|CREATED|4314890714|/ifs/test/audit_test2.doc

Jul 31 17:46:43 isln-tme-1(id1) audit_protocol[2188] S-1-22-1-0|0|System|1|10.20.40.1|SMB|CLOSE|SUCCESS|FILE|0:0|0:0|4314890714|/ifs/test/audit_test2.doc

Jul 31 17:46:43 isln-tme-1(id1) audit_protocol[2188] S-1-22-1-0|0|System|1|10.20.40.1|SMB|OPEN|SUCCESS|129|FILE|OPENED|4314890714|/ifs/test/audit_test2.doc

Jul 31 17:46:43 isln-tme-1(id1) audit_protocol[2188] S-1-22-1-0|0|System|1|10.20.40.1|SMB|CLOSE|SUCCESS|FILE|0:0|0:0|4314890714|/ifs/test/audit_test2.doc.txt

Jul 31 17:46:43 isln-tme-1(id1) audit_protocol[2188] S-1-22-1-0|0|System|1|10.20.40.1|SMB|RENAME|SUCCESS|FILE|4314890714|/ifs/test/ audit_test2.doc.txt|/ifs/test/audit_test.txt

Jul 31 17:46:44 isln-tme-1(id1) audit_protocol[2188] S-1-22-1-0|0|System|1|10.20.40.1|SMB|OPEN|FAILED:3221225524|129|FILE|DOES_NOT_EXIST||/ifs/test/audit_test2.doc

Jul 31 17:46:45 isln-tme-1(id1) audit_protocol[2188] S-1-22-1-0|0|System|1|10.20.40.1|SMB|CLOSE|SUCCESS|FILE|0:0|0:0|4314890714|/ifs/test/audit_test2.doc

Jul 31 17:46:45 isln-tme-1(id1) audit_protocol[2188] S-1-22-1-0|0|System|1|10.20.40.1|SMB|OPEN|SUCCESS|1179785|FILE|OPENED|4314890714|/ifs/test/audit_test3.txt

Jul 31 17:46:45 isln-tme-1 (id1) audit_protocol[2188] S-1-22-1-0|0|System|1|10.20.40.1|SMB|CLOSE|SUCCESS|FILE|0:0|0:0|4314890714|/ifs/test/audit_test3.txt

Jul 31 17:46:45 isln-tme-1 syslogd last message repeated 6 times

Jul 31 17:46:51 isln-tme-1 (id1) audit_protocol[2188] S-1-22-1-0|0|System|1|10.20.40.1|SMB|OPEN|SUCCESS|1180063|FILE|OPENED|4314890714|/ifs/test/audit_test3.txt

Jul 31 17:46:51 isln-tme-1 (id1) audit_protocol[2188] S-1-22-1-0|0|System|1|10.20.40.1|SMB|CLOSE|SUCCESS|FILE|0:0|0:0|4314890714|/ifs/test/audit_test3.txt

Jul 31 17:46:51 isln-tme-1(id1) audit_protocol[2188] S-1-22-1-0|0|System|1|10.20.40.1|SMB|CLOSE|SUCCESS|FILE|0:0|5:1|4314890714|/ifs/test/audit_test3.txt

OneFS and QLC Drive Support

Another significant feature of the recent OneFS 9.4 release is support for quad-level cell (QLC) flash media. Specifically, the PowerScale F900 and F600 all-flash NVMe platforms are now available with 15.4TB and 30.7TB QLC NVMe drives.

These new QLC drives offer the gamut of capacity, performance, reliability and affordability – and will be particularly beneficial for workloads such as artificial intelligence, machine and deep learning, and for media and entertainment environments.

The details of the new QLC drive options for the F600 and F900 platforms are as follows:

PowerScale Node Chassis specs

(per node)

Raw capacity

(per node)

Max Raw capacity
(252 node cluster)
F900 2U with 24 NVMe SSD drives 737.28TB with 30.72TB QLC

368.6TB with 15.36TB QLC

185.79PB with 30.72TB QLC

92.83PB with 15.36TB QLC

F600 1U with 8 NVMe SSD drives 245.76TB with 30.72TB QLC

122.88TB with 15.36TB QLC

61.93PB with 30.72TB QLC

30.96PB with 15.36TB QLC

This means an F900 cluster with the 30.7TB QLC drives can now scale up to a whopping 185.79PB in size, with a nice linear performance ramp!

So the new QLC drives double the all-flash capacity footprint, as compared to previous generations – while delivering robust environmental efficiencies in consolidated rack space, power and cooling. What’s more, PowerScale F600 and F900 nodes containing QLC drives can deliver the same level of performance as TLC drives, thereby delivering vastly superior economics and value. As illustrated below, QLC nodes performed at parity or slightly better than TLC nodes for throughput benchmarks and SPEC workloads.

The above graphs show the comparative peak random throughput per-node for both QLC and TLC.

QLC-based F600 and F900 nodes can easily be rapidly and non-disruptively integrated into existing PowerScale clusters, allowing seamless data lake expansion and accommodation of new workloads.

Compatibility-wise, there are a couple of key points to be aware of. If attempting to add a QLC drive to a non-QLC node, or vice versa, the unsupported drive will be blocked with the ‘WRONG_TYPE’ error. However, QLC and non-QLC nodes will happily coexist in different pools within the same cluster. But attempting to merge storage node pools with differing media classes will output the error ‘All nodes in the nodepool must have compatible [HDD|SSD] drive technology’.

From the WebUI, the ‘drive details’ pop-up window displays ‘NVME, SSD, QLC’ as the ‘Connection and media type’.  This can be viewed by navigating to Hardware configuration > Drives and selecting ‘View details’ for the desired drive:

The WebUI SmartPools summary, available by browsing to Storage pools > SmartPools, also incorporates ‘QLC’ into the pool name:

Similarly, in the ‘node pool details’:

From the OneFS CLI, existing commands displaying DSP (drive support package), PSI (platform support infrastructure), and storage and node pools display a new ‘media_class’ string, ‘QLC’. For example:

# isi storagepool nodepools ls

ID   Name                Nodes  Node Type IDs  Protection Policy  Manual

-------------------------------------------------------------------------

1    f600_15tb-ssd-qlc_736gb 1      1              +2d:1n             No

                         2

                         3

-------------------------------------------------------------------------

Total: 1

 

# isi storagepool nodetypes ls

ID   Product Name                               Nodes  Manual

--------------------------------------------------------------

1    F600-1U-Dual-736GB-2x25GE SFP+-15TB SSD QLC 1      No

                                                 2

                                                 3

--------------------------------------------------------------

Total: 1

OneFS 9.4 also introduces new model and vendor class fields, providing a dynamic and extensible path to determine what drive statistics and information to gather, how to capture them, and how to display them – in preparation for future drive technologies. For example:

# isi_radish -a

Bay 0/nvd15 is Dell Ent NVMe P5316 RI U.2 30.72TB FW:0.0.8 SN:BTAC1436043630PGGN, 60001615872 blks

Log Sense data (Bay 0/nvd15 ) –

Supported log pages 0x1 0x2 0x3 0x4 0x5 0x6 0x80 0x81

SMART/Health Information Log

============================

Critical Warning State: 0x00

Available spare: 0

Temperature: 0

Device reliability: 0

Read only: 0

Volatile memory backup: 0

Temperature: 297 K, 23.85 C, 74.93 F

Available spare: 100

Available spare threshold: 10

Percentage used: 0

Data units (512,000 byte) read: 1619199

Data units written: 10075777

Host read commands: 67060074

Host write commands: 4461942671

Controller busy time (minutes): 1

Power cycles: 21

Power on hours: 420

Unsafe shutdowns: 18

Media errors: 0

No. error info log entries: 0

Warning Temp Composite Time: 0

Error Temp Composite Time: 0

Temperature 1 Transition Count: 0

Temperature 2 Transition Count: 0

Total Time For Temperature 1: 0

Total Time For Temperature 2: 0

Finally, PowerScale F600 and F900 nodes must be running OneFS 9.4 and the latest DSP in order to support QLC drives. In the event of a QLC drive failure, it must be replaced with another QLC drive. Additionally, any attempts to downgrade a QLC node to a version prior to OneFS 9.4 will be blocked.

OneFS SmartSync Management and Diagnostics

In this final blog of this series, we’ll look at SmartSync’s diagnostic tools, performance, plus review some of its idiosyncrasies and coexistence with other OneFS features.

But first, performance. Unlike SyncIQ, which operates solely on a push model, SmartSync allows pull replication, too. This can be an incredibly useful performance option for environments that grow organically. As demand for replication on a source cluster increases, the additional compute and network load needs to be considered. Push replication, especially with multiple targets, can generate a significant load on the source cluster, as shown in CPU graphs in the following graphic:

In extreme cases, replication traffic resource utilization can potentially impact client workloads as data is pushed to the target. On the other hand, enabling  a pull replication model for a dataset can drastically reduce the resource impacts on the source cluster’s CPU utilization by offloading replication overhead to the target cluster. This can be seen in the following graphs:

For single-source dataset environments with numerous targets, pull replication can free up the source cluster’s compute and network resources, which can then be used more beneficially for client IO. However, if the target cluster is a capacity-optimized archive cluster without CPU and/or network resources to spare, a pull policy model, rather than the traditional push, may not be an option. In such cases, SmartSync also allows its policies to be limited, or throttled, in order to reduce system and/or network resource impacts from replication. SmartSync throttling come in two flavors: Bandwidth and CPU throttling.

  1. Bandwidth throttling is specified through a set of netmask rules plus a maximum throughput limit, and is configured via the ‘isi dm throttling’ CLI syntax:
# isi dm throttling bw-rules create NETMASK – [subnet] --bw-limit= command

Bandwidth limits are specified in bytes for a specific subnet and netmask. For example:

# isi dm throttling bw-rules create NETMASK --netmask 10.20.100.0/24 --bw-limit=$((20*1024*1024))

In this case, the bandwidth limit of 20MB (20*1024*1024 bytes) is applied to the 10.20.100.0 class C subnet. The throttling configuration change can be verified as follows:

# isi dm throttling bw-rules list

ID Rule Type Netmask Bw Limit

------------------------------------------

0 NETMASK 10.20.100.0/24 20.00MB

------------------------------------------

Total: 1

 

  1. Compute-wise, SmartSync policies by default can consume up to 30% of a node’s CPU cycles if its total CPU usage is less than 90%. If the node’s total CPU utilization reaches 90%, or if the SmartSync consumption reaches 30% of the total CPU, SmartSync automatically throttles its CPU consumption.

Additional CPU throttling is specified through ‘allowed CPU percentage’ and ‘backoff CPU percentage’ limits, which are also configured via the ‘isi dm throttling’ CLI command syntax.

The ‘allowed CPU Threshold’ paramater sets the always-allowed CPU cycles that SmartSync is allowed to use, regardless of the overall node CPU usage.  If system’s CPU usage crosses the ‘System CPU Load Threshold’ and SmartSync uses more than the ‘Allowed CPU Threshold’, it will then throttle CPU utilization to remain at or below the ‘Allowed CPU Threshold’.

For example, to set the allowed CPU threshold to 20% and the system CPU threshold to 80%:

# isi dm throttling settings view

    Allowed CPU Threshold: 30

System CPU Load Threshold: 90

# isi dm throttling settings modify --allowed-cpu-threshold 20 --system-cpu-load-threshold 80

# isi dm throttling settings view

    Allowed CPU Threshold: 20

System CPU Load Threshold: 80

Additionally, SmartSync performance is also aided by a scalable run-time engine, spanning the cluster, and which spins up threads (fibers) on demand and uses asynchronous IO to process replication tasks (chunks). Batch operations are used for efficient small file, attribute, and data block transfer. Namespace contention avoidance, efficient snapshot utilization, and separation of dataset creation from transfer are salient design features of the both the baseline and incremental sync algorithms. Plus, the availability of a pull transfer model can significantly reduce the impact on a source cluster, if needed.

On the caveats and considerations front, SmartSync v1 in OneFS 9.4 does have some notable limitations to be cognizant of. Notably, failover and failback of a SmartSync policy option is not currently supported, nor is an option to allow writes on the target cluster. However, the dataset is available for read and write on copy policies once the replication to the target platform is complete if the ‘–copy-createdataset-on-target=false’ option is specified. These limitations will be lifted in a future OneFS release. But for now, if required, repeat-copy data on the target platform may be copied out of the SmartSync data mover snapshot.

Other interoperability considerations include:

Component Interoperability
ADS/resource forked files With CloudCopy, only the main file stored; Alternate data stream/resource fork skipped when encountered.
Cloud copy-back Not supported unless data was created by a OneFS Datamover.
Cloud incrementals Unsupported for file->object transfers. One-time copy to/from cloud only.
CloudPools CloudPools Smartlink stub files are not supported.
Compression Compression for replication transfer is not supported.
Failover/failback policy Failover and failback option is not available, nor is an option to allow writes on the target cluster.
File metadata With CloudCopy, only POSIX UID, GID, atime, mtime, and ctime are copied.
File name encoding With CloudCopy, all encodings are converted to UTF-8.
Hadoop TDE SmartSync does not support the replication of the TDE domain and keys, rendering TDE encrypted data on the target cluster inaccessible.
Hard links With CloudCopy, hard links are not preserved, and a file/object is created for each link.
Inline data reduction Inline compressed and/or deduped data is rehydrated, decompressed, and transferred uncompressed to the target cluster.
Large files (4TB –> 16 TB) Supported up to the cloud provider’s maximum object size. SmartSync policies only connect with target clusters that also have large file support enabled.
RBAC SmartSync administrative access is assigned through the ISI_PRIV_DATAMOVER privilege
SFSE SFSE containerized small files are unpacked on the source cluster before replication.
SmartDedupe Deduplicated files are rehydrated back to their original size prior to replication.
SmartLock Compliance mode cluster are not supported with SmartSync.
SnapshotIQ Tightly integrated; uses snapshots for incrementals and re-baselining.
Sparse files With CloudCopy, sparse regions of files are written out as zeros.
Special files With CloudCopy, special files are skipped when encountered.
Symbolic links With CloudCopy, symlinks are skipped when encountered.
SyncIQ SmartSync and SyncIQ replication both happily coexist. An active SyncIQ license is required for both.

When it comes to monitoring and troubleshooting SmartSync, there are a variety of diagnostic tools available. These include:

Component Tools Issue
Logging ·         /var/log/isi_dm.log

·         /var/log/messages

·         ifs/data/Isilon_Support/datamover/transfer_failures/baseline_failures_ <jobid>

General SmartSync info and  triage.
Accounts ·         isi dm accounts list / view Authentication, trust and encryption.
CloudCopy ·         S3 Browser (ie. Cloudberry), Microsoft Azure Storage Explorer Cloud access and connectivity.
Dataset ·         isi dm dataset list/view Dataset creation and health.
File system ·         isi get Inspect replicated files and objects.
Jobs ·         isi dm jobs list/view

·         isi_datamover_job_status -jt

Job and task execution, auto-pausing, completion, control, and transfer.
Network ·         isi dm throttling bw-rules list/view

·         isi_dm network ping/discover

Network connectivity and throughput.
Policies ·         isi dm policies list/view

·         isi dm base-policies list/view

Copy and dataset policy execution and transfer.
Service ·         isi services -a isi_dm_d <enable/disable> Daemon configuration and control.
Snapshots ·         isi snapshot snapshots list/view Snapshot execution and access.
System ·         isi dm throttling settings CPU load and system performance.

SmartSync info and errors are typically written to /var/log/isi_dm.log and /var/log/messages, while DM jobs transfer failures generate a log specific to the job ID under /ifs/data/Isilon_Support/datamover/transfer_failures.

Once a policy is running, the job status is reported via ‘isi dm jobs list’. Once complete, job histories are available by running ‘isi dm historical jobs list’. More details for a specific job can be glean from the ‘isi dm job view’ command, using the pertinent job ID from the list output above. Additionally, the ‘isi_datamover_job_status’ command with the job ID as an argument will also supply detailed information about a specific job.

Once running, a DM job can be further controlled via the ‘isi dm jobs modify’ command, and available actions include cancel, partial-completion, pause, or resume.

If a certificate authority (CA) is not correctly configured on a PowerScale cluster, the SmartSync daemon will not start, even though accounts and policies can still be configured. Be aware that the failed policies will not be reported via ‘isi dm jobs list’ or ‘isi dm historical-jobs list’ since they never started. Instead, an improperly configured CA is reported in the /var/log/isi_dm.log as follows:

Certificates not correctly installed, Data Mover service sleeping: At least one CA must be installed: No such file or directory from dm_load_certs_from_store (/b/mnt/src/isilon/lib/isi_dm/isi_dm_remote/src/rpc/dm_tls.cpp:197 ) from dm_tls_init (/b/mnt/src/isilon/lib/isi_dm/isi_dm_remote/src/rpc/dm_tls.cpp:279 ): Unable to load certificate information

Once a CA and identity are correctly configured, the SmartSync service automatically activates. Next, SmartSync attempts a handshake with the target cluster. If the CA or identity is mis-configured, the handshake process fails, and generates an entry in /var/log/isi_dm.log. For example:

2022-06-30T12:38:17.864181+00:00 GEN-HOP-NOCL-RR-1(id1) isi_dm_d[52758]: [0x828c0a110]: /b/mnt/src/isilon/lib/isi_dm/isi_dm_remote/src/acct_mon.cpp:dm_acc tmon_try_ping:348: [Fiber 3778] ping for account guid: 0000000000000000c4000000000000000000000000000000, result: dead

Note that the full handshake error detail is logged if the SmartSync service (isi_dm_d) is set to log at the ‘info’ or ‘debug’ level using isi_ilog:

# isi_ilog -a isi_dm_d --level info+

Valid ilog levels include:

fatal error err notice info debug trace

error+ err+ notice+ info+ debug+ trace+

A copy or repeat-copy policy requires an available dataset for replication before running. If a dataset has not been successfully created prior to the copy or repeat-copy policy job starting for the same base path, the job is paused. In the following example, the base path of the copy policy is not the same as that of the dataset policy, hence the job fails with a “path doesn’t match…” error.

# ls -l /ifs/data/Isilon_support/Datamover/transfer_failures

Total 9

-rw-rw----   1 root  wheel  679  June 29 10:56 baseline_failure_10

# cat /ifs/data/Isilon_support/Datamover/transfer_failures/baseline_failure_10

Task_id=0x00000000000000ce, task_type=root task ds base copy, task_state=failed-fatal path doesn’t match dataset base path: ‘/ifs/test’ != /ifs/data/repeat-copy’:

from bc_task)initialize_dsh (/b/mnt/src/isilon/lib/isi_dm/isi_dm/src/ds_base_copy

from dmt_execute (/b/mnt/src/isilon/lib/isi_dm/isi_dm/src/ds_base_copy_root_task

from dm_txn_execute_internal (/b/mnt/src/isilon/lib/isi_dm/isi_dm_base/src/txn.cp

from dm_txn_execute (/b/mnt/src/isilon/lib/isi_dm/isi_dm_base/src/txn.cpp:2274)

from dmp_task_spark_execute (/b/mnt/src/isilon/lib/isi_dm/isi_dm/src/task_runner.

Once any errors for a policy have been resolved, the ‘isi dm jobs modify’ command can be used to resume the job.

OneFS SmartSync Configuration

In the first blog of this series, we looked at OneFS SmartSync’s architecture and attributes. Next, we’ll delve into the configuration side of things, and walk through a basic setup.

Since there’s no SmartSync WebUI yet in OneFS 9.4, the bulk of the SmartSync configuration is performed via the ‘isi dm’ CLI tool, which contains the following the principal subcommands:

Subcommand Description
isi dm accounts Manage Datamover accounts. An activate SyncIQ license is required to create Datamover accounts.
isi dm base-policies Manage Datamover base-policy. Base policies are templates to provide common values to groups of related concrete Datamover policies. Eg. Define a base policy to override the run schedule of a concrete policy.
isi dm certificates Manage Datamover certificates.
isi dm config Show Datamover Manual Configuration.
isi dm datasets Show Datamover Dataset Information.
isi dm historical-jobs Manage Datamover historical jobs.
isi dm jobs Manage Datamover jobs.
isi dm policies Manage Datamover policy. Policies can be either:

CREATION – Creates/replicates a dataset, either once or on a schedule.

COPY – Defines a one-time copy of a dataset to or from a remote system

isi dm throttling Manage Datamover bandwidth and CPU throttling. Bandwidth throttling rules can be configured for each Datamover job.

The high-level view of the SmartSync setup and configuration process is as follows:

 

  1. The first step involves installing or upgrading the cluster to OneFS 9.4. SmartSync replication is handled by the ‘isi_dm_d’ service, which is disabled by default and needs to be enabled prior to configuring and using SmartSync. This can be easily accomplished with the following CLI syntax:
# isi services -a isi_dm_d

Service 'isi_dm_d' is disabled.

# isi services -a isi_dm_d enable

The service 'isi_dm_d' has been enabled.

 

  1. SmartSync uses TLS (transport layer security, or SSL) and, as such, requires trust to be established between the source and target clusters. In addition to a Certificate Authority (CA) and Certificate Identity (CI) for authorization and authentication, both clusters also require encryption to be enabled in order for the isi_dm_d service to run. The best practice is to use a local CA to sign each cluster’s CI, but self-signed certificates can be used instead in the absence of a suitable CA.

Before creating accounts, certificates must be generated and copied to the appropriate clusters. The following Certificate Authorities (CA) and trust hierarchies are required:

Requirement Description
TLS certificates ● A mutually authenticated TLS handshake is required. Authorization, authentication, and encryption are provided by TLS certificates.

● TLS certificates are always required for daemon startup and all communication between Datamover engines.

● Encryption can be disabled, but authorization and authentication cannot be disabled.

Certificate Authorities (CA) ● One or more Certificate Authorities (CA) are required on each Datamover system.

● Dell recommends that customers use a new, Datamover-specific CA for signing Datamover identity certificates.

● The CA that signs an identity certificate does not need to be installed on the system that the identity certificate is installed on. Two systems trust each other if they have the CAs that signed each other’s identity certificates.

Identity certificates ● The certificate that provides authentication of the identity claimed.

● Exactly one identity certificate must exist on each Datamover system.

● Identity certificates are signed by one of the CAs deployed on the systems that the system is going to communicate with.

Trust hierarchies ● Two systems trust each other if they have the CAs that signed each other’s identity certificates.

● There is no concept of unidirectional trust—trust is entirely mutual.

The following steps can be used to generate and copy the pertinent TLS certificates to the source and target Datamover clusters:

Step Cluster Action Commands
1 Source Generate Certificate Authority (CA). # openssl genrsa -out ca-s.key 4096

# openssl req -x509 -new -nodes -key ca-s.key -sha256 -days 1825 -out ca-s.pem

 

2 Source Copy source cluster’s CA to target cluster. # scp ca-s.pem [Target Cluster IP]:/:/root
3 Source Generate Certificate Identity (CI). # openssl genrsa -out identity-s.key 4096

# openssl req -new -key identity-s.key -out identity-s.csr

4 Source Create a CI on source cluster. # cat << EOF > identity-s.ext authorityKeyIdentifier=keyid,issuer

basicConstraints=CA:FALSE

keyUsage=digitalSignature,nonRepudiation,keyEncipherment,dataEncipherment

EOF

5 Source Sign source cluster’s CI with source cluster’s CA. # openssl x509 -req -in identity-s.csr -CA ca-s.pem -CAkey ca-s.key -CAcreateserial -out identity-s.crt -days 825 -sha256 -extfile identity-s.ext
6 Target Generate a CA on target cluster. # openssl genrsa -out ca-t.key 4096

# openssl req -x509 -new -nodes -key ca-t.key -sha256 -days 1825 -out ca-t.pem

7 Target Copy target cluster CA to source cluster. # scp ca-t.pem [Source Cluster IP]:/root
8 Target Generate CI on target cluster. # openssl genrsa -out identity-t.key 4096

# openssl req -new -key identity-t.key -out identity-t.csr

9 Target Create a CI on target cluster. # cat << EOF > identity-t.ext authorityKeyIdentifier=keyid,issuer basicConstraints=CA:FALSE keyUsage=digitalSignature,nonRepudiation,keyEncipherment,dataEncip herment EOF
10 Target Sign this CI with target cluster’s CA. # openssl x509 -req -in identity-t.csr -CA ca-t.pem -CAkey ca-t.key -CAcreateserial -out identity-t.crt -days 825 – sha256 -extfile identity-t.ext

 

  1. Next, the various CAs and CIs are installed across the two clusters.
Step Cluster Action Command
1 Source Install source cluster’s CA. # isi dm certificates ca create “$PWD”/ca-s.pem –name <source-cluster-ca>
2 Source Install target cluster’s CA. # isi dm certificates ca create “$PWD”/ca-t.pem –name <target-cluster-ca>
3 Source Install source cluster’s CI. # isi dm certificates identity create “$PWD”/identity-s.crt –certificate-key-path “$PWD”/identity-s.key –name <source-cluster-identity>
4 Target Install target cluster’s CA. # isi dm certificates ca create “$PWD”/ca-t.pem –name <target-cluster-ca>

 

5 Target Install source cluster’s CA. # isi dm certificates ca create “$PWD”/ca-s.pem –name <source-cluster-ca>

 

6 Target Install target cluster’s CI. # isi dm certificates identity create “$PWD”/identity-t.crt –certificate-key-path “$PWD”/identity-t.key –name <target-cluster-identity>

Note that the certificates must be located under /ifs when performing the import, otherwise an error similar to the following will be returned:

Invalid certificate path: /root/ca-s.pem [CERTS_CERT_INVALID]

At this point, encryption is now configured on the source and target clusters.

 

  1. By default, a local account, ‘DM Local Account’, is already configured. The ‘isi dm accounts list’ command can be used to display this ‘DM Local Account.’
# isi dm accounts list

ID                                               Name             URI             Account Type  Auth Mode   Local Network Pool  Remote Network Pool

----------------------------------------------------------------------------------------------------------------------------------------------------

0060167118de5018ab62800ce595db9bdb40000000000000 DM Local Account dm://[::1]:7722 DM            CERTIFICATE

----------------------------------------------------------------------------------------------------------------------------------------------------

Total: 1

The following steps illustrate configuring a push policy from the source cluster. Note that a single account can be used for both a push and pull policy, depending on the replication topology. After encryption is configured, the next step is to add a replication account to the source cluster, pointing replication to a target cluster.

On the source cluster, add a replication account using the ‘isi dm accounts create’ CLI command:

# isi dm accounts create DM dm://[Target Cluster IP]:7722 [‘target-acct’]

If desired, local and remote SmartConnect pools can be specified for the source and target clusters, respectively, with the –local-network-pool and –remote-network-pool flags.

The ‘isi dm accounts list’ command can be used to verify successful account creation:

# isi dm accounts list

ID                                               Name             URI             Account Type  Auth Mode   Local Network Pool  Remote Network Pool

----------------------------------------------------------------------------------------------------------------------------------------------------

f8f21e66c32476412b621d182495f22d3e31000000000000 DM Local Account dm://[::1]:7722 DM            CERTIFICATE

000c38b4ga22e3810d53ff27449b285b98c8000000000000 rmt-acct          dm://10.20.50.130:7722 DM

----------------------------------------------------------------------------------------------------------------------------------------------------

Total: 2

In the above, the ‘DM Local Account’ is the source cluster’s account, and ‘rmt-acct’ is the target cluster’s account, plus IP address.

 

  1. Two policies are needed here. First, the ‘isi dm policies create’ CLI command can be run with the ‘CREATION’ policy option in order to create a dataset. The syntax for this command to run at ‘normal’ priority is:
# isi dm policies create [Policy Name] NORMAL true CREATION -- creation-account-id=[DM local account] --creation-base-path= -- creation-dataset-retention-period= --creation-dataset-reserve= -- creation-dataset-expiry-action=DELETE ––recurrence=”cron expression” --start-time="YYYY-MM-DD HH:MM:SS"

The configuration parameters for the ‘isi dm policies create’ command include:

Parameter Description
policy-type Specifies the type of policy. Options are:

● CREATION —the process of creating the dataset

● COPY —used for one-time data transfers

● REPEAT_COPY —used for repeated transfers

● EXPIRATION —how long the snapshot is stored

priority Assigns a priority to this policy. The options are: LOW | NORMAL | HIGH.
true Specifies that the policy is enabled.
creation-account-id The DM local account ID specified in the isi dm accounts list command.
creation-base-path For SmartSync this specifies the directory path or file for the dataset. For cloud copy, this specifies the object store key prefix.
creation-dataset-retention-period How long the dataset is retained in seconds before expiration.
creation-dataset-reserve how many datasets to keep in reserve that are protected from expiration, irrespective of the creation-dataset-retention-period.
creation-dataset-expiry-action Specifies what happens with the dataset after expiration. With OneFS 9.4, the only expiration option is DELETE.
recurrence How often the policy runs.
Start-time The date and time when the policy runs. If a prior date is entered, the policy runs immediately.

The following CLI command creates a Datamover CREATION policy named createTestDataset. The policy creates a dataset with the base filepath /ifs/test/dm/data1. The creation account is the local Datamover account. The dataset expires 1,500 seconds (25 minutes) after its creation, after which it is deleted. The policy starts running June 1, 2022, at 12pm.

# isi dm policies create --name=createTestDataset --enabled=true --priority=low --policy-type=CREATION --creation-base-path=/ifs/test/dm/data1 --creation-account-id=local --creation-dataset-expiry-action=DELETE --creation-dataset-retention-period=1500 --start-time "2022-12-01 12:00:00"

To list the Datamover policies:

# isi dm policies list

ID   Validity  Name              Enabled  Disabled By DM  Priority  Policy Type  Base Policy ID  Date Times  Recurrence  Start Time          Parent Exec Policy ID

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

1 Yes createTestDataset Yes No LOW    CREATION - - - 2022-12-01 12:00:00 -                 

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

The ‘isi dm policies view’ CLI syntax can be used to inspect details of a policy, in this case ‘createTestDataset’ with an ID of 1 above:

# isi dm policies view 1

                        ID: 1

                  Validity: Yes

                      Name: createProdDataset

                   Enabled: Yes

            Disabled By DM: No

                  Priority: LOW

                   Run Now: No

            Base Policy ID: -

     Parent Exec Policy ID: -

                  Schedule

                    Date Times: -

                    Recurrence: -

                    Start Time: 2022-12-01 12:00:00

Policy Specific Attributes

                        Policy Type: CREATION

                    Creation Policy

                           Account ID: local

                            Base Path: /ifs/test/dm/data1

                            Retention

               Dataset Retention Period: 1500

                        Dataset Reserve: 2

                  Dataset Expiry Action: DELETE

In addition to the newly configured CREATION policy, a COPY policy is also required to perform the data move. This can be created as follows:

# isi dm policies create archive-restore NORMAL true COPY --copy-source-base-path=/ifs/test/dm/data1 --copycreate-dataset-on-target=true --copy-base-base-account-id= f8f21e66c32476412b621d182495f22d3e31000000000000 --copy-base-source-accountid= f8f21e66c32476412b621d182495f22d3e31000000000000--copy-base-target-account-id=000c38b4ga22e3810d53ff27449b285b98c8000000000000--copy-base-target-basepath=/ifs/test/dm/data1 --copy-base-target-dataset-type=FILE --copy-base-dataset-retention-period=3600 --copy-base-dataset-reserve=2 --copy-base-policy-dataset-expiry-action=DELETE

Confirm both the COPY and CREATION Datamover policies are present:

# isi dm policies list

ID   Validity  Name              Enabled  Disabled By DM  Priority  Policy Type  Base Policy ID  Date Times  Recurrence  Start Time          Parent Exec Policy ID

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

1 Yes createTestDataset Yes No LOW    CREATION -      -     -     -                 

2 Yes archive-restore   Yes No NORMAL COPY     -      -     -     –

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

 

  1. The next step is to run the CREATION policy (ID = 1) in order to create the dataset:
# isi dm policies modify 1 --run-now=true

The running job can be inspected as follows:

# isi dm jobs list

ID Job Type Job Priority Job Policy ID Job Control Request Job Start Time Job End Time Job State Job State Flags

----------------------------------------------------------------------------------------------------------------

201 DATASET_CREATION_JOB NORMAL  1  NONE  2022-06-23T14:52:22 2022-06-23T14:53:04 finishing   No failure

----------------------------------------------------------------------------------------------------------------

Total: 1

Once the job has completed, the ‘isi dm historical-jobs list’ CLI command allows the dataset creation policy’s status to be queried.

# isi dm historical-jobs list

ID Job Type Job Priority Job Policy ID Job Control Request Job Start Time Job End Time Job State Job State Flags

----------------------------------------------------------------------------------------------------------------

201 DATASET_CREATION_JOB NORMAL  1  NONE  2022-06-23T14:52:22 2022-06-23T14:54:51 finished   No failure

----------------------------------------------------------------------------------------------------------------

Total: 1

Finally, run the COPY policy (ID = 2) to replicate the dataset from the source to target cluster:

# isi dm policies modify 2 --run-now=true

# isi dm jobs list

ID Job Type Job Priority Job Policy ID Job Control Request Job Start Time Job End Time Job State Job State Flags

----------------------------------------------------------------------------------------------------------------

202 DATASET_BASELINE_COPY_JOB NORMAL  2  NONE   2022-06-23T14:55:11 2022-06-23T14:56:48 running   No failure

----------------------------------------------------------------------------------------------------------------

Total: 1

When the COPY job has completed, the ‘historical-jobs’ output will now show both the CREATION and COPY job details:

# isi dm historical-jobs list

ID Job Type Job Priority Job Policy ID Job Control Request Job Start Time Job End Time Job State Job State Flags

----------------------------------------------------------------------------------------------------------------

202 DATASET_BASELINE_COPY_JOB NORMAL  2  NONE   2022-06-23T14:55:11 2022-06-23T14:57:06 finished   No failure

201 DATASET_CREATION_JOB NORMAL  1  NONE  2022-06-23T14:52:22 2022-06-23T14:54:51 finished   No failure

----------------------------------------------------------------------------------------------------------------

Total: 2

Once created, the new dataset can be inspected via the ‘isi dm datasets list’ command output:

# isi dm datasets list

ID Dataset State Dataset Type Dataset Base Path Dataset Subpaths Dataset Creation Time Dataset Expiry Action Dataset Retention Period

-------------------------------------------------------------------------------------------------------------------------------------

1  COMPLETE FILE  /ifs/test/dm/data1   - 2022-06-23T14:54:51  DELETE  2022-06-23T15:54:51

-------------------------------------------------------------------------------------------------------------------------------------

Total: 1

To view Datamover policies:

# isi dm policies view

Note that the procedure above configures push replication of a dataset from a source to target. Conversely, to perform a pull from the target cluster, the replication account is instead added to the target cluster, and with the source cluster’s IP used:

# isi dm accounts create DM dm://[Source Cluster IP]:7722 [‘source-acct’]

Object data replication to public cloud or Dell ECS targets can also be configured with the ‘isi dm accounts create’ CLI command, but does require a couple of additional parameters, namely:

Parameter Description
Object store type AWS_S3, Azure, or ECS_S3
URI {http,https}://hostname:port/bucketname
Auth Access ID, Secret Key
Proxy Optional proxy information

For example:

# isi dm accounts create AWS_S3 https://aws-host:5555/bucket dm-account-name --authmode CLOUD --access-id aws-access-id --secret-key aws-secret-key

Be aware that a dataset must be available before a copy, or repeat-copy data replication policy runs, or the policy will fail.

Behind the scenes, dataset creation leverages a SnapshotIQ snapshot, which can be inspected via the ‘isi snapshot list’ command. These DM dataset snapshots are easily recognizable due to their ‘isi_dm’ prefixed naming convention.

In the final article in this series, we’ll take a look at SmartSync management, monitoring, and troubleshooting.

OneFS SmartSync Datamover

Amongst the bevy of new functionality introduced in OneFS 9.4 is SmartSync v1, and we’ll be taking a look at this new replication product over the course of the next couple of blog articles.

So, the new SmartSync Datamover enables flexible data movement and copying, incremental resyncs, and push and pull data transfer of file data between PowerScale clusters. Additionally, SmartSync CloudCopy also enables the copying of file-to-object data from a source cluster to a cloud object storage target. Cloud object targets include AWS S3 and Microsoft Azure, as well as Dell ECS.

Having a variety of target destination options allows multiple copies of a dataset to be stored across locations and regions, both on and off-prem, providing increased data resilience and the ability to rapidly recover from catastrophic events.

CloudCopy uses HTTP as the data replication transport layer to cloud storage, while cluster to cluster SmartSync leverages a proprietary RCP-based messaging system. In addition to the replication of the actual data, SmartSync also preserves the common file attributes including Windows ACLs, POSIX permissions and attributes, creation times, extended attributes, alternate data streams, etc.

In order to use SmartSync, SyncIQ  must be licensed and active across all nodes in the cluster. Additionally, a cluster account with the ISI_PRIV_DATAMOVER privilege is needed in order to configure and run SmartSync data mover policies. While file-to-file replication requires SmartSync to be running on both source and target clusters, for OneFS Cloud Copy to transfer to/from cloud storage, only the OneFS 9.4 cluster requires the SmartSync data mover. No data mover needed on the cloud systems. Be aware that the inbound TCP 7722 IP port must be open across any intermediate gateways and firewalls to allow SmartSync replication to occur.

Under the hood, replication is handled by the ‘isi_dm_d’ service, which is disabled by default, and needs to be enabled prior to configuring and using SmartSync. SmartSync uses TLS (transport layer security, or SSL) and, as such, requires trust to be established between the source and target clusters. In addition to a Certificate Authority (CA) and Certificate Identity (CI) for authorization and authentication, both clusters also require encryption to be enabled in order for the isi_dm_d service to run. The best practice is to use a local CA to sign each cluster’s CI,  but self-signed certificates can be used instead in the absence of a suitable CA.

The SmartSync Datamover has a purpose-build, integrated job execution engine, and Datamovers are executed on each cluster node in cooperative mode.

Shared Key-Value Stores (KVS) are used for jobs/tasks distribution, and extra indexing is implemented for quick lookups by task state, task type, and alive time. There are no dependencies or communication between tasks, and job cancellation and pausing is handled by posting a ‘request’ into a job record (request polling).

Within the SmartSync hierarchy, accounts define the connections to remote systems, policies define the replication configurations, and jobs perform the work:

Component Details
Accounts Datamover accounts:

–          URI, eg. dm://remotenas.isln.com:7722

–          Local and remote network pools defining nodes/interfaces to use for data transfer

–          Client and server certificates to enable TLS

CloudCopy accounts:

–          Account type (AWS S3, ECS S3, Azure)

–          URI, eg. https://cloudcluster.isln.com:9002/cloudbucket

–          Credentials

Policies –          Dataset creation policy

–          Dataset copy policy

–          Dataset repeat copy policy

–          Dataset expiration policy

Jobs Runtime entities created based on policies schedules. There are two major types of data transfer jobs:

–         Baseline jobs for initial transfers and

–         Incremental jobs for subsequent transfers between FILE Datamover systems.

Tasks Spawned by jobs and are the individual chunks of work that a job must perform. No 1-to-1 relationship to their associated files.

 

SmartSync Datasets are self-contained, independent entities. Once created, they’re assigned globally-unique IDs, and backed by file system snapshots on PowerScale. Parent-child relationships are used for incremental transfers, and a handshake determines the exact changeset to be transferred.

As demand for replication on a source cluster increases, the additional compute and network load needs to be considered. Multiple targets can generate a significant demand on the source cluster, with replication traffic contending with client workloads as data is pushed to the target. Fortunately, SmartSync allows a target cluster to pull the dataset, thereby minimizing the resource impacts on the source.

For single-source dataset environments with numerous targets, push replication can be incredibly useful, allowing the source cluster’s resources to be focused on client IO. In addition to both push and pull replication, SmartSync also supports a variety of topologies, such as fan-out, chaining, etc.

SmartSync provides enhanced replication failure resilience, minimizing replication times even when a job runs into an error. Rather than failing an entire replication job if an error is encountered, requiring a manual restart, SmartSync instead places the job into a paused state, and presents three options:

  1. Cancel the job altogether.
  2. Resolve the errors and resume the job.
  3. Complete a partial replication.

With option 3, the portion of the dataset already transferred is retained, thereby decreasing the subsequent job’s work and execution time.

The SmartSync architecture intentionally decouples source cluster snapshot creation (dataset creation) from the actual data replication transfer to the target, allowing each to run independently via separate configured policies configured for each. This helps mitigate the disruptive chain effect of a failure during the snapshot process early in the process. Additionally, SmartSync offers parent-child policies which launch a replication job only after successful snapshot creation, providing an alternative to recurrence in situations where it’s unclear how long a previous policy may take to complete.

With SmartSync, ‘re-baselining’ (full-resync) is not required for source-target clusters which already contain an earlier version  of a dataset. For example, in the following three-cluster DR topology, cluster A replicates to B, and B replicates to C:

A parent-child relationship means that, if cluster B becomes unavailable, the cluster A to C policy would not require a new baseline. Instead, clusters A and C’s datasets are compared via a handshake, enabling only the changed data blocks to be transferred, thereby minimizing replication overhead. This is particularly beneficial for environments with large datasets, significantly shrinking RPO and RTO times and increasing DR readiness.

When setting up a SmartSync 3-way relationship, be sure to use a single dataset creation policy when configuring datasets on the same path. If there are separate dataset creation policies for each relationship, B and C will have different datasets (snapshots) with different dataset IDs.  In this case, if A dies it would be impossible to establish an incremental sync relationship between B on C on those datasets, since the incremental transfer won’t be able to ‘connect’ the dataset IDs between B and C.

SmartSync allows subsequent incremental data movement by managing and re-transferring failed file transfers. Similarly, Dataset reconnect enables systems with common base datasets to establish instant incremental syncs. SmartSync also proactively locks the SnapshotIQ snapshots it generates, providing better protection and separation between Datamover and other cluster snapshots.

Other SmartSync features and functionality includes:

Feature Details
Bandwidth throttling Set of netmask rules. Limits are per-node.
CPU throttling Allowed and Backoff CPU percentages.
Base policies Template providing common values to groups of related policies (schedule, source base path, enable/disable, etc). Ie. Disabling base policy affects all linked concrete policies.
Concrete policy Predefined set of fields from the base policy
Incremental reconnect Ability to run incrementals between systems with common base datasets but no prior replication relationship
Unconnected nodes (NANON) Active accounts are monitored by each node. No work allocation to nodes without network access.
Snapshot locking Avoids accidental snapshot deletion, with subsequent re-base-lining.

 

SmartSync allows subsequent incremental data movement by managing and re-transferring failed file transfers. Similarly, Dataset reconnect enables systems with common base datasets to establish instant incremental syncs. SmartSync also proactively locks the SnapshotIQ snapshots it uses, providing better separation between Datamover and other snapshots.

Performance-wise, SmartSync is powered by a scalable run-time engine, spanning the cluster, and which spins up threads (fibers) on demand and uses asynchronous IO to process replication tasks (chunks). Batch operations are used for efficient small file, attribute, and data block transfer. Namespace contention avoidance, efficient snapshot utilization, and separation of dataset creation from transfer are salient design features of the both the baseline and incremental sync algorithms. Plus, the availability of a pull transfer model can significantly reduce the impact on a source cluster, if needed.

The streamlined baseline and incremental file transfer jobs operates as follows:

On the CloudCopy side, the SmartSync copy format provides both regular file representation, browsability and usability of file system data in the cloud. That said, as compared to the file-to-file Datamover, there are certain CloudCopy considerations and limitations to be aware of, such as no incremental copy. These also include:

CloudCopy Caveats Details
ADS files Skipped when encountered.
Hardlinks An object will be created for each link (ie. links are not preserved).
Symlinks Skipped when encountered.
Directories An object is created for each directory.
Special files Skipped when encountered.
Metadata Only POSIX mode bits, UID, GID, atime, mtime, ctime are preserved.
Filename encodings Converted to UTF-8.
Path Path relative to root copy directory is used as object key.
Large files An error is returned for files larger than the cloud providers maximum object size.
Long filenames File names exceeding 256 bytes are compressed.
Long paths Junction points are created when paths exceed 1024 bytes to redirect where objects are being stored
Sparse files Sparse sections are not preserved and are written out fully as zeros.

As mentioned earlier, there are also some prerequisites to address before running SmartSync. First, the source and target(s) must be running OneFS 9.4 and with SyncIQ licensed across the cluster. Additionally, the identity certificates and a shared CA must be present to communicate with a peer Datamover.

In the next article in this series, we’ll turn our attention to the configuration and use of SmartSync.

OneFS System Partition Hygiene

In addition to the /ifs data storage partition, like most UNIX-derived operating systems, OneFS uses several system partitions, including:

Partition Description
/ Root partition containing all the data to start up and run the system, and which contains the base OneFS software image.
/dev Device files partition. Drives, for example, are accessed through block device files such as /dev/ad0.
/ifs Clustered filesystem partition, which spans all of a cluster’s nodes. Includes /ifs/.ifsvar.
/usr Partition for user programs.
/var Partition to store variable data, such as log files, etc. In OneFS, this partition is mostly used for /var/run and /var/log.
/var/crash The crash partition is configured for binary dumps.

One advantage of having separate partitions rather than one big chunk of space is that different parts of the OS are somewhat protected from each other. For example, if /var fills up, it doesn’t affect the root / partition.

While OneFS automatically performs the vast majority of its system housekeeping, occasionally the OneFS /var partition on one or more of a cluster’s nodes will fill up, typically as the result of heavy log writing activity and/or the presence corefile(s). If /var reaches 75%, 85%, or 95% of capacity, a CELOG event is  automatically fired and an alert sent.

The following CLI command will provide a view of /var usage across the cluster:

# isi_for_array -s "du -h /var | sort -n | tail -n10"

The typical resolution for this scenario is to rotate the logfiles under /var/log. If, after log rotation, the /var partition returns to a normal usage level, reviewing the list of recently written logs will usually determine if a specific log is rotating frequently/excessively. Log rotation will usually resolve the full-partition issue by compressing or removing large logs and old logs, thereby automatically reducing partition usage.
The ‘df -i’ CLI command, run on the node that reported the error, will display the details of the /var partition. For example:

# df -i | grep var | grep -v crash
Filesystem 1K-blocks Used Avail Capacity iused ifree %iused Mounted on
/dev/mirror/var0 1013068 49160 882864 5% 1650 139276  92% /var

If the percentage used value is 90% or higher, as above, the recommendation is to reduce the number of files in the /var partition. To remove files that do not belong in the /var partition, first run the following ‘find’ command on the node that generated the alert. This will display any files in the /var partition that are greater than 5 MB in size:

# find -x /var -type f -size +10000 -exec ls -lh {} \; | awk '{ print $9 ": " $5 }'

The output will show any large files that files that do not typically belong in the /var partition. These could include artifacts such as OneFS install packages, cluster log gathers, packet captures, or other user-created files. Remove the files or move them to the /ifs directory. If you are unsure which, if any, files are viable candidates for removal, contact Dell Support for assistance.

The ‘fstat’ CLI command is a useful tool for listing the open files on a node or in a directory, or to display files that were opened by a particular process. This information can be invaluable for determining if a process is holding a large file open. For example a node’s open files on a node can be displayed as follows:

# fstat

A list of the open files can help in monitoring the processes that are writing large files.

Using the ‘-f’flag will narrow the fstat output to a particularly directory:

# fstat -f <directory_path>

Similarly, to list the files opened by a particular process:

# fstat -p <pid>

If there are no open files found in the /var directory, it is entirely possible that a large file has become unlinked and is consuming space because one or more processes have the file open. The fstat command can be used to confirm this, as follows:

# fstat -f /var | grep var

If a process is holding a file open, output similar to the following is displayed:

root lwio 98281 4 /var 69612 -rw------- 100120000 rw

Here, the lwio daemon (PID 98281) has a 100MB file open that is approximately 100 MB (100120000 bytes). The file’s inode number, 69612, can be used to retrieve the its name:

# find -x /var -inum 69612 -print

/var/log/lwiod.log

If a process is holding a large file open and it’s inode cannot be found, the file is considered to be ‘unlinked’. In this case, the recourse is typically to restart the offending process. Note that, before stopping and restarting a process, consider any possible negative consequences. For example, stopping the OneFS SMB daemon, lwiod, in the example above would potentially disconnect SMB users.

If neither of the suggestions above resolves the issue, the logfile’s rollover file size limit can be reduced and the file itself compressed. To do this, first create a backup of the /etc/newsyslog.conf file as follows:

# cp /etc/newsyslog.conf /ifs/newsyslog.conf
# cp /etc/newsyslog.conf /etc/newsyslog.bak

Next, open the /ifs/newsyslog.conf file in emacs, vi, or editor or choice and locate the following line:

/var/log/wtmp 644 3 * @01T05 B

Change the line to:

/var/log/wtmp 644 3 10000 @01T05 ZB

These changes instruct the system to roll over the /var/log/wtmp file when it reaches 10 MB and then to compress the file with gzip. Save and close the /ifs/newsyslog.conf file, and then run the following command to copy the updated ‘newsyslog.conf’ file to the remaining nodes on the cluster:

# isi_for_array 'cp /ifs/newsyslog.conf /etc/newsyslog.conf'

If other logs are rotating frequently, or if the preceding solutions do not resolve the issue, run the isi_gather_info command to gather logs, and then contact Dell Support for assistance.

There are several options available to stop processes and create a corefile under OneFS:

CLI Command Description
gcore Generate a core dump file of the running process without actually killing it.
kill -6 Stop a single process and get a core dump file
killall -6 Stop all processes and get a core dump file
kill -9 Force a process to stop

The ‘gcore’ CLI command can generate a core dump file from a running process without actually killing it. First, the ‘ps’ CLI command can be used to find and display the process ID (PID) for a running process:

# ps -auxww | egrep 'USER|lsass' | grep -v grep

USER     PID %CPU %MEM   VSZ   RSS  TT  STAT STARTED      TIME COMMAND
root   68547  0.0  0.3 150464 38868 ??   S    Sun11PM   0:06.87 lw-container lsass (lsass)

In the above example, the PID for the lsass process is 68547. Next, the ‘gcore’ CLI command can be used to generate a core dump of this PID and write the output to a location of choice, in this example a file aptly named ‘lsass.core’.

 # gcore -c /ifs/data/Isilon_Support/lsass.core 68547

# ls -lsia /ifs/data/Isilon_Support/lsass.core
4297467006 58272 -rw-------     1 root  wheel  239280128 Jun 10 19:10 /ifs/data/Isilon_Support/lsass.core

Typically, the /ifs/data/Isilon_Support directory provides an excellent location to write the coredump to. Clearly, /var is not a great choice, since the partition is likely already full.

Finally, when the coredump has been written, the ‘isi_gather_info’ tool can be used to coalesce both the core file and pertinent cluster logs and the core into a convenient tarfile.

# isi_gather_info --local-only -f /ifs/data/Isilon_Support/lsass.core

# ls -lsia /ifs/data/Isilon_Support | grep -i gather
4298180122    26 -rw-r--r-- +    1 root  wheel         19 Jun 10 15:44 last_full_gather

The resulting log set, ‘/ifs/data/Isilon_Support/last_full_gather’, is then ready for upload to Dell Support for further investigation and analysis.