OneFS and Software Journal Mirroring – Management and Troubleshooting

Software journal mirroring (SJM) in OneFS 9.11 delivers critical file system support to meet the reliability requirements for PowerScale platforms with high capacity flash drives. By keeping a synchronized and consistent copy of the journal on another node, and automatically recovering the journal from it upon failure, enabling SJM can reduce the node failure rate by around three orders of magnitude – while also boosting storage efficiency by negating the need for a higher level of on-disk FEC protection.

SJM is enabled by default for the applicable platforms on new clusters. So for clusters including F710 or F910 nodes with large QLC drives that ship with 9.11 installed, SJM will be automatically activated.

SJM adds a mirroring scheme, which provides the redundancy for the journal’s contents.

This is where /ifs updates are sent to a node’s local, or primary, journal as usual. But they’re also synchronously replicated, or mirrored, to another node’s journal, too – referred to as the ‘buddy’.

Every node in an SJM-enabled pool is dynamically assigned a buddy node, and if a new SJM-capable node is added to the cluster, it’s automatically paired up with a buddy. These buddies are unique for every node in the cluster.

SJM’s automatic recovery scheme can use a buddy journal’s contents to re-form the primary node’s journal. And this recovery mechanism can also be applied manually if a journal device needs to be physically replaced.

The introduction of SJM changes the node recovery options slightly in OneFS 9.11. These options now include an additional method for restoring the journal:

This means that if a node within an SJM-enabled pool ends up at the ‘stop_boot’ prompt, before falling back to SmartFail, the available options in order of desirability are:

Order Options Description
1 Automatic journal recovery OneFS will first try to automatically recover from the local copy.

 

2 Automatic journal mirror recovery Automatic journal mirror recovery attempts to SyncBack from the buddy node’s journal.
3 Manual SJM recovery Dell support can attempt a manual SJM recovery, particularly in scenarios where a bug or issue in the software journal mirroring feature itself is inhibiting automatic recovery in some way.
4 SmartFail OneFS quarantines the node, places it into a read-only state, and reprotects by distributing the data to other devices.

While SJM is available upon upgrade commit to OneFS 9.11, it is not automatically activated. So any F710 or F910 SJM-capable node pools that were originally shipped with OneFS 9.10 installed  will require SJM to be manually enabled after their upgrade to 9.11.

If SJM is not activated on a cluster with capable node pools running OneFS 9.11, a CELOG alert will be raised, encouraging the customer to enable it. This CELOG alert will contain information about the administrative actions required to enable SJM. Additionally, a pre-upgrade check is also included in OneFS 9.11 to prevent any existing cluster with nodes containing 61TB drives that were shipped with OneFS 9.9 or older installed, from upgrading directly to 9.11 until the afflicted nodes have been USB-reimaged and their journals reformatted.

For SJM-capable clusters which do not have journal mirroring enabled, the CLI command (and platform API endpoint) to activate SJM operate at the nodepool level. Each SJM-capable pool will need to be enabled separately via the ‘isi storagepool nodepools modify’ CLI command, plus the pool name and the new  ‘–sjm-enable=true’ argument.

# isi storagepool nodepools modify <name> --sjm-enabled true

Note that this new syntax is only applicable only for nodepool(s) with SJM-capable nodes.

Similarly, to query the SJM status on a cluster’s nodepools:

# isi storagepool nodepools list –v | grep –e ‘SJM’ –e ‘Name:’

And to check a cluster’s nodes for SJM capabilities:

# isi storagepool nodetypes list -v | grep -e 'Product' -e 'Capable'

So there are a couple of considerations with SJM that should be borne in mind. As mentioned previously, any SJM-capable nodes that are upgraded from OneFS 9.10 will not have SJM enabled by default. So if, after upgrade to 9.11, a capable pool remains in an SJM-disabled state, a CELOG warning will be raised informing that the data may be under-protected, and hence it’s reliability lessened. And the CELOG event will include recommended corrective and remedial action. So administrative intervention will be required to enable SJM on this particular node pool, ideally, or alternatively increase the protection level to meet the same reliability goal.

So how impactful is SJM to protection overhead on an SJM-capable node pool/cluster? The following table shows the protection layout, both with and without SJM, for the F710 and F910 nodes containing 61TB drives:

Node type Drive Size Journal Mirroring +d2:1n +3d:1n1d +2n +3n
F710 61TB SDPM 3 4-6 7-34 35-252
F710 61TB SDPM SJM 4-16 17-252
F910 61TB SDPM 3 5-19 20-252
F910 61TB SDPM SJM 3-16 17-252

Taking the F710 with 61TB drives example above, without SJM +3n protection is required at 35 nodes and above. In contrast, with SJM-enabled, the +3d:1n1d protection level suffices all the way up to the current maximum cluster size of 252 nodes.

Generally, beyond enabling it on any capable-pools, after upgrading to 9.11 SJM just does its thing and does not require active administration or management. However, with a corresponding buddy journal for every primary node, there may be times when a primary and its buddy become un-synchronized. Clearly, this would mean that mirroring is not functioning correctly and a SyncBack recovery attempt would be unsuccessful. OneFS closely monitors this scenario, and will fire either of the top two CELOG event types below to alert the cluster admin in the event that journal syncing and/or mirroring are not working properly:

Possible causes for this could include the buddy remaining disconnected, or in a read-only state, for a protracted period of time. Or a software bug or issue, that’s preventing successful mirroring. This would result in a CELOG warning being raised for the buddy of the specific node, with the suggested administrative action included in the event contents.

Also, be aware that SJM-capable and non-SJM-capable nodes can be placed in the same nodepool if needed, but only if SJM is disabled on that pool – and the protection increased correspondingly.

The following chart illustrates the overall operational flow of SJM:

SJM is a core file system feature, so the bulk of its errors and status changes are written to the ubiquitous /var/log/messages file. However, since the Buddy assignment mechanism is a separate component with its own user-space demon, its notifications and errors are sent to a dedicated ‘isi_sjm_budassign_d’ log. This logfile is located at:

/var/log/isi_sjm_budassign_d.log

OneFS and Software Journal Mirroring – Architecture and Operation

In this next article in the OneFS software journal mirroring series, we will dig into SJM’s underpinnings and operation in a bit more depth.

With its debut in OneFS 9.11, the current focus of SJM is the all-flash F-series nodes containing either 61TB or 122TB QLC SSDs. In these cases, SJM dramatically improves the reliability of these dense drive platforms with journal fault tolerance. Specifically, it maintains a consistent copy of the primary node’s journal on a separate node. By automatically recovering the journal from this mirror, SJM is able to substantially reduce the node failure rate without the need for increased FEC protection overhead.

SJM is enabled by default for the applicable platforms on new clusters. So for clusters including F710 or F910 nodes with large QLC drives that ship with 9.11 installed, SJM will be automatically activated.

SJM adds a mirroring scheme, which provides the redundancy for the journal’s contents. This is where /ifs updates are sent to a node’s local, or primary, journal as usual. But they’re also synchronously replicated, or mirrored, to another node’s journal, too – referred to as the ‘buddy’.

Architecturally, SJM’s main components and associated lexicon are as follows:

Item Description
Primary Node with a journal that is co-located with the data drives that the journal will flush to.
Buddy Node with a journal that stores sufficient information about transactions on the primary to restore the contents of a primary node’s journal in the event of its failure.
Caller Calling function that executes a transaction. Analogous to the initiator in the 2PC protocol.
Userspace journal library Saves the backup, restores the backup, and dumps journal (primary and buddy).
Buddy reconfiguration system Enables buddy reconfiguration and stores the mapping in buddy map via buddy updater.
Buddy mapping updater Provides interfaces and protocol for updating buddy map.
Buddy map Stores buddy map (primary <-> buddy).
Journal recovery subsystem Facilitates journal recovery from buddy on primary journal loss.
Buddy map interface Kernel interface for buddy map.
Mirroring subsystem Mirrors global and local transactions.
JGN Journal Generation Number, to identify versions and verify if two copies of a primary journal are consistent.
JGN interface Journal Generation Number interface to update/read JGN.
NSB Node state block, which stores JGN.
SB Journal Superblock.
SyncForward Mechanism to sync an out-of-date buddy journal with missed primary journal content additions & deletions.
SyncBack Mechanism to reconstitute a blown primary journal from the mirrored information stored in the buddy journal.

These components are organized into the following hierarchy and flow, split across kernel and user space:

A node’s primary journal is co-located with the data drives that it will flush to. In contrast, the buddy journal lives on a remote node and stores sufficient information about transactions on the primary, to allow it to restore the contents of a primary node’s journal in the event of its failure.

SyncForward is the mechanism by which an out of date Buddy journal is caught up with any Primary journal transactions that it might have missed. While SyncBack, or restore, allows a blown Primary journal to be reconstituted from the mirroring information stored in its Buddy journal.

SJM needs to be able to rapidly detect a number of failure scenarios and decide which is the appropriate recovery workflow to initiate. For example, a blown primary journal, where SJM must quickly determine whether the Buddy’s contents are complete, to allow a SyncBack to fully reconstruct a valid Primary journal. Versus whether to resort to a more costly node rebuild instead. Or, if the Buddy node disconnects briefly, which of a Primary journal’s changes should be replicated during a SyncForward, in order to bring the Buddy efficiently back into alignment.

SJM tags the transactions logged into the Primary journal, and their corresponding mirrors in the Buddy, with a monotonically increasing Journal Generation Number, or JGN.

The JGN represents the most recent & consistent copy of a primary node’s journal, and it’s incremented whenever the write status of the Buddy journal changes, which is tracked by the Primary via OneFS GMP group change updates.

In order to determine whether the Buddy journal’s contents are complete, the JGN needs to be available to the primary node when its primary journal is blown. So the JGN is stored in a Node State Block, or NSB, and saved on a quorum of the node’s data-drives. Therefore, upon loss of a Primary journal, the JGN in the node state block can be compared against the JGN in the Buddy to confirm its transaction mirroring is complete, before the SyncBack workflow is initiated.

A primary transaction exists on the node where data storage is being modified, and the corresponding buddy transaction is a hot, redundant duplicate of the primary information on a separate node. The SDPM journal storage on the F-series platforms  is fast, and the pipe between nodes across the backend network is optimized for low-latency bulk data flow. And this allows the standard POSIX file model to transparently operate on the front-end protocols, which are blissfully aware of any journal jockeying that’s occurring behind the scenes.

The journal mirroring activity is continuous, and if the Primary loses contact with its Buddy, it will urgently seek out another Buddy and repeat the mirroring for each active transaction, to regain a fully mirrored journal config. If the reverse happens, and the Primary vanishes due to an adverse event like a local power loss or an unexpected reboot, the primary can reattach to its designated buddy and ensure that its own journal is consistent with the transactions that the Buddy has kept safely mirrored. This means that the buddy must reside on a different node than the primary. As such, it’s normal and expected for each primary node to also be operating as the buddy for a different node.

The prerequisite platform requirements for SJM support in 9.11, which are referred to as ‘SJM-capable nodes, are as follows:

Essentially, this is any F710 and F910’s with 61TB or 122TB SSDs which shipped with OneFS 9.10 or later are considered SJM-capable.

Note that there are a small number of F710 and F910s with 61TB drives out there which were shipped with OneFS 9.9 or earlier installed. These nodes must be re-imaged before they can use SJM. So they first need to be SmartFailed out, then USB reimaged to OneFS 9.10 or later. This is to allow the node’s SDPM journal device to be reformatted to include a second partition for the 16 GiB buddy journal allocation. However, this 16 GiB of space reserved for the buddy journal will not be used when SJM is disabled. The following table shows the maximum SDPM usage per journal type based on SJM enablement:

Journal State Primary journal Buddy journal
SJM enabled 16 GiB 16 GiB
SJM disabled 16 GiB 0 GiB

But to reiterate, the SJM-capable platforms which will ship with OneFS 9.11 installed, or those that shipped with OneFS 9.10, are ready to run SJM, and will form node pools of equivalent type.

While SJM is available upon upgrade commit to OneFS 9.11, it is not automatically activated. So for any F710 or F910 nodes with large QLC drives that were originally shipped with OneFS 9.10 installed, the cluster admin will need to manually enable SJM on any capable pools after their upgrade to 9.11.

Plus, if SJM is not activated, a CELOG alert will be raised, encouraging the customer to enable it, in order for the cluster to meet the reliability requirements. This CELOG alert will contain information about the administrative actions required to enable SJM.

Additionally, a pre-upgrade check is also included in OneFS 9.11 to prevent any existing cluster with nodes containing 61TB drives that were shipped with OneFS 9.9 or older installed, from upgrading directly to 9.11 – until these nodes have been USB-reimaged and their journals reformatted.

OneFS and Software Journal Mirroring

OneFS 9.11 sees the addition of a Software journal mirroring capability, which adds critical file system support to meet the reliability requirements for platforms with high capacity drives.

But first, a quick journal refresher… OneFS uses journaling to ensure consistency across both disks locally within a node and disks across nodes. As such, the journal is among the most critical components of a PowerScale node. When OneFS writes to a drive, the data goes straight to the journal, allowing for a fast reply.

Block writes go to the journal first, and a transaction must be marked as ‘committed’ in the journal before a ‘success’ status is returned to the file system operation.

Once a transaction is committed the change is guaranteed to be stable. If the node crashes or loses power, changes can still be applied from the journal at mount time via a ‘replay’ process. The journal uses a battery-backed persistent storage medium in order to be available after a catastrophic node event, and must also be:

Journal Performance Characteristic Description
High throughput All blocks (and therefore all data) pass through the journal, so it must never become a bottleneck.
Low latency Since transaction state changes are often in the latency path multiple times for a single operation, particularly for distributed transactions.

The OneFS journal mostly operates at the physical level, storing changes to physical blocks on the local node. This is necessary because all initiators in OneFS have a physical view of the file system – and therefore issue physical read and write requests to remote nodes. The OneFS journal supports both 512byte and 8KiB block sizes of 512 bytes for storing written inodes and blocks respectively.

By design, the contents of a node’s journal are only needed in a catastrophe, such as when memory state is lost. For fast access during normal operation, the journal is mirrored in RAM. Thus, any reads come from RAM and the physical journal itself is write-only in normal operation. The journal contents are read at mount time for replay. In addition to providing fast stable writes, the journal also improves performance by serving as a write-back cache for disks. When a transaction is committed, the blocks are not immediately written to disk. Instead, it is delayed until the space is needed. This allows the I/O scheduler to perform write optimizations such as reordering and clustering blocks. This also allows some writes to be elided when another write to the same block occurs quickly, or the write is otherwise unnecessary, such as when the block is freed.

So the OneFS journal provides the initial stable storage for all writes and does not release a block until it is guaranteed to be stable on a drive. This process involves multiple steps and spans both the file system and operating system. The high-level flow is as follows:

Step Operation Description
1 Transaction prep A block is written on a transaction, for example a write_block message is received by a node. An asynchronous write is started to the journal. The transaction preparation step will wait until all writes on the transaction complete.
2 Journal delayed write The transaction is committed. Now the journal issues a delayed write. This simply marks the buffer as dirty.
3 Buffer monitoring A daemon monitors the number of dirty buffers and issues the write to the drive upon reach its threshold.
4 Write completion notification The journal receives an upcall indicating that the write is complete.
5 Threshold reached Once journal space runs low or an idle timeout expires, the journal issues a cache flush to the drive to ensure the write is stable.
6 Flush to disk When cache flush completes, all writes completed before the cache flush are known stable. The journal frees the space.

The PowerScale F-series platforms use Dell’s VOSS M.2 SSD drive as the non-volatile device for their software-defined persistent memory (SDPM) journal vault.  The SDPM itself comprises two main elements:

Component Description
BBU The BBU pack (battery backup unit) supplies temporary power to the CPUs and memory allowing them to perform a backup in the event of a power loss.
Vault A 32GB M.2 NVMe to which the system memory is vaulted.

While the BBU is self-contained, the M.2 NVMe vault is housed within a VOSS module, and both components are easily replaced if necessary.

The current focus of software journal mirroring (SJM) are the all-flash F710 and F910 nodes that contain either the 61 TB QLC SSDs, or the soon to be available 122TB drives. In these cases, SJM dramatically improves the reliability of these dense drive platforms. But first, some context regarding journal failure and it’s relation to node rebuild times, durability, and the protection overhead.

Typically, a node needs to be rebuilt when its journal fails, for example if it loses its data, or if the journal device develops a fault and needs to be replaced. To accomplish this, the OneFS SmartFail operation has historically been the tool of choice, to restripe the data away from the node. But the time to completion for this operation depends on the restripe rate and amount of the storage. And the gist is that the denser the drives, the more storage is on the node, and the more work SmartFail has to perform.

And if restriping takes longer, the window during which the data is under-protected also increases. This directly affects reliability, by reducing the mean time to data loss, or MTTDL. PowerScale has an MTTDL target of 5,000 years for any given size of a cluster. The 61TB QLC SSDs represent an inflection point for OneFS restriping, where, due to their lengthy rebuild times, reliability, and specifically MTTDL, become significantly impacted.

So the options in a nutshell for these dense drive nodes, are either to:

  1. Increase the protection overhead, or:
  2. Improve a node’s resilience and, by virtue, reduce its failure rate.

Increasing the protection level is clearly undesirable, because the additional overhead reduces usable capacity and hence the storage efficiency – thereby increasing the per-terabyte cost, as well as reducing rack density and energy efficiency.

Which leaves option 2: Reducing the node failure rate itself, which the new SJM functionality in 9.11 achieves by adding journal redundancy.

So, by keeping a synchronized and consistent copy of the journal on another node, and automatically recovering the journal from it upon failure, enabling SJM can reduce the node failure rate by around three orders of magnitude – while removing the need for a punitively high protection level on platforms with large-capacity drives.

SJM is enabled by default for the applicable platforms on new clusters. So for clusters including F710 or F910 nodes with large QLC drives that ship with 9.11 installed, SJM will be automatically activated.

SJM adds a mirroring scheme, which provides the redundancy for the journal’s contents. This is where /ifs updates are sent to a node’s local, or primary, journal as usual. But they’re also synchronously replicated, or mirrored, to another node’s journal, too – referred to as the ‘buddy’.

This is somewhat analogous to how the PowerScale H and A-series chassis-based node paring operates, albeit in software and over the backend network this time, and with no fixed buddy assignment, rather than over a dedicated PCIe non-transparent bridge link to a dedicated partner node, as in the case of the chassis-based platforms.

Every node in an SJM-enabled pool is dynamically assigned a buddy node. And similarly, if a new SJM-capable node is added to the cluster, it’s automatically paired up with a buddy. These buddies are unique for every node in the cluster.

SJM’s automatic recovery scheme can use a buddy journal’s contents to re-form the primary node’s journal. And this recovery mechanism can also be applied manually if a journal device needs to be physically replaced.

A node’s primary journal lives within that node, next to its storage drives. In contrast, the buddy journal lives on a remote node and stores sufficient information about transactions on the primary, to allow it to restore the contents of a primary node’s journal in the event of its failure.

SyncForward is the process that enables a stale Buddy journal to reconcile with the Primary and any transactions that it might have missed. Whereas SyncBack, or restore, allows a blown Primary journal to be reconstructed from the mirroring information stored in its Buddy journal.

The next blog article in this series will dig into SJM’s architecture and management in a bit more depth.

PowerScale InsightIQ 6.0

It’s been an active April for PowerScale already. Close on the tail of the OneFS 9.11 launch comes the unveiling of the new, innovative PowerScale InsightIQ 6.0 release.

InsightIQ provides powerful performance monitoring and reporting functionality, helping to maximize PowerScale cluster performance and efficiency. This includes advanced analytics to optimize applications, correlate cluster events, and the ability to accurately forecast future storage needs.

So what new treats does this InsightIQ 6.0 release bring to the table?

Added functionality includes:

  • Greater scale
  • Expanded ecosystem support
  • Enhanced reporting efficiency
  • Streamlined upgrade and migration

InsightIQ 6.0 continues to offer the same two deployment models as its 5.x predecessors:

Deployment Model Description
InsightIQ Scale Resides on bare-metal Linux hardware or virtual machine.
InsightIQ Simple Deploys on a VMware hypervisor.

The InsightIQ Scale version resides on bare-metal Linux hardware or virtual machine, whereas InsightIQ Simple deploys via OVA on a VMware hypervisor.

In v6.0, InsightIQ Scale enjoys a substantial boost in its breadth-of-monitoring scope and can now encompass up to 20 clusters or 504 nodes.

Additionally, with this new 6.0 version, InsightIQ Scale can now be deployed on a single Linux host. This is in stark contrast to InsightIQ 5’s requirements for a three Linux node minimum installation platform.

Deployment:

The deployment options and hardware requirements for installing and running InsightIQ 6.0 are as follows:

Attribute InsightIQ 6.0 Simple InsightIQ 6.0 Scale
Scalability Up to 10 clusters or 252 nodes Up to 20 clusters or 504 nodes
Deployment On VMware, using OVA template RHEL or SLES with deployment script
Hardware requirements VMware v15 or higher:

·         CPU: 8 vCPU

·         Memory: 16GB

·         Storage: 1.5TB (thin provisioned);

Or 500GB on NFS server datastore

Up to 10 clusters and 252 nodes:

·         CPU: 8 vCPU or Cores

·         Memory: 16GB

·         Storage: 500GB

Up to 20 clusters and 504 nodes:

·         CPU: 12 vCPU or Cores

·         Memory: 32GB

·         Storage: 1TB

Networking requirements 1 static IP on the PowerScale cluster’s subnet 1 static IP on the PowerScale cluster’s subnet

 

Ecosystem support:

The InsightIQ ecosystem itself is also expanded in version 6.0 to also include SUSE Linux Enterprise Server 15 SP4, in addition to Red Hat Enterprise Linux (RHEL) versions 8.10, 9.4, and RHOSP 17. This allows customers who have standardized on SUSE to now run an InsightIQ 6.0 Scale deployment on a V15 host to monitor the latest OneFS versions.

Qualified on InsightIQ 5.2 InsightIQ 6.0
OS (IIQ Scale Deployment) RHEL 8.10 and RHEL 9.4 RHEL 8.10, RHEL 9.4, RHOSP 17, and SLES 15 SP4
PowerScale OneFS 9.3 to 9.10 OneFS 9.4 to 9.11
VMware ESXi ESXi v7.0U3 and ESXi v8.0U3 ESXi v7.0U3 and ESXi v8.0U3
VMware Workstation Workstation 17 Free Workstation 17 Free Version

Similarly, in addition to deployment on VMware ESXi 7 & 8, the InsightIQ Simple version can also be installed for free on VMware Workstation 17, providing the ability to stand up InsightIQ in a non-production or lab environment for trial or demo purpose, without incurring a VMware charge. Plus, the InsightIQ 6.0 OVA template has now been reduced in size to under 5GB, and with an installation time of less than 12 minutes.

Online Upgrade

The prerequisites for upgrading to InsightIQ 6.0 are either a Simple or Scale deployment with InsightIQ v5.1.x or v5.2.x installed and running. Additionally, the free disk space must exceed 50% of the allocated capacity.

The upgrade in 6.0 is a five step process:

First, the installer checks the current Insight IQ version, verifies there’s sufficient free disk space, and confirms that setup is ready. Next, IIQ is halted and dependencies met, followed by the installation of the new 6.0 infrastructure and a migration of the legacy InsightIQ 5.x configuration and historical report data to the new platform. Finally, the cleanup phase removes the old configuration files, etc, and InsightIQ 6.0 is ready to go.

Phase Description
Pre-check Check IIQ version; verify free disk space; confirm setup is ready
Pre-upgrade Stop IIQ and install dependencies
Install and Migrate Install IIQ 6.0 infrastructure and migrate IIQ Data
Post-upgrade Migrate historical report data
Cleanup Remove old configuration files

During the upgrade of an InsightIQ Scale deployment to v6.0, the 3-node setup will be converted to 1-node configuration. After a successful upgrade, InsightIQ will be accessible via the primary node’s IP address.

Offline Migration

The offline migration functionality in this new release facilitates the transfer of data and configuration context from InsightIQ version 4.4.1 to version 6.0. This includes support for both InsightIQ Simple and InsightIQ Scale deployments.

Additionally, the process has been streamlined in InsightIQ 6.0 so that only a single ‘iiq_data_migrations.sh’ script needs to be run to complete the migration. For example:

This is in contrast to prior IIQ releases, where separate import and export utilities were required for the migration process. Detailed migration logs are also provided in InsightIQ 6.0, located at /usr/share/storagemonitoring/logs/offline_migration/insightiq_offline_migration.log.

Durable Data Collection

Decoupled data collection and processing in IIQ 6.0 delivers gains in both performance and fault tolerance. Under the hood, InsightIQ 6.0 sees an updated architecture with the introduction of the following new components:

Component Role
Data Processor Responsible for processing and storing the data in TimescaleDB for display by Reporting service.
Temporary Datastore Stores historical statistics fetched from PowerScale cluster, in-between collection and processing.
Message Broker Facilitates inter-service communication. With the separation of data collection and data processing, this allows both services to signal to each other when their respective roles come up.
Timescale DB New database storage for the time-series data. Designed for optimized handling of historical statistics.

Telemetry Down-sampling

InsightIQ 6.0’s new TimescaleDB database now permits the storage of long-term historical data via an enhanced retention strategy:

Unlike prior InsightIQ releases, which used two data formats, with v6.0 telemetry, summary data is now stored in the following cascading levels, each with a different data retention period:

Level Sample Length Data Retention Period
Raw table Varies by metric type. Raw data sample lengths range from 30s to 5m. 24 hours
5m summary 5 minutes 7 days
15m summary 15 minutes 4 weeks
3h summary 3 hours Infinite

Note that the actual raw sample length may vary by graph/data type – from 30 seconds for CPU % Usage data up to 5 minutes for cluster capacity metrics.

Meanwhile, the new InsightIQ v6.0 code is available for download on the Dell Support site, allowing both the installation of and upgrade to this new release.

OneFS and Dell Technologies Connectivity Services Management and Troubleshooting

In this final article in the Dell Technologies Connectivity Services (DTCS) for OneFS series, we turn our attention to management and troubleshooting.

Once the provisioning process above is complete, the ‘isi connectivity settings view’ CLI command reports the status and health of DTCS operations on the cluster.

# isi connectivity settings view

        Service enabled: Yes

       Connection State: enabled

      OneFS Software ID: xxxxxxxxxx

          Network Pools: subnet0:pool0

        Connection mode: direct

           Gateway host: -

           Gateway port: -

    Backup Gateway host: -

    Backup Gateway port: -

  Enable Remote Support: Yes

Automatic Case Creation: Yes

       Download enabled: Yes

This can also be obtained from the WebUI by navigating to Cluster management > General settings > Connectivity services:

There are some caveats and considerations to keep in mind when upgrading to OneFS 9.10 or later and enabling DTCS, including:

  • DTCS is disabled when STIG Hardening applied to cluster
  • Using DTCS on a hardened cluster is not supported
  • Clusters with the OneFS network firewall enabled (‘isi network firewall settings’) may need to allow outbound traffic on port 9443.
  • DTCS is supported on a cluster that’s running in Compliance mode
  • Secure keys are held in Key manager under the RICE domain

Also, note that ESRS can no longer be used after DTCS has been provisioned on a cluster.

DTCS has a variety of components that gather and transmit various pieces of OneFS data and telemetry to Dell Support and backend services through the Embedded Service Enabler (ESE.  These workflows include CELOG events; In-product activation (IPA) information; CloudIQ telemetry data; Isi-Gather-info (IGI) logsets; and provisioning, configuration, and authentication data to ESE and the various backend services.

Activity Information
Events and alerts DTCS  can be configured to send CELOG events..
Diagnostics The OneFS isi diagnostics gather and isi_gather_info logfile collation and transmission commands have a  DTCS  option.
Healthchecks HealthCheck definitions are updated using  DTCS .
License Activation The isi license activation start command uses  DTCS  to connect.
Remote Support Remote Support uses DTCS and the Connectivity Hub to assist customers with their clusters.
Telemetry CloudIQ telemetry data is sent using DTCS.

CELOG

Once DTCS is up and running, it can be configured to send CELOG events and attachments via ESE to CLM. This can be managed by the ‘isi event channels’ CLI command syntax. For example:

# isi event channels list

ID   Name                                    Type         Enabled

------------------------------------------------------------------

2    Heartbeat Self-Test                     heartbeat    Yes

3    Dell Technologies connectivity services connectivity No

------------------------------------------------------------------

Total: 2

# isi event channels view "Dell Technologies connectivity services"

     ID: 3

   Name: Dell Technologies connectivity services

   Type: connectivity

Enabled: No

Or from the WebUI:

CloudIQ Telemetry

DTCS provides an option to send telemetry data to CloudIQ. This can be enabled from the CLI as follows;

# isi connectivity telemetry modify --telemetry-enabled 1 --telemetry-persist 0

# isi connectivity telemetry view

        Telemetry Enabled: Yes

        Telemetry Persist: No

        Telemetry Threads: 8

Offline Collection Period: 7200

Or via the DTCS WebUI:

Diagnostics Gather

Also, the ‘isi diagnostics gather’ and isi_gather_info CLI commands both now include a ‘–connectivity’ upload option for log gathers, which also allows them to continue to function when the cluster is unhealthy via a new ‘Emergency mode’. For example, to start a gather from the CLI that will be uploaded via DTCS:

# isi diagnostics gather start -–connectivity 1

Similarly, for ISI gather info:

# isi_gather_info --connectivity

Or to explicitly avoid using DTCS for ISI gather info log gather upload:

# isi_gather_info --noconnectivity

This can also be configured from the WebUI via Cluster management > General configuration > Diagnostics > Gather:

License Activation through DTCS

PowerScale License Activation (previously known as In-Product Activation) facilitates the management of the cluster’s entitlements and licenses by communicating directly with Software Licensing Central via DTCS. Licenses can either be activated automatically or manually.

The procedure for automatic activation includes:

Step 1: Connect to Dell Technologies Connectivity Services

Step 2: Get a License Activation Code

Step 3: Select modules and activate

Similarly, for manual activation:

Step 1: Download the Activation file

Step 2: Get Signed License from Dell Software Licensing Central

Step 3: Upload Signed License

To activate OneFS product licenses through the DTCS WebUI, navigate to Cluster management > Licensing. For example, on a new cluster without any signed licenses:

Click the button Update & Refresh in the License Activation section. In the ‘Activation File Wizard’, select the desired software modules.

Next select ‘Review changes’, review, click ‘Proceed’, and finally ‘Activate’.

Note that it can take up to 24 hours for the activation to occur.

Alternatively, cluster License activation codes (LAC) can also be added manually.

Troubleshooting

When it comes to troubleshooting DTCS, the basic process flow is as follows:

The OneFS components and services above are:

Component Info
ESE Embedded Service Enabler.
isi_rice_d Remote Information Connectivity Engine (RICE).
isi_crispies_d Coordinator for RICE Incidental Service Peripherals including ESE Start.
Gconfig OneFS centralized configuration infrastructure.
MCP Master Control Program – starts, monitors, and restarts OneFS services.
Tardis Configuration service and database.
Transaction journal Task manager for RICE.

Of these, ESE, isi_crispies_d, isi_rice_d, and the Transaction Journal are exclusive to DTCS and its predecessor, SupportAssist. In contrast, Gconfig, MCP, and Tardis are all legacy services that are used by multiple other OneFS components.

For its connectivity, DTCS elects a single leader single node within the subnet pool, and NANON nodes are automatically avoided. Ports 443 and 8443 are required to be open for bi-directional communication between the cluster and Connectivity Hub, and port 9443 is for communicating with a gateway. The DTCS ESE component communicates with a number of Dell backend services

  • SRS
  • Connectivity Hub
  • CLM
  • ELMS/Licensing
  • SDR
  • Lightning
  • Log Processor
  • CloudIQ
  • ESE

Debugging backend issues may involve one or more services, and Dell Support can assist with this process.

The main log files for investigating and troubleshooting DTCS issues and idiosyncrasies are isi_rice_d.log and isi_crispies_d.log. These is also an ese_log, which can be useful, too. These can be found at:

Component Logfile Location Info
Rice /var/log/isi_rice_d.log Per node
Crispies /var/log/isi_crispies_d.log Per node
ESE /ifs/.ifsvar/ese/var/log/ESE.log Cluster-wise for single instance ESE

Debug level logging can be configured from the CLI as follows:

# isi_for_array isi_ilog -a isi_crispies_d --level=debug+

# isi_for_array isi_ilog -a isi_rice_d --level=debug+

Note that the OneFS log gathers (such as the output from the isi_gather_info utility) will capture all the above log files, plus the pertinent DTCS Gconfig contexts and Tardis namespaces, for later analysis.

If needed, the Rice and ESE configurations can also be viewed as follows:

# isi_gconfig -t ese

[root] {version:1}

ese.mode (char*) = direct

ese.connection_state (char*) = disabled

ese.enable_remote_support (bool) = true

ese.automatic_case_creation (bool) = true

ese.event_muted (bool) = false

ese.primary_contact.first_name (char*) =

ese.primary_contact.last_name (char*) =

ese.primary_contact.email (char*) =

ese.primary_contact.phone (char*) =

ese.primary_contact.language (char*) =

ese.secondary_contact.first_name (char*) =

ese.secondary_contact.last_name (char*) =

ese.secondary_contact.email (char*) =

ese.secondary_contact.phone (char*) =

ese.secondary_contact.language (char*) =

(empty dir ese.gateway_endpoints)

ese.defaultBackendType (char*) = srs

ese.ipAddress (char*) = 127.0.0.1

ese.useSSL (bool) = true

ese.srsPrefix (char*) = /esrs/{version}/devices

ese.directEndpointsUseProxy (bool) = false

ese.enableDataItemApi (bool) = true

ese.usingBuiltinConfig (bool) = false

ese.productFrontendPrefix (char*) = platform/16/connectivity

ese.productFrontendType (char*) = webrest

ese.contractVersion (char*) = 1.0

ese.systemMode (char*) = normal

ese.srsTransferType (char*) = ISILON-GW

ese.targetEnvironment (char*) = PROD

And for ‘rice’.

# isi_gconfig -t rice

[root] {version:1}

rice.enabled (bool) = false

rice.ese_provisioned (bool) = false

rice.hardware_key_present (bool) = false

rice.connectivity_dismissed (bool) = false

rice.eligible_lnns (char*) = []

rice.instance_swid (char*) =

rice.task_prune_interval (int) = 86400

rice.last_task_prune_time (uint) = 0

rice.event_prune_max_items (int) = 100

rice.event_prune_days_to_keep (int) = 30

rice.jnl_tasks_prune_max_items (int) = 100

rice.jnl_tasks_prune_days_to_keep (int) = 30

rice.config_reserved_workers (int) = 1

rice.event_reserved_workers (int) = 1

rice.telemetry_reserved_workers (int) = 1

rice.license_reserved_workers (int) = 1

rice.log_reserved_workers (int) = 1

rice.download_reserved_workers (int) = 1

rice.misc_task_workers (int) = 3

rice.accepted_terms (bool) = false

(empty dir rice.network_pools)

rice.telemetry_enabled (bool) = true

rice.telemetry_persist (bool) = false

rice.telemetry_threads (uint) = 8

rice.enable_download (bool) = true

rice.init_performed (bool) = false

rice.ese_disconnect_alert_timeout (int) = 14400

rice.offline_collection_period (uint) = 7200

The ‘-q’ flag can also be used in conjunction with the isi_gconfig command to identify any values that are not at their default settings. For example, the stock (default) Rice gconfig context will not report any configuration entries:

# isi_gconfig -q -t rice

[root] {version:1}

OneFS and Provisioning Dell Technologies Connectivity Services – Part 2

In the previous article in this Dell Technologies Connectivity Services (DTCS) for OneFS Support series, we reviewed the off-cluster prerequisites for enabling DTCS on a PowerScale cluster:

  1. Upgrading the cluster to OneFS 9.10 or later.
  2. Obtaining the secure access key and PIN.
  3. Selecting either direct connectivity or gateway connectivity.
  4. If using gateway connectivity, installing Secure Connect Gateway v5.x.

In this article, we turn our attention to step 5 – provisioning Dell Technologies Connectivity Services (DTCS) on the cluster.

Note that, as part of this process, we’ll be using the access key and PIN credentials previously obtained from the Dell Support portal in step 2 above.

Provisioning DTCS on a cluster

DTCS can be configured from the OneFS 9.10 WebUI by navigating to ‘Cluster management > General settings > DTCS’.

When unconfigured, the Connectivity Services WebUI page also displays verbiage recommending the adoption of DTCS:

  1. Accepting the telemetry notice.

Selecting the ‘Connect Now’ button initiates the following setup wizard. The first step requires checking and accepting the Infrastructure Telemetry Notice:

  1. Support Contract.

For the next step, enter the details for the primary support contact, as prompted:

Or from the CLI using the ‘isi connectivity contacts’ command set. For example:

# isi connectivity contacts modify --primary-first-name=Nick --primary-last-name=Trimbee --primary-email=trimbn@isilon.com
  1. Establish Connections.

Next, complete the ‘Establish Connections’ page

This involves the following steps:

  • Selecting the network pool(s).
  • Adding the secure access key and PIN,
  • Configuring either direct or gateway access
  • Selecting whether to allow remote support, CloudIQ telemetry, and auto case creation.

a. Select network pool(s).

At least one statically-allocated IPv4 or IPv6 network subnet and pool is required for provisioning DTCS.

Select one or more network pools or subnets from the options displayed. For example, in this case ‘subnet0pool0’:

Or from the CLI:

Select one or more static subnet/pools for outbound communication. This can be performed via the following CLI syntax:

# isi connectivity settings modify --network-pools="subnet0.pool0"

Additionally, if the cluster has the OneFS network firewall enabled (‘isi network firewall settings’), ensure that outbound traffic is allowed on port 9443.

b.  Add secure access key and PIN.

In this next step, add the secure access key and pin. These should have been obtained in an earlier step in the provisioning procedure from the following Dell Support site: https://www.dell.com/support/connectivity/product/isilon-onefs.:

Alternatively, if configuring DTCS via the OneFS CLI, add the key and pin via the following syntax:

# isi connectivity provision start --access-key <key> --pin <pin>

c.  Configure access.

i. Direct access.

Or from the CLI. For example, to configure direct access (the default), ensure the following parameter is set:

# isi connectivity settings modify --connection-mode direct

# isi connectivity settings view | grep -i "connection mode"

Connection mode: direct

ii.  Gateway access.

Alternatively, to connect via a gateway, check the ‘Connect via Secure Connect Gateway’ button:

Complete the ‘gateway host’ and ‘gateway port’ fields as appropriate for the environment.

Alternatively, to set up a gateway configuration from the CLI, use the ‘isi connectivity settings modify’ syntax. For example, to configure using the gateway FQDN ‘secure-connect-gateway.yourdomain.com’ and the default port ‘9443’:

# isi connectivity settings modify --connection-mode gateway

# isi connectivity settings view | grep -i "connection mode"

Connection mode: gateway

# isi connectivity settings modify --gateway-host secure-connect-gateway.yourdomain.com --gateway-port 9443

When setting up the gateway connectivity option, Secure Connect Gateway v5.0 or later must be deployed within the data center. Note that DTCS is incompatible with either ESRS gateway v3.52 or SAE gateway v4. However, Secure Connect Gateway v5.x is backwards compatible with PowerScale OneFS ESRS and SupportAssist, which allows the gateway to be provisioned and configured ahead of a cluster upgrade to DTCS/OneFS 9.10.

d.  Configure support options.

Finally, configure the desired support options:

When complete, the WebUI will confirm that SmartConnect is successfully configured and enabled, as follows:

Or from the CLI:

# isi connectivity settings view

Service enabled: Yes

Connection State: enabled

OneFS Software ID: ELMISL0223BJJC

Network Pools: subnet0.pool0, subnet0.testpool1, subnet0.testpool2, subnet0.testpool3, subnet0.testpool4

Connection mode: gateway

Gateway host: eng-sea-scgv5stg3.west.isilon.com

Gateway port: 9443

Backup Gateway host: eng-sea-scgv5stg.west.isilon.com

Backup Gateway port: 9443

Enable Remote Support: Yes

Automatic Case Creation: Yes

Download enabled: Yes

Having worked through getting DTCS configured, up and running, in the next article in this series we’ll turn our attention to the management and troubleshooting of DTCS.

PowerScale OneFS 9.11

In the runup to next month’s Dell Technologies World 2025, PowerScale is bringing spring with the launch of the innovative OneFS 9.11 release, which shipped today (8th April 2025). This all-encompassing new 9.11 version offers PowerScale innovations in capacity, durability, replication, protocols, serviceability, and ease of use.

OneFS 9.11 delivers the latest version of PowerScale’s software platform for on-prem and cloud environments and workloads. This deployment flexibility can make it a solid fit for traditional file shares and home directories, vertical workloads like financial services, M&E, healthcare, life sciences, and next-gen AI, ML and analytics applications.

PowerScale’s scale-out architecture can be deployed on-site, in co-location facilities, or as customer managed Amazon AWS and Microsoft Azure deployments, providing core to edge to cloud flexibility, plus the scale and performance and needed to run a variety of unstructured workflows on-prem or in the public cloud.

With data security, detection, and monitoring being top of mind in this era of unprecedented cyber threats, OneFS 9.11 brings an array of new features and functionality to keep your unstructured data and workloads more available, manageable, and durable than ever.

Hardware Innovation

On the platform hardware front, OneFS 9.11 also unlocks dramatic capacity enhancements the all-flash F710 and F910, which see the introduction of support for 122TB QLC SSDs.

Additionally, support is added in OneFS 9.11 for future H and A-series chassis-based hybrid platforms.

Software Journal Mirroring

In OneFS 9.11 a new software journal mirroring capability (SJM) is added for the PowerScale all-flash F710 and F910 platforms with 61 TB or larger QLC SSDs. For these dense drive nodes, software journal mirroring negates the need for higher FEC protection levels and their associated overhead.

With SJM, file system writes are sent to a node’s local journal as well as synchronously replicated, or mirrored, to a buddy node’s journal. In the event of a failure, SJM’s automatic recovery scheme can use a Buddy journal’s mirrored contents to re-form the Primary node’s journal, avoiding the need to SmartFail the node.

Protocols

The S3 object protocol enjoys conditional write and cluster status enhancements in OneFS 9.11. With conditional write support, the addition of an ‘if-none-match’ HTTP header for ‘PutObject’ or ‘CompleteMultipartUpload’ requests guards against overwriting of existing objects with identical key names.

For cluster reporting, capacity, health, and network status are exposed via new S3 endpoints. Status monitoring is predicated on a virtual bucket and object, and reported via GETs on the virtual object to read the Cluster Status data. All other S3 calls to the virtual bucket and object are blocked, with 405 error code returned.

Replication

In OneFS 9.11, SmartSync sees the addition of backup-to-object functionality. This includes a full-fidelity file system baseline plus fast incremental replication to ECS/ObjectScale, AWS S3, and AWS Glacier IR object stores. Support is provided for the full range of OneFS path lengths, encodings, and file sizes up to 16TB – plus special files and alternate data streams (ADS), symlinks and hardlinks, sparse regions, and POSIX and SMB attributes.

OneFS 9.11 also introduces the default enablement of temporary directory hashing on new SyncIQ replication policies, thereby improving target-side directory delete performance.

Support and Monitoring

For customers that are still using Dell’s legacy ESRS connectivity service, OneFS 9.11 also includes a seamless migration path to its replacement, Dell Technologies Connectivity Services (DTCS). To ensure all goes smoothly, a pre-check phase runs a migration checklist, which must pass in order for the operation to progress. Once underway, the prior ESRS and cluster identity settings are preserved and migrated, and finally a provisioning phase completes the transition to DTCS.

In summary, OneFS 9.11 brings the following new features and functionality to the Dell PowerScale ecosystem:

Feature Description
Networking ·         Dynamic IP pools added to SmartConnect Basic
Platform ·         Support for F-series nodes with 122TB QLC SSD drives
Protocol ·         S3 cluster status API
Replication ·         SmartSync File-to-Object
Support ·         Seamless ESRS to DTCS migration

 

Reliability ·         Software Journal Mirroring for high-capacity QLC SSD nodes.

We’ll be taking a deeper look at the new OneFS 9.11 features and functionality in blog articles over the course of the next few weeks.

Meanwhile, the new OneFS 9.11 code is available on the Dell Support site, as both an upgrade and reimage file, allowing both installation and upgrade of this new release.

For existing clusters running a prior OneFS release, the recommendation is to open a Service Request with to schedule an upgrade. To provide a consistent and positive upgrade experience, Dell Technologies is offering assisted upgrades to OneFS 9.11 at no cost to customers with a valid support contract. Please refer to this Knowledge Base article for additional information on how to initiate the upgrade process.

OneFS and Provisioning Dell Technologies Connectivity Services – Part 1

In OneFS 9.10, several OneFS components leverage Dell Technologies Connectivity Services (DTCS) as their secure off-cluster data retrieval and communication channel. These include:

Component Details
Events and Alerts DTCS can send CELOG events and attachments via ESE to CLM.
Diagnostics Logfile gathers can be uploaded to Dell via DTCS.
License activation License activation uses DTCS for the ‘isi license activation start’ CLI cmd
Telemetry Telemetry is sent through DTCS to CloudIQ for analytics
Health check Health check definition downloads  leverage DTCS
Remote Support Remote Support  uses DTCS along with Connectivity Hub

For existing clusters, DTCS supports the same basic workflows as its predecessors, ESRS and SupportAssist, so the transition from old to new is generally pretty seamless.

As such, the overall process for enabling DTCS in OneFS is as follows:

  1. Upgrade the cluster to OneFS 9.10.
  2. Obtain the secure access key and PIN.
  3. Select either direct connectivity or gateway connectivity.
  4. If using gateway connectivity, install Secure Connect Gateway (v5.0 or later).
  5. Provision DTCS on the cluster.

We’ll go through each of the configuration steps above in order:

  1. Install or upgrade to OneFS 9.10.

First, the cluster must be running OneFS 9.10 in order to configure DTCS.

There are some additional considerations and caveats to bear in mind when upgrading to OneFS 9.10 and planning on enabling DTCS. These include the following:

  • DTCS is disabled when STIG Hardening applied to cluster
  • Using DTCS on a hardened cluster is not supported
  • Clusters with the OneFS network firewall enabled (‘isi network firewall settings’) may need to allow outbound traffic on ports 443 and 8443, plus 9443 if gateway (SCG) connectivity is configured.
  • DTCS is supported on a cluster that’s running in Compliance mode
  • If the cluster already has SupportAssist configured and running, the conversion to DTCS will occur automatically.
  • If upgrading from an earlier release on a cluster not running SupportAssist, the OneFS 9.10 upgrade to must be committed before DTCS can be provisioned.

Also, ensure that the user account that will be used to enable DTCS belongs to a role with the ‘ISI_PRIV_REMOTE_SUPPORT’ read and write privilege. For example:

# isi auth privileges | grep REMOTE

ISI_PRIV_REMOTE_SUPPORT                        Configure remote support

For example, the ‘ese’ user account below:

# isi auth roles view ConnectivityServicesRole

Name: ConnectivityServicesRole

Description: -

Members: ese

Privileges

ID: ISI_PRIV_LOGIN_PAPI

Permission: r


ID: ISI_PRIV_REMOTE_SUPPORT

Permission: w
  1. Obtaining secure access key and PIN.

An access key and pin are required in order to provision DTCS, and these secure keys are held in Key manager under the RICE domain. This access key and pin can be obtained from the Dell Support site:

In the Quick link navigation bar, select the ‘Generate Access key’ link:

On the following page, select the appropriate button:

The credentials required to obtain an access key and pin vary depending on prior cluster configuration. Sites that have previously provisioned ESRS will need their OneFS Software ID (SWID) to obtain their access key and pin.

The ‘isi license list’ CLI command can be used to determine a cluster’s SWID. For example:

# isi license list | grep "OneFS Software ID"

OneFS Software ID: ELMISL999CKKD

However, customers with new clusters and/or have not previously provisioned ESRS or SupportAssist will require their Site ID in order to obtain the access key and pin.

Note that any new cluster hardware shipping after January 2023 will already have a built-in key, so this key can be used in place of the Site ID above.

For example, if this is the first time registering this cluster and it does not have a built-in key, select ‘Yes, let’s register’:

Enter the Site ID, site name, and location information for the cluster:

Choose a 4-digit PIN and save it for future reference. After that click the ‘Create My Access Key’ button:

Next, the access key is generated.

An automated email is sent from the Dell Services Connectivity Team containing the pertinent key info, including:

  • Access Key
  • Product ID
  • Site ID/UCID
  • Expiration Date

For example:

Note that this access key is valid for one week, after which it automatically expires.

Next, in the cluster’s WebUI, navigate back to Cluster management > General settings > Connectivity Services and complete the EULA:

Next, enter the access key and PIN information in the appropriate fields. Finally, click the ‘Finish Setup’ button to complete the DTCS provisioning process:

  1. Direct or gateway topology decision.

A topology decision will need to be made between implementing either direct connectivity or gateway connectivity, depending on the needs of the environment:

  • Direct Connect:

  • Gateway Connect:

DTCS uses ports 443 and 8443 by default for bi-directional communication between the cluster and Connectivity Hub. As such, these ports will need to be open across any firewalls or packet filters between the cluster and the corporate network edge to allow connectivity to Dell Support.

Additionally, port 9443 is used for communicating with a gateway (SCG).

# grep -i esrs /etc/services

isi_esrs_d      9443/tcp  #EMC Secure Remote Support outbound alerts
  1. Optional Secure Connect Gateway installation.

This step is only required when deploying a Secure Connect gateway. If a direct connect topology is desired, go directly to step 5 below.

When configuring DTCS with the gateway connectivity option, Secure Connect Gateway v5.0 or later must be deployed within the data center.

Dell Secure Connect Gateway (SCG) is available for Linux, Windows, Hyper-V, and VMware environments, and as of writing, the latest version is 5.28.00. The installation binaries can be downloaded from: https://www.dell.com/support/home/en-us/product-support/product/secure-connect-gateway/drivers

The procedure to download SCG is as follows:

  1. Sign in to https://www.dell.com/support/product-details/en-us/product/secure-connect-gateway-app-edition. The Secure Connect Gateway – Application Edition page is displayed. If you have issues signing in using your business account or unable to access the page even after signing in, contact Dell Administrative Support.
  2. In the Quick links section, click Generate Access key.
  3. On the Generate Access Key page, perform the following steps:
  4. Select a site ID, site name, or site location.
  5. Enter a four-digit PIN and click Generate key. An access key is generated and sent to your email address. NOTE: The access key and PIN must be used within seven days and cannot be used to register multiple instances of secure connect gateway.
  6. Click Done.
  7. On the Secure Connect Gateway – Application Edition page, click the Drivers & Downloads tab.
  8. Search and select the required version.
  9. In the ACTION column, click Download.

The following steps are required in order to setup SCG:

Pertinent resources for configuring and running SCG include:

SCG Deployment Guide

SCP User Guide

SCP Support Matrix, for supported devices, protocols, firmware versions, and operating systems.

Another useful source of SCG installation, configuration, and troubleshooting information is the Dell support forum: https://www.dell.com/community/Secure-Connect-Gateway/bd-p/SCG

  1. Provisioning DTCS on the cluster.

At this point, the off-cluster pre-staging work should be complete.

In the next article in this series, we turn our attention to the DTCS provisioning process on the cluster itself (step 5).

OneFS Dell Technologies Connectivity Services Architecture and Operation

In this article in the Dell Technologies Connectivity Services series, we’ll dig a little deeper and look at the OneFS DTCS architecture and operation.

The OneFS Dell Connectivity Services relies on the following infrastructure and services:

Service Name
ESE Embedded Service Enabler.
isi_rice_d Remote Information Connectivity Engine (RICE).
isi_crispies_d Coordinator for RICE Incidental Service Peripherals including ESE Start.
Gconfig OneFS centralized configuration infrastructure.
MCP Master Control Program – starts, monitors, and restarts OneFS services.
Tardis Configuration service and database.
Transaction journal Task manager for RICE.

Of these, ESE, isi_crispies_d, isi_rice_d, and the Transaction Journal were introduced back in OneFS 9.5 and are exclusive to Dell Connectivity Services. In contrast, Gconfig, MCP, and Tardis are all legacy services that are employed by multiple other OneFS components.

The Remote Information Connectivity Engine (RICE) represents the Dell Connectivity Services ecosystem that allows OneFS to connect to the Dell support backend. At a high level, its architecture operates as follows:

Under the hood, the Embedded Service Enabler (ESE) is at the core of the connectivity platform and acts as a unified communications broker between the PowerScale cluster and Dell Support. ESE runs as a OneFS service and, upon startup, looks for an on-premises gateway server, such as Dell Connectivity Services Enterprise. If none is found, it connects back to the connectivity pipe (SRS). The collector service then interacts with ESE to send telemetry, obtain upgrade packages, transmit events and alerts, etc.

Depending on the available resources, ESE provides a base functionality with optional capabilities to enhance serviceability as appropriate. ESE is multithreaded, and each payload type is handled by different threads. For example, events are handled by event threads, and binary and structured payloads are handled by web threads. Within OneFS, ESE gets installed to /usr/local/ese and runs as the ese user in the ese group.

Networking-wise, Dell Connectivity Services  provides full support for both IPv4 and IPv6. The responsibilities of isi_rice_d include listening for network changes, getting eligible nodes elected for communication, monitoring notifications from CRISPIES, and engaging Task Manager when ESE is ready to go.

The Task Manager is a core component of the RICE engine. Its responsibility is to watch the incoming tasks that are placed into the journal and assign workers to step through the tasks state machine until completion. It controls the resource utilization (python threads) and distributes tasks that are waiting on a priority basis.

The ‘isi_crispies_d’ service exists to ensure that ESE is only running on the RICE active node, and nowhere else. It acts, in effect, like a specialized MCP just for ESE and RICE-associated services, such as IPA. This entails starting ESE on the RICE active node, re-starting it if it crashes on the RICE active node, and stopping it and restarting it on the appropriate node if the RICE active instance moves to another node. We are using ‘isi_crispies_d’ for this, and not MCP, because MCP does not support a service running on only one node at a time.

The core responsibilities of ‘isi_crispies_d’ include:

  • Starting and stopping ESE on the RICE active node
  • Monitoring ESE and restarting, if necessary. ‘isi_crispies_d’ restarts ESE on the node if it crashes. It will retry a couple of times and then notify RICE if it’s unable to start ESE.
  • Listening for gconfig changes and updating ESE. Stopping ESE if unable to make a change and notifying RICE.
  • Monitoring other related services.

The state of ESE, and of other RICE service peripherals, is stored in the OneFS tardis configuration database so that it can be checked by RICE. Similarly, ‘isi_crispies_d’ monitors the OneFS Tardis configuration database to see which node is designated as the RICE ‘active’ node.

The ‘isi_telemetry_d’ daemon is started by MCP and runs when Dell Connectivity Services is enabled. It does not have to be running on the same node as the active RICE and ESE instance. Only one instance of ‘isi_telemetry_d’ will be active at any time, while the other nodes wait for the lock.

The current status and setup of Dell Connectivity Services on a PowerScale cluster can be queried via the ‘isi connectivity settings view’ CLI command. For example:

# isi connectivity settings view

        Service enabled: Yes

       Connection State: enabled

      OneFS Software ID: ELMISL08224764

          Network Pools: subnet0:pool0

        Connection mode: direct

           Gateway host: -

           Gateway port: -

    Backup Gateway host: -

    Backup Gateway port: -

  Enable Remote Support: Yes

Automatic Case Creation: Yes

       Download enabled: Yes

This can also be viewed from the WebUI under Cluster management > General settings > Connectivity Services:

Note that once a cluster is provisioned with Dell Connectivity Services, the legacy ESRS can no longer be used.

OneFS and Dell Technologies Connectivity Services

Within the plethora of new functionality in the OneFS 9.10 release payload lies support for the Dell Technologies Connectivity Services, or DTCS – Dell’s remote connectivity system.

DTCS assists with quickly identifying, triaging, and fixing cluster issues, and boosts productivity by replacing manual routines with automated support. Its predictive issue detection and proactive remediation helps accelerate resolution – or avoid issues completely. Best of all, DTCS is included with all Dell PowerScale support plans, although the specific features available may vary based on the level of service contract.

Within OneFS, Dell Technologies Connectivity Services is intended for transmitting events, logs, and telemetry from PowerScale to Dell support. As such, it provides a full replacement for the legacy ESRS, as well as rebranding of the former SupportAssist services.

Delivering a consistent remote support experience across the storage portfolio, DTCS is intended for all sites that can send telemetry off-cluster to Dell over the internet. Dell Connectivity Services integrates the Dell Embedded Service Enabler (ESE) into PowerScale OneFS along with a suite of daemons to allow its use on a distributed system.

Dell Technologies Connectivity Services (formerly SupportAssist) ESRS
Dell’s next generation remote connectivity solution. Being phased out of service.
Can either connect directly, or via supporting gateways. Can only use gateways for remote connectivity.
Uses Connectivity Hub to coordinate support. Uses ServiceLink to coordinate support.
Requires access key and pin, or hardware key, to enable. Uses customer username and password to enable.

Dell Technologies Connectivity Services uses Connectivity Hub and can either interact directly, or through a Secure Connect gateway.

DTCS comprises a variety of components that gather and transmit various pieces of OneFS data and telemetry to Dell Support, via the Embedded Service Enabler (ESE).  These workflows include CELOG events, In-product activation (IPA) information, CloudIQ telemetry data, Isi-Gather-info (IGI) log sets, and provisioning, configuration and authentication data to ESE and the various backend services.

Workflow Details
CELOG DTCS can be configured to send CELOG events and attachments via ESE to CLM.   CELOG has a ‘Dell Connectivity Services’ channel that, when active, will create an EVENT task for DTCS to propagate.
License Activation The isi license activation start command uses DTCS to connect.

Several pieces of PowerScale and OneFS functionality require licenses, and to register and must communicate with the Dell backend services in order to activate those cluster licenses. In OneFS 9.10, DTCS is the preferred mechanism to send those license activations via the Embedded Service Enabler(ESE) to the Dell backend. License information can be generated via the ‘isi license generate’ CLI command, and then activated via the ‘isi license activation start’ syntax.

Provisioning DTCS must register with backend services in a process known as provisioning.  This process must be executed before the Embedded Service Enabler(ESE) will respond on any of its other available API endpoints.  Provisioning can only successfully occur once per installation, and subsequent provisioning tasks will fail. DTCS must be configured via the CLI or WebUI before provisioning.  The provisioning process uses authentication information that was stored in the key manager upon the first boot.
Diagnostics The OneFS isi diagnostics gather and isi_gather_info logfile collation and transmission commands have a –connectivity option.
Healthchecks HealthCheck definitions are updated using DTCS.
Telemetry CloudIQ Telemetry data is sent using DTCS.
Remote Support Remote Support uses DTCS and the Connectivity Hub to assist customers with their clusters.

DTCS requires an access key and PIN, or hardware key, in order to be enabled, with most customers likely using the access key and pin method. Secure keys are held in Key manager under the RICE domain.

In addition to the transmission of data from the cluster to Dell, Connectivity Hub also allows inbound remote support sessions to be established for remote cluster troubleshooting.

In the next article in this series, we’ll take a deeper look at the Dell Technologies Connectivity Services architecture and operation.