PowerScale Platforms

In this article, we’ll take a quick peek at the new PowerScale F200 and F600 hardware platforms. For reference, here’s where these new nodes sit in the current hardware hierarchy:

The PowerScale F200 is an entry-level all flash node that utilizes affordable SAS SSDs and a single-CPU 1U PowerEdge platform. It’s performance and capacity profile makes it ideally suited for use cases such as remote office/back office environments, factory floors, IoT, retail, smaller organizations, etc. The key advantages to the F200 are its low entry capacities and price points and the flexibility to add nodes individually, as opposed to a chassis/2 node minimum for the legacy Gen6 platforms.

The F200 contains four 3.5” drive bays populated with a choice of 960GB, 1.92TB, or 3.84TB enterprise SAS SSDs.

Inline data reduction, which incorporates compression, dedupe, and single instancing, is included as standard and requires no additional licensing.

Under the hood, the F200 node is based on the PowerEdge R640 server platform. Each node contains a  single Socket Intel CPU, and 10/25 GbE Front-End and Back-End networking,

Configurable memory options of 48GB or 96GB per node are available.

In contrast, the PowerScale F600 is a mid-level all-flash platform that utilizes NVMe SSDs and a dual-CPU 1U PowerEdge platform.  The ideal use cases for the F600 include performant workflows, such as M&E, EDA, HPC, and others, with some cost sensitivity and less demand for capacity.

The F600 contains eight 2.5” drive bays populated with a choice of 1.92TB, 3.84TB, or 7,68TB enterprise NVMe SSDs. Inline data reduction, which incorporates compression, dedupe, and single instancing, is also included as standard.

The F600 is also based on the 1U R640 PowerEdge server platform, but, unlike the F200, with dual socket Intel CPUs. Front-End networking options include 10/25 GbE or 40/100 GbE and with 100 GbE for the Back-End network.

Configurable memory options include 128GB, 192GB, or 384GB per node.

For Ethernet networking, the 10/40GbE environment uses SFP+ and QSFP+ cables and modules, whereas the 25/100GbE environment uses SFP28 and QSFP28 cables and modules. These cables are mechanically identical and the 25/100GbE NICs and switches will automatically read cable types and adjust accordingly. However, be aware that the 10/40GbE NICS and switches will not recognize SFP28 cables.

The 40GbE and 100GbE connections are actually four lanes of 10GbE and 25GbE respectively, allowing switches to ‘breakout’ a QSFP port into 4 SFP ports. While this is automatic on the Dell back-end switches, some front-end switches may need configuring

The F200 has a single NIC configuration comprising both a 10/25GbE front-end and back-end. By comparison, the F600 nodes are available in two configurations, with a 100GbE back-end and either a 25GbE or 100GbE front-end and.

Here’s what the back-end NIC/Switch Support Matrix looks like for the PowerScale F200 and F600:

Drive subsystem-wise, the PowerEdge R640 platform’s bay numbering scheme starts with 0 instead of 1. On the F200, there are four SAS SSDs, numbered from 0 to 3.

The F600 has ten total bays, of which numbers 0 and 1 on the far left are unused. The eight NVMe SSDs therefore reside in bays 2 to 9.

Support has been added to OneFS 9.0 for NVMe. alongside the legacy SCSI and ATA interfaces. Note that NVMe drives are only currently supported on the F600 nodes, and these drives use the NVMe and NVD drivers. The NVD is a block device driver that exposes an NVMe namespace like a drive and is what most OneFS operations act upon, and each NVMe drive has a /dev/nvmeX, /dev/nvmeXnsX and /dev/nvdX device entry. From a drive management standpoint, the CLI and WebUI are pretty much unchanged. While NVMe has been added as new drive type, the ’isi devices’ CLI syntax stays the same and the locations remain as ‘bays’. Similarly, the CLI drive utilities such as ‘isi_radish’ and ‘isi_drivenum’ also operate the same, where applicable

The F600 and F200 nodes’ front panel has limited functionality compared to older platform generations and will simply allow the user to join a node to a cluster and display the node name after the node has successfully joined the cluster.

Similar to legacy Gen6 platforms, a PowerScale node’s serial number can be found either by viewing /etc/isilon_serial_number or running the ‘isi_hw_status | grep SerNo’ CLI command syntax. The serial number reported by OneFS will match that of the service tag attached to the physical hardware and the /etc/isilon_system_config file will report the appropriate node type. For example:

# cat /etc/isilon_system_config

PowerScale F600

Introducing Dell EMC PowerScale…

Today we’re thrilled to launch Dell EMC PowerScale – a new unstructured data storage family centered  around OneFS 9.0.

This release represents a series of firsts for us:

  • Hardware-wise, we’ve delivered our first NVMe offering, the first nodes delivered in a compact 1RU form factor, the first of our platforms designed and built entirely on Dell Power-series hardware, and the first PowerScale branded products.

  • Software-wise, OneFS 9.0 introduces support for the AWS S3 API as a top-tier protocol – extending our data lake to natively include object, and enabling hybrid and cloud-native workloads that use S3-compatible backend storage, such cloud backup & archive software, modern apps, analytics flows, IoT workloads, etc. And to run these on-prem, alongside and coexisting with traditional file-based workflows.

  • DataIQ’s tight integration with OneFS 9.0 enables seamless data discovery, understanding, and movement, delivering intelligent insights and holistic management.

  • CloudIQ harnesses the power of machine learning and AI to proactively mitigate issues before they become problems.

Over the course of the next few blog articles, we’ll explore the new platforms, features and functionality of the new PowerScale family in more depth…

OneFS Caching Hierarchy

Caching occurs in OneFS at multiple levels, and for a variety of types of data. For this discussion we’ll concentrate on the caching of file system structures in main memory and on SSD.

OneFS’ caching infrastructure design is based on aggregating each individual node’s cache into one cluster wide, globally accessible pool of memory. This is done by using an efficient messaging system, which allows all the nodes’ memory caches to be available to each and every node in the cluster.

For remote memory access, OneFS utilizes the Sockets Direct Protocol (SDP) over an Ethernet or Infiniband (IB) backend interconnect on the cluster. SDP provides an efficient, socket-like interface between nodes which, by using a switched star topology, ensures that remote memory addresses are only ever one hop away. While not as fast as local memory, remote memory access is still very fast due to the low latency of the backend network.

OneFS uses up to three levels of read cache, plus an NVRAM-backed write cache, or write coalescer. The first two types of read cache, level 1 (L1) and level 2 (L2), are memory (RAM) based, and analogous to the cache used in CPUs. These two cache layers are present in all Isilon storage nodes.  An optional third tier of read cache, called SmartFlash, or Level 3 cache (L3), is also configurable on nodes that contain solid state drives (SSDs). L3 cache is an eviction cache that is populated by L2 cache blocks as they are aged out from memory.

The OneFS caching subsystem is coherent across the cluster. This means that if the same content exists in the private caches of multiple nodes, this cached data is consistent across all instances. For example, consider the following scenario:

  1. Node 2 and Node 4 each have a copy of data located at an address in shared cache.
  2. Node 4, in response to a write request, invalidates node 2’s copy.
  3. Node 4 then updates the value.
  4. Node 2 must re-read the data from shared cache to get the updated value.

OneFS utilizes the MESI Protocol to maintain cache coherency, implementing an “invalidate-on-write” policy to ensure that all data is consistent across the entire shared cache. The various states that in-cache data can take are:

M – Modified: The data exists only in local cache, and has been changed from the value in shared cache. Modified data is referred to as ‘dirty’.

E – Exclusive: The data exists only in local cache, but matches what is in shared cache. This data referred to as ‘clean’.

S – Shared: The data in local cache may also be in other local caches in the cluster.

I – Invalid: A lock (exclusive or shared) has been lost on the data.

L1 cache, or front-end cache, is memory that is nearest to the protocol layers (e.g. NFS, SMB, etc) used by clients, or initiators, connected to that node. The main task of L1 is to prefetch data from remote nodes. Data is pre-fetched per file, and this is optimized in order to reduce the latency associated with the nodes’ IB back-end network. Since the IB interconnect latency is relatively small, the size of L1 cache, and the typical amount of data stored per request, is less than L2 cache.

L1 is also known as remote cache because it contains data retrieved from other nodes in the cluster. It is coherent across the cluster, but is used only by the node on which it resides, and is not accessible by other nodes. Data in L1 cache on storage nodes is aggressively discarded after it is used. L1 cache uses file-based addressing, in which data is accessed via an offset into a file object. The L1 cache refers to memory on the same node as the initiator. It is only accessible to the local node, and typically the cache is not the primary copy of the data. This is analogous to the L1 cache on a CPU core, which may be invalidated as other cores write to main memory. L1 cache coherency is managed via a MESI-like protocol using distributed locks, as described above.

L2, or back-end cache, refers to local memory on the node on which a particular block of data is stored. L2 reduces the latency of a read operation by not requiring a seek directly from the disk drives. As such, the amount of data prefetched into L2 cache for use by remote nodes is much greater than that in L1 cache.

L2 is also known as local cache because it contains data retrieved from disk drives located on that node and then made available for requests from remote nodes. Data in L2 cache is evicted according to a Least Recently Used (LRU) algorithm. Data in L2 cache is addressed by the local node using an offset into a disk drive which is local to that node. Since the node knows where the data requested by the remote nodes is located on disk, this is a very fast way of retrieving data destined for remote nodes. A remote node accesses L2 cache by doing a lookup of the block address for a particular file object. As described above, there is no MESI invalidation necessary here and the cache is updated automatically during writes and kept coherent by the transaction system and NVRAM.

L3 cache is a subsystem which caches evicted L2 blocks on a node. Unlike L1 and L2, not all nodes or clusters have an L3 cache, since it requires solid state drives (SSDs) to be present and exclusively reserved and configured for caching use. L3 serves as a large, cost-effective way of extending a node’s read cache from gigabytes to terabytes. This allows clients to retain a larger working set of data in cache, before being forced to retrieve data from higher latency spinning disk. The L3 cache is populated with “interesting” L2 blocks dropped from memory by L2’s least recently used cache eviction algorithm. Unlike RAM based caches, since L3 is based on persistent flash storage, once the cache is populated, or warmed, it’s highly durable and persists across node reboots, etc. L3 uses a custom log-based filesystem with an index of cached blocks. The SSDs provide very good random read access characteristics, such that a hit in L3 cache is not that much slower than a hit in L2.

To utilize multiple SSDs for cache effectively and automatically, L3 uses a consistent hashing approach to associate an L2 block address with one L3 SSD. In the event of an L3 drive failure, a portion of the cache will obviously disappear, but the remaining cache entries on other drives will still be valid. Before a new L3 drive may be added to the hash, some cache entries must be invalidated.

OneFS also uses a dedicated inode cache in which recently requested inodes are kept. The inode cache frequently has a large impact on performance, because clients often cache data, and many network I/O activities are primarily requests for file attributes and metadata, which can be quickly returned from the cached inode.

OneFS provides tools to accurately assess the performance of the various levels of cache at a point in time. These cache statistics can be viewed from the OneFS CLI using the isi_cache_stats command. Statistics for L1, L2 and L3 cache are displayed for both data and metadata. For example:

# isi_cache_stats
Totals 

l1_data: a 224G 100% r 226G 100% p 318M 77%, l1_encoded: a 0.0B 0% r 0.0B 0% p 0.0B 0%, l1_meta: r 4.5T 99% p 152K 48%, 

l2_data: r 1.2G 95% p 115M 79%, l2_meta: r 27G 72% p 28M 3%, 

l3_data: r 0.0B 0% p 0.0B 0%, l3_meta: r 8G 99% p 0.0B 0%

For more detailed and formatted output, a verbose option of the command is available using the ‘isi_cache_stats -v’ option.

It’s worth noting that for L3 cache, the prefetch statistics will always read zero, since it’s a pure eviction cache and does not utilize data or metadata prefetch.

Due to balanced data distribution, automatic rebalancing, and distributed processing, OneFS is able to leverage additional CPUs, network ports, and memory as the system grows. This also allows the caching subsystem (and, by virtue, throughput and IOPS) to scale linearly with the cluster size.

FAQ: Ansible Module for Dell EMC Isilon

To which Ansible module for Dell EMC Isilon version does this FAQ apply?

This FAQ applies to version 1.1 of the module

 

Where can I get this Ansible module for Dell EMC Isilon?

We have a community in GitHub: https://github.com/dell/ansible-isilon

 

What is the software prerequisites?

  • Isilon OneFS 8 or higher
  • Ansible 2.7 or higher
  • Python 2.7.12 or higher
  • Red Hat Enterprise Linux 7.6

 

What are the supported features for this Ansible module for Dell EMC Isilon?

The Ansible Modules for Dell EMC Isilon includes:

  • File System Module
  • Access Zone Module
  • Users Module
  • Groups Module
  • Snapshot Module
  • Snapshot Schedule Module
  • NFS Module
  • SMB Module
  • Gather Facts Module

Each module includes View, Create, Delete and Modify operations. For the details, refer to the table below:

user

group

filesystem

Access zone

NFS export

SMB share

snapshot

Snapshot schedule

Create

y

y

y

n

y

y

y

y

Modify

y

y

y

y

y

y

y

y

Delete

y

y

y

n

y

y

y

y

View

y

y

y

y

y

y

y

y

What is the filesystem as we don’t see this concept in Isilon?

Filesystem in this Ansible module represents a directory in a given access zone with owner, ACL and even quotas specified.

 

How to install it?

I’ve listed high-level steps below. For the details, refer to the product guide at

https://github.com/dell/ansible-isilon/blob/dellemc_ansible/docs/Ansible%20for%20Dell%20EMC%20Isilon%20v1.1%20Product%20Guide.pdf

The following example is using CenoOS 8 + python 3.6 + Ansible 2.9.5 + Isilon sdk 8.1.1 + OneFS 8.2.2. The overall steps are as the followings:

  1. Install Ansible 2.9.5

# dnf -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm

# dnf install ansible

  1. Check the python version for ansible by using the following command

# ansible –version

In my case it’s python 3.6.8

[root@c8 ~]# ansible –version

ansible 2.9.5

  config file = /etc/ansible/ansible.cfg

  configured module search path = [‘/root/.ansible/plugins/modules’, ‘/usr/share/ansible/plugins/modules’]

  ansible python module location = /usr/lib/python3.6/site-packages/ansible

  executable location = /usr/bin/ansible

  python version = 3.6.8 (default, Nov 21 2019, 19:31:34) [GCC 8.3.1 20190507 (Red Hat 8.3.1-4)]

  1. Install Isilon sdk 8.1.1

# pip3 install isi_sdk_8_1_1

  1. Install Isilon Ansible module: (make sure the path is aligned with the python version)

Copy utils/dellemc_ansible_utils.py to  /usr/lib/python3.6/site-packages/ansible/module_utils/

Copy all module Python files from ‘isilon/library’ folder to  /usr/lib/python3.6/site-packages/ansible/modules/storage/emc

  1. Install the playbook

Coyp dellemc_ansible/isilon/playbooks to any place you want

  1. Test the installation

Update the playbooks/ flo_test.yml. mine is as below:

– name: Collect set of facts in Isilon

  hosts: localhost

  connection: local

  vars:

    onefs_host: ‘192.168.116.88’

    verify_ssl: False

    api_user: ‘root’

    api_password: ‘a’

    access_zone: ‘System’

  tasks:

  – name: Get nodes of the Isilon cluster

    dellemc_isilon_gatherfacts:

      onefs_host: “{{onefs_host}}”

      verify_ssl: “{{verify_ssl}}”

      api_user: “{{api_user}}”

      api_password: “{{api_password}}”

      gather_subset:

        – nodes

    register: subset_result

  – debug:

      var: subset_result

run the playbook:

ansible-playbook  <path to playbooks/flo_test.yml>

If everything is good, you should see the Info for your Isilon is returned:

…………

                        “release”: “v9.0.0.BETA.0”,

                        “uptime”: 24533,

                        “version”: “Isilon OneFS v8.2.2(RELEASE): 0x900003000000001:Tue Feb 25 09:19:10 PST 2020    root@se********-build11-114:/b/mnt/obj/b/mnt/src/********md64.********md64/sys/IQ.********md64.rele********se   FreeBSD cl********ng version 5.0.0 (t********gs/RELEASE_500/fin********l 312559) (b********sed on LLVM 5.0.0svn)”

                    }

                }

            ],

            “total”: 1

        },

        “Providers”: [],

        “Users”: [],

        “changed”: false,

        “failed”: false

    }

}

 

PLAY RECAP *********************************************************************

localhost                  : ok=3    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0  

 

Does this module support quota?

The current version only support directory(file system) quotas, but not user or group quotas.

 

Where can I find the examples?

Check examples from each module’s file in /ansible-isilon/dellemc_ansible/isilon/library/

I’ve also create a short video on how to use this module to create and mount NFS export from Isilon.

 

What is the limitation of this module?

Gatherfacts

Getting the list of users and groups with very long names may fail.

Users and Groups

Only local users and groups can be created.

Operations on users and groups with very long names may fail.

Access Zone

Creation and deletion of access zones is not supported.

Filesystems

ACLs can only be modified from POSIX to POSIX mode.

Only directory quotas are supported but not user or group quotas.

Modification of include_snap_data flag is not supported.

NFS Export

If there multiple exports present with the same path in an access zone, operations on such exports fail.

Advanced Isilon features

No support for advanced Isilon features like SyncIQ, tiering, WORM and so on.

How to uninstall the module?

  1. pip3 uninstall isi_sdk_8_1_1
  2. Remove dellemc_ansible_utils.py from  /usr/lib/python3.6/site-packages/ansible/module_utils/
  3. Remove the following files from  /usr/lib/python3.6/site-packages/ansible/modules/storage/emc

dellemc_isilon_accesszone.py

dellemc_isilon_filesystem.py

dellemc_isilon_gatherfacts.py

dellemc_isilon_group.py

dellemc_isilon_nfs.py

dellemc_isilon_smb.py

dellemc_isilon_snapshot.py

dellemc_isilon_snapshotschedule.py

dellemc_isilon_user.py

  1. Remove all the play book

 

Where do submit an issue against the driver?

The Ansible module for Dell EMC Isilon is officially by Dell EMC. Therefore you can open a ticket directly to the support website : https://www.dell.com/support/ or open a discussion in the forum : https://www.dell.com/community/Containers/bd-p/Containers

 

Can I run this module in a production environment?

Yes, the module is production-grade. Please make sure your environment follows the pre-requisites and Ansible best practices.

Quick overview of Ansible module for Dell EMC Isilon

Ansible module for Dell EMC Isilon has been released 2 months before and I’ve seen many people are interested in it. Here is a quick demo of how to use it to  create and mount an NFS export. For further details, please go to our github community: https://github.com/dell/ansible-isilon

OneFS Endurant Cache

The earlier blog post on multi-threaded I/O generated several questions on synchronous writes in OneFS. So this seemed like a useful topic to explore in a bit more detail.

OneFS natively provides a caching mechanism for synchronous writes – or writes that require a stable write acknowledgement to be returned to a client. This functionality is known as the Endurant Cache, or EC.

The EC operates in conjunction with the OneFS write cache, or coalescer, to ingest, protect and aggregate small, synchronous NFS writes. The incoming write blocks are staged to NVRAM, ensuring the integrity of the write, even during the unlikely event of a node’s power loss.  Furthermore, EC also creates multiple mirrored copies of the data, further guaranteeing protection from single node and, if desired, multiple node failures.

EC improves the latency associated with synchronous writes by reducing the time to acknowledgement back to the client. This process removes the Read-Modify-Write (R-M-W) operations from the acknowledgement latency path, while also leveraging the coalescer to optimize writes to disk. EC is also tightly coupled with OneFS’ multi-threaded I/O (Multi-writer) process, to support concurrent writes from multiple client writer threads to the same file. And the design of EC ensures that the cached writes do not impact snapshot performance.

The endurant cache uses write logging to combine and protect small writes at random offsets into 8KB linear writes. To achieve this, the writes go to special mirrored files, or ‘Logstores’. The response to a stable write request can be sent once the data is committed to the logstore. Logstores can be written to by several threads from the same node, and are highly optimized to enable low-latency concurrent writes.

Note that if a write uses the EC, the coalescer must also be used. If the coalescer is disabled on a file, but EC is enabled, the coalescer will still be active with all data backed by the EC.

So what exactly does an endurant cache write sequence look like?

Say an NFS client wishes to write a file to an Isilon cluster over NFS with the O_SYNC flag set, requiring a confirmed or synchronous write acknowledgement. Here is the sequence of events that occur to facilitate a stable write.

1)  A client, connected to node 3, begins the write process sending protocol level blocks. 4K is the optimal block size for the endurant cache.

2)  The NFS client’s writes are temporarily stored in the write coalescer portion of node 3’s RAM. The Write Coalescer aggregates uncommitted blocks so that the OneFS can, ideally, write out full protection groups where possible, reducing latency over protocols that allow “unstable” writes. Writing to RAM has far less latency that writing directly to disk.

3)  Once in the write coalescer, the endurant cache log-writer process writes mirrored copies of the data blocks in parallel to the EC Log Files.

The protection level of the mirrored EC log files is the same as that of the data being written by the NFS client.

4)  When the data copies are received into the EC Log Files, a stable write exists and a write acknowledgement (ACK) is returned to the NFS client confirming the stable write has occurred. The client assumes the write is completed and can close the write session.

5)  The write coalescer then processes the file just like a non-EC write at this point. The write coalescer fills and is routinely flushed as required as an asynchronous write via to the block allocation manager (BAM) and the BAM safe write (BSW) path processes.

6)  The file is split into 128K data stripe units (DSUs), parity protection (FEC) is calculated and FEC stripe units (FSUs) are created.

7)  The layout and write plan is then determined, and the stripe units are written to their corresponding nodes’ L2 Cache and NVRAM. The EC logfiles are cleared from NVRAM at this point. OneFS uses a Fast Invalid Path process to de-allocate the EC Log Files from NVRAM.

8)  Stripe Units are then flushed to physical disk.

9)  Once written to physical disk, the data stripe Unit (DSU) and FEC stripe unit (FSU) copies created during the write are cleared from NVRAM but remain in L2 cache until flushed to make room for more recently accessed data.

As far as protection goes, the number of logfile mirrors created by EC is always one more than the on-disk protection level of the file. For example:

File Protection Level Number of EC Mirrored Copies
+1n 2
2x 3
+2n 3
+2d:1n 3
+3n 4
+3d:1n 4
+4n 5

The EC mirrors are only used if the initiator node is lost. In the unlikely event that this occurs, the participant nodes replay their EC journals and complete the writes.

If the write is an EC candidate, the data remains in the coalescer, an EC write is constructed, and the appropriate coalescer region is marked as EC. The EC write is a write into a logstore (hidden mirrored file) and the data is placed into the journal.

Assuming the journal is sufficiently empty, the write is held there (cached) and only flushed to disk when the journal is full, thereby saving additional disk activity.

An optimal workload for EC involves small-block synchronous, sequential writes – something like an audit or redo log, for example. In that case, the coalescer will accumulate a full protection group’s worth of data and be able to perform an efficient FEC write.

The happy medium is a synchronous small block type load, particularly where the I/O rate is low and the client is latency-sensitive. In this case, the latency will be reduced and, if the I/O rate is low enough, it won’t create serious pressure.

The undesirable scenario is when the cluster is already spindle-bound and the workload is such that it generates a lot of journal pressure. In this case, EC is just going to aggravate things.

So how exactly do you configure the endurant cache?

Although on by default, setting the efs.bam.ec.mode sysctl to value ‘1’ will enable the Endurant Cache:

# isi_sysctl_cluster efs.bam.ec.mode=1

EC can also be enabled & disabled per directory:

# isi set -c [on|off|endurant_all|coal_only] <directory_name>

To enable the coalescer but switch of EC, run:

# isi set -c coal_only

And to disable the endurant cache completely:

# isi_for_array –s isi_sysctl_cluster efs.bam.ec.mode=0

A return value of zero on each node from the following command will verify that EC is disabled across the cluster:

# isi_for_array –s sysctl efs.bam.ec.stats.write_blocks efs.bam.ec.stats.write_blocks: 0

If the output to this command is incrementing, EC is delivering stable writes.

Be aware that if the Endurant Cache is disabled on a cluster the sysctl efs.bam.ec.stats.write_blocks output on each node will not return to zero, since this sysctl is a counter, not a rate. These counters won’t reset until the node is rebooted.

As mentioned previously, EC applies to stable writes. Namely:

  • Writes with O_SYNC and/or O_DIRECT flags set
  • Files on synchronous NFS mounts

When it comes to analyzing any performance issues involving EC workloads, consider the following:

  • What changed with the workload?
  • If upgrading OneFS, did the prior version also have EC enable?
  • If the workload has moved to new cluster hardware:
  • Does the performance issue occur during periods of high CPU utilization?
  • Which part of the workload is creating a deluge of stable writes?
  • Was there a large change in spindle or node count?
  • Has the OneFS protection level changed?
  • Is the SSD strategy the same?

Disabling EC is typically done cluster-wide and this can adversely impact certain workflow elements. If the EC load is localized to a subset of the files being written, an alternative way to reduce the EC heat might be to disable the coalescer buffers for some particular target directories, which would be a more targeted adjustment. This can be configured via the isi set –c off command.

One of the more likely causes of performance degradation is from applications aggressively flushing over-writes and, as a result, generating a flurry of ‘commit’ operations. This can generate heavy read/modify/write (r-m-w) cycles, inflating the average disk queue depth, and resulting in significantly slower random reads. The isi statistics protocol CLI command output will indicate whether the ‘commit’ rate is high.

It’s worth noting that synchronous writes do not require using the NFS ‘sync’ mount option. Any programmer who is concerned with write persistence can simply specify an O_FSYNC or O_DIRECT flag on the open() operation to force synchronous write semantics for that fie handle. With Linux, writes using O_DIRECT will be separately accounted-for in the Linux ‘mountstats’ output. Although it’s almost exclusively associated with NFS, the EC code is actually protocol-agnostic. If writes are synchronous (write-through) and are either misaligned or smaller than 8k, they have the potential to trigger EC, regardless of the protocol.

The endurant cache can provide a significant latency benefit for small (eg. 4K), random synchronous writes – albeit at a cost of some additional work for the system.

However, it’s worth bearing the following caveats in mind:

  • EC is not intended for more general purpose I/O.
  • There is a finite amount of EC available. As load increases, EC can potentially ‘fall behind’ and end up being a bottleneck.
  • Endurant Cache does not improve read performance, since it’s strictly part of the write process.
  • EC will not increase performance of asynchronous writes – only synchronous writes.

OneFS Writes

OneFS run equally across all the nodes in a cluster such that no one node controls the cluster and all nodes are true peers. Looking from a high-level at the components within each node, the I/O stack is split into a top layer, or initiator, and a bottom layer, or participant. This division is used as a logical model for the analysis of OneFS’ read and write paths.

At a physical-level, CPUs and memory cache in the nodes are simultaneously handling initiator and participant tasks for I/O taking place throughout the cluster. There are caches and a distributed lock manager that are excluded from the diagram below for simplicity’s sake.

When a client connects to a node to write a file, it is connecting to the top half or initiator of that node. Files are broken into smaller logical chunks called stripes before being written to the bottom half or participant of a node (disk). Failure-safe buffering using a write coalescer is used to ensure that writes are efficient and read-modify-write operations are avoided. The size of each file chunk is referred to as the stripe unit size. OneFS stripes data across all nodes and protects the files, directories and associated metadata via software erasure-code or mirroring.

OneFS determines the appropriate data layout to optimize for storage efficiency and performance. When a client connects to a node, that node’s initiator acts as the ‘captain’ for the write data layout of that file. Data, erasure code (FEC) protection, metadata and inodes are all distributed on multiple nodes, and spread across multiple drives within nodes. The following illustration shows a file write occurring across all nodes in a three node cluster.

OneFS uses a cluster’s Ethernet or Infiniband back-end network to allocate and automatically stripe data across all nodes . As data is written, it’s also protected at the specified level.

When writes take place, OneFS divides data out into atomic units called protection groups. Redundancy is built into protection groups, such that if every protection group is safe, then the entire file is safe. For files protected by FEC, a protection group consists of a series of data blocks as well as a set of parity blocks for reconstruction of the data blocks in the event of drive or node failure. For mirrored files, a protection group consists of all of the mirrors of a set of blocks.

OneFS is capable of switching the type of protection group used in a file dynamically, as it is writing. This allows the cluster to continue without blocking in situations when temporary node failure prevents the desired level of parity protection from being applied. In this case, mirroring can be used temporarily to allow writes to continue. When nodes are restored to the cluster, these mirrored protection groups are automatically converted back to FEC protected.

During a write, data is broken into stripe units and these are spread across multiple nodes as a protection group. As data is being laid out across the cluster, erasure codes or mirrors, as required, are distributed within each protection group to ensure that files are protected at all times.

One of the key functions of the OneFS AutoBalance job is to reallocate and balance data and, where possible, make storage space more usable and efficient. In most cases, the stripe width of larger files can be increased to take advantage of new free space, as nodes are added, and to make the on-disk layout more efficient.

The initiator top half of the ‘captain’ node uses a modified two-phase commit (2PC) transaction to safely distribute writes across the cluster, as follows:

Every node that owns blocks in a particular write operation is involved in a two-phase commit mechanism, which relies on NVRAM for journaling all the transactions that are occurring across every node in the storage cluster. Using multiple nodes’ NVRAM in parallel allows for high-throughput writes, while maintaining data safety against all manner of failure conditions, including power failures. In the event that a node should fail mid-transaction, the transaction is restarted instantly without that node involved. When the node returns, it simply replays its journal from NVRAM.

In a write operation, the initiator also orchestrates the layout of data and metadata, the creation of erasure codes, and lock management and permissions control. OneFS can also optimize layout decisions made by to better suit the workflow. These access patterns, which can be configured at a per-file or directory level, include:

  • Concurrency: Optimizes for current load on the cluster, featuring many simultaneous clients.
  • Streaming:  Optimizes for high-speed streaming of a single file, for example to enable very fast reading with a single client.
  • Random:  Optimizes for unpredictable access to the file, by adjusting striping and disabling the use of prefetch.

OneFS Multi-writer

In the previous blog article we took a look at write locking and shared access in OneFS. Next, we’ll delve another layer deeper into OneFS concurrent file access.

The OneFS locking hierarchy also provides a mechanism called Multi-writer, which allows a cluster to support concurrent writes from multiple client writer threads to the same file. This granular write locking is achieved by sub-diving the file into separate regions and granting exclusive data write locks to these individual ranges, as opposed to the entire file. This process allows multiple clients, or write threads, attached to a node to simultaneously write to different regions of the same file.

Concurrent writes to a single file need more than just supporting data locks for ranges. Each writer also needs to update a file’s metadata attributes such as timestamps, block count, etc. A mechanism for managing inode consistency is also needed, since OneFS is based on the concept of a single inode lock per file.

In addition to the standard shared read and exclusive write locks, OneFS also provides the following locking primitives, via journal deltas, to allow multiple threads to simultaneously read or write a file’s metadata attributes:

OneFS Lock Types include:

Exclusive: A thread can read or modify any field in the inode. When the transaction is committed, the entire inode block is written to disk, along with any extended attribute blocks.

Shared: A thread can read, but not modify, any inode field.

DeltaWrite: A thread can modify any inode fields which support delta-writes. These operations are sent to the journal as a set of deltas when the transaction is committed.

DeltaRead: A thread can read any field which cannot be modified by inode deltas.

These locks allow separate threads to have a Shared lock on the same LIN, or for different threads to have a DeltaWrite lock on the same LIN. However, it is not possible for one thread to have a Shared lock and another to have a DeltaWrite. This is because the Shared thread cannot perform a coherent read of a field which is in the process of being modified by the DeltaWrite thread.

The DeltaRead lock is compatible with both the Shared and DeltaWrite lock. Typically the file system will attempt to take a DeltaRead lock for a read operation, and a DeltaWrite lock for a write, since this allows maximum concurrency, as all these locks are compatible.

Here’s what the write lock incompatibilities looks like:

OneFS protects data by writing file blocks (restriping) across multiple drives on different nodes. The Job Engine defines a ‘restripe set’ comprising jobs which involve file system management, protection and on-disk layout. The restripe set contains the following jobs:

  • AutoBalance & AutoBalanceLin
  • FlexProtect & FlexProtectLin
  • MediaScan
  • MultiScan
  • SetProtectPlus
  • SmartPools
  • Upgrade

Multi-writer for restripe allows multiple restripe worker threads to operate on a single file concurrently. This in turn improves read/write performance during file re-protection operations, plus helps reduce the window of risk (MTTDL) during drive Smartfails, etc. This is particularly true for workflows consisting of large files, while one of the above restripe jobs is running. Typically, the larger the files on the cluster, the more benefit multi-writer for restripe will offer.

Note that OneFS multi-writer ranges are not a fixed size and instead tied to layout/protection groups. So typically in the megabytes size range. The number of threads that can write to the same file concurrently, from the filesystem perspective, is only limited by file size. However, NFS file handle affinity (FHA) comes into play from the protocol side, and so the default is typically eight threads per node.

The clients themselves do not apply for granular write range locks in OneFS, since multi-writer operation is completely invisible to the protocol. Multi-writer uses proprietary locking which OneFS performs to coordinate filesystem operations. As such, multi-writer is distinct from byte-range locking that application code would call, or even oplocks/leases which the client protocol stack would call.

Depending on the workload, multi-writer can improve performance by allowing for more concurrency, and (while typically on by default in recent releases) can be enabled on a cluster from the CLI as follows:

# isi_sysctl_cluster efs.bam.coalescer.multiwriter=1

Similarly, to disable multi-writer:

# isi_sysctl_cluster efs.bam.coalescer.multiwriter=0

Note that, as a general rule, unnecessary contention should be avoided. For example:

  • Avoid placing unrelated data in the same directory: Use multiple directories instead. Even if it is related, split it up if there are many entries.
  • Use multiple files: Even if the data is ultimately related, from a performance/scalability perspective, having each client use its own file and then combining them as a final stage is the correct way to architect for performance.

With multi-writer for restripe, an exclusive lock is no longer required on the LIN during the actual restripe of data. Instead, OneFS tries to use a delta write lock to update the cursors used to track which parts of the file need restriping. This means that a client application or program should be able to continue to write to the file while the restripe operation is underway.

An exclusive lock is only required for a very short period of time while a file is set up to be restriped. A file will have fixed widths for each restripe lock, and the number of range locks will depend on the quantity of threads and nodes which are actively restriping a single file.

OneFS File Locking and Concurrent Access

There has been a bevy of recent questions around how OneFS allows various clients attached to different nodes of a cluster to simultaneously read from and write to the same file. So it seemed like a good time for a quick refresher on some of the concepts and mechanics behind OneFS concurrency and distributed locking.

File locking is the mechanism that allows multiple users or processes to access data concurrently and safely. For reading data, this is a fairly straightforward process involving shared locks. With writes, however, things become more complex and require exclusive locking, since data must be kept consistent.

OneFS has a fully distributed lock manager that marshals locks on data across all the nodes in a storage cluster. This locking manager is highly extensible and allows for multiple lock types to support both file system locks as well as cluster-coherent protocol-level locks, such as SMB share mode locks or NFS advisory-mode locks. OneFS also has support for delegated locks such as SMB oplocks and NFSv4 delegations.

Every node in a cluster is able to act as coordinator for locking resources, and a coordinator is assigned to lockable resources based upon a hashing algorithm. This selection algorithm is designed so that the coordinator almost always ends up on a different node than the initiator of the request. When a lock is requested for a file, it can either be a shared lock or an exclusive lock. A shared lock is primarily used for reads and allows multiple users to share the lock simultaneously. An exclusive lock, on the other hand, allows only one user access to the resource at any given moment, and is typically used for writes. Exclusive lock types include:

Mark Lock:  An exclusive lock resource used to synchronize the marking and sweeping processes for the Collect job engine job.

Snapshot Lock:  As the name suggests, the exclusive snapshot lock which synchronizes the process of creating and deleting snapshots.

Write Lock:  An exclusive lock that’s used to quiesce writes for particular operations, including snapshot creates, non-empty directory renames, and marks.

The OneFS locking infrastructure has its own terminology, and includes the following definitions:

Domain: Refers to the specific lock attributes (recursion, deadlock detection, memory use limits, etc) and context for a particular lock application. There is one definition of owner, resource, and lock types, and only locks within a particular domain may conflict.

Lock Type: Determines the contention among lockers. A shared or read lock does not contend with other types of shared or read locks, while an exclusive or write lock contends with all other types. Lock types include:
• Advisory
• Anti-virus
• Data
• Delete
• LIN
• Mark
• Oplocks
• Quota
• Read
• Share Mode
• SMB byte-range
• Snapshot
• Write

Locker: Identifies the entity which acquires a lock.

Owner: A locker which has successfully acquired a particular lock. A locker may own multiple locks of the same or different type as a result of recursive locking.

Resource: Identifies a particular lock. Lock acquisition only contends on the same resource. The resource ID is typically a LIN to associate locks with files.

Waiter: Has requested a lock, but has not yet been granted or acquired it.

Here’s an example of how threads from different nodes could request a lock from the coordinator:

1. Node 2 is selected as the lock coordinator of these resources.
2. Thread 1 from Node 4 and thread 2 from Node 3 request a shared lock on a file from Node 2 at the same time.
3. Node 2 checks if an exclusive lock exists for the requested file.
4. If no exclusive locks exist, Node 2 grants thread 1 from Node 4 and thread 2 from Node 3 shared locks on the requested file.
5. Node 3 and Node 4 are now performing a read on the requested file.
6. Thread 3 from Node 1 requests an exclusive lock for the same file as being read by Node 3 and Node 4.
7. Node 2 checks with Node 3 and Node 4 if the shared locks can be reclaimed.
8. Node 3 and Node 4 are still reading so Node 2 asks thread 3 from Node 1 to wait for a brief instant.
9. Thread 3 from Node 1 blocks until the exclusive lock is granted by Node 2 and then completes the write operation.

OneFS Drive Performance Statistics

The previous post examined some of the general cluster performance metrics. In this article we’ll focus in on the disk subsystem and take a quick look at some of the drive statistics counters. As we’ll see, OneFS offers several tools to inspect and report on both drive health and performance.

Let’s start with some drive failure and wear reporting tools….

The following cluster-wide command will indicate any drives that are marked as smartfail, empty, stalled, or down:

# isi_for_array -sX 'isi devices list | egrep -vi “healthy|L3”'

Usually, any node that requires a drive replacement will have an amber warning light on the front display panel. Also, the drive that needs swapping out will typically be marked by a red LED.

Alternatively, isi_drivenum will also show the drive bay location of each drive, plus a variety of other disk related info, etc.

# isi_for_array -sX ‘isi_drivenum –A’

This next command provides drive wear information for each node’s flash (SSD) boot drives:

# isi_for_array -sSX "isi_radish -a /dev/da* | grep -e FW: -e 'Percent Life' | grep -v Used”

However, the output is in hex. This can be converted to a decimal percent value using the following shell command, where <value> is the raw hex output:

# echo "ibase=16; <value>" | bc

Alternatively, the following perl script will also translate the isi_radish command output from hex into comprehensible ‘life remaining’ percentages:

#!/usr/bin/perl

use strict;

use warnings;

my @drives = ('ada0', 'ada1');

foreach my $drive (@drives)

{

print "$drive:\n";

open CMD,'-|',"isi_for_array -s isi_radish -vt /dev/$drive" or

die "Failed to open pipe!\n";

while (<CMD>)

{

if (m/^(\S+).*(Life Remaining|Lifetime Left).*\(raw\s+([^)]+)/i)

{

print "$1 ".hex($3)."%\n";

}

}

}

The following drive statistics can be useful for both performance analysis and troubleshooting purposes.

General disk activity stats are available via the isi statistics command.

For example:

# isi statistics system –-nodes=all --oprates --nohumanize

This output will give you the per-node OPS over protocol, network and disk. On the disk side, the sum of DiskIn (writes)  and DIskOut (reads) gives the total IOPS for all the drives per node.

For the next level of granularity, the following drive statistics command provides individual SSSD drive info. The sum of OpsIn and OpsOut is the total IOPS per drive in the cluster.

# isi statistics drive -nall -–long --type=sata --sort=busy | head -20

And the same info for SSDs:

# isi statistics drive -nall --long --type=ssd --sort=busy | head -20

The primary counters of interest in drive stats data are often the ‘TimeInQ’, ‘Queued’, OpsIn, OpsOut, and IO and the ’Busy’ percentage of each disk. If most or all the drives have high busy percentages, this indicates a uniform resource constraint, and there is a strong likelihood that the cluster is spindle bound. If, say, the top five drives are much busier than the rest, this suggests a workflow hot-spot.

# isi statistics pstat

The read and write mix, plus metadata operations, for a particular protocol can be gleaned from the output of the isi statistics pstat command. In addition to disk statistics, CPU and network stats are also provided. The –protocol parameter is used to specify the core NAS protocols such as NFSv3, NFSv4, SMB1, SMB2, HDFS, etc. Additionally, OneFS specific protocol stats, including job engine (jobd), platform API (papi), IRP, etc, are also available.

For example, the following will show NFSv3 stats in a ‘top’ format, refreshed every 6 seconds by default:

# isi statistics pstat --protocol nfs3 --format top

The uptime command provides system load average for 1, 5, and 15 minute intervals, and is comprised of both CPU queues and disk queues stats.

# isi_for_array -s 'uptime'

It’s worth noting that this command’s output does not take CPU quantity into account. As such, a load average of 1 on a single CPU means the node is pegged. However, that load average of 1 on a dual CPU system means the CPU is 50% idle.

The following command will give the CPU count:

# isi statistics query current --nodes all --degraded --stats node.cpu.count

The sum of disk ops across a cluster per node is available via the following syntax:

# isi statistics query current --nodes=all --stats=node.disk.xfers.rate.sum

There are a whole slew of more detailed drive metrics that OneFS makes available for query.

Disk time in queue provides an indication as to how long an operation is queued on a drive. This indicator is key if a cluster is disk-bound. A time in queue value of 10 to 50 milliseconds is concerning, whereas a value of 50 to 100 milliseconds indicates a potential problem.

The following CLI syntax can be used to obtain the maximum, minimum, and average values for disk time in queue for SATA drives in this case:

# isi statistics drive --nodes=all --degraded --no-header --no-footer | awk ' /SATA/ {sum+=$8; max=0; min=1000} {if ($8>max) max=$8; if ($8<min) min=$8} END {print “Min = “,min; print “Max = “,max; print “Average = “,sum/NR}’

The following command displays the time in queue for 30 drives sorted highest-to-lowest:

# isi statistics drive list -n all --sort=timeinq | head -n 30

Queue depth indicates how many operations are queued on drives. A queue depth of 5 to 10 is considered heavy queuing.

The following CLI command can be used to obtain the maximum, minimum, and average values for disk queue depth of SATA drives. If there’s a big delta between the maximum number and average number in the queue, it’s worth investigating further to determine whether an individual drive is working excessively.

# isi statistics drive --nodes=all --degraded --no-header --no-footer | awk ' /SATA/ {sum+=$9; max=0; min=1000} {if ($9>max) max=$9; if ($9<min) min=$9} END {print “Min = “,min; print “Max = “,max; print “Average = “,sum/NR}’

For information on SAS or SSD drives, you can substitute SAS or SSD for SATA in the above syntax.

To display queue depth for twenty drives sorted highest-to-lowest, run the following command:

# isi statistics drive list -n all --sort=queued | head -n 20

Note that the TimeAvg metric, as reported by isi statistics drive command, represents all the latency at the disk that doesn’t include the scheduler wait time (TimeInQ). So this is a measure of disk access time (ie. send the op, wait, receive response). The Total Time at the disk is a sum of the access time (TimeAvg) and the scheduler time (TimeInQ).

The disk percent busy metric can he useful to determine if a drive is getting pegged. However, it does not indicate how much extra work may be in the queue. To obtain the maximum, minimum, and average disk busy values for SATA drives, run the following command. For information on SAS or SSD drives, you can include SAS or SSD respectively, instead of SATA.

# isi statistics drive --nodes=all --degraded --no-header --no-footer | awk ' /SATA/ {sum+=$10; max=0; min=1000} {if ($10>max) max=$10; if ($10min) min=$10} END {print “Min = “,min; print “Max = “,max; print “Average = “,sum/NR}’

To display disk percent busy for twenty drives sorted highest-to-lowest issue, run the following command.

# isi statistics drive -nall --output=busy | head -n 20