PowerScale A310 and A3100 Platforms

In this article, we’ll examine the new PowerScale A310 and A3100 hardware platforms that were released a couple of weeks back.

These A310 and A3100 comprise the latest generation of PowerScale A-series ‘archive’ platforms:

The PowerScale A-series systems are designed for cooler, infrequently accessed data use cases. These include active archive workflows for the A310, such as regulatory compliance data, medical imaging archives, financial records, and legal documents. And deep archive/cold storage for the A3100 platform, including surveillance video archives, backup, and DR repositories.

Representing the archive-tier, the A310 and A3100 both utilize a single-socket Zeon processor with 96GB of memory and fifteen (A310) or twenty hard drives per node respectively, plus SSDs for metadata/caching – and with four nodes residing within a 4RU chassis. From an initial 4 node (1 chassis) starting point, A310 and A31100 clusters can be easily and non-disruptively scaled two nodes at a time up to a maximum of 252 nodes (63 chassis) per cluster.

The A31x modular platform is based on Dell’s ‘Infinity’ chassis. Each node’s compute module contains a single 16-core Intel Sapphire Rapids CPU running at 1.8 GHz and with 22.5MB of cache, plus 96GB of DDR5 DRAM. Front-End networking options include 10/25 GbE and with both Ethernet or Infiniband as selectable options for the back-End network.

As such, the new A31x core hardware specifications are as follows:

Hardware Class PowerScale A-Series (Archive)
Model A310 A3100
OS version Requires OneFS 9.11 or above.

Requires NFP 13.1 or greater.

BIOS based on Dell’s PowerBIOS

Requires OneFS 9.11 or above.

Requires NFP 13.1 or greater.

BIOS based on Dell’s PowerBIOS

Platform Four nodes per 4RU chassis; upgradeable per pair; node-compatible with prior gens. Four nodes per 4RU chassis; upgradeable per pair; node-compatible with prior gens.
CPU 8 Cores @ 1.8GHz, 22.5MB Cache 8 Cores @ 1.8GHz, 22.5MB Cache
Memory 96GB DDR5 DRAM 96GB DDR5 DRAM
Journal M.2: 480GB NVMe with 3-cell battery backup (BBU) M.2: 480GB NVMe with 3-cell battery backup (BBU)
Depth Standard 36.7 inch chassis Deep 42.2 inch chassis
Cluster size Max of 63 chassis (252 nodes) per cluster. Max of 63 chassis (252 nodes) per cluster.
Storage Drives 60 per chassis     (15 per node) 80 per chassis     (20 per node)
HDD capacities 2TB,4TB, 8TB, 12TBTB, 16TB, 20TB, 24TB 12TBTB, 16TB, 20TB, 24TB
SSD (cache) capacities 0.8TB, 1.6TB, 3.2TB, 7.68TB 0.8TB, 1.6TB, 3.2TB, 7.68TB
Max raw capacity 1.4PB per chassis 1.9PB per chassis
Front-end network 10/25 Gb Ethernet 10/25 Gb Ethernet
Back-end network Ethernet or Infiniband Ethernet or Infiniband

These node hardware attributes can be easily viewed from the OneFS CLI via the ‘isi_hw_status’ command. For example, from an A3100:

# isi_hw_status
  SerNo: CF2BC243400025

 Config: H6R28

ChsSerN:

ChsSlot: 1

FamCode: A

ChsCode: 4U

GenCode: 10

PrfCode: 3

   Tier: 3
  Class: storage
 Series: n/a
Product: A3100-4U-Single-96GB-1x1GE-2x25GE SFP+-240TB-6554GB SSD
  HWGen: PSI

Chassis: INFINITY (Infinity Chassis)

    CPU: GenuineIntel (1.80GHz, stepping 0x000806f8)

   PROC: Single-proc, Octa-core

    RAM: 103079215104 Bytes

   Mobo: INFINITYPIFANO (Custom EMC Motherboard)

  NVRam: INFINITY (Infinity Memory Journal) (4096MB card) (size 4294967296B)

 DskCtl: LSI3808 (LSI 3808 SAS Controller) (8 ports)

 DskExp: LSISAS35X36I (LSI SAS35x36 SAS Expander - Infinity)

PwrSupl: Slot1-PS0 (type=ACBEL POLYTECH, fw=03.01)

PwrSupl: Slot2-PS1 (type=ACBEL POLYTECH, fw=03.01)

  NetIF: bge0,lagg0,mce0,mce1,mce2,mce3

 BEType: 25GigE

 FEType: 25GigE

 LCDver: IsiVFD2 (Isilon VFD V2)

 Midpln: NONE (No Midplane Support)

Power Supplies OK

Power Supply Slot1-PS0 good

Power Supply Slot2-PS1 good

CPU Operation (raw 0x882C0800)  = Normal

CPU Speed Limit                 = 100.00%

Fan0_Speed                      = 12360.000

Fan1_Speed                      = 12000.000

Slot1-PS0_In_Voltage            = 212.000

Slot2-PS1_In_Voltage            = 209.000

SP_CMD_Vin                      = 12.100

CMOS_Voltage                    = 3.120

Slot1-PS0_Input_Power           = 290.000

Slot2-PS1_Input_Power           = 290.000

Pwr_Consumption                 = 590.000

SLIC0_Temp                      = na

SLIC1_Temp                      = na

DIMM_Bank0                      = 42.000

DIMM_Bank1                      = 40.000

CPU0_Temp                       = -43.000

SP_Temp0                        = 40.000

MP_Temp0                        = na

MP_Temp1                        = 29.000

Embed_IO_Temp0                  = 51.000

Hottest_SAS_Drv                 = -45.000

Ambient_Temp                    = 29.000

Slot1-PS0_Temp0                 = 47.000

Slot1-PS0_Temp1                 = 40.000

Slot2-PS1_Temp0                 = 47.000

Slot2-PS1_Temp1                 = 40.000

Battery0_Temp                   = 38.000

Drive_IO0_Temp                  = 43.000

Also note that the A310 and A3100 are only available in a 96GB memory configuration.

On the front of each chassis is an LCD front panel control with back-lit buttons and 4 LED Light Bar Segments – 1 per Node. These LEDs typically display blue for normal operation or yellow to indicate a node fault. This LCD display is articulating,  allowing it to be swung clear of the drive sleds for non-disruptive HDD replacement, etc.

The rear of the chassis houses the compute modules for each node, which contain CPU, memory, networking, cache SSDs, and power supplies. Specifically, an individual compute module contains a multi core Cascade Lake CPU, memory, M2 flash journal, up to two SSDs for L3 cache, six DIMM channels, front end 10/25 Gb ethernet, backend 40/100 or 10/25 Gb ethernet or Infiniband, an ethernet management interface, and power supply and cooling fans:

As shown above, the field replaceable components are indicated via colored ‘touchpoints’. Two touchpoint colors, orange and blue, indicate respectively which components are hot swappable versus replaceable via a node shutdown.

Touchpoint Detail
Blue Cold (offline) field serviceable component
Orange Hot (Online) field serviceable component

The serviceable components within an PowerScale A310 or A3100 chassis are as follows:

Component Hot Swap CRU FRU
Drive sled Yes Yes Yes
·         Hard drives (HDDs) Yes Yes Yes
Compute node No Yes Yes
·         Compute module No No No
o   M.2 journal flash No No Yes
o   CPU complex No No No
o   DIMMs No No Yes
o   Node fans No No Yes
o   NICs/HBAs No No Yes
o   HBA riser No No Yes
o   Battery backup unit (BBU) No No Yes
o   DIB No No No
·         Flash drives (SSDs) Yes Yes Yes
·         Power supply with fan Yes Yes Yes
Front panel Yes No Yes
Chassis No No Yes
Rail kits No No Yes
Mid-plane Replace entire chassis

Nodes are paired for resilience and durability, with each pair sharing a mirrored journal and two power supplies:

Storage-wise, each of the four nodes within a PowerScale A310/0 chassis’ has five associated drive containers, or sleds. These sleds occupy bays in the front of each chassis, with a node’s drive sleds stacked vertically. For example:

Nodes are numbered 1 through 4, left to right looking at the front of the chassis, while the drive sleds are labeled A  through E, with A at the top.

The drive sled is the tray which slides into the front of the chassis. Within each sled, the 3.5” SAS hard drives it contains are numbered sequentially starting from drive zero, which is the HDD adjacent the airdam.

Each bay in a drive sled has a yellow ‘drive fault’ LED associated with each drive:

Even when a sled is removed from its chassis and its power source, these fault LEDs will remain active for 10+ minutes. LED viewing holes are also provided so the sled’s top cover does not need to be removed.

The A3100’s 42.2 inch chassis accommodates four HDDs per sled, compared to three drives for the standard (36.7 inch) depth A310 shown above. As such, the A3100 requires a deep rack, such as the Dell Titan cabinet whereas the A310 can reside in a regular 17” data center cabinet.

The A310 and A3100 platforms support a range of HDD capacities, currently including 2TB, 4, 8, 12, 16, 20, and 24TB capacities, and both regular ISE (instant secure erase) or self-encrypting drive (SED) formats.

A node’s drive details can be queried with OneFS CLI utilities such as ‘isi_radish’ and ‘isi_drivenum’. For example, the command output from an A3100 node:

# isi_drivenum

Bay  1   Unit 6      Lnum 20    Active      SN:GXNG0X800253     /dev/da1
Bay  2   Unit 7      Lnum 21    Active      SN:GXNG0X800263     /dev/da2
Bay  A0   Unit 19     Lnum 16    Active      SN:ZRT1A5JR         /dev/da6
Bay  A1   Unit 18     Lnum 17    Active      SN:ZRT1A4SE         /dev/da5
Bay  A2   Unit 17     Lnum 18    Active      SN:ZRT1A42D         /dev/da4
Bay  A3   Unit 16     Lnum 19    Active      SN:ZRT19494         /dev/da3
Bay  B0   Unit 25     Lnum 12    Active      SN:ZRT18NEY         /dev/da10
Bay  B1   Unit 24     Lnum 13    Active      SN:ZRT1FJCJ         /dev/da9
Bay  B2   Unit 23     Lnum 14    Active      SN:ZRT18N7F         /dev/da8
Bay  B3   Unit 22     Lnum 15    Active      SN:ZRT1FDJL         /dev/da7
Bay  C0   Unit 31     Lnum 8     Active      SN:ZRT1FJ0T         /dev/da14
Bay  C1   Unit 30     Lnum 9     Active      SN:ZRT1F6BF         /dev/da13
Bay  C2   Unit 29     Lnum 10    Active      SN:ZRT1FJMS         /dev/da12
Bay  C3   Unit 28     Lnum 11    Active      SN:ZRT18NE6         /dev/da11
Bay  D0   Unit 37     Lnum 4     Active      SN:ZRT18N9P         /dev/da18
Bay  D1   Unit 36     Lnum 5     Active      SN:ZRT18N8V         /dev/da17
Bay  D2   Unit 35     Lnum 6     Active      SN:ZRT18NBE         /dev/da16
Bay  D3   Unit 34     Lnum 7     Active      SN:ZRT1FR62         /dev/da15
Bay  E0   Unit 43     Lnum 0     Active      SN:ZRT1FDJ4         /dev/da22
Bay  E1   Unit 42     Lnum 1     Active      SN:ZRT1FR86         /dev/da21
Bay  E2   Unit 41     Lnum 2     Active      SN:ZRT1EJ4H         /dev/da20
Bay  E3   Unit 40     Lnum 3     Active      SN:ZRT1E9MS         /dev/da19

The first two lines of output about (bays 1 & 2) reference the cache SSD drives, contained withing the compute modules. The remaining ‘bay’ locations indicate both the sled (A to E) and drive (0 to 3). The presence above of four HDDs per sled (ie. bay numbers 0 to 3) indicate this is an A3100 node, rather than an A310 with only three HDDs per sled.

With regard to the nodes’ internal drives, the boot disk reservation size has increased to 18GB on these new platforms from 8GB on the previous generation. Plus partition sizes have also been expanded on these new platforms in OneFS 9.11, as follows:

Partition A310 / A3100 A300 / A3000
hw 1GB 500MB
journal backup 8197MB 8GB
kerneldump 5GB 2GB
keystore 64MB 64MB
root 4GB 2GB
var 4GB 2GB
var-crash 7GB 3GB

The PowerScale A310 and A3100 platforms are available in the following networking configurations, with a 10/25Gb Ethernet front-end and either Ethernet or Infiniband back-end:

Model A310 A3100
Front-end network 10/25 GigE 10/25 GigE
Back-end network 10/25 GigE, Infiniband 10/25 GigE, Infiniband

These NICs and their PCI bus addresses can be determined via the ’pciconf’ CLI command, as follows:

# pciconf -l | grep mlx

mlx5_core0@pci0:16:0:0: class=0x020000 card=0x002015b3 chip=0x101f15b3 rev=0x00 hdr=0x00

mlx5_core1@pci0:16:0:1: class=0x020000 card=0x002015b3 chip=0x101f15b3 rev=0x00 hdr=0x00

mlx5_core2@pci0:65:0:0: class=0x020000 card=0x002015b3 chip=0x101f15b3 rev=0x00 hdr=0x00

mlx5_core3@pci0:65:0:1: class=0x020000 card=0x002015b3 chip=0x101f15b3 rev=0x00 hdr=0x00

Similarly, the NIC hardware details and drive firmware versions can be viewed as follows:

# mlxfwmanager

Querying Mellanox devices firmware ...

Device #1:
----------
  Device Type:      ConnectX6LX
  Part Number:      06XJXK_0R5WK9_Ax
  Description:      NVIDIA ConnectX-6 LX Dual Port 25 GbE SFP Network Adapter
  PSID:             DEL0000000031
  PCI Device Name:  pci0:16:0:0
  Base GUID:        58a2e10300e22a24
  Base MAC:         58a2e1e22a24
  Versions:         Current        Available
     FW             26.36.1010     N/A
     PXE            3.6.0901       N/A
     UEFI           14.29.0014     N/A

  Status:           No matching image found

Device #2:
----------
  Device Type:      ConnectX6LX
  Part Number:      06XJXK_0R5WK9_Ax
  Description:      NVIDIA ConnectX-6 LX Dual Port 25 GbE SFP Network Adapter
  PSID:             DEL0000000031
  PCI Device Name:  pci0:65:0:0
  Base GUID:        58a2e10300e22bf4
  Base MAC:         58a2e1e22bf4
  Versions:         Current        Available
     FW             26.36.1010     N/A
     PXE            3.6.0901       N/A
     UEFI           14.29.0014     N/A

  Status:           No matching image found

Compared to their A30x predecessors, the A310 and A3100 see a number of generational hardware upgrades. These include an shift to DDR5 memory, Sapphire Rapids CPU, and an up-spec’d power supply.

In terms of performance, the new A31x nodes provide a significant increase over the prior generation, as shown in the following streaming read and writes comparison chart for the A3100 and A3000:

OneFS node compatibility provides the ability to have similar node types and generations within the same node pool. In OneFS 9.11 and later, compatibility between the A310 and A3100 nodes and the previous generation platform is supported. Specifically, this node pool compatibility includes:

OneFS Node Pool Compatibility Gen6 MLK New
A200 A300/L A310/L
A2000 A3000/L A3100/L
H400 A300 A310

Node pool compatibility checking includes drive capacities, including for both data HDDs and SSD cache. This pool compatibility permits the addition of A310 node pairs to an existing node pool comprising four of more A300s if desired, rather than creating a A310 new node pool. Plus a similar compatibility for A3100/A3000 nodes.

Note that, while the A31x is node pool compatible with the A30x, the A31x nodes are effectively throttled to match the performance envelope of the A30x nodes. Regarding storage efficiency, support for OneFS inline data reduction on mixed A-series diskpools is as follows:

Gen6 MLK New Data Reduction Enabled
A200 A300/L A310/L False
A2000 A3000/L A3100/L False
H400 A300 A310 False
A200 A310 False
A300 A310 True
H400 A310 False
A2000 A3100 False
A3000 A3100 True

To summarize, in combination with OneFS 9.11, these new PowerScale hybrid A31x platforms deliver a compelling value proposition in terms of efficiency, density, flexibility, scalability, and affordability.

PowerScale H710 and H7100 Platforms

In this article, we’ll take a more in-depth look at the new PowerScale H710 and H7100 hardware platforms that were released last week. Here’s where these new systems sit in the current hardware hierarchy:

As such, the PowerScale H710 and H7100 are the workhorses of the PowerScale portfolio. Built for general-purpose workloads, the H71x platforms offering flexibility and scalability for a broad range of applications including home directories, file shares, generative AI, editing and post-production media workflows, and medical PACS and genomic data with efficient tiering.

Representing the mid-tier, the H710 and H7100 both utilize a single-socket Zeon processor with 384GB of memory and fifteen (H710) or twenty hard drives per node respectively, plus SSDs for metadata/caching – and with four nodes residing within a 4RU chassis. From an initial 4 node (1 chassis) starting point, H710 and H7100 clusters can be easily and non-disruptively scaled two nodes at a time up to a maximum of 252 nodes (63 chassis) per cluster.

The H71x modular platform is based on Dell’s ‘Infinity’ chassis. Each node’s compute module contains a single 16-core Intel Sapphire Rapids CPU running at 2.0 GHz and with 30MB of cache, plus 384GB of DDR5 DRAM. Front-End networking options include 10/25/40/100 GbE and with both 100Gb Ethernet or Infiniband as selectable options for the Back-End network.

As such, the new H71x core hardware specifications are as follows:

Hardware Class PowerScale H-Series (Hybrid)
Model A310 A3100
OS version Requires OneFS 9.11 or above, and NFP 13.1 or greater.

BIOS based on Dell’s PowerBIOS

Requires OneFS 9.11 or above, and NFP 13.1 or greater.

BIOS based on Dell’s PowerBIOS

Platform Four nodes per 4RU chassis; upgradeable per pair; node-compatible with prior gens. Four nodes per 4RU chassis; upgradeable per pair; node-compatible with prior gens.
CPU 16 Cores @ 2.0GHz, 30MB Cache 16 Cores @ 2.0GHz, 30MB Cache
Memory 384GB DDR5 DRAM 384GB DDR5 DRAM
Journal M.2: 480GB NVMe with 3-cell battery backup (BBU) M.2: 480GB NVMe with 3-cell battery backup (BBU)
Depth Standard 36.7 inch chassis Deep 42.2 inch chassis
Cluster size Max of 63 chassis (252 nodes) per cluster. Max of 63 chassis (252 nodes) per cluster.
Storage Drives 60 per chassis     (15 per node) 80 per chassis     (20 per node)
HDD capacities 2TB,4TB, 8TB, 12TBTB, 16TB, 20TB, 24TB 12TBTB, 16TB, 20TB, 24TB
SSD (cache) capacities 0.8TB, 1.6TB, 3.2TB, 7.68TB 0.8TB, 1.6TB, 3.2TB, 7.68TB
Max raw capacity 1.4PB per chassis 1.9PB per chassis
Front-end network 10/25/40/100 GigE 10/25/40/100 GigE
Back-end network 100 GigE, Infiniband 100 Gb/s Ethernet or Infiniband

These node hardware attributes, plus a variety of additional info and environmentals, can be easily viewed from the OneFS CLI via the ‘isi_hw_status’ command. For example, from an F710:

# isi_hw_status
  SerNo: CF25J243000005
 Config: 1WVXW
ChsSerN:
ChsSlot: 2
FamCode: H
ChsCode: 4U
GenCode: 10
PrfCode: 7
   Tier: 3
  Class: storage
 Series: n/a
Product: H710-4U-Single-192GB-1x1GE-2x100GE QSFP28-240TB-3277GB SSD-SED
  HWGen: PSI
Chassis: INFINITY (Infinity Chassis)
    CPU: GenuineIntel (2.00GHz, stepping 0x000806f8)
   PROC: Single-proc, 16-HT-core
    RAM: 206152138752 Bytes
   Mobo: INFINITYPIFANO (Custom EMC Motherboard)
  NVRam: INFINITY (Infinity Memory Journal) (8192MB card) (size 8589934592B)
 DskCtl: LSI3808 (LSI 3808 SAS Controller) (8 ports)
 DskExp: LSISAS35X36I (LSI SAS35x36 SAS Expander - Infinity)
PwrSupl: Slot1-PS0 (type=ARTESYN, fw=02.30)
PwrSupl: Slot2-PS1 (type=ARTESYN, fw=02.30)
  NetIF: bge0,lagg0,mce0,mce1,mce2,mce3
 BEType: 100GigE
 FEType: 100GigE
 LCDver: IsiVFD2 (Isilon VFD V2)
 Midpln: NONE (No Midplane Support)
Power Supplies OK
Power Supply Slot1-PS0 good
Power Supply Slot2-PS1 good
CPU Operation (raw 0x882D0800)  = Normal
CPU Speed Limit                 = 100.00%
Fan0_Speed                      = 12000.000
Fan1_Speed                      = 11880.000
Slot1-PS0_In_Voltage            = 208.000
Slot2-PS1_In_Voltage            = 207.000
SP_CMD_Vin                      = 12.100
CMOS_Voltage                    = 3.080
Slot1-PS0_Input_Power           = 280.000
Slot2-PS1_Input_Power           = 270.000
Pwr_Consumption                 = 560.000
SLIC0_Temp                      = na
SLIC1_Temp                      = na
DIMM_Bank0                      = 40.000
DIMM_Bank1                      = 41.000
CPU0_Temp                       = -43.000
SP_Temp0                        = 37.000
MP_Temp0                        = na
MP_Temp1                        = 29.000
Embed_IO_Temp0                  = 48.000
Hottest_SAS_Drv                 = -26.000
Ambient_Temp                    = 29.000
Slot1-PS0_Temp0                 = 58.000
Slot1-PS0_Temp1                 = 38.000
Slot2-PS1_Temp0                 = 55.000
Slot2-PS1_Temp1                 = 35.000
Battery0_Temp                   = 36.000
Drive_IO0_Temp                  = 42.000

Note that the H710 and H7100 are only available in a 384GB memory configuration.

Starting at the business end of the chassis, the articulating front panel display allows the user to join the nodes to a cluster, etc:

The chassis front panel includes an LCD display with 9 cap-touch back-lit buttons. Four LED Light bar segments, 1 per node, illuminate blue to indicate normal operation or yellow to alert of a node fault. The front panel display is hinge mounted so it can be moved clear of the drive sleds, with a ribbon cable running down the length of the chassis to connect the display to the midplane.

As with all PowerScale nodes, the front panel display provides some useful information for the four nodes, such as the ‘outstanding alerts’ status shown above, etc.

For storage, each of the four nodes within a PowerScale H710/0 chassis’ has five associated drive containers, or sleds. These sleds occupy bays in the front of each chassis, with a node’s drive sleds stacked vertically:

Nodes are numbered 1 through 4, left to right looking at the front of the chassis, while the drive sleds are labeled A through E, with sleds A occupying the top row of the chassis.

The drive sled is the tray which slides into the front of the chassis. Within each sled, the 3.5” SAS hard drives it contains are numbered sequentially starting from drive zero, which is the HDD adjacent the airdam.

The H7100 uses a longer 42.2 inch, allowing it to accommodate four HDDs per sled compared to three drives for the H710, which is 36.7 inch in depth. This also means that the H710 can reside in a regular 17” data center rack or cabinet, whereas the H7100 requires a deep rack, such as the Dell Titan cabinet.

The H710 and H7100 platforms support a range of HDD capacities, currently including 2TB, 4, 8, 12, 16, 20, and 24TB capacities, and both regular ISE (instant secure erase) or self-encrypting drive (SED) formats.

Each drive sled has a white ‘not safe to remove’ LED on its front top left, as well as a blue power/activity LED, and an amber fault LED.

The compute modules for each node are housed in the rear of the chassis, and contain CPU, memory, networking, and SSDs, as well as power supplies. Nodes 1 & 2 are a node pair, as are nodes 3 & 4. Each node-pair shares a mirrored journal and two power supplies:

Here’s the detail of an individual compute module, which contains a multi core Sapphire Rapids CPU, memory, M2 flash journal, up to two SSDs for L3 cache, six DIMM channels, front-end 40/100 or 10/25 Gb ethernet, back-end 40/100 or 10/25 Gb ethernet or Infiniband, an ethernet management interface, and power supply and cooling fans:

Of particular note is the ‘journal active’ LED, which is displayed as a white ‘hand icon’. When this is illuminated, it indicates that the mirrored journal is actively vaulting.

Note that a node’s compute module should not be removed from the chassis while this while LED is lit!

On the front of each chassis is an LCD front panel control with back-lit buttons and 4 LED Light Bar Segments – 1 per Node. These LEDs typically display blue for normal operation or yellow to indicate a node fault. This LCD display is hinged so it can be swung clear of the drive sleds for non-disruptive HDD replacement, etc.

Details can be queried with OneFS CLI drive utilities such as ‘isi_radish’ and ‘isi_drivenum’. For example, the command output from an H710 node:

tme-1# isi_drivenum

Bay  1   Unit 6      Lnum 15    Active      SN:7E30A02K0F43     /dev/da1
Bay  2   Unit N/A    Lnum N/A   N/A         SN:N/A              N/A
Bay  A0   Unit 1      Lnum 12    Active      SN:ZRS1HP4G         /dev/da4
Bay  A1   Unit 17     Lnum 13    Active      SN:ZR7105GY         /dev/da3
Bay  A2   Unit 16     Lnum 14    Active      SN:ZRS1HNZG         /dev/da2
Bay  B0   Unit 24     Lnum 9     Active      SN:ZRS1PHFG         /dev/da7
Bay  B1   Unit 23     Lnum 10    Active      SN:ZRS1HEA1         /dev/da6
Bay  B2   Unit 22     Lnum 11    Active      SN:ZRS1PHFX         /dev/da5
Bay  C0   Unit 30     Lnum 6     Active      SN:ZR5EFV0D         /dev/da10
Bay  C1   Unit 29     Lnum 7     Active      SN:ZR5FE3Z8         /dev/da9
Bay  C2   Unit 28     Lnum 8     Active      SN:ZR5FE311         /dev/da8
Bay  D0   Unit 36     Lnum 3     Active      SN:ZR5FE3DA         /dev/da13
Bay  D1   Unit 35     Lnum 4     Active      SN:ZRS1PHEF         /dev/da12
Bay  D2   Unit 34     Lnum 5     Active      SN:ZRS1HP6T         /dev/da11
Bay  E0   Unit 42     Lnum 0     Active      SN:ZRS1PHEM         /dev/da16
Bay  E1   Unit 41     Lnum 1     Active      SN:ZRS1PHDV         /dev/da15
Bay  E2   Unit 40     Lnum 2     Active      SN:ZRS1HPAT         /dev/da14

The ‘bay’ locations indicate the drive location in the chassis. ‘Bay 1’ references the cache/metadata SSD, located within the node’s compute module. Whereas the HDDs are referenced by their respective sled (A to E) and drive slot (0 to 2). For example, drive ‘E1’ in the following example:

The H710 and H7100 platforms are available in the following networking configurations, with a 10/25/40/100Gb ethernet front-end and 10/25/40/100Gb ethernet or 100Gb Infiniband back-end:

Model H710 H7100
Front-end network 10/25/40/100 GigE 10/25/40/100 GigE
Back-end network 10/25/40/100 GigE, Infiniband 10/25/40/100 GigE, Infiniband

These NICs and their PCI bus addresses can be determined via the ’pciconf’ CLI command, as follows:

# pciconf -l | grep mlx

mlx4_core0@pci0:59:0:0: class=0x020000 card=0x028815b3 chip=0x100315b3 rev=0x00 hdr=0x00

mlx5_core0@pci0:216:0:0:        class=0x020000 card=0x001615b3 chip=0x101515b3 rev=0x00 hdr=0x00

mlx5_core1@pci0:216:0:1:        class=0x020000 card=0x001615b3 chip=0x101515b3 rev=0x00 hdr=0x00

Similarly, the NIC hardware details and drive firmware versions can be viewed as follows:

# mlxfwmanager
Querying Mellanox devices firmware ...

Device #1:
----------
  Device Type:      ConnectX3
  Part Number:      105-001-013-00_Ax
  Description:      Mellanox 40GbE/56G FDR VPI card
  PSID:             EMC0000000004
  PCI Device Name:  pci0:59:0:0
  Port1 MAC:        1c34dae19e31
  Port2 MAC:        1c34dae19e32
  Versions:         Current        Available
     FW             2.42.5000      N/A
     PXE            3.4.0752       N/A
  Status:           No matching image found

Device #2:
----------
  Device Type:      ConnectX4LX
  Part Number:      020NJD_0MRT0D_Ax
  Description:      Mellanox 25GBE 2P ConnectX-4 Lx Adapter
  PSID:             DEL2420110034
  PCI Device Name:  pci0:216:0:0
  Base MAC:         1c34da4492e8
  Versions:         Current        Available
     FW             14.32.2004     N/A
     PXE            3.6.0502       N/A
     UEFI           14.25.0018     N/A
  Status:           No matching image found

Compared with their H70x predecessors, the H710 and H7100 see a number of hardware performance upgrades. These include a move to DDR5 memory, Sapphire Rapids CPU, and an upgraded power supply.

In terms of performance, the new H71x nodes provide a solid improvement over the prior generation. For example, streaming read and writes on both the H7100 and H7000:

OneFS node compatibility provides the ability to have similar node types and generations within the same node pool. In OneFS 9.11 and later, compatibility between the H710 and H7100 nodes and the previous generation platform is supported. Specifically, this node pool compatibility includes:

PowerScale H-series Node Pool Compatibility Gen6 MLK New
H500 H700 H710
H5600 H7000 H7100
H600

Node pool compatibility checking includes drive capacities for both data HDDs and SSD cache. This pool compatibility permits the addition of H710 node pairs to an existing node pool comprising four or more H700s, if desired, rather than creating an entirely new 4-node H710 node pool. Plus, there’s a similar compatibility between the H7100 and H7000 nodes.

Note that, while the H71x is node pool compatible with the H70x, it does require a performance compromise, since the H71x nodes are effectively throttled to match the performance envelope of the H70x nodes.

Apropos storage efficiency, OneFS inline data reduction support on mixed H-series diskpools is as follows:

Gen6 MLK New Data Reduction Enabled
H500 H700 H710 False
H500 H710 False
H700 H710 True
H5600 H7000 H7100 True
H5600 H7100 True
H7000 H7100 True

In the next article in this series, we’ll turn our attention to the PowerScale A310 and A3100 platforms.

PowerScale H710, H7100, A310, and A3100 Platform Nodes

Hot on the heels of the recent OneFS 9.11 release sees the launch of four new PowerScale hybrid and archive series hardware offerings. Between them, these new H310, H3100, A310 and A3100 spinning-disk-based nodes add significant blended capacity to the PowerScale stable.

Built atop the latest generation of Dell’s PowerScale chassis-based architecture, these new H-series and A-series platforms each boast a range of HDD capacities, paired with SSD for cache, a Sapphire Rapids CPU, a generous helping of DDR5 memory, and ample network connectivity – with four paired nodes all housed within a modular, power-efficient 4RU form factor chassis.

Here’s where these new platforms sit in the current PowerScale hardware hierarchy:

These new platforms will replace the PowerScale H700, H7000, A300, and A3000 systems, and further extend PowerScale’s price-density envelope.

The PowerScale H710, H7100, A310, and A3100 nodes offer an evolution from previous generations, while also focusing on environmental sustainability, reducing power consumption and carbon footprint. Housed in a 4RU chassis with balanced airflow and enhanced cooling, these new platforms offer significantly greater density than their predecessors – plus are ready to support Seagate’s 32TB HAMR HDDs when those drives become available later this year.

These new nodes all require OneFS 9.11 (or later) and also include in-line compression and deduplication by default – further increasing their capacity headroom, effective density, and power efficiency. Plus, incorporating Intel’s 4th gen Xeon Sapphire Rapids CPUs and the latest DDR5 DRAM offers greater processing horsepower plus an improved performance per watt.

Scalability-wise, both platforms require a minimum of four nodes (1 chassis) to form a cluster (or node pool). From here, they can be simply and non-disruptively scaled two nodes at a time up to a maximum of 252 nodes (63 chassis) per cluster. The basic specs for these new platforms are as follows:

Hardware Class PowerScale H-Series (Hybrid) PowerScale A-series (Archive)
Model H710 H7100 A310 A3100
OneFS version Requires OneFS 9.11 or above. Requires OneFS 9.11 or above.
CPU 16 Cores @ 2.0GHz, 30MB Cache 8 Cores @ 1.8GHz, 22.5MB Cache
Memory 384GB DDR5 DRAM 96GB DDR5 DRAM
Platform Four nodes per 4RU chassis; upgradeable per pair; node-compatible with prior generations. Four nodes per 4RU chassis; upgradeable per pair; node-compatible with prior generations.
Depth Standard 36.7 inch chassis Deep 42.2 inch chassis Standard 36.7 inch chassis Deep 42.2 inch chassis
Max cluster size Maximum of 63 chassis (252 nodes) per cluster. Maximum of 63 chassis (252 nodes) per cluster.
Storage Drives 60 per chassis     (15 per node) 80 per chassis     (20 per node) 60 per chassis     (15 per node) 80 per chassis     (20 per node)
HDD capacities 2TB,4TB, 8TB, 12TBTB, 16TB, 20TB, 24TB 2TB,4TB, 8TB, 12TBTB, 16TB, 20TB, 24TB
SSD (cache) capacities 0.8TB, 1.6TB, 3.2TB, 7.68TB 0.8TB, 1.6TB, 3.2TB, 7.68TB
Max raw capacity 1.4PB per chassis 1.9PB per chassis 1.4PB per chassis 1.9PB per chassis
Front-end network 10/25/40/100 GigE 10/25 GigE
Back-end network 10/25/40/100 GigE, Infiniband 10/25 GigE, Infiniband

In concert with the generational CPU and DRAM upgrades in the new PowerScale chassis platforms, OneFS 9.11 software advancements also help deliver a nice performance bump for the H71x and A31x hybrid platforms – particularly for sequential reads and writes.

The PowerScale H-series platforms are designed for general-purpose workloads, offering flexibility and scalability for a wide range of applications including file shares and home directories, editing and post-production media workflows, generative AI, and PACS and genomic data with efficient tiering.

In contrast, the A-series platforms are designed for cooler, infrequently accessed data use cases. These include active archive workflows for the A310, such as regulatory compliance data, medical imaging archives, financial records, and legal documents. And deep archive/cold storage for the A3100 platform, including surveillance video archives, backup, and DR repositories.

Over the next couple of articles, we’ll dig into the technical details of each of the new platforms. But, in summary, when combined with OneFS 9.11, the new PowerScale hybrid H71x and A31x platforms quite simply deliver on efficiency, flexibility, performance, scalability, and affordability!

OneFS SyncIQ Temporary Directory Hashing

SyncIQ receives an update in OneFS 9.11 with the default enablement of its Temporary Directory Hashing feature, which can help improve replication directory delete performance on target clusters.

But first, some background. For several years now, OneFS has included functionality, commonly referred to as temporary directory hashing, which addresses some of the challenges that SyncIQ can potentially encounter with large incremental replication tasks. Specifically, if a cluster contains an extra-wide directory, with many different replication threads trying to write to it simultaneously, OneFS file system performance can be impacted due to contention over lock requests on that very wide directory.

When Sync IQ performs an incremental transfer it frequently uses a temporary working directory in cases such as where a file has been created, but its parent doesn’t exist yet due to LIN order processing – or files being removed and their parents not being available, etc. Instead, SyncIQ will use this temporary working directory as a place to stash these files until it can put them in the correct location. In some incremental replication workflows, this could result in an extra-wide temporary working directory, potentially containing millions or billions of directories entries. When there are hundreds of SyncIQ workers all trying to link and unlink files from that same directory, performance can start to become impacted.

To address this, temporary directory hashing introduces support for subdirectories within a large temp working directory, based on a directory cookie. This allows SyncIQ to split up that monolithic directory into a bunch of smaller ones, so workers don’t contend with all of the other workers when they’re trying to link and unlink files within their temporary directories. They only contend with the other workers in their particular subdirectory, often providing a significant performance boost in some cases with lots of concurrent access.

In OneFS 9.11, temporary directory hashing functionality now becomes the default configuration and behavior.

SyncIQ’s temporary directory hashing functionality has actually existed within OneFS since 8.2, but prior to OneFS 9.11 it had to be manually enabled on a per-policy basis for any desired replication workflows.

When installing or upgrading a cluster to OneFS 9.11 or later, temporary directory hashing becomes the default configuration, so that any new SyncIQ policies will automatically have temporary directory hashing will enabled. However, this default change will not be applied retroactively to any legacy policies that were configured prior to OneFS 9.11 upgrade.

That said, any pre-existing policies can be easily configured to use temporary directory hashing from the SyncIQ source cluster with the following CLI syntax:

# echo '{"enable_hash_tmpdir": true}' | isi_papi_tool PUT /7/sync/policies/<policy name>

For example:

# echo '{"enable_hash_tmpdir": true}' | isi_papi_tool PUT /7/sync/policies/remote_zone1

Type request body, press enter, and CTRL-D:

204

Content-type: text/plain

Allow: GET, PUT, DELETE, HEAD

The configuration can be verified as ‘enabled’ with the following command:

# isi_papi_tool GET /7/sync/policies/remote_zone1 | grep "enable_hash_tmpdir"

"enable_hash_tmpdir" : true,

Under the hood, temporary directory hashing places any directories within a SyncIQ policy which need to be deleted into subdirectories under the ./tmp-working-dir/ directory, instead at the root of tmp-working-dir. This lowers contention on the root tmp-working-dir by moving exclusive locking requests to those subdirectories.

Performance-wise, the benefit and efficiency of SyncIQ temporary directory hashing will vary by cluster constitution, environment, and workflow. However, environments with thousands of directory deletions per policy run have seen improvements of between 2x-20x faster delete performance. To determine whether this feature is proving beneficial for a specific policy, view the SyncIQ job reports and compare the ‘STF_PHASE_CT_DIR_DELS’ job phase start and end times. This will indicate how much time those jobs have spent in this temporary directory delete phase, and can be accomplished from the replication source cluster with the following CLI syntax:

# isi sync reports view <policy_name> <job_id> | grep -C 3 "CT_DIR_DELS"

For example:

# isi sync reports view remote_zone1 31 | grep -C 3 "CT_DIR_DELS"

                            Phase: STF_PHASE_CT_DIR_DELS

                       Start Time: 2025-06-06T16:12:39

                         End Time: 2025-06-06T16:10:47

Note that for some SyncIQ policies which routinely move wide and shallow directories from one directory to another, temporary directory hashing may actually adversely impact those moves. In such instances, the feature can be disabled for each individual SyncIQ replication policy as follows:

# echo '{"enable_hash_tmpdir": false}' | isi_papi_tool PUT /7/sync/policies/<policy name>

Note that the above command should be run on the replication source cluster, using the root user authenticating to the PAPI service, replacing <policy name> with the appropriate value.

For example:

# echo '{"enable_hash_tmpdir": false}' | isi_papi_tool PUT /7/sync/policies/remote_zone1

Type request body, press enter, and CTRL-D:

204

Content-type: text/plain

Allow: GET, PUT, DELETE, HEAD

As such, OneFS will have disabled this feature for all subsequent job runs.

Similarly, the configuration can be verified as follows:

# isi_papi_tool GET /7/sync/policies/remote_zone1 | grep "enable_hash_tmpdir"

"enable_hash_tmpdir" : false,

For clusters running OneFS 8.2 through OneFS 9.10, where SyncIQ temporary directory deletion is disabled by default, it can be activated on a per-policy basis as follows:
# echo '{"enable_hash_tmpdir": true}' | isi_papi_tool PUT /7/sync/policies/<policy name>

As such, the next time SyncIQ runs the specified policy, temporary directory hashing will be enabled for this and future job runs.

So, in summary, SyncIQ temporary directory hashing can improve SyncIQ directory deletion performance for many policies with wide directory cases. While in OneFS 9.10 and earlier it was manually configurable on an individual per policy basis, now, in OneFS 9.11 and later, temporary directory hashing is enabled by default on all new Sync IQ policies.

OneFS S3 Conditional Writes and Cluster Status Reporting API

In addition to the core file protocols, PowerScale cluster also supports the ubiquitous AWS S3 protocol. As such, applications have multiple access options to the same underlying dataset, semantics, across both file and object.

Also, since OneFS objects and buckets are essentially files and directories within the /ifs filesystem, the same OneFS data services, such as Snapshots, SyncIQ, WORM, etc, are all seamlessly integrated. This makes it possible to run hybrid and cloud-native workloads, which use S3-compatible backend storage – for example cloud backup & archive software, modern apps, analytics flows, IoT workloads, etc. – and to run these on-prem, alongside and coexisting with traditional file-based workflows.

The recent OneFS 9.11 release further enhances the PowerScale S3 protocol implementation with two new features: The addition of conditional write support and API-based cluster status reporting.

First, the new S3 conditional write support prevents the overwriting of existing S3 objects with identical key names. It does this via pre-condition arguments to the S3 ‘PutObject’ and ‘CompleteMultipartUpload’ requests with the addition of an ‘If-None-Match’ HTTP header. As such, if the condition is not met, the S3 operation(s) fail. Note, however, that OneFS does not currently support the ‘If-Match’ HTTP header, which checks the Etag value. More information about S3 conditional write is provided in the following AWS documentation: https://docs.aws.amazon.com/AmazonS3/latest/userguide/conditional-requests.html

The second new piece of S3 functionality in OneFS 9.11 is API-based cluster status reporting. Increasingly, next gen applications need a reliable method decide where to store their backups and large data blobs across a variety of storage technologies. As such, a consistent API format, including cluster health status reporting, is needed to answer general questions about any S3 endpoint that maybe be under consideration as a potential target – particularly for applications without access to the management network. So providing the cluster status API facilitates intelligent decision making, such as how best to balance load and capacity across multiple PowerScale clusters. Additionally, the cluster status data can also help with performance analysis, as well as diagnosing hardware issues. For example, if an endpoint-alert has had zero successful objects delivered to it in the last hour, this status object will be the first thing that gets queried to see if there is a visible issue, or if applications are ‘routing around’ by intentionally using other resources.

The API uses an S3 endpoint with the following URL format:

S3://cluster-status/s3_cluster_status_v1

This mimics the GET object operation in the S3 service and is predicated on a via virtual bucket and object. As such, HEAD requests on this virtual bucket and object are valid, as is a GET request on the virtual object to read the Cluster Status data. All other S3 calls to this virtual bucket and object are prohibited, and the 405 HTTP error code returned.

Applications and users can use the S3 SDK, or other S3-conversant utility such as ‘s5cmd’, to retrieve the cluster status object, which involves the three valid S3 requests mentioned above:

  • HEAD bucket
  • HEAD object
  • GET object

Where the ‘get object’ request returns the cluster status details. For example, using the ‘s5cmd’ utility from a windows client:

C:\s5cmd_2.3.0> .\s5cmd.exe –endpoint-url=http:10.10.20.30:9020 cat s3://cluster-status/s3_cluster_status_v1

{

   “15min_avg_read_bw_mbs” : “0.12”,

   “15min_avg_write_bw_mbs” : “0.04”,

   “capacity_status_age_date” : “2025/06/04T07:43:02”,

   “health” : “all_nodes_operational”,

   “health_percentage” : “100”,

   “health_status_age_date” : “2025/06/04T07:43:02”,

   “mgmt_name” : “10.10.20.30:8080”

   “net_state” : “full”,

   “net_state_age_date” : “2025/06/04T07:43:02”,

   “net_state_calculation” : {

      “available_percentage” : “99”,

      “down_bw_mbs” : “0”,

      “total_bw_mbs” : “3576”,

      “used_bw_mbs” : "0.01”,

   },

   “total_capacity_tb” : :0.06”,

   “total_capacity_tib” : :0.05”,

   “total_free_space_tb” : :0.06”,

   “total_free_space_tib” : :0.05”,

}

The response format is JSON, and authenticated S3 users can access these APIs and download the cluster status object. The table below includes details of each response field:

Requested Field Description
mgmt_name Management interface name of this cluster.
total_capacity_tb Cluster’s total “current” capacity in base 10 terabytes.
total_capacity_tib Cluster’s total “current” capacity in base 2 terabytes(tebibytes).
total_free_space_tb Cluster’s total “current” free space in base 10 terabytes.
total_free_space_tib Cluster’s total “current” free space in base 2 terabytes(tebibytes).
capacity_status_age_date Number of second between the time of issuance and the proper calculation of capacity status.
health Calculated status based on per node health status:  either all_nodes_operational or some_nodes_nonoperational or non_operational.
health_percentage Vendor specific number from 0-100% where the vendor’s judgement should be used has to what level of the systems normal load it can take.
health_status_age_date Number of second between the time of issuance and the proper calculation of health status.
15min_avg_read_bw_mbs Read bandwidth in use, measured in megabytes per second, averaged over a 15-minute period.
15min_avg_write_bw_mbs Write bandwidth in use, measured in megabytes per second, averaged over a 15-minute period.
net_state Networking status to S3 clusters. Divided into “Full”, “Half”, “Critical”, and “Unknown”
net_state_age_date Number of second between the time of issuance and the proper calculation of network status.

These fields can be grouped into the following core categories:

Category Description
Capacity Reports the total capacity and available capacity in both terabytes and tebibytes.
Health Includes the cluster health, node health and network health.
Management ‘Management name’ references the out-of-band management interface that admins can use to configure the cluster.
Networking Network status takes both the interfaces up/down status and the read write bandwidth on each interface into consideration.
Performance Includes the read and write bandwidth.

Under the hood, The high-level cluster reporting API operational workflow can be categorized as follows:

When an S3 client sends a get cluster status request, the OneFS S3 service retrieves the data from isi_status_d and Flexnet services. As part of this transaction, the calculations are performed and the result returned to the S3 client in JSON format. To speed up the retrieve process, a memory cache retains the data with a configured expiry time.

Configuration-wise, the addition of the cluster status API in OneFS 9.11 introduces the following new gconfig parameters:

Name Default Value Description
S3ClusterStatusBucketName “cluster-status” Name of the bucket used to access cluster status.
S3ClusterStatusCacheExpirationInSec 300 Expiration time in seconds for cluster status cache in memory. Once reached, the next request for cluster status will results in a new fetch of fresh data.
S3ClusterStatusEnabled 0 Boolean parameter controlling whether the feature is enabled or not.

0 = disabled; 1 = enabled

S3ClusterStatusObjectName “s3_cluster_status_v1” Name of the object used to access cluster status.

These parameter values can be viewed or configured using the ‘isi_gconfig’ CLI utility. For example:

# isi_gconfig | grep S3Cluster

registry.Services.lwio.Parameters.Drivers.s3.S3ClusterStatusBucketName (char*) = cluster-status

registry.Services.lwio.Parameters.Drivers.s3.S3ClusterStatusCacheExpirationInSec (uint32) = 300

registry.Services.lwio.Parameters.Drivers.s3.S3ClusterStatusEnabled (uint32) = 0

registry.Services.lwio.Parameters.Drivers.s3.S3ClusterStatusObjectName (char*) = s3_cluster_status_v1

The following gconfig CLI command syntax can be used to activate this feature, which is disabled by default:

# isi_gconfig registry.Services.lwio.Parameters.Drivers.s3.S3ClusterStatusEnabled=1

# isi_gconfig | grep S3Cluster | grep -i enabled

registry.Services.lwio.Parameters.Drivers.s3.S3ClusterStatusEnabled (uint32) = 1

Two new operations are added to the S3 service, namely ‘head S3 cluster status’ and ‘get S3 cluster status’. For the head bucket, it will always return 200. For head cluster status object, the following three fields are required:

  • ‘content-length’, which is the length of the cluster status object
  • ‘last modified date’ maps the date for getting the cluster status object
  • an empty ‘etag’

Note that OneFS uses the MD5 hash of an empty string for the empty value.

The S3 cluster status API is available once OneFS 9.11 has been successfully installed and committed. and S3 service is enabled. During upgrade to 9.11 the ‘404 Not Found’ will be returned if the API endpoints are queried.

There are a couple of common cluster status API issues to be aware of. These include:

Issue Troubleshooting step(s)
The get cluster status API fails to get the cluster status and returns 404 Check if the configuration: S3ClusterStatusEnabled has been set to 1, or if the S3ClusterStatusBucketName and S3ClusterStatusObjectName match the bucket name and object name requested in the API.
The get cluster status API fails to get the cluster status and returns 403 Check if the access key is input correctly and if the user is an authenticated user
The get cluster status API frequently returns “unknown” value Verify that the dependent services (ie. isi_status_d) is running

Helpful log files for further investigating API issues such as the above include the S3 protocol log, Stats daemon log, and Flexnet service log. These can be found at the following locations on each node:

Logfile Location
S3 protocol log /var/log/s3.log
Flexnet daemon log /var/log/isi_flexnet_d.log
Stats daemon log /var/log/isi_stats_d.log

Additionally, the following CLI utilities can also be useful troubleshooting tools:

# isi_gconfig

# isi services s3

OneFS and the PowerScale PA110 Performance Accelerator

In addition to a variety of software features, OneFS 9.11 also introduces support for a new PowerScale performance accelerator node, based upon the venerable 1RU Dell PE R660 platform.

The diskless PA110 accelerator can simply, and cost effectively, augment the CPU, RAM, and bandwidth of a network or compute-bound cluster without significantly increasing its capacity or footprint.

Since the accelerator node contains no storage and a sizable RAM footprint, it has a substantial L1 cache, since all the data is fetched from other storage nodes. Cache aging is based on a least recently used (LRU) eviction policy and the PA110 is available in a single memory configuration, with 512GB of DDR5 DRAM per node. The PA110 also supports both inline compression and deduplication.

In particular, the PA110 accelerator can provide significant benefit to serialized, read-heavy, streaming workloads by virtue of its substantial, low-churn L1 cache, helping to increase throughput and reduce latency. For example, a typical scenario for PA110 addition could be a small all-flash cluster supporting a video editing workflow that is looking for a performance and/or front-end connectivity enhancement, but no additional capacity.

Other than a low capacity M.2 SSD boot card, the PA110 node contains no local storage or journal. This new accelerator is fully compatible with clusters containing the current and previous generation PowerScale nodes. Also, unlike storage nodes which require the addition of a 3 or 4 node pool of similar nodes, a single PA110 can be added to a cluster. The PA110 can be added to a cluster containing all-flash, hybrid, and archive nodes.

Under the top cover, the one rack-unit PA110 enclosure contains dual Sapphire Rapids 6442Y CPUs with 24 core/48 thread-60MB L3, running at 2.6GHz. This is complemented by 512GB of DDR5 memory and dual 960GB M.2 mirrored boot media.

Networking comprises the venerable Mellanox CX6 series NICs, with options including CX6-LX Dual port 25G, CX6-DX Dual port 100G, or MLX CX6 VPI 200G Ethernet.

The PA110 also includes a LOM (Lan-On-Motherboard) port for management and a RIO/DB9 for the serial port. This is all powered by dual 1100W Titanium hot swappable power supplies.

The PowerScale PA110 also uses a new boot-optimized storage solution (BOSS) for its boot media. This comprises a BOSS module and associated card carrier. The module is housed in the chassis as shown:

The card carrier holds two M.2 NVMe SSD cards, which can be removed from the rear of the node as follows:

Note that, unlike PowerScale storage nodes, since the accelerator does not provide any /ifs filesystem storage capacity, the PowerScale PA110 node does not require OneFS feature licenses for any of the various data services running in a cluster.

The PowerScale PA110 can also be configured to order in ‘backup mode’, too. In this configuration, the accelerator also includes a pair of fibre channel ports, provided by an Emulex LPE35002 32Gb FC HBA. This enables direct, or two-way, NDMP backup from a cluster to a tape library or VTL, either directly attached or across a fibre channel fabric.

With a fibre channel card installed in slot 2, the PA110 backup accelerator integrates seamlessly with current DR infrastructure, as well as with leading data backup and recovery software technologies to satisfy the availability and recovery SLA requirements of a wide variety of workloads.

As a backup accelerator, the PA110 aids overall cluster performance by offloading NDMP backup traffic directly to the fibre channel ports and reducing CPU and memory consumption on storage nodes – thereby minimizing impact on front end workloads. This can be of particular benefit to clusters that have been using chassis-based nodes populated with fibre channel cards. In these cases, a simple, non-disruptive addition of PA110 backup accelerator node(s) frees up compute resources on the storage nodes, boosting client workload performance and shrinking NDMP backup windows.

The following table includes the hardware specs for the new PowerScale PA110 performance accelerator, as compared to its predecessors (P100 and B100), which are as follows:

Component (per node) PA110 (New( P100 (Prior gen) B100 (Prior gen)
OneFS release OneFS 9.11 or later OneFS 9.3 or later OneFS 9.3 or later
Chassis PowerEdge R660 PowerEdge R640 PowerEdge R640
CPU 24 cores (dual socket Intel 6442Y @ 2.6Ghz) 20 cores (dual socket Intel 4210R @ 2.4Ghz) 20 cores (dual socket Intel 4210R @ 2.4Ghz)
Memory 512GB DDR5 384GB or 768GB DDR4 384GB DDR4
Front-end I/O 2 x 10/25 Gb Ethernet; Or

2 x 40/100Gb Ethernet; Or

2 x HDR Infiniband (200Gb)

2 x 10/25 Gb Ethernet Or

2 x 40/100Gb Ethernet

2 x 10/25 Gb Ethernet Or

2 x 40/100Gb Ethernet

Back-end I/O 2 x 10/25 Gb Ethernet Or

2 x 40/100Gb Ethernet Or 2 x HDR Infiniband (200Gb)

Optional 2 x FC for NDMP

2 x 10/25 Gb Ethernet Or

2 x 40/100Gb Ethernet

Or 2 x QDR Infiniband

2 x 10/25 Gb Ethernet Or

2 x 40/100Gb Ethernet Or 2 x QDR Infiniband

Mgmt Port LAN on motherboard 4 x 1GbE (rNDC) 4 x 1GbE (rNDC)
Journal N/A N/A N/A
Boot media BOSS module 960GB 2x 960GB SAS SSD drives 2x 960GB SAS SSD drives
IDSDM 1 x 32GB microSD (Receipt and recovery boot image) 1x 32GB microSD (Receipt and recovery boot image) 1x32GB microSD (Receipt and recovery boot image)
Power Supply Dual redundant 1100W

100-240V, 50/60Hz

Dual redundant 750W 100-240V, 50/60Hz Dual redundant 750W 100-240V, 50/60Hz
Rack footprint 1RU 1RU 1RU
Cluster addition Minimum one node, and single node increments Minimum one node, and single node increments Minimum one node, and single node increments

These node hardware attributes can be easily viewed from the OneFS CLI via the ‘isi_hw_status’ command.

OneFS Migration from ESRS to Dell Connectivity Services

Tucked amongst the payload of the recent OneFS 9.11 release is new functionality that enables a seamless migration from EMC Secure Remote Services (ESRS) to Dell Technologies Connectivity Services (DTCS). DTCS, as you may recall from previous blog articles on the topic, is the rebranded SupportAssist solution for cluster phone-home connectivity.

First, why migrate from ESRS to DTCS? Well, two years ago, an end of service life date of January 2024 was announced for the Secure Remote Services version 3 gateway, which is used by the older ESRS, ConnectEMC, and Dial-home connectivity methods. Given this, the solution for clusters still using the SRSv3 gateway is to either:

  1. Upgrade Secure Remote Services v3 to Secure Connect Gateway v5.
  2. Upgrade to OneFS 9.5 or later and use the SupportAssist/DTCS ‘direct connect’ option.

The objective of this new OneFS 9.11 feature is to help customers migrate to DTCS so they can achieve their desired connectivity state with as little disruption as possible.

Scenario After upgrade to OneFS 9.11
Clusters with ESRS + SCGv5 Seamless migration capable
New cluster DTC is the only way for connectivity.

“isi esrs”, “Remote Support” in the WebUI will either be unavailable or hidden.

Clusters without ESRS/SupportAssist/DTC configured Same as above
Clusters with ESRS + SRSv3 Retain ESRS + SRSv3

HealCheck warning triggered

WebUI banner showed the migration did not happen

Resolution is to upgrade to SCGv5 or use direct connection

retry command is “isi connectivity provision start –retry-migration”

 

So when a cluster that has been provisioned with ESRS using a secure connect gateway is upgraded to OneFS 9.11, this feature automatically attempts to migrate to DTCS. Upon successful completion, any references to ESRS will no longer be visible.

Similarly, on new clusters running 9.1 1, the ability to provision with ESRS is removed, and messaging is displayed to encourage DTCS provisioning and enablement.

Under the hood, the automatic migration architecture comprises the following core functional components:

Component Description
Upgrade commit hook Starts the migration job.

 

Healthcheck connectivity_migration checklist A group of checks used to determine if automatic migration can proceed.

 

Provision state machine Updated to use the ESE API /upgradekey for provisioning in migration scenario.

 

Job Engine Job ·         Precheck: Runs the connectivity_migration checklist which must pass

·         Migrate settings: Configure DTCS using ESRS and Cluster identity settings

·         Provision: Enables DTCS, starts a provision task using state machine

 

There’s a new healthcheck checklist called ‘connectivity migration’, which contains a group of checks to determine whether its safe for an automatic migration to proceed.

There’s been an update to the provision state machine so that it now uses the upgrade key from the ESE API, so that we can provision in the migration scenario.

And the final piece is the migration job. Executed and managed by the Job Engine, this migration job has 3 phases.

The first, or pre-check, phase runs the connectivity migration checklist. All the checklist elements must to pass in order for the job to continue.

If the checklist fails, the results of those checks can be used to determine what remedial actions are needed in order to get the cluster to their desired connectivity state. When it does pass, the job progresses to the migration settings phase. Here, the required configuration data is extracted from ESRS and the cluster settings in order to configure DTCS. This includes items like the gateway host, customer contact info, telemetry settings, etc. Once the DTCS configuration data is in place, the job continues to its final phase, which spawns the actual provision task.

After enabling DTCS, the provisioning state machine uses the ESRS API key that was configured or paired with the configured gateway, which it uses and passes to the ESE API upgrade key, associate the key with the new ESE back end. Once that’s in place, DTCS provisioning via the upgrade hook background process.

A new CELOG alert has been added that will be triggered if DTCS provisioning fails during a seamless migration. This alert will automatically open a service request with a sev3 priority, and recommends contact Dell Support for assistance.

The connectivity CLI are minimal in OneFS 9.11, and essentially comprise providing messaging based on the state of the underlying system. The following example is from a freshly installed OneFS 9.11 cluster, where any ‘isi esrs’ CLI commands now display the following ‘no longer supported’ message:

# isi esrs view

Secure Remote Services (SRS) is no longer supported. Use Dell Technologies connectivity services instead via 'isi connectivity'.

A cluster that’s been upgraded to OneFS 9.11, but fails to automatically migrate to DTCS will display a message stating that SRS is at the end of its service life.

# isi esrs view

Warning: Secure Remote Service is at end of service life. Upgrade connectivity to Dell Technologies connectivity services now using ‘isi connectivity’ to prevent disruptions.  See https://www.dell.com/support/kbdoc/en-us/0000152189/powerscale-onefs-info-hubs cluster administration guides for more information.

There’s also a new ‘–retry-migration’ option for the ‘isi connectivity provision start’ command:

# isi connectivity provision start --retry-migration

SRS to Dell Technologies connectivity services migration started.

This can be used to rerun the migration process once any issues have been corrected, based on the results of the connectivity migration checklist.

Finally, upon successful migration, a message will inform that ESRS has been migrated to DTCS and that ESRS is no longer supported:

# isi esrs view

Secure Remote Services (SRS) connectivity has migrated to Dell Technologies connectivity services. Use ‘isi connectivity’ to manage connectivity as SRS is no longer supported.

Similarly, the WebUI updates will reflect the state of the underlying system. For example, on a freshly installed OneFS 9.11 cluster, the WebUI dashboard will remind the administrator that Dell Technologies Connectivity Services needs to be configured:

On the general settings page, the tab for ‘remote support’ has been removed in OneFS 9.11:

In the diagnostics gather when the checkbox comes up, the option for ESRS uploads has been removed and replaced with the DTCS upload:

And on a fresh OneFS 9.11 cluster, the remote support channel is no longer listed as an option for alerts:

If a migration does not complete successfully, a warning is displayed on the remote support tab on the general settings page informing that the migration has failed. This warning also provides information on how to proceed:

The WebUI messaging prompts the cluster admin to resolve the failed migration by examining the results of that checklist, and provides a path forward.

The alert is also displayed on the licensing tab, because at this point the connectivity needs to be reestablished because the migration failed:

The WebUI messaging provides steps to help resolve any migration issues. Plus, if a migration has failed, the ESRS upload will still remain present and active until DTCS is successfully provisioned:

Once successfully migrated, the WebUI dashboard will confirm this status:

The dashboard will also confirm that DTCS is now enabled and connected via the SCG:

Additionally, the ‘remote support’ tab and page are no longer visible under general settings, and the former ESRS option is replaced by the DTCS option on the gather menu:

When investigating and troubleshooting connectivity migration issues, if something goes wrong with the migration job, examining the /var/log/isi_ job_d.log file and search for ‘EsrsToDtcsMigration’ can be a useful starting point. For additional detail, increasing the verbosity to ‘debug logging’ for the isi_job_d service and retrying the migration can also be helpful.

Additionally, the ‘isi healthcheck evaluations’ command line options can be used to query the status of the connectivity_migration checklist, to help determine which of the checks has failed and needs attention:

# isi healthcheck evaluations list

# isi healthcheck evaluations view <name of latest>

Similarly, from the WebUI, navigating to Cluster management > Job operations displays the job status and errors. While Cluster Management > Healthcheck > Evaluations tab allows the connectivity_migration checklist details to be examined.

Note that ESRS to DTCS auto migration is only for clusters running ESRS that have been provisioned with  and are using the Secure Connect Gateway (SCG) option. Post successful migration,  the customer can always switch to using a direct connection rather than via SCG, if desired.

OneFS and Software Journal Mirroring – Management and Troubleshooting

Software journal mirroring (SJM) in OneFS 9.11 delivers critical file system support to meet the reliability requirements for PowerScale platforms with high capacity flash drives. By keeping a synchronized and consistent copy of the journal on another node, and automatically recovering the journal from it upon failure, enabling SJM can reduce the node failure rate by around three orders of magnitude – while also boosting storage efficiency by negating the need for a higher level of on-disk FEC protection.

SJM is enabled by default for the applicable platforms on new clusters. So for clusters including F710 or F910 nodes with large QLC drives that ship with 9.11 installed, SJM will be automatically activated.

SJM adds a mirroring scheme, which provides the redundancy for the journal’s contents.

This is where /ifs updates are sent to a node’s local, or primary, journal as usual. But they’re also synchronously replicated, or mirrored, to another node’s journal, too – referred to as the ‘buddy’.

Every node in an SJM-enabled pool is dynamically assigned a buddy node, and if a new SJM-capable node is added to the cluster, it’s automatically paired up with a buddy. These buddies are unique for every node in the cluster.

SJM’s automatic recovery scheme can use a buddy journal’s contents to re-form the primary node’s journal. And this recovery mechanism can also be applied manually if a journal device needs to be physically replaced.

The introduction of SJM changes the node recovery options slightly in OneFS 9.11. These options now include an additional method for restoring the journal:

This means that if a node within an SJM-enabled pool ends up at the ‘stop_boot’ prompt, before falling back to SmartFail, the available options in order of desirability are:

Order Options Description
1 Automatic journal recovery OneFS will first try to automatically recover from the local copy.

 

2 Automatic journal mirror recovery Automatic journal mirror recovery attempts to SyncBack from the buddy node’s journal.
3 Manual SJM recovery Dell support can attempt a manual SJM recovery, particularly in scenarios where a bug or issue in the software journal mirroring feature itself is inhibiting automatic recovery in some way.
4 SmartFail OneFS quarantines the node, places it into a read-only state, and reprotects by distributing the data to other devices.

While SJM is available upon upgrade commit to OneFS 9.11, it is not automatically activated. So any F710 or F910 SJM-capable node pools that were originally shipped with OneFS 9.10 installed  will require SJM to be manually enabled after their upgrade to 9.11.

If SJM is not activated on a cluster with capable node pools running OneFS 9.11, a CELOG alert will be raised, encouraging the customer to enable it. This CELOG alert will contain information about the administrative actions required to enable SJM. Additionally, a pre-upgrade check is also included in OneFS 9.11 to prevent any existing cluster with nodes containing 61TB drives that were shipped with OneFS 9.9 or older installed, from upgrading directly to 9.11 until the afflicted nodes have been USB-reimaged and their journals reformatted.

For SJM-capable clusters which do not have journal mirroring enabled, the CLI command (and platform API endpoint) to activate SJM operate at the nodepool level. Each SJM-capable pool will need to be enabled separately via the ‘isi storagepool nodepools modify’ CLI command, plus the pool name and the new  ‘–sjm-enable=true’ argument.

# isi storagepool nodepools modify <name> --sjm-enabled true

Note that this new syntax is only applicable only for nodepool(s) with SJM-capable nodes.

Similarly, to query the SJM status on a cluster’s nodepools:

# isi storagepool nodepools list –v | grep –e ‘SJM’ –e ‘Name:’

And to check a cluster’s nodes for SJM capabilities:

# isi storagepool nodetypes list -v | grep -e 'Product' -e 'Capable'

So there are a couple of considerations with SJM that should be borne in mind. As mentioned previously, any SJM-capable nodes that are upgraded from OneFS 9.10 will not have SJM enabled by default. So if, after upgrade to 9.11, a capable pool remains in an SJM-disabled state, a CELOG warning will be raised informing that the data may be under-protected, and hence it’s reliability lessened. And the CELOG event will include recommended corrective and remedial action. So administrative intervention will be required to enable SJM on this particular node pool, ideally, or alternatively increase the protection level to meet the same reliability goal.

So how impactful is SJM to protection overhead on an SJM-capable node pool/cluster? The following table shows the protection layout, both with and without SJM, for the F710 and F910 nodes containing 61TB drives:

Node type Drive Size Journal Mirroring +d2:1n +3d:1n1d +2n +3n
F710 61TB SDPM 3 4-6 7-34 35-252
F710 61TB SDPM SJM 4-16 17-252
F910 61TB SDPM 3 5-19 20-252
F910 61TB SDPM SJM 3-16 17-252

Taking the F710 with 61TB drives example above, without SJM +3n protection is required at 35 nodes and above. In contrast, with SJM-enabled, the +3d:1n1d protection level suffices all the way up to the current maximum cluster size of 252 nodes.

Generally, beyond enabling it on any capable-pools, after upgrading to 9.11 SJM just does its thing and does not require active administration or management. However, with a corresponding buddy journal for every primary node, there may be times when a primary and its buddy become un-synchronized. Clearly, this would mean that mirroring is not functioning correctly and a SyncBack recovery attempt would be unsuccessful. OneFS closely monitors this scenario, and will fire either of the top two CELOG event types below to alert the cluster admin in the event that journal syncing and/or mirroring are not working properly:

Possible causes for this could include the buddy remaining disconnected, or in a read-only state, for a protracted period of time. Or a software bug or issue, that’s preventing successful mirroring. This would result in a CELOG warning being raised for the buddy of the specific node, with the suggested administrative action included in the event contents.

Also, be aware that SJM-capable and non-SJM-capable nodes can be placed in the same nodepool if needed, but only if SJM is disabled on that pool – and the protection increased correspondingly.

The following chart illustrates the overall operational flow of SJM:

SJM is a core file system feature, so the bulk of its errors and status changes are written to the ubiquitous /var/log/messages file. However, since the Buddy assignment mechanism is a separate component with its own user-space demon, its notifications and errors are sent to a dedicated ‘isi_sjm_budassign_d’ log. This logfile is located at:

/var/log/isi_sjm_budassign_d.log

OneFS and Software Journal Mirroring – Architecture and Operation

In this next article in the OneFS software journal mirroring series, we will dig into SJM’s underpinnings and operation in a bit more depth.

With its debut in OneFS 9.11, the current focus of SJM is the all-flash F-series nodes containing either 61TB or 122TB QLC SSDs. In these cases, SJM dramatically improves the reliability of these dense drive platforms with journal fault tolerance. Specifically, it maintains a consistent copy of the primary node’s journal on a separate node. By automatically recovering the journal from this mirror, SJM is able to substantially reduce the node failure rate without the need for increased FEC protection overhead.

SJM is enabled by default for the applicable platforms on new clusters. So for clusters including F710 or F910 nodes with large QLC drives that ship with 9.11 installed, SJM will be automatically activated.

SJM adds a mirroring scheme, which provides the redundancy for the journal’s contents. This is where /ifs updates are sent to a node’s local, or primary, journal as usual. But they’re also synchronously replicated, or mirrored, to another node’s journal, too – referred to as the ‘buddy’.

Architecturally, SJM’s main components and associated lexicon are as follows:

Item Description
Primary Node with a journal that is co-located with the data drives that the journal will flush to.
Buddy Node with a journal that stores sufficient information about transactions on the primary to restore the contents of a primary node’s journal in the event of its failure.
Caller Calling function that executes a transaction. Analogous to the initiator in the 2PC protocol.
Userspace journal library Saves the backup, restores the backup, and dumps journal (primary and buddy).
Buddy reconfiguration system Enables buddy reconfiguration and stores the mapping in buddy map via buddy updater.
Buddy mapping updater Provides interfaces and protocol for updating buddy map.
Buddy map Stores buddy map (primary <-> buddy).
Journal recovery subsystem Facilitates journal recovery from buddy on primary journal loss.
Buddy map interface Kernel interface for buddy map.
Mirroring subsystem Mirrors global and local transactions.
JGN Journal Generation Number, to identify versions and verify if two copies of a primary journal are consistent.
JGN interface Journal Generation Number interface to update/read JGN.
NSB Node state block, which stores JGN.
SB Journal Superblock.
SyncForward Mechanism to sync an out-of-date buddy journal with missed primary journal content additions & deletions.
SyncBack Mechanism to reconstitute a blown primary journal from the mirrored information stored in the buddy journal.

These components are organized into the following hierarchy and flow, split across kernel and user space:

A node’s primary journal is co-located with the data drives that it will flush to. In contrast, the buddy journal lives on a remote node and stores sufficient information about transactions on the primary, to allow it to restore the contents of a primary node’s journal in the event of its failure.

SyncForward is the mechanism by which an out of date Buddy journal is caught up with any Primary journal transactions that it might have missed. While SyncBack, or restore, allows a blown Primary journal to be reconstituted from the mirroring information stored in its Buddy journal.

SJM needs to be able to rapidly detect a number of failure scenarios and decide which is the appropriate recovery workflow to initiate. For example, a blown primary journal, where SJM must quickly determine whether the Buddy’s contents are complete, to allow a SyncBack to fully reconstruct a valid Primary journal. Versus whether to resort to a more costly node rebuild instead. Or, if the Buddy node disconnects briefly, which of a Primary journal’s changes should be replicated during a SyncForward, in order to bring the Buddy efficiently back into alignment.

SJM tags the transactions logged into the Primary journal, and their corresponding mirrors in the Buddy, with a monotonically increasing Journal Generation Number, or JGN.

The JGN represents the most recent & consistent copy of a primary node’s journal, and it’s incremented whenever the write status of the Buddy journal changes, which is tracked by the Primary via OneFS GMP group change updates.

In order to determine whether the Buddy journal’s contents are complete, the JGN needs to be available to the primary node when its primary journal is blown. So the JGN is stored in a Node State Block, or NSB, and saved on a quorum of the node’s data-drives. Therefore, upon loss of a Primary journal, the JGN in the node state block can be compared against the JGN in the Buddy to confirm its transaction mirroring is complete, before the SyncBack workflow is initiated.

A primary transaction exists on the node where data storage is being modified, and the corresponding buddy transaction is a hot, redundant duplicate of the primary information on a separate node. The SDPM journal storage on the F-series platforms  is fast, and the pipe between nodes across the backend network is optimized for low-latency bulk data flow. And this allows the standard POSIX file model to transparently operate on the front-end protocols, which are blissfully aware of any journal jockeying that’s occurring behind the scenes.

The journal mirroring activity is continuous, and if the Primary loses contact with its Buddy, it will urgently seek out another Buddy and repeat the mirroring for each active transaction, to regain a fully mirrored journal config. If the reverse happens, and the Primary vanishes due to an adverse event like a local power loss or an unexpected reboot, the primary can reattach to its designated buddy and ensure that its own journal is consistent with the transactions that the Buddy has kept safely mirrored. This means that the buddy must reside on a different node than the primary. As such, it’s normal and expected for each primary node to also be operating as the buddy for a different node.

The prerequisite platform requirements for SJM support in 9.11, which are referred to as ‘SJM-capable nodes, are as follows:

Essentially, this is any F710 and F910’s with 61TB or 122TB SSDs which shipped with OneFS 9.10 or later are considered SJM-capable.

Note that there are a small number of F710 and F910s with 61TB drives out there which were shipped with OneFS 9.9 or earlier installed. These nodes must be re-imaged before they can use SJM. So they first need to be SmartFailed out, then USB reimaged to OneFS 9.10 or later. This is to allow the node’s SDPM journal device to be reformatted to include a second partition for the 16 GiB buddy journal allocation. However, this 16 GiB of space reserved for the buddy journal will not be used when SJM is disabled. The following table shows the maximum SDPM usage per journal type based on SJM enablement:

Journal State Primary journal Buddy journal
SJM enabled 16 GiB 16 GiB
SJM disabled 16 GiB 0 GiB

But to reiterate, the SJM-capable platforms which will ship with OneFS 9.11 installed, or those that shipped with OneFS 9.10, are ready to run SJM, and will form node pools of equivalent type.

While SJM is available upon upgrade commit to OneFS 9.11, it is not automatically activated. So for any F710 or F910 nodes with large QLC drives that were originally shipped with OneFS 9.10 installed, the cluster admin will need to manually enable SJM on any capable pools after their upgrade to 9.11.

Plus, if SJM is not activated, a CELOG alert will be raised, encouraging the customer to enable it, in order for the cluster to meet the reliability requirements. This CELOG alert will contain information about the administrative actions required to enable SJM.

Additionally, a pre-upgrade check is also included in OneFS 9.11 to prevent any existing cluster with nodes containing 61TB drives that were shipped with OneFS 9.9 or older installed, from upgrading directly to 9.11 – until these nodes have been USB-reimaged and their journals reformatted.

OneFS and Software Journal Mirroring

OneFS 9.11 sees the addition of a Software journal mirroring capability, which adds critical file system support to meet the reliability requirements for platforms with high capacity drives.

But first, a quick journal refresher… OneFS uses journaling to ensure consistency across both disks locally within a node and disks across nodes. As such, the journal is among the most critical components of a PowerScale node. When OneFS writes to a drive, the data goes straight to the journal, allowing for a fast reply.

Block writes go to the journal first, and a transaction must be marked as ‘committed’ in the journal before a ‘success’ status is returned to the file system operation.

Once a transaction is committed the change is guaranteed to be stable. If the node crashes or loses power, changes can still be applied from the journal at mount time via a ‘replay’ process. The journal uses a battery-backed persistent storage medium in order to be available after a catastrophic node event, and must also be:

Journal Performance Characteristic Description
High throughput All blocks (and therefore all data) pass through the journal, so it must never become a bottleneck.
Low latency Since transaction state changes are often in the latency path multiple times for a single operation, particularly for distributed transactions.

The OneFS journal mostly operates at the physical level, storing changes to physical blocks on the local node. This is necessary because all initiators in OneFS have a physical view of the file system – and therefore issue physical read and write requests to remote nodes. The OneFS journal supports both 512byte and 8KiB block sizes of 512 bytes for storing written inodes and blocks respectively.

By design, the contents of a node’s journal are only needed in a catastrophe, such as when memory state is lost. For fast access during normal operation, the journal is mirrored in RAM. Thus, any reads come from RAM and the physical journal itself is write-only in normal operation. The journal contents are read at mount time for replay. In addition to providing fast stable writes, the journal also improves performance by serving as a write-back cache for disks. When a transaction is committed, the blocks are not immediately written to disk. Instead, it is delayed until the space is needed. This allows the I/O scheduler to perform write optimizations such as reordering and clustering blocks. This also allows some writes to be elided when another write to the same block occurs quickly, or the write is otherwise unnecessary, such as when the block is freed.

So the OneFS journal provides the initial stable storage for all writes and does not release a block until it is guaranteed to be stable on a drive. This process involves multiple steps and spans both the file system and operating system. The high-level flow is as follows:

Step Operation Description
1 Transaction prep A block is written on a transaction, for example a write_block message is received by a node. An asynchronous write is started to the journal. The transaction preparation step will wait until all writes on the transaction complete.
2 Journal delayed write The transaction is committed. Now the journal issues a delayed write. This simply marks the buffer as dirty.
3 Buffer monitoring A daemon monitors the number of dirty buffers and issues the write to the drive upon reach its threshold.
4 Write completion notification The journal receives an upcall indicating that the write is complete.
5 Threshold reached Once journal space runs low or an idle timeout expires, the journal issues a cache flush to the drive to ensure the write is stable.
6 Flush to disk When cache flush completes, all writes completed before the cache flush are known stable. The journal frees the space.

The PowerScale F-series platforms use Dell’s VOSS M.2 SSD drive as the non-volatile device for their software-defined persistent memory (SDPM) journal vault.  The SDPM itself comprises two main elements:

Component Description
BBU The BBU pack (battery backup unit) supplies temporary power to the CPUs and memory allowing them to perform a backup in the event of a power loss.
Vault A 32GB M.2 NVMe to which the system memory is vaulted.

While the BBU is self-contained, the M.2 NVMe vault is housed within a VOSS module, and both components are easily replaced if necessary.

The current focus of software journal mirroring (SJM) are the all-flash F710 and F910 nodes that contain either the 61 TB QLC SSDs, or the soon to be available 122TB drives. In these cases, SJM dramatically improves the reliability of these dense drive platforms. But first, some context regarding journal failure and it’s relation to node rebuild times, durability, and the protection overhead.

Typically, a node needs to be rebuilt when its journal fails, for example if it loses its data, or if the journal device develops a fault and needs to be replaced. To accomplish this, the OneFS SmartFail operation has historically been the tool of choice, to restripe the data away from the node. But the time to completion for this operation depends on the restripe rate and amount of the storage. And the gist is that the denser the drives, the more storage is on the node, and the more work SmartFail has to perform.

And if restriping takes longer, the window during which the data is under-protected also increases. This directly affects reliability, by reducing the mean time to data loss, or MTTDL. PowerScale has an MTTDL target of 5,000 years for any given size of a cluster. The 61TB QLC SSDs represent an inflection point for OneFS restriping, where, due to their lengthy rebuild times, reliability, and specifically MTTDL, become significantly impacted.

So the options in a nutshell for these dense drive nodes, are either to:

  1. Increase the protection overhead, or:
  2. Improve a node’s resilience and, by virtue, reduce its failure rate.

Increasing the protection level is clearly undesirable, because the additional overhead reduces usable capacity and hence the storage efficiency – thereby increasing the per-terabyte cost, as well as reducing rack density and energy efficiency.

Which leaves option 2: Reducing the node failure rate itself, which the new SJM functionality in 9.11 achieves by adding journal redundancy.

So, by keeping a synchronized and consistent copy of the journal on another node, and automatically recovering the journal from it upon failure, enabling SJM can reduce the node failure rate by around three orders of magnitude – while removing the need for a punitively high protection level on platforms with large-capacity drives.

SJM is enabled by default for the applicable platforms on new clusters. So for clusters including F710 or F910 nodes with large QLC drives that ship with 9.11 installed, SJM will be automatically activated.

SJM adds a mirroring scheme, which provides the redundancy for the journal’s contents. This is where /ifs updates are sent to a node’s local, or primary, journal as usual. But they’re also synchronously replicated, or mirrored, to another node’s journal, too – referred to as the ‘buddy’.

This is somewhat analogous to how the PowerScale H and A-series chassis-based node paring operates, albeit in software and over the backend network this time, and with no fixed buddy assignment, rather than over a dedicated PCIe non-transparent bridge link to a dedicated partner node, as in the case of the chassis-based platforms.

Every node in an SJM-enabled pool is dynamically assigned a buddy node. And similarly, if a new SJM-capable node is added to the cluster, it’s automatically paired up with a buddy. These buddies are unique for every node in the cluster.

SJM’s automatic recovery scheme can use a buddy journal’s contents to re-form the primary node’s journal. And this recovery mechanism can also be applied manually if a journal device needs to be physically replaced.

A node’s primary journal lives within that node, next to its storage drives. In contrast, the buddy journal lives on a remote node and stores sufficient information about transactions on the primary, to allow it to restore the contents of a primary node’s journal in the event of its failure.

SyncForward is the process that enables a stale Buddy journal to reconcile with the Primary and any transactions that it might have missed. Whereas SyncBack, or restore, allows a blown Primary journal to be reconstructed from the mirroring information stored in its Buddy journal.

The next blog article in this series will dig into SJM’s architecture and management in a bit more depth.