Unstructured Data Quick Tips

OneFS Hardware Network Considerations

As we’ve seen in prior articles in this series, OneFS and the PowerScale platforms support a variety of Ethernet speeds, cable and connector styles, and network interface counts, depending on the node type selected. However, unlike the back-end network, Dell does not specify particular front-end switch models, allowing PowerScale clusters to seamlessly integrate into the data link layer (layer 2) of an organization’s existing Ethernet IP network infrastructure. For example:

A layer 2 looped topology as above extends VLANs between the distribution/aggregation switches, with spanning tree protocol (STP) preventing network loops by shutting down redundant paths. The access layer uplinks may be used to load balance VLANs. This distributed architecture allows the cluster’s external network to connect to multiple access switches, affording each node similar levels of availability, performance, and management properties.

Link aggregation can be used to combine multiple Ethernet interfaces into a single link-layer interface, and is implemented between a single switch and PowerScale node, where transparent failover or switch port redundancy is required. Link aggregation assumes all links are full duplex, point to point, and at the same data rate, providing graceful recovery from link failures. If a link fails, traffic is automatically sent to the next available link without disruption.

Quality of service (QoS) can be implemented through differentiated services code point (DSCP), by specifying a value in the packet header that maps to an ‘effort level’ for traffic. Since OneFS does not provide an option for tagging packets with a specified DSCP marking, the recommended practice is to configure the first hop ports to insert DSCP values on the access switches connected to the PowerScale nodes. OneFS does retain headers for packets that already have a specified DSCP value, however.

When designing a cluster, the recommendation is that each node have at least one front-end interface configured, preferably in at least one static SmartConnect zone. Although a cluster can be run in a ‘not all nodes on the network’ (NANON) configuration, where feasible, the recommendation is to connect all nodes to the front-end network(s). Additionally, cluster services such as SNMP, ESRS, ICAP, and auth providers (AD, LDAP, NIS, etc) prefer each node to have an address that can reach the external servers.

In contrast with scale-up NAS platforms that use separate network interfaces for out-of-band management and configuration, OneFS traditionally performs all cluster network management in-band. However, PowerScale nodes typically contain a dedicated 1Gb Ethernet port that can be configured for use as a management network via ICMP or iDRAC, simplifying administration of a large cluster. OneFS also supports using a node’s serial port as an RS232 out-of-band management interface, and this practice is highly recommended for large clusters. Serial connectivity can provide reliable BIOS-level command line access for on-site or remote service staff to perform maintenance, troubleshooting and installation operations.

SmartConnect provides a configurable allocation method for each IP address pool:

Allocation Method

Attributes

Static

• One IP per interface is assigned, will likely require fewer IP’s to meet minimum requirements

• No Failover of IP’s to other interfaces

Dynamic

• Multiple IP per interface is assigned, will require more IP’s to meet minimum requirements

• Failover of IP’s to other interfaces, Failback policies are needed

The default ‘static’ allocation assigns a single persistent IP address to each interface selected in the pool, leaving additional pool IP addresses unassigned if the number of addresses exceeds the total interfaces.

The lowest IP address of the pool is assigned to the lowest Logical Node Number (LNN) from the selected interfaces, subsequently for the second-lowest IP address and LNN, etc. If a node or interface becomes unavailable, this IP address does not move to another node or interface. Also, when the node or interface becomes unavailable, it is removed from the SmartConnect zone, and new connections will not be assigned to the node. Once the node is available again, SmartConnect will automatically add it back into the zone and assign new connections.

By contrast, ‘dynamic’ allocation divides all available IP addresses in the pool across all selected interfaces, and OneFS attempts to assign the IP addresses as evenly as possible. However, if the interface-to-IP address ratio is not an integer value, a single interface might have more IP addresses than another. As such, wherever possible, ensure that all the interfaces have the same number of IP addresses.

In concert with dynamic allocation, dynamic failover provides high availability by transparently migrating IP addresses to another node when an interface is not available. If a node becomes unavailable, all the IP addresses it was hosting are reallocated across the new set of available nodes in accordance with the configured failover load-balancing policy. The default IP address failover policy is round robin, which evenly distributes IP addresses from the unavailable node across available nodes. Because the IP address remains consistent, irrespective of which node it resides on, failover to the client is transparent, so high availability is seamless.

The other available IP address failover policies are the same as the initial client connection balancing policies, that is, connection count, throughput, or CPU usage. In most scenarios, round robin is not only the best option but also the most common. However, the other failover policies are available for specific workflows.

The decision on whether to implement dynamic failover is highly dependent on the protocol(s) being used, general workflow attributes, and any high-availability design requirements:

Protocol	State	Suggested Allocation Strategy

NFSv3	Stateless	Dynamic
NFSv4	Stateful	Dynamic or Static, depending on mount daemon, OneFS version, and Kerberos.
SMB	Stateful	Dynamic or Static
SMB Multi-channel	Stateful	Dynamic or Static
S3	Stateless	Dynamic or Static
HDFS	Stateful	Dynamic or Static. HDFS uses separate name-node and data-node connections. Allocation strategy depends on need for data locality and/or multi-protocol. Ie: HDFS + NFSv3 : Dynamic Pool HDFS + SMB : Static Pool
HTTP	Stateless	Static
FTP	Stateful	Static
SyncIQ	Stateful	Static required

Assigning each workload or data store to a unique IP address enables OneFS SmartConnect to move each workload to one of the other interfaces, minimizing the additional work that a remaining node in the SmartConnect pool must absorb and ensuring that the workload is evenly distributed across all the other nodes in the pool.

Static IP pools require one IP address for each logical interface within the pool. Since each node provides two interfaces for external networking, if link aggregation is not configured, this would require 2*N IP addresses for a static pool.

Determining the number of IP addresses within a dynamic allocation pool varies depending on the workflow, node count, and the estimated number of clients that would be in a failover event. While dynamic pools need, at a minimum, the number of IP addresses to match a pool’s node count, the ‘N * (N – 1)’ formula can often prove useful for calculating the required number of IP addresses for smaller pools. In this equation, N is the number of nodes that will participate in the pool

For example, a SmartConnect pool with four-node interfaces, using the ‘N * (N – 1)’ model will result in three unique IP addresses being allocated to each node. A failure on one node interface will cause each of that interface’s three IP addresses to fail over to a different node in the pool. This ensures that each of the three active interfaces remaining in the pool receives one IP address from the failed node interface. If client connections to that node are evenly balanced across its three IP addresses, SmartConnect will evenly distribute the workloads to the remaining pool members. For larger clusters, this formula may not be feasible due to the sheer number of IP addresses required.

Enabling jumbo frames (Maximum Transmission Unit set to 9000 bytes) typically yields improved throughput performance with slightly reduced CPU usage than standard frames, where the MTU is set to 1500 bytes. For example, with 40 Gb Ethernet connections, jumbo frames provide about five percent better throughput and about one percent less CPU usage.

OneFS provides the ability to optimize storage performance by designating zones to support specific workloads or subsets of clients. Different network traffic types can be segregated on separate subnets using SmartConnect pools.

For large clusters, partitioning the cluster’s networking resources and allocating bandwidth to each workload can help minimize the likelihood that heavy traffic from one workload will affect network throughput for another. This is particularly true for SyncIQ replication and NDMP backup traffic, which can frequently benefit from its own set of interfaces, separate from user and client IO load.

The ‘groupnet’ networking object is part of OneFS’ support for multi-tenancy. Groupnets sit above subnets and pools and allow separate multi-tenant access zones to contain distinct DNS settings.

The management and data network(s) can then be incorporated into different multi-tenant access zones, each with their own DNS, directory access services, and routing, as appropriate.

OneFS Hardware Platform Considerations

A key decision for performance, particularly in a large cluster environment, is the type and quantity of nodes deployed. Heterogeneous clusters can be architected with a wide variety of node styles and capacities, in order to meet the needs of a varied data set and wide spectrum of workloads. These node styles encompass several hardware generations, and fall loosely into three main categories or tiers. While heterogeneous clusters can easily include many hardware classes and configurations, the best practice of simplicity for building clusters holds true here too.

Consider the physical cluster layout and environmental factors, particularly when designing and planning a large cluster installation. These factors include:

Redundant power supply
Airflow and cooling
Rackspace requirements
Floor tile weight constraints
Networking Requirements
Cabling distance Limitations

The following table details the physical dimensions, weight, power draw, and thermal properties for the range of PowerScale F-series all-flash nodes:

Model	Tier	Height	Width	Depth	RU	Weight	MaxWatts	Watts	Max BTU	Normal BTU
F900	All-flash NVMe performance	2U (2×1.75IN)	17.8 IN / 45 cm	31.8 IN / 85.9 cm	2RU	73 lbs	1297	859	4425	2931
F600	All-flash NVMe Performance	1U (1.75IN)	17.8 IN / 45 cm	31.8 IN / 85.9 cm	1RU	43 lbs	467	718	2450	1594
F200	All-flash Performance	1U (1.75IN)	17.8 IN / 45 cm	31.8 IN / 85.9 cm	1RU	47 lbs	395	239	1346	816

Note that the table above represents individual nodes. A minimum of three of each node style are required for a node pool.

Similarly, the following table details the physical dimensions, weight, power draw, and thermal properties for the range of PowerScale chassis-based platforms:

Model	Tier	Height	Width	Depth	RU	Weight	MaxWatts	Watts	Max BTU	Normal BTU
F800/810	All-flash performance	4U (4×1.75IN)	17.6 IN / 45 cm	35 IN / 88.9 cm	4RU	169 lbs (77 kg)	1764	1300	6019	4436
H700	Hybrid/Utility	4U (4×1.75IN)	17.6 IN / 45 cm	35 IN / 88.9 cm	4RU	261lbs (100 kg)	1920	1528	6551	5214
H7000	Hybrid/Utility	4U (4×1.75IN)	17.6 IN / 45 cm	39 IN / 99.06 cm	4RU	312 lbs (129 kg)	2080	1688	7087	5760
H600	Hybrid/Utility	4U (4×1.75IN)	17.6 IN / 45 cm	35 IN / 88.9 cm	4RU	213 lbs (97 kg)	1990	1704	6790	5816
H5600	Hybrid/Utility	4U (4×1.75IN)	17.6 IN / 45 cm	39 IN / 99.06 cm	4RU	285 lbs (129 kg)	1906	1312	6504	4476
H500	Hybrid/Utility	4U (4×1.75IN)	17.6 IN / 45 cm	35 IN / 88.9 cm	4RU	248 lbs (112 kg)	1906	1312	6504	4476
H400	Hybrid/Utility	4U (4×1.75IN)	17.6 IN / 45 cm	35 IN / 88.9 cm	4RU	242 lbs (110 kg)	1558	1112	5316	3788
A300	Archive	4U (4×1.75IN)	17.6 IN / 45 cm	35 IN / 88.9 cm	4RU	252 lbs (100 kg)	1460	1070	4982	3651
A3000	Archive	4U (4×1.75IN)	17.6 IN / 45 cm	39 IN / 99.06 cm	4RU	303 lbs (129 kg)	1620	1230	5528	4197
A200	Archive	4U (4×1.75IN)	17.6 IN / 45 cm	35 IN / 88.9 cm	4RU	219 lbs (100 kg)	1460	1052	4982	3584
A2000	Archive	4U (4×1.75IN)	17.6 IN / 45 cm	39 IN / 99.06 cm	4RU	285 lbs (129 kg)	1520	1110	5186	3788

Note that the table above represents 4RU chassis, each of which contains four PowerScale platform nodes (the minimum node pool size).

Below are the locations of both the front end (ext-1 & ext-2) and back-end (int-1 & int-2) network interfaces on the PowerScale stand-alone F-series and chassis-based nodes:

A PowerScale cluster’s backend network is analogous to a distributed systems bus. Each node has two backend interfaces for redundancy that run in an active/passive configuration (int-1 and int-2 above). The primary interface is connected to the primary switch, and the secondary interface to a separate switch.

For nodes using 40/100 Gb or 25/10 Gb Ethernet or Infiniband connected with multimode fiber, the maximum cable length is 150 meters. This allows a cluster to span multiple rack rows, floors, and even buildings, if necessary. While this can solve floor space challenges, in order to perform any physical administration activity on nodes you must know where the equipment is located.

The table below shows the various PowerScale node types and their respective backend network support. While Ethernet is the preferred medium – particularly for large PowerScale clusters –Infiniband is also supported for compatibility with legacy Isilon clusters.

Node Models	Details
F200, F600, F900	F200: nodes support a 10 GbE or 25 GbE connection to the access switch using the same NIC. A breakout cable can connect up to four nodes to a single switch port. F600: nodes support a 40 GbE or 100 GbE connection to the access switch using the same NIC. F900: nodes support a 40 GbE or 100 GbE connection to the access switch using the same NIC.
H700, H7000, A300, A3000	Supports 40 GbE or 100 GbE connection to the access switch using the same NIC. OR Supports 25 GbE or 10 GbE connection to the leaf using the same NIC. A breakout cable can connect a 40 GbE switch port to four 10 GbE nodes or a 100 GbE switch port to four 25 GbE nodes.
F810, F800, H600, H500, H5600	Performance nodes support a 40 GbE connection to the access switch.
A200, A2000, H400	Archive nodes support a 10GbE connection to the access switch using a breakout cable. A breakout cable can connect a 40 GbE switch port to four 10 GbE nodes or a 100 GbE switch port to four 10 GbE nodes.

Currently only Dell approved switches are supported for backend Ethernet and IB cluster interconnection. These include:

Switch Model	Port Count	Port Speed	Height (Rack Units)	Role	Notes
Dell S4112	24	10GbE	½	ToR	10 GbE only.
Dell 4148	48	10GbE	1	ToR	10 GbE only.
Dell S5232	32	100GbE	1	Leaf or Spine	Supports 4x10GbE or 4x25GbE breakout cables. Total of 124 10GbE or 25GbE nodes as top-of-rack backend switch. Port 32 does not support breakout.
Dell Z9100	32	100GbE	1	Leaf or Spine	Supports 4x10GbE or 4x25GbE breakout cables. Total of 128 10GbE or 25GbE nodes as top-of-rack backend switch.
Dell Z9264	64	100GbE	2	Leaf or Spine	Supports 4x10GbE or 4x25GbE breakout cables. Total of 128 10GbE or 25GbE nodes as top-of-rack backend switch.
Arista 7304	128	40GbE	8	Enterprise core	40GbE or 10GbE line cards.
Arista 7308	256	40GbE	13	Enterprise/ large cluster	40GbE or 10GbE line cards.
Mellanox Neptune MSX6790	36	QDR	1	IB fabric	32Gb/s quad data rate Infiniband.

Be aware that the use of patch panels is not supported for PowerScale cluster backend connections, regardless of overall cable lengths. All connections must be a single link, single cable directly between the node and backend switch. Also, Ethernet and Infiniband switches must not be reconfigured or used for any traffic beyond a single cluster.

Support for leaf spine backend Ethernet network topologies was first introduced in OneFS 8.2. In a leaf-spine network switch architecture, the PowerScale nodes connect to leaf switches at the access, or leaf, layer of the network. At the next level, the aggregation and core network layers are condensed into a single spine layer. Every leaf switch connects to every spine switch to ensure that all leaf switches are no more than one hop away from one another. For example:

Leaf-to-spine switch connections require even distribution, to ensure the same number of spine connections from each leaf switch. This helps minimize latency and reduces the likelihood of bottlenecks in the back-end network. By design, a leaf spine network architecture is both highly scalable and redundant.

Leaf spine network deployments can have a minimum of two leaf switches and one spine switch. For small to medium clusters in a single rack, the back-end network typically uses two redundant top-of-rack (ToR) switches, rather than implementing a more complex leaf-spine topology.

OneFS Hardware Installation Considerations

When it comes to physically installing PowerScale nodes, most utilize a 35 inch depth chassis and will fit in a standard depth data center cabinet. Nodes can be secured to standard storage racks with their sliding rail kits, included in all node packaging and compatible with racks using either 3/8 inch square holes, 9/32 inch round holes, or 10-32 / 12-24 / M5X.8 / M6X1 pre-threaded holes. These supplied rail kit mounting brackets are adjustable in length from 24 inches to 36 inches to accommodate different rack depths. When selecting an enclosure for PowerScale nodes, ensure that the rack supports the minimum and maximum rail kit sizes.

Rack Component	Description
a	Distance between front surface of the rack and the front NEMA rail
b	Distance between NEMA rails, minimum=24in (609.6mm), max=34in (863.6mm)
c	Distance between the rear of the chassis to the rear of the rack, min=2.3in (58.42mm)
d	Distance between inner front of the front door and the NEMA rail, min=2.5in (63.5mm)
e	Distance between the inside of the rear post and the rear vertical edge of the chassis and rails, min=2.5in (63.5mm)
f	Width of the rear rack post
g	19in (486.2mm)+(2e), min=24in (609.6mm)
h	19in (486.2mm) NEMA+(2e)+(2f) Note: Width of the PDU+0.5in (13mm) <=e +f If j=i+c+PDU depth+3in (76.2mm), then h=min 23.6in (600mm) Assuming the PDU is mounted beyond i+c.
i	Chassis depth: Normal chassis=35.80in (909mm) : Deep chassis=40.40in (1026mm) Switch depth (measured from the front NEMA rail): Note: The inner rail is fixed at 36.25in (921mm) Allow up to 6in (155mm) for cable bend radius when routing up to 32 cables to one side of the rack. Select the greater of the installed equipment.
j	Minimum rack depth=i+c
k	Front
l	Rear
m	Front door
n	Rear door
p	Rack post
q	PDU
r	NEMA
s	NEMA 19 inch
t	Rack top view
u	Distance from front NEMA to chassis face: Dell PowerScale deep and normal chassis = 0in

However, the high capacity models such as the F800/810, H7000, H5600, A3000 and A2000 have 40 inch depth chassis and require extended depth cabinets such as the APC 3350 or Dell Titan-HD rack.

Additional room must be provided for opening the FRU service trays at the rear of the nodes and, in the chassis-based 4RU platforms, the disk sleds at the front of the chassis. With the exception of the 2RU F900, the stand-alone PowerScale all-flash nodes are 1RU in height (including the 1RU diskless P100 accelerator and B100 backup accelerator nodes).

Power-wise, each cabinet typically requires between two and six independent single or three-phase power sources. To determine the specific requirements, use the published technical specifications and device rating labels for the devices to calculate the total current draw for each rack.

Specification	North American 3 wire connection (2 L and 1 G)	International 3 wire connection (1 L, 1 N, and 1 G)
Input nominal voltage	200–240 V ac +/- 10% L – L nom	220–240 V ac +/- 10% L – L nom
Frequency	50–60 Hz	50–60 Hz
Circuit breakers	30 A	32 A
Power zones	Two	Two
Power requirements at site (minimum to maximum)	Single-phase: six 30A drops, two per zone Three-phase Delta: two 50A drops, one per zone Three-phase Wye: two 32A drops, one per zone	Single-phase: six 30A drops, two per zone Three-phase Delta: two 50A drops, one per zone Three-phase Wye: two 32A drops, one per zone

Additionally, the recommended environmental conditions to support optimal PowerScale cluster operation are as follows:

Attribute	Details
Temperature	Operate at >=90 percent of the time between 10 degrees celsiuses to 35 degrees celsius degrees celsius, and <=10 percent of the time between 5 degrees celsiuses to 40 degrees celsiuses.
Humidity	40 to 55 percent relative humidity
Weight	A fully configured cabinet must sit on at least two floor tiles, and can weigh approximately 1588 kilograms (3500 pounds).
Altitude	0 meters to 2439 meters (0 to 8,000 ft) above sea level operating altitude.

Weight is a critical factor to keep in mind, particularly with the chassis-based nodes. Individual 4RU chassis can weigh up to around 300lbs each, and the maximum floor tile capacity for each individual cabinet or rack must be kept in mind. For the deep node styles (H7000, H5600, A3000 and A2000), the considerable node weight may prevent racks from being fully populated with PowerScale equipment. If the cluster uses a variety of node types, installing the larger, heavier nodes at the bottom of each rack and the lighter chassis at the top can help distribute weight evenly across the cluster racks’ floor tiles.

Note that there are no lift handles on the PowerScale 4RU chassis. However, the drive sleds can be removed to provide handling points if no lift is available. With all the drive sleds removed, but leaving the rear compute modules inserted, the chassis weight drops to a more manageable 115lbs or so. It is strongly recommended to use a lift for installation of 4RU chassis.

Cluster backend switches ship with the appropriate rails (or tray) for proper installation of the switch in the rack. These rail kits are adjustable to fit NEMA front rail to rear rail spacing ranging from 22 in to 34 in.

Note that some manufacturers Ethernet switch rails are designed to overhang the rear NEMA rails, helping to align the switch with the PowerScale chassis at the rear of the rack. These require a minimum clearance of 36 in from the front NEMA rail to the rear of the rack, in order to ensure that the rack door can be closed.

Consider the following large cluster topology, for example:

This contiguous rack architecture is designed to scale up to the current maximum PowerScale cluster size of 252 nodes, in 63 4RU chassis, across nine racks as the environment grows – while still keeping cable management relatively simple. Note that this configuration assumes 1RU per node. If using F900 nodes, which are 2RU in size, additional rack capacity should be budgeted for.

Successful large cluster infrastructures depend heavily on the proficiency of the installer and their optimizations for maintenance and future expansion. Some good data center design practices include:

Pre-allocating and reserving adjacent racks in the same isle to fully accommodate the anticipated future cluster expansion
Reserving an empty ‘mailbox’ slot in the top half of each rack for any pass-through cable management needs.
Dedicating one of the racks in the group for the back-end and front-end distribution/spine switches – in this case rack R3.

For Hadoop workloads, PowerScale clusters are compatible with the rack awareness feature of HDFS to provide balancing in the placement of data. Rack locality keeps the data flow internal to the rack.

Excess cabling can be neatly stored in 12” service coils on a cable tray above the rack, if available, or at the side of the rack as illustrated below.

The use of intelligent power distribution units (PDUs) within each rack can facilitate the remote power cycling of nodes, if desired.

For deep nodes such as the H7000 and A3000 hardware, where chassis depth can be a limiting factor, horizontally mounted PDUs within the rack can be used in place of vertical PDUs, if necessary. If front-mounted, partial depth Ethernet switches are deployed, horizontal PDUs can be installed in the rear of the rack directly behind the switches to maximize available rack capacity.

With copper cables (SFP+, QSFP, CX4, etc) the maximum cable length is typically limited to 10 meters or less. After factoring in for dressing the cables to maintain some level of organization and proximity within the racks and cable trays, all the racks with PowerScale nodes need to be in close physical proximity to each other –either in the same rack row or close by in an adjacent row – or adopt a leaf-spine topology, with leaf switches in each rack.

If greater physical distance between nodes is required, support for multimode fiber (QSFP+, MPO, LC, etc) extends the cable length limitation to 150 meters. This allows nodes to be housed on separate floors or on the far side of a floor in a datacenter if necessary. While solving the floor space problem, this does have the potential to introduce new administrative and management challenges.

The various cable types, form factors, and supported lengths available for PowerScale nodes:

Cable Form Factor	Medium	Speed (Gb/s)	Max Length
QSFP28	Optical	100Gb	30M
MPO	Optical	100/40Gb	150M
QSFP28	Copper	100Gb	5M
QSFP+	Optical	40Gb	10M
LC	Optical	25/10Gb	150M
QSFP+	Copper	40Gb	5M
SFP28	Copper	25Gb	5M
SFP+	Copper	10Gb	7M
CX4	Copper	IB QDR/DDR	10M

The connector types for the cables above can be identified as follows:

As for the nodes themselves, the following rear views indicate the locations of the various network interfaces:

Note that Int-a and int-b indicate the primary and secondary back-end networks, whereas Ext-1 and Ext-2 are the front-end client networks interfaces.

Be aware that damage to the InfiniBand or Ethernet cables (copper or optical fibre) can negatively affect cluster performance. Never bend cables beyond the recommended bend radius, which is typically 10–12 times the diameter of the cable. For example, if a cable is 1.6 inches, round up to 2 inches and multiply by 10 for an acceptable bend radius.

Cables differ, so follow the explicit recommendations of the cable manufacturer.

The most important design attribute for bend radius consideration is the minimum mated cable clearance (Mmcc). Mmcc is the distance from the bulkhead of the chassis through the mated connectors/strain relief including the depth of the associated 90 degree bend. Multimode fiber has many modes of light (fiber optic) traveling through the core. As each of these modes moves closer to the edge of the core, light and the signal are more likely to be reduced, especially if the cable is bent. In a traditional multimode cable, as the bend radius is decreased, the amount of light that leaks out of the core increases, and the signal decreases. Best practices for data cabling include:

Keep cables away from sharp edges or metal corners.
Avoid bundling network cables with power cables. If network and power cables are not bundled separately, electromagnetic interference (EMI) can affect the data stream.
When bundling cables, do not pinch or constrict the cables.
Avoid using zip ties to bundle cables, instead use velcro hook-and-loop ties that do not have hard edges, and can be removed without cutting. Fastening cables with velcro ties also reduces the impact of gravity on the bend radius.

Note that the effects of gravity can also decrease the bend radius and result in degradation of signal power and quality.

Cables, particularly when bundled, can also obstruct the movement of conditioned air around the cluster, and cables should be secured away from fans, etc. Flooring seals and grommets can be useful to keep conditioned air from escaping through cable holes. Also ensure that smaller Ethernet switches are drawing cool air from the front of the rack, not from inside the cabinet. This can be achieved either with switch placement or by using rack shelving.

OneFS Hardware Environmental and Logistical Considerations

In this article, we turn our attention to some of the environmental and logistical aspects of cluster design, installation and management.

In addition to available rack space and physical proximity of nodes, provision needs to be made for adequate power and cooling as the cluster expands. New generations of drives and nodes typically deliver increased storage density, which often magnifies the power draw and cooling requirements per rack unit.

The recommendation is for a large cluster’s power supply to be fully redundant and backed up with a battery UPS and/or power generator. In the worst instance, if a cluster does loose power, the nodes are protected internally by filesystem journals which preserve any in-flight uncommitted writes. However, the time to restore power and bring up a large cluster from an unclean shutdown can be considerable.

Like most data center equipment, the cooling fans in PowerScale nodes and switches pull air from the front to back of the chassis. To complement this, data centers often employ a hot isle/cold isle rack configuration, where cool, low humidity air is supplied in the aisle at the front of each rack or cabinet either at the floor or ceiling level, and warm exhaust air is returned at ceiling level in the aisle to the rear of each rack.

Given the significant power draw, heat density, and weight of cluster hardware, some datacenters are limited in the number of nodes each rack can support. For partially filled racks, the use of blank panels to cover the front and rear of any unfilled rack units can help to efficiently direct airflow through the equipment.

The table below shows the various front and back-end network speeds and connector form factors across the PowerScale storage node portfolio.

Speed (Gb/s)	Form Factor	Front-end/ Back-end	Supported Nodes
100/40	QSFP28	Back-end	F900, F600, H700, H7000, A300, A3000, P100, B100
40 QDR	QSFP+	Back-end	F800, F810, H600, H5600, H500, H400, A200, A2000
25/10	SFP28	Back-end	F900, F600, F200, H700, H7000, A300, A3000, P100, B100
10 QDR	QSFP+	Back-end	H400, A200, A2000
100/40	QSFP28	Front-end	F900, F600, H700, H7000, A300, A3000, P100, B100
40 QDR	QSFP+	Front-end	F800, F810, H600, H5600, H500, H400, A200, A2000
25/10	SFP28	Front-end	F900, F600, F200, H700, H7000, A300, A3000, P100, B100
25/10	SFP+	Front-end	F800, F810, H600, H5600, H500, H400, A200, A2000
10 QDR	SFP+	Front-end	F800, F810, H600, H5600, H500, H400, A200, A2000

With large clusters, especially when the nodes may not be racked in a contiguous manner, having all the nodes and switches connected to serial console concentrators and remote power controllers is highly advised. However, to perform any physical administration or break/fix activity on nodes you must know where the equipment is located and have administrative resources available to access and service all locations.

As such, the following best practices are recommended:

Develop and update thorough physical architectural documentation.
Implement an intuitive cable coloring standard.
Be fastidious and consistent about cable labeling.
Use the appropriate length of cable for the run and create a neat 12” loop of any excess cable, secured with Velcro.
Observe appropriate cable bend ratios, particularly with fiber cables.
Dress cables and maintain a disciplined cable management ethos.
Keep a detailed cluster hardware maintenance log.
Where appropriate, maintain a ‘mailbox’ space for cable management.

Disciplined cable management and labeling for ease of identification is particularly important in larger PowerScale clusters, where density of cabling is high. Each chassis can require up to twenty eight cables, as shown in the table below:

Cabling Component	Medium	Cable Quantity per Chassis
Back end network	Ethernet or Infiniband	8
Front end network	Ethernet	8
Management Interface	1Gb Ethernet	4
Serial Console	DB9 RS 232	4
Power cord	110V or 220V AC power	4
Total		28

The recommendation for cabling a PowerScale chassis is as follows:

Split cabling in the middle of the chassis, between nodes 2 and 3.
Route Ethernet and Infiniband cables towards lower side of the chassis.
Connect power cords for nodes 1 and 3 to PDU A and power cords for nodes 2 and 4 to PDU B.
Bundle network cables with the AC power cords for ease of management.
Leave enough cable slack for servicing each individual node’s FRUs.

Similarly, the stand-alone F-series all flash nodes, in particular the 1RU F600 and F200 nodes, also have a similar density of cabling per rack unit:

Cabling Component	Medium	Cable Quantity per F-series node
Back end network	10 or 40 Gb Ethernet or QDR Infiniband	2
Front end network	10 or 40Gb Ethernet	2
Management Interface	1Gb Ethernet	1
Serial Console	DB9 RS 232	1
Power cord	110V or 220V AC power	2
Total		8

Consistent and meticulous cable labeling and management is particularly important in large clusters. PowerScale chassis that employ both front and back end Ethernet networks can include up to twenty Ethernet connections per 4RU chassis.

In each node’s compute module, there are two PCI slots for the Ethernet cards (NICs). Viewed from the rear of the chassis, in each node the right hand slot (HBA Slot 0) houses the NIC for the front end network, and the left hand slot (HBA Slot 1) the NIC for the front end network. In addition to this, there is a separate built-in 1Gb Ethernet port on each node for cluster management traffic.

While there is no requirement that node 1 aligns with port 1 on each of the backend switches, it can certainly make cluster and switch management and troubleshooting considerably simpler. Even if exact port alignment is not possible, with large clusters, ensure that the cables are clearly labeled and connected to similar port regions on the backend switches.

PowerScale nodes and the drives they contain have identifying LED lights to indicate when a component has failed and to allow proactive identification of resources. The ‘isi led’ CLI command can be used to proactive illuminate specific node and drive indicator lights to aid in identification.

Drive repair times depend on a variety of factors:

OneFS release (determines Job Engine version and how efficiently it operates)
System hardware (determines drive types, amount of CPU and RAM, etc)
Filesystem: Amount of data, data composition (lots of small vs large files), protection, tunables, etc.
Load on the cluster during the drive failure

A useful method to estimate future FlexProtect runtime is to use old repair runtimes as a guide, if available.

The drives in the PowerScale chassis-based platforms have a bay-grid nomenclature, where A-E indicates each of the sleds and 0-6 would point to the drive position in the sled. The drive closest to the front is 0, whereas the drive closest to the back is 2/3/5, depending on the drive sled type.

When it comes to updating and refreshing hardware in a large cluster, swapping nodes can be a lengthy process of somewhat unpredictable duration. Data has to be evacuated from each old node during the Smartfail process prior to its removal, and restriped and balanced across the new hardware’s drives. During this time there will also be potentially impactful group changes as new nodes are added and the old ones removed.

However, if replacing an entire node-pool as part of a tech refresh, a SmartPools filepool policy can be crafted to migrate the data to another nodepool across the back-end network. When complete, the nodes can then be Smartfailed out, which should progress swiftly since they are now empty.

If multiple nodes are Smartfailed simultaneously, at the final stage of the process the node remove is serialized with around 60 seconds pause between each. The Smartfail job places the selected nodes in read-only mode while it copies the protection stripes to the cluster’s free space. Using SmartPools to evacuate data from a node or set of nodes in preparation to remove them is generally a good idea, and is usually a relatively fast process.

OneFS Multi-tenancy and Zone-aware Access Control

Multi-tenancy in OneFS is predicated upon the following four areas:

Area	Feature	Description
Security	Multi-tenant Access Zones	Share and export-level access control departmental segregation
Data	SmartPools	Nodepools and Tiers for data segregation
Networking	SmartConnect	Groupnets and Zones for network segmentation
Administration	RBAC	Data access and administration separation

For authentication and access control, OneFS multi-tenant access zones provide a way to associate the cluster with multiple sets of auth providers to provide varied access to cluster resources. Each zone contains the necessary configuration to support authentication and identity management services for client access to OneFS.

A combination of SmartConnect zones, node pools, and multi-tenant access zone enables the separation of authentication providers into different groups and provides a mechanism to limit data access to specific node groups, network interfaces, directory hierarchies and file system areas. The following image shows three ‘tenants’ (in this case business units in an enterprise), each with their own subset of cluster resources – network pool, node pool, and authentication & identity management infrastructure.

Within OneFS, Role-based access control (RBAC) provides the ability to grant cluster administrators the necessary privileges to perform various tasks through the Platform API, such as creating/modifying/viewing NFS exports, SMB shares, authentication providers, and various cluster settings. For example, data center operations staff can be assigned read-only rights to the entire cluster, allowing full monitoring access but no configuration changes to be made. OneFS provides a collection of built-in roles, including audit, system & security administrator, plus the ability to create custom defined roles, either per multi-tenant access zone or across the cluster. Roles Based Administration is integrated with the OneFS command line interface, WebUI and Platform API.

OneFS RBAC enables roles and a subset of zone-aware privileges to be assigned on a per-tenant access zone basis. This allows administrative tasks covered by the zone-aware privileges to be permitted inside a specific multi-tenant access zone, by defining a ‘local administrator’ for that zone. A user in the System zone still has the ability to administer all other multi-tenant access zones, and so remains a ‘global administrator’. However, a user in a non-System multi-tenant access zone can also be given a set or privileges to act as a ‘local administrator’ for that particular zone, as well as being able to view (but not modify) global settings related to those privileges.

The following privileges are available in non-System multi-tenant access zones:

Privilege	Description
ISI_PRIV_LOGIN_PAPI	Log in to Platform API and WebUI
ISI_PRIV_AUTH	Configure identities, roles and authentication providers
ISI_PRIV_AUTH_GROUPS	User groups from authentication provider
ISI_PRIV_AUTH_PROVIDERS	Configure Auth providers
ISI_PRIV_AUTH_RULES	User mapping rules.
ISI_PRIV_AUTH_SETTINGS_ACLS	Configure ACL policy settings
ISI_PRIV_AUTH_SETTINGS_GLOBAL	Configure global authentication settings
ISI_PRIV_AUTH_USERS	Users from authentication providers
ISI_PRIV_AUTH_ZONES	Configure multi-tenant access zones
ISI_PRIV_RESTRICTED_AUTH	Configure identities with the same or lesser privilege
ISI_PRIV_RESTRICTED_AUTH_GROUPS	Configure identities with the same or lesser privilege
ISI_PRIV_RESTRICTED_AUTH_USERS	Configure identities with the same or lesser privilege
ISI_PRIV_ROLE	Create new roles and assign privileges
ISI_PRIV_AUDIT	Configure audit capabilities
ISI_PRIV_FILE_FILTER	Configure File Filtering based on file types
ISI_PRIV_FILE_FILTER_SETTINGS	File Filtering service and filter settings
ISI_PRIV_HDFS	Setup HDFS Filesystem, service, users and settings
ISI_PRIV_HDFS_FSIMAGE_JOB_SETTINGS	HDFS FSImage job settings
ISI_PRIV_HDFS_FSIMAGE_SETTINGS	HDFS FSImage service settings
ISI_PRIV_HDFS_INOTIFY_SETTINGS	HDFS Inotify service settings
ISI_PRIV_HDFS_PROXYUSERS	Proxy users and members
ISI_PRIV_HDFS_RACKS	HDFS virtual rack settings
ISI_PRIV_HDFS_RANGERPLUGIN_SETTINGS	Settings for the HDFS ranger plugin
ISI_PRIV_HDFS_SETTINGS	HDFS Service, protocol and ambari server settings
ISI_PRIV_NFS	Setup NFS Service, exports and configure settings
ISI_PRIV_NFS_ALIASES	Aliases for export directory names
ISI_PRIV_NFS_EXPORTS	NFS Exports and permissions
ISI_PRIV_NFS_SETTINGS	NFS export and other settings
ISI_PRIV_NFS_SETTINGS_EXPORT	NFS export and user mapping settings
ISI_PRIV_NFS_SETTINGS_GLOBAL	NFS global and service settings
ISI_PRIV_NFS_SETTINGS_ZONE	NFS zone related settings
ISI_PRIV_PAPI_CONFIG	Configure the Platform API and WebUI
ISI_PRIV_S3	Setup S3 Buckets and configure settings
ISI_PRIV_S3_BUCKETS	S3 buckets and ACL
ISI_PRIV_S3_MYKEYS	S3 key management
ISI_PRIV_S3_SETTINGS	S3 global and zone settings
ISI_PRIV_S3_SETTINGS_GLOBAL	S3 global and service settings
ISI_PRIV_S3_SETTINGS_ZONE	S3 zone related settings
ISI_PRIV_SMB	Setup SMB Service, shares and configure settings
ISI_PRIV_SMB_SESSIONS	Active SMB sessions
ISI_PRIV_SMB_SETTINGS	View and manage SMB settings
ISI_PRIV_SMB_SETTINGS_GLOBAL	SMB global and service settings
ISI_PRIV_SMB_SETTINGS_SHARE	SMB filter and share Settings
ISI_PRIV_SMB_SHARES	SMB shares and permissions
ISI_PRIV_SWIFT	Configure Swift
ISI_PRIV_VCENTER	Configure VMware vCenter
ISI_PRIV_IFS_BACKUP	Backup files from /ifs
ISI_PRIV_IFS_RESTORE	Restore files to /ifs
ISI_PRIV_NS_TRAVERSE	Traverse and view directory metadata
ISI_PRIV_NS_IFS_ACCESS	Access /ifs via RESTful Access to Namespace service

Additionally, two built-in roles are provided by default in a non-System multi-tenant access zone:

Role

Description

Privilege

ZoneAdmin

Allows administration of aspects of configuration related to current multi-tenant access zone

· ISI_PRIV_LOGIN_PAPI

· ISI_PRIV_AUDIT

· ISI_PRIV_FILE_FILTER

· ISI_PRIV_HDFS

· ISI_PRIV_NFS

· ISI_PRIV_SMB

· ISI_PRIV_SWIFT

· ISI_PRIV_VCENTER

· ISI_PRIV_NS_TRAVERSE

· ISI_PRIV_NS_IFS_ACCESS

ZoneSecurityAdmin

Allows administration of aspects of security configuration related to current multi-tenant access zone

· ISI_PRIV_LOGIN_PAPI

· ISI_PRIV_AUTH

· ISI_PRIV_ROLE

Note that neither of these roles has any default users automatically assigned.

With RBAC, an authentication provider created from the System multi-tenant access zone can be viewed and used by all other multi-tenant access zones. However, it can only be modified/deleted from System multi-tenant access zone.

Be aware that a Kerberos provider can only be created from the System multi-tenant access zone.

With RBAC, an authentication provider created from a non-System multi-tenant access zone can only be used by that specific multi-tenant access zone. However, it can be administered (ie. viewed/modified/deleted) from either its multi-tenant access zone or the System multi-tenant access zone.

A local provider from a non-System multi-tenant access zone can only be used by that specific non-System multi-tenant access zone. It cannot be used by other multi-tenant access zones, including System multi-tenant access zone and can only be viewed/modified from that specific non-system multi-tenant access zone and plus the system multi-tenant access zone.

To support zone-aware RBAC, the OneFS WebUI includes a ‘current access zone’ field to select the desired multi-tenant access zone. This is located under Access > Membership and Roles:

An ‘instance’ configuration field is used to identify specific AD authentication providers. There could be multiple AD authentication providers referring to a same domain and must specify an instance name and a machine account name

Each multi-tenant access zone’s administrator(s) can create their own AD authentication provider to connect to the same domain. This can be configured from either the WebUI or CLI command. However, each multi-tenant access zone can still only include a single AD authentication provider.

In the WebUI, the ‘instance name’ field appears under the ‘Add an Active Directory provider’ configuration screen, accessed by navigating to Access > Authentication Providers > Active Directory > Join a Domain:

Similarly, the ‘isi auth ads create’ CLI command sees the addition of ‘–instance’ and ‘–machine-account’ arguments. For example:

# isi auth ads create --name=ad.isilon.com --user=Administrator –-instance=ad1 –-machine-account=my-isilon123

The ‘instance’ name can then be used to reference the AD provider

# isi auth ads view ad1

To illustrate, take the following roles and privileges example. The smb2 role is created in zone2 and a user, smbuser2, added to the smb2 role:

# isi auth roles create --zone=zone2 smb2

# isi auth users create --zone=zone2 smbuser2

# isi auth roles modify smb2 --zone=zone2 --add-priv=ISI_PRIV_LOGIN_PAPI --add-priv=ISI_PRIV_SMB –-add-user=smbuser2

# isi auth roles modify smb2 --zone=zone2 --add-priv=ISI_PRIV_NETWORK

Privilege ISI_PRIV_NETWORK cannot be added to role in non-System access zone

# isi smb share create --zone=zone2 share2 /ifs/data/zone2

# isi_run -z zone2 -l smbuser2 isi smb shares list

Share Name   Path

---------------------------

Share2 /ifs/data/zone2

---------------------------

Total: 1

# isi_run -z zone2 -l smbuser2 isi nfs exports list

Privilege check failed. The following read privilege is required: NFS (ISI_PRIV_NFS)

As shown above, the smbuser2 can log into the WebUI via zone2 and can create/modify/view SMB shares/settings in zone2. Smbuser2 can also view, but not modify, global SMB configuration settings. However, Smbuser2 is not be able to view/modify shares in other multi-tenant access zones.

When connecting to a cluster via the SSH session protocol to perform CLI configuration, be aware that SSH access is still only available from System multi-tenant access zone. As such, administrators coming from non-System multi-tenant access zones will only be able to use the WebUI or platform API to perform cluster configuration and management.

The ‘isi auth role’ CLI command offers a ‘–zone’ argument to report on specific multi-tenant access zones:

# isi auth roles list -–zone=zone2

With no ‘–zone’ option specified, this command returns a list of roles in the current multi-tenant access zone.

Note that multiple instances connected to the same AD provider can be configured so long as each has a unique machine account name.

To help with troubleshooting permissions issues, the ‘isi_run’ CLI utility can be used to run OneFS CLI command(s) as if it were coming from a non-System multi-tenant access zone. For example:

# isi_run -z zone2 isi auth roles list

Note that a multi-tenant access zone name can be used as well as the zone ID as an argument for the -z (zone) flag option, as above.

In Kerberos environments, in order to work with AD, the appropriate service principal name (SPN) must be created in the appropriate Active Directory machine accounts, and duplicate SPNs must be avoided.

OneFS provides the ‘isi auth ads spn’ CLI command set to verify and manage SPN configuration. Command options include:

Show which SPNs are missing or extra as compared to expected SPNs:

# isi auth ads spn check <AD-instance-name>

Add and remove missing/extra SPNs:

# isi auth ads spn fix <AD-instance-name>  --user=<Administrator>

Add, but avoid removing extra SPNs:

# isi auth ads  spn fix <AD-instance-name> --noremove

Add an SPN, and add to ‘expected SPN’ list:

# isi auth ads spn create <AD-instance-name> <SPN>

Delete an SPN and remove from ‘expected SPN’ list:

# isi auth ads spn delete <AD-instance-name> <SPN>

OneFS Environmentals Reporting

The Energy Star for storage initiative is a SNIA-defined criteria, in conjunction with the EPA and DoE, to evaluate the energy efficiency of a storage system. While earlier OneFS versions adhered to Energy Star 1.1 from 2014, OneFS 9.2 and later support the latest Energy Star 2.0 version. This allows a compliance engineer or administrator to easily query inlet temperature and power consumption via a simple, convenient interface. It also enables third-party datacenter management software to access environmental data via a standard network connection and take appropriate actions if an anomaly is reported. Energy star information is available via the CLI on clusters running in both enterprise and compliance mode

Under the hood, the power and temperature data is retrieved through the IPMI interface. Command support is at the node level, and values are cached for up to five seconds after retrieval.

The CLI syntax involves the ‘isi_estar’ command, which can be used to obtain energy star data from the following supported PowerScale platforms:

F-series: F200, F600, F900, F800, F810
H-series: H400, H500, H5600, H600, H700, H7000
A-series: A200, A2000, A300, A3000
Accelerators: P100, B100

For example, the output from an F600 node reports:

# isi_for_array -s isi_estar

f600prime-1: Input power:             374.000

f600prime-1: Inlet air temperature:   19.000

Note that all previous generation platforms are unsupported and will return the following error:

 “ERROR – Error reading psi sensor name, platform not supported.”

However, for nodes that don’t support the ‘isi_estar’ CLI command, similar output can be obtained using the ‘isi_hw_status -wt’ CLI command syntax. For example:

# isi_hw_status -wt | grep -e Ambient_Temp -e Pwr_Consumption

Pwr_Consumption                 = 424.000

Ambient_Temp                    = 26.000

Since the ‘isi_estar’ command is node-local, in order to report on all the nodes in the cluster it can be run via the isi_for_array utility. upplies.

# isi_for_array -s isi_estar

f600prime-1: Input power:             374.000

f600prime-1: Inlet air temperature:   19.000

f600prime-2: Input power:             374.000

f600prime-2: Inlet air temperature:   20.000

f600prime-3: Input power:             374.000

f600prime-3: Inlet air temperature:   21.000

The ‘input power’ metric displays the combined power consumption in watts for both of a node’s power supplies.

To continuously sample isi_estar output at 30 second intervals, use the following syntax:

# while true; do date; isi_estar; sleep 30; done

Another useful utility for easily verifying the status of the nodes’ journal batteries in a cluster is the ‘isi batterystatus’ CLI command set. For example:

# isi batterystatus list

Lnn  Status1                           Status2  Result1  Result2

-----------------------------------------------------------------

1    Ready, enabled, and fully charged N/A      passed   N/A

2    Ready, enabled, and fully charged N/A      passed   N/A

3    Ready, enabled, and fully charged N/A      passed   N/A

4    Ready, enabled, and fully charged N/A      passed   N/A

Detailed status for a particular nodes is also available:

# isi batterystatus view

            Lnn: 1

        Status1: Ready and enabled

        Status2: N/A

        Result1: passed

        Result2: N/A

Last Test Time1: Mon Sep 12 23:59:50 2022

Next Test Time1: Fri Sep 23 09:59:50 2022

Last Test Time2: N/A

Next Test Time2: N/A

      Supported: Yes

        Present: Yes

For more detailed hardware and environmental data, the ‘isi_hw_status’ CLI command can also be useful with its plethora of information:

# isi_hw_status

  SerNo: JACNT194340156

 Config: 110-385-400B-04

ChsSerN: JACNN194420707

ChsSlot: 4

FamCode: H

ChsCode: 4U

GenCode: 0

PrfCode: 5

   Tier: 3

  Class: storage

 Series: n/a

Product: H500-4U-Single-128GB-1x1GE-2x10GE SFP+-30TB-1638GB SSD

  HWGen: PSI

Chassis: INFINITY (Infinity Chassis)

    CPU: GenuineIntel (2.20GHz, stepping 0x000406f1)

   PROC: Single-proc, 10-HT-core

    RAM: 137226121216 Bytes

   Mobo: INFINITY (Custom EMC Motherboard)

  NVRam: INFINITY (Infinity Memory Journal) (4096MB card) (size 4294967296B)

 DskCtl: PMC8074 (PMC 8074) (8 ports)

 DskExp: PMC8056I (PMC-Sierra PM8056 - Infinity)

PwrSupl: Slot3-PS0 (type=ARTESYN, fw=02.14)

PwrSupl: Slot4-PS1 (type=ARTESYN, fw=02.14)

  NetIF: bge0,mlxen0,mlxen1,mlxen2,mlxen3

 IBType: Unknown (None)

 BEType: 40GigE

 FEType: 10GigE

 LCDver: IsiVFD2 (Isilon VFD V2)

 Midpln: NONE (No Midplane Support)

Power Supplies OK

Power Supply Slot3-PS0 good

Power Supply Slot4-PS1 good

CPU Operation (raw 0x88290000)  = Normal

CPU Speed Limit                 = 100.00%

Fan0_Speed                      = 6600.000

Fan1_Speed                      = 6600.000

Slot3-PS0_In_Voltage            = 209.000

Slot4-PS1_In_Voltage            = 209.000

SP_CMD_Vin                      = 12.200

CMOS_Voltage                    = 3.080

Slot3-PS0_Input_Power           = 176.000

Slot4-PS1_Input_Power           = 216.000

Pwr_Consumption                 = 400.000

DIMM_Bank0                      = 46.000

DIMM_Bank1                      = 47.000

CPU0_Temp                       = -36.000

SP_Temp0                        = 40.000

MP_Temp0                        = 27.000

Hottest_HDD_Temp                = 20.000

Ambient_Temp                    = 27.000

Slot3-PS0_Temp0                 = 64.000

Slot3-PS0_Temp1                 = 37.000

Slot4-PS1_Temp0                 = 66.000

Slot4-PS1_Temp1                 = 40.000

Battery0_Temp                   = 41.000

Altitude                        = 120.000

Note that, unlike the ‘isi_estar’ command, ‘isi_hw_status’ will run on all PowerScale and earlier generation Isilon nodes.

Options for the ‘isi_hw_status’ CLI command include:

Flag	Description
-S pct	set CPU speed throttling to given percentage
-C	clear CPU overtemp indicator
-L	use list-format
-T	use table-format
-H	suppress headers in table-format mode
-HH	shows _only_ the headers in table-format mode
-c	show system components
-d	show debug-level system components (-dd for more)
-F	show system component flags
-i	show system identification
-s	show system state
-f	show fan speeds
-v	show voltages
-a	show currents
-w	show watts
-t	Show temperatures
-m	Show meters
-I	Show miscellaneous inputs
-b	Show NVRAM status
-V	Turns on verbose output
-g	Turns on debug output
-q	Turns on quiet mode (suppesses extra verbose/debug output)
-A	Show all output
-r	Disable IPMI command cache reads on IPMI-based platforms
-P	Probe and update hardware state info in PSI

For example, the ‘isi_hw_status’ command run with the ‘-sfvt’ flags can be useful to show a node’s environmental data:

# isi_hw_status -sfvt

Power Supplies OK

Power Supply Slot3-PS0 good

Power Supply Slot4-PS1 good

CPU Operation (raw 0x88290000)  = Normal

CPU Speed Limit                 = 100.00%

Fan0_Speed                      = 6600.000

Fan1_Speed                      = 6600.000

Slot3-PS0_In_Voltage            = 212.000

Slot4-PS1_In_Voltage            = 213.000

SP_CMD_Vin                      = 12.200

CMOS_Voltage                    = 3.080

DIMM_Bank0                      = 43.000

DIMM_Bank1                      = 44.000

CPU0_Temp                       = -36.000

SP_Temp0                        = 38.000

MP_Temp0                        = 26.000

Hottest_HDD_Temp                = 20.000

Ambient_Temp                    = 26.000

Slot3-PS0_Temp0                 = 58.000

Slot3-PS0_Temp1                 = 33.000

Slot4-PS1_Temp0                 = 62.000

Slot4-PS1_Temp1                 = 37.000

Battery0_Temp                   = 38.000

Note that the ‘isi_hw_status’ command output can also be viewed in table format with the ’-T’ flag, in order to aid readability.

Additionally, the OneFS stats engine records virtually all of the sensor data, a list of which can be obtained by running:

# isi statistics list keys list | grep sensor.temp

For example, the temperature sensors on an H700 include:

node.sensor.temp.celsius.Ambient_Temp

node.sensor.temp.celsius.Battery0_Temp

node.sensor.temp.celsius.CPU0_Temp

node.sensor.temp.celsius.DIMM_Bank0

node.sensor.temp.celsius.DIMM_Bank1

node.sensor.temp.celsius.Drive_IO0_Temp

node.sensor.temp.celsius.Embed_IO_Temp0

node.sensor.temp.celsius.Hottest_SAS_Drv

node.sensor.temp.celsius.MP_Temp0

node.sensor.temp.celsius.MP_Temp1

node.sensor.temp.celsius.SLIC0_Temp

node.sensor.temp.celsius.SLIC1_Temp

node.sensor.temp.celsius.SP_Temp0

node.sensor.temp.celsius.Slot1-PS0_Temp0

node.sensor.temp.celsius.Slot1-PS0_Temp1

node.sensor.temp.celsius.Slot2-PS1_Temp0

node.sensor.temp.celsius.Slot2-PS1_Temp1

OneFS keeps in-memory a recent history of all of the stats that the engine collects.

For a node’s HDD and SSD drives, the ‘isi devices drive’ CLI command syntax can be used view status:

# isi devices drive list

Lnn  Location  Device    Lnum  State   Serial       Sled

---------------------------------------------------------

8    Bay  1    /dev/da1  15    L3      9VNX0JA00844 N/A

8    Bay  2    -         N/A   EMPTY                N/A

8    Bay  A0   /dev/da9  7     HEALTHY ZC23CH5P     A

8    Bay  A1   /dev/da2  14    HEALTHY ZC23CGX9     A

8    Bay  A2   /dev/da10 6     HEALTHY ZC23CGRE     A

8    Bay  B0   /dev/da3  13    HEALTHY ZC23C7WL     B

8    Bay  B1   /dev/da11 5     HEALTHY ZC23CH5Q     B

8    Bay  B2   /dev/da4  12    HEALTHY ZC23C8LQ     B

8    Bay  C0   /dev/da12 4     HEALTHY ZC23C8P0     C

8    Bay  C1   /dev/da5  11    HEALTHY ZC23C8C3     C

8    Bay  C2   /dev/da13 3     HEALTHY ZC23C8L1     C

8    Bay  D0   /dev/da6  10    HEALTHY ZC23C8FD     D

8    Bay  D1   /dev/da14 2     HEALTHY ZC23C7Z3     D

8    Bay  D2   /dev/da7  9     HEALTHY ZC23C874     D

8    Bay  E0   /dev/da15 1     HEALTHY ZC23C84D     E

8    Bay  E1   /dev/da8  8     HEALTHY ZC23C8LG     E

8    Bay  E2   /dev/da16 0     HEALTHY ZC23BC3X     E

---------------------------------------------------------

In this case, since it’s a chassis-based node, the drive location includes both bay and sled.

For more extensive drive information, the ‘isi_radish’ CLI command is also useful for querying a variety of drive heath and performance metrics, including drive and airflow temperatures. The ‘-T’ flag can be used to specifically report drive threshold violations:

# isi_radish -T

As mentioned previously, OneFS 9.0 and later releases also support the Intelligent Platform Management Interface protocol (IPMI), and, amongst other things, use this to gather the ‘isi_estar’ environmental data.

IPMI allows out-of-band console access and remote power control across a dedicated ethernet interface via Serial over LAN (SoL). As such, IMPI provides true lights-out management for PowerScale and Isilon Gen6 nodes without the need for additional rs-232 serial port concentrators or PDU rack power controllers.

For example, IPMI enables individual nodes, or the entire cluster, to be powered on after maintenance, or gracefully powered down after a power outage if the cluster is operating on limited backup power.

Similarly, IPMI facilitates performing a Hard/Cold Reboot/Power Cycle, for example, if a node is unresponsive to OneFS.

IPMI is enabled, configured, and operated from the CLI via the ‘isi ipmi’ command set and a cluster’s console can easily be accessed using the IPMItool utility. IPMItool is available as part of most Linux distributions, or accessible through other proprietary tools.

For the PowerScale F900, F600, F200, stand-alone nodes, the Dell iDRAC remote console option can also be accessed via an https web browser session to the default port 443 at a node’s IPMI address.

Note that in order to run the OneFS IPMI commands, the administrative account being used must have the ‘RBAC ISI_PRIV_IPMI’ privilege.

The following CLI syntax can be used to enable IPMI for DHCP:

# isi ipmi settings modify --enabled=True --allocation-type=dhcp 35 426 IPMI

Similarly, to enable IPMI for a static IP address:

# isi ipmi settings modify --enabled=True --allocation-type=static

To enable IPMI for a range of IP addresses use:

# isi ipmi network modify --gateway=[gateway IP] --prefixlen= --ranges=[IP Range]

The power control and Serial over LAN features can be configured and viewed using the following CLI command syntax. For example:

# isi ipmi features list

ID            Feature Description           Enabled

----------------------------------------------------

Power-Control Remote power control commands Yes

SOL           Serial over Lan functionality Yes

----------------------------------------------------

To enable the power control feature:

# isi ipmi features modify Power-Control --enabled=True

To enable the Serial over LAN (SoL) feature:

# isi ipmi features modify SOL --enabled=True

The following CLI commands can be used to configure a single username and password to perform IPMI tasks across all nodes in a cluster. Note that usernames can be up to 16 characters in length, while the associated passwords must be 17-20 characters in length.

To configure the username and password, run the CLI command:

# isi ipmi user modify --username [Username] --set-password

To confirm the username configuration, use:

# isi ipmi user view

Username: admin

On the client side, the ‘ipmitool’ command utility is ubiquitous in the Linux and UNIX world, and is included natively as part of most distributions. If not, it can easily be installed using the appropriate package manager, such as ‘yum’.

The ipmitool usage syntax is as follows:

[Linux Host:~]$ ipmitool -I lanplus -H [Node IP] -U [Username] -L OPERATOR -P [password]

For example, to execute power control commands:

ipmitool -I lanplus -H [Node IP] -U [Username] -L OPERATOR -P [password] power [command]

The ‘power’ command options above include status, on, off, cycle, and reset.

OneFS Diagnostics

In addition to the /usr/bin/isi_gather_info tool, OneFS also provides both a GUI and common ‘isi’ CLI version of the tool – albeit with slightly reduced functionality. This means that a OneFS log gather can be initiated either from the WebUI, or via the ‘isi diagnostics’ CLI command set with the following syntax:

# isi diagnostics gather start

The diagnostics gather status can also be queried as follows:

# isi diagnostics gather status

Gather is running.

Once the command has completed, the gather tarfile can be found under /ifs/data/Isilon_Support.

The ‘isi diagnostics’ configuration can also be viewed and modified as follows:

# isi diagnostics gather settings view

                Upload: Yes

                  ESRS: Yes

         Supportassist: Yes

           Gather Mode: full

  HTTP Insecure Upload: No

      HTTP Upload Host:

      HTTP Upload Path:

     HTTP Upload Proxy:

HTTP Upload Proxy Port: -

            Ftp Upload: Yes

       Ftp Upload Host: ftp.isilon.com

       Ftp Upload Path: /incoming

      Ftp Upload Proxy:

 Ftp Upload Proxy Port: -

       Ftp Upload User: anonymous

   Ftp Upload Ssl Cert:

   Ftp Upload Insecure: No

Configuration options for the ‘isi diagnostics gather’ CLI command include:

Option	Description
–upload <boolean>	Enable gather upload.
–esrs <boolean>	Use ESRS for gather upload.
–gather-mode (incremental \| full)	Type of gather: incremental, or full.
–http-insecure-upload <boolean>	Enable insecure HTTP upload on completed gather.
–http-upload-host <string>	HTTP Host to use for HTTP upload.
–http-upload-path <string>	Path on HTTP server to use for HTTP upload.
–http-upload-proxy <string>	Proxy server to use for HTTP upload.
–http-upload-proxy-port <integer>	Proxy server port to use for HTTP upload.
–clear-http-upload-proxy-port	Clear proxy server port to use for HTTP upload.
–ftp-upload <boolean>	Enable FTP upload on completed gather.
–ftp-upload-host <string>	FTP host to use for FTP upload.
–ftp-upload-path <string>	Path on FTP server to use for FTP upload.
–ftp-upload-proxy <string>	Proxy server to use for FTP upload.
–ftp-upload-proxy-port <integer>	Proxy server port to use for FTP upload.
–clear-ftp-upload-proxy-port	Clear proxy server port to use for FTP upload.
–ftp-upload-user <string>	FTP user to use for FTP upload.
–ftp-upload-ssl-cert <string>	Specifies the SSL certificate to use in FTPS connection.
–ftp-upload-insecure <boolean>	Whether to attempt a plain text FTP upload.
–ftp-upload-pass <string>	FTP user to use for FTP upload password.
–set-ftp-upload-pass	Specify the FTP upload password interactively.

As mentioned above, ‘isi diagnostics gather’ does not present quite as broad an array of features as the isi_gather_info utility. This is primarily for security purposes, since ‘isi diagnostics’ does not require root privileges to run. Instead, a user account with the ‘ISI_PRIV_SYS_SUPPORT’ RBAC privilege is needed in order to run a gather from either the WebUI or ‘isi diagnostics gather’ CLI interface.

Once a gather is running, a second instance cannot be started from any other node until that instance finishes. Typically, a warning along the lines of the following will be displayed:

"It appears that another instance of gather is running on the cluster somewhere. If you would like to force gather to run anyways, use the --force-multiple-igi flag. If you believe this message is in error, you may delete the lock file here: /ifs/.ifsvar/run/gather.node."

This lock can be removed as follows:

# rm -f /ifs/.ifsvar/run/gather.node

A log gather can also be initiated from the OneFS WebUI by navigating to Cluster management > Diagnostics > Gather:

The WebUI also uses the ‘isi diagnostics’ platform API handler and so, like the CLI command, also offers a subset of the full isi_gather_info functionality.

A limited menu of configuration options are also available via the WebUI under Cluster management > Diagnostics > Gather settings:

Also contained within the OneFS diagnostics command set is the ‘isi diagnostics netlogger’ utility. Netlogger captures IP traffic over a period of time for network and protocol analysis.

Under the hood, netlogger is a python wrapper around the ubiquitous tcpdump utility, and can be run either from the OneFS command line or WebUI.

For example, from the WebUI, browse to Cluster management > Diagnostics > Netlogger:

Alternatively, from the OneFS CLI, the isi_netlogger command captures traffic on interface (‘–interfaces’) over a timeout period of minutes (‘–duration’), and stores a specified number of log files ( ‘–count’).

Here’s the basic syntax of the CLI utility:

 # isi diagnostics netlogger start

        [--interfaces <str>]

        [--count <integer>]

        [--duration <duration>]

        [--snaplength <integer>]

        [--nodelist <str>]

        [--clients <str>]

        [--ports <str>]

        [--protocols (ip | ip6 | arp | tcp | udp)]

        [{--help | -h}]

Note that using the ‘-b’ bpf buffer size option will temporarily change the default buffer size while netlogger is running.

The command options include:

Netlogger Option	Description
–interfaces <str>	Limit packet collection to specified network interfaces.
–count <integer>	The number of packet capture files to keep after they reach the duration limit. Defaults to the latest 3 files. 0 is infinite.
–duration <duration>	How long to run the capture before rotating the capture file. Default is 10 minutes.
–snaplength <integer>	The maximum amount of data for each packet that is captured. Default is 320 bytes. Valid range is 64 to 9100 bytes.
–nodelist <str>	List of nodes specified by LNN on which to run the capture.
–clients <str>	Limit packet collection to specified Client hostname / IP addresses.
–ports <str>	Limit packet collection to specified TCP or UDP ports.
–protocols (ip \| ip6 \| arp \| tcp \| udp)	Limit packet collection to specified protocols.

Netlogger’s log files are stored by default under /ifs/netlog/<node_name>.

The WebUI can also be used to configure the netlogger parameters under Cluster management > Diagnostics > Netlogger settings:

Be aware that ‘isi diagnostics netlogger’ can consume significant cluster resources. When running the tool on a production cluster, be cognizant of the effect on the system.

When the command has completed, the capture file(s) are stored under:

# /ifs/netlog/[nodename]

The following command can also be used to incorporate netlogger output files into a gather_info bundle:

# isi_gather_info -n [node#] -f /ifs/netlog

To capture on multiple nodes of the cluster, the netlogger command can be prefixed by the versatile isi_for_array utility:

# isi diagnostics netlogger --nodelist 2,3 --timeout 5 --snaplength 256

The command syntax above will create five minute incremental files on nodes 2 and 3, using a snaplength of 256 bytes, which will capture the first 256 bytes of each packet. These five-minute logs will be kept for about three days and the naming convention is of the form netlog-<node_name>-<date>-<time>.pcap. For example:

# ls /ifs/netlog/tme_h700-1

netlog-tme_h700-1.2022-09-02_10.31.28.pcap

When using netlogger, the ‘–snaplength’ option needs to be set appropriately based on the protocol being to capture the right amount of detail in the packet headers and/or payload. Or, if you want the entire contents of every packet, a value of zero (‘–snaplength 0’) can be used.

The default snaplength for netlogger is to capture 320 bytes per packet, which is typically sufficient for most protocols.

However, for SMB, a snaplength of 512 is sometimes required. Note that, depending on a node’s traffic quantity, a snaplength of 0 (eg: capture whole packet) can potentially overwhelm the network interface driver.

All the output gets written to files under /ifs/netlog directory, and the default capture time is ten minutes (‘–duration 10’).

Filters can be applied to the filter to the end to constrain traffic to/from certain hosts or protocols. For example, to limit output to traffic between client 10.10.10.1:

# isi diagnostics netlogger --duration 5 --snaplength 256 --clients 10.10.10.1

Or to capture only NFS traffic, filter on port 2049:

# isi diagnostics netlogger --ports 2049

OneFS Logfile Collection with isi-gather-info

The previous blog article outlining the investigation and troubleshooting of OneFS deadlocks and hang-dumps generated several questions about OneFS logfile gathering. So it seemed like a germane topic to explore in an article.

The OneFS ‘isi_gather_info’ utility has long been a cluster staple for collecting and collating context and configuration that primarily aids support in the identification and resolution of bugs and issues. As such, it is arguably OneFS’ primary support tool and, in terms of actual functionality, it performs the following roles:

Executes many commands, scripts, and utilities on cluster, and saves their results
Gathers all these files into a single ‘gzipped’ package.
Transmits the gather package back to Dell via several optional transport methods.

By default, a log gather tarfile is written to the /ifs/data/Isilon_Support/pkg/ directory. It can also be uploaded to Dell via the following means:

Transport Mechanism	Description	TCP Port
ESRS	Uses Dell EMC Secure Remote Support (ESRS) for gather upload.	443/8443
FTP	Use FTP to upload completed gather.	21
HTTP	Use HTTP to upload gather.	80/443

More specifically, the ‘isi_gather_info’ CLI command syntax includes the following options:

Option	Description
–upload <boolean>	Enable gather upload.
–esrs <boolean>	Use ESRS for gather upload.
–gather-mode (incremental \| full)	Type of gather: incremental, or full.
–http-insecure-upload <boolean>	Enable insecure HTTP upload on completed gather.
–http-upload-host <string>	HTTP Host to use for HTTP upload.
–http-upload-path <string>	Path on HTTP server to use for HTTP upload.
–http-upload-proxy <string>	Proxy server to use for HTTP upload.
–http-upload-proxy-port <integer>	Proxy server port to use for HTTP upload.
–clear-http-upload-proxy-port	Clear proxy server port to use for HTTP upload.
–ftp-upload <boolean>	Enable FTP upload on completed gather.
–ftp-upload-host <string>	FTP host to use for FTP upload.
–ftp-upload-path <string>	Path on FTP server to use for FTP upload.
–ftp-upload-proxy <string>	Proxy server to use for FTP upload.
–ftp-upload-proxy-port <integer>	Proxy server port to use for FTP upload.
–clear-ftp-upload-proxy-port	Clear proxy server port to use for FTP upload.
–ftp-upload-user <string>	FTP user to use for FTP upload.
–ftp-upload-ssl-cert <string>	Specifies the SSL certificate to use in FTPS connection.
–ftp-upload-insecure <boolean>	Whether to attempt a plain text FTP upload.
–ftp-upload-pass <string>	FTP user to use for FTP upload password.
–set-ftp-upload-pass	Specify the FTP upload password interactively.

Once the gather arrives at Dell, it is automatically unpacked by a support process and analyzed using the ‘logviewer’ tool.

Under the hood, there are two principal components responsible for running a gather. These are:

Component	Description
Overlord	· The manager process, triggered by the user, which oversees all the isi_gather_info tasks which are executed on a single node.
Minion	· The worker process, which runs a series of commands (specified by the overlord) on a specific node.

The ‘isi_gather_info’ utility is primarily written in python, with its configuration under the purview of MCP, and RPC services provided by the isi_rpc_d daemon.

For example:

# isi_gather_info&

# ps -auxw | grep -i gather

root   91620    4.4  0.1 125024  79028  1  I+   16:23        0:02.12 python /usr/bin/isi_gather_info (python3.8)

root   91629    3.2  0.0  91020  39728  -  S    16:23        0:01.89 isi_rpc_d: isi.gather.minion.minion.GatherManager (isi_rpc_d)

root   93231    0.0  0.0  11148   2692  0  D+   16:23        0:00.01 grep -i gather

The overlord uses isi_rdo (the OneFS remote command execution daemon) to start up the minion processes and informs them of the commands to be executed via an ephemeral XML file, typically stored at /ifs/.ifsvar/run/<uuid>-gather_commands.xml. The minion then spins up an executor and a command for each entry in the XML file.

The parallel process executor (the default one to use) acts as a pool, triggering commands to run in parallel until a specified number are running in parallel. The commands themselves take care of the running and processing of results, checking frequently to ensure the timeout threshold has not been passed.

The executor also keeps track of which commands are currently running, and how many are complete, and writes them to a file so that the overlord process can display useful information. Once complete, the executor returns the runtime information to the minion, which records the benchmark file. The executor will also safely shut itself down if the isi_gather_info lock file disappears, such as if the isi_gather_info process is killed.

During a gather the minion returns nothing to the overlord process, since the output of its work is written to disk.

Architecturally, the ‘gather’ process comprises an eight phase workflow:

The details of each phase are as follows:

Phase	Description
1. Setup	Reads from the arguments passed in, as well as any config files on disk, and sets up the config dictionary, which will be used throughout the rest of the codebase. Most of the code for this step is contained in isilon/lib/python/gather/igi_config/configuration.py. This is also the step where the program is most likely to exit, if some config arguments end up being invalid
2. Run local	Executes all the cluster commands, which are run on the same node that is starting the gather. All these commands run in parallel (up to the current parallelism value). This is typically the second longest running phase.
3. Run nodes	Executes the node commands across all of the cluster’s nodes. This runs on each node, and while these commands run in parallel (up to the current parallelism value), they do not run in parallel with the local step.
4. Collect	Ensures all of the results end up on the overlord node (the node that started gather). If gather is using /ifs, it is very fast, but if it’s not, it needs to SCP all the node results to a single node.
5. Generate Extra Files	Generates nodes_info and package_info.xml. These are two files that are present in every single gather, and tell us some important metadata about the cluster
6. Packing	Packs (tars and gzips) all the results. This is typically the longest running phase, often by an order of magnitude
7. Upload	Transports the tarfile package to its specified destination. Depending on the geographic location this phase might also be a lengthy duration.
8. Cleanup	Cleanups any intermediary files that were created on cluster. This phase will run even if gather fails, or is interrupted.

Since the isi_gather_info tool is primarily intended for troubleshooting clusters with issues, it runs as root (or compadmin in compliance mode), as it needs to be able to execute under degraded conditions (eg. without GMP, during upgrade, and under cluster splits, etc). Given these atypical requirements, isi_gather_info is built as a stand-alone utility, rather than using the platform API for data collection.

The time it takes to complete a gather is typically determined by cluster configuration, rather than size. For example, a gather on a small cluster with a large number of NFS shares will take significantly longer than on large cluster with a similar NFS configuration. Incremental gathers are not recommended, since the base that’s required to check against in the log store may be deleted. By default, gathers only persist for two weeks in the log processor.

On completion of a gather, a tar’d and zipped logset is generated and placed under the cluster’s /ifs/data/IsilonSupport/pkg directory by default. A standard gather tarfile unpacks to the following top-level structure:

# du -sh *

536M    IsilonLogs-powerscale-f900-cl1-20220816-172533-3983fba9-3fdc-446c-8d4b-21392d2c425d.tgz

320K    benchmark

 24K    celog_events.xml

 24K    command_line

128K    complete

449M    local

 24K    local.log

 24K    nodes_info

 24K    overlord.log

 83M    powerscale-f900-cl1-1

 24K    powerscale-f900-cl1-1.log

119M    powerscale-f900-cl1-2

 24K    powerscale-f900-cl1-2.log

134M    powerscale-f900-cl1-3

 24K    powerscale-f900-cl1-3.log

In this case, for a three node F900 cluster, the compressed tarfile is 536 MB in size. The bulk of the data, which is primarily CLI command output, logs and sysctl output, is contained in the ‘local’ and individual node directories (powerscale-f900-cl1-*). Each node directory contains a tarfile, varlog.tar, containing all the pertinent logfiles for that node.

In the root directory of the tarfile file includes the following:

Item	Description
benchmark	§ Runtimes for all commands executed by the gather.
celog_events.xml	· Info about the customer, including, name, phone, email, etc. · Contains significant about the cluster and individual nodes, including: § Cluster/Node names § Node Serial numbers § Configuration ID § OneFS version info § Events
command_line	· Syntax of gather commands run.
complete	§ Lists of complete commands run across the cluster and on individual nodes
local	· See below.
nodes_info	· Contains general information about the nodes, including the node ID, the IP address, the node name, and the logical node number
overlord.log	§ Gather execution and issue log.
package_info.xml	§ Cluster version details, GUID, S/N, and customer info (name, phone, email, etc).

Notable contents of the ‘local’ directory (all the cluster-wide commands that are executed on the node running the gather) include:

Local Contents Item	Description
isi_alerts_history	· This file seems to contain a list of all alerts that have ever occurred on the cluster · Event Id, which consists of the number of the initiating node and the event number · Times that the alert was issued and was resolved · Severity · Logical Node Number of the node(s) to which the alert applies · The message contained in the alert
isi_job_list	· Contains information about job engine processes · Includes Job names, enabled status, priority policy, and descriptions
isi_job_schedule	· A schedule of when job engine processes run · Includes job name, the schedule for a job, and the next time that a run of the job will occur
isi_license	· The current license status of all of the modules
isi_network_interfaces	§ State and configuration of all the cluster’s network interfaces.
isi_nfs_exports	§ Configuration detail for all the cluster’s NFS exports.
isi_services	§ Listing of all the OneFS services and whether they are enabled or disabled. More detailed configuration for each service is contained in separate files. Ie. For SnapshotIQ: o snapshot_list o snapshot_schedule o snapshot_settings o snapshot_usage o writable_snapshot_list
isi_smb	§ Detailed configuration info for all the cluster’s NFS exports.
isi_stat	§ Overall status of the cluster, including networks, drives, etc.
isi_statistics	§ CPU, protocol, and disk IO stats.

Contents of the directory for the ‘node’ directory include:

Node Contents Item	Description
df	· output of the df command
du	· Output of the du command · Unfortunately it runs ‘du -h’ which reports capacity in ‘human readable’ for, but make it more complex to sort.
isi_alerts	· Contains a list of outstanding alerts on the node
ps and ps_full	lists of all running process at the time that isi_gather_info was executed.

As the isi_gather_info command runs, status is provided via the interactive CLI session:

# isi_gather_info

Configuring

    COMPLETE

running local commands

    IN PROGRESS \

Progress of local

[########################################################  ]

147/152 files written  \

Some active commands are: ifsvar_modules_jobengine_cp, isi_statistics_heat, ifsv

ar_modules

Once the gather has completed, the location of the tarfile on the cluster itself is reported as follows:

# isi_gather_info

Configuring

    COMPLETE

running local commands

    COMPLETE

running node commands

    COMPLETE

collecting files

    COMPLETE

generating package_info.xml

    COMPLETE

tarring gather

    COMPLETE

uploading gather

    COMPLETE

Path to the tar-ed gather is:

/ifs/data/Isilon_Support/pkg/IsilonLogs-h5001-20220830-122839-23af1154-779c-41e9-b0bd-d10a026c9214.tgz

If the gather upload services are unavailable, errors will be displayed on the console, per below:

…

uploading gather

    FAILED

        ESRS failed - ESRS has not been provisioned

        FTP failed - pycurl error: (28, 'Failed to connect to ftp.isilon.com port 21 after 81630 ms: Operation timed out')

OneFS Deadlocks and Hang-dumps – Part 3

As we’ve seen previously in this series, very occasionally a cluster can become deadlocked and remain in an unstable state until the affected node(s), or sometimes the entire cluster, is rebooted or panicked. However, in addition to the data gathering discussed in the prior article, there are additional troubleshooting steps that can be explored by the more adventurous cluster admin – particularly with regard to investigating a LIN lock.

Lock Domain	Resource	Description
LIN	LIN	Every object in the OneFS filesystem (file, directory, internal special LINs) is indexed by a logical inode number (LIN). A LIN provides an extra level of indirection, providing pointers to the mirrored copies of the on-disk inode. This domain is used to provide mutual exclusion around classic BSD vnode operations. Operations that require a stable view of data take a read lock which allows other readers to operate simultaneously but prevents modification. Operations that change data take a write lock that prevents others from accessing that directory while the change is taking place.

The approach outlined can be useful to assist in identifying the problematic thread(s) and/or node(s) and helping to diagnose and resolve a cluster wide deadlock.

As a quick refresher, the various OneFS locking components include:

Locking Component	Description
Coordinator	A coordinator node arbitrates locking within the cluster for a particular subset of resources. The coordinator only maintains the lock types held and wanted by the initiator nodes.
Domain	Refers to the specific lock attributes (recursion, deadlock detection, memory use limits, etc) and context for a particular lock application. There is one definition of owner, resource, and lock types, and only locks within a particular domain may conflict.
Initiator	The node requesting a lock on behalf of a thread is called an initiator. The initiator must contact the coordinator of a resource in order to acquire the lock. The initiator may grant a lock locally for types which are subordinate to the type held by the node. For example, with shared-exclusive locking, an initiator which holds an exclusive lock may grant either a shared or exclusive lock locally.
Lock Type	Determines the contention among lockers. A shared or read lock does not contend with other types of shared or read locks, while an exclusive or write lock contends with all other types. Lock types include: Advisory, Anti-virus, Data, Delete, LIN, Mark, Oplocks, Quota, Share Mode, SMB byte-range, Snapshot, and Write.
Locker	Identifies the entity which acquires a lock.
Owner	A locker which has successfully acquired a particular lock. A locker may own multiple locks of the same or different type as a result of recursive locking.
Resource	Identifies a particular lock. Lock acquisition only contends on the same resource. The resource ID is typically a LIN to associate locks with files.
Waiter	Has requested a lock but has not yet been granted or acquired it.

So the basic data that will be required for a LIN lock investigation is as follows:

Data	Description
<Waiter-LNN>	Node number with the largest ‘started’ value.
<Waiting-Address>	Address of <Waiter-LNN> node above.
<LIN>	LIN from the ‘resource =’ field of <Waiter-LNN>
<Block-Address>	Block address from “resource=’ field of <Waiter-LNN>
<Locker-Node>	Node that owns the lock for the <LIN>. Has a non-zero value for ‘owner_count‘.
<Locker-Address)	Address of Locker-Node.

As such, the following process can be used to help investigate a LIN lock:

The details for each step above are as follows:

First, execute the following CLI syntax from any node in the cluster to view the LIN lock.’oldest_waiter’ infol:

# isi_for_array -X 'sysctl efs.lin.lock.initiator.oldest_waiter | grep -E "address|started"' | grep -v "exited with status 1"

Querying the ‘efs.lin.lock.initiator.oldest_waiter’ sysctl returns a deluge of information, for example:

# sysctl efs.lin.lock.initiator.oldest_waiter

efs.lin.lock.initiator.oldest_waiter: resource = 1:02ab:002c

waiter = {

    address = 0xfffffe8ff7674080

    locker = 0xfffffe99a52b4000

    type = shared

    range = [all]

    wait_type = wait ok

    refcount_type = stacking

    probe_id = 818112902

    waiter_id = 818112902

    probe_state = done

    started = 773086.923126 (29.933031 seconds ago)

    queue_in = 0xfffff80502ff0f08

    lk_completion_callback = kernel:lk_lock_callback+0

    waiter_type = sync

    created by:

      Stack: --------------------------------------------------

      kernel:lin_lock_get_locker+0xfe

      kernel:lin_lock_get_locker+0xfe

      kernel:bam_vget_stream_invalid+0xe5

      kernel:bam_vget_stream_valid_pref_hint+0x51

      kernel:bam_vget_valid+0x21

      kernel:bam_event_oprestart+0x7ef

      kernel:ifs_opwait+0x12c

      kernel:amd64_syscall+0x3a6

      --------------------------------------------------

The pertinent areas of interest for this exercise are the ‘address’ and ‘started’ (wait time) fields.

If the ‘started’ value is short (ie. less than 90 seconds), or there is no output returned, then this is potentially an MDS lock issue (which can be investigated via the ‘efs.mds.block_lock.initiator.oldest_waiter’ sysctl).

From the above output, examine the results with ‘started’ lines and find the one with the largest value for ‘(###.### seconds ago)’. The node number (<Waiter-LNN>) of this entry is the one of interest.
Next, examine the ‘address =’ entries and find the one with that same node number (<Waiting-Address>).

Note that if there are multiple entries per node, this could indicate a multiple shared lock with another exclusive lock waiting.

Query the LIN for the waiting address on the correct node using the following CLI syntax:

# isi_for_array -n<Waiter-LNN> 'sysctl efs.lin.lock.initiator.active_entries | egrep "resource|address|owner_count" | grep -B5 <Waiting-Address>'

The LIN for this issue is shown in the ‘resource =’ field from the above output. Use the following command to find which node owns the lock on that LIN:

# isi_for_array -X 'sysctl efs.lin.lock.initiator.active_entries |egrep "resource|owner_count"' | grep -A1 <LIN>

Parse the output from this command to find the entry that has a non-zero value for ‘owner_count’. This is the node that owns the lock for this LIN (<Locker-Node>).

Run the following command to find which thread owns the lock on the LIN:

# isi_for_array -n<Locker-Node> 'sysctl efs.lin.lock.initiator.active_entries | grep -A10 <LIN>'

The ‘locker =’ field will provide the thread address (<Locker-Addr>) for the thread holding the lock on the LIN. The following CLI syntax can be used to find the associated process and stack details for this thread:
```
# isi_for_array -n<Locker-Node>'sysctl kern.proc.all_stacks |grep -B1 -A20 <Locker-Addr>'
```
The output will provide the stack and process details. Depending on the process and stack information available from the previous command output, you may be able to terminate (ie. kill -9) the offending process in order to clear the deadlock issue.

Usually within a couple of minutes of killing the offending process, the cluster will become responsive again. The ‘isi get -L’ CLI command can be used to help determine which file was causing the issue, possibly giving some insight as to the root cause.

Please note that if you are unable to identify an individual culprit process, or are unsure of your findings, contact Dell Support for assistance.

OneFS Deadlocks and Hang-dumps – Part 2

As mentioned in the previous article in this series, hang-dumps can occur under the following circumstances.

Type	Description
Transient	The time to obtain the lock was long enough to trigger a hang-dump, but the lock is eventually granted. This is the less serious situation. The symptoms are typically general performance degradation, but the cluster is still responsive and able to progress.
Persistent	The issue typically requires significant remedial action, such as node reboots. This is usually indicative of a bug in OneFS, although it could also be caused by hardware issues, where hardware becomes unresponsive, and OneFS waits indefinitely for it to recover.

Certain normal OneFS operations, such as those involving very large files, have the potential to trigger a hang-dump with no long-term ill effects. However, in some situations the thread or process waiting for the lock to be freed, or ‘waiter’, is never actually granted the lock on the file. In such cases, users may be impacted.

If a hang-dump is generated as a result of a LIN lock timeout (the most likely scenario), this indicates that at least one thread in the system has been waiting for a LIN lock for over 90 seconds. The system hang can involve a single thread, or sometimes multiple threads, for example blocking a batch job. The system hang could be affecting interactive session(s), in which case the affected cluster users will likely experience performance impacts.

Specifically, in the case of a LIN lock timeout, if the LIN number is available, it can be easily mapped back to its associated filename using the ‘isi get’ CLI command.

# isi get -L <lin #>

However, a LIN which is still locked may necessitate waiting until the lock is freed before getting the name of the file.

By default, OneFS hang-dump files are written to the /var/crash directory as compressed text files. During a hang-dump investigation, Dell support typically utilizes internal tools to analyze the logs from all of the nodes and generate a graph to show the lock interactions between the lock holders (the thread or process that is holding the file) and lock waiters. The analytics are per-node and include a full dump of the lock state as seen by the local node, the stack of each thread in the system, plus a variety of other diagnostics including memory usage, etc. Since OneFS source-code access is generally required in order to interpret the stack traces, Dell Support can help investigate the hang-dump log file data, which can then be used to drive further troubleshooting.

A deadlocked cluster may exhibit one or more of the following symptoms:

Clients are unable to communicate with the cluster via SMB, NFS, SSH, etc.
The WebUI is unavailable and/or commands executed from the CLI fail to start or complete.
Processes cannot be terminated, even with SIGKILL (kill -9).
Degraded cluster performance is experienced, with low or no CPU/network/disk usage.
Inability to access files or folders under /ifs.

In order to recover from a deadlock, Dell support’s remediation will sometimes require panicking or rebooting a cluster. In such instances, thorough diagnostic information gathering should be performed prior to this drastic step. Without this diagnostic data, it will be often be impossible to determine the root cause of the deadlock. If the underlying cause of the deadlock is not corrected, rebooting the cluster and restarting the service may not resolve the issue.

The following steps can be run in order to gather data that will be helpful in determining the cause of a deadlock:

First, verify that there are no indeterminate journal transactions. If there are indeterminate journal transactions found, rebooting or panicking nodes will not resolve the issue.

# isi_for_array -X 'sysctl efs.journal.indeterminate_txns'

1: efs.journal.indeterminate_txns: 0
2: efs.journal.indeterminate_txns: 0
3: efs.journal.indeterminate_txns: 0

For each node, if the output of the above command returns zero, this indicates its journal is intact and all transactions are complete. Note that if the output is anything other than zero, the cluster contains indeterminate transactions, and Dell support should be engaged before any further troubleshooting is performed.

2. Next, check the /var/crash/directory for any recently created hang-dump files:

# isi_for_array -s 'ls -l /var/crash | grep -i hang'

Scan the /var/log/messages/ file for any recent references to ‘LOCK TIMEOUT’.

# isi_for_array -s 'egrep -i "lock timeout|hang" /var/log/messages | grep $(date +%Y-%m-%d)'

3.Collect the output from the ‘fstat’ CLI command, which identifies active files:

# isi_for_array -s 'fstat -m > /ifs/data/Isilon_Support/deadlock-data/fstat_$(hostname).txt'&

4. Record the Group Management Protocol (GMP) merge lock state:

# isi_for_array -s 'sysctl efs.gmp.merge_lock_state > /ifs/data/Isilon_Support/deadlock-data/merge_lock_state_$(hostname).txt'

5. Finally, run an ‘isi diagnostics gather’ logset gather to capture relevant cluster data and send the resulting zipped tarfile to Dell Support (via ESRS, FTP, etc).

# isi diagnostics gather start

A cluster reboot can be accomplished via an SSH connection as root to any node in the cluster, as follows:

# isi config

Welcome to the PowerScale configuration console.

Copyright (c) 2001-2022 Dell Inc. All Rights Reserved.

Enter 'help' to see list of available commands.

Enter 'help <command>' to see help for a specific command.

Enter 'quit' at any prompt to discard changes and exit.

        Node build: Isilon OneFS 9.4.0.0 B_MAIN_2978(RELEASE)

        Node serial number: JACNT194540666

TME1 >>> reboot all
 

!! You are about to reboot the entire cluster

Are you sure you wish to continue? [no] yes

Alternatively, the following CLI syntax can be used to reboot a single node from an SSH connection to it:

# kldload reboot_me

Or to reboot the cluster:

# isi_for_array -x$(isi_nodes -L %{lnn}) 'kldload reboot_me'

Note that simply shutting down or rebooting the affected node(s), or the entire cluster), while typically the quickest path to get up and running again, will not generate the core files required for debugging. If a root cause analysis is desired, these node(s) will need to be panicked in order to generate a dump of all active threads.

Only perform a node panic under the direct supervision of Dell Support! Be aware that panics bypass a number of important node shutdown functions, including unmounting /ifs, etc. However, a panic will generate additional kernel core information which is typically required by Dell Support in order to perform a thorough diagnosis. In situations where the entire cluster needs to be panicked, the recommendation is to start with the highest numbered node and work down to lowest. For each node that’s panicked, the debug information is written to the /var/crash directory, and can be identified by the ‘vmcore’ prefix.

If instructed by Dell Support to do so, the ‘isi_rbm_panic’ CLI command can be used to panic a node, with the argument being the logical node number (LNN) of the desired node to target. For example, to panic a node with LNN=2:

# isi_rbm_panic 2

If in any doubt, the following CLI syntax will return the corresponding node ID and node LNN for each node in the cluster:

# isi_nodes %{id} , %{lnn}