OneFS and NFS over RDMA Support

Over the last couple of decades, the ubiquitous network file system (NFS) protocol has become near synonymous with network attached storage. Since its debut in 1984, the technology has matured to such an extent that NFS is now deployed by organizations large and small across a broad range of critical production workloads. Currently, NFS is the OneFS file protocol with the most stringent performance requirements, serving key workloads such as EDA, artificial intelligence, 8K media editing and playback, financial services, and other branches of commercial HPC.

At its core, NFS over Remote Direct Memory Access (RDMA), as spec’d in RFC8267, enables data to be transferred between storage and clients with better performance and lower resource utilization than the standard TCP protocol. Network adapters with RDMA support, known as RNICs, allow direct data transfer with minimal CPU involvement, yielding increased throughput and reduced latency. For applications accessing large datasets on remote NFS, the benefits of RDMA include:

Benefit Detail
Low CPU utilization Leaves more CPU cycles for other applications during data transfer.
Increased throughput Utilizes high-speed networks to transfer large data amounts at line speed.
Low latency Provides fast responses, making remote file storage feel more like directly attached storage.
Emerging technologies Provides support for technologies such as NVIDIA’s GPUDirect, which offloads I/O directly to the client’s GPU.

Network file system over remote direct memory access, or NFSoRDMA, provides remote data transfer directly to and from memory, without CPU intervention. PowerScale clusters have offered NFSv3 over RDMA support, and its associated performance benefits, since its introduction in OneFS 9.2. As such, enabling this functionality under OneFS allows the cluster to perform memory-to-memory transfer of data over high speed networks, bypassing the CPU for data movement and helping both reduce latency and improve throughput.

Because OneFS already had support for NFSv3 over RDMA, extending this to NFSv4.x in OneFS 9.8 focused on two primary areas:

  • Providing support for NFSv4 compound operations.
  • Enabling native handling of the RDMA headers which NFSv4.1 uses.

So with OneFS 9.8 and later, clients can connect to PowerScale clusters using any of the current transport protocols and NFS versions – from v3 to v4.2:

Protocol RDMA TCP UDP
NFS v3 x x x
NFS v4.0 x x
NFS v4.1 x x
NFS v4.2 x x

The NFS over RDMA global configuration options in both the WebUI and CLI have also been simplified and genericized, negating the need to specify a particular NFS version:

And from the CLI:

# isi nfs settings global modify --nfs-rdma-enabled=true

A PowerScale cluster and client must meet certain prerequisite criteria in order to use NFS over RDMA.

Specifically, from the cluster side:

Requirement Details
Node type F210, F200, F600, F710, F900, F910, F800, F810, H700, H7000, A300, A3000
Network card (NIC) NVIDIA Mellanox ConnectX-3 Pro, ConnectX-4, ConnectX-5, ConnectX-6 network adapters which support 25/40/100 GigE connectivity.
OneFS version OneFS 9.2 or later for NFSv3 over RDMA, and OneFS 9.8 or later for NFSv4.x over RDMA.

Similarly, the OneFS NFSoRDMA implementation requires any NFS clients using RDMA to support ROCEv2 capabilities. This may be either client VMs on a hypervisor with RDMA network interfaces, or a bare-metal client with RDMA NICs. OS-wise, any Linux kernel supporting NFSv4.x and RDMA (Kver 5.3+) can be used, but RDMA-related packages such as ‘rdma-core’ and ‘libibvers-utils’ will also need to be installed. Package installation is handled via the Linux distribution’s native package manager.

Linux Distribution Package Manager Package Utility
OpenSUSE RPM Zypper
RHEL RPM Yum
Ubuntu Deb Apt-get / Dpkg

For example, on an OpenSUSE client:

# zypper install rdma-core libibvers-utils

Plus additional client configuration is also required, and this procedure is covered in detail below.

In addition to a new ‘nfs-rdma-enabled’ NFS global config setting (deprecating the prior ‘nfsv3-rdma-enabled setting), OneFS 9.8 also adds a new ‘nfs-rroce-only’ network pool setting. This allows the creation of an RDMA-only network pool that can only contain RDMA-capable interfaces. For example:

# isi network pools modify <pool_id> --nfs-rroce-only true

This is ideal for NFS failover purposes because it can ensure that a dynamic pool will only fail over to an RDMA-capable interface.

OneFS 9.8 also introduces a new NFS over RDMA CELOG event:

# isi event types list | grep -i rdma

400140003  SW_NFS_CLUSTER_NOT_RDMA_CAPABLE             400000000  To use the NFS-over-RDMA feature, the cluster must have an RDMA-capable front-end Network Interface Card.

This event will fire if the cluster transitions from being able to support RDMA to not, or if attempting to enable RDMA on a non-capable cluster. The previous ‘SW_NFSV3_CLUSTER_NOT_RDMA_CAPABLE’ in OneFS 9.7 and earlier is also deprecated.

When it comes to TCP/UDP port requirements for NFS over RDMA, any environments with firewalls and/or packet filtering deployed should ensure the following ports are open between PowerScale cluster and NFS client(s):

Port Description
4791 RoCEv2 (UDP) for RDMA payload encapsulation.
300 Used by NFSv3 mount service.
302 Used by NFSv3 network status monitor (NSM).
304 Used by NFSv3 network lock manager (NLM).
111 RPC portmapper for locating services like NFS and mountd.

NFSv4 over RDMA does not add any new ports or outside cluster interfaces or interactions to OneFS, and RDMA should not be assumed to be more or less secure than any other transport type. For maximum security, NFSv4 over RDMA can be configured to use a central identity manager such as Kerberos.

Telemetry-wise, the ‘isi statistics’ configuration in OneFS 9.8 includes a new ‘nfsv4rdma’ switch for v4, in addition to the legacy ‘nfsrdma’ (where ‘nfs4rdma’ includes all the 4.0, 4.1 and 4.2 statistics).

The new NFSv4 over RDMA CLI statistics options in OneFS 9.8 include:

Command Syntax Description
isi statistics client list –protocols=nfs4rdma Display NFSv4oRDMA cluster usage statistics organized according to cluster hosts and users.
isi statistics protocol list –protocols=nfs4rdma Display cluster usage statistics for NFSv4oRDMA
isi statistics pstat list –protocol=nfs4rdma Generate detailed NFSv4oRDMA statistics along with CPU, OneFS, network and disk statistics.
isi statistics workload list –dataset= –protocols=nfs4rdma Display NFSv4oRDMA workload statistics for specified dataset(s).

For example:

# isi statistics client list --protocols nfs4rdma  Ops     In    Out  TimeAvg  Node    Proto           Class   UserName     LocalName                    RemoteName------------------------------------------------------------------------------------------------------------------629.8  13.4k  62.1k    711.0     8 nfs4rdma  namespace_read    user_55  10.2.50.65   10.2.50.165605.6  16.9k  59.5k    594.9     4 nfs4rdma  namespace_read    user_254 10.2.50.66  10.2.50.166451.0   3.7M  41.5k   1948.5     1 nfs4rdma           write    user_74  10.2.50.72  10.2.50.172240.7 662.8k  18.1k    279.4     8 nfs4rdma          create    user_55  10.2.50.65 10.2.50.165

Additionally, session-level visibility and additional metrics can be gleaned from the ‘isi_nfs4mgmt’ utility, provides insight on a client’s cache state. The command output shows which clients are connected via RDMA or TCP from the server, in addition to their version. For example:

# isi_nfs4mgmt

ID                  Vers   Conn     SessionId   Client Address      Port  O-Owners Opens    Handles  L-Owners

1196363351478045825  4.0   tcp        -         10.1.100.110      856   1        7        10       0

1196363351478045826  4.0   tcp        -         10.1.100.112      872   0        0        0        0

2940493934191674019  4.2   rdma   3         10.2.50.227      40908 0       0         0         0

2940493934191674022  4.1   rdma    5        10.2.50.224      60152 0       0          0        0

The output above indicates two NFSv4.0 TCP sessions, plus one NFSv4.1 RDMA session and one NFSv4.1 RDMA session.

Used with the ‘—dump’ flag and client ID, isi_nfs4mgt will provide detailed information for a particular session:

# isi_nfs4mgmt --dump 2940493934191674019

Dump of client 2940493934191674019

Open Owners (0):

Session ID: 3

Forward Channel Connections: Remote: 10.2.50.227.40908 Local: 10.2.50.98.20049
....

Note that the ‘isi_nfs4mgmt’ tool is specific to stateful NFSv4 sessions and will not list any stateless NFV3 activity.

In the next article in this series, we’ll explore the procedure for enabling NFS over RDMA on a PowerScale cluster.

OneFS Routing and SBR – Part 2

As we saw in the previous article in this series, the primary effect of OneFS source-based routing (SBR) is helping to ensure that the cluster replies on the same interface as the ingress packet came in on. This happens automatically in conjunction with FlexNet’s NIC affinity.

Each of a cluster’s front-end subnets contains one or more pools of IP addresses which can be bound to external interfaces of nodes in the cluster. Pools also bind to a groupnet and associated access zone for multi-tenant authentication management, etc.

A cluster’s network pools each include a range of addresses, a list of interfaces, an aggregation mode, and a list of static routes. Static Routes can be configured on a per-pool basis. Unlike SBR, static routes provide a mechanism to force all traffic for a specific destination to use a specific address and a specific gateway. This means static routes, unlike SBR, can support client services without making those services zone-aware.

OneFS SBR will often simply just do the right thing with little or no additional configuration required. Therefore it is generally the preferred option, and indeed is the default for new clusters running OneFS 9.8 or later. That said, in order for SBR to create its IPFW  rule for a gateway, there must have been a session initiated from the source subnet in order to initiate it. If no traffic has been originated or received from a network that’s unreachable via the default gateway, OneFS will transmit traffic it originates through the default gateway. Static routes are an option in this case.

Static routes are also an alternative when SBR cannot do what is required – for example, if different subnets must be treated differently, or a customer actually requires the route from A to B to be different from the route from B to A.

Static routes can be easily added from the CLI with the following syntax:

 # isi network pools <pool> --static-routes<subnet_ip_address>/<CIDR_netmask>-<gateway_ip_address>

Where the first address and the integer form a netmask, and the second address is a gateway. Static routes are configured on a per-pool basis. For example:

# isi network pools groupnet0.subnet0.pool0 -–static-routes 10.30.1.0/22-10.30.1.1

Similarly, an individual static route can be removed as follows:

# isi network pools groupnet0.subnet0.pool0 –-remove-static-routes 10.30.1.0/22-10.30.1.1

Static routes are not mutually exclusive to SBR, but they do operate slightly differently when SBR is enabled. SBR just changes the way an egress packet is handled. Instead of matching a packet to a route based off the destination IP in the packet header, SBR uses the source IP of the packet instead.

Before changing the current SBR state on a cluster, the following CLI syntax can be used to confirm whether there are static routes configured in any IP address pools:

# isi network pools list –v | grep -i routes

If needed, all or a pool’s static routes can be easily removed as follows:

# isi network pools modify <pool_id> --clear-static-routes

When SBR parses the ipfw rule list in order to set the route, static routes take priority and are evaluated first. Under SBR, the narrowest route is preferred. This means that a CIDR /30 route that matches will be selected before a matching /28, etc. If no match is found, SBR then tries the subnet routes.

Clearly, static routes do have some notable limitations. By definition, they would need to include every destination address range in order to properly direct traffic – and this may be a large and changing set of information. Additionally, static routes can only direct traffic based on remote IP address. If multiple workflows use the same remote IP addresses, static routing cannot treat them differently.

Take the following multi-subnet topology example, where a client is three hops from a PowerScale cluster:

If the source IP address and the destination IP address both reside within the same subnet (ie. within the same ‘layer 2’ broadcast domain), the packet goes directly to the client’s IP address. Conversely, if the destination IP address is in a different subnet from the source, the packet is sent to the next-hop gateway.

In the above example, the client initiates a connection to the cluster at IP address 10.30.1.200:

First, the client determines that the destination IP address is not on its local network, and that it does not have a static route defined for that address. It then sends the packet to its default gateway (Gateway C) for more processing.

Next, the router at gateway C receives the client’s packet, examines the destination IP in the packet header, and determines that it has a route to the destination through router A at 10.30.1.1.

Since Router A has a direct connection on the 100GbE subnet to the destination IP address, it sends the packet directly to the cluster’s 10.30.1.200 interface.

At this point, OneFS must send a response packet to the client.

1. If SBR is disabled, the node determines that the client (10.2.1.50) is not on the same subnet and does not have a static route defined for that address. OneFS determines which gateway it must send the response packet to based upon the routing table.

Gateways with higher priority (lower value) have precedence over those with lower priority (higher value). For example, 1 has a higher priority than 10. The PowerScale node has one default gateway, which is the highest priority subnet the node is configured in. Since there is no static route configured on the node, OneFS chooses the default gateway 10.10.1.1 (router B) via the 10GbE interface.

The reply packet sent by the 10GbE interface has a source IP header of 10.30.1.200. Note that some firewalls or packet filters may interpret this as packet ‘spoofing’ and automatically block this type of behavior. Additionally, perceived performance asymmetry may also be an issue, since the connection may be bandwidth constrained because of the 10GbE link. In this case, a user may anticipate 100GbE performance but in actuality will limited to only a 10GbE connection.

2. Conversely, if SBR is enabled, the cluster node’s routing decisions are no longer predicated on the client’s destination IP address. Instead, OneFS parses the egress packet source IP header, then sends the packet via the gateway associated with the source IP subnet.

The node’s reply packet has a source IP address of 10.30.1.200 and, as such, the SBR routing rules indicate the preferred gateway is A (10.30.1.1) on the 100GbE subnet. When the response reaches gateway A, it is routed back to Gateway C across the core network to the client.

Note, however, that SBR will not override any statically configured routes. If SBR is enabled and a static route is created, a new ipfw rule is added. Since SBR only acts upon ‘reply’ packets, any traffic initiated by the node is unaffected. For example, when a node contacts a DNS or AD server, traditional routing rules (as though SBR is disabled) apply. Also, be aware that enabling SBR is a global (cluster-wide) action, and OneFS does not currently allow for SBR configuration at a groupnet, subnet, or pool-level granularity.

In addition to SBR, a number of other FlexNet networking configurations are either cluster-wide, or effectively amount to it:

Networking Component Description
DNS DNS, including the DNS caching daemon, operates cluster-wide.
Default gateway While the default gateway as a routing mechanism may appear to be subnet-specific in the UI, it behaves globally.
NTP The network time protocol (NTP) is configured and runs global to maintain time synchronization between the cluster’s nodes, domain controllers, and other servers and networked devices.
SBR While SBR is a cluster-wide setting, the routing changes will be specific to the routing tables on each individual node (each node’s routing table will be specific to the network pools it is a part of).

Note that, being global configuration, enabling SBR can affect other workflows as well.

Within Flexnet, each subnet has its own address space, with is specified by a base and netmask, gateway, VLAN tag, SmartConnect service address, and aggregation options (DSR return addresses).

While SBR is a cluster-wide setting, though the routing changes will be specific to the rules/routing tables on each individual node (each node’s routing table will be specific to the network pools it is a part of).

One quirk of subnet configuration is that, while each subnet can have a different default gateway configured, normally OneFS only uses the highest priority gateway configured in any of its subnets – falling back to lower-priority only if it is unreachable.

SBR aims to mitigate this idiosyncrasy, enforcing subnet settings by automatically creating IPFW rules based on subnet settings. This allows connections associated with a given subnet to use the specified gateway for that subnet, but only for connections bound to a specific local address within that subnet. This means they work only for incoming connections or outgoing connections that are made in a tenant-aware way; the common practice of clients binding to INADDR_ANY and letting the network stack choose which local address to use prevents SBR from working. Most client services running under OneFS (e.g. integration with authentication servers like LDAP and AD) therefore cannot currently use SBR.

OneFS Routing and SBR

The previous article on this topic generated several questions, which suggested that a more thorough exploration of OneFS source-based routing (SBR) is likely warranted. So here goes…

At its essence, network routing is the process of selecting a path for data traffic, either within a network or traversing multiple networks. The aim is to endure efficient data flow across subnets, while maintaining bandwidth and minimizing congestion. Routers, layer 3 switches, multi-homed system, etc, make routing decisions based on packet header addresses and routing tables, which record the paths packets should take to reach their destinations.

IP packet headers have the following form, with the source and destination addresses located towards the end of the header section, before the packet’s payload.

Routing is typically either static, using manually enter routing statements and rules, or dynamic, via routing protocols such as RIP, OSPF, etc.

While the nomenclature might suggest that OneFS source-based routing would route traffic based on a source IP address, instead SBR actually operates by dynamically creating per-subnet default routes. The gateway is derived from the subnet configuration, and, as such, gateways need to be defined for each subnet.

New cluster deployments running OneFS 9.8 and automatically have SBR enabled, whereas legacy clusters upgrading to 9.8 preserve their existing SBR configuration, whether on or off. While SBR is disabled by default in OneFS 9.7 and earlier releases, it can, if desired, be easily enabled from either the CLI or WebUI.

SBR is configured globally and, as such, is either on or off across the entire cluster and its network pools and subnets. OneFS 9.7 and earlier supports only the IPv4 protocol, whereas OneFS 9.8 and later also accommodate IPv6 subnets.

SBR can be instantly enabled on a PowerScale cluster by running the following CLI command:

# isi network external modify --sbr 1

# isi network external view | grep -i source

Source Based Routing: True

Or from the WebUI under Cluster management > Network configuration > Settings:

Similarly, SBR can be disabled as follows:

# isi network external modify --sbr 0

# isi network external view | grep -i source

Source Based Routing: False

Under the hood, SBR uses the FreeBSD ‘ipfw’ utility (as does the OneFS firewall) to record and manage its routing rules.

For example, with SBR disabled, querying ipfw on a cluster shows a single ‘any to any’ rule:

# isi network external view | grep -i source

Source Based Routing: False

# ipfw show

65535 11839927994 7560033188891 allow ip from any to any

By way of contrast, when SBR is enabled, a number of new, higher priority ‘allow’ rules for each NIC and gateway ‘fwd’ rules are added above the ‘any to any’ rule:

# isi network external view | grep -i source

Source Based Routing: True

# ipfw show

60000          16         33391 allow ip from any to any via lo0 out

60001           0             0 allow ip from any to ff02::1:ff00:0/104 out

60002      116082     112914089 allow ip from any to any via mce0 out

60003      217150     138771611 allow ip from any to any via mce1 out

60004           0             0 allow ip from any to any via ue0 out

60005           0             0 allow ip from any to fe80::/10 out

60006           0             0 allow ip from any to ff02::1 out

62000           0             0 fwd 2620:0:170:7c0f::1 ip from 2620:0:170:7c0f::/64 to not 2620:0:170:7c0f::/64 out

62001         121         94788 fwd 10.30.1.1 ip from 10.30.1.0/22 to not 10.30.1.0/22 out

65535 11842048952 7561181109905 allow ip from any to any

In this example node’s case, on a cluster running OneFS 9.8, there is one IPv4 subnet and one IPv6 subnet:

# isi network subnets list

ID                Subnet    Gateway|Priority      Pools     SC Service Addrs     Firewall Policy

------------------------------------------------------------------------------------------------

groupnet0.subnet0 10.30.1.0/22   10.30.1.1|10     pool0     10.30.1.100-10.30.1.110               default_subnets_policy

groupnet0.subnet1 2620:0:170:7c0f::/64 2620:0:170:7c0f::1|20 ipv6pool  2620:0:170:7c0f::4-2620:0:170:7c0f::4 default_subnets_policy

------------------------------------------------------------------------------------------------

Total: 2

So enabling SBR on this cluster results in the creation of a ‘fwd’ rule for each subnet:

# ipfw show | grep fwd

62000          33          2640 fwd 2620:0:170:7c0f::1 ip from 2620:0:170:7c0f::/64 to not 2620:0:170:7c0f::/64 out

62001      145794     140490002 fwd 10.30.1.1 ip from 10.30.1.0/22 to not 10.30.1.0/22 out

Please note that the ‘ipfw’ command should not be used to modify the OneFS routing rules (or firewall table) directly!

By way of a OneFS packet routing example, take the following network topology where three clients, each on separate subnets, are connecting to a PowerScale cluster:

The default gateway is the path for all traffic intended for clients not on the local subnet and not covered by a routing table entry. Utilizing SBR does not negate the need for a default gateway, since SBR effectively overrides the default gateway (but not static routes).

Note that SBR is not simple packet reflection. Instead, it’s the dynamic creation of per-subnet default routes. The router used as the gateway is derived from the FlexNet subnet definitions within the subnet configuration. As such, a gateway needs to be specified for each subnet.

 In addition to a gateway address, each subnet also has a defined priority. For example:

Or via the CLI:

# isi network subnets modify groupnet0.subnet1 --gateway 10.30.1.1 --gateway-priority 10

With SBR disabled, the highest priority gateway (ie. the gateway with the lowest reachable value) is used as the default route.

Once SBR is enabled, OneFS examines the FlexNet config for each subnet, and then creates ipfw rules that look at the source IP address from the cluster side and force the next-hop to be the gateway IP defined for the subnet which contains that IP address.

In the previous example with three clients on separate subnets connecting to a cluster, when traffic arrives from a subnet that is unreachable via the default gateway, the following routing rules will be added via ipfw:

The mechanism for adding ipfw rules is stateless, and SBR relies on the source IP address that transmits traffic to the cluster.

A session must be initiated from the source subnet for a corresponding ipfw rule to be created. Also, unless the cluster has received traffic that originated from a subnet has no route to the default gateway, OneFS transmits traffic it originates through the default gateway.

In the next article in this series. We’ll take a look at SBR and its interrelationship with static routes and other OneFS networking components.

OneFS Source Based Routing for IPv6 Networks

Tucked amongst the OneFS 9.8 feature payload were a couple of significant enhancements to source-based routing (SBR). Specifically, the introduction of:

  • IPv6 network support.
  • SBR enabled by default for fresh OneFS 9.8 installs.

Source-based routing was first introduced into OneFS back in 9.2. At its core, OneFS source base routing is essentially ‘per-subnet default routes’. OneFS parses the Flexnet configuration for each subnet, and then creates routing rules corresponding to the IP address from the cluster side, forcing the next-hop to be the router IP defined for the subnet which contains that IP address.

Until OneFS 9.8, SBR was disabled by default, and so required manual configuration in order to run it. Additionally, in OneFS 9.7 and earlier, SBR only supported IPv4 networks. With the release of OneFS 9.8, both IPv4 and IPv6 networks are now fully supported. Plus, for new clusters and fresh installs, SBR is now enabled by default. However, existing clusters that are upgraded to OneFS 9.8 will retain their existing configuration. So SBR will remain disabled unless it had already been configured to run.

When SBR is disabled and a request comes in from a client and is routed to a node in the cluster, when the return traffic is sent it will typically traverse the cluster’s default route.

With a large number of clients connected, there is a possibility of overloading the default route with a deluge of traffic. However, when SBR is enabled, each subnet has a defined priority gateway and return traffic is sent over the path that the request came from rather than the default route.

If traffic arrives from a subnet that isn’t reachable through the default gateway, routing rules are added for it. These rules are stateless and depend entirely on the source IP that sends traffic to the cluster.

So, with a well-balanced client network topology, client connects will follow their source routes and load will be automatically distributed more evenly and bi-directionally over the source paths, rather than returning across the cluster’s default route. This has the potential benefit of network performance improvements in addition to a more even distribution.

From the OneFS CLI, the ‘isi network external view’ command can be used to check the state of the external network configuration, as well as configure SBR.

For example:

# isi network external view

    Client TCP Ports: 2049, 445, 20, 21, 80

    Default Groupnet: groupnet0

  SC Rebalance Delay: 0

Source Based Routing: False

       SC Server TTL: 900


IPv6 Settings:

                   IPv6 Enabled: True

IPv6 Auto Configuration Enabled: False

       IPv6 Generate Link Local: False

          IPv6 Accept Redirects: False

                       IPv6 DAD: Disabled

          IPv6 SSIP Perform DAD: False

In the example above, SBR is disabled, but can be easily enabled as follows:

# isi network external modify --sbr=true

# isi network external view | grep -i source

Source Based Routing: True

Similarly the following syntax will disable SBR:

# isi network external modify --sbr=false

# isi network external view | grep -i source

Source Based Routing: False

SBR can also be configured from the OneFS WebUI by navigating to Cluster management > Network configuration > Settings:

Under the hood, SBR uses ipfw, to create and manage its routing rules. For example,  the following CLI output shows the two corresponding ipfw rules (62000 and 62001) that are created for IPv4 and IPv6 respectively when SBR is enabled: