OneFS NFS Netgroups

A OneFS network group, or netgroup, defines a network-wide group of hosts and users. As such, they can be used to restrict access to shared NFS filesystems, etc. Network groups are stored in a network information services, such as LDAP, NIS, or NIS+, rather than in a local file. Netgroups help to simplify the identification and management of people and machines for access control.

The isi_netgroup_d service provides netgroup lookups and caching for consumers of the ‘isi_nfs’ library.  Only mountd and the ‘isi nfs’ command-line interface use this service.  The isi_netgroup_d daemon maintains a fast, persistent cluster-coherent cache containing netgroups and netgroup members.  isi_netgroup_d enforces netgroup TTLs and netgroup retries.  A persistent cache database (SQLite) exists to store and recover cache data across reboots.  Communication with isi_netgroup_d is via RPC and it will register its service and port with the local rpcbind.

Within OneFS, the netgroup cache possesses the following gconfig configuration parameters:

# isi_gconfig -t nfs-config | grep cache

shared_config.bypass_netgroup_cache_daemon (bool) = false

netcache_config.nc_ng_expiration (uint32) = 3600000

netcache_config.nc_ng_lifetime (uint32) = 604800

netcache_config.nc_ng_retry_wait (uint32) = 30000

netcache_config.ncdb_busy_timeout (uint32) = 900000

netcache_config.ncdb_write (uint32) = 43200000

netcache_config.nc_max_hosts (uint32) = 200000

Similarly, the following files are used by the isi_netgroup_d daemon:

File Purpose
     /var/run/isi_netgroup_d.pid The pid of the currently running isi_netgroup_d
     /ifs/.ifs/modules/nfs/nfs_config.gc Server configuration file
     /ifs/.ifs/modules/nfs/netcache.db Persistent cache database
     /var/log/isi_netgroup_d.log Log output file

In general, using IP addresses works better than hostnames for netgroups. This is because hostnames require a DNS lookup and resolution from FQDN to IP address. Using IP addresses directly saves this overhead.

Resolving a large set of hosts in the allow/deny list is significantly faster when using netgroups. Entering a large host list in the NFS export means OneFS has to look up the hosts for each individual NFS export. In Netgroups, once looked up, it is cached by netgroups, so it doesn’t have to be looked up again if there are overlap between exports. It is also better to use an LDAP (or NIS) server when using Netgroups instead of the flat file. If you have a large list of hosts in the netgroups file, it can take a while to resolve as it is single threaded and sequential. LDAP/NIS provider based netgroups lookup is parallelized.

The OneFS netgroup cache has a default limit in gconfig of 200,000 host entries.

# isi_gconfig -t nfs-config | grep max

netcache_config.nc_max_hosts (uint32) = 200000

So what is the waiting period between when /etc/netgroup is updated to when the NFS export realizes the change? OneFS uses a netgroup cache and both its expiration and lifetime are both tunable. The netgroup expiration and lifetime can be configured with this following CLI command:

# isi nfs netgroup modify

--expiration or -e <duration>

Set the netgroup expiration time.

--lifetime or -l <duration>

Set the netgroup lifetime.

OneFS also provides the ‘isi nfs netgroups flush’ CLI command, which can be used to force a reload of the file.

# isi nfs netgroup flush

        [--host <string>]

        [{--verbose | -v}]

        [{--help | -h}]


Options:

    --host <string>

        IP address of the node to flush. Defaults is all nodes.


  Display Options:

    --verbose | -v

        Display more detailed information.

    --help | -h

        Display help for this command.

However, it is not recommended to flush the cache as a part of normal cluster operation. Refresh will walk the file and update the cache as needed.

Another area of caution is applying a netgroup with unresolved hostname(s). This will also slow down resolution of the hosts in the file when a refresh or startup of node happens. The best practice is to ensure that each host in the netgroups file is resolvable in DNS, or to just use IP addresses rather than names in the netgroup.

When it come to switching to a netgroup for clients already on an export, a netgroup can be added and clients removed in one step (#1 –add-client netgroup –remove-clients 1,2,3,etc). The cluster allows a mix of netgroup and host entries, so duplicates are tolerated. However, it’s worth noting that if there are unresolvable hosts in both areas, the startup resolution time will take that much longer.

OneFS & Files Per Directory

Had several recent enquiries from the field recently asking about the low impact methods to count the number of files in large directories containing hundreds of thousands to millions of files).

Unfortunately, there’s no ‘silver bullet’ command or data source available that will provide that count instantaneously: Something will have to perform a treewalk to gather these stats.  That said, there are a couple of approaches to this, each with its pros and cons:

  • If the customer has a SmartQuotas license, they can configure an advisory directory quota on the directories they want to check. As mentioned, the first job run will require working the directory tree, but they can get fast, low impact reports moving forward.
  • Another approach is using traditional UNIX commands, either from the OneFS CLI or, less desirably, from a UNIX client. The two following commands will both take time to run: “
# ls -f /path/to/directory | wc –l
# find /path/to/directory -type f | wc -l

It’s worth noting that when counting files with ls, you’ll probably get faster results by omitting the ‘-l’ flag and using ‘-f’ flag instead. This is because ‘-l’ resolves UID & GIDs to display users/groups, which creates more work thereby slowing the listing. In contrast,  ‘-f’ allows the ‘ls’ command to avoid sorting the output. This should be faster, and reduce memory consumption when listing extremely large numbers of files.

Ultimately, there really is no quick way to walk a file system and count the files – especially since both ls and find are single threaded commands.  Running either of these in the background with output redirected to a file is probably the best approach.

Depending on your arguments for the ls or find command, you can gather a comprehensive set of context info and metadata on a single pass.

# find /path/to/scan -ls > output.file

It will take quite a while for the command to complete, but once you have the output stashed in a file you can pull all sorts of useful data from it.

Assuming a latency of 10ms per file it would take 33 minutes for 200,000 files. While this estimate may be conservative, there are typically multiple protocol ops that need to be done to each file, and they do add up. Plus, as mentioned before, ‘ls’ is a single threaded command.

  • If possible, ensure the directories of interest are stored on a file pool that has at least one of the metadata mirrors on SSD (metadata-read).
  • Windows Explorer can also enumerate the files in a directory tree surprisingly quickly. All you get is a file count, but it can work pretty well.
  • If the directory you wish to know the file count for just happens to be /ifs, you can run the LinCount job, which will tell you how many LINs there are in the file system.

Lincount (relatively) quickly scans the filesystem and returns the total count of LINs (logical inodes). The LIN count is essentially equivalent to the total file and directory count on a cluster. The job itself runs by default at the LOW priority, and is the fastest method of determining object count on OneFS, assuming no other job has run to completion.

The following syntax can be used to kick off the Lincount job from the OneFS CLI:

# isi job start lincount

The output from this will be along the lines of “Added job [52]”.

Note: The number in square brackets is the job ID.

To view results, run the following command from the CLI:

# isi job reports view [job ID]

For example:
# isi job reports view 52

LinCount[52] phase 1 (2021-07-06T09:33:33)

------------------------------------------

Elapsed time   1 seconds

Errors         0

Job mode       LinCount

LINs traversed 1722

SINs traversed 0

The "LINs traversed" metric indicates that 1722 files and directories were found.

Note: The Lincount job will also include snapshot revisions of LINs in its count.

Alternatively, if another treewalk job has run against the directory you wish to know the count for, you might be in luck.

At any rate, hundreds of thousands of files is a large number to store in one directory. To reduce the directory enumeration time, where possible divide the files up into multiple subdirectories.

When it comes to NFS, the behavior is going to partially depend on whether the client is doing READDIRPLUS operations vs READDIR. READDIRPLUS is useful if the client is going to need the metadata. However, ff all you’re trying to do is list the filenames, it actually makes that operation much slower.

If you only read the filenames in the directory, and you don’t attempt to stat any associated metadata, then this requires a relatively small amount of I/O to pull the names from the meta-tree, and should be fairly fast.

If this has already been done recently, some or all of the blocks are likely to already be in L2 cache. As such, a subsequent operation won’t need to read from hard disk and will be substantially faster.

NFS is more complicated regarding what it will and won’t cache on the client side, particularly with the attribute cache and the timeouts that are associated with it.

Here are some options from fastest to slowest:

  • If NFS is using READDIR, as opposed to READDIRPLUS, and the ‘ls’ command is invoked with the appropriate arguments to prevent it polling metadata or sorting the output, execution will be relatively swift.
  • If ‘ls’ polls the metadata (or if NFS uses READDIRPLUS) but doesn’t sort the results, output will be fairly immediately, but will take longer to complete overall.
  • If ‘ls’ sorts the output, nothing will be displayed until ls has read everything and sorted it, then you’ll get the output in a deluge at the end.

OneFS MCP

Affectionately named after TRON’s  ‘Master Control Program’, MCP is OneFS’ main utility for distributed service control across a cluster. MCP is responsible for starting, monitoring, and restarting failed services on a cluster. It also monitors configuration files and acts upon configuration changes, propagating local file changes to the rest of the cluster. As such, it performs a similar function to the Windows ‘service control manager’ (SCM) or MacOS ‘launchd’.

MCP is actually comprised of three different processes, one for each of its modes:

  • Master
  • Failsafe
  • Forker

These can be seen when viewing the running processes on a healthy node:

# ps -auxw | grep -i mcp | grep -v grep

root    5400    0.4  0.0  60760  19928  -  Ss   11Jun21    170:08.18 isi_mcp: master (isi_mcp)

root    5179    0.0  0.0  32760  13632  -  Is   11Jun21      0:00.01 isi_mcp: failsafe (isi_mcp)

root    5181    0.0  0.0  31476  12572  -  Is   11Jun21      0:00.36 isi_mcp: forker (isi_mcp)

The ‘Master’ is the central MCP process and does the bulk of the work. It monitors files and services, including the failsafe process, and delegates actions to the forker process.

The role of the ‘Forker’ is to receive command-line actions from the master, execute them, and return the resulting exit codes. It receives actions from the master process over a UNIX domain socket. If the forker is inadvertently or intentionally killed, it’s automatically restarted by the master process. If necessary, MCP will continue trying to restart the forker at an increasing interval. If, after around ten minutes of unsuccessfully attempting to restart the forker, MCP will fire off a CELOG alert, and continue trying. A second alert would then be sent after thirty minutes.

The ‘Failsafe’ process is responsible for starting, monitoring, restarting, and stopping both the Master and Forker. It’s a single threaded process that, if killed, will shut down all three MCP services. If this occurs, the three services will stay down until they are restarted with the ‘isi_mcp’ CLI command. If the master fails and can’t be restarted, MCP will continue attempting to restart it and fire alerts in the same manner as described above for the forker service.

MCP monitors the following files:

File Type Function
/etc/mcp/sys/files/* Configuration files monitored by MCP.
/etc/mcp/sys/services/* Services that MCP starts and monitors.
/etc/ifs/array.xml Cluster configuration file.
/etc/mcp/override/* All files in override directory propagated to all nodes and entered in global mlist.
/etc/mcp/mlist.xml Local mlist (mlists are used to manage and track the above files)
/ifs/.ifsvar/etc/mcp/mlist.xml Master mlist

The following command will list the open files that MCP is currently monitoring on a node:

# for i in `sysctl efs.bam.busy_vnodes | grep -i mcp | awk '{print $4}' | sed -E 's/)//'`; do isi get -L $i | awk '{print $8}'; done

MCP monitors the configuration files in /etc/mcp/sys/files. While monitoring the configuration files MCP does two things:

  • Performs the file change action
  • Propagates config file changes to other nodes

Consider the XML configuration file for the ndmpd service, for example:

# cat /etc/mcp/sys/services/ndmpd

<?xml version="1.0"?>

<service name="ndmpd" enable="0" display="1" options="require-quorum,kill-on-sigquorum,require-post-ifs">

      <isi-meta-tag id="ndmp_service">

        <mod-attribs>enable</mod-attribs>

      </isi-meta-tag>

      <description>Network Data Management Protocol Daemon</description>

      <process name="isi_ndmp_d" pidfile="/var/run/isi_ndmp_d.pid"

               startaction="start" stopaction="stop"/>

      <actionlist name="start">

        <action>/usr/bin/isi_ndmp_d</action>

      </actionlist>

      <actionlist name="stop">

        <action>/usr/bin/killall isi_ndmp_d</action>

      </actionlist>

</service>

Much of what MCP does in response to an event notification is defined by the ‘actionlist’ in a config file. This is simply a list of commands to be executed, with action lists for starting and stopping services, and also for specific configuration files changes (for example, importing a product license).

Many of the local configuration files need to be uniform across the cluster so, unless the ‘notify =0’ flag is set, the master process also copies changed files to /ifs for MCP on other nodes to use.

MCP starts and watches already running services in accordance with their service description files, stored under /etc/mcp/sys/services. These are XML files which describe how a service is to be started when enabled or stopped when disabled.

The XML file also lists under ‘options’ the conditions of the node and/or cluster that must be met before the service can be started (for example above, ‘require-quorum’ or ‘require-post-ifs’, etc).

When a service is monitored, MCP ensures the correct state of the service on a node. If a service is marked ‘enable’, MCP will run the start action until the PID confirms it as running. When a service is marked ‘disable’, MCP will kill the service via its PID. The full list of services and their current state can be viewed with the following CLI command:

# isi services -a

MCP monitors services by observing their PID files (under /var/run), plus the process table itself, to determine if a process is already running or not. It compares this state against the ‘enabled/disabled’ state for the service and determines whether any start or stop actions are required. Services may also be configured to terminate if the cluster loses quorum with the option ‘kill-on-sigquorum’ in their XML file.

Another type of configuration file that MCP monitors is also known as a service override file, which live under /etc/mcp/override. These override files are used to store any current settings for options which have been changed from the defaults. The override files are always shared across the cluster via MCP’s configuration propagation mechanism.

The Master MCP process creates merged lists, or mlists, that are used to track and coordinate the process of managing the XML config and service description files. There are two types of mlist: Local and Master. The master process will automatically create the local mlist at startup if it doesn’t already exist. However, the master mlist is created later since MCP starts and begins operations before /ifs is mounted.

Here’s the mlist entry for the cluster’s NTP service, for example:

    <file>

      <path>/etc/mcp/templates/ntp.conf</path>

      <md5>7772b5d50494c85043933321c21dbb8d</md5>

      <timestamp>1623466667</timestamp>

      <revision>1</revision>

      <array_id>1</array_id>

    </file>

The local mlist has an entry for every file identified in the MCP file configuration files (/etc/mcp/sys/files), an entry for every configuration file (/etc/mcp/sys/files & procs), an entry for an override file for each service (may or may not exist), an entry for /etc/ifs/array.xml. It also contains an entry for the master mlist (/ifs/.ifsvar/etc/mcp/mlist.xml).

# grep mlist.xml mlist.xml

      <path>/ifs/.ifsvar/etc/mcp/mlist.xml</path>

The mlist has an entry for every local file that’s shared across the cluster and the override service files. A coordinator lock file prevents different nodes from making changes to /ifs at the same time.

If MCP attempts to start a service and fails, as long as the service is enabled, it will wait for an interval before attempting to start the service again. This interval doubles in size each time, until it reaches 256 seconds then remains at this frequency.

OneFS SmartFail

OneFS protects data stored on failing nodes or drives in a cluster through a process called Smartfail. During the process, OneFS places a device into quarantine and, depending on the severity of the issue, the data on it into a read-only state. While a device is quarantined, OneFS reprotects the data on the device by distributing the data to other devices.

After all data eviction or reconstruction is complete, OneFS logically removes the device from the cluster and the node or drive can be physically replaced. OneFS only automatically Smartfails devices as a last resort. Nodes and/or drives can also be manually Smartfailed. However, it is strongly recommended to first consult Isilon Support.

Occasionally a device might fail before OneFS detects a problem. If a drive fails without being Smartfailed, OneFS automatically starts rebuilding the data to available free space on the cluster. However, because a node might recover from a transient issue, if a node fails, OneFS does not start rebuilding data unless it is logically removed from the cluster.

A node that is unavailable and reported by isi status as ‘D’, or down, can be Smartfailed. If the node is hard down, likely with a significant hardware issue, the Smartfail process will take longer because data has to be recalculated from the FEC protection parity blocks. That said, it’s well worth attempting to bring the node up if at all possible – especially if the cluster and/or node pools is at the default +2D:1N protection.  The concern here is that, with a node down, there is a risk of data loss if a drive or other component goes bad during the Smartfail process.

If possible, and assuming the disk content is still intact, it can often be quicker to have the node hardware repaired. In this case, the entire node’s chassis (or compute module in the case of Gen 6 hardware) could be replaced and the old disks inserted with original content on them. This should only be performed at the recommendation and under the supervision of Isilon support. If the node down as a result of a journal inconsistency, it will have to be Smartfailed out. In this case, Isilon Support should be engaged to determine an appropriate action plan.

The recommended procedure for Smartfailing a node is as follows. In this example, we’ll assume that node 4 is down:

  1. From the CLI of any node except node 4, run the following command to Smartfail out the node:
# isi devices node smartfail --node-lnn 4
  1. Verify that node is removed from the cluster.
# isi status –q

(An ‘—S-’ will appear in node 4’s ‘Health’ column to indicate it has been Smartfailed).

  1. Monitor the successful completion of the job engine’s MultiScan, FlexProtect/FlexProtectLIN jobs:
# isi job status
  1. Un-cable and remove the node from the rack for disposal

As mentioned above, there are two primary Job Engine jobs that run as a result of a Smartfail:

  • MultiScan
  • FlexProtect or FlexProtectLIN

MultiScan performs the work of the both AutoBalance and Collect jobs simultaneously, and it is triggered after every group change. The reason is that new file layouts and file deletions that happen during a disruption to the cluster may be imperfectly balanced or, in the case of deletions, simply lost.

The Collect job reclaims free space from previously unavailable nodes or drives. A mark and sweep garbage collector, it identifies everything valid on the filesystem in the first phase, then in the second phase scans the drives freeing anything that isn’t marked valid.

AutoBalance ensures that, when node and drive usage across the cluster are out of balance. This job scans through all the drives looking for files to re-layout to make use of the less filled devices.

The purpose of the FlexProtect job is to scan the file system after a device failure to ensure that all files remain protected. Incomplete protection levels are fixed, in addition to missing data or parity blocks caused by drive or node failures. This job is started automatically after Smartfailing a drive or node. If a Smartfailed device was the reason the job started, the device is marked gone (completely removed from the cluster) at the end of the job.

Although a new node can be added to a cluster at any time, it’s best to avoid major group changes during a Smartfail operation. This helps avoid any unnecessary interruptions of a critical job engine data reprotection job. However, since a node is down, there is a window of risk while the cluster is rebuilding the data from that cluster.  Under pressing circumstances the Smartfail operation can be paused, the node added, and then Smartfail resumed once the new node has happily joined the cluster.

Be aware that, if the node you are adding is the same node that was Smartfailed, the cluster maintains a record of that node and may prevent the re-introduction of that node until the Smartfail is complete.  To mitigate risk, Isilon support should definitely be involved to ensure data integrity.

The time for a Smartfail to complete is hard to predict with any accuracy, and is dependent on:

Attribute Description
OneFS release Determines OneFS job engine version and how efficiently it operates.
System hardware Drive types, CPU, RAM, etc.
File system Quantity and type of data (ie. small vs large files), protection, tunables, etc.
Cluster load Processor and CPU utilization, capacity utilization, etc.

Typical Smartfail runtimes range from minutes for fairly empty, idle nodes with SSD and SAS drives to days for nodes with large SATA drives and a high capacity utilization. The FlexProtect job already runs at the highest job engine priority (value=1) and medium impact by default. As such, there isn’t much that can be done to speed up this job, beyond reducing cluster load.

Smartfail is also a valuable tool for proactive cluster node replacement, for example during a hardware refresh.  Provided cluster quorum is not broken, a Smartfail can be initiated on multiple nodes concurrently – but never more than n/2 – 1 nodes (rounded up)!

If replacing an entire node-pool as part of a tech refresh, a SmartPools filepool policy can be crafted to migrate the data to another nodepool across the back-end network. When complete, the nodes can then be Smartfailed out, which should progress swiftly since they are now empty.

If multiple nodes are Smartfailed simultaneously, at the final stage of the process the node remove is serialized with around 60 seconds pause between each. The Smartfail job places the selected nodes in read-only mode while it copies the protection stripes to the cluster’s free space. Using SmartPools to evacuate data from a node or set of nodes in preparation to remove them is generally a good idea, and is usually a relatively fast process.

SmartPools’ Virtual Hot Spare (VHS) functionality helps ensure that node pools maintain enough free space to successfully re-protect data in the event of a Smartfail. Though configured globally, VHS actually operates at the node pool level so that nodes with different size drives reserve the appropriate VHS space. This helps ensures that, while data may move from one disk pool to another during repair, it remains on the same class of storage. VHS reservations are cluster wide and configurable as either a percentage of total storage (0-20%) or as a number of virtual drives (1-4), with the default being 10%.

Note, a Smartfail is not guaranteed to remove all data on a node. Any pool in a cluster that’s flagged with the ‘System’ flag can store /ifs/.ifsvar data. A filepool policy to move the regular data won’t address this data. Also, since SmartPools ‘spillover’ may have occurred at some point, there are no guarantees that an ‘empty’ node is completely devoid of data. For this reason, OneFS still has to search the tree for files that may have blocks residing on the node.

Nodes can be easily Smartfailed via the OneFS WebUI by navigating to Cluster Management > Hardware Configuration and selecting ‘Actions > More > Smartfail Node’ for the desired node(s):

Similarly, the following CLI commands initiate and halt the node Smartfail process respectively. Firstly, the ‘isi devices node smartfail’ command kicks off the Smartfail process on a node and removes it from the cluster.

# isi devices node smartfail -h

Syntax

# isi devices node smartfail

[--node-lnn <integer>]

[--force | -f]

[--verbose | -v]

If necessary, the ‘isi devices node stopfail’ command can be used to discontinue the Smartfail process on a node.

# isi devices node stopfail -h

Syntax

isi devices node stopfail

[--node-lnn <integer>]

[--force | -f]

[--verbose | -v]

Similarly, individual drives within a node can be Smartfailed with the ‘isi devices drive smartfail’ CLI command.

# isi devices drive smartfail { <bay> | --lnum <integer> | --sled <string> }

        [--node-lnn <integer>]

        [{--force | -f}]

        [{--verbose | -v}]

        [{--help | -h}]

OneFS Drain-based Upgrade

In the previous blog article, we looked at the mechanism by which OneFS enables non-disruptive upgrades. For NFS users, rolling node upgrades is typically a pretty seamless event since client connections can be dynamically moved and rebalanced across the other nodes. However, for SMB clients, rolling upgrades can be more impactful.

During an upgrade workflow, nodes will reboot, and the SMB protocol service must be stopped temporarily. This results in a brief disruption for Windows clients  connected to the rebooting node. To solve this, OneFS 9.2 introduces a drain-based upgrade feature, which provides a mechanism by which nodes are prevented from rebooting or restarting protocol services until all SMB clients have disconnected from the node. A single SMB client that does not disconnect can cause the upgrade to be delayed indefinitely, so the cluster administrator is provided with options to reboot the node despite persisting clients.

The drain-based upgrade supports both OneFS, firmware and combined upgrades, and can be configured and managed via the OneFS WebUI, CLI, and RESTful platform API.

  • SMB protocol
  • OneFS upgrades
  • Firmware upgrades
  • Cluster reboots
  • Combined upgrades (OneFS and Firmware)

Drain-based upgrade is predicated upon the parallel upgrade workflow, covered in a previous article, and which offers accelerated upgrades for large clusters by working across OneFS neighborhoods, or fault domains. By concurrently upgrading a node per neighborhood, the more node neighborhoods there are within a cluster the more parallel activity can occur.

For simplicities sake, assume a PowerScale cluster with two neighborhoods with Node 1 to node 3 belonging to neighborhood 1 and Node 4 to node 6 belonging to neighborhood 2.

The following CLI command can be used to identify the correlation between the cluster’s nodes and OneFS neighborhoods, or failure domains:

# sysctl efs.lin.lock.initiator.coordinator_weights

Once the drain-based upgrade is started, a maximum of one node from each neighborhood will get the reservation which allows the nodes to upgrade simultaneously. OneFS will not reboot these nodes until the number of SMB clients is “0”. In this example Node 3 and Node 4 get the reservation for upgrading at the same time. However, there is one SMB connection to Node 3 and two SMB connections to Node 4. Neither of these nodes will be able to reboot until their SMB connection count gets to “0”. At this point, there are three options available:

Drain Action Description
Wait Wait until the SMB connection count reaches “0” or it hits the drain timeout value. The drain timeout value isa configurable parameter for each upgrade process. It is the maximum waiting period. If drain timeout is set to “0”, it means wait forever.
Delay drain Add the node into the delay list to delay client draining. The upgrade process will continue on another node in this neighborhood. After all the non-delayed nodes are upgraded, OneFS will rewind to the node in the delay list.
Skip drain Stop waiting for clients to migrate away from the draining node and reboot immediately.

 

The following CLI command can be used to confirm whether there are any active SMB connections. In this case, node 1 has one connected Windows client:

# isi statistics query current --keys=node.clientstats.connected.smb

 Node  node.clientstats.connected.smb

-------------------------------------

    1                               1

-------------------------------------

The ‘isi upgrade’ CLI command syntax can be used to perform the drain-based upgrade, and now includes flags for configuring drain-timeout and alert-timeout. In this example setting each to value 60 minutes and 45 minutes respectively. As such, if there is still an SMB connection after 45 minutes, a CELOG alert will be triggered to notify the cluster administrator. And after an hour, any remaining SMB connections will be dropped and the node upgrade reboot will continue.

# isi upgrade start --parallel --skip-optional --install-image-path=/ifs /data/<installation-file-name> --drain-timeout=60m --alert-timeout=45m

From the OneFS WebUI, the same can be achieved by navigating to Upgrade under Cluster management. In the example below, the WebUI indicates that node 1 is waiting for a draining SMB client. The response can be either to Skip or Delay.

If ‘Delay’ is selected, node one pauses to allow the remaining active client connection to drain:

After ‘Skip’ is chosen, the active client count is reduced to 0 and upgrade continues:

Here, the WebUI reports that node 2 has completed upgraded and is in the process of rebooting:

Once all nodes have completed, the upgrade can be committed by running the following command:

# isi upgrade cluster commit

Or from the WebUI:

Finally, confirm that the current version of OneFS is correct by running the following command:

# isi version

OneFS Non-disruptive Upgrades

When it comes to updating the OneFS version on a cluster, there are three primary options:

Of these, the simultaneous reboot is fast but disruptive, in that all the cluster’s nodes are upgraded and restarted in unison.

The other two options, rolling and parallel, are non-disruptive upgrades (NDUs), which allow a storage admin to upgrade a cluster while their end users continue to access data.

During the rolling upgrade process, one node at a time is updated to the new code, and the active clients attached to it are automatically migrated to other nodes in the cluster. Partial upgrade is also permitted, whereby a subset of cluster nodes can be upgraded, and the subset of nodes may also be grown during the upgrade. OneFS also allows an upgrade to be paused and resumed enabling customers to span upgrades over multiple smaller Maintenance Windows.

However, for larger clusters, OneFS also offers a parallel upgrade option. Parallel upgrade provides upgrade efficiency within node pools on clusters with multiple neighborhoods (availability zones), allowing the simultaneous upgrading of a node per neighborhood until the pool is complete . By doing this, the upgrade duration is dramatically reduced, while ensuring that end-users still continue to have full access to their data.

The parallel upgrade option avoids rebooting nodes unless a Diskpools DB reservation can be taken on that node. Each node runs the pre-upgrade optional and mandatory steps in lockstep. Nodes will not proceed to the MarkUpgrading state until the pre-upgrade checks have run successfully on all nodes. Once a node has reached the MarkUpgrading state, it will proceed through the upgrade hooks without regard for the completion state of hook on other nodes (ie not in lockstep).

Given that OneFS’ parallel upgrade option can dramatically improve the OneFS upgrade efficiency without impacting the data availability, the following formula can be used to estimate the duration of the parallel upgrade:

𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 = (𝑝𝑒𝑟 𝑛𝑜𝑑𝑒 𝑢𝑝𝑔𝑟𝑎𝑑𝑒 𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛) × (ℎ𝑖𝑔ℎ𝑒𝑠𝑡 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑜𝑑𝑒𝑠 𝑝𝑒𝑟 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟ℎ𝑜𝑜𝑑)

In the above formula:

  • The first parameter – per node upgrade duration – is around 20 minutes on average.
  • The second parameter – the highest number of nodes per neighborhood – can be obtained by running the following CLI command:
# sysctl efs.lin.lock.initiator.coordinator_weights

For example, consider a 150 node OneFS cluster. In an ideal layout, there would be 15 neighborhoods, each containing ten nodes. Neighborhood 1 would comprise nodes 1 to 10, neighborhood 2, nodes 11 to 20, and so on and so forth.

During the parallel upgrade, the upgrade framework will pick at most one node from each neighborhood, to run the upgrading job simultaneously. So in this case, node 1 from neighborhood 1st, node 11 from neighborhood 2nd, node 21 from neighborhood 3rd and etc will be upgraded at the same time. Considering, they are all in different neighborhoods or failure domain, it will not impact the current running workload.  After the first pass completes, it will go to the 2nd pass and then 3rd and etc.

So, in the 150 node example above, the estimated duration of the parallel upgrade is 200 minutes:

𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 = (𝑝𝑒𝑟 𝑛𝑜𝑑𝑒 𝑢𝑝𝑔𝑟𝑎𝑑𝑒 𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛) × (ℎ𝑖𝑔ℎ𝑒𝑠𝑡 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑜𝑑𝑒𝑠 𝑝𝑒𝑟 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟ℎ𝑜𝑜𝑑) = 20 × 10 = 200 𝑚𝑖𝑛𝑢𝑡𝑒𝑠

Under the hood, the OneFS non-disruptive upgrade system consists an UpgradeAgent and UpgradeSupervisor components.

The UpgradeAgent is a daemon that runs on every node. The UpgradeAgent’s role is to continually attempt to advance the upgrade process through to completion. It accomplishes this doing two things:

  1. Ensuring that an UpgradeSupervisoris running somewhere on the cluster by (a) checking to see if an upgrade is in progress and (b) waiting for its time slot, grabbing a lock file and then attempting to launch a supervisor.
  2. Receiving messages from any actively running UpgradeSupervisorand taking action on those messages.

The UpgradeSupervisor is a short-lived process which assesses the current state of the cluster and then takes action to advance the progress of the upgrade. The UpgradeSupervisor is stateless. It collects the persistent state of each node from that node’s UpgradeAgent using a status message. It also collects any information persistent on a cluster-wide basis. After reconstructing the current state of the upgrade process, it will then take action to affect the progress of the upgrade by dispatching an action message to the appropriate UpgradeAgent.

Since isi upgrade is an asynchronous process, the nodes in the cluster take turns running the controlling process. As such, the process that starts the upgrade does not run the upgrade but only sets it up. So when an ‘isi upgrade’ CLI command is run it will return fairly quickly. This also means that can’t stop the upgrade by stopping one process. Instead, a stop and restart option is provided using the ‘isi upgrade pause’ and ‘isi upgrade resume’ CLI commands.

Parallel upgrades are easily configured from the OneFS CLI by navigating to Cluster Management > Upgrade, and selecting ‘Parallel upgrade’ from the Upgrade type drop-down menu:

This can also be kicked-off from the OneFS command line using the following CLI syntax:

# isi upgrade start --parallel <upgrade_image>

Similarly, to start a rolling upgrade, which is the default, run:

# isi upgrade cluster start <upgrade_image>

The following CLI syntax will initiate a simultaneous upgrade:

# isi upgrade cluster start --simultaneous <upgrade_image>

Note that the upgrade framework always defaults to a rolling upgrade. Caution is advised when using the CLI to perform a simultaneous upgrade and the scheduling ‘type’ must be specified, i.e., –rolling, –simultaneous or –parallel

For example:

# isi upgrade cluster start /ifs/install.tar isi upgrade cluster start <code_path>

Since OneFS supports the ability to roll back to the previous version, in-order to complete an upgrade it must be committed.

isi upgrade cluster commit

Up until the time an upgrade is committed, an upgrade can be rolled back to the prior version as follows.

isi upgrade cluster rollback

The isi upgrade view CLI command can be used to monitor how the upgrade is progressing:

# isi upgrade viewisi upgrade view -i/--interactive

The following command will provide more detailed/verbose output:

# isi_upgrade_status

A faster, simpler version of isi_upgrade_status is also available:

# isi_upgrade_node_state-a (aggregate the latest hook update for each node)-devid=<X,Y,E-F>  (filter and display by devid)-lnn=<X-Y,A,C> (filter and display by LNN)-ts (time sort entries)

If the end of a maintenance window is reached but the cluster is not fully upgraded, the upgrade process can be quiesced and then restarted using the following CLI commands:

# isi upgrade pause
# isi upgrade resume

For example:

# isi upgrade pause

You are about to pause the running process, are you sure?  (yes/[no]):

yes

The process will be paused once the current step completes.

The current operation can be resumed with the command:

# isi upgrade resume

Note that pausing is not immediate: The upgrade will remain in a “Pausing” state until the currently
upgrading node is completed. Additional nodes will not be upgraded until the upgrade process is resumed.

The ‘pausing’ state can be viewed with the following commands: ‘isi upgrade view’ and ‘isi_upgrade_status’. Note that a rollback can be initiated either during ‘Pausing’ or ‘Paused’ states. Also, be aware that the ‘isi upgrade pause’ command has no effect when performing a simultaneous OneFS upgrade.

A rolling reboot can be initiated from the CLI on a subset of cluster nodes using the ‘isi upgrade rolling-reboot’ syntax and the ‘–nodes’ flag specifying the desired LNNs for upgrade:

# isi upgrade rolling-reboot --help

Description:

    Perform a Rolling Reboot of cluster.

Required Privileges:

    ISI_PRIV_SYS_UPGRADE

Usage:

    isi upgrade cluster rolling-reboot

        [--nodes <integer_range_list>]

        [--force]

        [{--help | -h}]

Options:

    --nodes <integer_range_list>

        List of comma (1,3,7) or dash (1-7) specified node LNNs to select. "all"

        can also be used to select all the cluster nodes at any given time.

  Display Options:

    --force

        Do not ask confirmation.

    --help | -h

        Display help for this command.

This ‘isi upgrade view’ syntax provides better visibility, status and progress of the rolling reboot process. For example:

# isi upgrade view

Upgrade Status:

Current Upgrade Activity: RollingReboot

   Cluster Upgrade State: committed

   Upgrade Process State: Not started

      Current OS Version: 9.2.0.0

      Upgrade OS Version: N/A

        Percent Complete: 0%

Nodes Progress:

     Total Cluster Nodes: 3

       Nodes On Older OS: 3

          Nodes Upgraded: 0

Nodes Transitioning/Down: 0

LNN  Progress  Version  Status

---------------------------------

1    100%        9.2.0.0  committed

2    rebooting   Unknown  non-responsive

3    0%          9.2.0.0  committed

Due to the duration of OneFS upgrades on larger clusters, it can sometimes be unclear if an OS upgrade is actually progressing or has stalled. To address this, if an upgrade is not making progress after fifteen minutes, the upgrade framework automatically sends a SW_UPGRADE_NODE_NON_RESPONSIVE alert via CELOG. For example:

# isi event events list

ID        Occurred       Sev    Lnn   Eventgroup ID  Message                                                                     

---------------------------------------------------------------------------------------------------------------

2.1805  06/14 04:33  C       2      1087               Excessive Time executing a Hook on Node: 3
# isi status

...

Critical Events:

Time                 LNN  Event                          

---------------     ----  ------------------------------ 

06/14 05:16:30  2    Excessive Time executing a ... 

...

The isi_upgrade_logs command also provides detailed upgrade tracking and debugging data.

Usage: isi_upgrade_logs [-a|--assessment][--lnn][--process {process name}][--level {start level,end level][--time {start time,end time][--guid {guid} | --devid {devid}]

 + No parameter this utility will pull error logs for the current upgrade process

 + -a or --assessment - will interrogate the last upgrade assessment run and display the results

The following arguments enable filtering to help extract the desired upgrade information:

Filter CMD Flag Description
–guid dump the logs for the node with the supplied guid
–devid dump the logs for the node/s with the supplied devid/s
–lnn dump the logs for the node/s with the supplied lnn/s
–process dump the logs for the node with the supplied process name
–level dump the logs for the supplied level range
–time dump the logs for the supplied time range
–metadata dump the logs matching the supplied regex

For example, to display all of the logs generated by isi_upgrade_agent_d on the node with LNN1:

# isi_upgrade_logs --lnn 1 --process /usr/sbin/isi_upgrade_agent_d  
…
1  2021-06-14T18:06:15  /usr/sbin/isi_upgrade_agent_d  Debug  Starting /usr/share/upgrade/event-actions/pre-upgrade-optional/read_only_node_check.py

1  2021-06-14T23:59:59  /usr/sbin/isi_upgrade_agent_d  Debug  Starting /usr/share/upgrade/event-actions/pre-upgrade-optional/isi_upgrade_checker

1  2021-06-14T18:06:15  /usr/sbin/isi_upgrade_agent_d  Debug  Starting /usr/share/upgrade/event-actions/pre-upgrade-optional/volcopy_check

1  2021-06-14T18:06:15  /usr/sbin/isi_upgrade_agent_d  Debug  Starting /usr/share/upgrade/event-actions/pre-upgrade-optional/empty

1  2021-06-14T18:06:15  /usr/sbin/isi_upgrade_agent_d  Debug  Starting Hook [/usr/share/upgrade/event-actions/pre-upgrade-optional/read_only_node_check.py]
 …

Note that the ‘–process’ flag requires the full name including path to be specified, as it is displayed in the logs.

For example, the following CLI syntax displays a list all of the Upgrade-related process names that have logged to LNN 1:

# isi_upgrade_logs --lnn 1 | awk ‘{print $3}’ | sort | uniq

These process names can then be added to the ‘–process’ argument.

OneFS CELOG Alerts and Events WebUI

The OneFS 9.2 release introduced a number of OneFS usability enhancements for managing cluster events and alerts. This new functionality makes it considerably simpler to filter events chronologically, categorize by their status, filter by the severity, search the event history, resolve, suppress or ignore bulk events, and manage scheduled maintenance windows.

For example, you can easily categorize, identify, and filter events by using the following criteria:

Action Detail
Show ·         Show events for:

–    Today

–    This week

–    This month

–    Custom range/

–    All

Categorize ·         Categorize events by their status:

–    Active

–    Ignored

–    Resolved

–    All

Filter ·         Filter events by severity:

–    Emergency

–    Critical

–    Warning

–    Information

Search ·         Search for specific event(s) in the event history
Resolve ·         Resolve bulk events
Ignore ·         Ignore bulk events

The new WebUI page for event group history can be accessed by navigating to Cluster management > Events and Alerts. For example:

With OneFS 9.2, CELOG maintenance mode can also easily be manually enabled and disabled. During a maintenance window, the system will continue to log events but not generate alerts. However, all events that occurred during the maintenance window can then be reviewed upon disabling maintenance mode. Active event groups will automatically resume generating alerts when the scheduled maintenance period ends. For example, to enable CELOG maintenance mode, in the OneFS WebUI select Cluster Management > Events and Alerts > Alert management tab and click on the ‘Enable CELOG maintenance mode’ button. In the prompt window, select ‘Enable CELOG maintenance mode’ as follows:

Create an Alert channel. Either SMTP or SNMP can be configured for the alert channel communication, and can be created by selecting ‘Create channel’:

To create an alert rule, click the tab Alerting rule and then click the button Create alert rule. In the prompt window, fill the Rule name, set the Rule condition to NEW, apply it to all the Alert categories, and attached it to the channel you have just created.

Events can be created using CLI syntax similar to the following:

# /usr/bin/isi_celog/celog_send_events.py -o 940100002

Heap looptimes: [-1]

running -1 [940100002]

1612871342 :: Sending eventids [940100002] with specifier None

940100002 message is OneFS {version} is currently running on unsupported nodes (devid(s) {devids}). {msg}.

1.195 (70368744177859) corresponds to eventid 940100002

Out of events to run. Exiting.
# /usr/bin/isi_celog/celog_send_events.py -o 940100001

Heap looptimes: [-1]

running -1 [940100001]

1612872343 :: Sending eventids [940100001] with specifier None

940100001 message is OneFS {version} is currently running and is not supported on this hardware: {msg}.

1.196 (70368744177860) corresponds to eventid 940100001

Out of events to run. Exiting.

During maintenance mode, OneFS will still show the event but there will be no associated alert. In this example, there is no SMTP alert email triggered.

The following CLI syntax can also be used to filter all the events which happened while the cluster was in CELOG maintenance mode:

# isi event groups list --maintenance-mode=true

ID   Started     Ended       Causes Short                           Lnn  Events  Severity

------------------------------------------------------------------------------------------

16   02/09 11:49 --          HW_CLUSTER_ONEFS_VERSION_NOT_SUPPORTED 1    1       critical

17   02/09 12:05 02/09 12:19 HW_ONEFS_VERSION_NOT_SUPPORTED         1    1       critical

------------------------------------------------------------------------------------------

Click the “Disable CELOG maintenance mode’ button. and select one of the following from the display window:

    1. View event details
    2. Ignore event
    3. Resolve event

In this example, the event HW_ONEFS_VERSION_NOT_SUPPORTED is marked resolved by clicking Action and Resolve event.

After the CELOG maintenance mode is disabled, you will get the email notification only for HW_CLUSTER_ONEFS_VERSION_NOT_SUPPORTED. The event which has been marked resolved will not trigger any notification.

When an event type is suppressed, it prevents an event from alerting on all configured CELOG channels. However, the event will still be displayed in the event group history.

To suppress an event type, click the button Suppress for a specific event under Event type ID tab. In this example, both 930100006 and 930100005 have been suppressed.

Create several events, for example by using CLI commands such as the following:

# /usr/bin/isi_celog/celog_send_events.py -o 930100006

Heap looptimes: [-1]

running -1 [930100006]

1612873812 :: Sending eventids [930100006] with specifier None

930100006 message is {sensor} out of spec in chassis {chassis} slot {slot}.

1.200 (70368744177864) corresponds to eventid 930100006

Out of events to run. Exiting.
# /usr/bin/isi_celog/celog_send_events.py -o 930100005

Heap looptimes: [-1]

running -1 [930100005]

1612873817 :: Sending eventids [930100005] with specifier None

930100005 message is {sensor} out of spec in chassis {chassis} slot {slot}.

1.201 (70368744177865) corresponds to eventid 930100005

Out of events to run. Exiting.

To list all the events in the suppressed list, use the following CLI syntax:

# isi event suppress list

ID        Name

—————————-

930100005 HWMON_ANY_DISCRETE

930100006 HWMON_ANY_METERS

—————————-

These suppressed events will only show in the event history and will not trigger any notification in any channels.

The desired event types can be un-suppressed by clicking pertinent the Un-suppress button(s).

 

OneFS External Key Management for Data Encryption

Data-at-rest data is inactive content that is physically housed on a cluster, or other storage medium. Protecting this data with cryptography ensures that it’s guarded against theft, in the event that drives or nodes are removed from a PowerScale cluster. Data-at-rest encryption DARE) is a requirement for federal and industry regulations ensuring data is encrypted when it is stored. OneFS has provided DARE solutions for many years through self-encrypting drives (SEDs) and, until now, an internal key management system.

The new OneFS 9.2 release introduces External Key Management (EKM) support for encrypted clusters, through the key management interoperability protocol, or KMIP. This enables offloading of the Master Key from a node to an External Key Manager, such as SKLM, SafeNet or Vormetric. Enhanced security is inherently provided through the separation of key manager from cluster, since node(s) cannot be rebooted without the keys. It also supports the secure transport of nodes. Additionally, centralized key management is also made available for multiple SED clusters, and provides the option to migrate existing keys from a cluster’s internal key store.

EKM provides enhanced security through the separation of the key manager from the cluster, enabling the secure transport of nodes, and helping organizations to meet regulatory compliance and corporate data at rest security requirements.

In order to use the OneFS 9.2 External Key Manager feature, clearly a cluster with self-encrypting drives is needed. Additionally, for the SED drives to be unlocked and user data made available, each node in the cluster must first contact the KMIP server to obtain the master encryption key from the server. Nodes in the cluster cannot boot without contacting the KMIP server first. Note that clusters without all their nodes connected to a front-end network (NANON) do not support External Key Management.

For the server, a KMIP Compliant Server supporting KMIP version 1.2 or greater is needed. Also:

  • KMIP Storage Array with SEDS Profile Version 1.0
  • KMIP server host/port information
  • 509 PKI for TLS mutual authentication
    • Certificate authority bundle
    • Client certificate and private key

And the following key manager servers have been verified with OneFS 9.2:

Vendor Key Manager & Version
Dell EMC CloudLink Center 6.0
Gemalto Gemalto KeySecure 8.7 k150v

Gemalto KeySecure 8.7 k170v

IBM Secure Key Lifecycle Manager (SKLM) 2.6.0.2

Secure Key Lifecycle Manager (SKLM) 2.7.0.0

Secure Key Lifecycle Manager (SKLM) 3.0.0

Thales E-Security KeyAuthority 4.0

External Key Management can be configured on OneFS 9.2 as follows:

  1. Obtain the KMIPs Server and Client Certificates. Copy both certificates to the cluster and make a note of the file names and location.
  2. From the OneFS web interface, select Access > Key Management

Alternatively, this can also be accomplished via the OneFS CLI:

# isi keymanager kmip servers create
  1. From the WebUI Key Management page, enter the KMIP server information and specify the filename with the server/client certificates’ location. If the KMIP has a client certificate password specified, enter that here. Check the “Enable Key Management” box and click Submit.

  1. Next, OneFS contacts the KMIP and confirms the connection or displays any errors.
  2. Once the KMIP server is added, the keys can now be migrated. Click the Keys tab to display all current Master Keys on the cluster. Click on Migrate all to migrate the keys to the KMIP server. From the “Migrate all” pop-up, click Migrate to start the migration.

  1. The key migration process may take several minutes or more to complete depending on the cluster and network utilization. During this time, a “Migration in process” message is displayed.

  1. Once the process is complete, and “Migration Successful” message is displayed:

The following OneFS key management CLI commands are also available:

To configure external KMIT servers:

# isi keymanager kmip servers -h     

Description:

    Configure external KMIP servers.

Required Privileges:

    ISI_PRIV_KEY_MANAGER

Usage:

    isi keymanager kmip servers <action>

        [--timeout <integer>]

        [{--help | -h}]

Actions:

    create    Configure a new external KMIP server.

    delete    Delete a external KMIP server.

    modify    Modify a external KMIP server.

    list      View a list of configured external KMIP servers.

    view      View a single external KMIP server.

Options:

  Display Options:

    --timeout <integer>

        Number of seconds for a command timeout (specified as 'isi --timeout NNN <command>').

    --help | -h

        Display help for this command.

See 'isi keymanager kmip servers <subcommand> --help' for more information on a specific subcommand.

To manage SED keystore settings:

# isi keymanager sed settings -h

Description:

    Manage self-encrypting drive keystore settings.

Required Privileges:

    ISI_PRIV_KEY_MANAGER

Usage:

    isi keymanager sed settings <action>

        [{--help | -h}]

Actions:

    modify    Modify SED settings

    view      View current SED settings.

Options:

  Display Options:

    --help | -h

        Display help for this command.

See 'isi keymanager sed settings <subcommand> --help' for more information on a specific subcommand.

And to report keymanager SED status:

# isi keymanager sed status    

 Node Status  Location  Remote Key ID  Error Info(if any)
----------------------------------------------------------
  1   REMOTE  Server    F84B50640CABD44B9D5F75427C2B5E

  2   REMOTE  Server    24285969BD8804A9A61EE39D99573B

  3   REMOTE   Server    7D561B1CA89B72B891B21BF834097F  
-----------------------------------------------------------
Total: 3

Bigdata File Formats Support on Dell EMC ECS 3.6

This article describes the Dell EMC ECS’s support for Apache Hadoop file formats in terms of disk space utilization. To determine this, we will use Apache Hive service to create and store different file format tables and analyze the disk space utilization by each table on the ECS storage.

Apache Hive supports several familiar file formats used in Apache Hadoop. Hive can load and query different data files created by other Hadoop components such as PIG, Spark, MapReduce, etc. In this article, we will check Apache Hive file formats such as TextFile, SequenceFIle, RCFile, AVRO, ORC and Parquet formats. Cloudera Impala also supports these file formats.

To begin with, let us understand a bit about these Bigdata File formats. Different file formats and compression codes work better for different data sets in Hadoop, the main objective of this article is to determine their supportability on DellEMC ECS storage which is a S3 compatible object store for Hadoop cluster.

Following are the Hadoop file formats

Test File: This is a default storage format. You can use the text format to interchange the data with another client application. The text file format is very common for most of the applications. Data is stored in lines, with each line being a record. Each line is terminated by a newline character(\n).

The test format is a simple plane file format. You can use the compression (BZIP2) on the text file to reduce the storage spaces.

Sequence File: These are Hadoop flat files that store values in binary key-value pairs. The sequence files are in binary format and these files can split. The main advantage of using the sequence file is to merge two or more files into one file.

RC File: This is a row columnar file format mainly used in Hive Datawarehouse, offers high row-level compression rates. If you have a requirement to perform multiple rows at a time, then you can use the RCFile format. The RCFile is very much like the sequence file format. This file format also stores the data as key-value pairs.

AVRO File: AVRO is an open-source project that provides data serialization and data exchange services for Hadoop. You can exchange data between the Hadoop ecosystem and a program written in any programming language. Avro is one of the popular file formats in Big Data Hadoop based applications.

ORC File: The ORC file stands for Optimized Row Columnar file format. The ORC file format provides a highly efficient way to store data in the Hive table. This file system was designed to overcome limitations of the other Hive file formats. The Use of ORC files improves performance when Hive is reading, writing, and processing data from large tables.

More information on the ORC file format: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

Parquet File: Parquet is a column-oriented binary file format. The parquet is highly efficient for the types of large-scale queries. Parquet is especially good for queries scanning particular columns within a particular table. The Parquet table uses compression Snappy, gzip; currently Snappy by default.

More information on the Parquet file format: https://parquet.apache.org/documentation/latest/

Please note for below testing Cloudera CDP Private Cloud Base 7.1.6 Hadoop cluster is used.

Disk Space Utilization on Dell EMC ECS

What is the space on the disk that is used for these formats in Hadoop on Dell EMC ECS? Saving on disk space is always a good thing, but it can be hard to calculate exactly how much space you will be used with compression. Every file and data set is different, and the data inside will always be a determining factor for what type of compression you’ll get. The text will compress better than binary data. Repeating values and strings will compress better than pure random data, and so forth.

As a simple test, we took the 2008 data set from http://stat-computing.org/dataexpo/2009/the-data.html. The compressed bz2 download measures at 108.5 Mb, and uncompressed at 657.5 Mb. We then uploaded the data to Dell EMC ECS through s3a protocol, and created an external table on top of the uncompressed data set:

Copy the original dataset to Hadoop cluster
[root@hop-kiran-n65 ~]# ll
total 111128
-rwxr-xr-x 1 root root 113753229 May 28 02:25 2008.csv.bz2
-rw-------. 1 root root 1273 Oct 31 2020 anaconda-ks.cfg
-rw-r--r--. 1 root root 36392 Dec 15 07:48 docu99139
[root@hop-kiran-n65 ~]# hadoop fs -put ./2008.csv.bz2 s3a://hive.ecs.bucket/diff_file_format_db/bz2/
[root@hop-kiran-n65 ~]# hadoop fs -ls s3a://hive.ecs.bucket/diff_file_format_db/bz2/
Found 1 items
-rw-rw-rw- 1 root root 113753229 2021-05-28 02:00 s3a://hive.ecs.bucket/diff_file_format_db/bz2/2008.csv.bz2
[root@hop-kiran-n65 ~]#
From Hadoop Compute Node, create a database with data location on ECS bucket and create an external table for the flights data uploaded to ECS bucket location.
DROP DATABASE IF EXISTS diff_file_format_db CASCADE;

CREATE database diff_file_format_db COMMENT 'Holds all the tables data on ECS bucket' LOCATION 's3a://hive.ecs.bucket/diff_file_format_db' ;
USE diff_file_format_db;

Create external table flight_arrivals_txt_bz2 (
year int,
month int,
DayofMonth int,
DayOfWeek int,
DepTime int,
CRSDepTime int,
ArrTime int,
CRSArrTime int,
UniqueCarrier string,
FlightNum int,
TailNum string,
ActualElapsedTime int,
CRSElapsedTime int,
AirTime int,
ArrDelay int,
DepDelay int,
Origin string,
Dest string,
Distance int,
TaxiIn int,
TaxiOut int,
Cancelled int,
CancellationCode int,
Diverted int,
CarrierDelay string,
WeatherDelay string,
NASDelay string,
SecurityDelay string,
LateAircraftDelay string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
location 's3a://hive.ecs.bucket/diff_file_format_db/bz2/';
The total number of records in this master table is
select count(*) from flight_arrivals_txt_bz2 ;

+----------+
|   _c0    |
+----------+
| 7009728  |
+----------+
Similarly, create different file format tables using the master table

To create different file formats files by simply specifying ‘STORED AS FileFormatName’ option at the end of a CREATE TABLE Command.

Create external table flight_arrivals_external_orc stored as ORC as select * from flight_arrivals_txt_bz2;
Create external table flight_arrivals_external_parquet stored as Parquet as select * from flight_arrivals_txt_bz2;
Create external table flight_arrivals_external_textfile stored as textfile as select * from flight_arrivals_txt_bz2;
Create external table flight_arrivals_external_sequencefile stored as sequencefile as select * from flight_arrivals_txt_bz2;
Create external table flight_arrivals_external_rcfile stored as rcfile as select * from flight_arrivals_txt_bz2;
Create external table flight_arrivals_external_avro stored as avro as select * from flight_arrivals_txt_bz2;
Disk space utilization of the tables

Now, let us compare the disk usage on ECS of all the files from Hadoop compute nodes.

[root@hop-kiran-n65 ~]# hadoop fs -du -h s3a://hive.ecs.bucket/diff_file_format_db/ | grep flight_arrivals
597.7 M 597.7 M s3a://hive.ecs.bucket/diff_file_format_db/flight_arrivals_external_avro
93.5 M 93.5 M s3a://hive.ecs.bucket/diff_file_format_db/flight_arrivals_external_orc
146.6 M 146.6 M s3a://hive.ecs.bucket/diff_file_format_db/flight_arrivals_external_parquet
403.1 M 403.1 M s3a://hive.ecs.bucket/diff_file_format_db/flight_arrivals_external_rcfile
751.1 M 751.1 M s3a://hive.ecs.bucket/diff_file_format_db/flight_arrivals_external_sequencefile
670.7 M 670.7 M s3a://hive.ecs.bucket/diff_file_format_db/flight_arrivals_external_textfile
[root@hop-kiran-n65 ~]#

Summary

From the below table we can conclude that Dell EMC ECS as S3 storage supports all the Hadoop file formats and provides the same disk utilization as with the traditional HDFS storage.

Compressed Percentage lower is batter and Compression ratio higher is better.

Format
Size
Compressed%
Compressed Ratio
CSV (Text) 670.7 M
BZ2 108.5 M 16.18% 83.82%
ORC 93.5 M 13.94% 86.06%
Parquet 146.6 M 21.85% 78.15%
RC FIle 403.1 M 60.10% 39.90%
AVRO 597.7 M 89.12% 10.88%
Sequence 751.1 M 111.97% -11.87%

Here the default settings and values were used to create all the different file format tables, there were no other optimizations done for this testing. Each file format ships with many options and optimizations to compress the data, only the defaults that ship CDP pvt cloud base 7.1.6 were used.

 

 

 

 

 

 

 

OneFS Cluster Configuration Export & Import

OneFS 9.2 introduces the ability to export a cluster’s configuration, which can then be used to perform a configuration restore to the original cluster, or to an alternate cluster that also supports this feature. A configuration export and import can be performed via either the OneFS CLI or platform API, and encompasses the following OneFS components for configuration backup and restore:

  • NFS
  • SMB
  • S3
  • NDMP
  • HTTP
  • Quotas
  • Snapshots

The underlying architecture comprises four layers , and the process flow is as follows:

Each layer of the architecture is a follows:

 Component Description
User Interface Allows users to submit operations with multiple choices, such as REST, CLI, or WebUI.
pAPI Handler Performs different actions according to the requests flowing in
Config Manager Core layer executing different jobs which are called by PAPI handlers.
Database Lightweight database manage asynchronous jobs, tracing state and receiving task data.

By default, configuration backup and restore files reside at:

File Location
Backup JSON file: /ifs/data/Isilon_Support/config_mgr/backup/<JobID>/<component>_<JobID>.json
Restore JSON file: /ifs/data/Isilon_Support/config_mgr/restore/<JobID>/<component>_<JobID>.json

The log file for configuration manager is located at /var/log/config_mgr.log and can be useful to monitor the progress of a config backup and restore.

So let’s take a look at this cluster configuration management process:

The following procedure steps through the export and import of a cluster’s NFS and SMB configuration – within the same cluster:

  1. Open an SSH connection to any node in the cluster and log in using the root account.
  2. Create several SMB shares and NFS exports using the following CLI command
# isi smb shares create --create-path --name=test --path=/ifs/test

# isi smb shares create --create-path --name=test2 --path=/ifs/test2

# isi nfs exports create --paths=/ifs/test

# isi nfs exports create --paths=/ifs/test2
  1. Export the NFS and SMB configuration using the following CLI command
# isi cluster config exports create --components=nfs,smb --verbose

As indicated in the output below, the job ID for this export task is ‘ PScale-20210524105345’

Are you sure you want to export cluster configuration? (yes/[no]): yes

This may take a few seconds, please wait a moment

Created export task ' PScale-20210524105345'
  1. To view the results of the export operation, use the following CLI command:
# isi cluster config exports view PScale-20210524105345

As displayed in the output below, the backup JSON files are located at /ifs/data/Isilon_Support/config_mgr/backup/PScale-20210524105345

     ID: PScale-20210524105345

 Status: Successful

   Done: ['nfs', 'smb']

 Failed: []

Pending: []

Message:

   Path: /ifs/data/Isilon_Support/config_mgr/backup/PScale-20210524105345
  1. The JSON files can be viewed under /ifs/data/Isilon_Support/config_mgr/backup/ PScale-20210524105345. OneFS will generate a separate configuration backup JSON file for each component (ie. SMB and NFS in this example):
# ls /ifs/data/Isilon_Support/config_mgr/backup/PScale-20210524105345 backup_readme.json              nfs_PScale-20210524105345.json  smb_PScale-20210524105345.json
  1. Delete all the SMB shares and NFS exports using the following commands:
# isi smb shares delete test

# isi smb shares delete test2

# isi nfs exports delete 9

# isi nfs exports delete 10
  1. Use the following CLI command to restore the SMB and NFS configuration:
# isi cluster config imports create PScale-20210524105345 --components=smb,nfs
  1. From the output below, the import job ID is ‘ PScale-20210524105345’
Are you sure you want to import cluster configuration? (yes/[no]): yes

This may take a few seconds, please wait a moment

Created import task ' PScale-20210524105345'
  1. To view the restore results, use the following command:
# isi cluster config imports view PScale-20210524105345

       ID: PScale-20210524110659

Export ID: PScale-20210524105345

   Status: Successful

     Done: ['nfs', 'smb']

   Failed: []

  Pending: []

  Message:

     Path: /ifs/data/Isilon_Support/config_mgr/restore/ PScale-20210524110659
  1. Verify that the SMB shares and NFS exports are restored:
# isi smb shares list

Share Name  Path

----------------------

test        /ifs/test

test2       /ifs/test2

----------------------

Total: 2
# isi nfs exports list

ID   Zone   Paths      Description

-----------------------------------

11   System /ifs/test

12   System /ifs/test2

-----------------------------------

Total: 2

A WebUI management component for this feature will be included in a future release, as will the ability to run a diff, or comparison, between two exported configurations .