January 2022 – Unstructured Data Quick Tips

OneFS HDFS ACLs Support

The Hadoop Distributed File System (HDFS) permissions model for files and directories has much in common with the ubiquitous POSIX model. HDFS access control lists, or ACLs, are Apache’s implementation of POSIX.1e ACLs, and each file and directory is associated with an owner and a group. The file or directory has separate permissions for the user that is the owner, for other users that are members of the group, and for all other users.

However, in addition to the traditional POSIX permissions model, HDFS also supports POSIX ACLs. ACLs are useful for implementing permission requirements that differ from the natural organizational hierarchy of users and groups. An ACL provides a way to set different permissions for specific named users or named groups, not only the file’s owner and the file’s group. HDFS ACL also support extended ACL entries, which allow multiple users and groups to configure different permissions for the same HDFS directory and files.

Compared to regular OneFS ACLs, HDFS ACLs differ in both structure and access check algorithm. The most significant difference is that the OneFS ACL algorithm is cumulative, so that if a user is requesting red-write-execute access, this can be granted by three different OneFS ACEs (access control entries). In contrast, HDFS ALCs have a strict, predefined ordering of ACEs for the evaluation of permissions and there is no accumulation of permission bits. As such, OneFS uses a translation algorithm to bridge this gap. For instance, here’s an example of the mappings between HDFS ACEs and OneFS ACEs:

HDFS ACE	OneFS ACE
user:rw-	allow <owner-sid> std-read std-write deny <owner-sid> std-execute
user:sheila:rw-	allow <owner-sheilas-sid> std-read std-write deny <owner-shelias-sid> std-execute
group::r–	allow <group-sid> std-read deny <group-sid> std-write std-execute
mask::rw-	posix-mask <everyone-sid> std-read std-write deny <owner-sid> std-execute
other::–x	allow <everyone-sid> std-execute deny <everyone-sid> std-read std-write

The HDFS Mask ACEs (derived from POSIX.1e) are special access control entries which apply to every named user and group, and represent the maximum permissions that a named user or any group can have for the file. They were introduced essentially to extend traditional read-write-execute (RWX) mode bits to support the ACLs model. OneFS translates these Mask ACEs to a ‘posix-mask’ ACE, and the access checking algorithm in the kernel applies the mask permissions to all appropriate trustees.

On translation of an HDFS ACL -> OneFS ACL ->HDFS ACL, OneFS is guaranteed to return the same ACL However, translation a OneFS ACL->HDFS ACL->OneFS ACL can be unpredictable. OneFS ACLs are richer that HDFS ACLs and can lose info when translated to HDFS ACLs when dealing with HDFS ACLs with multiple named groups when a trustee is a member of multiple groups. So, for example, if a user has RWX in one group, RW in another, and R in a third group, results we be as expected and RWX access will be granted. However, if a user has W in one group, X in another, and R in a third group, in these rare cases the ACL translation algorithm will prioritize security and produce a more restrictive ‘read-only’ ACL.

Here’s the full set of ACE mappings between HDFS and OneFS internals:

HDFS ACE permission	Apply to	OneFS Internal ACE permission
rwx	Directory	allow dir_gen_read, dir_gen_write, dir_gen_execute, delete_child deny
	File	allow file_gen_read, file_gen_write, file_gen_execute deny
rw-	Directory	allow dir_gen_read, dir_gen_write, delete_child deny traverse
	File	allow file_gen_read, file_gen_write deny execute
r-x	Directory	allow dir_gen_read, dir_gen_execute, deny add_file, add_subdir, dir_write_ext_attr, delete_child, dir_write_attr
	File	allow file_gen_read, file_gen_execure deny file_write, append, file_write_ext_attr, file_write_attr
r–	Directory	allow dir_gen_read deny add_file, add_subdir, dir_write_ext_attr, traverse, delete_child, dir_write_attr
	File	allow file_gen_read deny file_write, append, file_write_ext_attr, execute, file_write_attr
-wx	Directory	allow dir_gen_write, dir_gen_execute, delete_child, dir_read_attr deny list, dir_read_ext_attr
	File	Allow file_gen_write, file_gen_execute, file_read_attr deny file_read, file_read_ext_attr
-w-	Directory	allow dir_gen_write, delete_child, dir_read_attr deny list, dir_read_ext_attr, traverse
	File	Allow file_gen_write, file_read_attr deny file_read, file_read_ext_attr, execute
–x	Directory	allow dir_gen_execute, dir_read_attr deny list, add_file, add_subdir, dir_read_ext_attr, dir_write_ext_attr, delete_child
	File	allow file_gen_execute, file_read_attr deny file_read, file_write, append, file_read_ext_attr, file_write_ext_attr, file_write_attr
—	Directory	allow std_read_dac, std_synchronize, dir_read_attr deny list, add_file, add_subdir, dir_read_ext_attr, dir_write_ext_attr, traverse, delete_child, dir_write_attr
	File	allow std_read_dac, std_synchronize, file_read_attr deny file_read, file_write, append, file_read_ext_attr, file_write_ext_attr, execute, file_write_attr

Enabling HDFS ACLs is typically performed from the OneFS CLI, and can be configured at a per access zone granularity. For example:

# isi hdfs settings modify --zone=System --hdfs-acl-enabled=true

# isi hdfs settings --zone=System

                  Service: Yes

       Default Block Size: 128M

    Default Checksum Type: none

      Authentication Mode: simple_only

           Root Directory: /ifs/data

          WebHDFS Enabled: Yes

            Ambari Server:

          Ambari Namenode:

              ODP Version:

     Data Transfer Cipher: none

 Ambari Metrics Collector:

         HDFS ACL Enabled: Yes

Hadoop Version 3 Or Later: Yes

Note that the ‘hadoop-version-3’ parameter should be set to ‘false’ if the HDFS client(s) are running Hadoop 2 or earlier. For example:

# isi hdfs settings modify --zone=System --hadoop-version-3-or-later=false

# isi hdfs settings --zone=System

                  Service: Yes

       Default Block Size: 128M

    Default Checksum Type: none

      Authentication Mode: simple_only

           Root Directory: /ifs/data

          WebHDFS Enabled: Yes

            Ambari Server:

          Ambari Namenode:

              ODP Version:

     Data Transfer Cipher: none

 Ambari Metrics Collector:

         HDFS ACL Enabled: Yes

Hadoop Version 3 Or Later: No

Other useful ACL configuration options include:

# isi auth settings acls modify --calcmodegroup=group_only

Specifies how to approximate group mode bits. Options include:

Option	Description
group_access	Approximates group mode bits using all possible group ACEs. This causes the group permissions to appear more permissive than the actual permissions on the file.
group_only	Approximates group mode bits using only the ACE with the owner ID. This causes the group permissions to appear more accurate, in that you see only the permissions for a particular group and not the more permissive set. Be aware that this setting may cause access-denied problems for NFS clients, however.

# isi auth settings acls modify --calcmode-traverse=require

Specifies whether or not traverse rights are required in order to traverse directories with existing ACLs.

# isi auth settings acls modify --group-ownerinheritance=parent

Specifies how to handle inheritance of group ownership and permissions. If you enable a setting that causes the group owner to be inherited from the creator’s primary group, you can override it on a per-folder basis by running the chmod command to set the set-gid bit. This inheritance applies only when the file is created. The following options are available:

Option	Description
native	Specifies that if an ACL exists on a file, the group owner will be inherited from the file creator’s primary group. If there is no ACL, the group owner is inherited from the parent folder.
parent	Specifies that the group owner be inherited from the file’s parent folder.
creator	Specifies that the group owner be inherited from the file creator’s primary group.

Once configured, any of these settings can then be verified with the following CLI syntax:

# isi auth setting acls view

Standard Settings

      Create Over SMB: allow

                Chmod: merge

    Chmod Inheritable: no

                Chown: owner_group_and_acl

               Access: windows

Advanced Settings

                        Rwx: retain

    Group Owner Inheritance: parent

                  Chmod 007: default

             Calcmode Owner: owner_aces

             Calcmode Group: group_only

           Synthetic Denies: remove

                     Utimes: only_owner

                   DOS Attr: deny_smb

                   Calcmode: approx

          Calcmode Traverse: require

If and when it comes to troubleshooting HDFS ACLs, the /var/log/hdfs.log file can be invaluable. Setting the hdfs.log file to the ‘debug’ level will generate log entries detailing ACL and ACE parsing and configuration. This can be easily accomplished with the following CLI command:

# isi hdfs log-level modify --set debug

Here’s an example of debug level log entries showing ACL output and creation:

A file’s detailed security descriptor configuration can be viewed from OneFS using the ‘ls -led’ command. Specifically, the ‘-e’ argument in this command will print all the ACLs and associated ACEs. For example:

# ls -led /ifs/hdfs/file1

-rwxrwxrwx +     1 yarn  hadoop  0 Jan 26 22:38 file1

OWNER: user:yarn

GROUP: group:hadoop

0: everyone posix_mask file_gen_read,file_gen_write,file_gen_execute

1: user:yarn allow file_gen_read,file_gen_write,std_write_dac

2: group:hadoop allow std_read_dac,std_synchronize,file_read_attr

3: group:hadoop deny file_read,file_write,append,file_read_ext_attr, file_write_ext_attr,execute,file_write_attr

The access rights to a file for a specific user can also be viewed using the ‘isi auth access’ CLI command as follows:

# isi auth access --user=admin /ifs/hdfs/file1

               User

                 Name: admin

                  UID: 10

                  SID: SID:S-1-22-1-10


               File

                  Owner

                 Name: yarn

                   ID: UID:5001

                  Group

                 Name: hadoop

                   ID: GID:5000

       Effective Path: /ifs/hdfs/file1

     File Permissions: file_gen_read

        Relevant Aces: group:admin allow file_gen_read

                              Group:admin deny file_write, append,file_write_ext_attr,execute,file_wrote_attr

                              Everyone allow file_gen_read,file_gen_write,file_gen_execute

        Snapshot Path: No

          Delete Child: The parent directory allows delete_child for this user, the user may delete the file.

When using SyncIQ or NDMP with HDFS ACLs, be aware that replicating or restoring data from a OneFS 9.3 cluster with HDFS ACLs enabled to a target cluster running an earlier version of OneFS can result in loss of ACL detail. Specifically, the new ‘mask’ ACE type cannot be replicated or restored on target clusters running prior releases, since the ‘mask’ ACE is only introduced in 9.3. Instead, OneFS will generate two versions of the ACL – with and without ‘Mask’ – which maintains the same level of security.

OneFS NFS Performance Resource Monitoring

Another feature of the recent OneFS 9.3 release is enhanced performance resource monitoring for the NFS protocol, adding path tracking for NFS and bringing it up to parity with the SMB protocol resource monitoring.

But first, a quick refresher. The performance resource monitoring framework enables OneFS to track and report the use of transient system resources (ie. resources that only exists at a given instant), providing insight into who is consuming what resources, and how much of them. Examples include CPU time, network bandwidth, IOPS, disk accesses, and cache hits (but not currently disk space or memory usage). OneFS performance resource monitoring initially debuted in OneFS 8.0.1 and is an ongoing project, which ultimately will provide both insights and control. This will allow prioritization of work flowing through the system, prioritization and protection of mission critical workflows, and the ability to detect if a cluster is at capacity.

Since identification of work is highly subjective, OneFS performance resource monitoring provides significant configuration flexibility, allowing cluster admins to define exactly how they wish to track workloads. For example, an administrator might want to partition their work based on criterial like which user is accessing the cluster, the export/share they are using, which IP address they’re coming from – and often a combination of all three.

So why not just track it all, you may ask? It would simply generate too much data (potentially requiring a cluster just to monitor your cluster!).

OneFS has always provided client and protocol statistics, however they were typically front-end only. Similarly, OneFS provides CPU, cache and disk statistics, but they did not display who was consuming them. Partitioned performance unites these two realms, tracking the usage of the CPU, drives and caches, and spanning the initiator/participant barrier.

Under the hood, OneFS collects the resources consumed, grouped into distinct workloads, and the aggregation of these workloads comprise a performance dataset.

Item	Description	Example
Workload	· A set of identification metrics and resources used	{username:nick, zone_name:System} consumed {cpu:1.5s, bytes_in:100K, bytes_out:50M, …}
Performance Dataset	· The set of identification metrics to aggregate workloads by · The list of workloads collected matching that specification	{usernames, zone_names}
Filter	· A method for including only workloads that match specific identification metrics.	Filter{zone_name:System} {username:nick, zone_name:System} {username:jane, zone_name:System} {username:nick, zone_name:Perf}

The following metrics are tracked:

Category	Items
Identification Metrics	· Username / UID / SID · Primary Groupname / GID / GSID · Secondary Groupname / GID / GSID · Zone Name · Local/Remote IP Address/Range · Path* · Share / Export ID · Protocol · System Name* · Job Type
Transient Resources	· CPU Usage · Bytes In/Out · IOPs · Disk Reads/Writes · L2/L3 Cache Hits
Performance Statistics	· Read/Write/Other Latency
Supported Protocols	· NFS · SMB · S3 · Jobs · Background Services

With the exception of the system dataset, performance datasets must be configured before statistics are collected. This is typically performed via the ‘isi performance’ CLI command set, but can also be done via the platform API:

 https://<node_ip>:8080/platform/performance

Once a performance dataset has been configured, it will continue to collect resource usage statistics every 30 seconds until it is deleted. These statistics can be viewed via the ‘isi statistics’ CLI interface.

This is as simple as, first, creating a dataset specifying which identification metrics you wish to partition work by:

# isi performance dataset create --name ds_test1 username protocol export_id share_name

Next, waiting 30 seconds for data to collect.

Finally, viewing the performance dataset:

# isi statistics workload --dataset ds_test1

    CPU  BytesIn  BytesOut   Ops  Reads  Writes   L2   L3  ReadLatency  WriteLatency  OtherLatency  UserName  Protocol  ExportId  ShareName  WorkloadType

---------------------------------------------------------------------------------------------------------------------------------------------------------

 11.0ms     2.8M     887.4   5.5    0.0   393.7  0.3  0.0      503.0us       638.8us         7.4ms       nick      nfs3         1          -             -

  1.2ms    10.0K     20.0M  56.0   40.0     0.0  9.0  0.0        0.0us         0.0us         0.0us      jane      nfs3         3          -             -

  1.0ms    18.3M      17.0  10.0    0.0    47.0  0.0  0.0        0.0us       100.2us         0.0us      jane      smb2         -       home             -

 31.4us     15.1      11.7   0.1    0.0     0.0  0.0  0.0      349.3us         0.0us         0.0us       nick      nfs4         4          -             -

166.3ms      0.0       0.0   0.0    0.0     0.1  0.0  0.0        0.0us         0.0us         0.0us         -         -         -          -      Excluded

 31.6ms      0.0       0.0   0.0    0.0     0.0  0.0  0.0        0.0us         0.0us         0.0us         -         -         -          -        System

 70.2us      0.0       0.0   0.0    0.0     3.3  0.1  0.0        0.0us         0.0us         0.0us         -         -         -          -       Unknown

  0.0us      0.0       0.0   0.0    0.0     0.0  0.0  0.0        0.0us         0.0us         0.0us         -         -         -          -    Additional

  0.0us      0.0       0.0   0.0    0.0     0.0  0.0  0.0        0.0us         0.0us         0.0us         -         -         -          - Overaccounted

---------------------------------------------------------------------------------------------------------------------------------------------------------

Total: 8

Be aware that OneFS can only harvest a limited number of workloads per dataset. As such, it keeps track of the highest resource consumers, which it considers top workloads, and outputs as many of those as possible, up to a limit of either 1024 top workloads or 1.5MB memory per sample (whichever is reached first).

When you’re done with a performance dataset, it can be easily removed with the ‘isi performance dataset delete <dataset_name> syntax. For example:

# isi performance dataset delete ds_test1

Are you sure you want to delete the performance dataset.? (yes/[no]): yes

Deleted performance dataset 'ds_test1' with ID number 1.

Performance resource monitoring includes the five aggregate workloads below for special cases, and OneFS can output up to 1024 additional pinned workloads. Plus, a cluster administrator can configure up to four custom datasets, or special workloads, which can aggregate special case workloads.

Aggregate Workload	Description
Additional	· Not in the ‘top’ workloads
Excluded	· Does not match the dataset definition (ie. missing a required metric such as export_id) · Does not match any applied filter (See later in the slide deck)
Overaccounted	· Total of work that has appeared in multiple workloads within the dataset · Can happen for datasets using path and/or group metrics
System	· Background system/kernel work
Unknown	· OneFS couldn’t determine where this work came from

The ‘system’ dataset is created by default and cannot be deleted or renamed. Plus, it is the only dataset that includes the OneFS services and job engine’s resource metrics:

OneFS Resource

Details

System service

· Any process started by isi_mcp / isi_daemon

· Any protocol (likewise) service

Jobs

· Includes job ID, job type, and phase

Non-protocol work only includes CPU, cache and drive statistics, but no bandwidth usage, op counts, or latencies. Plus, protocols (ie. S3) that haven’t been fully integrated into the resource monitoring framework yet also get these basic statistics, and only in the system dataset

If no dataset is specified in an ‘isi statistics workload’ command, the ‘system’ dataset statistics are displayed by default:

# isi statistics workload

Performance workloads can also be ‘pinned’, allowing a workload to be tracked even when it’s not a ‘top’ workload, and regardless of how many resources it consumes. This can be configured with the following CLI syntax:

# isi performance workloads pin <dataset_name/id> <metric>:<value>

Note that all metrics for the dataset must be specified. For example:

# isi performance workloads pin ds_test1 username:jane protocol:nfs3 export_id:3

# isi statistics workload --dataset ds_test1

    CPU  BytesIn  BytesOut   Ops  Reads  Writes   L2   L3  ReadLatency  WriteLatency  OtherLatency  UserName  Protocol  ExportId   WorkloadType

-----------------------------------------------------------------------------------------------------------------------------------------------

 11.0ms     2.8M     887.4   5.5    0.0   393.7  0.3  0.0      503.0us       638.8us         7.4ms       nick      nfs3         1              -

  1.2ms    10.0K     20.0M  56.0   40.0     0.0  0.0  0.0        0.0us         0.0us         0.0us      jane      nfs3         3         Pinned <- Always Visible

 31.4us     15.1      11.7   0.1    0.0     0.0  0.0  0.0      349.3us         0.0us         0.0us       jim      nfs4         4              -    Workload

166.3ms      0.0       0.0   0.0    0.0     0.1  0.0  0.0        0.0us         0.0us         0.0us         -         -         -       Excluded

 31.6ms      0.0       0.0   0.0    0.0     0.0  0.0  0.0        0.0us         0.0us         0.0us         -         -         -         System

 70.2us      0.0       0.0   0.0    0.0     3.3  0.1  0.0        0.0us         0.0us         0.0us         -         -         -        Unknown

  0.0us      0.0       0.0   0.0    0.0     0.0  0.0  0.0        0.0us         0.0us         0.0us         -         -         -     Additional <- Unpinned workloads

  0.0us      0.0       0.0   0.0    0.0     0.0  0.0  0.0        0.0us         0.0us         0.0us         -         -         -  Overaccounted    that didn’t make it

-----------------------------------------------------------------------------------------------------------------------------------------------

Total: 8

Workload filters can also be configured to restrict output to just those workloads which match a specified criteria. This allows for more finely-tuned tracking, which can be invaluable when dealing with large number of workloads. Configuration is greedy, since a workload will be included if it matches any applied filter. Filtering can be implemented with the following CLI syntax:

# isi performance dataset create <all_metrics> --filters <filtered_metrics>

# isi performance filter apply <dataset_name/id> <metric>:<value>

For example:

# isi performance dataset create --name ds_test1 username protocol export_id --filters username,protocol

# isi performance filters apply ds_test1 username:nick protocol:nfs3

# isi statistics workload --dataset ds_test1

    CPU  BytesIn  BytesOut   Ops  Reads  Writes   L2   L3  ReadLatency  WriteLatency  OtherLatency  UserName  Protocol  ExportId   WorkloadType

-----------------------------------------------------------------------------------------------------------------------------------------------

 11.0ms     2.8M     887.4   5.5    0.0   393.7  0.3  0.0      503.0us       638.8us         7.4ms       nick      nfs3         1              - <- Matches Filter

 13.0ms     1.4M     600.4   2.5    0.0   200.7  0.0  0.0      405.0us       638.8us         8.2ms       nick      nfs3         7              - <- Matches Filter

167.5ms    10.0K     20.0M  56.1   40.0     0.1  0.0  0.0      349.3us         0.0us         0.0us         -         -         -       Excluded <- Aggregate of not

 31.6ms      0.0       0.0   0.0    0.0     0.0  0.0  0.0        0.0us         0.0us         0.0us         -         -         -         System    matching filter

 70.2us      0.0       0.0   0.0    0.0     3.3  0.1  0.0        0.0us         0.0us         0.0us         -         -         -        Unknown

  0.0us      0.0       0.0   0.0    0.0     0.0  0.0  0.0        0.0us         0.0us         0.0us         -         -         -     Additional

  0.0us      0.0       0.0   0.0    0.0     0.0  0.0  0.0        0.0us         0.0us         0.0us         -         -         -  Overaccounted

-----------------------------------------------------------------------------------------------------------------------------------------------

Total: 8

Be aware that the metric to filter on must be specified when creating the dataset. A dataset with a filtered metric, but no filters applied, will output an empty dataset.

As mentioned previously, NFS path tracking is the principal enhancement to OneFS 9.3 performance resource monitoring, and this can be easily enabled as follows (for both the NFS and SMB protocols:

# isi performance dataset create --name ds_test1 username path --filters=username,path

The statistics can then be viewed with the following CLI syntax (in this case for dataset ‘ds_test1’ with ID 1):

# isi statistics workload list --dataset 1

When it comes to path tracking, there are a couple of caveats and restrictions that should be considered. Firstly, with OneFS 9.3, it is now available on the NFS and SMB protocols only. Also, path tracking is expensive and OneFS cannot track every single path. The desired paths to be tracked must be listed first, and can be specified by either pinning a workload or applying a path filter

If the resource overhead of path tracking is deemed too costly, consider a possible equivalent alternative such as tracking by NFS Export ID or SMB Share, as appropriate.

Users can have thousands of secondary groups, often simply too many to track. Only the primary group will be tracked until groups are specified, so any secondary groups to be tracked must be specified first, again either by applying a group filter or pinning a workload.

Note that some metrics only apply to certain protocols, For example, ‘export_id’ only applies to NFS. Similarly, ‘share_name’ only applies to SMB. Also, note that there is no equivalent metric (or path tracking) implemented for the S3 object protocol yet. This will be addressed in a future release.

These metrics can be used individually in a dataset or combined. For example, a dataset configured with both ‘export_id’ and metrics will list both NFS and SMB workloads with either ‘export_id’ or ‘share_name’. Likewise, a dataset with only the ‘share_name’ metric will only list SMB workloads, and a dataset with just the ‘export_id’ metric will only list NFS, excluding any SMB workloads. However, if a workload is excluded it will be aggregated into the special ‘Excluded’ workload.

When viewing collected metrics, the ‘isi statistics workload’ CLI command displays only the last sample period in table format with the statistics normalized to a per-second granularity:

# isi statistics workload [--dataset <dataset_name/id>]

Names are resolved wherever possible, such as UID to username, IP address to hostname, etc, and lookup failures are reported via an extra ‘error’ column. Alternatively, the ‘—numeric’ flag can also be included to prevent any lookups:

# isi statistics workload [--dataset <dataset_name/id>] –numeric

Aggregated cluster statistics are reported by default, but adding the ‘—nodes’ argument will provide per node statistics as viewed from each initiator. More specifically, the ‘–nodes=0’ flag will display the local node only, whereas ‘–nodes=all’ will report all nodes.

The ‘isi statistics workload’ command also includes standard statistics output formatting flags, such as ‘—sort’, ‘—totalby’, etc.

    --sort (CPU | (BytesIn|bytes_in) | (BytesOut|bytes_out) | Ops | Reads |

      Writes | L2 | L3 | (ReadLatency|latency_read) |

      (WriteLatency|latency_write) | (OtherLatency|latency_other) | Node |

      (UserId|user_id) | (UserSId|user_sid) | UserName | (Proto|protocol) |

      (ShareName|share_name) | (JobType|job_type) | (RemoteAddr|remote_address)

      | (RemoteName|remote_name) | (GroupId|group_id) | (GroupSId|group_sid) |

      GroupName | (DomainId|domain_id) | Path | (ZoneId|zone_id) |

      (ZoneName|zone_name) | (ExportId|export_id) | (SystemName|system_name) |

      (LocalAddr|local_address) | (LocalName|local_name) |

      (WorkloadType|workload_type) | Error)

        Sort data by the specified comma-separated field(s). Prepend 'asc:' or

        'desc:' to a field to change the sort order.







    --totalby (Node | (UserId|user_id) | (UserSId|user_sid) | UserName |

      (Proto|protocol) | (ShareName|share_name) | (JobType|job_type) |

      (RemoteAddr|remote_address) | (RemoteName|remote_name) |

      (GroupId|group_id) | (GroupSId|group_sid) | GroupName |

      (DomainId|domain_id) | Path | (ZoneId|zone_id) | (ZoneName|zone_name) |

      (ExportId|export_id) | (SystemName|system_name) |

      (LocalAddr|local_address) | (LocalName|local_name))

        Aggregate per specified fields(s).

In addition to the CLI, the same output can be accessed via the platform API:

# https://<node_ip>:8080/platform/statistics/summary/workload

Raw metrics can also be viewed using the ‘isi statistic query’ command:

# isi statistics query <current|history> --format=json --keys=<node|cluster>.performance.dataset.<dataset_id>

Options are either ‘current’, which will provide most recent sample, or ‘history’, which will report samples collected over the last 5 minutes. Names are not looked up and statistics are not normalized – the sample period is included in the output. The command syntax must also include the ‘–format=json’ flag, since no other output formats are currently supported. Additionally, there are two distinct types of keys:

Key Type	Description
cluster.performance.dataset.<dataset_id>	Cluster-wide aggregated statistics
node.performance.dataset.<dataset_id>	Per node statistics from the initiator perspective

Similarly, this raw metrics can also be obtained via the platform API as follows:

https://<node_ip>:8080/platform/statistics/<current|history>?keys=node.performance.dataset.0

OneFS Writable Snapshots Coexistence and Caveats

In the final article in this series, we’ll take a look at how writable snapshots co-exist in OneFS, and their integration and compatibility with the various OneFS data services.

Staring with OneFS itself, support for writable snaps is introduced in OneFS 9.3 and the functionality is enabled after committing an upgrade to OneFS 9.3. Non-disruptive upgrade to OneFS 9.3 and to later releases is fully supported. However, as we’ve seen over this series of articles, writable snaps in 9.3 do have several proclivities, caveats, and recommended practices. These include observing the default OneFS limit of 30 active writable snapshots per cluster (or at least not attempting to delete more than 30 writable snapshots at any one time if the max_active_wsnaps limit is increased for some reason).

There are also certain restrictions governing where a writable snapshot’s mount point can reside in the file system. These include not at an existing directory, below a source snapshot path, or under a SmartLock or SyncIQ domain. Also, while the contents of a writable snapshot will retain the permissions they had in the source, ensure the parent directory tree has appropriate access permissions for the users of the writable snapshot.

The OneFS job engine and restriping jobs also support writable snaps and, in general, most jobs can be run from inside a writable snapshot’s path. However, be aware that jobs involving tree-walks will not perform copy-on-read for LINs under writable snapshots.

The PermissionsRepair job is unable to fix the files under a writable snapshot which have yet to be copy-on-read. To prevent this, prior to starting a PermissionsRepair job, instance the `find` CLI command (which searches for files in directory hierarchy) can be run on the writable snapshot’s root directory in order to populate the writable snapshot’s namespace.

The TreeDelete job works for subdirectories under writable snapshot. TreeDelete, run on or above a writable snapshot, will not remove the root, or head, directory of the writable snapshot (unless scheduled through writable snapshot library).

The ChangeList, FileSystemAnalyze, and IndexUpdate jobs are unable to see files in a writable snapshot. As such , the FilePolicy job, which relies on index update, cannot manage files in writable snapshot.

Writable snapshots also work as expected with OneFS access zones. For example, a writable snaps can be created in a different access zone than its source snapshot:

# isi zone zones list

Name     Path

------------------------

System   /ifs

zone1    /ifs/data/zone1

zone2    /ifs/data/zone2

------------------------

Total: 2

# isi snapshot snapshots list

118224 s118224              /ifs/data/zone1

# isi snapshot writable create s118224 /ifs/data/zone2/wsnap1

# isi snapshot writable list

Path                   Src Path          Src Snapshot

------------------------------------------------------

/ifs/data/zone2/wsnap1 /ifs/data/zone1   s118224

------------------------------------------------------

Total: 1

Writable snaps are supported on any cluster architecture that’s running OneFS 9.3, and this includes clusters using data encryption with SED drives, which are also fully compatible with writable snaps. Similarly, InsightIQ and DataIQ both support and accurately report on writable snapshots, as expected.

Writable snaps are also compatible with SmartQuotas, and use directory quotas capacity reporting to track both physical and logical space utilization. This can be viewed using the `isi quota quotas list/view` CLI commands, in addition to ‘isi snapshots writable view’ command.

Regarding data tiering, writable snaps co-exist with SmartPools, and configuring SmartPools above writable snapshots is supported. However, in OneFS 9.3, SmartPools filepool tiering policies will not apply to a writable snapshot path. Instead, the writable snapshot data will follow the tiering policies which apply to the source of the writable snapshot. Also, SmartPools is frequently used to house snapshots on a lower performance, capacity optimized tier of storage. In this case, the performance of a writable snap that has its source snapshot housed on a slower pool will likely be negatively impacted. Also, be aware that CloudPools is incompatible with writable snaps in OneFS 9.3 and CloudPools on a writable snapshot destination is currently not supported.

On the data immutability front, a SmartLock WORM domain cannot be created at or above a writable snapshot under OneFS 9.3. Attempts will fail with following messages:

# isi snapshot writable list

Path                  Src Path          Src Snapshot

-----------------------------------------------------

/ifs/test/rw-head     /ifs/test/head1   s159776

-----------------------------------------------------

Total: 1

# isi worm domain create -d forever /ifs/test/rw-head/worm

Are you sure? (yes/[no]): yes

Failed to enable SmartLock: Operation not supported

# isi worm domain create -d forever /ifs/test/rw-head/worm

Are you sure? (yes/[no]): yes

Failed to enable SmartLock: Operation not supported

Creating a writable snapshot inside a directory with a WORM domain is also not permitted.

# isi worm domains list

ID      Path           Type

---------------------------------

2228992 /ifs/test/worm enterprise

---------------------------------

Total: 1

# isi snapshot writable create s32106 /ifs/test/worm/wsnap

Writable Snapshot cannot be nested under WORM domain 22.0300: Operation not supported

Regarding writable snaps and data reduction and storage efficiency, the story in OneFS 9.3 is as follows. OneFS in-line compression works with writable snapshots data, but in-line deduplication is not supported, and existing files under writable snapshots will be ignored by in-line dedupe. However, inline dedupe can occur on any new files created fresh on the writable snapshot.

Post-process deduplication of writable snapshot data is not supported and the SmartDedupe job will ignore the files under writable snapshots.

Similarly, at the per-file level, attempts to clone data within a writable snapshot (cp -c) are also not permitted and will fail with the following error:

# isi snapshot writable list

Path                  Src Path          Src Snapshot

-----------------------------------------------------

/ifs/wsnap1           /ifs/test1        s32106

-----------------------------------------------------

Total: 31

# cp -c /ifs/wsnap1/file1 /ifs/wsnap1/file1.clone

cp: file1.clone: cannot clone from 1:83e1:002b::HEAD to 2:705c:0053: Invalid argument

Additionally, in a small file packing archive workload the files under a writable snapshot will be ignored by the OneFS small file storage efficiency (SFSE) process, and there is also currently no support for data inode inlining within a writable snapshot domain.

Turning attention to data availability and protection, there are also some writable snapshot caveats in OneFS 9.3 to bear in mind. Regarding SnapshotIQ:

Writable snaps cannot be created from a source snapshot of the /ifs root directory. They also cannot currently be locked or changed to read-only. However, the read-only source snapshot will be locked for the entire life cycle of a writable snapshot.

Writable snaps cannot be refreshed from a newer read-only source snapshot. However, a new writable snapshot can be created from a more current source snapshot in order to include subsequent updates to the replicated production dataset. Taking a read-only snapshot of a writable snap is also not permitted and will fail with the following error message:

# isi snapshot snapshots create /ifs/wsnap2

snapshot create failed: Operation not supported

Writable snapshots cannot be nested in the namespace under other writable snapshots, and such operations will return ENOTSUP.

Only IFS domains-based snapshots are permitted as the source of a writable snapshot. This means that any snapshots taken on a cluster prior to OneFS 8.2 cannot be used as the source for a writable snapshot.

Snapshot aliases cannot be used as the source of a writable snapshot, even if using the alias target ID instead of the alias target name. The full name of the snapshot must be specified.

# isi snapshot snapshots view snapalias1

               ID: 134340

             Name: snapalias1

             Path: /ifs/test/rwsnap2

        Has Locks: Yes

         Schedule: -

  Alias Target ID: 106976

Alias Target Name: s106976

          Created: 2021-08-16T22:18:40

          Expires: -

             Size: 90.00k

     Shadow Bytes: 0.00

        % Reserve: 0.00%

     % Filesystem: 0.00%

            State: active

# isi snapshot writable create 134340 /ifs/testwsnap1

Source SnapID(134340) is an alias: Operation not supported

The creation of SnapRevert domain is not permitted at or above a writable snapshot. Similarly, the creation of a writable snapshot inside a directory with a SnapRevert domain is not supported. Such operations will return ENOTSUP.

Finally, the SnapshotDelete job has no interaction with writable snapss and the TreeDelete job handles writable snapshot deletion instead.

Regarding NDMP backups, since NDMP uses read-only snapshots for checkpointing it is unable to backup writable snapshot data in OneFS 9.3.

Moving on to replication, SyncIQ is unable to copy or replicate the data within a writable snapshot in OneFS 9.3. More specifically:

Replication Condition	Description
Writable snapshot as SyncIQ source	Replication fails because snapshot creation on the source writable snapshot is not permitted.
Writable snapshot as SyncIQ target	Replication job fails as snapshot creation on the target writable snapshot is not supported.
Writable snapshot one or more levels below in SyncIQ source	Data under a writable snapshot will not get replicated to the target cluster. However, the rest of the source will get replicated as expected
Writable snapshot one or more levels below in SyncIQ target	If the state of a writable snapshot is ACTIVE, the writable snapshot root directory will not get deleted from the target, so replication will fail.

Attempts to replicate the files within a writable snapshot with fail with the following SyncIQ job error:

“SyncIQ failed to take a snapshot on source cluster. Snapshot initialization error: snapshot create failed. Operation not supported.”

Since SyncIQ does not allow its snapshots to be locked, OneFS cannot create writable snapshots based on SyncIQ-generated snapshots. This includes all read-only snapshots with a ‘SIQ-*’ naming prefix. Any attempts to use snapshots with an SIQ* prefix will fail with the following error:

# isi snapshot writable create SIQ-4b9c0e85e99e4bcfbcf2cf30a3381117-latest /ifs/rwsnap

Source SnapID(62356) is a SyncIQ related snapshot: Invalid argument

A common use case for writable snapshots is in disaster recovery testing. For DR purposes, an enterprise typically has two PowerScale clusters configured in a source/target SyncIQ replication relationship. Many organizations have a requirement to conduct periodic DR tests to verify the functionality of their processes and tools in the event of a business continuity interruption or disaster recovery event.

Given the writable snapshots compatibility with SyncIQ caveats described above, a writable snapshot of a production dataset replicated to a target DR cluster can be created as follows:

On the source cluster, create a SyncIQ policy to replicate the source directory (/ifs/test/head) to the target cluster:

# isi sync policies create --name=ro-head sync --source-rootpath=/ifs/prod/head --target-host=10.224.127.5 --targetpath=/ifs/test/ro-head

# isi sync policies list

Name  Path              Action  Enabled  Target

---------------------------------------------------

ro-head /ifs/prod/head  sync    Yes      10.224.127.5

---------------------------------------------------

Total: 1

Run the SyncIQ policy to replicate the source directory to /ifs/test/ro-head on the target cluster:

# isi sync jobs start ro-head --source-snapshot s14


# isi sync jobs list

Policy Name  ID   State   Action  Duration

-------------------------------------------

ro-head        1    running run     22s

-------------------------------------------

Total: 1


# isi sync jobs view ro-head

Policy Name: ro-head

         ID: 1

      State: running

     Action: run

   Duration: 47s

 Start Time: 2021-06-22T20:30:53


Target:

Take a read-only snapshot of the replicated dataset on the target cluster:

# isi snapshot snapshots create /ifs/test/ro-head

# isi snapshot snapshots list

ID   Name                                        Path

-----------------------------------------------------------------

2    SIQ_HAL_ro-head_2021-07-22_20-23-initial /ifs/test/ro-head

3    SIQ_HAL_ro-head                          /ifs/test/ro-head

5    SIQ_HAL_ro-head_2021-07-22_20-25         /ifs/test/ro-head

8    SIQ-Failover-ro-head-2021-07-22_20-26-17 /ifs/test/ro-head

9    s106976                                   /ifs/test/ro-head

-----------------------------------------------------------------

Using the (non SIQ_*) snapshot of the replicated dataset above as the source, create a writable snapshot on the target cluster at /ifs/test/head:

# isi snapshot writable create s106976 /ifs/test/head

Confirm the writable snapshot has been created on the target cluster:

# isi snapshot writable list

Path              Src Path         Src Snapshot

----------------------------------------------------------------------

/ifs/test/head  /ifs/test/ro-head  s106976

----------------------------------------------------------------------

Total: 1




# du -sh /ifs/test/ro-head

 21M    /ifs/test/ro-head

Export and/or share the writable snapshot data under /ifs/test/head on the target cluster using the protocol(s) of choice. Mount the export or share on the client systems and perform DR testing and verification as appropriate.
When DR testing is complete, delete the writable snapshot on the target cluster:

# isi snapshot writable delete /ifs/test/head

Note that writable snapshots cannot be refreshed from a newer read-only source snapshot. A new writable snapshot would need to be created using the newer snapshot source in order to reflect and subsequent updates to the production dataset on the target cluster.

So there you have it: The introduction of writable snaps v1 in OneFS 9.3 delivers the much anticipated ability to create fast, simple, efficient copies of datasets by enabling a writable view of a regular snapshot, presented at a target directory, and accessible by clients across the full range of supported NAS protocols.

OneFS Writable Snapshots Management, Monitoring and Performance

When it comes to the monitoring and management of OneFS writable snaps, the ‘isi writable snapshots’ CLI syntax looks and feels similar to regular, read-only snapshots utilities. The currently available writable snapshots on a cluster can be easily viewed from the CLI with the ‘isi snapshot writable list’ command. For example:

# isi snapshot writable list

Path              Src Path        Src Snapshot

----------------------------------------------

/ifs/test/wsnap1  /ifs/test/prod  prod1

/ifs/test/wsnap2  /ifs/test/snap2 s73736

----------------------------------------------

The properties of a particular writable snap, including both its logical and physical size, can be viewed using the ‘isi snapshot writable view’ CLI command:

# isi snapshot writable view /ifs/test/wsnap1

         Path: /ifs/test/wsnap1

     Src Path: /ifs/test/prod

 Src Snapshot: s73735

      Created: 2021-06-11T19:10:25

 Logical Size: 100.00

Physical Size: 32.00k

        State: active

The capacity resource accounting layer for writable snapshots is provided by OneFS SmartQuotas. Physical, logical, and application logical space usage is retrieved from a directory quota on the writable snapshot’s root and displayed via the CLI as follows:

# isi quota quotas list

Type      AppliesTo  Path             Snap  Hard  Soft  Adv  Used  Reduction  Efficiency

-----------------------------------------------------------------------------------------

directory DEFAULT    /ifs/test/wsnap1 No    -     -     -    76.00 -          0.00 : 1

-----------------------------------------------------------------------------------------

Or from the OneFS WebUI by navigating to File system > SmartQuotas > Quotas and usage:

For more detail, the ‘isi quota quotas view’ CLI command provides a thorough appraisal of a writable snapshot’s directory quota domain, including physical, logical, and storage efficiency metrics plus a file count. For example:

# isi quota quotas view /ifs/test/wsnap1 directory

                        Path: /ifs/test/wsnap1

                        Type: directory

                   Snapshots: No

                    Enforced: Yes

                   Container: No

                      Linked: No

                       Usage

                           Files: 10

         Physical(With Overhead): 32.00k

        FSPhysical(Deduplicated): 32.00k

         FSLogical(W/O Overhead): 76.00

        AppLogical(ApparentSize): 0.00

                   ShadowLogical: -

                    PhysicalData: 0.00

                      Protection: 0.00

     Reduction(Logical/Data): None : 1

Efficiency(Logical/Physical): 0.00 : 1

                        Over: -

               Thresholds On: fslogicalsize

              ReadyToEnforce: Yes

                  Thresholds

                   Hard Threshold: -

                    Hard Exceeded: No

               Hard Last Exceeded: -

                         Advisory: -

    Advisory Threshold Percentage: -

                Advisory Exceeded: No

           Advisory Last Exceeded: -

                   Soft Threshold: -

        Soft Threshold Percentage: -

                    Soft Exceeded: No

               Soft Last Exceeded: -

                       Soft Grace: -

This information is also available from the OneFS WebUI by navigating to File system > SmartQuotas > Generated reports archive > View report details:

Additionally, the ‘isi get’ CLI command can be used to inspect the efficiency of individual writable snapshot files. First, run the following command syntax on the chosen file in the source snapshot path (in this case /ifs/test/source).

In the example below, the source file, /ifs//test/prod/testfile1, is reported as 147 MB in size and occupying 18019 physical blocks:

# isi get -D /ifs/test/prod/testfile1.zip

POLICY   W   LEVEL PERFORMANCE COAL  ENCODING      FILE              IADDRS

default      16+2/2 concurrency on    UTF-8         testfile1.txt        <1,9,92028928:8192>

*************************************************

* IFS inode: [ 1,9,1054720:512 ]

*************************************************

*

*  Inode Version:      8

*  Dir Version:        2

*  Inode Revision:     145

*  Inode Mirror Count: 1

*  Recovered Flag:     0

*  Restripe State:     0

*  Link Count:         1

*  Size:               147451414

*  Mode:               0100700

*  Flags:              0x110000e0

*  SmartLinked:        False

*  Physical Blocks:    18019

However, when running the ‘isi get’ CLI command on the same file within the writable snapshot tree (/ifs/test/wsnap1/testfile1), the writable, space-efficient copy now only consumes 5 physical blocks, as compared with 18019 blocks in the original file:

# isi get -D /ifs/test/wsnap1/testfile1.zip

POLICY   W   LEVEL PERFORMANCE COAL  ENCODING      FILE              IADDRS

default      16+2/2 concurrency on    UTF-8         testfile1.txt        <1,9,92028928:8192>

*************************************************

* IFS inode: [ 1,9,1054720:512 ]

*************************************************

*

*  Inode Version:      8

*  Dir Version:        2

*  Inode Revision:     145

*  Inode Mirror Count: 1

*  Recovered Flag:     0

*  Restripe State:     0

*  Link Count:         1

*  Size:               147451414

*  Mode:               0100700

*  Flags:              0x110000e0

*  SmartLinked:        False

*  Physical Blocks:    5

Writable snaps use the OneFS policy domain manager, or PDM, for domain membership checking and verification. For each writable snap, a ‘WSnap’ domain is created on the target directory. The ‘isi_pdm’ CLI utility can be used to report on the writable snapshot domain for a particular directory.

# isi_pdm -v domains list --patron Wsnap /ifs/test/wsnap1

Domain          Patron          Path

b.0700          WSnap           /ifs/test/wsnap1

Additional details of the backing domain can also be displayed with the following CLI syntax:

# isi_pdm -v domains read b.0700

('b.0700',):

{ version=1 state=ACTIVE ro store=(type=RO SNAPSHOT, ros_snapid=650, ros_root=5:23ec:0011)ros_lockid=1) }

Domain association does have some ramifications for writable snapshots in OneFS 9.3 and there are a couple of notable caveats. For example, files within the writable snapshot domain cannot be renamed outside of the writable snap to allow the file system to track files in a simple manner.

# mv /ifs/test/wsnap1/file1 /ifs/test

mv: rename file1 to /ifs/test/file1: Operation not permitted

Also, the nesting of writable snaps is not permitted in OneFS 9.3, and an attempt to create a writable snapshot on a subdirectory under an existing writable snapshot will fail with the following CLI command warning output:

# isi snapshot writable create prod1 /ifs/test/wsnap1/wsnap1-2

Writable snapshot:/ifs/test/wsnap1 nested below another writable snapshot: Operation not supported

When a writable snap is created, any existing hard links and symbolic links (symlinks) that reference files within the snapshot’s namespace will continue to work as expected. However, existing hard links with a file external to the snapshot’s domain will disappear from the writable snap, including the link count.

Link Type	Supported	Details
Existing external hard link	No	Old external hard links will fail.
Existing internal hard link	Yes	Existing hard links within the snapshot domain will work as expected.
External hard link	No	New external hard links will fail.
New internal hard link	Yes	Existing hard links will work as expected.
External symbolic link	Yes	External symbolic links will work as expected.
Internal symbolic link	Yes	Internal symbolic links will work as expected.

Be aware that any attempt to create a hard link to another file outside of the writable snapshot boundary will fail.

# ln /ifs/test/file1 /ifs/test/wsnap1/file1

ln: /ifs/test/wsnap1/file1: Operation not permitted

However, symbolic links will work as expected. OneFS hard link and symlink actions and expectations with writable snaps are as follows:

Writable snaps do not have a specific license, and their use is governed by the OneFS SnapshotIQ data service. As such, in addition to a general OneFS license, SnapshotIQ must be licensed across all the nodes in order to use writable snaps on a OneFS 9.3 PowerScale cluster. Additionally, the ‘ISI_PRIV_SNAPSHOT’ role-based administration privilege is required on any cluster administration account that will create and manage writable snapshots. For example:

# isi auth roles view SystemAdmin | grep -i snap

             ID: ISI_PRIV_SNAPSHOT

In general, writable snapshot file access is marginally less performant compared to the source, or head, files, since an additional level of indirection is required to access the data blocks. This is particularly true for older source snapshots, where a lengthy read-chain can require considerable ‘ditto’ block resolution. This occurs when parts a file no longer resides in the source snapshot, and the block tree of the inode on the snapshot does not point to a real data block. Instead it has a flag marking it as a ‘ditto block’. A Ditto-block indicates that the data is the same as the next newer version of the file, so OneFS automatically looks ahead to find the more recent version of the block. If there are large numbers (such as hundreds or thousands) of snapshots of the same unchanged file, reading from the oldest snapshot can have a considerable impact on latency.

Performance Attribute	Details
Large Directories	Since a writable snap performs a copy-on-read to populate file metadata on first access, the initial access of a large directory (containing millions of files, for example) that tries to enumerate its contents will be relatively slow because the writable snapshot has to iteratively populate the metadata. This is applicable to namespace discovery operations such as ‘find’ and ‘ls’, unlinks and renames, plus other operations working on large directories. However, any subsequent access of the directory or its contents will be fast since file metadata will already be present and there will be no copy-on-read overhead. The unlink_copy_batch & readdir_copy_batch parameters under the sysctl ‘efs.wsnap’ control of the size of batch metadata copy operations. These parameters can be helpful for tuning the number of iterative metadata copy-on-reads for datasets containing large directories. However, these sysctls should only be modified under the direct supervision of Dell technical support.
Writable snapshot metadata read/write	Initial read and write operations will perform a copy-on-read and will therefore be marginally slower compared to the head. However, once the copy-on-read has been performed for the LINs, the performance of read/write operations will be nearly equivalent to head.
Writable snapshot data read/write	In general, writable snapshot data reads and writes will be slightly slower compared to head.
Multiple writable snapshots of single source	The performance of each subsequent writable snap created from the same source read-only snapshot will be the same as that of the first, up to the OneFS 9.3 default recommended limit of a total of 30 writable snapshots. This is governed by the ‘max_active_wsnpas’ sysctl. # sysctl efs.wsnap.max_active_wsnaps efs.wsnap.max_active_wsnaps: 30 While the ‘max_active_wsnaps’ sysctl can be configured up to a maximum of 2048 writable snapshots per cluster, changing this sysctl from its default value of 30 is strongly discouraged in OneFS 9.3.
Writable snapshots and SmartPools tiering	Since unmodified file data in a writable snap is read directly from the source snapshot, if the source is stored on a lower performance tier than the writable snapshot’s directory structure this will negatively impact the writable snapshot’s latency.
Storage Impact	The storage capacity consumption of a writable snapshot is proportional to the number of writes, truncate, or similar operations it receives, since only the changed blocks relative to its source snapshot are stored. The metadata overhead will grow linearly as a result of copy-on-reads with each new writable snapshots that is created and accessed.
Snapshot Deletes	Writable snapshot deletes are de-staged and performed out of band by the TreeDelete job. As such, the performance impact should be minimal, although the actual delete of the data is not instantaneous. Additionally, the TreeDelete job has a path to avoid copy-on-writing any files within a writable snap that have yet to been enumerated.

Be aware that, since writable snaps are highly space efficient, the savings are strictly in terms of file data. This means that metadata will be consumed in full for each file and directory in a snapshot. So, for large sizes and quantities of writable snapshots, inode consumption should be considered, especially for metadata read and metadata write SSD strategies.

In the next and final article in this series, we’ll examine writable snapshots in the context of the other OneFS data services.

OneFS Writable Snapshots

OneFS 9.3 introduces writable snapshots to the PowerScale data services portfolio, enabling the creation and management of space and time efficient, modifiable copies of regular OneFS snapshots, at a directory path within the /ifs namespace, and which can be accessed and edited via any of the cluster’s file and object protocols, including NFS, SMB, and S3.

The primary focus of writable snaps in 9.3 is disaster recovery testing, where they can be used to quickly clone production datasets, allowing DR procedures to be routinely tested on identical, thin copies of production data. This is of considerable benefit to the growing number of enterprises that are using Isolate & Test, or Bubble networks, where DR testing is conducted inside a replicated environment that closely mimics production.

Other writable snapshot use cases can include parallel processing workloads that span a server fleet, which can be configured to use multiple writable snaps of a single production data set, to accelerate time to outcomes and results. And writable snaps can also be used to build and deploy templates for near-identical environments, enabling highly predictable and scalable dev and test pipelines.

The OneFS writable snapshot architecture provides an overlay to a read-only source snapshot, allowing a cluster administrator to create a lightweight copy of a production dataset using a simple CLI command, and present and use it as a separate writable namespace.

In this scenario, a SnapshotIQ snapshot (snap_prod_1) is taken of the /ifs/prod directory. The read-only ‘snap_prod_1’ snapshot is then used as the backing for a writable snapshot created at /ifs/wsnap. This writable snapshot contains the same subdirectory and file structure as the original ‘prod’ directory, just without the added data capacity footprint.

Internally, OneFS 9.3 introduces a new protection group data structure, ‘PG_WSNAP’, which provides an overlay that allows unmodified file data to be read directly from the source snapshot, while storing only the changes in the writable snapshot tree.

In this example, a file (Head) comprises four data blocks, A through D. A read-only snapshot is taken of the directory containing the Head file. This file is then modified through a copy-on-write operation. As a result, the new Head data, B¹, is written to block 102, and the original data block ‘B’ is copied to a new physical block (110). The snapshot pointer now references block 110 and the new location for the original data ‘B’, so the snapshot has its own copy of that block.

Next, a writable snapshot is created using the read-only snapshot as its source. This writable snapshot is then modified, so its updated version of block C is stored in its own protection group (PG_WSNAP). A client then issues a read request for the writable snapshot version of the file. This read request is directed, through the read-only snapshot, to the Head versions of blocks A and D, the read-only snapshot version for block B and the writable snapshot file’s own version of block C (C¹in block 120).

OneFS directory quotas provide the writable snapshots accounting and reporting infrastructure, allowing users to easily view the space utilization of a writable snapshot. Additionally, IFS domains are also used to bound and manage writable snapshot membership. In OneFS, a domain defines a set of behaviors for a collection of files under a specified directory tree. If a directory has a protection domain applied to it, that domain will also affect all of the files and subdirectories under that top-level directory.

When files within a newly created writable snapshot are first accessed, data is read from the source snapshot, populating the files’ metadata, in a process known as copy-on-read (CoR). Unmodified data is read from the source snapshot and any changes are stored in the writable snapshot’s namespace data structure (PG_WSNAP).

Since a new writable snapshot is not copy-on-read up front, its creation is extremely rapid. As files are subsequently accessed, they are enumerated and begin to consume metadata space.

On accessing a writable snapshot file for the first time, a read is triggered from the source snapshot and the file’s data is accessed directly from the read-only snapshot. At this point, the MD5 checksums for both the source file and writable snapshot file are identical. If, for example, the first block of file is overwritten, just that single block is written to the writable snapshot, and the remaining unmodified blocks are still read from the source snapshot. At this point, the source and writable snapshot files are now different, so their MD5 checksums will also differ.

Before writable snapshots can be created and managed on a cluster, the following prerequisites must be met:

The cluster is running OneFS 9.3 or later with the upgrade committed.
SnapshotIQ is licensed across the cluster.

Note that for replication environments using writable snapshots and SyncIQ, all target clusters must be running OneFS 9.3 or later, have SnapshotIQ licensed, and provide sufficient capacity for the full replicated dataset.

By default, up to thirty active writable snapshots can be created and managed on a cluster from either the OneFS command-line interface (CLI) or RESTful platform API.

On creation of a new writable snap, all ﬁles contained in the snapshot source, or HEAD, directory tree are instantly available for both reading and writing in the target namespace.

Once no longer required, a writable snapshot can be easily deleted by CLI. Be aware that WebUI configuration of writable snapshots is not available as of OneFS 9.3. That said, a writable snapshot can easily be created from the CLI as follows:

The source snapshot (src-snap) is an existing read-only snapshot (prod1), and the destination path (dst-path) is a new directory within the /ifs namespace (/ifs/test wsnap1). A read-only source snapshot can be generated as follows:

# isi snapshot snapshots create prod1 /ifs/test/prod

# isi snapshot snapshots list

ID     Name                             Path

-------------------------------------------------------

7142   prod1                     /ifs/test/prod

Next, the following command creates a writable snapshot in an ‘active’ state.

# isi snapshot writable create prod1 /ifs/test/wsnap1

# isi snapshot snapshots delete -f prod1

Snapshot "prod1" can't be deleted because it is locked

While the OneFS CLI is not explicitly prevented from unlocking a writable snapshot’s lock on the backing snapshot, it does provide a clear warning.

# isi snap lock view prod1 1

     ID: 1

Comment: Locked/Unlocked by Writable Snapshot(s), do not force delete lock.           

Expires: 2106-02-07T06:28:15

  Count: 1

# isi snap lock delete prod1 1

Are you sure you want to delete snapshot lock 1 from s13590? (yes/[no]):

Be aware that a writable snapshot cannot be created on an existing directory. A new directory path must be specified in the CLI syntax, otherwise the command will fail with the following error:

# isi snapshot writable create prod1 /ifs/test/wsnap1

mkdir /ifs/test/wsnap1 failed: File exists

Similarly, if an unsupported path is specified, the following error will be returned:

# isi snapshot writable create prod1 /isf/test/wsnap2

Error in field(s): dst_path

Field: dst_path has error: The value: /isf/test/wsnap2 does not match the regular expression: ^/ifs$|^/ifs/ Input validation failed.

A writable snapshot also cannot be created from a source snapshot of the /ifs root directory, and will fail with the following error:

# isi snapshot writable create s1476 /ifs/test/ifs-wsnap

Cannot create writable snapshot from a /ifs snapshot: Operation not supported

Be aware that OneFS 9.3 does not currently provide support for scheduled or automated writable snapshot creation.

When it comes to deleting a writable snapshot, OneFS uses the job engine’s TreeDelete job under the hood to unlink all the contents. As such, running the ‘isi snapshots writable delete’ CLI command automatically queues a TreeDelete instance, which the job engine executes asynchronously in order to remove and clean up a writable snapshot’s namespace and contents. However, be aware that the TreeDelete job execution, and hence the data deletion, is not instantaneous. Instead, the writable snapshot’s directories and files are moved under a temporary ‘*.deleted’ directory. For example:

# isi snapshot writable create prod1 /ifs/test/wsnap2

# isi snap writable delete /ifs/test/wsnap2

Are you sure? (yes/[no]): yes

# ls /ifs/test

prod                            wsnap2.51dc245eb.deleted

wsnap1

Next, this temporary directory is removed in a non-synchronous operation. If the TreeDelete job fails for some reason, the writable snapshot can be deleted using its renamed path. For example:

# isi snap writable delete /ifs/test/wsnap2.51dc245eb.deleted

Deleting a writable snap removes the lock on the backing read-only snapshot so it can also then be deleted, if required, provided there are no other active writable snapshots based off that read-only snapshot.

The deletion of writable snapshots in OneFS 9.3 is also a strictly manual process. There is currently no provision for automated, policy-driven control such as the ability to set a writable snapshot expiry date, or a bulk snapshot deletion mechanism.

The recommended practice is to quiesce any client sessions to a writable snapshot prior to its deletion. Since the backing snapshot can no longer be trusted once its lock is removed during the deletion process, any ongoing IO may experience errors as the writable snapshot is removed.

In the next article in this series, we’ll take a look at writable snapshots monitoring and performance.

OneFS Replication Bandwidth Management

When it comes to managing replication bandwidth in OneFS, SyncIQ allows cluster admins to configure reservations on a per-policy basis, thereby permitting fine-grained bandwidth control.

SyncIQ attempts to satisfy these reservation requirements based on what is already running and on the existing bandwidth rules and schedules. If a policy doesn’t have a specified reservation, its bandwidth is allocated from the reserve specified in the global configuration. If there is insufficient bandwidth available, SyncIQ will evenly divide the resources across all running policies until they reach the requested reservation. The salient goal here is to prevent starvation of policies.

Under the hood, each PowerScale node has a SyncIQ scheduler process running, which is responsible for launching replication jobs, creating the initial job directory, and updating jobs in response to any configuration changes. The scheduler also launches a coordinator process, which manages bandwidth throttling, in addition to overseeing the replication worker processes, snapshot management, report generation, target monitoring, and work allocation.

Component	Process	Description
Scheduler	isi_migr_sched	The SyncIQ scheduler processes (isi_migr_sched) are responsible for the initialization of data replication jobs. The scheduler processes monitor the SyncIQ configuration and source record files for updates and reloads them whenever changes are detected in order to determine if and when a new job should be started. In addition, once a job has started, one of the schedulers will create a coordinator process (isi_migrate) responsible for the creation and management of the worker processes that handle the actual data replication aspect of the job. The scheduler processes also creates the initial job directory when a new job starts. In addition, they are responsible for monitoring the coordinator process and restarting it if the coordinator crashes or becomes unresponsive during a job. The scheduler processes are limited to one per node.
Coordinator	isi_migrate	The coordinator process (isi_migrate) is responsible for the creation and management of worker processes during a data replication job. In addition, the coordinator is responsible for: Snapshot management: Takes the file system snapshots used by SyncIQ, keeps them locked while in use, and deletes them once they are no longer needed. Writing reports: Aggregates the job data reported from the workers and writes it to /ifs/.ifsvar/modules/tsm/sched/reports/ Bandwidth throttling Managing target monitor (tmonitor) process Connects to the tmonitor process, a secondary worker process on the target, which helps manage worker processes on the target side.
Bandwidth Throttler	isi_migr_bandwidth	The bandwidth host (isi_migr_bandwidth) provides rationing information to the coordinator in order to regulate the job’s bandwidth usage.
Pworker	isi_migr_pworker	Primary worker processes on the source cluster, responsible for handling and transferring cluster data while a replication job runs.
Sworker	isi_migr_sworker	Secondary worker processes on the target cluster, responsible for handling and transferring cluster data while a replication job runs.
Tmonitor		The coordinator process contacts the sworker daemon on the target cluster, which then forks off a new process to become the tmonitor. The tmonitor process acts as a target-side coordinator, providing a list of target node IP addresses to the coordinator, communicating target cluster changes (such as the loss or addition of a node), and taking target-side snapshots when necessary. Unlike a normal sworker process, the tmonitor process does not directly participate in any data transfer duties during a job.

These running processes can be viewed from the CLI as follows:

# ps -auxw | grep -i migr

root     493    0.0  0.0  62764  39604  -  Ss   Mon06        0:01.25 /usr/bin/isi_migr_pworker

root     496    0.0  0.0  63204  40080  -  Ss   Mon06        0:03.99 /usr/bin/isi_migr_sworker

root     499    0.0  0.0  39612  22148  -  Is   Mon06        2:41.30 /usr/bin/isi_migr_sched

root     523    0.0  0.0  44692  26396  -  Ss   Mon06        0:24.47 /usr/bin/isi_migr_bandwidth

root   49726    0.0  0.0  63944  41224  -  D    Thu06        0:42.04 isi_migr_sworker: onefs1.zone1-zone2 (isi_migr_sworker)

root   49801    0.0  0.0  63564  40992  -  S    Thu06        1:21.84 isi_migr_pworker: zone1-zone2 (isi_migr_pworker)

Global Bandwidth Reservation can be configured from the OneFS WebUI by browsing to Data Protection > SyncIQ > Performance Rules, or from CLI using the ‘isi sync rules’ command. Bandwidth limits are typically configured and associated with a schedule, creating a limit for the sum of all policies and applying a schedule. For example:

The newly created rule is displayed as follows:

Global bandwidth is applied as a combined limit of policies, allowing for a reservation configuration per policy. The recommended practice is to set a bandwidth reservation for each policy.

Per-policy bandwidth reservation can be configured via the OneFS CLI as follows:

Configure one or more bandwidth rules:

# isi sync rules

For each policy, configure desired bandwidth amount to reserve:

# isi sync policy <create | modify> --bandwidth-reservation=#

Optionally, specify global configuration defaults:

# isi sync settings modify --bandwidth-reservation-reserve-percentage=#

# isi sync settings modify --bandwidth-reservation-reserve-absolute=#

# isi sync settings modify --clear-bandwidth-reservation-reserve

These settings relate to how much bandwidth should be allocated to policies that do not have a reservation

By default, there is a 1% percentage reserve. Bandwidth calculations are based on the bandwidth rule that is set, not on actual network conditions. If a policy does not have a specified reservation, resources are allocated from the reserve defined in the global configuration settings.

If there is insufficient bandwidth available for all policies to get their requested amounts, the bandwidth is evenly split across all running policies until they reach their requested reservation. This effectively ensures that the policies with the lowest requirements will reach their reservation before policies with larger reservations, helping to prevent bandwidth starvation.

For example, take the following three policies:

Total of 15 Mb/s bandwidth
Policy	Requested	Allocated
Policy 1	10 Mb/s	5 Mb/s
Policy 2	20 Mb/s	5 Mb/s
Policy 3	30 Mb/s	5 Mb/s

All three policies equally share the available 15 Mb/s of bandwidth (5 Mb/s each):

Say that the total bandwidth allocation in the scenario above is increased from 15 Mb/s to 40 Mb/s:

Total of 40 Mb/s bandwidth
Policy	Requested	Allocated
Policy 1	10 Mb/s	10 Mb/s
Policy 2	20 Mb/s	15 Mb/s
Policy 3	30 Mb/s	15 Mb/s

The lowest reservation rule, policy 1, now receives its full allocation of 10 Mb/s, and the two other policies split the remaining bandwidth (15 Mb/s each).

There are several tools to aid comprehending and troubleshooting SyncIQ’s bandwidth allocation. For example, the following command will display the SyncIQ policy configuration:

# isi sync policies list

Name        Path            Action  Enabled  Target

------------------------------------------------------

policy1 /ifs/data/zone1 copy    Yes      onefs-trgt1

policy2 /ifs/data/zone3 copy    Yes      onefs-trgt2

------------------------------------------------------

# isi sync policies view <name>

# isi sync policies view zone1-zone2

ID: ce0cbbba832e60d7ce7713206f7367bb

Name: policy1

Path: /ifs/data/zone1

Action: copy

Enabled: Yes

Target: onefs-trgt1

Description:

Check Integrity: Yes

Source Include Directories: -

Source Exclude Directories: /ifs/data/zone1/zone4

Source Subnet: -

Source Pool: -

Source Match Criteria: -

Target Path: /ifs/data/zone2/zone1_sync

Target Snapshot Archive: No

Target Snapshot Pattern: SIQ-%{SrcCluster}-%{PolicyName}-%Y-%m-%d_%H-%M-%S

Target Snapshot Expiration: Never

Target Snapshot Alias: SIQ-%{SrcCluster}-%{PolicyName}-latest

Sync Existing Target Snapshot Pattern: %{SnapName}-%{SnapCreateTime}

Sync Existing Snapshot Expiration: No

Target Detect Modifications: Yes

Source Snapshot Archive: No

Source Snapshot Pattern:

Source Snapshot Expiration: Never

Snapshot Sync Pattern: *

Snapshot Sync Existing: No

Schedule: when-source-modified

Job Delay: 10m

Skip When Source Unmodified: No

RPO Alert: -

Log Level: trace

Log Removed Files: No

Workers Per Node: 3

Report Max Age: 1Y

Report Max Count: 2000

Force Interface: No

Restrict Target Network: No

Target Compare Initial Sync: No

Disable Stf: No

Expected Dataloss: No

Disable Fofb: No

Disable File Split: No

Changelist creation enabled: No

Accelerated Failback: No

Database Mirrored: False

Source Domain Marked: False

Priority: high

Cloud Deep Copy: deny

Bandwidth Reservation: -

Last Job State: running

Last Started: 2022-03-15T11:35:39

Last Success: 2022-03-15T11:35:39

Password Set: No

Conflicted: No

Has Sync State: Yes

Source Certificate ID:

Target Certificate ID:

OCSP Issuer Certificate ID:

OCSP Address:

Encryption Cipher List:

Encrypted: No

Linked Service Policies: -

Delete Quotas: Yes

Disable Quota Tmp Dir: No

Ignore Recursive Quota: No

Allow Copy Fb: No

Bandwidth Rules can be viewed via the CLI as follows:

# isi sync rules list

ID Enabled Type      Limit      Days    Begin  End

-------------------------------------------------------

bw-0 Yes bandwidth 50000 kbps Mon-Fri 08:00 18:00

-------------------------------------------------------

Total: 1




# isi sync rules view bw-0

ID: bw-0

Enabled: Yes

Type: bandwidth

Limit: 50000 kbps

Days: Mon-Fri

Schedule

Begin: 08:00

End: 18:00

Description:

Additionally, the following CLI command will show the global SyncIQ unallocated reserve settings

# isi sync settings view

Service: on

Source Subnet: -

Source Pool: -

Force Interface: No

Restrict Target Network: No

Tw Chkpt Interval: -

Report Max Age: 1Y

Report Max Count: 2000

RPO Alerts: Yes

Max Concurrent Jobs: 50

Bandwidth Reservation Reserve Percentage: 1

Bandwidth Reservation Reserve Absolute: -

Encryption Required: Yes

Cluster Certificate ID:

OCSP Issuer Certificate ID:

OCSP Address:

Encryption Cipher List:

Renegotiation Period: 8H

Service History Max Age: 1Y

Service History Max Count: 2000

Use Workers Per Node: No