OneFS Caching – L3 Performance and Sizing

In the final article in this caching series we’ll take a look at some of the L3 cache’s performance benefits and attributes – plus how to size the cache and other considerations and good practices.

One of the goals of L3 is to deliver solid benefits right out of the box for a wide variety of workloads. However, L3 cache usually provides more benefit for random and aggregated workloads than for sequential and optimized workflows – typically delivering similar IOPS as SmartPools metadata-read strategy, for user data retrieval (reads).

Although the benefit of L3 caching is highly workflow dependent, the following general rules can be assumed:

  • During data prefetch operations, streaming requests are intentionally sent directly to the spinning disks (HDDs), while utilizing the L3 cache SSDs for random IO.
  • SmartPools metadata-write strategy may be the better choice for metadata write and/or overwrite heavy workloads, for example EDA and certain HPC workloads.
  • L3 cache can deliver considerable latency improvements for repeated random read workflows over both non-L3 nodepools and SmartPools metadata-read configured nodepools.
  • L3 can also provide improvements for parallel workflows, by reducing the impact to streaming throughput from random  reads (streaming meta-data).
  • The performance of OneFS job engine jobs can also be increased by L3 cache

L3 cache is enabled by default for Isilon A200, A200 and the older Gen5 NL and HD nodes that contain SSDs, and cannot be disabled. On these platforms, L3 cache runs in a metadata only mode. By storing just metadata blocks, L3 cache optimizes the performance of operations such as system protection and maintenance jobs, in addition to metadata intensive workloads.

Figuring out the size of the active data, or working set, for your environment is the first step in an L3 cache SSD sizing exercise.

L3 cache utilizes all available SSD space over time. As a rule, L3 cache benefits more with more available SSD space. However, sometimes losing spindle count hurts more than adding cache helps a workflow. If possible add a larger capacity SSD rather than multiple smaller SSDs.

L3 cache sizing involves calculating the correct amount of SSD space to fit the working data set. This can be done by using the isi_cache_stats command to periodically capture L2 cache statistics on an existing cluster.

Run the following commands based on the workload activity cycle, at job start and job end. Initially run isi_cache_stats –c in order to reset, or zero out, the counters. Then run isi_cache_stats –v at workload activity completion and save the output. This will help determine an accurate indication of the size of the working data set, by looking at the L2 cache miss rates for both data and metadata on a single node.

These cache miss counters are displayed as 8KB blocks. So an L2_data_read.miss value of 1024 blocks represents 8 MB of actual missed data.

The formula for calculating the working set size is:

(L2_data_read.miss + L2_meta_read.miss) = working_set size

Once the working set size has been calculated, a good rule of thumb is to size L3 SSD capacity per node according to the following formula:

L2 capacity + L3 capacity >= 150% of working set size.

There are diminishing returns for L3 cache after a certain point. With too high an SSD to working set size ratio, the cache hits decrease and fail to add greater benefit. Conversely, when compared to SmartPools SSD strategies, another benefit of using SSDs for L3 cache is that performance will degrade much more gracefully if metadata does happen to exceed the SSD capacity available.

Repeated random read workloads will typically benefit most from L3 cache via latency improvements. When sizing L3 SSD capacity, the recommendation is to use a small number (ideally no more than two) of large capacity SSDs rather than multiple small SSDs to achieve the appropriate capacity of SSD(s) that will fit your working data set.

When it comes to replacing failed L3 cache SSDs, the same procedure should be employed as for replacing other storage drives. However, L3 cache SSDs do not require FlexProtect or AutoBalance to run post replacement, so it’s typically a much faster process.

For a legacy node pool using a SmartPools metadata-write strategy, the conventional wisdom is to avoid converting it to L3 cache unless:

  1. The SSDs are seriously underutilized.
  2. The overall I/O mix has changed and represents a significant drop in metadata write percentage.
  3. The SSDs in the pool are oversubscribed and spilling over to hard disk.
  4. Your primary concern is SSD longevity.

OneFS Caching – The Key to L3

Unlike L1 and L2 cache, which are always present and operational in storage nodes, L3 cache is enabled per node pool via a simple on or off configuration setting. Other than this, there are no additional visible configuration settings to change. When enabled, L3 consumes all the SSD in node pool. Also, L3 cannot coexist with other SSD strategies, with the exception of Global Namespace Acceleration. However, since they’re exclusively reserved, L3 Cache node pool SSDs cannot participate in GNA.

Note that L3 cache is typically enabled by default on any new node pool containing SSDs.

Once the SSDs have been reformatted and are under the control of L3 cache, the WebUI removes them from usable storage:

There is also a global setting which governs whether to enable L3 cache by default for new node pools.

When converting the SSDs in a particular nodepool to use L3 cache rather than SmartPools, progress can be estimated by periodically tracking SSD space (used capacity) usage over the course of the conversion process. Additionally, the Job impact policy of the Flexprotect_Plus or SmartPools job responsible for the L3 conversion can be reprioritized to run faster or slower. This has the effect of conversely increasing or decreasing the impact of the conversion process on cluster resources.

OneFS provides tools to accurately assess the performance of the various levels of cache at a point in time. These cache statistics can be viewed from the OneFS CLI using the isi_cache_stats command. Statistics for L1, L2 and L3 cache are displayed for both data and metadata.

# isi_cache_stats

Totals            

l1_data: a 446G 100% r 579G  85% p 134G  89%, l1_encoded: a 0.0B   0% r 0.0B   0% p 0.0B   0%, l1_meta: r  82T 100% p 219M  92%,

l2_data: r 376G  78% p 331G  81%, l2_meta: r 604G  96% p 1.7G   4%,

l3_data: r   6G  19% p 0.0B   0%, l3_meta: r  24G  99% p 0.0B   0%

For more detailed and formatted output, a verbose option of the command is available using the ‘isi_cache_stats -v’ option:

# isi_cache_stats -v

------------------------- Totals -------------------------

l1_data:

        async read (8K blocks):
                aread.start:              58665103 / 100.0%
                aread.hit:                58433375 /  99.6%
                aread.miss:                 231378 /   0.4%
                aread.wait:                    350 /   0.0%


        read (8K blocks):
                read.start:               89234355 / 100.0%
                read.hit:                 58342417 /  65.4%
                read.miss:                13082048 /  14.7%
                read.wait:                  246797 /   0.3%
                prefetch.hit:             17563093 /  19.7%


        prefetch (8K blocks):
                prefetch.start:           19836713 / 100.0%
                prefetch.hit:             17563093 /  88.5%


l1_encoded:
        async read (8K blocks):
                aread.start:                     0 /   0.0%
                aread.hit:                       0 /   0.0%
                aread.miss:                      0 /   0.0%
                aread.wait:                      0 /   0.0%


        read (8K blocks):
                read.start:                      0 /   0.0%
                read.hit:                        0 /   0.0%
                read.miss:                       0 /   0.0%
                read.wait:                       0 /   0.0%
                prefetch.hit:                    0 /   0.0%


        prefetch (8K blocks):
                prefetch.start:                  0 /   0.0%
                prefetch.hit:                    0 /   0.0%


l1_meta:
        read (8K blocks):
                read.start:              11030213475 / 100.0%
                read.hit:                11019567231 /  99.9%
                read.miss:                 8070087 /   0.1%
                read.wait:                 2548102 /   0.0%
                prefetch.hit:                28055 /   0.0%


        prefetch (8K blocks):
                prefetch.start:              30483 / 100.0%
                prefetch.hit:                28055 /  92.0%


l2_data:
        read (8K blocks):
                read.start:               63393624 / 100.0%
                read.hit:                  5916114 /   9.3%
                read.miss:                 4289278 /   6.8%
                read.wait:                 9815412 /  15.5%
                prefetch.hit:             43372820 /  68.4%


        prefetch (8K blocks):
                prefetch.start:           53327065 / 100.0%
                prefetch.hit:             43372820 /  81.3%


l2_meta:
        read (8K blocks):
                read.start:               82823463 / 100.0%
                read.hit:                 78959108 /  95.3%
                read.miss:                 3643663 /   4.4%
                read.wait:                    1758 /   0.0%
                prefetch.hit:               218934 /   0.3%


        prefetch (8K blocks):
                prefetch.start:            5517237 / 100.0%
                prefetch.hit:               218934 /   4.0%


l3_data:
        read (8K blocks):
                read.start:                4418424 / 100.0%
                read.hit:                   817632 /  18.5%
                read.miss:                 3600792 /  81.5%
                read.wait:                       0 /   0.0%
                prefetch.hit:                    0 /   0.0%


        prefetch (8K blocks):
                prefetch.start:                  0 /   0.0%
                prefetch.hit:                    0 /   0.0%


l3_meta:
        read (8K blocks):
                read.start:                3104472 / 100.0%
                read.hit:                  3087217 /  99.4%
                read.miss:                   17255 /   0.6%
                read.wait:                       0 /   0.0%
                prefetch.hit:                    0 /   0.0%


        prefetch (8K blocks):
                prefetch.start:                  0 /   0.0%
                prefetch.hit:                    0 /   0.0%


l1_all:
        prefetch.start:                   19867196 / 100.0%
        prefetch.misses:                         0 /   0.0%

l2_all:
        prefetch.start:                   58844302 / 100.0%
        prefetch.misses:                     48537 /   0.1%

It’s worth noting that for L3 cache, the prefetch statistics will always read zero, since it’s a pure eviction cache and does not utilize data or metadata prefetch.

Due to balanced data distribution, automatic rebalancing, and distributed processing, OneFS is able to leverage additional CPUs, network ports, and memory as the system grows. This also allows the caching subsystem (and, by virtue, throughput and IOPS) to scale linearly with the cluster size.

OneFS Caching – Workings and Mechanics

In this article we’ll dig into the workings and mechanics of OneFS read caching a bit deeper…

L1 cache interacts with the L2 cache on any node it requires data from, and the L2 cache interacts with both the storage subsystem and L3 cache. L3 cache can be enabled or disabled at a nodepool level. L3 cached blocks are stored on one or more SSDs within the node and each node in the same node nodepool has to have L3 cache enabled

Here are the relative latency of OneFS Cache Hits and Misses:

Cache Hit Miss
L1 10us L2
L2 100us L3, (or Hard Disk)
L3 200us Hard Disk
Hard Disk 1-10ms x

Note: These latency numbers may vary in an active cluster.

L2 is typically more beneficial than L1 because a hit avoids a higher latency operation. An L1 cache hit avoids a back-end round-trip to fetch the data, whereas an L2 cache hit avoids a SATA disk seek in the worst case. This is a dramatic difference in both relative and absolute terms. For SATA drives, an L2 miss is two orders of magnitude above a hit compared to one for L1, and a single back-end round-trip is typically a small portion of a full front-end operation.

L2 is preferable because it is accessible to all nodes. Assuming a workflow with any overlap among nodes, it is preferable to have the cluster’s DRAM holding L2 data rather than L1. In L2, a given data block is only cached once and invalidated much less frequently. This is why storage nodes are configured with a drop-behind policy on file data. Nodes without disks will not drop behind since there is no L2 data to cache.

When a read request arrives from a client, OneFS determines whether the requested data is in local cache. Any data resident in local cache is read immediately. If data requested is not in local cache, it is read from disk. For data not on the local node, a request is made from the remote nodes on which it resides. On each of the other nodes, another cache lookup is performed. Any data in the cache is returned immediately, and any data not in the cache is retrieved from disk. When the data has been retrieved from local and remote cache (and possibly disk), it is returned back to the client.

Each level of OneFS’ cache hierarchy utilizes a different strategy for cache eviction, to meet the particular needs of that cache type. For L1 cache in storage nodes, cache aging is based on a drop-behind algorithm. L2 cache utilizes a Least Recently Used algorithm, or LRU, since it is relatively simple to implement, low-overhead, and performs well in general. By contrast, the L3 cache employs a first-in, first-out eviction policy (or FIFO) since it’s writing to what is effectively a specialized linear filesystem on SSD.

For OneFS, a drawback of LRU is that it is not scan resistant. For example, a OneFS Job Engine job or backup process that scans a large amount of data can cause the L2 cache to be flushed. This is mitigated to a large degree by the L3 cache. Other eviction policies have the ability to promote frequently accessed entries such that they are not evicted by scanning entries, which are accessed only once.

OneFS uses two primary sources of information for predicting a file’s access pattern and pre-populate the cache with data and metadata blocks before they’re requested:

  1. OneFS attributes that can be set on files and directories to provide hints to the filesystem.
  2. The actual read activity occurring on the file.

This technique is known as ‘prefetching’, whereby the latency of an operation is mitigated by predictively copying data into a cache before it has been requested. Data prefetching is employed frequently and is a significant benefactor of the OneFS flexible file allocation strategy.

Flexible allocation involves determining the best layout for a file based on several factors, including cluster size (number of nodes), file size, and protection level (e.g.+2 or +3). The performance effect of flexible allocation is to place a file on the largest number of drives possible, given the above constraints.

The most straightforward application of prefetch is file data, where linear access is common for unstructured data, such as media files. Reading and writing of such files generally starts at the beginning and continues unimpeded to the end of the file. After a few requests, it becomes highly likely that a file is being streamed to the end.

OneFS data prefetch strategies can be configured either from the command line or via SmartPools. File data prefetch behavior can be controlled down to a per-file granularity using the ‘isi set/get’ command’s access pattern setting. The available selectable file access patterns include concurrency (the default), streaming, and random.

# isi get tstfile1

POLICY    LEVEL PERFORMANCE COAL  FILE

default   6+2/2 streaming on    tstfile1

# isi set -l random tstfile1

# isi get tstfile1

POLICY    LEVEL PERFORMANCE COAL  FILE

default   6+2/2 random      on    tstfile1

Metadata prefetch occurs for the same reason as file data. Metadata scanning operations, such as finds and treewalks, can benefit. However, the use of metadata prefetch is less common because most accesses are random and unpredictable.

OneFS also provides a mechanism for prefetching files based on their nomenclature. In film and TV production, “streaming” often takes a different form as opposed to streaming an audio file. Each frame in a movie will often be contained in an individual file. As such, streaming reads a set of image files and prefetching across files is important. The files are often a subset of a directory, so directory entry prefetch does not apply. Ideally, this would be controlled by a client application, however in practice this rarely occurs.

To address this, OneFS has a file name prefetch facility. While file name prefetch is disabled by default, as with file data prefetch, it can be enabled with file access settings. When enabled, file name prefetch guesses the next sequence of files to be read by matching against several generic naming patterns.

Flexible file handle affinity (FHA) is a read-side algorithm designed to better utilize the internal threads used to read files. Using system configuration options and read access profiling, the number of operations per thread can be tuned to improve the efficiency of reads. FHA maps file handles to worker threads according to a combination of system settings, locality of the read requests (in terms of how close the requested addresses are), and the latency of the thread(s) serving requests to a particular client.

Note that prefetch does not apply to the L3 cache, since L3 is populated with ‘interesting’ L2 blocks dropped from memory by L2’s least recently used cache eviction algorithm.

Blocks evicted from L2 are candidates for inclusion in L3, and a filter is employed to reduce the quantity and increase the value of incoming blocks. Because L3 is a first in, first out (FIFO) cache, filtering is performed ahead of time. By selecting blocks that are more likely to be read again, L3 can both limit SSD churn and enhance the quality of the L3 cache contents.

The L3 filter uses several heuristics to evaluate which candidate blocks will likely be most valuable and should go to L3 cache. In general, L3 prefers metadata/inode to data blocks. And the guiding principle for data blocks is that the per-block cost of re-reading a sequential cluster of blocks from disk is much lower than performing random reads from disk. For example, if a block is a “random” read (ie. there are no neighboring blocks on this disk in L2), then it is always included in L3. Conversely, If the block is part of a sequential cluster of 16 or more blocks (128KB), it is not evicted to L3. As such, the L3 cache can be most effective, per capacity, by addressing random reads.

The most frequently accessed data and metadata on a node should just remain in L2 cache and not get evicted to L3. For the next tier of cached data that’s accessed frequently enough to live in L3, but not frequently enough to always live in RAM, there’s a mechanism in place to keep these semi-frequently accessed blocks in L3.

To maintain this L3 cache persistence, when the kernel goes to read a metadata or data block, the following steps are performed:

1) First, L1 cache is checked. Then, if no hit, L2 cache is consulted.

2) If a hit is found in memory, it’s done.

3) If not in memory, L3 is then checked.

4) If there’s an L3 hit, and that item is near the end of the L3 FIFO (last 10%), a flag is set on the block which causes it to be evicted into L3 again when it is evicted out of L2.

Additionally, any un-cached job engine metadata requests will always come from disk and bypass L3 cache, so they do not displace user-cached blocks from L3 cache. As new versions are written, the journal notifies L3, which invalidates and removes the dirty block(s) from its cache.

OneFS Caching Architecture

There have been a number of recent enquiries from the field around how caching is performed in OneFS. So it seemed like an ideal time to review this topic over the next couple of articles.

Caching occurs at multiple different levels, and for a variety of types of data. In this article we’ll concentrate on the caching of file system structures in main memory and on SSD.

OneFS’ caching infrastructure design is predicated on aggregating each individual node’s cache into one cluster wide, globally accessible pool of memory. This is achieved by using an efficient messaging system that allows all the nodes’ memory caches to be available to each and every node in the cluster.

For remote memory access, OneFS utilizes the Sockets Direct Protocol (SDP) over an Ethernet or Infiniband backend interconnect on the cluster. SDP provides an efficient, socket-like interface between nodes which, by using a switched star topology, ensures that remote memory addresses are only ever one hop away. While not as fast as local memory, remote memory access is still very fast due to the low latency of the dedicated backend interconnect.

OneFS uses up to three levels of read cache, plus an NVRAM-backed write cache, or write coalescer. The first two types of read cache, level 1 (L1) and level 2 (L2), are memory (RAM) based, and analogous to the cache used in CPUs. A third tier of read cache, called SmartFlash, or Level 3 cache (L3), is also configurable on nodes that contain solid state drives (SSDs). L3 cache is an eviction cache that is populated by L2 cache blocks as they are aged out from memory.

The OneFS caching subsystem is coherent across the cluster. This means that if the same content exists in the private caches of multiple nodes, this cached data is consistent across all instances. For example, consider the following scenario:

  1. Node 2 and Node 4 each have a copy of data located at an address in shared cache.
  2. Node 4, in response to a write request, invalidates node 2’s copy.
  3. Node 4 then updates the value.
  4. Node 2 must re-read the data from shared cache to get the updated value.

OneFS utilizes the MESI Protocol to maintain cache coherency, implementing an “invalidate-on-write” policy to ensure that all data is consistent across the entire shared cache. The various states that in-cache data can take are:

M – Modified: The data exists only in local cache, and has been changed from the value in shared cache. Modified data is referred to as ‘dirty’.

E – Exclusive: The data exists only in local cache, but matches what is in shared cache. This data referred to as ‘clean’.

S – Shared: The data in local cache may also be in other local caches in the cluster.

I – Invalid: A lock (exclusive or shared) has been lost on the data.

L1 cache, or front-end cache, is memory that is nearest to the protocol layers (e.g. NFS, SMB, etc) used by clients, or initiators, connected to that node. The main task of L1 is to prefetch data from remote nodes. Data is pre-fetched per file, and this is optimized in order to reduce the latency associated with the nodes’ IB back-end network. Since the IB interconnect latency is relatively small, the size of L1 cache, and the typical amount of data stored per request, is less than L2 cache.

L1 is also known as remote cache because it contains data retrieved from other nodes in the cluster. It is coherent across the cluster, but is used only by the node on which it resides, and is not accessible by other nodes. Data in L1 cache on storage nodes is aggressively discarded after it is used. L1 cache uses file-based addressing, in which data is accessed via an offset into a file object. The L1 cache refers to memory on the same node as the initiator. It is only accessible to the local node, and typically the cache is not the primary copy of the data. This is analogous to the L1 cache on a CPU core, which may be invalidated as other cores write to main memory. L1 cache coherency is managed via a MESI-like protocol using distributed locks, as described above.

It’s worth noting that L1 cache is utilized differently in accelerator nodes, which don’t contain any disk drives. Instead, the entire read cache is L1 cache, since all the data is fetched from other storage nodes. Also, cache aging is based on a least recently used (LRU) eviction policy, as opposed to the drop-behind algorithm typically used in a storage node’s L1 cache. Because an accelerator’s L1 cache is large, and the data in it is much more likely to be requested again, so data blocks are not immediately removed from cache upon use. However, metadata & update heavy workloads don’t benefit as much, and an accelerator’s cache is only beneficial to clients directly connected to the node.

L2, or back-end cache, refers to local memory on the node on which a particular block of data is stored. L2 reduces the latency of a read operation by not requiring a seek directly from the disk drives. As such, the amount of data prefetched into L2 cache for use by remote nodes is much greater than that in L1 cache.

L2 is also known as local cache because it contains data retrieved from disk drives located on that node and then made available for requests from remote nodes. Data in L2 cache is evicted according to a Least Recently Used (LRU) algorithm. Data in L2 cache is addressed by the local node using an offset into a disk drive which is local to that node. Since the node knows where the data requested by the remote nodes is located on disk, this is a very fast way of retrieving data destined for remote nodes. A remote node accesses L2 cache by doing a lookup of the block address for a particular file object. As described above, there is no MESI invalidation necessary here and the cache is updated automatically during writes and kept coherent by the transaction system and NVRAM.

L3 cache is a subsystem which caches evicted L2 blocks on a node. Unlike L1 and L2, not all nodes or clusters have an L3 cache, since it requires solid state drives (SSDs) to be present and exclusively reserved and configured for caching use. L3 serves as a large, cost-effective way of extending a node’s read cache from gigabytes to terabytes. This allows clients to retain a larger working set of data in cache, before being forced to retrieve data from higher latency spinning disk. The L3 cache is populated with “interesting” L2 blocks dropped from memory by L2’s least recently used cache eviction algorithm.

Unlike RAM based caches, since L3 is based on persistent flash storage, once the cache is populated, or warmed, it’s highly durable and persists across node reboots, etc. L3 uses a custom log-based filesystem with an index of cached blocks. The SSDs provide very good random read access characteristics, such that a hit in L3 cache is not that much slower than a hit in L2.

To utilize multiple SSDs for cache effectively and automatically, L3 uses a consistent hashing approach to associate an L2 block address with one L3 SSD. In the event of an L3 drive failure, a portion of the cache will obviously disappear, but the remaining cache entries on other drives will still be valid. Before a new L3 drive may be added to the hash, some cache entries must be invalidated.

OneFS also uses a dedicated inode cache in which recently requested inodes are kept. The inode cache frequently has a large impact on performance, because clients often cache data, and many network I/O activities are primarily requests for file attributes and metadata, which can be quickly returned from the cached inode.

OneFS provides tools to accurately assess the performance of the various levels of cache at a point in time. These cache statistics can be viewed from the OneFS CLI using the isi_cache_stats command. Statistics for L1, L2 and L3 cache are displayed for both data and metadata.

# isi_cache_stats
Totals l1_data: a 409G 100% r 542G 84% p 134G 89%, l1_encoded: a 0.0B 0% r 0.0B 0% p% p 331G 81%, l2_meta: r 597G 96% p 1.7G 4%, l3_data: r 6G 18% p 0.0B 0%, l3_meta: r 22G 99

For more detailed and formatted output, a verbose option of the command is available using the following syntax:

# isi_cache_stats -v

It’s worth noting that for L3 cache, the prefetch statistics will always read zero, since it’s a pure eviction cache and does not utilize data or metadata prefetch.

Due to balanced data distribution, automatic rebalancing, and distributed processing, OneFS is able to leverage additional CPUs, network ports, and memory as the system grows. This also allows the caching subsystem (and, by virtue, throughput and IOPS) to scale linearly with the cluster size.

OneFS Quota Domains

In the previous article, we looked at the use of protection domains in OneFS, focusing on SyncIQ replication, SmartLock immutable archiving, and Snapshots and SnapRevert.

Under the hood, SmartQuotas is also based on the concept of domains – the linchpins of quota accounting. Since OneFS is a single file system, it relies on accounting domains for defining the scope of a quota in place of the typical volume boundaries found in most storage systems. As such, a domain defines which files belong to a quota, accounts for each resource type in that set and defines the top-level directory configuration point.

For SmartQuotas, the three main resource types are:

Resource Type Description
Directory A specific directory and all its subdirectories
User A specific user
Group All members of a specific group

A domain defined as “name@folder” would be the set of files under “folder”, owned by “name”, which could be either a user or a group. The files accounted include all files reachable from the given path, without traversing any soft links. The owner “name” can be ALL, and “/ifs”, the OneFS root directory is also an effective ALL for “folder”.

With SmartQuotas it’s easy to create traditional domain types quickly by using “ALL”. The following are examples of domain types:

  • All files belonging to user Jane: user:Jane@/ifs
  • All files under /ifs/home, belonging to any user: ALL@/ifs/home.
  • All files under /ifs/home that belong to user Jane: user:Jane@/ifs/home

Domains cannot be created on anything but directories. More specifically, domains are associated with the actual directories themselves, not directory paths. For example, if the domain is ALL@/ifs/home/data, but /ifs/home/data gets renamed to /ifs/home/files, the domain stays with the directory.

Domains can also be nested and may overlap. For example, say a hard quota is set on /ifs/data/marketing for 5TB. 1TB soft quotas are then placed on individual users in the marketing department. This ensures that the marketing directory as a whole never exceeds 5TB, while limiting the users in the marketing department to 1TB each.

A default quota domain is one that does not account for any specific set of files but instead specifies a policy for new domains that match a specific trigger. In other words, default domains are configuration templates for actual domains. SmartQuotas use the identity notation ‘default-user’, ‘default-group’, and ‘default directory’ to describe domains with default policies. For example, the domain default-user@/ifs/home becomes specific-user@/ifs/home for each specific-user that is not otherwise defined. All enforcements on default-user are copied to specific-user when specific-user allocates within the domain and the new inherited domain quota is termed as a Linked Quota. There may be overlapping defaults (i.e. default-user@/ifs and default-user@/ifs/home may both be defined).

Default quota domains help drastically simplify quota management for large environments by providing a mechanism to define top level template configurations from which many actual quotas are cloned, or linked. When a default quota domain is configured on a directory, any subdirectories created directly underneath this will automatically inherit the quota limits specified in the parent domain. This streamlines the provisioning and management quotas for large enterprise environments. Furthermore, default directory quotas can co-exist with user and/or group quotas and legacy default quotas.

Default directory quotas have been available since OneFS 8.2, in addition to the default user and group quotas available in earlier releases. For example:

  • Create default-directory quota
# isi quota create --path=/ifs/parent-dir --type=default-directory --hard-threshold=10M
  • Modify Default directory quota
# isi quota modify --path=/ifs/parent-dir --type=default-directory --advisory-threshold=6M --soft-threshold=7M --soft-grace=1D
  • List default-directory quota
# isi quota list                 

  Type              AppliesTo  Path            Snap  Hard   Soft  Adv  Used

  --------------------------------------------------------------------------

  default-directory DEFAULT    /ifs/parent-dir No    10.00M -    6.00M 0.00

  --------------------------------------------------------------------------

  Total: 1
  • Delete Default directory quota
# isi quota delete --path=/ifs/parent-dir --type=default-directory

If the enforcements on a default domain change, SmartQuotas will automatically propagate the changes to the Linked Quota domains. If a default quota domain is deleted, SmartQuotas will delete all children marked as inherited. An administrator may also choose to delete the default without deleting the children, but this will break inheritance on all inherited children.

For example, the creation & deletion of sub-directory under default directory folder causes inherited directory quota creation and removal:

A domain may be in one of three accounting states, as follows:

Quota Accounting States Description
Ready A domain in the ready state is fully accounted. SmartQuotas displays “ready” domains in all interfaces and all enforcements apply to such domains.
Accounting A domain is placed in the Accounting state when it’s waiting on accounting updates.
Deleting After a request to delete a domain, SmartQuotas will place the domain in the deleting state until tear-down is complete. Domain removal may be a lengthy process.

SmartQuotas displays accounting domains in all interfaces including usage data but indicate they are in the process of being “Accounted”. SmartQuotas applies all enforcements to accounting domains, even when it might reject an allocation that would have proceeded if it had completed the QuotaScan.

Domains in the deleting state are hidden from all interfaces and the top-level directory of a domain may be deleted while the domain is still in the deleting state (assuming there are no domains in “Ready” or “Accounting” state defined on the directory). No enforcements are applied for domains in “Deleting” state.

A quota scan is performed when the domain is in an Accounting State. This can occur during quota creation to account the new domain if a quota has been set for the domain and quota deletion to un-account the domain. A QuotaScan is required when creating a quota on a non-empty directory. If quotas are created up-front on an empty directory, no QuotaScan is necessary.

In addition, a QuotaScan job may be started from the WebUI or command line interface using the “isi job” command. Any path specified on the command line is treated as the root of a tree that should be processed. This is provided primarily as a means to re-scan a directory or maintenance reasons.

There are main three processes or daemons associated with SmartQuotas:

  • isi_quota_notify_d
  • isi_quota_sweeper_d
  • isi_quota_report_d

The job of the notification daemon, isi_quota_notify_d, is to listen for ‘limit exceeded’ and ‘link denied’ events and generate notifications for each. It also responds to configuration change events and instructs the QDB to generate ‘expired’ and ‘violated’ over-threshold notifications.

A quota sweeper daemon, isi_quota_sweeper_d, is responsible for a number of quota housekeeping tasks such as propagating default changes, domain and notification rule garbage collection and kicking off QuotaScan jobs when necessary.

Finally, the reporting daemon, isi_quota_report_d, is responsible for generating quota reports. Since the QDB only produces real-time resource usage, reports are necessary for providing point-in-time vies of a quota domain’s usage. These historical reports are useful for trend analysis of quota resource usage.

OneFS 8.2 and subsequent releases use the rpc.quotad service to facilitate client-side quota reporting on UNIX and Linux clients via native ‘quota’ tools. The service which runs on tcp/udp port 762 is enabled by default, and control is under NFS global settings.

Additionally, in OneFS 8.2 and later, users can now see their available user capacity set by soft and/or hard user and group quotas rather than the entire cluster capacity or parent directory-quotas. This avoids the ‘illusion’ of seeing available space that may not be associated with their quotas.

OneFS Protection Domains

In OneFS, a domain defines a set of behaviors for a collection of files under a specified directory tree. More specifically, a protection domain is a marker which prevents a configured subset of files and directories from being deleted or modified.

If a directory has a protection domain applied to it, that domain will also affect all of the files and subdirectories under that top-level directory. As we’ll see, in some instances, OneFS creates protection domains automatically, but they can also be configured manually.

With the recent introduction of domain-based snapshots, OneFS now supports four types of protection domain:

  • SnapRevert domains
  • SmartLock domains
  • SyncIQ domains
  • Snapshot domains

The process of restoring a snapshot in full to its top level directory can easily be accomplished by the SnapRevert job. This enables cluster administrators to quickly revert to a previous, known-good recovery point – for example in the event of a virus or malware outbreak, The SnapRevert job can be run from the job engine WebUI or CLI, and simply requires adding the desired snapshot ID.

SnapRevert domains are assigned to directories that are contained in snapshots to prevent files and directories from being modified while a snapshot is being reverted. OneFS does not automatically create SnapRevert domains. The SnapRevert domain is described as a ‘restricted writer’ domain, in OneFS jargon. Essentially, this is a piece of extra filesystem metadata and associated locking that prevents a domain’s files being written to while restoring a last known good snapshot.

Because the SnapRevert domain is essentially just a metadata attribute, or marker, placed onto a file or directory, a preferred practice is to create the domain before there is data. This avoids having to wait for DomainMark or DomainTag (the aptly named Job Engine jobs that mark a domain’s files) to walk the entire tree, setting that attribute on every file and directory within it.

There are two main components to SnapRevert:

  • The file system domain that the objects are put into.
  • The job that reverts everything back to what’s in a snapshot.

The SnapRevert job itself actually uses a local SyncIQ policy to copy data out of the snapshot, discarding any changes to the original directory. When the SnapRevert job completes, the original data is left in the directory tree. In other words, after the job completes, the file system (HEAD) is exactly as it was at the point in time that the snapshot was taken. The LINs for the files/directories don’t change, because what’s there is not a copy.

The SnapRevert job can either be scheduled or manually run from the OneFS WebUI by navigating to Cluster Management > Job Operations > Job Types > SnapRevert and clicking the ‘Start Job’ button.

A snapshot can’t be reverted until a SnapRevert domain has been created on its top level directory. If necessary, SnapRevert domains can also be nested. For example, domains could be successfully created on both /ifs/snap1 and /ifs/snap1/snap2. Also. A SnapRevert domain can easily be deleted if you no longer need to restore snapshots of that directory.

It’s worth noting that CloudPools also supports SnapRevert for SmartLink (stub) files. For example, if CloudPools archived “/ifs/cold_data”, the files in this directory would be replaced with stubs and the data moved off to the cloud provider of choice. If you then created a domain for the directory and ran the SnapRevert job, the original files would be restored to the directory, and CloudPools would remove any cloud data that was created as part of the original archive process.

SmartLock domains are assigned to WORM (write once, read many) immutable archive directories to prevent committed files from being modified or deleted. OneFS automatically sets up a SmartLock domain when a SmartLock directory is created. Note that a SmartLock domain cannot be manually deleted. However, if you remove a SmartLock directory, OneFS automatically deletes the associated SmartLock domain.

Once a file is SmartLocked (WORM committed) it cannot ever be modified or moved. It cannot be deleted until its ‘committed until’ or ‘expiry’ date has passed.  Even when the expiry date has passed (ie. the file is in an ‘expired’ state) it cannot be modified or moved.  All you can do with an expired file is either delete it or extend its ‘committed until’ date into the future.

SyncIQ domains can be assigned to both the source and target directories of replication policies. OneFS automatically creates a SyncIQ domain for the target directory of a replication policy the first time that the policy is run. OneFS also automatically creates a SyncIQ domain for the source directory of a replication policy during the failback process.

A SyncIQ domain can be manually created for a source directory before initiating the failback process, by configuring the policy for accelerated failback. However, a SyncIQ domain that marks the target directory of a replication policy cannot be deleted.

SnapshotIQ also uses a domain-based model for governance of scheduled snapshots in OneFS 8.2 and later releases. By utilizing the OneFS IFS domains infrastructure, recurring snapshot efficiency and performance is increased by limiting the scope of governance to a smaller, well defined domain boundary.

IFS Domains provide a Mark Job that proactively marks all the files in the domain. Creating a new snapshot on a fully marked domain will not cause further “painting” operations, thereby avoiding a significant portion of the resource overhead caused by taking a new snapshot.

Once a domain has been fully marked, subsequent snapshot creation operations will not cause any further painting. The new snapshot ID is simply added to the domain data section, so the creation of a new snapshot will not trigger a system-wide painting event anymore. Domains are re-used whenever possible.

Creating two domains of the same type on the same directory will cause the second domain to become an alias of the first domain. Aliases don’t require marking since they share the already existing marks. This benefits both snapshots and snapshot schedules taken on the same directory. For all these reasons, the number of I/O and locking operations needed to resolve snapshot governance is greatly reduced. Because the SnapIDs are stored in a single location (as opposed to being stored on individual inodes), this greatly simplifies Snapshot ID garbage collection whenever a Snapshot is deleted. By leveraging IFS Domains, creating a new snapshot on a domain that is fully marked will not cause further “painting” operations, so a significant portion of the performance impact caused by taking a new snapshot is avoided.

The illustration above shows an example of domain-based snapshots. In this case, a snapshot was taken on the ‘projects’ directory, and the on the directory named ‘video’. File v1.mp4 is tagged with the domain IDs, making it more efficient to determine snapshot governance.

A snapshot of file v1.mp4 creates a snap_ID in the domain’s SBT (system b-tree) providing a single place to store snapshot metadata. In previous OneFS versions, snapIDs were stored in the inode, which resulted in duplication of the snap_IDs and metadata usage.

Note that only snapshots taken after upgrade to OneFS 8.2 will use IFS domains backing. Any snapshots created prior to upgrade will not be converted and will remain in their original form.

Additionally, the new domain-based snapshot functionality in OneFS 8.2 brings other benefits including:

  • Improved management of SnapIDs
  • Reduced number of operations needed to resolve snapshot governance.
  • More efficient use of metadata
  • The automatic exclusion of the cluster’s /ifs/.ifsvar subtree from all root (/ifs) snapshots – although this behavior is configurable.
  • The write cache, or coalescer, is enhanced to better support parallel snapshot creates.
  • The snapshot create path is improved to reduce contention on the STF during copy-on-write.

Sync and snap domains can be easily created to enable snapshot revert and replication failover operations. However, SmartLock domains cannot be manually created, however, since OneFS automatically creates a domain upon creation of a SmartLock directory.

For example, the following CLI syntax will create a SnapRevert domain for /ifs/snap1:

# isi job jobs start domainmark --root /ifs/snap1 --dm-type SnapRevert

And from the WebUI:

You can delete a replication or snapshot revert domain if you want to move directories out of the domain. However, SmartLock domains cannot be manually removed, but will be automatically removed upon deletion of a SmartLock directory.

The following CLI command will delete a SnapRevert domain on /ifs/snap1:

# isi job jobs start domainmark --root /ifs/snap1 --dm-type SnapRevert –delete

Similarly, via the WebUI:

Protection domains can (and usually should) be manually created before they are required by OneFS to perform certain actions. However, manually creating protection domains can limit the ability to interact with the data marked by the domain.

OneFS 8.2 and later releases provide an ‘isi_pdm’ CLI utility for managing protection domains, with the following syntax:

#isi_pdm -h

usage: isi_pdm [-h] [-v]

               {base,domains,exclusions,operations,ifsvar-sysdom} ...




positional arguments:

  {base,domains,exclusions,operations,ifsvar-sysdom}

    base                Read base domains.

    domains             Read or manipulate domain instances.

    exclusions          Add or list domain exclusions.

    operations          Read pending pdm operations.

    ifsvar-sysdom       Manage .ifsvar system domain.




optional arguments:

  -h, --help            show this help message and exit

  -v, --verbose

For example:

# isi_pdm domains list /ifs/data All

[ 2.0100, 315.0100 ]

# isi_pdm exclusions list 2.0100

{

    DomID = 16.8100

    Owner LIN = 1:0000:0001

}

Domain membership can also be viewed via the ‘isi get’ command.

Here are some OneFS domain recommendations, constraints, and considerations:

  • Copying a large number of files into a protection domain can be a lengthy process, since each file must be marked individually as belonging to the protection domain.
  • The best practice is to create protection domains for directories while the directories are empty, and then add files to the directory.
  • Theisi sync policies create command contains an ‘—accelerated-failback true’ option, which automatically marks the domain. This can save considerable time during failback.
  • If you use SyncIQ to create a replication policy for a SmartLock compliance directory, the SyncIQ and SmartLock compliance domains must be configured at the same root directory level. A SmartLock compliance domain cannot be nested inside a SyncIQ domain.
  • If a domain is currently preventing the modification or deletion of a file, you cannot create a protection domain for a directory that contains that file. For example, if /ifs/data/smartlock/file.txt is set to a WORM state by a SmartLock domain, you cannot create a SnapRevert domain for /ifs/data/.
  • Directories cannot be moved in or out of protection domains. However, you can move a directory to another location within the same protection domain.

OneFS MultiScan, AutoBalance, & Collect

As we’ve seen throughout the recent file system maintenance job articles, OneFS utilizes file system scans to perform such tasks as detecting and repairing drive errors, reclaiming freed blocks, etc. Since these scans typically involve complex sequences of operations, they are implemented via syscalls and coordinated by the Job Engine. These jobs are generally intended to run as minimally disruptive background tasks in the cluster, using spare or reserved capacity.

FS Maintenance Job Description
AutoBalance Restores node and drive free space balance
Collect Reclaims leaked blocks
FlexProtect Replaces the traditional RAID rebuild process
MediaScan Scrub disks for media-level errors
MultiScan Run AutoBalance and Collect jobs concurrently

In this final article of the series, we’ll turn our attention to MultiScan. This job is a combination of both the of the AutoBalance job, which rebalances data across drives, and the Collect job, which recovers leaked blocks from the filesystem. In addition to reclaiming unused capacity as a result of drive replacements, snapshot and data deletes, etc, MultiScan also helps expose and remediate any filesystem inconsistencies.

The OneFS job engine defines two exclusion sets that govern which jobs can execute concurrently on a cluster. MultiScan straddles both of the job engine’s exclusion sets, with AutoBalance (and AutoBalanceLin) in the restripe set, and Collect in the mark set.

The restriping exclusion set is per-phase instead of per job, which helps to more efficiently parallelize restripe jobs when they don’t need to lock down resources. However, with the marking exclusion set, OneFS can only accommodate a single marking job at any point in time.

MultiScan is an unscheduled job that runs by default at ‘LOW’ impact and executes AutoBalance and Collect simultaneously. It is triggered by cluster group change events, which include node boot, shutdown, reboot, drive replacement, etc. While AutoBalance will execute each time the MultiScan job is triggered, Collect typically won’t be run more often that once every 2 weeks. AutoBalance and/or Collect are typically only run manually if MultiScan has been disabled.

When a new node or drive is added to the cluster, its blocks are almost entirely free, whereas the rest of the cluster is usually considerably more full, capacity-wise. AutoBalance restores the balance of free blocks in the cluster. As such, AutoBalance runs if a cluster’s nodes have a greater than 5% imbalance in capacity utilization. In addition, AutoBalance also fixes recovered writes that occurred due to transient unavailability and also addresses fragmentation.

If the cluster’s nodes contain SSDs, AutoBalanceLin (as opposed to the regular AutoBalance job) runs most efficiently by performing a LIN scan using a flash-backed metadata mirror. When a cluster is unbalanced, there is not an obvious subset of files to filter, since the files to be restriped are the ones which are not using the node or drive with less free space. In the case of an added node or drive, no files will be using it. As a result, almost any file scanned is enumerated for restripe.

As mentioned, the Collect job reclaims leaked blocks using a mark and sweep process. In traditional UNIX systems this function is typically performed by the ‘fsck’ utility. With OneFS, however, the other traditional functions of fsck are not required, since the transaction system keeps the file system consistent. Leaks only affect free space.

Collect’s ‘mark and sweep’ gets its name from the in-memory garbage collection algorithm. First, the in-use blocks and any new allocations are marked with the current generation in the Mark phase. When this is complete, the drives are swept of any blocks which don’t have the current generation in the Sweep phase.

In addition to automatic job execution following a group change event, Multiscan can also be initiated on demand. The following CLI syntax will kick of a manual job run:

# isi job start multiscan

Started job [209]

# isi job list

ID   Type      State   Impact  Pri  Phase  Running Time

--------------------------------------------------------

209  MultiScan Running Low     4    1/4    1s

--------------------------------------------------------

Total: 1

The Multiscan job’s progress can be tracked via a CLI command as follows:

# isi job jobs view 209

               ID: 209

             Type: MultiScan

            State: Running

           Impact: Low

           Policy: LOW

              Pri: 4

            Phase: 1/4

       Start Time: 2021-01-03T20:15:16

     Running Time: 34s

     Participants: 1, 2, 3

         Progress: Collect: 225 LINs, 0 errors

                   AutoBalance: 225 LINs, 0 errors

                   LIN Estimate based on LIN count of 2793 done on Jan 04 20:02:57 2021

                   LIN Based Estimate:  3m 2s Remaining (8% Complete)

                   Block Based Estimate: 5m 48s Remaining (4% Complete)

                   0 errors total

Waiting on job ID: -

      Description: Collect, AutoBalance

The LIN (logical inode) statistics above include both files and directories.

Be aware that the estimated LIN percentage can occasionally be misleading/anomalous. If concerned, verify that the stated total LIN count is roughly in line with the file count for the cluster’s dataset. Even if the LIN count is in doubt, the estimated block progress metric should always be accurate and meaningful.

If the job is in its early stages and no estimation can be given (yet), isi job will instead report its progress as “Started”. Note that all progress is reported per phase, with MultiScan phase 1 being the one where the lion’s share of the work is done. By comparison, phases 2-4 of the job are comparatively short.

A job’s resource usage can be traced from the CLI as such:

# isi job statistics view

     Job ID: 209

      Phase: 1

   CPU Avg.: 11.46%

Memory Avg.

        Virtual: 301.06M

       Physical: 28.71M

        I/O

            Ops: 3513425

          Bytes: 26.760G

Finally, upon completion, the Multiscan job report, detailing all four stages, can be viewed by using the following CLI command with the job ID as the argument:

# isi job reports view 209

MultiScan[209] phase 1 (2021-01-03T20:02:57)

--------------------------------------------

Elapsed time          307 seconds (5m7s)

Working time          307 seconds (5m7s)

Errors                0

Rebalance/LINs        2793

Rebalance/Files       2416

Rebalance/Directories 377

Rebalance/Errors      0

Rebalance/Bytes       372607773184 bytes (347.018G)

Collect/LINs          2788

Collect/Files         2411

Collect/Directories   377

Collect/Errors        0

Collect/Bytes         130187742208 bytes (121.247G)




MultiScan[209] phase 2 (2021-01-03T20:02:57)

--------------------------------------------

Elapsed time     0 seconds

Working time     0 seconds

Errors           0

LINs traversed   0

LINs processed   0

SINs traversed   0

SINs processed   0

Files seen       0

Directories seen 0

Total bytes      0 bytes




MultiScan[209] phase 3 (2021-01-03T20:02:58)

--------------------------------------------

Elapsed time          1 seconds

Working time          1 seconds

Errors                0

Rebalance/SINs        0

Rebalance/Files       0

Rebalance/Directories 0

Rebalance/Errors      0

Rebalance/Bytes       0 bytes

Collect/SINs          0

Collect/Files         0

Collect/Directories   0

Collect/Errors        0

Collect/Bytes         0 bytes

Unbalanced diskpools  Pool_Name = h600_18tb_3.2tb-ssd_256gb:2, free_blocks = 8693136159, total_blocks = 8715355092

Pool_Name = h600_18tb_3.2tb-ssd_256gb:3, free_blocks = 7259260440, total_blocks = 7262795910







MultiScan[209] phase 4 (2021-01-03T20:03:17)

--------------------------------------------

Elapsed time 19 seconds

Working time 19 seconds

Errors       0

Drives swept 33

LINs freed   0

Inodes freed 128359

Bytes freed  80022503424 bytes (74.527G)

Keys freed   0

Inodes lost  0

Using Dell EMC Isilon with Microsoft’s SQL Server Big Data Clusters

By Boni Bruno, Chief Solutions Architect | Dell EMC

Dell EMC Isilon

Dell EMC Isilon solves the hard scaling problems our customers have with consolidating and storing large amounts of unstructured data.  Isilon’s scale-out design and multi-protocol support provides efficient deployment of data lakes as well as support for big data platforms such as Hadoop, Spark, and Kafka to name a few examples.

In fact, the embedded HDFS implementation that comes with Isilon OneFS has been CERTIFIED by Cloudera for both HDP and CDH Hadoop distributions.  Dell EMC has also been recognized by Gartner as a Leader in the Gartner Magic Quadrant for Distributed File Systems and Object Storage four years in a row.  To that end, Dell EMC is delighted to announce that Isilon is a validated HDFS tiering solution for Microsoft’s SQL Server Big Data Clusters.

SQL Server Big Data Clusters & HDFS Tiering with Dell EMC Isilon

SQL Server Big Data Clusters allow you to deploy clusters of SQL Server, Spark, and HDFS containers on Kubernetes. With these components, you can combine and analyze MS SQL relational data with high-volume unstructured data on Dell EMC Isilon. This means that Dell EMC customers who have data on their Isilon clusters can now make their data available to their SQL Server Big Data Clusters for analytics using the embedded HDFS interface that comes with Isilon OneFS.

Note:  The HDFS Tiering feature of SQL Server 2019 Big Data Clusters currently does not support Cloudera Hadoop, Isilon provides immediate access to HDFS data with or without a Hadoop distribution being deployed in the customers’ environment.  This is a unique value proposition of Dell EMC Isilon storage solution for SQL Server Big Data Clusters.  Unstructured data stored on Isilon is directly accessed over HDFS and will transparently appear as local data to the SQL Server Big Data Cluster platform.

The Figure below depicts the overall architecture between SQL Server Big Data Cluster platform and Dell EMC Isilon or ECS storage solutions.

Dell EMC provides two storage solutions that can integrate with SQL Server Big Data Clusters. Dell EMC Isilon provides a high-performance scale-out HDFS solution and Dell EMC ECS provides a high-capacity scale-out S3A solution, both are on-premise storage solutions.

We are currently working with the Microsoft’s Azure team to get these storage solutions available to customers in the cloud as well.  The remainder of this article provides details on how Dell EMC Isilon integrates with SQL Server Big Data Cluster over HDFS.

Setting up HDFS on Dell EMC Isilon

Enabling HDFS on Isilon is as simple as clicking a button in the OneFS GUI.  Customers have the choice of having multiple access zones if needed, access zones provide a logical separation of the data and users with support for independent role-based access controls.  For the purposes of this article, a “msbdc” access zone will be used for reference.  By default, HDFS is disabled on a given access zone as shown below:

To activate HDFS, simply click the Activate HDFS button.  Note:  HDFS licenses are free with the purchase of Isilon, HDFS licenses can be installed under Cluster Management\Licenses.

Once an HDFS license in installed and HDFS is activated on a given access zone, the HDFS settings can be viewed as shown below:

The GUI allows you to easily change the HDFS block size, Authentication Type, Enable the Ranger Security Plugin, etc.  Isilon OneFS also supports various authentication providers and additional protocols as shown below:

Simply pick the authentication provider of your choice and specify the provider details to enable remote authentication services on Isilon.  Note:  Isilon OneFS has a robust security architecture and authentication, identity management, and authorization stack, you can find more details here.

The multi-protocol support included with Isilon allows customers to land data on Isilon over SMB, NFS, FTP, or HTTP and make all or part of the data available to SQL Server Big Data Clusters over HDFS without having a Hadoop cluster installed – Beautiful!

A key performance aspect of Dell EMC Isilon is the scale-out design of both the hardware and the integrated OneFS storage operating system.  Isilon OneFS provides a unique SmartConnect feature that provides HDFS namenode and datanode load balancing and redundancy.

To use SmartConnect, simply delegate a sub-domain of your choice on your internal DNS server to Isilon and OneFS will automatically load balance all the associated HDFS connections from SQL Server Big Data Clusters transparently across all the physical nodes on the Isilon storage cluster.

The SmartConnect zone name is configured under Cluster Management\Network Configuration\ in the OneFS GUI as shown below:

 

In the example screen shot above, the SmartConnect Zone name is msbdc.dellemc.com, this means the delegated subdomain on the internal DNS server should be msbdc, a nameserver record for this msbdc subdomain needs to point to the defined SmartConnect Service IP.

The Service IP information is in the subnet details in the OneFS GUI as shown below:

In the above example, the service IP address is 10.10.10.10.  So, creating DNS records for 10.10.10.10 (e.g. isilon.dellemc.com) and a NS record for msbdc.dellemc.com that is served by isilon.dellemc.com (10.10.10.10) is all that would be needed on the internal DNS server configuration to take advantage of the built-in load balancing capabilities of Isilon.

Use “ping” to validate the SmartConnect/DNS configuration.  Multiple ping tests to msbdc.dellemc.com should result with different IP address responses returned by Isilon, the range of IP addresses returned is defined by the IP Pool Range in the Isilon GUI.

SQL Server Big Data Cluster would simply have a single mount configuration pointing to the defined SmartConnect Zone name on Isilon.  Details on how to setup the HDFS mount to Isilon from SQL Server Big Data Cluster is presented in the next section.

SmartConnect makes storage administration easy.  If more storage capacity is required, simply add more Isilon nodes to the cluster and storage capacity and I/O performance instantly increases without having to make a single configuration change to the SQL Server Big Data Clusters – BRILLIANT!

With HDFS enabled, the access zone defined, and the network/DNS configuration complete, the Isilon storage system can now be mounted by SQL Server Big Data Clusters.

Mounting Dell EMC Isilon from SQL Server Big Data Cluster

Assuming you have a SQL Server Big Data Cluster running, begin with opening a terminal session to connect to your SQL Server Big Data Cluster.  You can obtain the IP address of the end point controller-svc-external service of your cluster with the following command:

Using the IP of the controller end point obtained from the above command, log into your big data cluster:

Mount Isilon using HDFS on your SQL Server Big Data Cluster with the following command:

Note:  hdfs://msbdc.dellemc.com is shown as an example, the hdfs uri must match the SmartConnect Zone name defined in the Isilon configuration.  The data directory specified is also an example, any directory name that exists within the Isilon Access Zone can be used.  Also, the mount point /mount1 that is shown above is just an example, any name can be used for the mount point.

An example of a successful response of the above mount command is shown below:

Create mount /mount1 submitted successfully.  Check mount status for progress.

Check the mount status with the following command:

sample output:

Run an hdfs shell and list the contents on Isilon:

sample output:

In addition to using hdfs shell commands, you can use tools like Azure Data Studio to access and browse files over the HDFS service on Dell EMC Isilon.  The example below is using Spark to read the data over HDFS:

To learn more about Dell EMC Isilon, please visit us at DellEMC.com.

 

Isilon OneFS and Hadoop Known Issues

The following are known issues that exist with OneFS and Hadoop HDFS integrations:

Oozie sharedlib deployment fails with Isilon

The deployment of the oozie shared libraries fails on Ambari 2.7/HDP 3.x against Isilon.

oozie makes a rpc check for erasure encoding when deploying the shared libararies, OneFS doesn’t support HDFS erasure encoding as OneFS is natively using its own Erasure Encoding for data protection and the call fails with poor handling on the oozie side of the code, this causes a failure in the deployment of the shared lib.

[root@centos-01 ~]# /usr/hdp/current/oozie-server/bin/oozie-setup.sh sharelib create -fs hdfs://hdp-27.foo.com:8020 -locallib /usr/hdp/3.0.1.0-187/oozie/libserver

  setting OOZIE_CONFIG=${OOZIE_CONFIG:-/usr/hdp/current/oozie-server/conf}

  setting CATALINA_BASE=${CATALINA_BASE:-/usr/hdp/current/oozie-server/oozie-server}

  setting CATALINA_TMPDIR=${CATALINA_TMPDIR:-/var/tmp/oozie}

  setting OOZIE_CATALINA_HOME=/usr/lib/bigtop-tomcat

  setting JAVA_HOME=/usr/jdk64/jdk1.8.0_112

  setting JRE_HOME=${JAVA_HOME}

  setting CATALINA_OPTS="$CATALINA_OPTS -Xmx2048m"

  setting OOZIE_LOG=/var/log/oozie

  setting CATALINA_PID=/var/run/oozie/oozie.pid

  setting OOZIE_DATA=/hadoop/oozie/data

  setting OOZIE_HTTP_PORT=11000

  setting OOZIE_ADMIN_PORT=11001

  setting JAVA_LIBRARY_PATH=/usr/hdp/3.0.1.0-187/hadoop/lib/native/Linux-amd64-64

  setting OOZIE_CLIENT_OPTS="${OOZIE_CLIENT_OPTS} -Doozie.connection.retry.count=5 "

  setting OOZIE_CONFIG=${OOZIE_CONFIG:-/usr/hdp/current/oozie-server/conf}

  setting CATALINA_BASE=${CATALINA_BASE:-/usr/hdp/current/oozie-server/oozie-server}

  setting CATALINA_TMPDIR=${CATALINA_TMPDIR:-/var/tmp/oozie}

  setting OOZIE_CATALINA_HOME=/usr/lib/bigtop-tomcat

  setting JAVA_HOME=/usr/jdk64/jdk1.8.0_112

  setting JRE_HOME=${JAVA_HOME}

  setting CATALINA_OPTS="$CATALINA_OPTS -Xmx2048m"

  setting OOZIE_LOG=/var/log/oozie

  setting CATALINA_PID=/var/run/oozie/oozie.pid

  setting OOZIE_DATA=/hadoop/oozie/data

  setting OOZIE_HTTP_PORT=11000

  setting OOZIE_ADMIN_PORT=11001

  setting JAVA_LIBRARY_PATH=/usr/hdp/3.0.1.0-187/hadoop/lib/native/Linux-amd64-64

  setting OOZIE_CLIENT_OPTS="${OOZIE_CLIENT_OPTS} -Doozie.connection.retry.count=5 "

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/hdp/3.0.1.0-187/oozie/lib/slf4j-simple-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/3.0.1.0-187/oozie/libserver/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/3.0.1.0-187/oozie/libserver/slf4j-log4j12-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.SimpleLoggerFactory]

3138 [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

4193 [main] INFO org.apache.hadoop.security.UserGroupInformation - Login successful for user oozie/centos-01.foo.com@FOO.COM using keytab file /etc/security/keytabs/oozie.service.keytab

4436 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir

4490 [main] INFO org.apache.hadoop.security.SecurityUtil - Updating Configuration

log4j:WARN No appenders could be found for logger (org.apache.htrace.core.Tracer).

log4j:WARN Please initialize the log4j system properly.

log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

Found Hadoop that supports Erasure Coding. Trying to disable Erasure Coding for path: /user/root/share/lib

Error invoking method with reflection





Error: java.lang.reflect.InvocationTargetException

Stack trace for the error was (for debug purposes):

java.lang.RuntimeException: java.lang.reflect.InvocationTargetException

        at org.apache.oozie.tools.ECPolicyDisabler.invokeMethod(ECPolicyDisabler.java:111)

        at org.apache.oozie.tools.ECPolicyDisabler.tryDisableECPolicyForPath(ECPolicyDisabler.java:47)

        at org.apache.oozie.tools.OozieSharelibCLI.run(OozieSharelibCLI.java:171)

        at org.apache.oozie.tools.OozieSharelibCLI.main(OozieSharelibCLI.java:67)

Caused by: java.lang.reflect.InvocationTargetException

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

        at java.lang.reflect.Method.invoke(Method.java:498)

        at org.apache.oozie.tools.ECPolicyDisabler.invokeMethod(ECPolicyDisabler.java:108)

        ... 3 more

Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcNoSuchMethodException): Unknown RPC: getErasureCodingPolicy

        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1497)

        at org.apache.hadoop.ipc.Client.call(Client.java:1443)

        at org.apache.hadoop.ipc.Client.call(Client.java:1353)

        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)

        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)

        at com.sun.proxy.$Proxy9.getErasureCodingPolicy(Unknown Source)

        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getErasureCodingPolicy(ClientNamenodeProtocolTranslatorPB.java:1892)

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

        at java.lang.reflect.Method.invoke(Method.java:498)

        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)

        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)

        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)

        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)

        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)

        at com.sun.proxy.$Proxy10.getErasureCodingPolicy(Unknown Source)

        at org.apache.hadoop.hdfs.DFSClient.getErasureCodingPolicy(DFSClient.java:3082)

        at org.apache.hadoop.hdfs.DistributedFileSystem$66.doCall(DistributedFileSystem.java:2884)

        at org.apache.hadoop.hdfs.DistributedFileSystem$66.doCall(DistributedFileSystem.java:2881)

        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

        at org.apache.hadoop.hdfs.DistributedFileSystem.getErasureCodingPolicy(DistributedFileSystem.java:2898)

        ... 8 more
A workaround is a manual copy and unpack of the oozie-sharelib.tar.gz to the /user/oozie/share/lib

Cloudera BDR integration with Cloudera Manager Based Isilon Integration

Cloudera CDH with BDR is no longer supported with Isilon, CDH fails to integrate BDR completely with a Cloudera Manager based Isilon cluster.

Upgrading Ambari 2.6.5 to 2.7 – setfacl issue with Hive

Per the following procedure: http://www.unstructureddatatips.com/upgrade-hortonworks-hdp2-6-5-to-hdp3-on-dellemc-isilon-onefs-8-1-2-and-later/

When upgrading from Ambari 2.6.5 to 2.7, if the Hive Service is installed the following must be completed prior to upgrade otherwise the upgrade process will stall with an Unknown RPC issue as seen below.

 

The Isilon OneFS HDFS service does not support the HDFS ACL’s and the resulting setfacl will cause the upgrade to stall.

Add the following property: dfs.namenode.acls.enabled=false to the custom hdfs-site prior to upgrading and this will prevent the upgrade attempting to use setfacl.

Restart any services that need restarting

Execute the upgrade per the procedure and the Hive setfacl issue will not be encountered.

Additional Upgrade issue you may see:

– Error mapping uname \’yarn-ats\’ to uid (created yarn-ats user: isi auth users create yarn-ats –zone=<hdfs zone>)

– MySQL Dependency error (execute: ambari-server setup –jdbc-db=mysql –jdbc-driver=/usr/share/java/mysql-connector-java.jar)

– Ambari Metrics restart issue Reference: http://www.ryanchapin.com/fv-b-4-818/-SOLVED–Unable-to-Connect-to-ambari-metrics-collector-Issues.html

 

OneFS 8.2 Local Service Accounts need to be ENABLED

With the release of OneFS 8.2 a number of changes were made in the identity management stack, one modification that is required on 8.2 is that local accounts need to be in the enabled state to be used for identity, in prior version local account ID’s could be used with the local account disabled.

In 8.2 all local accounts must be ENABLED to be used for ID management by OneFS, this is required:

In 8.1.2 and prior, local accounts were functional when disabled

On upgrade to 8.2

  • All accounts should be set the ‘enabled state’
  • Enable all accounts prior to upgrade

The latest version of the create_users script on  the isilon_hadoop_tools github will now create enabled users by default

Enabling account does not make this account interactive logon aware they are still just ID’s used by Isilon for HDFS ID management.

 

Support for HDP 3.1 with the Isilon Management Pack 1.0.1.0

With the release of the Isilon Management Pack 1.0.1.0 support for HDP 3.1 is included, the procedure to upgrade the mpack is listed here if mpack 1.0.0.1 was installed with HDP 3.0.1.

Before upgrading the mpack the following KB should be consulted to assess the status of the Kerberized Spark2 services and if modifications were made to 3.0.1 installs were made in Ambari: Isilon: Spark2 fails to start after Kerberization with HDP 3 and OneFS due to missing configurations

Upgrade the Isilon Ambari Management Pack

  1. Download the Isilon Ambari Management Pack
  2. Install the management pack by running the following commands on the
    Ambari server:
    
    ambari-server upgrade-mpack –-mpack = <path-to-new-mpack.tar.gz> -verbose
    
    ambari-server restart

     

How to determine the Isilon Ambari Management Pack version

On the Ambari server host run the following command:

ls /var/lib/ambari-server/resources/mpacks | grep “onefs-ambari-mpack-”

The output will appear similar to this, where x.x.x.x indicates which version of the IAMP is currently installed:

onefs-ambari-mpack-x.x.x.x

How to find the README in Isilon Ambari Management Pack 1.0.1.0

Download the Isilon Ambari Management Pack

  1. Run the following command to extract the contents:
    • tar -zxvf isilon-onefs-mpack-1.0.1.0.tar.gz
  2. The README is located under isilon-onefs-mpack-1.0.1.0/addon-services/ONEFS/1.0.0/support/README
  3. Please review the README for release information.

 

The release of OneFS 8.2 brings changes to Hadoop Cluster Deployment and Setup

Prior to 8.2, the following two configurations were required to support Hadoop cluster

  1. Modification to the Access Control List Policy setting for OneFS is no longer needed

We used to run ‘isi auth settings acls modify –group-owner-inheritance=parent’  to make the OneFS file system act like an HDFS file system, this was a global setting and affected the whole cluster and other workflows. In 8.2 this is no longer needed, hdfs operation act like this natively so the setting is no longer required. Do not run this command on the setup of hdfs of new 8.2 clusters, if this was previously set on 8.1.2 and prior it is suggested to leave the setting as is because modifying it can affect other workflows.

  1. hdfs to root mappings is not needed – replaced by RBAC

Prior to 8.2 hdfs => root mappings were required to facilitate the behavior of the hdfs account, in 8.2 this root mapping has been replaced with an RBAC privilege, no root mapping is needed and instead the following RBAC role with the specified privileges should be created, add any account needing this access.

isi auth roles create --name=hdfs_access --description="Bypass FS permissions" --zone=System
isi auth roles modify hdfs_access --add-priv=ISI_PRIV_IFS_RESTORE --zone=System
isi auth roles modify hdfs_access --add-priv=ISI_PRIV_IFS_BACKUP --zone=System
isi auth roles modify hdfs_access --add-user=hdfs --zone=System
isi auth roles view hdfs_access --zone=System
isi_for_array "isi auth mapping flush --all"
isi_for_array "isi auth cache flush --all"

 

The installation guides will reflect these changes shortly.

Summary:

8.1.2 and Earlier:

hdfs=>root mapping

ACL Policy Change Needed

8.2 and Later

RBAC role for hdfs

No ACL Policy Change

 

When using Ambari 2.7 and the Isilon Management Pack, the following is seen in the Isilon hdfs.log:

isilon-3: 2019-05-14T14:34:06-04:00 <30.4> isilon-3 hdfs[95183]: [hdfs] Ambari: Agent for zone 12 got a bad exit code from its Ambari server. The agent will attempt to recover.

isilon-3: 2019-05-14T14:34:06-04:00 <30.6> isilon-3 hdfs[95183]: [hdfs] Ambari: The Ambari server for zone 12 is running a version unsupported by OneFS: 2.7.1.0. Agent will reset and retry until a supported Ambari server version is installed or Ambari is no longer enabled for this zone

isilon-3: 2019-05-14T14:35:12-04:00 <30.4> isilon-3 hdfs[95183]: [hdfs] Ambari: Agent for zone 12 got a bad exit code from its Ambari server. The agent will attempt to recover.

isilon-3: 2019-05-14T14:35:12-04:00 <30.6> isilon-3 hdfs[95183]: [hdfs] Ambari: The Ambari server for zone 12 is running a version unsupported by OneFS: 2.7.1.0. Agent will reset and retry until a supported Ambari server version is installed or Ambari is no longer enabled for this zone

isilon-3: 2019-05-14T14:36:17-04:00 <30.4> isilon-3 hdfs[95183]: [hdfs] Ambari: Agent for zone 12 got a bad exit code from its Ambari server. The agent will attempt to recover.

isilon-3: 2019-05-14T14:36:17-04:00 <30.6> isilon-3 hdfs[95183]: [hdfs] Ambari: The Ambari server for zone 12 is running a version unsupported by OneFS: 2.7.1.0. Agent will reset and retry until a supported Ambari server version is installed or Ambari is no longer enabled for this zone

When using Ambari with the Isilon Management Pack, Isilon should not be configured with an Ambari Server or ODP version as they are no longer needed since the Management Pack is in use.

If they have been added, remove them from the Isilon hdfs configuration for the zone in question, this only applied to Ambari 2.7 with the Isilon Management Pack, Ambari 2.6 and earlier still require these settings.

isilon-1# isi hdfs settings view --zone=zone-hdp27

Service: Yes

Default Block Size: 128M

Default Checksum Type: none

Authentication Mode: kerberos_only

Root Directory: /ifs/zone/hdp27/hadoop-root

WebHDFS Enabled: Yes

           Ambari Server: -

Ambari Namenode: hdp-27.foo.com

       Odp Version: -

Data Transfer Cipher: none

Ambari Metrics Collector: centos-01.foo.com

 

Ambari sees LDAPS issue connecting to AD during Kerberization

05 Apr 2018 20:05:14,081 ERROR [ambari-client-thread-38] KerberosHelperImpl:2379 - Cannot validate credentials: org.apache.ambari.server.serveraction.kerberos.KerberosInvalidConfigurationException: Failed to connect to KDC - Failed to communicate with the Active Directory at ldaps://rduvnode217745.west.isilon.com/DC=AMB3,DC=COM: simple bind failed: rduvnode217745.west.isilon.com:636

Make sure the server’s SSL certificate or CA certificates have been imported into Ambari’s truststore.

 

Review the following KB from Hortonworks on resolving this Ambari Issue:

https://community.hortonworks.com/content/supportkb/148572/failed-to-connect-to-kdc-make-sure-the-servers-ssl.html

 

HDFS rollup patch for 8.1.2 – Patch-240163:

Patch for OneFS 8.1.2.0. This patch addresses issues with the Hadoop Distributed File System (HDFS).

********************************************************************************

This patch can be installed on clusters running the following OneFS version:

8.1.2.0

This patch deprecates the following patch:

Patch-236288

 

This patch conflicts with the following patches:

Patch-237113

Patch-237483

 

If any conflicting or deprecated patches are installed on the cluster, you must

remove them before installing this patch.

********************************************************************************

RESOLVED ISSUES

 

* Bug ID 240177

The Hadoop Distributed File System (HDFS) rename command did not manage file

handles correctly and might have caused data unavailability with

STATUS_TOO_MANY_OPEN_FILES error.

 

* Bug ID 236286

If a OneFS cluster had the Hadoop Distributed File System (HDFS) configured for Kerberos authentication, WebHDFS requests over curl might have failed to follow a redirect request.

 

 

WebHDFS issue with Kerberized 8.1.2 – curl requests fail to follow redirects; Service Checks and Ambari Views will fail

 

Isilon HDFS error: STATUS_TOO_MANY_OPENED_FILES causes jobs to fail

 

OneFS 8.0.0.X and Cloudera Impala 5.12.X: Impala queries fail with `WARNINGS: TableLoadingException: Failed to load metadata for table: <tablename> , CAUSED BY: IllegalStateException: null`

 

Ambari agent fails to send heartbeats to Ambari server if agent is running on a NANON node

NameNode gives out any IP addresses in an access zone, even across pools and subnets; client connection may fail as a result

Other Known Issues

  1. Host Registrations fails on RHEL 7 hosts with opensslissues

Modify the ambari-agent.ini file:

/etc/ambari-agent/conf/ambari-agent.ini

[security]

force_https_protocol=PROTOCOL_TLSv1_2

 

Restart the ambari-server and all ambari-agents

https://community.hortonworks.com/questions/145/openssl-error-upon-host-registration.html

 

OneFS 9.0.0 the services are now disabled by default

Check the service status using isi sevrices -a

hop-ps-a-3# isi services -a
Available Services:    
apache2              Apache2 Web Server                       Enabled 
auth                 Authentication Service                   Enabled  
celog                Cluster Event Log                        Enabled connectemc           ConnectEMC Service                       Disabled 
cron                 System cron Daemon                       Enabled dell_dcism           Dell iDRAC Service Module                Enabled dell_powertools      Dell PowerTools Agent Daemon             Enabled 
dmilog               DMI log monitor                          Enabled  
gmond                Ganglia node monitor                     Disabled  
hdfs                 HDFS Server                              Disabled 

Enable the hdfs service manually to get  going with Hadoop cluster access from hdfs client.

Upgrade Hortonworks HDP2.6.5 to HDP3.* on DellEMC Isilon OneFS 8.1.2 and later

Introduction

This blog post walks you through the process of upgrading  Hortonworks Data Platform (HDP) 2.6.5 to HDP 3.0.1 or HDP3.1.0  on DellEMC Isilon OneFS 8.1.2/OneFS 8.2 This is intended for systems administrators, IT program managers, IT architects, and IT managers who are upgrading Hortonworks Data Platform installed on OneFS 8.1.2.0. or later versions

There are two official ways to upgrade to HDP 3.* as follows:

    1. Deploy a fresh HDP 3.* cluster and migrate existing data using Data Lifecycle Manager or distributed copy (distcp).
    2. Perform an in-place upgrade of an existing HDP 2.6.x cluster.

This post will demonstrate in-place upgrades. Make sure your cluster is ready and meets all the success criteria as mentioned here and in the official Hortonworks Upgrade documentation.

The installation or upgrade process of the new HDP 3.0.1 and later versions on Isilon OneFS 8.1.2 and later versions has changed as follows:

The OneFS is not presented as a host to the HDP cluster anymore, and instead, OneFS is internally managed as a dedicated service in place of HDFS by installing a management pack called the Ambari Management Pack for Isilon OneFS. It is a software component that can be installed on the Ambari Server to define OneFS as a service in a Hadoop cluster. The management pack allows an Ambari administrator to start, stop, and configure OneFS as a HDFS storage service. This provides native NameNode and DataNode capabilities similar to traditional HDFS.

This management pack is OneFS release-independent and can be updated in between releases if needed.

Prerequisites

    1. Hadoop cluster running HDP 2.6.5 and Ambari Server 2.6.2.2.
    2. DellEMC Isilon OneFS updated to 8.1.2 and patch 240163 installed.
    3. Ambari Management Pack for Isilon OneFS download fromhere.
    4. HDFS to OneFS Service converter script download from here.

We will perform the upgrade in two parts: first we will make the changes on the OneFS host and followed by updates on the HDP cluster.

OneFS Host Preparation

The step-by-step process to prepare the OneFS host for the HDP upgrade is as follows:.

    1. First make sure the Isilon OneFS cluster is running 8.1.2 installed with the latest patch available. Check DellEMC support or Current Isilon OneFS Patches

  1. HDP 3.0.1 comes with TLSv2.0 service which relies on the yarn-ats user and a dedicated HBase storage in the back-end for Yarn apps and jobs framework metrics collections. For this, we  create two new users yarn-ats and yarn-ats-hbase on the OneFS host.

Login to the Isilon OneFS terminal node using root credentials, and run the following commands:

isi auth group create yarn-ats
isi auth users create yarn-ats --primary-group yarn-ats --home-directory=/ifs/home/yarn-ats
isi auth group create yarn-ats-hbase
isi auth users create yarn-ats-hbase --primary-group yarn-ats-hbase --home-directory=/ifs/home/yarn-ats-hbase
  1. Once the new users are created, you need to map yarn-ats-hbase to yarn-ats on the OneFS host. This step is required only if you are going to secure the HDP cluster with Kerberization.
isi zone modify --add-user-mapping-rules="yarn-ats-hbase=>yarn-ats" –-zone=ZONE_NAME

This user mapping depends on the mode of Timeline Service 2.0 Installation. Read those instructions carefully and opt for the deployment mode to avoid ats-hbase service failure.

You can skip the yarn-ats-hbase to yarn-ats user mapping in the following two cases:

    • Renaming yarn-ats-hbase principals to yarn-ats during Kerberization if Timeline Service V2.0s are deployed in Embedded or System Service mode.
    • There is no need to set user mapping if TLSv2.0 is configured on external Hbase.

HDP Cluster preparation and upgrade

Follow the steps as documented. The steps  must to meet all of the prerequisites in the Hortonworks upgrade document.

  1. Before starting the process, make sure the HDP 2.6.5 cluster is healthy by doing a service check, and address all of the alerts, if any display.

  1. Now stop the HDFS service and all other components running on the OneFS host.

  1. Delete the Datanode/Namenode/SNamenode using the following curl command:

Note that before DN/NN and SNN are deleted, you’ll see something like the following:

Use the following curl commands to delete the DN, NN and SNN:

export AMBARI_SERVER=<Ambar server IP/FQDN>
export CLUSTER=<HDP2.6.5 cluster name>
export HOST=<OneFS host FQDN>
curl -u admin:admin -H "X-Requested-By: Ambari" -X DELETE http://$AMBARI_SERVER:8080/api/v1/clusters/$CLUSTER/hosts/$HOST/host_components/DATANODE
curl -u admin:admin -H "X-Requested-By: Ambari" -X DELETE http://$AMBARI_SERVER:8080/api/v1/clusters/$CLUSTER/hosts/$HOST/host_components/NAMENODE
curl -u admin:admin -H "X-Requested-By: Ambari" -X DELETE http://$AMBARI_SERVER:8080/api/v1/clusters/$CLUSTER/hosts/$HOST/host_components/SECONDARY_NAMENODE

After the deleting DN/NN and SNN, you’ll see something similar to the following:

  1. Manually delete the OneFS host from the Ambari Server UI.

Following the steps from five to nine are critical and are related to the Hortonworks HDP upgrade process. Refer to the Hortonworks upgrade documentations or consult the Hortonworks support if necessary.

Note: Steps five to nine in the HDP upgrade process below are related to the services running on our POC cluster. You’ll have to do backup, migration, upgrades to the necessary service as described in the Hortonworks documentation before going to  step 10.

———-

  1. Upgrade Ambari Server/agent to 2.7.1, by follow the Hortonworks Ambari Server upgrade document.

  1. Register and install HDP 3.0.1, by following the steps in this Hortonworks HDP register and install target version guide.
  2. Upgrade Ambari metrics, by following the steps in this upgrade ambari metrics guide
  3. Note: This next step is critical: Perform a service check on all the services and make sure to address all  alerts if any.
  4. Click upgrade and complete the upgrade process. Address any issues encountered before proceeding to avoid service failures after finalizing the upgrade.

A screen similar to the following displays:

———–

After the successful upgrade to HDP 3.0.1, continue installing Ambari Management pack for Isilon OneFS on the upgraded Ambari Server.
  1. For the Ambari Server Management Pack installation, login to the Ambari Server terminal, download the management pack, install, and then restart the Ambari server.

a. Download the Ambari Management Pack for Isilon OneFS from here

b. Install the management pack as shown below. Once it is installed, the following displays: Ambari Server ‘install-mpack’ completed successfully.

root@RDUVNODE334518:~ # ambari-server install-mpack --mpack=isilon-onefs-mpack-0.1.0.0.tar.gz --verbose
Using python /usr/bin/python
Installing management pack
INFO: Loading properties from /etc/ambari-server/conf/ambari.properties
INFO: Installing management pack isilon-onefs-mpack-0.1.0.0-SNAPSHOT.tar.gz
INFO: Loading properties from /etc/ambari-server/conf/ambari.properties
INFO: Download management pack to temp location /var/lib/ambari-server/data/tmp/isilon-onefs-mpack-0.1.0.0-SNAPSHOT.tar.gz
INFO: Loading properties from /etc/ambari-server/conf/ambari.properties
INFO: Expand management pack at temp location /var/lib/ambari-server/data/tmp/isilon-onefs-mpack-0.1.0.0-SNAPSHOT/
2018-11-07 06:36:39,137 - Execute[('tar', '-xf', '/var/lib/ambari-server/data/tmp/isilon-onefs-mpack-0.1.0.0-SNAPSHOT.tar.gz', '-C', '/var/lib/ambari-server/data/tmp/')] {'tries': 3, 'sudo': True, 'try_sleep': 1}
INFO: Loading properties from /etc/ambari-server/conf/ambari.properties
INFO: Loading properties from /etc/ambari-server/conf/ambari.properties
INFO: Stage management pack onefs-ambari-mpack-0.1 to staging location /var/lib/ambari-server/resources/mpacks/onefs-ambari-mpack-0.1
INFO: Processing artifact ONEFS-addon-services of type stack-addon-service-definitions in /var/lib/ambari-server/resources/mpacks/onefs-ambari-mpack-0.1/addon-services
INFO: Loading properties from /etc/ambari-server/conf/ambari.properties
INFO: Loading properties from /etc/ambari-server/conf/ambari.properties
INFO: Adjusting file permissions and ownerships
INFO: about to run command: chmod -R 0755 /var/lib/ambari-server/resources/stacks
INFO:
process_pid=28352
INFO: about to run command: chown -R -L root /var/lib/ambari-server/resources/stacks
INFO:
process_pid=28353
INFO: about to run command: chmod -R 0755 /var/lib/ambari-server/resources/extensions
INFO:
process_pid=28354
INFO: about to run command: chown -R -L root /var/lib/ambari-server/resources/extensions
INFO:
process_pid=28355
INFO: about to run command: chmod -R 0755 /var/lib/ambari-server/resources/common-services
INFO:
process_pid=28356
INFO: about to run command: chown -R -L root /var/lib/ambari-server/resources/common-services
INFO:
process_pid=28357
INFO: about to run command: chmod -R 0755 /var/lib/ambari-server/resources/mpacks
INFO:
process_pid=28358
INFO: about to run command: chown -R -L root /var/lib/ambari-server/resources/mpacks
INFO:
process_pid=28359
INFO: about to run command: chmod -R 0755 /var/lib/ambari-server/resources/mpacks/cache
INFO:
process_pid=28360
INFO: about to run command: chown -R -L root /var/lib/ambari-server/resources/mpacks/cache
INFO:
process_pid=28361
INFO: about to run command: chmod -R 0755 /var/lib/ambari-server/resources/dashboards
INFO:
process_pid=28362
INFO: about to run command: chown -R -L root /var/lib/ambari-server/resources/dashboards
INFO:
process_pid=28363
INFO: about to run command: chown -R -L root /var/lib/ambari-server/resources/stacks
INFO:
process_pid=28364
INFO: about to run command: chown -R -L root /var/lib/ambari-server/resources/extensions
INFO:
process_pid=28365
INFO: about to run command: chown -R -L root /var/lib/ambari-server/resources/common-services
INFO:
process_pid=28366
INFO: about to run command: chown -R -L root /var/lib/ambari-server/resources/mpacks
INFO:
process_pid=28367
INFO: about to run command: chown -R -L root /var/lib/ambari-server/resources/mpacks/cache
INFO:
process_pid=28368
INFO: about to run command: chown -R -L root /var/lib/ambari-server/resources/dashboards
INFO:
process_pid=28369
INFO: Management pack onefs-ambari-mpack-0.1 successfully installed! Please restart ambari-server.
INFO: Loading properties from /etc/ambari-server/conf/ambari.properties
Ambari Server 'install-mpack' completed successfully.

c. Restart the Ambari Server.

root@RDUVNODE334518:~ # ambari-server restart
Using python /usr/bin/python
Restarting ambari-server
Waiting for server stop...
Ambari Server stopped
Ambari Server running with administrator privileges.
Organizing resource files at /var/lib/ambari-server/resources...
Ambari database consistency check started...
Server PID at: /var/run/ambari-server/ambari-server.pid
Server out at: /var/log/ambari-server/ambari-server.out
Server log at: /var/log/ambari-server/ambari-server.log
Waiting for server start................
Server started listening on 8080

DB configs consistency check: no errors and warnings were found.

 

  1. Replace the HDFS service with the OneFS service; the management pack installed contains OneFS Service related settings.

For this step, delete the HDFS service, add the OneFS service installed from the Ambari Management Pack above, and copy the HDFS service configuration into the OneFS service.

a. To delete HDFS, add the OneFS service, and copy the configuration, you have an automation tool “hdfs_to_onefs_convertor.py”.

Login to the Ambari Server terminal and download the script from here.

wget --no-check-certificate https://raw.githubusercontent.com/apache/ambari/trunk/contrib/management-packs/isilon-onefs-mpack/src/main/tools/hdfs_to_onefs_convert.py

b. Now run the script by issuing the Ambari server and cluster name as the parameters. Once it completes, you see all the services up and running.

root@RDUVNODE334518:~ # python hdfs_to_onefs_convertor.py -o 'RDUVNODE334518.west.isilon.com' -c 'hdpupgd'
This script will replace the HDFS service to ONEFS
The following prerequisites are required:
* ONEFS management package must be installed
* Ambari must be upgraded to >=v2.7.0
* Stack must be upgraded to HDP-3.0
* Is highly recommended to backup ambari database before you proceed.
Checking Cluster: hdpupgd (http://RDUVNODE334518.west.isilon.com:8080/api/v1/clusters/hdpupgd)
Found stack HDP-3.0
Please, confirm you have made backup of the Ambari db [y/n] (n)? y
Collecting hosts with HDFS_CLIENT
Found hosts [u'rduvnode334518.west.isilon.com']
Stopping all services..
Downloading core-site..
Downloading hdfs-site..
Downloading hadoop-env..
Deleting HDFS..
Adding ONEFS..
Adding ONEFS config..
Adding core-site
Adding hdfs-site
Adding hadoop-env-site
Adding ONEFS_CLIENT to hosts: [u'rduvnode334518.west.isilon.com']
Starting all services..
root@RDUVNODE334518:~ #


  1. At this point, you have successfully upgraded to HDP 3.0.1 and replaced the HDFS service with the OneFS service. From now on, Isilon OneFS only acts as an HDFS storage layer, so you can remove the Ambari Server and ODP Version settings from the Isilon’s HDFS settings as follows:
kbhusan-y93o5ew-1# isi hdfs settings modify --zone=System --odp-version=
kbhusan-y93o5ew-1# isi hdfs settings modify --zone=System --ambari-server=
kbhusan-y93o5ew-1# isi hdfs settings view
Service: Yes
Default Block Size: 128M
Default Checksum Type: none
Authentication Mode: all
Root Directory: /ifs/hdfs-root
WebHDFS Enabled: Yes
Ambari Server: -
Ambari Namenode: kb-hdp-1.west.isilon.com
Odp Version: -
Data Transfer Cipher: none
Ambari Metrics Collector: kb-hdp-1.west.isilon.com
kbhusan-y93o5ew-1#


13. Login into the Ambari Web UI and check the OneFS service and its configuration. Perform the service check.

A screen similar to the following displays:

Review the results:

Summary

In this blog, we demonstrated how you can successfully upgrade the Apache Ambari Server/agents to 2.7.1 and Hortonworks HDP 2.6.5 to HDP 3.0.1 on DellEMC Isilon OneFS 8.1.2 installed with the latest patch available. The same steps apply to upgrading the later versions of HDP3.0.1.

We installed Ambari server Management Pack for DellEMC Isilon OneFS which replaced the HDFS service to the OneFS service. This enables Ambari administrator to start, stop, and configure OneFS as a HDFS storage service, and this also provides native NameNode and DataNode capabilities like traditional HDFS to DellEMC Isilon OneFS.