OneFS Deadlocks and Hang-dumps – Part 2

As mentioned in the previous article in this series, hang-dumps can occur under the following circumstances.

Type Description
Transient The time to obtain the lock was long enough to trigger a hang-dump, but the lock is eventually granted. This is the less serious situation. The symptoms are typically general performance degradation, but the cluster is still responsive and able to progress.
Persistent The issue typically requires significant remedial action, such as node reboots. This is usually  indicative of a bug in OneFS, although it could also be caused by hardware issues, where hardware becomes unresponsive, and OneFS waits indefinitely for it to recover.

Certain normal OneFS operations, such as those involving very large files, have the potential to trigger a hang-dump with no long-term ill effects. However, in some situations the thread or process waiting for the lock to be freed, or ‘waiter’, is never actually granted the lock on the file. In such cases, users may be impacted.

If a hang-dump is generated as a result of a LIN lock timeout  (the most likely scenario), this indicates that at least one thread in the system has been waiting for a LIN lock for over 90 seconds. The system hang can involve a single thread, or sometimes multiple threads, for example blocking a batch job. The system hang could be affecting interactive session(s), in which case the affected cluster users will likely experience performance impacts.

Specifically, in the case of a LIN lock timeout, if the LIN number is available, it can be easily mapped back to its associated filename using the ‘isi get’ CLI command.

# isi get -L <lin #>

However, a LIN which is still locked may necessitate waiting until the lock is freed before getting the name of the file.

By default, OneFS hang-dump files are written to the /var/crash directory as compressed text files. During a hang-dump investigation, Dell support typically utilizes internal tools to analyze the logs from all of the nodes and generate a graph to show the lock interactions between the lock holders (the thread or process that is holding the file) and lock waiters. The analytics are per-node and include a full dump of the lock state as seen by the local node, the stack of each thread in the system, plus a variety of other diagnostics including memory usage, etc. Since OneFS source-code access is generally required in order to interpret the stack traces, Dell Support can help investigate the hang-dump log file data, which can then be used to drive further troubleshooting.

A deadlocked cluster may exhibit one or more of the following symptoms:

  • Clients are unable to communicate with the cluster via SMB, NFS, SSH, etc.
  • The WebUI is unavailable and/or commands executed from the CLI fail to start or complete.
  • Processes cannot be terminated, even with SIGKILL (kill -9).
  • Degraded cluster performance is experienced, with low or no CPU/network/disk usage.
  • Inability to access files or folders under /ifs.

In order to recover from a deadlock, Dell support’s remediation will sometimes require panicking or rebooting a cluster. In such instances, thorough diagnostic information gathering should be performed prior to this drastic step. Without this diagnostic data, it will be often be impossible to determine the root cause of the deadlock. If the underlying cause of the deadlock is not corrected, rebooting the cluster and restarting the service may not resolve the issue.

The following steps can be run in order to gather data that will be helpful in determining the cause of a deadlock:

  1. First, verify that there are no indeterminate journal transactions. If there are indeterminate journal transactions found, rebooting or panicking nodes will not resolve the issue.
# isi_for_array -X 'sysctl efs.journal.indeterminate_txns'

1: efs.journal.indeterminate_txns: 0
2: efs.journal.indeterminate_txns: 0
3: efs.journal.indeterminate_txns: 0

For each node, if the output of the above command returns zero, this indicates its journal is intact and all transactions are complete. Note that if the output is anything other than zero, the cluster contains indeterminate transactions, and Dell support should be engaged before any further troubleshooting is performed.

2. Next, check the /var/crash/directory for any recently created hang-dump files:

# isi_for_array -s 'ls -l /var/crash | grep -i hang'

Scan the /var/log/messages/ file for any recent references to ‘LOCK TIMEOUT’.

# isi_for_array -s 'egrep -i "lock timeout|hang" /var/log/messages | grep $(date +%Y-%m-%d)'

3.Collect the output from the ‘fstat’ CLI command, which identifies active files:

# isi_for_array -s 'fstat -m > /ifs/data/Isilon_Support/deadlock-data/fstat_$(hostname).txt'&

4. Record the Group Management Protocol (GMP) merge lock state:

# isi_for_array -s 'sysctl efs.gmp.merge_lock_state > /ifs/data/Isilon_Support/deadlock-data/merge_lock_state_$(hostname).txt'

5. Finally, run an ‘isi diagnostics gather’ logset gather to capture relevant cluster data and send the resulting zipped tarfile to Dell Support (via ESRS, FTP, etc).

# isi diagnostics gather start

A cluster reboot can be accomplished via an SSH connection as root to any node in the cluster, as follows:

# isi config

Welcome to the PowerScale configuration console.

Copyright (c) 2001-2022 Dell Inc. All Rights Reserved.

Enter 'help' to see list of available commands.

Enter 'help <command>' to see help for a specific command.

Enter 'quit' at any prompt to discard changes and exit.

        Node build: Isilon OneFS B_MAIN_2978(RELEASE)

        Node serial number: JACNT194540666

TME1 >>> reboot all

!! You are about to reboot the entire cluster

Are you sure you wish to continue? [no] yes

Alternatively, the following CLI syntax can be used to reboot a single node from an SSH connection to it:

# kldload reboot_me

Or to reboot the cluster:

# isi_for_array -x$(isi_nodes -L %{lnn}) 'kldload reboot_me'

Note that simply shutting down or rebooting the affected node(s), or the entire cluster), while typically the quickest path to get up and running again, will not generate the core files required for debugging. If a root cause analysis is desired, these node(s) will need to be panicked in order to generate a dump of all active threads.

Only perform a node panic under the direct supervision of Dell Support! Be aware that panics bypass a number of important node shutdown functions, including unmounting /ifs, etc. However, a panic will generate additional kernel core information which is typically required by Dell Support in order to perform a thorough diagnosis. In situations where the entire cluster needs to be panicked, the recommendation is to start with the highest numbered node and work down to lowest. For each node that’s panicked, the debug information is written to the /var/crash directory, and can be identified by the ‘vmcore’ prefix.

If instructed by Dell Support to do so, the ‘isi_rbm_panic’ CLI command can be used to panic a node, with the argument being the logical node number (LNN) of the desired node to target. For example, to panic a node with LNN=2:

# isi_rbm_panic 2

If in any doubt, the following CLI syntax will return the corresponding node ID and node LNN for each node in the cluster:

# isi_nodes %{id} , %{lnn}

Leave a Reply

Your email address will not be published. Required fields are marked *