Cluster Composition and GMP – Part 2

As we saw in the first of these blog articles, in OneFS parlance a group is a list of nodes, drives and protocols which are currently participating in the cluster. Under normal operating conditions, every node and its requisite disks are part of the current group, and the group’s status can be viewed by running sysctl efs.gmp.group on any node of the cluster.

For example, on a three node cluster:

# sysctl efs.gmp.group

efs.gmp.group: <2,288>: { 1-3:0-11, smb: 1-3, nfs: 1-3, hdfs: 1-3, all_enabled_protocols: 1-3, isi_cbind_d: 1-3, lsass: 1-3, s3: 1-3’ }

So, OneFS group info comprises three main parts:

  • Sequence number: Provides identification for the group (ie.’ <2,288>’ )
  • Membership list: Describes the group (ie. ‘1-3:0-11’ )
  • Protocol list: Shows which nodes are supporting which protocol services (ie. { smb: 1-3, nfs: 1-3, hdfs: 1-3, all_enabled_protocols: 1-3, isi_cbind_d: 1-3, lsass: 1-3, s3: 1-3

Please note that, for the sake of ease of reading, the protocol information has been removed from each of the group strings in all the following examples.

If more detail is desired, the syscl efs.gmp.current_info command will report extensive current GMP information.

The membership list {1-3:0-11, … } represents our three node cluster, with nodes 1 through 3, each containing 12 drives, numbered zero through 11. The numbers before the colon in the group membership string represent the participating Array IDs, and the numbers after the colon are the Drive IDs.

Each node’s info is maintained in the /etc/ifs/array.xml file. For example, the entry for node 1 of the cluster above reads:

<device>

<port>5019</port>

<array_id>2</array_id>

<array_lnn>2</array_lnn>

<guid>0007430857d489899a57f2042f0b8b409a0c</guid>

<onefs_version>0x800005000100083</onefs_version>

<ondisk_onefs_version>0x800005000100083</ondisk_onefs_version>

<ipaddress name=”int-a”>192.168.76.77</ipaddress>

<status>ok</status>

<soft_fail>0</soft_fail>

<read_only>0x0</read_only>

<type>storage</type>

</device>

It’s worth noting that the Array IDs (or Node IDs as they’re also often known) differ from a cluster’s Logical Node Numbers (LNNs). LNNs are the numberings that occur within node names, as displayed by isi stat for example.

Fortunately, the isi_nodes command provides a useful cross-reference of both LNNs and Array IDs:

# isi_nodes “%{name}: LNN %{lnn}, Array ID %{id}”

node-1: LNN 1, Array ID 1

node-2: LNN 2, Array ID 2

node-3: LNN 3, Array ID 3

As a general rule, LNNs can be re-used within a cluster, whereas Array IDs are never recycled. In this case, node 1 was removed from the cluster and a new node was added instead:

node-1: LNN 1, Array ID 4

The LNN of node 1 remains the same, but its Array ID has changed to ‘4’. Regardless of how many nodes are replaced, Array IDs will never be re-used.

A node’s LNN, on the other hand, is based on the relative position of its primary backend IP address, within the allotted subnet range.

The numerals following the colon in the group membership string represent drive IDs that, like Array IDs, are also not recycled. If a drive is failed, the node will identify the replacement drive with the next unused number in sequence.

Unlike Array IDs though, Drive IDs (or Lnums, as they’re sometimes known) begin at 0 rather than at 1 and do not typically have a corresponding ‘logical’ drive number.

For example:

node-3# isi devices drive list

Lnn  Location  Device    Lnum  State   Serial

—————————————————–

3    Bay  1    /dev/da1  12    HEALTHY PN1234P9H6GPEX

3    Bay  2    /dev/da2  10    HEALTHY PN1234P9H6GL8X

3    Bay  3    /dev/da3  9     HEALTHY PN1234P9H676HX

3    Bay  4    /dev/da4  8     HEALTHY PN1234P9H66P4X

3    Bay  5    /dev/da5  7     HEALTHY PN1234P9H6GPRX

3    Bay  6    /dev/da6  6     HEALTHY PN1234P9H6DHPX

3    Bay  7    /dev/da7  5     HEALTHY PN1234P9H6DJAX

3    Bay  8    /dev/da8  4     HEALTHY PN1234P9H64MSX

3    Bay  9    /dev/da9  3     HEALTHY PN1234P9H66PEX

3    Bay 10    /dev/da10 2     HEALTHY PN1234P9H5VMPX

3    Bay 11    /dev/da11 1     HEALTHY PN1234P9H64LHX

3    Bay 12    /dev/da12 0     HEALTHY PN1234P9H66P2X

—————————————————–

Total: 12

Note that the drive in Bay 5 has an Lnum, or Drive ID, of 7, the number by which it will be represented in a group statement.

Drive bays and device names may refer to different drives at different points in time, and either could be considered a “logical” drive ID. While the best practice is definitely not to switch drives between bays of a node, if this does happen OneFS will correctly identify the relocated drives by Drive ID and thereby prevent data loss.

Depending on device availability, device names ‘/dev/da*’ may change when a node comes up, so cannot be relied upon to refer to the same device across reboots. However, Drive IDs and drive bay numbers do provide consistent drive identification.

Status info for the drives is kept in a node’s /etc/ifs/drives.xml file. Here’s the entry is for drive Lnum 0 on node Lnn 3, for example:

<logicaldrive number=”0″ seqno=”0″ active=”1″ soft-fail=”0″ ssd=”0″ purpose=”0″>66b60c9f1cd8ce1e57ad0ede0004f446</logicaldrive>

For efficiency and ease of reading, group messages combine the xml lists into a pair of numbers separated by dashes to make reporting more efficient and easier to read. For example  ‘ 1-3:0-11 ‘.

However, when a replacement disk (Lnum 12) is added to node 2, the list becomes:

{ 1:0-11, 2:0-1,3-12, 3:0-11 }.

Unfortunately, changes like these can make cluster groups trickier to read.

For example: { 1:0-23, 2:0-5,7-10,12-25, 3:0-23, 4:0-7,9-36, 5:0-35, 6:0-9,11-36 }

This describes a  cluster with two node pools. Nodes 1 to 3 contain 24 drives each, and nodes 4 through 6 are have 36 drives each. Nodes 1, 3, and 5 contain all their original drives, whereas node 2 has lost drives 6 and 11, and node 6 is missing drive 10.

Accelerator nodes are listed differently in group messages since they contain no disks to be part of the group. They’re listed twice, once as a node with no disks, and again explicitly as a ‘diskless’ node.

For example, nodes 11 and 12 in the following:

{ 1:0-23, 2,4:0-10,12-24, 5:0-10,12-16,18-25, 6:0-17,19-24, 7:0-10,12-24, 9-10:0-23, 11-12, diskless: 11-12 …}

Nodes in the process of SmartFailing are also listed both separately and in the regular group. For example, node 2 in the following:

{ 1-3:0-23, soft_failed: 2 …}

However, when the FlexProtect completes, the node will be removed from the group.

A SmartFailed node that’s also unavailable will be noted as both down and soft_failed. For example:

{ 1-3:0-23, 5:0-17,19-24, down: 4, soft_failed: 4 …}

Similarly, when a node is offline, the other nodes in the cluster will show that node as down:

{ 1-2:0-23, 4:0-23,down: 3 …}

Note that no disks for that node are listed, and that it doesn’t show up in the group.

If the node is split from the cluster—that is, if it is online but not able to contact other nodes on its back-end network—that node will see the rest of the cluster as down. Its group might look something like {6:0-11, down: 3-5,8-9,12 …} instead.

When calculating whether a cluster is below protection level, SmartFailed devices should be considered ‘in the group’ unless they are also down: a cluster with +2:1 protection with three nodes up but smartfailed does not pose an exceptional risk to data availability.

Like nodes, drives may be smartfailed and down, or smartfailed but available. The group statement looks similar to that for a smartfailed or down node, only the drive Lnum is also included. For example:

{ 1-4:0-23, 5:0-6,8-23, 6:0-17,19-24, down: 5:7, soft_failed: 5:7 }

indicates that node id 5 drive Lnum 7 is both SmartFailed and unavailable.

If the drive was SmartFailed but still available, the group would read:

{ 1-4:0-23, 5:0-6,8-23, 6:0-17,19-24, soft_failed: 5:7 }

When multiple devices are down, consolidated group statements can be tricky to read. For example, if node 1 was down, and drive 4 of node 3 was down, the group statement would read:

{ 2:0-11, 3:0-3,5-11, 4-5:0-11, down: 1, 3:4, soft_failed: 1, 3:4 }

As mentioned in the previous GMP blog article, OneFS has a read-only mode. Nodes in a read-only state are clearly marked as such in the group:

{ 1-6:0-8, soft_failed: 2, read_only: 3 }

Node 3 is listed both as a regular group member and called out separately at the end, because it’s still active. It’s worth noting that “read-only” indicates that OneFS will not write to the disks in that node. However, incoming connections to that node are still able write to other nodes in the cluster.

Non-responsive, or dead, nodes appear in groups when a node has been permanently removed from the cluster without SmartFailing the node. For example, node 11 in the following:

{ 1-5:0-11, 6:0-7,9-12, 7-10,12-14:0-11, 15:0-10,12, 16-17:0-11, dead: 11 }

Drives in a dead state include a drive number as well as a node number. For example:

{ 1:0-11, 2:0-9,11, 3:0-11, 4:0-11, 5:0-11, 6:0-11, dead: 2:10 }

In the event of a dead disk or node, the recommended course of action is to immediately start a FlexProtect and contact Isilon Support.

SmartFailed disks appear in a similar manner to other drive-specific states, and therefore include both an array ID and a drive ID. For example:

{ 1:0-11, 2:0-3,5-12, 3-4:0-11, 5:0-1,3-11, 6:0-11, soft_failed: 5:2 }

This shows drive 2 in node 5 to be SmartFailed, but still available. If the drive was physically unavailable or down, the group would show as:

{ 1:0-11, 2:0-3,5-12, 3-4:0-11, 5:0-1,3-11, 6:0-11, down: 5:2, soft_failed: 5:2 }

Stalled drives (drives that don’t respond) are marked as such, for example:

{ 1:0-2,4-11, 2-4:0-11, stalled: 1:3 }

When a drive becomes un-stalled, it simply returns to the group. In this case, the new group would return to:

{ 1-4:0-11 }

A group displays the sequence number between angle brackets. For example, <3,6>: { 1-3:0-11 }, the sequence number is <3,6>.

The first number within the sequence, in this case 3, identifies the node that initiated the most recent group change

In the case of a node leaving the group, the lowest-numbered node remaining in the cluster will initiate the group change and thus appear as the first number within the angle brackets. In the case of a node joining the group, the newly-joined node will initiate the change and thus will be the listed node. If the group change involved a single drive joining or leaving the group, the node containing that drive will initiate the change and thus will be the listed node.

The second piece of the group sequence number increases sequentially. The previous group would have had a 5 in this place; the next group should have a 7.

Rarely do we need to review sequence numbers, so long as they are increasing sequentially, and so long as they are initiated by either the lowest-numbered node, a newly-added node, or a node that removed a drive. The group membership contains the information that we most frequently require.

A group change occurs when an event changes devices participating in a cluster. These may be caused by drive removals or replacements, node additions, node removals, node reboots or shutdowns, backend (internal) network events, and the transition of a node into read-only mode. For debugging purposes, group change messages can be reviewed to determine whether any devices are currently in a failure state. We will explore this further in the next GMP blog article.

 

When a group change occurs, a cluster-wide process writes a message describing the new group membership to /var/log/messages on every node. Similarly, if a cluster “splits,” the newly-formed clusters behave in the same way: each node records its group membership to /var/log/messages. When a cluster splits, it breaks into multiple clusters (multiple groups). This is rarely, if ever, a desirable event. Notice that cluster and group are synonymous: a cluster is defined by its group members. Group members which lose sight of other group members no longer belong to the same group and thus no longer belong to the same cluster.

To view group changes from one node’s perspective, you can grep for the expression ‘new group’ to extract the group change statements from the log. For example:

tme-1# grep -i ‘new group’ /var/log/messages | tail –n 10

Nov 8 08:07:50 (id1) /boot/kernel/kernel: [gmp_info.c:530](pid 1814=”kt: gmpdrive-upda”) new group: : { 1:0-4, down: 1:5-11, 2-3 }

Nov 8 08:07:50 (id1) /boot/kernel/kernel: [gmp_info.c:530](pid 1814=”kt: gmpdrive-upda”) new group: : { 1:0-5, down: 1:6-11, 2-3 }

Nov 8 08:07:50 (id1) /boot/kernel/kernel: [gmp_info.c:530](pid 1814=”kt: gmpdrive-upda”) new group: : { 1:0-6, down: 1:7-11, 2-3 }

Nov 8 08:07:50 (id1) /boot/kernel/kernel: [gmp_info.c:530](pid 1814=”kt: gmpdrive-upda”) new group: : { 1:0-7, down: 1:8-11, 2-3 }

Nov 8 08:07:50 (id1) /boot/kernel/kernel: [gmp_info.c:530](pid 1814=”kt: gmpdrive-upda”) new group: : { 1:0-8, down: 1:9-11, 2-3 }

Nov 8 08:07:50 (id1) /boot/kernel/kernel: [gmp_info.c:530](pid 1814=”kt: gmpdrive-upda”) new group: : { 1:0-9, down: 1:10-11, 2-3 }

Nov 8 08:07:50 (id1) /boot/kernel/kernel: [gmp_info.c:530](pid 1814=”kt: gmpdrive-upda”) new group: : { 1:0-10, down: 1:11, 2-3 }

Nov 8 08:07:50 (id1) /boot/kernel/kernel: [gmp_info.c:530](pid 1814=”kt: gmpdrive-upda”) new group: : { 1:0-11, down: 2-3 }

Nov 8 08:07:51 (id1) /boot/kernel/kernel: [gmp_info.c:530](pid 1814=”kt: gmpmerge”) new group: : { 1:0-11, 3:0-7,9-12, down: 2 }

Nov 8 08:07:52 (id1) /boot/kernel/kernel: [gmp_info.c:530](pid 1814=”kt: gmpmerge”) new group: : {

In this case, the tail -10 command has been used to limit the returned group changes to the last ten reported in the file. All of these occur within two seconds, so in the case of an actual case, we would want to go further back, to before whatever incident was under investigation.

INTERPRETING GROUP CHANGES

Even in the example above, however, we can be sure of several things:

  • Most importantly, at last report all nodes of the cluster are operational and joined into the cluster. No nodes or drives report as down or split. (At some point in the past, drive ID 8 on node 3 was replaced, but a replacement disk has been added successfully.)
  • Next most important is that node 1 rebooted: for the first eight out of ten lines, each group change is adding back a drive on node 1 into the group, and nodes two and three are inaccessible. This occurs on node reboot prior to any attempt to join an active group and is correct and healthy behavior.
  • Note also that node 3 joins in with node 1 before node 2 does. This might be coincidental, given that the two nodes join within a second of each other. On the other hand, perhaps node 2 also rebooted while node 3 remained up. A review of group changes from these other nodes could confirm either of those behaviors.

Logging onto node 3, we can see the following:

tme-1# grep -i ‘new group’ /var/log/messages | tail -10

Jul 8 08:07:50 (id3) /boot/kernel/kernel: [gmp_info.c:530](pid 1828=”kt: gmpdrive-upda”) new group: : { 3:0-4, down: 1-2, 3:5-7,9-12 }

Jul 8 08:07:50 (id3) /boot/kernel/kernel: [gmp_info.c:530](pid 1828=”kt: gmpdrive-upda”) new group: : { 3:0-5, down: 1-2, 3:6-7,9-12 }

Jul 8 08:07:50 (id3) /boot/kernel/kernel: [gmp_info.c:530](pid 1828=”kt: gmpdrive-upda”) new group: : { 3:0-6, down: 1-2, 3:7,9-12 }

Jul 8 08:07:50 (id3) /boot/kernel/kernel: [gmp_info.c:530](pid 1828=”kt: gmpdrive-upda”) new group: : { 3:0-7, down: 1-2, 3:9-12 }

Jul 8 08:07:50 (id3) /boot/kernel/kernel: [gmp_info.c:530](pid 1828=”kt: gmpdrive-upda”) new group: : { 3:0-7,9, down: 1-2, 3:10-12 }

Jul 8 08:07:50 (id3) /boot/kernel/kernel: [gmp_info.c:530](pid 1828=”kt: gmpdrive-upda”) new group: : { 3:0-7,9-10, down: 1-2, 3:11-12 }

Jul 8 08:07:50 (id3) /boot/kernel/kernel: [gmp_info.c:530](pid 1828=”kt: gmpdrive-upda”) new group: : { 3:0-7,9-11, down: 1-2, 3:12 }

Jul 8 08:07:50 (id3) /boot/kernel/kernel: [gmp_info.c:530](pid 1828=”kt: gmpdrive-upda”) new group: : { 3:0-7,9-12, down: 1-2 }

Jul 8 08:07:50 (id3) /boot/kernel/kernel: [gmp_info.c:530](pid 1828=”kt: gmpmerge”) new group: : { 1:0-11, 3:0-7,9-12, down: 2 }

Jul 8 08:07:52 (id3) /boot/kernel/kernel: [gmp_info.c:530](pid 1828=”kt: gmpmerge”) new group: : { 1-2:0-11, 3:0-7,9-12 }

In this instance, it’s apparent that node 3 rebooted at the same time. It’s worth checking node 2’s logs to see if it also rebooted at the same time.

Given that all three nodes rebooted simultaneously, it’s highly likely that this was a cluster-wide event, rather than a single-node issue – especially since watchdog timeouts that cause cluster-wide reboots typically cause staggered rather than simultaneous reboots. The Softwatch process (also known as software watchdog or swatchdog) monitors the kernel and dumps a stack trace and/or reboots the node when the node is not responding. This tool protects the cluster from the impact of heavy CPU starvation and aids issue discovery and resolution process.

Constructing a timeline

If a cluster experiences multiple, staggered group changes, it can be extremely helpful to craft a timeline of the order and duration in which nodes are up or down. This timeline illustrates with. This info can be cross-referenced with panic stack traces and other system logs to help diagnose the root cause of an event.

For example, in the following a 15-node cluster experiences six different node reboots over a twenty-minute period. These are the group change messages from node 14, which that stayed up the whole duration:

tme-14# grep ‘new group’ tme-14-messages

Jul 8 16:44:38 tme-14(id20) /boot/kernel/kernel: [gmp_info.c:510](pid 1060=”kt: gmp-split”) new group: : { 1-2:0-11, 6-8, 13-15:0-11, 16:0,2-12, 17- 18:0-11, 19-21, down: 9}

Jul 8 16:44:58 tme-14(id20) /boot/kernel/kernel: [gmp_info.c:510](pid 1066=”kt: gmp-split”) new group: : { 1:0-11, 6-8, 13-14:0-11, 16:0,2-12, 17- 18:0-11, 19-21, down: 2, 9, 15}

Jul 8 16:45:20 tme-14(id20) /boot/kernel/kernel: [gmp_info.c:510](pid 1066=”kt: gmp-split”) new group: : { 1:0-11, 6-8, 14:0-11, 16:0,2-12, 17-18:0- 11, 19-21, down: 2, 9, 13, 15} Mar 26 16:47:09 tme-14(id20) /boot/kernel/kernel: [gmp_info.c:510](pid 1066=”kt: gmp-merge”) new group: : { 1:0-11, 6-8, 9,14:0-11, 16:0,2-12, 17- 18:0-11, 19-21, down: 2, 13, 15}

Jul 8 16:47:27 tme-14(id20) /boot/kernel/kernel: [gmp_info.c:510](pid 1066=”kt: gmp-split”) new group: : { 6-8, 9,14:0-11, 16:0,2-12, 17-18:0-11, 19-21, down: 1-2, 13, 15}

Jul 8 16:48:11 tme-14(id20) /boot/kernel/kernel: [gmp_info.c:510](pid 2102=”kt: gmp-split”) new group: : { 6-8, 9,14:0-11, 16:0,2-12, 17:0-11, 19- 21, down: 1-2, 13, 15, 18}

Jul 8 16:50:55 tme-14(id20) /boot/kernel/kernel: [gmp_info.c:510](pid 2102=”kt: gmp-merge”) new group: : { 6-8, 9,13-14:0-11, 16:0,2-12, 17:0-11, 19- 21, down: 1-2, 15, 18}

Jul 8 16:51:26 tme-14(id20) /boot/kernel/kernel: [gmp_info.c:510](pid 85396=”kt: gmp-merge”) new group: : { 2:0-11, 6-8, 9,13-14:0-11, 16:0,2-12, 17:0-11, 19-21, down: 1, 15, 18}

Jul 8 16:51:53 tme-14(id20) /boot/kernel/kernel: [gmp_info.c:510](pid 85396=”kt: gmp-merge”) new group: : { 2:0-11, 6-8, 9,13-15:0-11, 16:0,2-12, 17:0-11, 19-21, down: 1, 18}

Jul 8 16:54:06 tme-14(id20) /boot/kernel/kernel: [gmp_info.c:510](pid 85396=”kt: gmp-merge”) new group: : { 1-2:0-11, 6-8, 9,13-15:0-11, 16:0,2-12, 17:0-11, 19-21, down: 18}

Jul 8 16:56:10 tme-14(id20) /boot/kernel/kernel: [gmp_info.c:510](pid 2102=”kt: gmp-merge”) new group: : { 1-2:0-11, 6-8, 9,13-15:0-11, 16:0,2-12, 17-18:0-11, 19-21}

Jul 8 16:59:54 tme-14(id20) /boot/kernel/kernel: [gmp_info.c:510](pid 85396=”kt: gmp-split”) new group: : { 1-2:0-11, 6-8, 9,13-15,17-18:0-11, 19-21, down: 16}

Jul 8 17:05:23 tme-14(id20) /boot/kernel/kernel: [gmp_info.c:510](pid 1066=”kt: gmp-merge”) new group: : { 1-2:0-11, 6-8, 9,13-15:0-11, 16:0,2-12, 17-18:0-11, 19-21}

First, run the isi_nodes “%{name}: LNN %{lnn}, Array ID %{id}” to map the cluster’s node names to their respective Array IDs.

Before the cluster node outage event on Jul 8, we can see there was a group change on Jul 3

Jul 8 14:54:00 tme-14(id20) /boot/kernel/kernel: [gmp_info.c:510](pid 1060=”kt: gmp-merge”) new group: : { 1-2:0-11, 6-8, 9,13-15:0-11, 16:0,2-12, 17-18:0-11, 19-21, diskless: 6-8, 19-21 }

After that, all nodes came back online, and the cluster could be considered healthy. The cluster contains six accelerators, and all nine data nodes with twelve drives apiece. Since the Array IDs now extend to 21, and Array IDs 3 through 5 and 10 through 12 are missing, this confirms that six nodes were added or removed from the cluster.

So, the first event occurs at 16:44:38 on 8 July:

Jul 8 16:44:38 tme-14(id20) /boot/kernel/kernel: [gmp_info.c:510](pid 1060=”kt: gmp-split”) new group: : { 1-2:0-11, 6-8, 13-15:0-11, 16:0,2-12, 17- 18:0-11, 19-21, down: 9, diskless: 6-8, 19-21 }

Node 14 identifies Array ID 9 (LNN 6) as having left the group.

Next, twenty seconds later, two more nodes (2 & 15) show as offline:

Jul 8 16:44:58 tme-14(id20) /boot/kernel/kernel: [gmp_info.c:510](pid 1066=”kt: gmp-split”) new group: : { 1:0-11, 6-8, 13-14:0-11, 16:0,2-12, 17- 18:0-11, 19-21, down: 2, 9, 15, diskless: 6-8, 19-21 }

Twenty-two seconds later, another node goes offline:

Jul 8 16:45:20 tme-14(id20) /boot/kernel/kernel: [gmp_info.c:510](pid 1066=”kt: gmp-split”) new group: : { 1:0-11, 6-8, 14:0-11, 16:0,2-12, 17-18:0- 11, 19-21, down: 2, 9, 13, 15, diskless: 6-8, 19-21 }

At this point, four nodes (2,6,7, & 9) are marked as being offline:

Nearly two minutes later, the previously down node (node 6) rejoins:

Jul 8 16:47:09 tme-14(id20) /boot/kernel/kernel: [gmp_info.c:510](pid 1066=”kt: gmp-merge”) new group: : { 1:0-11, 6-8, 9,14:0-11, 16:0,2-12, 17- 18:0-11, 19-21, down: 2, 13, 15, diskless: 6-8, 19-21 }

Twenty-five seconds after node 6 comes back, however, node 1 goes offline:

Jul 8 16:47:27 tme-14(id20) /boot/kernel/kernel: [gmp_info.c:510](pid 1066=”kt: gmp-split”) new group: : { 6-8, 9,14:0-11, 16:0,2-12, 17-18:0-11, 19-21, down: 1-2, 13, 15, diskless: 6-8, 19-21 }

Finally, the group returns to the same as the original group:

Jul 8 17:05:23 tme-14(id20) /boot/kernel/kernel: [gmp_info.c:510](pid 1066=”kt: gmp-merge”) new group: : { 1-2:0-11, 6-8, 9,13-15:0-11, 16:0,2-12, 17-18:0-11, 19-21, diskless: 6-8, 19-21 }

As such, a timeline of this cluster event could read:

Jul 8 16:44:38 6 down

Jul 8 16:44:58 2, 9 down (6 still down)

Jul 8 16:45:20 7 down (2, 6, 9 still down)

Jul 8 16:47:09 6 up (2, 7, 9 still down)

Jul 8 16:47:27 1 down (2, 7, 9 still down)

Jul 8 16:48:11 12 down (1, 2, 7, 9 still down)

Jul 8 16:50:55 7 up (1, 2, 9, 12 still down)

Jul 8 16:51:26 2 up (1, 9, 12 still down)

Jul 8 16:51:53 9 up (1, 12 still down)

Jul 8 16:54:06 1 up (12 still down)

Jul 8 16:56:10 12 up (none down)

Jul 8 16:59:54 10 down

Jul 8 17:05:23 10 up (none down)

Before triangulating log events across multiple nodes, it’s important to ensure that the nodes’ clocks are all synchronized. To check this, run the isi_for_array –q date command on all nodes and confirm that they match. If not, apply the time offset for a particular node to the timestamps of its logfiles.

So what caused node 6 to go offline at 16:44:38? The messages file for that node show that nothing of note occurred between noon on Jul 8 and 16:44:31. After this, a slew of messages were logged:

Jul 8 16:44:31 tme-tme-6(id9) /boot/kernel/kernel: [rbm_device.c:749](pid 132=”swi5: clock sio”) ping failure (1)

Jul 8 16:44:31 tme-tme-6(id9) /boot/kernel/kernel: last 3 messages out: GMP_NODE_INFO_UPDATE, GMP_NODE_INFO_UPDATE, LOCK_REQ

Jul 8 16:44:31 tme-tme-6(id9) /boot/kernel/kernel: last 3 messages in : LOCK_RESP, TXN_COMMITTED, TXN_PREPARED

These three messages are repeated several times and then node 6 splits:

Jul 8 16:44:31 tme-tme-6(id9) /boot/kernel/kernel: [rbm_device.c:749](pid 132=”swi5: clock sio”) ping failure (21)

Jul 8 16:44:31 tme-tme-6(id9) /boot/kernel/kernel: last 3 messages out: GMP_NODE_INFO_UPDATE, GMP_NODE_INFO_UPDATE, LOCK_RESP

Jul 8 16:44:31 tme-6(id9) /boot/kernel/kernel: last 3 messages in : LOCK_REQ, LOCK_RESP, LOCK_RESP

Jul 8 16:44:34 tme-6(id9) /boot/kernel/kernel: [gmp_info.c:650](pid 48538=”kt: disco-cbs”) disconnected from node 1

Jul 8 16:44:34 tme-6(id9) /boot/kernel/kernel: [gmp_info.c:650](pid 49215=”kt: disco-cbs”) disconnected from node 2

Jul 8 16:44:34 tme-6(id9) /boot/kernel/kernel: [gmp_info.c:650](pid 50864=”kt: disco-cbs”) disconnected from node 6

Jul 8 16:44:34 tme-6(id9) /boot/kernel/kernel: [gmp_info.c:650](pid 49114=”kt: disco-cbs”) disconnected from node 7

Jul 8 16:44:34 tme-6(id9) /boot/kernel/kernel: [gmp_info.c:650](pid 30433=”kt: disco-cbs”) disconnected from node 8

Jul 8 16:44:34 tme-6(id9) /boot/kernel/kernel: [gmp_info.c:650](pid 49218=”kt: disco-cbs”) disconnected from node 13

Jul 8 16:44:34 tme-6(id9) /boot/kernel/kernel: [gmp_info.c:650](pid 50903=”kt: disco-cbs”) disconnected from node 14

Jul 8 16:44:34 tme-6(id9) /boot/kernel/kernel: [gmp_info.c:650](pid 24705=”kt: disco-cbs”) disconnected from node 15

Jul 8 16:44:34 tme-6(id9) /boot/kernel/kernel: [gmp_info.c:650](pid 48574=”kt: disco-cbs”) disconnected from node 16

Jul 8 16:44:34 tme-6(id9) /boot/kernel/kernel: [gmp_info.c:650](pid 49508=”kt: disco-cbs”) disconnected from node 17

Jul 8 16:44:34 tme-6(id9) /boot/kernel/kernel: [gmp_info.c:650](pid 52977=”kt: disco-cbs”) disconnected from node 18

Jul 8 16:44:34 tme-6(id9) /boot/kernel/kernel: [gmp_info.c:650](pid 52975=”kt: disco-cbs”) disconnected from node 19

Jul 8 16:44:34 tme-6(id9) /boot/kernel/kernel: [gmp_info.c:650](pid 50902=”kt: disco-cbs”) disconnected from node 20

Jul 8 16:44:34 tme-6(id9) /boot/kernel/kernel: [gmp_info.c:650](pid 48513=”kt: disco-cbs”) disconnected from node 21

Jul 8 16:44:34 tme-6(id9) /boot/kernel/kernel: [gmp_rtxn.c:194](pid 48513=”kt: gmp-split”) forcing disconnects from { 1, 2, 6, 7, 8, 13, 14, 15, 16, 17, 18, 19, 20, 21 }

Jul 8 16:44:50 tme-6(id9) /boot/kernel/kernel: [gmp_info.c:510](pid 48513=”kt: gmp-split”) new group: : { 9:0-11, down: 1-2, 6-8, 13-21, diskless: 6-8, 19-21 }

Node 6 splits from the rest of the nodes, then rejoins the rest of the cluster without a reboot.

Review messages logs for other nodes

After grabbing the pertinent node state and event info from the /var/log/messages logs for all fifteen nodes, a final timeline could read:

Jul 8 16:44:38 6 down

6: *** – split, not rebooted. Network issue? No engine stalls1 at that time…

Jul 8 16:44:58 2, 9 down (6 still down)

2: Softwatch timed out

9: Softwatch timed out

Jul 8 16:45:20 7 down (2, 6, 9 still down)

7: Indeterminate transactions

Jul 8 16:47:09 6 up (2, 7, 9 still down)

Jul 8 16:47:27 1 down (2, 7, 9 still down)

1: Softwatch timed out

Jul 8 16:48:11 12 down (1, 2, 7, 9 still down)

12: Softwatch timed out

Jul 8 16:50:55 7 up (1, 2, 9, 12 still down)

Jul 8 16:51:26 2 up (1, 9, 12 still down)

Jul 8 16:51:53 9 up (1, 12 still down)

Jul 8 16:54:06 1 up (12 still down)

Jul 8 16:56:10 12 up (none down)

Jul 8 16:59:54 10 down

10: Indeterminate transactions

Jul 8 17:05:23 10 up (none down)

Note: The BAM (block allocation manager) is responsible for building and executing a ‘write plan’ of which blocks should be written to which drives on which nodes for each transaction. OneFS logs an engine stall if this write plan encounters an unexpected delay.

Because group changes document the cluster’s actual configuration from OneFS’ perspective, they’re a vital tool in understanding at any point in time which devices the cluster considers available, and which devices the cluster considers as having failed. This info, when combined with other data from cluster logs, can provide a succinct but detailed cluster history, simplifying both debugging and failure analysis.

Leave a Reply

Your email address will not be published. Required fields are marked *