[LU-7938] Recovery on secondary OSS node stalled - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Not a Bug
Priority: Blocker
Fix Version/s: Lustre 2.9.0
Affects Version/s: Lustre 2.8.0, Lustre 2.9.0
Labels:
- soak
Environment:
lola
build: 2.8 GA + patches

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Error happens during soak testing of build '20160324' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160324). DNE is enabled. MDTs had been formatted with ldiskfs, OSTs using zfs. MDS and OSS nodes are configured in HA active-active failover configuration. MDS nodes operated wiht 1 MDT per MDS, while OSSes running 4 OST per node.
Nodes lola-4 and lola-5 form a HA cluster.

Event history

2016-03-29 07:48:04,307:fsmgmt.fsmgmt:INFO triggering fault oss_failover of node lola-4
powercyle node
2016-03-29 07:52:27,876:fsmgmt.fsmgmt:INFO lola-4 is up
2016-03-29 07:53:09,557:fsmgmt.fsmgmt:INFO zpool import and mount of lola-4's OSTs complete
Recovery don't complete after recovery_time is zero
2016-03-29 08:03 (approximately) Aborted recovery manually (lctl --device ... abort_recovery)

Attached files:
messages, console and debug log before (lustre-log-20160329-0759-recovery-stalled) and after recovery was aborted (lustre-log-20160329-0803-recovery-aborted)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

console-lola-5.log.bz2
40 kB
29/Mar/16 3:59 PM
console-lola-5.log-20160722.bz2
59 kB
22/Jul/16 12:24 PM
lola-4-console-20160415.bz2
95 kB
15/Apr/16 12:21 PM
lola-4-lustre-log-20160415-0420.bz2
0.3 kB
15/Apr/16 12:30 PM
lola-4-lustre-log-20160415-0430.bz2
936 kB
15/Apr/16 12:23 PM
lola-4-messages-20160415.bz2
215 kB
15/Apr/16 12:21 PM
lustre-log-20160329-0759-recovery-stalled.bz2
0.3 kB
30/Mar/16 8:49 AM
lustre-log-20160329-0803-recovery-aborted.bz2
8 kB
29/Mar/16 3:59 PM
lustre-log-lola-30-2016-07-25_0855-ost-recovery-stalled.bz2
0.3 kB
26/Jul/16 5:39 AM
lustre-log-lola-30-2016-07-25_0859-after-ost-recovery-aborted.bz2
0.3 kB
26/Jul/16 5:39 AM
lustre-log-lola-3-2016-07-25_0855-ost-recovery-stalled.bz2
0.3 kB
25/Jul/16 6:38 PM
lustre-log-lola-3-2016-07-25_0859-after-ost-recovery-aborted.bz2
0.3 kB
26/Jul/16 5:39 AM
lustre-log-lola-5-20160722_0434_ost_recovery_stalled.bz2
0.3 kB
22/Jul/16 12:24 PM
lustre-log-lola-5-20160722_0438_after_ost_recovery_aborted.bz2
176 kB
22/Jul/16 12:24 PM
lustre-log-lola-8-2016-07-25_0855-ost-recovery-stalled.bz2
0.3 kB
25/Jul/16 6:38 PM
lustre-log-lola-8-2016-07-25_0859-after-ost-recovery-aborted.bz2
0.3 kB
25/Jul/16 6:38 PM
messages-lola-5.log.bz2
49 kB
29/Mar/16 3:59 PM
messages-lola-5.log-20160722.bz2
220 kB
22/Jul/16 12:24 PM
ost-failover-setttings-20160726_0702
16 kB
26/Jul/16 3:38 PM

Activity

[LU-7938] Recovery on secondary OSS node stalled

Frank Heckes (Inactive) added a comment - 27/Jul/16 8:50 AM

Failover works after reformat of Lustre FS. Ticket can be closed.

Frank Heckes (Inactive) added a comment - 27/Jul/16 8:50 AM Failover works after reformat of Lustre FS. Ticket can be closed.

Frank Heckes (Inactive) added a comment - 26/Jul/16 3:41 PM

failover.node is specified for mkfs.lustre command via option --servicenode

Frank Heckes (Inactive) added a comment - 26/Jul/16 3:41 PM failover.node is specified for mkfs.lustre command via option --servicenode

Frank Heckes (Inactive) added a comment - 26/Jul/16 3:39 PM

reformat is onging. I'm going to start soak again, once it has completed.

Frank Heckes (Inactive) added a comment - 26/Jul/16 3:39 PM reformat is onging. I'm going to start soak again, once it has completed.

Frank Heckes (Inactive) added a comment - 26/Jul/16 3:37 PM

I set the parameter 'lustre:failover.node' for on all OSTs to the appropriate values for the failover nodes (see attached file ost-failover-setttings-20160726_0702).
Actually this didn't solve the problem. Recovery is stalled still on failover node.

I also reformatted the soak FS. The parameter setting will be filled (e.g.) with the following content:

192.168.1.102: mkfs_cmd = zfs create -o canmount=off -o xattr=sa soaked-ost18/ost18
192.168.1.102: Writing soaked-ost18/ost18 properties
192.168.1.102:   lustre:version=1
192.168.1.102:   lustre:flags=4194
192.168.1.102:   lustre:index=18
192.168.1.102:   lustre:fsname=soaked
192.168.1.102:   lustre:svname=soaked:OST0012
192.168.1.102:   lustre:mgsnode=192.168.1.108@o2ib10:192.168.1.109@o2ib10
192.168.1.102:   lustre:failover.node=192.168.1.102@o2ib10:192.168.1.103@o2ib10

Additionally I used option '--force-nohostid' for mkfs.lustre - command. I think the patch enabling hostid haven't landed, yet and spl fails during reformat if /etc/hostid isn't available.

Frank Heckes (Inactive) added a comment - 26/Jul/16 3:37 PM I set the parameter ' lustre:failover.node ' for on all OSTs to the appropriate values for the failover nodes (see attached file ost-failover-setttings-20160726_0702). Actually this didn't solve the problem. Recovery is stalled still on failover node. I also reformatted the soak FS. The parameter setting will be filled (e.g.) with the following content: 192.168.1.102: mkfs_cmd = zfs create -o canmount=off -o xattr=sa soaked-ost18/ost18 192.168.1.102: Writing soaked-ost18/ost18 properties 192.168.1.102: lustre:version=1 192.168.1.102: lustre:flags=4194 192.168.1.102: lustre:index=18 192.168.1.102: lustre:fsname=soaked 192.168.1.102: lustre:svname=soaked:OST0012 192.168.1.102: lustre:mgsnode=192.168.1.108@o2ib10:192.168.1.109@o2ib10 192.168.1.102: lustre:failover.node=192.168.1.102@o2ib10:192.168.1.103@o2ib10 Additionally I used option ' --force-nohostid ' for mkfs.lustre - command. I think the patch enabling hostid haven't landed, yet and spl fails during reformat if /etc/hostid isn't available.

Frank Heckes (Inactive) added a comment - 26/Jul/16 11:58 AM

Indeed it's a configuration error. Looks like NID of the secondary node (192.168.1.103@o2ib) weren't written during FS format.
I set them explicitly again using zfs set lustre:failover.node, and check whether this resolves the issue.

Frank Heckes (Inactive) added a comment - 26/Jul/16 11:58 AM Indeed it's a configuration error. Looks like NID of the secondary node ( 192.168.1.103@o2ib ) weren't written during FS format. I set them explicitly again using zfs set lustre:failover.node , and check whether this resolves the issue.

nasf (Inactive) added a comment - 26/Jul/16 7:37 AM

Thanks Frank for the detailed logs. According to the client side log, it always tried to connect lola2 for the failed OSTs (OST0012 OST0000 OST0006 OST000c):

00000100:00000001:30.0:1469462043.991030:0:125723:0:(import.c:508:import_select_connection()) Process entered
00000100:00080000:30.0:1469462043.991033:0:125723:0:(import.c:523:import_select_connection()) soaked-OST0000-osc-ffff8810717d2000: connect to NID 192.168.1.102@o2ib10 last attempt 5189761512
00000100:00080000:30.0:1469462043.991036:0:125723:0:(import.c:567:import_select_connection()) soaked-OST0000-osc-ffff8810717d2000: tried all connections, increasing latency to 50s
00000100:00080000:30.0:1469462043.991099:0:125723:0:(import.c:601:import_select_connection()) soaked-OST0000-osc-ffff8810717d2000: import ffff880ff6706000 using connection 192.168.1.102@o2ib10/192.168.1.102@o2ib10
00000100:00000001:30.0:1469462043.991104:0:125723:0:(import.c:605:import_select_connection()) Process leaving (rc=0 : 0 : 0)
...

That means there is only one candidate can be selected to connect the OST0000, but according to your description, there should be another candidate 192.168.1.103@o2ib for connection. So I am wondering how did you config the lola2 and lola3 as the failover pairs?

nasf (Inactive) added a comment - 26/Jul/16 7:37 AM Thanks Frank for the detailed logs. According to the client side log, it always tried to connect lola2 for the failed OSTs (OST0012 OST0000 OST0006 OST000c): 00000100:00000001:30.0:1469462043.991030:0:125723:0:(import.c:508:import_select_connection()) Process entered 00000100:00080000:30.0:1469462043.991033:0:125723:0:(import.c:523:import_select_connection()) soaked-OST0000-osc-ffff8810717d2000: connect to NID 192.168.1.102@o2ib10 last attempt 5189761512 00000100:00080000:30.0:1469462043.991036:0:125723:0:(import.c:567:import_select_connection()) soaked-OST0000-osc-ffff8810717d2000: tried all connections, increasing latency to 50s 00000100:00080000:30.0:1469462043.991099:0:125723:0:(import.c:601:import_select_connection()) soaked-OST0000-osc-ffff8810717d2000: import ffff880ff6706000 using connection 192.168.1.102@o2ib10/192.168.1.102@o2ib10 00000100:00000001:30.0:1469462043.991104:0:125723:0:(import.c:605:import_select_connection()) Process leaving (rc=0 : 0 : 0) ... That means there is only one candidate can be selected to connect the OST0000, but according to your description, there should be another candidate 192.168.1.103@o2ib for connection. So I am wondering how did you config the lola2 and lola3 as the failover pairs?

Frank Heckes (Inactive) added a comment - 26/Jul/16 5:40 AM

debug files have been uploaded. Roles:
lola-8: MGS
lola-3: secondary OSS
lola-30: Lustre client

Frank Heckes (Inactive) added a comment - 26/Jul/16 5:40 AM debug files have been uploaded. Roles: lola-8: MGS lola-3: secondary OSS lola-30: Lustre client

Frank Heckes (Inactive) added a comment - 25/Jul/16 4:12 PM - edited

Resources of lola-2 (OST0012 OST0000 OST0006 OST000c) failed over to lola-3

2016-07-25 08:49:12,572:fsmgmt.fsmgmt:INFO Wait for recovery to complete
2016-07-25 08:55 created debug logs
2016-07-25 08:57:51 aborted recovery
2016-07-25 08:59 created debug logs

Upload of debug logs has been started, but it might take some time.

Frank Heckes (Inactive) added a comment - 25/Jul/16 4:12 PM - edited Resources of lola-2 ( OST0012 OST0000 OST0006 OST000c ) failed over to lola-3 2016-07-25 08:49:12,572:fsmgmt.fsmgmt:INFO Wait for recovery to complete 2016-07-25 08:55 created debug logs 2016-07-25 08:57:51 aborted recovery 2016-07-25 08:59 created debug logs Upload of debug logs has been started, but it might take some time.

People

Assignee:: nasf (Inactive)

Reporter:: Frank Heckes (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 29/Mar/16 3:53 PM

Updated:: 27/Jul/16 1:12 PM

Resolved:: 27/Jul/16 1:12 PM