Details
-
Bug
-
Resolution: Not a Bug
-
Blocker
-
Lustre 2.8.0, Lustre 2.9.0
-
lola
build: 2.8 GA + patches
-
3
-
9223372036854775807
Description
Description
Error happens during soak testing of build '20160324' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160324). DNE is enabled. MDTs had been formatted with ldiskfs, OSTs using zfs. MDS and OSS nodes are configured in HA active-active failover configuration. MDS nodes operated wiht 1 MDT per MDS, while OSSes running 4 OST per node.
Nodes lola-4 and lola-5 form a HA cluster.
Event history
- 2016-03-29 07:48:04,307:fsmgmt.fsmgmt:INFO triggering fault oss_failover of node lola-4
powercyle node - 2016-03-29 07:52:27,876:fsmgmt.fsmgmt:INFO lola-4 is up
- 2016-03-29 07:53:09,557:fsmgmt.fsmgmt:INFO zpool import and mount of lola-4's OSTs complete
- Recovery don't complete after recovery_time is zero
- 2016-03-29 08:03 (approximately) Aborted recovery manually (lctl --device ... abort_recovery)
Attached files:
messages, console and debug log before (lustre-log-20160329-0759-recovery-stalled) and after recovery was aborted (lustre-log-20160329-0803-recovery-aborted)
Attachments
Activity
I set the parameter 'lustre:failover.node' for on all OSTs to the appropriate values for the failover nodes (see attached file ost-failover-setttings-20160726_0702).
Actually this didn't solve the problem. Recovery is stalled still on failover node.
I also reformatted the soak FS. The parameter setting will be filled (e.g.) with the following content:
192.168.1.102: mkfs_cmd = zfs create -o canmount=off -o xattr=sa soaked-ost18/ost18 192.168.1.102: Writing soaked-ost18/ost18 properties 192.168.1.102: lustre:version=1 192.168.1.102: lustre:flags=4194 192.168.1.102: lustre:index=18 192.168.1.102: lustre:fsname=soaked 192.168.1.102: lustre:svname=soaked:OST0012 192.168.1.102: lustre:mgsnode=192.168.1.108@o2ib10:192.168.1.109@o2ib10 192.168.1.102: lustre:failover.node=192.168.1.102@o2ib10:192.168.1.103@o2ib10
Additionally I used option '--force-nohostid' for mkfs.lustre - command. I think the patch enabling hostid haven't landed, yet and spl fails during reformat if /etc/hostid isn't available.
Indeed it's a configuration error. Looks like NID of the secondary node (192.168.1.103@o2ib) weren't written during FS format.
I set them explicitly again using zfs set lustre:failover.node, and check whether this resolves the issue.
Thanks Frank for the detailed logs. According to the client side log, it always tried to connect lola2 for the failed OSTs (OST0012 OST0000 OST0006 OST000c):
00000100:00000001:30.0:1469462043.991030:0:125723:0:(import.c:508:import_select_connection()) Process entered
00000100:00080000:30.0:1469462043.991033:0:125723:0:(import.c:523:import_select_connection()) soaked-OST0000-osc-ffff8810717d2000: connect to NID 192.168.1.102@o2ib10 last attempt 5189761512
00000100:00080000:30.0:1469462043.991036:0:125723:0:(import.c:567:import_select_connection()) soaked-OST0000-osc-ffff8810717d2000: tried all connections, increasing latency to 50s
00000100:00080000:30.0:1469462043.991099:0:125723:0:(import.c:601:import_select_connection()) soaked-OST0000-osc-ffff8810717d2000: import ffff880ff6706000 using connection 192.168.1.102@o2ib10/192.168.1.102@o2ib10
00000100:00000001:30.0:1469462043.991104:0:125723:0:(import.c:605:import_select_connection()) Process leaving (rc=0 : 0 : 0)
...
That means there is only one candidate can be selected to connect the OST0000, but according to your description, there should be another candidate 192.168.1.103@o2ib for connection. So I am wondering how did you config the lola2 and lola3 as the failover pairs?
debug files have been uploaded. Roles:
lola-8: MGS
lola-3: secondary OSS
lola-30: Lustre client
Resources of lola-2 (OST0012 OST0000 OST0006 OST000c) failed over to lola-3
- 2016-07-25 08:49:12,572:fsmgmt.fsmgmt:INFO Wait for recovery to complete
- 2016-07-25 08:55 created debug logs
- 2016-07-25 08:57:51 aborted recovery
- 2016-07-25 08:59 created debug logs
Upload of debug logs has been started, but it might take some time.
According to the recovery status, no clients connected to the new mounted OSTs on the lola-5, and there are no client connection records in the lola-5 debug logs (I am sure that it is NOT because of logs overwritten). So it seems that the clients were not aware of the OST load switching from lola-4 to lola-5, as to no connection to the lola-5, then no recovery. We need the full debug logs on the MGS and the client (one active client is enough). Thanks!
Comments from Cliff above seems to be related to LU-8089.
The error for for OST reported in the description above, also occurred for build https://build.hpdd.intel.com/job/lustre-master/3406 (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160713)
and is reproducible for any failover initiated.
Test environment configuration is
4 MDSes with 1 MDT each formatted with ldiskfs in active-active failover confguration
6 OSSes with 4 OSTs each formatted with zfs in active-active failover configuration
16 clients
The stalled recovery process can be interrupted (aborted) at any time.
I attached new message, console and debug log files (take during recovery, after recovery abort) of secondary node lola-5.
Debug files of all other nodes taken at the same time, are available and can be uploaded on demand.
Sequence of events:
- 2016-07-22 04:21:54,665:fsmgmt.fsmgmt:INFO triggering fault oss_failover lola-4
- 2016-07-22 04:26:37,971:fsmgmt.fsmgmt:INFO lola-4 is up
- 2016-07-22 04:27:01,278:fsmgmt.fsmgmt:INFO ... soaked-OST0008 mounted successfully on lola-5 (NOTE: decimal id for OST)
- 2016-07-22 04:27:12,348:fsmgmt.fsmgmt:INFO ... soaked-OST0020 mounted successfully on lola-5
- 2016-07-22 04:27:23,515:fsmgmt.fsmgmt:INFO ... soaked-OST0002 mounted successfully on lola-5
- 2016-07-22 04:27:36,047:fsmgmt.fsmgmt:INFO ... soaked-OST0014 mounted successfully on lola-5
- 2016-07-22 04:27:36 Recovery started
- 2016-07-22 04:37 Recovery aborted
Status of recovery on OSS node lola-5 and client status were:
[root@lola-5 ~]# lctl get_param *.*.recovery_status obdfilter.soaked-OST0002.recovery_status= status: RECOVERING recovery_start: 0 time_remaining: 0 connected_clients: 0/23 req_replay_clients: 0 lock_repay_clients: 0 completed_clients: 0 evicted_clients: 0 replayed_requests: 0 queued_requests: 0 next_transno: 68719725397 obdfilter.soaked-OST0003.recovery_status= status: COMPLETE recovery_start: 1468949759 recovery_duration: 30 completed_clients: 23/23 replayed_requests: 0 last_transno: 34368384861 VBR: DISABLED IR: ENABLED obdfilter.soaked-OST0008.recovery_status= status: RECOVERING recovery_start: 0 time_remaining: 0 connected_clients: 0/23 req_replay_clients: 0 lock_repay_clients: 0 completed_clients: 0 evicted_clients: 0 replayed_requests: 0 queued_requests: 0 next_transno: 68719723367 obdfilter.soaked-OST0009.recovery_status= status: COMPLETE recovery_start: 1468949751 recovery_duration: 38 completed_clients: 23/23 replayed_requests: 0 last_transno: 34368183131 VBR: DISABLED IR: ENABLED obdfilter.soaked-OST000e.recovery_status= status: RECOVERING recovery_start: 0 time_remaining: 0 connected_clients: 0/23 req_replay_clients: 0 lock_repay_clients: 0 completed_clients: 0 evicted_clients: 0 replayed_requests: 0 queued_requests: 0 next_transno: 68719751551 obdfilter.soaked-OST000f.recovery_status= status: COMPLETE recovery_start: 1468949763 recovery_duration: 26 completed_clients: 23/23 replayed_requests: 0 last_transno: 34368269722 VBR: DISABLED IR: ENABLED obdfilter.soaked-OST0014.recovery_status= status: RECOVERING recovery_start: 0 time_remaining: 0 connected_clients: 0/23 req_replay_clients: 0 lock_repay_clients: 0 completed_clients: 0 evicted_clients: 0 replayed_requests: 0 queued_requests: 0 next_transno: 68719756594 obdfilter.soaked-OST0015.recovery_status= status: COMPLETE recovery_start: 1468949751 recovery_duration: 38 completed_clients: 23/23 replayed_requests: 0 last_transno: 34368229218 VBR: DISABLED IR: ENABLED
The recovery stalled in a previous session running during night time over several hours without completion. That's why I aborted the recovery after ten minutes; I wasn't sure debug logs would contain relevant information, but confident that the recovery would never end.
Does not appear possible to abort_recovery, results in timeouts.
INFO: task lctl:41183 blocked for more than 120 seconds. Tainted: P -- ------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. lctl D 0000000000000019 0 41183 41083 0x00000080 ffff8806bdda7b68 0000000000000086 0000000000000000 ffff88082fe86ab0 ffff88082fe86ae8 ffff880439371150 0000ece58069441d 0000000000000000 ffff8806bdd8eae8 000000010f83d3bd ffff8806bdd8f068 ffff8806bdda7fd8 Call Trace: [<ffffffff8153aa35>] schedule_timeout+0x215/0x2e0 [<ffffffff8153a693>] wait_for_common+0x123/0x180 [<ffffffff81067650>] ? default_wake_function+0x0/0x20 [<ffffffff8153a7cd>] wait_for_completion+0x1d/0x20 [<ffffffffa0992e80>] target_stop_recovery_thread+0x50/0xe0 [ptlrpc] [<ffffffffa112ac02>] mdt_iocontrol+0x6b2/0x920 [mdt] [<ffffffff812337cf>] ? security_inode_permission+0x1f/0x30 [<ffffffffa0783edc>] class_handle_ioctl+0x15fc/0x20d0 [obdclass] [<ffffffff811a5998>] ? do_filp_open+0x798/0xd20 [<ffffffffa07682ab>] obd_class_ioctl+0x4b/0x190 [obdclass] [<ffffffff811a7f82>] vfs_ioctl+0x22/0xa0 [<ffffffff811a8124>] do_vfs_ioctl+0x84/0x580 [<ffffffff811a03f6>] ? final_putname+0x26/0x50 [<ffffffff811a86a1>] sys_ioctl+0x81/0xa0 [<ffffffff810e905e>] ? __audit_syscall_exit+0x25e/0x290 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
Log errors
Lustre: Skipped 299 previous similar messages Lustre: soaked-MDT0002: Recovery already passed deadline 137:09, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
The error is occurring again on the tip of master plus the two patches.
# cat /proc/fs/lustre/mdt/soaked-MDT0002/recovery_status status: RECOVERING recovery_start: 1465232959 time_remaining: 0 connected_clients: 20/20 req_replay_clients: 0 lock_repay_clients: 1 completed_clients: 19 evicted_clients: 0 replayed_requests: 0 queued_requests: 0 next_transno: 442381632838
reformat is onging. I'm going to start soak again, once it has completed.