Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Lustre 2.10.1
-
Soak stress cluster Lustre version=2.10.0_61_g6aabd4a
-
3
-
9223372036854775807
Description
Not entirely certain what is going on here, however this is very repeatable, and we have verified that the array is quite healthy.
Steps:
Soak-11 is power-cycled for a failover test.
Soak-10 attempts to mount the failed MDT. -> MDT0003
Soak-10 reports a few odd errors, dumps a lustre log.
Aug 31 20:01:27 soak-10 kernel: LNet: Service thread pid 3916 was inactive for 200.29s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Aug 31 20:01:27 soak-10 kernel: Pid: 3916, comm: mdt01_027 Aug 31 20:01:27 soak-10 kernel: #012Call Trace: Aug 31 20:01:27 soak-10 kernel: [<ffffffff8168c969>] schedule+0x29/0x70 Aug 31 20:01:27 soak-10 kernel: [<ffffffffa1165b82>] top_trans_wait_result+0xa3/0x14e [ptlrpc] Aug 31 20:01:27 soak-10 kernel: [<ffffffff810c54e0>] ? default_wake_function+0x0/0x20 Aug 31 20:01:27 soak-10 kernel: [<ffffffffa114814b>] top_trans_stop+0x46b/0x970 [ptlrpc] Aug 31 20:01:27 soak-10 kernel: [<ffffffffa15eb569>] lod_trans_stop+0x259/0x340 [lod] Aug 31 20:01:27 soak-10 kernel: [<ffffffffa166f4d5>] ? mdd_changelog_ns_store+0x2e5/0x650 [mdd] Aug 31 20:01:27 soak-10 kernel: [<ffffffffa1687b94>] mdd_trans_stop+0x24/0x40 [mdd] Aug 31 20:01:28 soak-10 kernel: [<ffffffffa1672753>] mdd_create+0x10b3/0x1330 [mdd] Aug 31 20:01:28 soak-10 kernel: [<ffffffffa1538d86>] mdt_create+0x846/0xbb0 [mdt] Aug 31 20:01:28 soak-10 kernel: [<ffffffffa153925b>] mdt_reint_create+0x16b/0x350 [mdt] Aug 31 20:01:28 soak-10 kernel: [<ffffffffa153a760>] mdt_reint_rec+0x80/0x210 [mdt] Aug 31 20:01:28 soak-10 kernel: [<ffffffffa151c2fb>] mdt_reint_internal+0x5fb/0x9c0 [mdt] Aug 31 20:01:28 soak-10 kernel: [<ffffffffa1527e37>] mdt_reint+0x67/0x140 [mdt] Aug 31 20:01:28 soak-10 kernel: [<ffffffffa11348f5>] tgt_request_handle+0x925/0x1370 [ptlrpc] Aug 31 20:01:28 soak-10 kernel: [<ffffffffa10dd2c6>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc] Aug 31 20:01:28 soak-10 kernel: [<ffffffffa0d6db97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] Aug 31 20:01:28 soak-10 kernel: [<ffffffffa10e12a0>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] Aug 31 20:01:28 soak-10 kernel: [<ffffffffa10e0800>] ? ptlrpc_main+0x0/0x1de0 [ptlrpc] Aug 31 20:01:28 soak-10 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0 Aug 31 20:01:28 soak-10 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0 Aug 31 20:01:28 soak-10 kernel: [<ffffffff816978d8>] ret_from_fork+0x58/0x90 Aug 31 20:01:28 soak-10 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0 Aug 31 20:01:28 soak-10 kernel: Aug 31 20:01:28 soak-10 kernel: LustreError: dumping log to /tmp/lustre-log.1504209688.3916
Mount then hangs with -114 -> EALREADY /* Operation already in progress */
Aug 31 20:01:39 soak-10 kernel: LustreError: 11348:0:(obd_mount_server.c:1832:server_fill_super()) Unable to start osd on /dev/mapper/360080e50001fedb80000015752012949: -114
At this time the mount is unkillable. Attempts to un-mount the other MDT also fail
Aug 31 20:58:14 soak-10 kernel: Lustre: soaked-MDT0002: Client 040f43a4-0589-a1a1-1a89-accb11c79d99 (at 192.168.1.116@o2ib) reconnecting Aug 31 20:58:14 soak-10 kernel: Lustre: soaked-MDT0002: Connection restored to 040f43a4-0589-a1a1-1a89-accb11c79d99 (at 192.168.1.116@o2ib) Aug 31 20:58:15 soak-10 kernel: Lustre: Failing over soaked-MDT0002 Aug 31 20:58:16 soak-10 kernel: LustreError: 21348:0:(ldlm_resource.c:1094:ldlm_resource_complain()) soaked-MDT0000-osp-MDT0002: namespace resource [0x200000bd2:0x6:0x0].0x0 (ffff8808011aad80) refcount nonzero (1) after lock cleanup; forcing cleanup. Aug 31 20:58:16 soak-10 kernel: LustreError: 21348:0:(ldlm_resource.c:1676:ldlm_resource_dump()) --- Resource: [0x200000bd2:0x6:0x0].0x0 (ffff8808011aad80) refcount = 2 Aug 31 20:58:16 soak-10 kernel: LustreError: 21348:0:(ldlm_resource.c:1679:ldlm_resource_dump()) Granted locks (in reverse order): Aug 31 20:58:16 soak-10 kernel: LustreError: 21348:0:(ldlm_resource.c:1682:ldlm_resource_dump()) ### ### ns: soaked-MDT0000-osp-MDT0002 lock: ffff88078ded8c00/0xdf81cb0d8be12866 lrc: 2/0,1 mode: EX/EX res: [0x200000bd2:0x6:0x0].0x0 bits 0x2 rrc: 3 type: IBT flags: 0x1106401000000 nid: local remote: 0x5fb81c9b7e5be5fe expref: -99 pid: 3916 timeout: 0 lvb_type: 0 Aug 31 20:58:16 soak-10 kernel: LustreError: 21348:0:(ldlm_resource.c:1676:ldlm_resource_dump()) --- Resource: [0x200000bd2:0x6:0x0].0x0 (ffff8808011aad80) refcount = 2 Aug 31 20:58:16 soak-10 kernel: LustreError: 21348:0:(ldlm_resource.c:1679:ldlm_resource_dump()) Granted locks (in reverse order): Aug 31 20:58:18 soak-10 kernel: Lustre: soaked-MDT0002: Not available for connect from 192.168.1.136@o2ib (stopping) Aug 31 20:58:19 soak-10 kernel: Lustre: soaked-MDT0002: Not available for connect from 192.168.1.130@o2ib (stopping) Aug 31 20:58:19 soak-10 kernel: Lustre: Skipped 1 previous similar message Aug 31 20:58:22 soak-10 kernel: Lustre: soaked-MDT0002: Not available for connect from 172.16.1.48@o2ib1 (stopping) Aug 31 20:58:22 soak-10 kernel: Lustre: Skipped 3 previous similar messages Aug 31 20:58:24 soak-10 kernel: Lustre: soaked-MDT0002: Not available for connect from 192.168.1.138@o2ib (stopping) Aug 31 20:58:24 soak-10 kernel: Lustre: Skipped 3 previous similar messages Aug 31 20:58:28 soak-10 kernel: Lustre: soaked-MDT0002: Not available for connect from 172.16.1.46@o2ib1 (stopping) Aug 31 20:58:28 soak-10 kernel: Lustre: Skipped 5 previous similar messages Aug 31 20:58:38 soak-10 kernel: Lustre: soaked-MDT0002: Not available for connect from 192.168.1.131@o2ib (stopping) Aug 31 20:58:38 soak-10 kernel: Lustre: Skipped 28 previous similar messages Aug 31 20:58:41 soak-10 kernel: LustreError: 0-0: Forced cleanup waiting for soaked-MDT0000-osp-MDT0002 namespace with 1 resources in use, (rc=-110)
Dumped the lustre log both before and after the un-mount attempt. Crash-dumped the system.
When the system returned from the crash dump/reboot, all devices mounted fine.
Ran for awhile with only mds_restart test, so devices were unmounted and remounted on the same node. This worked fine. This wedged only seems to occur during failover.
Multiple Lusre logs, system log and full stack traces attached