[LU-9935] Failover mount hangs on DNE MDT Created: 31/Aug/17 Updated: 26/Oct/17 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Cliff White (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
Soak stress cluster Lustre version=2.10.0_61_g6aabd4a |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Not entirely certain what is going on here, however this is very repeatable, and we have verified that the array is quite healthy. Aug 31 20:01:27 soak-10 kernel: LNet: Service thread pid 3916 was inactive for 200.29s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Aug 31 20:01:27 soak-10 kernel: Pid: 3916, comm: mdt01_027 Aug 31 20:01:27 soak-10 kernel: #012Call Trace: Aug 31 20:01:27 soak-10 kernel: [<ffffffff8168c969>] schedule+0x29/0x70 Aug 31 20:01:27 soak-10 kernel: [<ffffffffa1165b82>] top_trans_wait_result+0xa3/0x14e [ptlrpc] Aug 31 20:01:27 soak-10 kernel: [<ffffffff810c54e0>] ? default_wake_function+0x0/0x20 Aug 31 20:01:27 soak-10 kernel: [<ffffffffa114814b>] top_trans_stop+0x46b/0x970 [ptlrpc] Aug 31 20:01:27 soak-10 kernel: [<ffffffffa15eb569>] lod_trans_stop+0x259/0x340 [lod] Aug 31 20:01:27 soak-10 kernel: [<ffffffffa166f4d5>] ? mdd_changelog_ns_store+0x2e5/0x650 [mdd] Aug 31 20:01:27 soak-10 kernel: [<ffffffffa1687b94>] mdd_trans_stop+0x24/0x40 [mdd] Aug 31 20:01:28 soak-10 kernel: [<ffffffffa1672753>] mdd_create+0x10b3/0x1330 [mdd] Aug 31 20:01:28 soak-10 kernel: [<ffffffffa1538d86>] mdt_create+0x846/0xbb0 [mdt] Aug 31 20:01:28 soak-10 kernel: [<ffffffffa153925b>] mdt_reint_create+0x16b/0x350 [mdt] Aug 31 20:01:28 soak-10 kernel: [<ffffffffa153a760>] mdt_reint_rec+0x80/0x210 [mdt] Aug 31 20:01:28 soak-10 kernel: [<ffffffffa151c2fb>] mdt_reint_internal+0x5fb/0x9c0 [mdt] Aug 31 20:01:28 soak-10 kernel: [<ffffffffa1527e37>] mdt_reint+0x67/0x140 [mdt] Aug 31 20:01:28 soak-10 kernel: [<ffffffffa11348f5>] tgt_request_handle+0x925/0x1370 [ptlrpc] Aug 31 20:01:28 soak-10 kernel: [<ffffffffa10dd2c6>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc] Aug 31 20:01:28 soak-10 kernel: [<ffffffffa0d6db97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] Aug 31 20:01:28 soak-10 kernel: [<ffffffffa10e12a0>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] Aug 31 20:01:28 soak-10 kernel: [<ffffffffa10e0800>] ? ptlrpc_main+0x0/0x1de0 [ptlrpc] Aug 31 20:01:28 soak-10 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0 Aug 31 20:01:28 soak-10 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0 Aug 31 20:01:28 soak-10 kernel: [<ffffffff816978d8>] ret_from_fork+0x58/0x90 Aug 31 20:01:28 soak-10 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0 Aug 31 20:01:28 soak-10 kernel: Aug 31 20:01:28 soak-10 kernel: LustreError: dumping log to /tmp/lustre-log.1504209688.3916 Mount then hangs with -114 -> EALREADY /* Operation already in progress */ Aug 31 20:01:39 soak-10 kernel: LustreError: 11348:0:(obd_mount_server.c:1832:server_fill_super()) Unable to start osd on /dev/mapper/360080e50001fedb80000015752012949: -114 At this time the mount is unkillable. Attempts to un-mount the other MDT also fail Aug 31 20:58:14 soak-10 kernel: Lustre: soaked-MDT0002: Client 040f43a4-0589-a1a1-1a89-accb11c79d99 (at 192.168.1.116@o2ib) reconnecting Aug 31 20:58:14 soak-10 kernel: Lustre: soaked-MDT0002: Connection restored to 040f43a4-0589-a1a1-1a89-accb11c79d99 (at 192.168.1.116@o2ib) Aug 31 20:58:15 soak-10 kernel: Lustre: Failing over soaked-MDT0002 Aug 31 20:58:16 soak-10 kernel: LustreError: 21348:0:(ldlm_resource.c:1094:ldlm_resource_complain()) soaked-MDT0000-osp-MDT0002: namespace resource [0x200000bd2:0x6:0x0].0x0 (ffff8808011aad80) refcount nonzero (1) after lock cleanup; forcing cleanup. Aug 31 20:58:16 soak-10 kernel: LustreError: 21348:0:(ldlm_resource.c:1676:ldlm_resource_dump()) --- Resource: [0x200000bd2:0x6:0x0].0x0 (ffff8808011aad80) refcount = 2 Aug 31 20:58:16 soak-10 kernel: LustreError: 21348:0:(ldlm_resource.c:1679:ldlm_resource_dump()) Granted locks (in reverse order): Aug 31 20:58:16 soak-10 kernel: LustreError: 21348:0:(ldlm_resource.c:1682:ldlm_resource_dump()) ### ### ns: soaked-MDT0000-osp-MDT0002 lock: ffff88078ded8c00/0xdf81cb0d8be12866 lrc: 2/0,1 mode: EX/EX res: [0x200000bd2:0x6:0x0].0x0 bits 0x2 rrc: 3 type: IBT flags: 0x1106401000000 nid: local remote: 0x5fb81c9b7e5be5fe expref: -99 pid: 3916 timeout: 0 lvb_type: 0 Aug 31 20:58:16 soak-10 kernel: LustreError: 21348:0:(ldlm_resource.c:1676:ldlm_resource_dump()) --- Resource: [0x200000bd2:0x6:0x0].0x0 (ffff8808011aad80) refcount = 2 Aug 31 20:58:16 soak-10 kernel: LustreError: 21348:0:(ldlm_resource.c:1679:ldlm_resource_dump()) Granted locks (in reverse order): Aug 31 20:58:18 soak-10 kernel: Lustre: soaked-MDT0002: Not available for connect from 192.168.1.136@o2ib (stopping) Aug 31 20:58:19 soak-10 kernel: Lustre: soaked-MDT0002: Not available for connect from 192.168.1.130@o2ib (stopping) Aug 31 20:58:19 soak-10 kernel: Lustre: Skipped 1 previous similar message Aug 31 20:58:22 soak-10 kernel: Lustre: soaked-MDT0002: Not available for connect from 172.16.1.48@o2ib1 (stopping) Aug 31 20:58:22 soak-10 kernel: Lustre: Skipped 3 previous similar messages Aug 31 20:58:24 soak-10 kernel: Lustre: soaked-MDT0002: Not available for connect from 192.168.1.138@o2ib (stopping) Aug 31 20:58:24 soak-10 kernel: Lustre: Skipped 3 previous similar messages Aug 31 20:58:28 soak-10 kernel: Lustre: soaked-MDT0002: Not available for connect from 172.16.1.46@o2ib1 (stopping) Aug 31 20:58:28 soak-10 kernel: Lustre: Skipped 5 previous similar messages Aug 31 20:58:38 soak-10 kernel: Lustre: soaked-MDT0002: Not available for connect from 192.168.1.131@o2ib (stopping) Aug 31 20:58:38 soak-10 kernel: Lustre: Skipped 28 previous similar messages Aug 31 20:58:41 soak-10 kernel: LustreError: 0-0: Forced cleanup waiting for soaked-MDT0000-osp-MDT0002 namespace with 1 resources in use, (rc=-110) Dumped the lustre log both before and after the un-mount attempt. Crash-dumped the system. |
| Comments |
| Comment by Peter Jones [ 01/Sep/17 ] |
|
Lai Could you please advise on this one? Thanks Peter |
| Comment by Cliff White (Inactive) [ 01/Sep/17 ] |
|
There may have been a configuration issue, let me re-test to confirm |
| Comment by Sarah Liu [ 06/Sep/17 ] |
|
I don't see this error after Cliff restarted. Instead soak.log shows on 9/4, when trying to fail back from soak-9 to soak-8, soak-9 hang during umount. Don't find core dump for soak-9 though. soak.log 2017-09-04 07:15:19,080:fsmgmt.fsmgmt:INFO soak-8 is up!!!
2017-09-04 07:15:30,092:fsmgmt.fsmgmt:INFO Failing over soaked-MDT0000 ...
2017-09-04 07:15:30,092:fsmgmt.fsmgmt:INFO Mounting soaked-MDT0000 on soak-9 ...
2017-09-04 07:15:44,501:sched.sched :ERROR soak-44: client still faulty. Please check node. Next check in 180s.
2017-09-04 07:16:22,822:fsmgmt.fsmgmt:INFO ... soaked-MDT0000 mounted successfully on soak-9
2017-09-04 07:16:22,823:fsmgmt.fsmgmt:INFO ... soaked-MDT0000 failed over
2017-09-04 07:16:22,823:fsmgmt.fsmgmt:INFO Wait for recovery to complete
2017-09-04 07:16:23,113:fsmgmt.fsmgmt:DEBUG Recovery Result Record: {'soak-9': {'soaked-MDT0001': 'COMPLETE', 'soaked-MDT0000': 'RECOVERING'}}
2017-09-04 07:16:23,113:fsmgmt.fsmgmt:INFO soaked-MDT0000 in status 'RECOVERING'.
...
2017-09-04 07:18:09,934:fsmgmt.fsmgmt:DEBUG Recovery Result Record: {'soak-9': {'soaked-MDT0001': 'COMPLETE', 'soaked-MDT0000': 'COMPLETE'}}
2017-09-04 07:18:09,934:fsmgmt.fsmgmt:INFO Node soak-9: 'soaked-MDT0000' recovery completed
2017-09-04 07:18:09,934:fsmgmt.fsmgmt:INFO Failing back soaked-MDT0000 ...
2017-09-04 07:18:09,935:fsmgmt.fsmgmt:INFO Unmounting soaked-MDT0000 on soak-9 ...
soak-9 syslog Sep 4 07:18:10 soak-9 kernel: Lustre: Failing over soaked-MDT0000 Sep 4 07:18:10 soak-9 kernel: LustreError: 11-0: soaked-MDT0000-osp-MDT0001: operation out_update to node 0@lo failed: rc = -19 Sep 4 07:18:10 soak-9 kernel: LustreError: Skipped 3 previous similar messages Sep 4 07:18:10 soak-9 kernel: Lustre: soaked-MDT0000-osp-MDT0001: Connection to soaked-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete Sep 4 07:18:10 soak-9 kernel: Lustre: Skipped 1 previous similar message Sep 4 07:18:10 soak-9 kernel: LustreError: 4464:0:(ldlm_lockd.c:1415:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8803f1172c00 ns: mdt-soaked-MDT0000_UUID lock: ffff880332d2a000/0x72b6987a94ebf754 lrc: 3/0,0 mode: CR/CR res: [0x20000a814:0xc695:0x0].0x0 bits 0x9 rrc: 2 type: IBT flags: 0x50200000000000 nid: 192.168.1.120@o2ib remote: 0x4d92350181a4cd1e expref: 4 pid: 4464 timeout: 0 lvb_type: 0 Sep 4 07:18:11 soak-9 kernel: Lustre: soaked-MDT0000: Not available for connect from 192.168.1.120@o2ib (stopping) Sep 4 07:18:12 soak-9 kernel: Lustre: soaked-MDT0000: Not available for connect from 172.16.1.45@o2ib1 (stopping) Sep 4 07:18:13 soak-9 kernel: LustreError: 4257:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff88008e50b000 x1577570293501872/t0(0) o13->soaked-OST000a-osc-MDT0000@192.168.1.106@o2ib:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 Sep 4 07:18:13 soak-9 kernel: LustreError: 4257:0:(client.c:1166:ptlrpc_import_delay_req()) Skipped 19 previous similar messages Sep 4 07:18:13 soak-9 kernel: Lustre: soaked-MDT0000: Not available for connect from 192.168.1.107@o2ib (stopping) Sep 4 07:18:13 soak-9 kernel: Lustre: Skipped 1 previous similar message Sep 6 22:49:04 soak-9 rsyslogd: [origin software="rsyslogd" swVersion="7.4.7" x-pid="1763" x-info="http://www.rsyslog.com"] start Sep 6 22:48:28 soak-9 kernel: microcode: microcode updated early to revision 0x710, date = 2013-06-17 Sep 6 22:48:28 soak-9 kernel: Initializing cgroup subsys cpuset Sep 6 22:48:28 soak-9 kernel: Initializing cgroup subsys cpu Sep 6 22:48:28 soak-9 kernel: Initializing cgroup subsys cpuacct |