[LU-6138] lfsck_namespace not progressing and lfsck_stop hangs Created: 20/Jan/15 Updated: 05/Aug/20 Resolved: 23/Feb/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | James Nunez (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | lfsck | ||
| Environment: |
OpenSFS cluster with two MDSs with one MDT each, three OSSs with two OSTs each and three clients. Lustre tag 2.6.92 build #2812 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 17107 | ||||||||
| Description |
|
I ran a small version of the Stability test in the LFSCK test plan; created 5 subdirectories with 100 objects in each. One proc runs LFSCK namespace, one deletes existing files and all other procs create more directories and files. LFSCK namespace completed on MDT01, but was stuck on MDT02 with status: scanning-phase2 After a few minutes, I ran the following on MDS01: lctl lfsck_stop -A -M scratch-MDT0000 This command did not and has not returned for over 30 minutes. So, I tried to stop LFSCK on MDS02: lctl lfsck_stop -A -M scratch-MDT0001 and this command has not returned. The dmesg logs complain about losing contact with a client, but, from the user perspective, all nodes look up and responsive when I get on them. From MDS02, containing MDT02 that LFSCK namespace did not complete, dmesg contains: LustreError: 11-0: scratch-MDT0000-osp-MDT0001: operation mds_connect to node 192.168.2.125@o2ib failed: rc = -11
Lustre: scratch-MDT0001: haven't heard from client d9f65e0e-5d98-7dfb-985b-592359f9067c (at 192.168.2.111@o2ib) in 234 seconds. I think it's dead, and I am evicting it. exp ffff880292852c00, cur 1421640751 expire 1421640601 last 1421640517
Lustre: cli-ctl-scratch-MDT0001: Allocated super-sequence [0x00000003c0000400-0x0000000400000400):1:mdt]
LNet: Service thread pid 32420 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 32420, comm: mdt_out02_000
Call Trace:
[<ffffffff81061d12>] ? default_wake_function+0x12/0x20
[<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
[<ffffffffa0ebd14c>] lfsck_stop+0x34c/0x3f0 [lfsck]
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffffa091a842>] ? __req_capsule_get+0x162/0x6d0 [ptlrpc]
[<ffffffffa0ebe9b0>] lfsck_in_notify+0x2a0/0x330 [lfsck]
[<ffffffffa094d694>] tgt_handle_lfsck_notify+0x64/0x150 [ptlrpc]
[<ffffffffa095597e>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
[<ffffffffa0905711>] ptlrpc_main+0xe41/0x1960 [ptlrpc]
[<ffffffff81069290>] ? pick_next_task_fair+0xd0/0x130
[<ffffffff81529466>] ? schedule+0x176/0x3b0
[<ffffffffa09048d0>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[<ffffffff8109abf6>] kthread+0x96/0xa0
[<ffffffff8100c20a>] child_rip+0xa/0x20
[<ffffffff8109ab60>] ? kthread+0x0/0xa0
[<ffffffff8100c200>] ? child_rip+0x0/0x20
LustreError: dumping log to /tmp/lustre-log.1421730002.32420
INFO: task lctl:5236 blocked for more than 120 seconds.
Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lctl D 000000000000000d 0 5236 8921 0x00000080
ffff88102b211b58 0000000000000086 0000000000000000 ffff88083fd304c0
ffff88102b211ae8 ffff88083fd304c0 0000000000000000 ffff88083fd304c0
ffff880c97827058 ffff88102b211fd8 000000000000fbc8 ffff880c97827058
Call Trace:
[<ffffffff8152ae5e>] __mutex_lock_slowpath+0x13e/0x180
[<ffffffff8152acfb>] mutex_lock+0x2b/0x50
[<ffffffffa0ebcf2f>] lfsck_stop+0x12f/0x3f0 [lfsck]
[<ffffffffa1130c7b>] ? osp_key_init+0x6b/0x190 [osp]
[<ffffffffa06cf30f>] ? keys_fill+0x6f/0x190 [obdclass]
[<ffffffffa10c74cf>] mdd_iocontrol+0x1ff/0xb70 [mdd]
[<ffffffffa06d3c23>] ? lu_context_init+0xa3/0x240 [obdclass]
[<ffffffffa0f7f05e>] mdt_iocontrol+0x6be/0x8e0 [mdt]
[<ffffffffa069fad6>] ? class_name2dev+0x56/0xe0 [obdclass]
[<ffffffffa069b77c>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
[<ffffffff8119c3c8>] ? do_filp_open+0x798/0xd20
[<ffffffffa06822ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
[<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
[<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
[<ffffffff81196dd6>] ? final_putname+0x26/0x50
[<ffffffff8119f091>] sys_ioctl+0x81/0xa0
[<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task lctl:5236 blocked for more than 120 seconds.
Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lctl D 000000000000000d 0 5236 8921 0x00000080
ffff88102b211b58 0000000000000086 0000000000000000 ffff88083fd304c0
ffff88102b211ae8 ffff88083fd304c0 0000000000000000 ffff88083fd304c0
ffff880c97827058 ffff88102b211fd8 000000000000fbc8 ffff880c97827058
Call Trace:
[<ffffffff8152ae5e>] __mutex_lock_slowpath+0x13e/0x180
[<ffffffff8152acfb>] mutex_lock+0x2b/0x50
[<ffffffffa0ebcf2f>] lfsck_stop+0x12f/0x3f0 [lfsck]
[<ffffffffa1130c7b>] ? osp_key_init+0x6b/0x190 [osp]
[<ffffffffa06cf30f>] ? keys_fill+0x6f/0x190 [obdclass]
[<ffffffffa10c74cf>] mdd_iocontrol+0x1ff/0xb70 [mdd]
[<ffffffffa06d3c23>] ? lu_context_init+0xa3/0x240 [obdclass]
[<ffffffffa0f7f05e>] mdt_iocontrol+0x6be/0x8e0 [mdt]
[<ffffffffa069fad6>] ? class_name2dev+0x56/0xe0 [obdclass]
[<ffffffffa069b77c>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
[<ffffffff8119c3c8>] ? do_filp_open+0x798/0xd20
[<ffffffffa06822ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
[<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
[<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
[<ffffffff81196dd6>] ? final_putname+0x26/0x50
[<ffffffff8119f091>] sys_ioctl+0x81/0xa0
[<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task lctl:5236 blocked for more than 120 seconds.
Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lctl D 000000000000000d 0 5236 8921 0x00000080
ffff88102b211b58 0000000000000086 0000000000000000 ffff88083fd304c0
ffff88102b211ae8 ffff88083fd304c0 0000000000000000 ffff88083fd304c0
ffff880c97827058 ffff88102b211fd8 000000000000fbc8 ffff880c97827058
Call Trace:
[<ffffffff8152ae5e>] __mutex_lock_slowpath+0x13e/0x180
[<ffffffff8152acfb>] mutex_lock+0x2b/0x50
[<ffffffffa0ebcf2f>] lfsck_stop+0x12f/0x3f0 [lfsck]
[<ffffffffa1130c7b>] ? osp_key_init+0x6b/0x190 [osp]
[<ffffffffa06cf30f>] ? keys_fill+0x6f/0x190 [obdclass]
[<ffffffffa10c74cf>] mdd_iocontrol+0x1ff/0xb70 [mdd]
[<ffffffffa06d3c23>] ? lu_context_init+0xa3/0x240 [obdclass]
[<ffffffffa0f7f05e>] mdt_iocontrol+0x6be/0x8e0 [mdt]
[<ffffffffa069fad6>] ? class_name2dev+0x56/0xe0 [obdclass]
[<ffffffffa069b77c>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
[<ffffffff8119c3c8>] ? do_filp_open+0x798/0xd20
[<ffffffffa06822ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
[<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
[<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
[<ffffffff81196dd6>] ? final_putname+0x26/0x50
[<ffffffff8119f091>] sys_ioctl+0x81/0xa0
[<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Lustre: 5459:0:(service.c:1335:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
req@ffff880c5f8d46c0 x1490612292269532/t0(0) o1101->scratch-MDT0000-mdtlov_UUID@192.168.2.125@o2ib:471/0 lens 320/224 e 24 to 0 dl 1421730401 ref 2 fl Interpret:/0/0 rc 0/0
INFO: task lctl:5236 blocked for more than 120 seconds.
Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lctl D 000000000000000d 0 5236 8921 0x00000080
ffff88102b211b58 0000000000000086 0000000000000000 ffff88083fd304c0
ffff88102b211ae8 ffff88083fd304c0 0000000000000000 ffff88083fd304c0
ffff880c97827058 ffff88102b211fd8 000000000000fbc8 ffff880c97827058
Call Trace:
[<ffffffff8152ae5e>] __mutex_lock_slowpath+0x13e/0x180
[<ffffffff8152acfb>] mutex_lock+0x2b/0x50
[<ffffffffa0ebcf2f>] lfsck_stop+0x12f/0x3f0 [lfsck]
[<ffffffffa1130c7b>] ? osp_key_init+0x6b/0x190 [osp]
[<ffffffffa06cf30f>] ? keys_fill+0x6f/0x190 [obdclass]
[<ffffffffa10c74cf>] mdd_iocontrol+0x1ff/0xb70 [mdd]
[<ffffffffa06d3c23>] ? lu_context_init+0xa3/0x240 [obdclass]
[<ffffffffa0f7f05e>] mdt_iocontrol+0x6be/0x8e0 [mdt]
[<ffffffffa069fad6>] ? class_name2dev+0x56/0xe0 [obdclass]
[<ffffffffa069b77c>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
[<ffffffff8119c3c8>] ? do_filp_open+0x798/0xd20
[<ffffffffa06822ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
[<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
[<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
[<ffffffff81196dd6>] ? final_putname+0x26/0x50
[<ffffffff8119f091>] sys_ioctl+0x81/0xa0
[<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Lustre: scratch-MDT0001: Client scratch-MDT0000-mdtlov_UUID (at 192.168.2.125@o2ib) reconnecting
INFO: task lctl:5236 blocked for more than 120 seconds.
Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lctl D 000000000000000d 0 5236 8921 0x00000080
ffff88102b211b58 0000000000000086 0000000000000000 ffff88083fd304c0
ffff88102b211ae8 ffff88083fd304c0 0000000000000000 ffff88083fd304c0
ffff880c97827058 ffff88102b211fd8 000000000000fbc8 ffff880c97827058
Call Trace:
[<ffffffff8152ae5e>] __mutex_lock_slowpath+0x13e/0x180
[<ffffffff8152acfb>] mutex_lock+0x2b/0x50
[<ffffffffa0ebcf2f>] lfsck_stop+0x12f/0x3f0 [lfsck]
[<ffffffffa1130c7b>] ? osp_key_init+0x6b/0x190 [osp]
[<ffffffffa06cf30f>] ? keys_fill+0x6f/0x190 [obdclass]
[<ffffffffa10c74cf>] mdd_iocontrol+0x1ff/0xb70 [mdd]
[<ffffffffa06d3c23>] ? lu_context_init+0xa3/0x240 [obdclass]
[<ffffffffa0f7f05e>] mdt_iocontrol+0x6be/0x8e0 [mdt]
[<ffffffffa069fad6>] ? class_name2dev+0x56/0xe0 [obdclass]
[<ffffffffa069b77c>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
[<ffffffff8119c3c8>] ? do_filp_open+0x798/0xd20
[<ffffffffa06822ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
[<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
[<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
[<ffffffff81196dd6>] ? final_putname+0x26/0x50
[<ffffffff8119f091>] sys_ioctl+0x81/0xa0
[<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task lctl:5236 blocked for more than 120 seconds.
Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lctl D 000000000000000d 0 5236 8921 0x00000080
ffff88102b211b58 0000000000000086 0000000000000000 ffff88083fd304c0
ffff88102b211ae8 ffff88083fd304c0 0000000000000000 ffff88083fd304c0
ffff880c97827058 ffff88102b211fd8 000000000000fbc8 ffff880c97827058
Call Trace:
[<ffffffff8152ae5e>] __mutex_lock_slowpath+0x13e/0x180
[<ffffffff8152acfb>] mutex_lock+0x2b/0x50
[<ffffffffa0ebcf2f>] lfsck_stop+0x12f/0x3f0 [lfsck]
[<ffffffffa1130c7b>] ? osp_key_init+0x6b/0x190 [osp]
[<ffffffffa06cf30f>] ? keys_fill+0x6f/0x190 [obdclass]
[<ffffffffa10c74cf>] mdd_iocontrol+0x1ff/0xb70 [mdd]
[<ffffffffa06d3c23>] ? lu_context_init+0xa3/0x240 [obdclass]
[<ffffffffa0f7f05e>] mdt_iocontrol+0x6be/0x8e0 [mdt]
[<ffffffffa069fad6>] ? class_name2dev+0x56/0xe0 [obdclass]
[<ffffffffa069b77c>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
[<ffffffff8119c3c8>] ? do_filp_open+0x798/0xd20
[<ffffffffa06822ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
[<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
[<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
[<ffffffff81196dd6>] ? final_putname+0x26/0x50
[<ffffffff8119f091>] sys_ioctl+0x81/0xa0
[<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task lctl:5236 blocked for more than 120 seconds.
Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lctl D 000000000000000d 0 5236 8921 0x00000080
ffff88102b211b58 0000000000000086 0000000000000000 ffff88083fd304c0
ffff88102b211ae8 ffff88083fd304c0 0000000000000000 ffff88083fd304c0
ffff880c97827058 ffff88102b211fd8 000000000000fbc8 ffff880c97827058
Call Trace:
[<ffffffff8152ae5e>] __mutex_lock_slowpath+0x13e/0x180
[<ffffffff8152acfb>] mutex_lock+0x2b/0x50
[<ffffffffa0ebcf2f>] lfsck_stop+0x12f/0x3f0 [lfsck]
[<ffffffffa1130c7b>] ? osp_key_init+0x6b/0x190 [osp]
[<ffffffffa06cf30f>] ? keys_fill+0x6f/0x190 [obdclass]
[<ffffffffa10c74cf>] mdd_iocontrol+0x1ff/0xb70 [mdd]
[<ffffffffa06d3c23>] ? lu_context_init+0xa3/0x240 [obdclass]
[<ffffffffa0f7f05e>] mdt_iocontrol+0x6be/0x8e0 [mdt]
[<ffffffffa069fad6>] ? class_name2dev+0x56/0xe0 [obdclass]
[<ffffffffa069b77c>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
[<ffffffff8119c3c8>] ? do_filp_open+0x798/0xd20
[<ffffffffa06822ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
[<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
[<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
[<ffffffff81196dd6>] ? final_putname+0x26/0x50
[<ffffffff8119f091>] sys_ioctl+0x81/0xa0
[<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task lctl:5236 blocked for more than 120 seconds.
Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lctl D 000000000000000d 0 5236 8921 0x00000080
ffff88102b211b58 0000000000000086 0000000000000000 ffff88083fd304c0
ffff88102b211ae8 ffff88083fd304c0 0000000000000000 ffff88083fd304c0
ffff880c97827058 ffff88102b211fd8 000000000000fbc8 ffff880c97827058
Call Trace:
[<ffffffff8152ae5e>] __mutex_lock_slowpath+0x13e/0x180
[<ffffffff8152acfb>] mutex_lock+0x2b/0x50
[<ffffffffa0ebcf2f>] lfsck_stop+0x12f/0x3f0 [lfsck]
[<ffffffffa1130c7b>] ? osp_key_init+0x6b/0x190 [osp]
[<ffffffffa06cf30f>] ? keys_fill+0x6f/0x190 [obdclass]
[<ffffffffa10c74cf>] mdd_iocontrol+0x1ff/0xb70 [mdd]
[<ffffffffa06d3c23>] ? lu_context_init+0xa3/0x240 [obdclass]
[<ffffffffa0f7f05e>] mdt_iocontrol+0x6be/0x8e0 [mdt]
[<ffffffffa069fad6>] ? class_name2dev+0x56/0xe0 [obdclass]
[<ffffffffa069b77c>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
[<ffffffff8119c3c8>] ? do_filp_open+0x798/0xd20
[<ffffffffa06822ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
[<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
[<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
[<ffffffff81196dd6>] ? final_putname+0x26/0x50
[<ffffffff8119f091>] sys_ioctl+0x81/0xa0
[<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task lctl:5236 blocked for more than 120 seconds.
Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lctl D 000000000000000d 0 5236 8921 0x00000080
ffff88102b211b58 0000000000000086 0000000000000000 ffff88083fd304c0
ffff88102b211ae8 ffff88083fd304c0 0000000000000000 ffff88083fd304c0
ffff880c97827058 ffff88102b211fd8 000000000000fbc8 ffff880c97827058
Call Trace:
[<ffffffff8152ae5e>] __mutex_lock_slowpath+0x13e/0x180
[<ffffffff8152acfb>] mutex_lock+0x2b/0x50
[<ffffffffa0ebcf2f>] lfsck_stop+0x12f/0x3f0 [lfsck]
[<ffffffffa1130c7b>] ? osp_key_init+0x6b/0x190 [osp]
[<ffffffffa06cf30f>] ? keys_fill+0x6f/0x190 [obdclass]
[<ffffffffa10c74cf>] mdd_iocontrol+0x1ff/0xb70 [mdd]
[<ffffffffa06d3c23>] ? lu_context_init+0xa3/0x240 [obdclass]
[<ffffffffa0f7f05e>] mdt_iocontrol+0x6be/0x8e0 [mdt]
[<ffffffffa069fad6>] ? class_name2dev+0x56/0xe0 [obdclass]
[<ffffffffa069b77c>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
[<ffffffff8119c3c8>] ? do_filp_open+0x798/0xd20
[<ffffffffa06822ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
[<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
[<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
[<ffffffff81196dd6>] ? final_putname+0x26/0x50
[<ffffffff8119f091>] sys_ioctl+0x81/0xa0
[<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task lctl:5236 blocked for more than 120 seconds.
Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lctl D 000000000000000d 0 5236 8921 0x00000080
ffff88102b211b58 0000000000000086 0000000000000000 ffff88083fd304c0
ffff88102b211ae8 ffff88083fd304c0 0000000000000000 ffff88083fd304c0
ffff880c97827058 ffff88102b211fd8 000000000000fbc8 ffff880c97827058
Call Trace:
[<ffffffff8152ae5e>] __mutex_lock_slowpath+0x13e/0x180
[<ffffffff8152acfb>] mutex_lock+0x2b/0x50
[<ffffffffa0ebcf2f>] lfsck_stop+0x12f/0x3f0 [lfsck]
[<ffffffffa1130c7b>] ? osp_key_init+0x6b/0x190 [osp]
[<ffffffffa06cf30f>] ? keys_fill+0x6f/0x190 [obdclass]
[<ffffffffa10c74cf>] mdd_iocontrol+0x1ff/0xb70 [mdd]
[<ffffffffa06d3c23>] ? lu_context_init+0xa3/0x240 [obdclass]
[<ffffffffa0f7f05e>] mdt_iocontrol+0x6be/0x8e0 [mdt]
[<ffffffffa069fad6>] ? class_name2dev+0x56/0xe0 [obdclass]
[<ffffffffa069b77c>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
[<ffffffff8119c3c8>] ? do_filp_open+0x798/0xd20
[<ffffffffa06822ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
[<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
[<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
[<ffffffff81196dd6>] ? final_putname+0x26/0x50
[<ffffffff8119f091>] sys_ioctl+0x81/0xa0
[<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Lustre: scratch-MDT0001: Client scratch-MDT0000-mdtlov_UUID (at 192.168.2.125@o2ib) reconnecting
On MDS01, containing the MDS01 that LFSCK did complete, dmesg contains: LustreError: 11-0: scratch-MDT0001-osp-MDT0000: operation mds_connect to node 192.168.2.126@o2ib failed: rc = -11 Lustre: ctl-scratch-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:ost Lustre: ctl-scratch-MDT0000: super-sequence allocation rc = 0 [0x0000000280000400-0x00000002c0000400):4:ost Lustre: Skipped 1 previous similar message Lustre: ctl-scratch-MDT0000: super-sequence allocation rc = 0 [0x0000000300000400-0x0000000340000400):3:ost Lustre: Skipped 1 previous similar message Lustre: ctl-scratch-MDT0000: super-sequence allocation rc = 0 [0x0000000340000400-0x0000000380000400):5:ost Lustre: MGS: haven't heard from client b55379e3-6789-caf6-3733-61f9e03b0716 (at 192.168.2.111@o2ib) in 234 seconds. I think it's dead, and I am evicting it. exp ffff8807fed38000, cur 1421640751 expire 1421640601 last 1421640517 Lustre: ctl-scratch-MDT0000: super-sequence allocation rc = 0 [0x0000000380000400-0x00000003c0000400):0:mdt Lustre: ctl-scratch-MDT0000: super-sequence allocation rc = 0 [0x00000003c0000400-0x0000000400000400):1:mdt LustreError: 2296:0:(class_obd.c:397:class_handle_ioctl()) OBD ioctl : No Device -12066 LustreError: 2297:0:(class_obd.c:397:class_handle_ioctl()) OBD ioctl : No Device -12066 LustreError: 2300:0:(class_obd.c:397:class_handle_ioctl()) OBD ioctl : No Device -12066 Lustre: 2301:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1421729801/real 1421729801] req@ffff880796e3a6c0 x1490612292269532/t0(0) o1101->scratch-MDT0001-osp-MDT0000@192.168.2.126@o2ib:24/4 lens 320/224 e 24 to 1 dl 1421730557 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Lustre: scratch-MDT0001-osp-MDT0000: Connection to scratch-MDT0001 (at 192.168.2.126@o2ib) was lost; in progress operations using this service will wait for recovery to complete Lustre: scratch-MDT0001-osp-MDT0000: Connection restored to scratch-MDT0001 (at 192.168.2.126@o2ib) Lustre: 2301:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1421730557/real 1421730557] req@ffff880796e3a6c0 x1490612292269532/t0(0) o1101->scratch-MDT0001-osp-MDT0000@192.168.2.126@o2ib:24/4 lens 320/224 e 24 to 1 dl 1421731313 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Lustre: scratch-MDT0001-osp-MDT0000: Connection to scratch-MDT0001 (at 192.168.2.126@o2ib) was lost; in progress operations using this service will wait for recovery to complete Lustre: scratch-MDT0001-osp-MDT0000: Connection restored to scratch-MDT0001 (at 192.168.2.126@o2ib) I will upload the log from /tmp/lustre-log.1421730002.32420 from MDS02. I have not tried to reproduce this error. |
| Comments |
| Comment by James Nunez (Inactive) [ 21/Jan/15 ] |
|
I've just uploaded the log described in the Description of this ticket to uploads/ Also, I cannot reproduce this problem. I've tried 5 or 6 times and cannot get LFSCK to hang. |
| Comment by James Nunez (Inactive) [ 21/Jan/15 ] |
|
I spoke too soon; I have a hung 'lctl lfsck_stop' on mds01 as described above. Looking at dmesg on mds02, there is a stack trace: Lustre: shutting down debug daemon thread...
INFO: task lfsck_namespace:9249 blocked for more than 120 seconds.
Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lfsck_namespa D 0000000000000004 0 9249 2 0x00000080
ffff8801c9cbbbc0 0000000000000046 0000000000000000 ffff8801c9cbbd20
ffff8801c9cbbd20 ffff880acdf09000 ffff8801c9cbbbc0 ffffffffa06d37e9
ffff880825bf6638 ffff8801c9cbbfd8 000000000000fbc8 ffff880825bf6638
Call Trace:
[<ffffffffa06d37e9>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
[<ffffffffa06d3a3d>] lu_object_find_at+0x3d/0xe0 [obdclass]
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffffa06d28ff>] ? lu_object_put+0x1cf/0x330 [obdclass]
[<ffffffffa06d3b1f>] lu_object_find_slice+0x1f/0x80 [obdclass]
[<ffffffffa0ee5669>] lfsck_namespace_assistant_handler_p2+0xb19/0x11b0 [lfsck]
[<ffffffff81083f30>] ? process_timeout+0x0/0x10
[<ffffffffa0ec7399>] lfsck_assistant_engine+0x14e9/0x1e00 [lfsck]
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffffa0ec5eb0>] ? lfsck_assistant_engine+0x0/0x1e00 [lfsck]
[<ffffffff8109abf6>] kthread+0x96/0xa0
[<ffffffff8100c20a>] child_rip+0xa/0x20
[<ffffffff8109ab60>] ? kthread+0x0/0xa0
[<ffffffff8100c200>] ? child_rip+0x0/0x20
Lustre: debug daemon will attempt to start writing to /tmp/lfsck_log_2 (1048576kB max)
I've uploaded two kernel logs from each MDS to uploads/ |
| Comment by nasf (Inactive) [ 23/Jan/15 ] |
|
James, I think the last failure is not the same as the original |
| Comment by James Nunez (Inactive) [ 03/Feb/15 ] |
|
I've run this test again with the latest master 2.6.93 tag and ran into the same problem; run LFSCK under a create/remove workload and, 'lctl lfsck_stop' hangs. We may need test with a fix to |
| Comment by nasf (Inactive) [ 04/Feb/15 ] |
|
There another two potential reasons may cause the LFSCK hung: 1) self-deadlock on OST that may cause the LFSCK on the OST cannot exit, it has been fixed by the patch http://review.whamcloud.com/#/c/13525/ |
| Comment by Gerrit Updater [ 06/Feb/15 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13666 |
| Comment by Gerrit Updater [ 18/Feb/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13666/ |
| Comment by nasf (Inactive) [ 19/Feb/15 ] |
|
We cannot close this ticket until |
| Comment by Andreas Dilger [ 20/Feb/15 ] |
| Comment by Gerrit Updater [ 23/Feb/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13818/ |
| Comment by Jodi Levi (Inactive) [ 23/Feb/15 ] |
|
Patch landed to master. |