[LU-6138] lfsck_namespace not progressing and lfsck_stop hangs Created: 20/Jan/15  Updated: 05/Aug/20  Resolved: 23/Feb/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Blocker
Reporter: James Nunez (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: lfsck
Environment:

OpenSFS cluster with two MDSs with one MDT each, three OSSs with two OSTs each and three clients. Lustre tag 2.6.92 build #2812


Issue Links:
Related
is related to LU-5791 LFSCK 5: use bottom object for consis... Resolved
Severity: 3
Rank (Obsolete): 17107

 Description   

I ran a small version of the Stability test in the LFSCK test plan; created 5 subdirectories with 100 objects in each. One proc runs LFSCK namespace, one deletes existing files and all other procs create more directories and files.

LFSCK namespace completed on MDT01, but was stuck on MDT02 with

status: scanning-phase2

After a few minutes, I ran the following on MDS01:

lctl lfsck_stop -A -M scratch-MDT0000

This command did not and has not returned for over 30 minutes. So, I tried to stop LFSCK on MDS02:

lctl lfsck_stop -A -M scratch-MDT0001

and this command has not returned.

The dmesg logs complain about losing contact with a client, but, from the user perspective, all nodes look up and responsive when I get on them.

From MDS02, containing MDT02 that LFSCK namespace did not complete, dmesg contains:

LustreError: 11-0: scratch-MDT0000-osp-MDT0001: operation mds_connect to node 192.168.2.125@o2ib failed: rc = -11
Lustre: scratch-MDT0001: haven't heard from client d9f65e0e-5d98-7dfb-985b-592359f9067c (at 192.168.2.111@o2ib) in 234 seconds. I think it's dead, and I am evicting it. exp ffff880292852c00, cur 1421640751 expire 1421640601 last 1421640517
Lustre: cli-ctl-scratch-MDT0001: Allocated super-sequence [0x00000003c0000400-0x0000000400000400):1:mdt]
LNet: Service thread pid 32420 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 32420, comm: mdt_out02_000

Call Trace:
 [<ffffffff81061d12>] ? default_wake_function+0x12/0x20
 [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
 [<ffffffffa0ebd14c>] lfsck_stop+0x34c/0x3f0 [lfsck]
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa091a842>] ? __req_capsule_get+0x162/0x6d0 [ptlrpc]
 [<ffffffffa0ebe9b0>] lfsck_in_notify+0x2a0/0x330 [lfsck]
 [<ffffffffa094d694>] tgt_handle_lfsck_notify+0x64/0x150 [ptlrpc]
 [<ffffffffa095597e>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
 [<ffffffffa0905711>] ptlrpc_main+0xe41/0x1960 [ptlrpc]
 [<ffffffff81069290>] ? pick_next_task_fair+0xd0/0x130
 [<ffffffff81529466>] ? schedule+0x176/0x3b0
 [<ffffffffa09048d0>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
 [<ffffffff8109abf6>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109ab60>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20

LustreError: dumping log to /tmp/lustre-log.1421730002.32420
INFO: task lctl:5236 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lctl          D 000000000000000d     0  5236   8921 0x00000080
 ffff88102b211b58 0000000000000086 0000000000000000 ffff88083fd304c0
 ffff88102b211ae8 ffff88083fd304c0 0000000000000000 ffff88083fd304c0
 ffff880c97827058 ffff88102b211fd8 000000000000fbc8 ffff880c97827058
Call Trace:
 [<ffffffff8152ae5e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff8152acfb>] mutex_lock+0x2b/0x50
 [<ffffffffa0ebcf2f>] lfsck_stop+0x12f/0x3f0 [lfsck]
 [<ffffffffa1130c7b>] ? osp_key_init+0x6b/0x190 [osp]
 [<ffffffffa06cf30f>] ? keys_fill+0x6f/0x190 [obdclass]
 [<ffffffffa10c74cf>] mdd_iocontrol+0x1ff/0xb70 [mdd]
 [<ffffffffa06d3c23>] ? lu_context_init+0xa3/0x240 [obdclass]
 [<ffffffffa0f7f05e>] mdt_iocontrol+0x6be/0x8e0 [mdt]
 [<ffffffffa069fad6>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa069b77c>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
 [<ffffffff8119c3c8>] ? do_filp_open+0x798/0xd20
 [<ffffffffa06822ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
 [<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
 [<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
 [<ffffffff81196dd6>] ? final_putname+0x26/0x50
 [<ffffffff8119f091>] sys_ioctl+0x81/0xa0
 [<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task lctl:5236 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lctl          D 000000000000000d     0  5236   8921 0x00000080
 ffff88102b211b58 0000000000000086 0000000000000000 ffff88083fd304c0
 ffff88102b211ae8 ffff88083fd304c0 0000000000000000 ffff88083fd304c0
 ffff880c97827058 ffff88102b211fd8 000000000000fbc8 ffff880c97827058
Call Trace:
 [<ffffffff8152ae5e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff8152acfb>] mutex_lock+0x2b/0x50
 [<ffffffffa0ebcf2f>] lfsck_stop+0x12f/0x3f0 [lfsck]
 [<ffffffffa1130c7b>] ? osp_key_init+0x6b/0x190 [osp]
 [<ffffffffa06cf30f>] ? keys_fill+0x6f/0x190 [obdclass]
 [<ffffffffa10c74cf>] mdd_iocontrol+0x1ff/0xb70 [mdd]
 [<ffffffffa06d3c23>] ? lu_context_init+0xa3/0x240 [obdclass]
 [<ffffffffa0f7f05e>] mdt_iocontrol+0x6be/0x8e0 [mdt]
 [<ffffffffa069fad6>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa069b77c>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
 [<ffffffff8119c3c8>] ? do_filp_open+0x798/0xd20
 [<ffffffffa06822ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
 [<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
 [<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
 [<ffffffff81196dd6>] ? final_putname+0x26/0x50
 [<ffffffff8119f091>] sys_ioctl+0x81/0xa0
 [<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task lctl:5236 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lctl          D 000000000000000d     0  5236   8921 0x00000080
 ffff88102b211b58 0000000000000086 0000000000000000 ffff88083fd304c0
 ffff88102b211ae8 ffff88083fd304c0 0000000000000000 ffff88083fd304c0
 ffff880c97827058 ffff88102b211fd8 000000000000fbc8 ffff880c97827058
Call Trace:
 [<ffffffff8152ae5e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff8152acfb>] mutex_lock+0x2b/0x50
 [<ffffffffa0ebcf2f>] lfsck_stop+0x12f/0x3f0 [lfsck]
 [<ffffffffa1130c7b>] ? osp_key_init+0x6b/0x190 [osp]
 [<ffffffffa06cf30f>] ? keys_fill+0x6f/0x190 [obdclass]
 [<ffffffffa10c74cf>] mdd_iocontrol+0x1ff/0xb70 [mdd]
 [<ffffffffa06d3c23>] ? lu_context_init+0xa3/0x240 [obdclass]
 [<ffffffffa0f7f05e>] mdt_iocontrol+0x6be/0x8e0 [mdt]
 [<ffffffffa069fad6>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa069b77c>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
 [<ffffffff8119c3c8>] ? do_filp_open+0x798/0xd20
 [<ffffffffa06822ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
 [<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
 [<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
 [<ffffffff81196dd6>] ? final_putname+0x26/0x50
 [<ffffffff8119f091>] sys_ioctl+0x81/0xa0
 [<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Lustre: 5459:0:(service.c:1335:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
  req@ffff880c5f8d46c0 x1490612292269532/t0(0) o1101->scratch-MDT0000-mdtlov_UUID@192.168.2.125@o2ib:471/0 lens 320/224 e 24 to 0 dl 1421730401 ref 2 fl Interpret:/0/0 rc 0/0
INFO: task lctl:5236 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lctl          D 000000000000000d     0  5236   8921 0x00000080
 ffff88102b211b58 0000000000000086 0000000000000000 ffff88083fd304c0
 ffff88102b211ae8 ffff88083fd304c0 0000000000000000 ffff88083fd304c0
 ffff880c97827058 ffff88102b211fd8 000000000000fbc8 ffff880c97827058
Call Trace:
 [<ffffffff8152ae5e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff8152acfb>] mutex_lock+0x2b/0x50
 [<ffffffffa0ebcf2f>] lfsck_stop+0x12f/0x3f0 [lfsck]
 [<ffffffffa1130c7b>] ? osp_key_init+0x6b/0x190 [osp]
 [<ffffffffa06cf30f>] ? keys_fill+0x6f/0x190 [obdclass]
 [<ffffffffa10c74cf>] mdd_iocontrol+0x1ff/0xb70 [mdd]
 [<ffffffffa06d3c23>] ? lu_context_init+0xa3/0x240 [obdclass]
 [<ffffffffa0f7f05e>] mdt_iocontrol+0x6be/0x8e0 [mdt]
 [<ffffffffa069fad6>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa069b77c>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
 [<ffffffff8119c3c8>] ? do_filp_open+0x798/0xd20
 [<ffffffffa06822ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
 [<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
 [<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
 [<ffffffff81196dd6>] ? final_putname+0x26/0x50
 [<ffffffff8119f091>] sys_ioctl+0x81/0xa0
 [<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Lustre: scratch-MDT0001: Client scratch-MDT0000-mdtlov_UUID (at 192.168.2.125@o2ib) reconnecting
INFO: task lctl:5236 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lctl          D 000000000000000d     0  5236   8921 0x00000080
 ffff88102b211b58 0000000000000086 0000000000000000 ffff88083fd304c0
 ffff88102b211ae8 ffff88083fd304c0 0000000000000000 ffff88083fd304c0
 ffff880c97827058 ffff88102b211fd8 000000000000fbc8 ffff880c97827058
Call Trace:
 [<ffffffff8152ae5e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff8152acfb>] mutex_lock+0x2b/0x50
 [<ffffffffa0ebcf2f>] lfsck_stop+0x12f/0x3f0 [lfsck]
 [<ffffffffa1130c7b>] ? osp_key_init+0x6b/0x190 [osp]
 [<ffffffffa06cf30f>] ? keys_fill+0x6f/0x190 [obdclass]
 [<ffffffffa10c74cf>] mdd_iocontrol+0x1ff/0xb70 [mdd]
 [<ffffffffa06d3c23>] ? lu_context_init+0xa3/0x240 [obdclass]
 [<ffffffffa0f7f05e>] mdt_iocontrol+0x6be/0x8e0 [mdt]
 [<ffffffffa069fad6>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa069b77c>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
 [<ffffffff8119c3c8>] ? do_filp_open+0x798/0xd20
 [<ffffffffa06822ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
 [<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
 [<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
 [<ffffffff81196dd6>] ? final_putname+0x26/0x50
 [<ffffffff8119f091>] sys_ioctl+0x81/0xa0
 [<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task lctl:5236 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lctl          D 000000000000000d     0  5236   8921 0x00000080
 ffff88102b211b58 0000000000000086 0000000000000000 ffff88083fd304c0
 ffff88102b211ae8 ffff88083fd304c0 0000000000000000 ffff88083fd304c0
 ffff880c97827058 ffff88102b211fd8 000000000000fbc8 ffff880c97827058
Call Trace:
 [<ffffffff8152ae5e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff8152acfb>] mutex_lock+0x2b/0x50
 [<ffffffffa0ebcf2f>] lfsck_stop+0x12f/0x3f0 [lfsck]
 [<ffffffffa1130c7b>] ? osp_key_init+0x6b/0x190 [osp]
 [<ffffffffa06cf30f>] ? keys_fill+0x6f/0x190 [obdclass]
 [<ffffffffa10c74cf>] mdd_iocontrol+0x1ff/0xb70 [mdd]
 [<ffffffffa06d3c23>] ? lu_context_init+0xa3/0x240 [obdclass]
 [<ffffffffa0f7f05e>] mdt_iocontrol+0x6be/0x8e0 [mdt]
 [<ffffffffa069fad6>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa069b77c>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
 [<ffffffff8119c3c8>] ? do_filp_open+0x798/0xd20
 [<ffffffffa06822ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
 [<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
 [<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
 [<ffffffff81196dd6>] ? final_putname+0x26/0x50
 [<ffffffff8119f091>] sys_ioctl+0x81/0xa0
 [<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task lctl:5236 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lctl          D 000000000000000d     0  5236   8921 0x00000080
 ffff88102b211b58 0000000000000086 0000000000000000 ffff88083fd304c0
 ffff88102b211ae8 ffff88083fd304c0 0000000000000000 ffff88083fd304c0
 ffff880c97827058 ffff88102b211fd8 000000000000fbc8 ffff880c97827058
Call Trace:
 [<ffffffff8152ae5e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff8152acfb>] mutex_lock+0x2b/0x50
 [<ffffffffa0ebcf2f>] lfsck_stop+0x12f/0x3f0 [lfsck]
 [<ffffffffa1130c7b>] ? osp_key_init+0x6b/0x190 [osp]
 [<ffffffffa06cf30f>] ? keys_fill+0x6f/0x190 [obdclass]
 [<ffffffffa10c74cf>] mdd_iocontrol+0x1ff/0xb70 [mdd]
 [<ffffffffa06d3c23>] ? lu_context_init+0xa3/0x240 [obdclass]
 [<ffffffffa0f7f05e>] mdt_iocontrol+0x6be/0x8e0 [mdt]
 [<ffffffffa069fad6>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa069b77c>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
 [<ffffffff8119c3c8>] ? do_filp_open+0x798/0xd20
 [<ffffffffa06822ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
 [<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
 [<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
 [<ffffffff81196dd6>] ? final_putname+0x26/0x50
 [<ffffffff8119f091>] sys_ioctl+0x81/0xa0
 [<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task lctl:5236 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lctl          D 000000000000000d     0  5236   8921 0x00000080
 ffff88102b211b58 0000000000000086 0000000000000000 ffff88083fd304c0
 ffff88102b211ae8 ffff88083fd304c0 0000000000000000 ffff88083fd304c0
 ffff880c97827058 ffff88102b211fd8 000000000000fbc8 ffff880c97827058
Call Trace:
 [<ffffffff8152ae5e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff8152acfb>] mutex_lock+0x2b/0x50
 [<ffffffffa0ebcf2f>] lfsck_stop+0x12f/0x3f0 [lfsck]
 [<ffffffffa1130c7b>] ? osp_key_init+0x6b/0x190 [osp]
 [<ffffffffa06cf30f>] ? keys_fill+0x6f/0x190 [obdclass]
 [<ffffffffa10c74cf>] mdd_iocontrol+0x1ff/0xb70 [mdd]
 [<ffffffffa06d3c23>] ? lu_context_init+0xa3/0x240 [obdclass]
 [<ffffffffa0f7f05e>] mdt_iocontrol+0x6be/0x8e0 [mdt]
 [<ffffffffa069fad6>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa069b77c>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
 [<ffffffff8119c3c8>] ? do_filp_open+0x798/0xd20
 [<ffffffffa06822ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
 [<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
 [<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
 [<ffffffff81196dd6>] ? final_putname+0x26/0x50
 [<ffffffff8119f091>] sys_ioctl+0x81/0xa0
 [<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task lctl:5236 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lctl          D 000000000000000d     0  5236   8921 0x00000080
 ffff88102b211b58 0000000000000086 0000000000000000 ffff88083fd304c0
 ffff88102b211ae8 ffff88083fd304c0 0000000000000000 ffff88083fd304c0
 ffff880c97827058 ffff88102b211fd8 000000000000fbc8 ffff880c97827058
Call Trace:
 [<ffffffff8152ae5e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff8152acfb>] mutex_lock+0x2b/0x50
 [<ffffffffa0ebcf2f>] lfsck_stop+0x12f/0x3f0 [lfsck]
 [<ffffffffa1130c7b>] ? osp_key_init+0x6b/0x190 [osp]
 [<ffffffffa06cf30f>] ? keys_fill+0x6f/0x190 [obdclass]
 [<ffffffffa10c74cf>] mdd_iocontrol+0x1ff/0xb70 [mdd]
 [<ffffffffa06d3c23>] ? lu_context_init+0xa3/0x240 [obdclass]
 [<ffffffffa0f7f05e>] mdt_iocontrol+0x6be/0x8e0 [mdt]
 [<ffffffffa069fad6>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa069b77c>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
 [<ffffffff8119c3c8>] ? do_filp_open+0x798/0xd20
 [<ffffffffa06822ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
 [<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
 [<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
 [<ffffffff81196dd6>] ? final_putname+0x26/0x50
 [<ffffffff8119f091>] sys_ioctl+0x81/0xa0
 [<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task lctl:5236 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lctl          D 000000000000000d     0  5236   8921 0x00000080
 ffff88102b211b58 0000000000000086 0000000000000000 ffff88083fd304c0
 ffff88102b211ae8 ffff88083fd304c0 0000000000000000 ffff88083fd304c0
 ffff880c97827058 ffff88102b211fd8 000000000000fbc8 ffff880c97827058
Call Trace:
 [<ffffffff8152ae5e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff8152acfb>] mutex_lock+0x2b/0x50
 [<ffffffffa0ebcf2f>] lfsck_stop+0x12f/0x3f0 [lfsck]
 [<ffffffffa1130c7b>] ? osp_key_init+0x6b/0x190 [osp]
 [<ffffffffa06cf30f>] ? keys_fill+0x6f/0x190 [obdclass]
 [<ffffffffa10c74cf>] mdd_iocontrol+0x1ff/0xb70 [mdd]
 [<ffffffffa06d3c23>] ? lu_context_init+0xa3/0x240 [obdclass]
 [<ffffffffa0f7f05e>] mdt_iocontrol+0x6be/0x8e0 [mdt]
 [<ffffffffa069fad6>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa069b77c>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
 [<ffffffff8119c3c8>] ? do_filp_open+0x798/0xd20
 [<ffffffffa06822ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
 [<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
 [<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
 [<ffffffff81196dd6>] ? final_putname+0x26/0x50
 [<ffffffff8119f091>] sys_ioctl+0x81/0xa0
 [<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Lustre: scratch-MDT0001: Client scratch-MDT0000-mdtlov_UUID (at 192.168.2.125@o2ib) reconnecting

On MDS01, containing the MDS01 that LFSCK did complete, dmesg contains:

LustreError: 11-0: scratch-MDT0001-osp-MDT0000: operation mds_connect to node 192.168.2.126@o2ib failed: rc = -11
Lustre: ctl-scratch-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:ost
Lustre: ctl-scratch-MDT0000: super-sequence allocation rc = 0 [0x0000000280000400-0x00000002c0000400):4:ost
Lustre: Skipped 1 previous similar message
Lustre: ctl-scratch-MDT0000: super-sequence allocation rc = 0 [0x0000000300000400-0x0000000340000400):3:ost
Lustre: Skipped 1 previous similar message
Lustre: ctl-scratch-MDT0000: super-sequence allocation rc = 0 [0x0000000340000400-0x0000000380000400):5:ost
Lustre: MGS: haven't heard from client b55379e3-6789-caf6-3733-61f9e03b0716 (at 192.168.2.111@o2ib) in 234 seconds. I think it's dead, and I am evicting it. exp ffff8807fed38000, cur 1421640751 expire 1421640601 last 1421640517
Lustre: ctl-scratch-MDT0000: super-sequence allocation rc = 0 [0x0000000380000400-0x00000003c0000400):0:mdt
Lustre: ctl-scratch-MDT0000: super-sequence allocation rc = 0 [0x00000003c0000400-0x0000000400000400):1:mdt
LustreError: 2296:0:(class_obd.c:397:class_handle_ioctl()) OBD ioctl : No Device -12066
LustreError: 2297:0:(class_obd.c:397:class_handle_ioctl()) OBD ioctl : No Device -12066
LustreError: 2300:0:(class_obd.c:397:class_handle_ioctl()) OBD ioctl : No Device -12066
Lustre: 2301:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1421729801/real 1421729801]  req@ffff880796e3a6c0 x1490612292269532/t0(0) o1101->scratch-MDT0001-osp-MDT0000@192.168.2.126@o2ib:24/4 lens 320/224 e 24 to 1 dl 1421730557 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Lustre: scratch-MDT0001-osp-MDT0000: Connection to scratch-MDT0001 (at 192.168.2.126@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Lustre: scratch-MDT0001-osp-MDT0000: Connection restored to scratch-MDT0001 (at 192.168.2.126@o2ib)
Lustre: 2301:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1421730557/real 1421730557]  req@ffff880796e3a6c0 x1490612292269532/t0(0) o1101->scratch-MDT0001-osp-MDT0000@192.168.2.126@o2ib:24/4 lens 320/224 e 24 to 1 dl 1421731313 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Lustre: scratch-MDT0001-osp-MDT0000: Connection to scratch-MDT0001 (at 192.168.2.126@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Lustre: scratch-MDT0001-osp-MDT0000: Connection restored to scratch-MDT0001 (at 192.168.2.126@o2ib)

I will upload the log from /tmp/lustre-log.1421730002.32420 from MDS02.

I have not tried to reproduce this error.



 Comments   
Comment by James Nunez (Inactive) [ 21/Jan/15 ]

I've just uploaded the log described in the Description of this ticket to uploads/LU-6138.

Also, I cannot reproduce this problem. I've tried 5 or 6 times and cannot get LFSCK to hang.

Comment by James Nunez (Inactive) [ 21/Jan/15 ]

I spoke too soon; I have a hung 'lctl lfsck_stop' on mds01 as described above. Looking at dmesg on mds02, there is a stack trace:

Lustre: shutting down debug daemon thread...
INFO: task lfsck_namespace:9249 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lfsck_namespa D 0000000000000004     0  9249      2 0x00000080
 ffff8801c9cbbbc0 0000000000000046 0000000000000000 ffff8801c9cbbd20
 ffff8801c9cbbd20 ffff880acdf09000 ffff8801c9cbbbc0 ffffffffa06d37e9
 ffff880825bf6638 ffff8801c9cbbfd8 000000000000fbc8 ffff880825bf6638
Call Trace:
 [<ffffffffa06d37e9>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
 [<ffffffffa06d3a3d>] lu_object_find_at+0x3d/0xe0 [obdclass]
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa06d28ff>] ? lu_object_put+0x1cf/0x330 [obdclass]
 [<ffffffffa06d3b1f>] lu_object_find_slice+0x1f/0x80 [obdclass]
 [<ffffffffa0ee5669>] lfsck_namespace_assistant_handler_p2+0xb19/0x11b0 [lfsck]
 [<ffffffff81083f30>] ? process_timeout+0x0/0x10
 [<ffffffffa0ec7399>] lfsck_assistant_engine+0x14e9/0x1e00 [lfsck]
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa0ec5eb0>] ? lfsck_assistant_engine+0x0/0x1e00 [lfsck]
 [<ffffffff8109abf6>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109ab60>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
Lustre: debug daemon will attempt to start writing to /tmp/lfsck_log_2 (1048576kB max)

I've uploaded two kernel logs from each MDS to uploads/LU-6138; lfsck_log_mds01.txt and lfsck_log_mds02.txt. These logs were taken after I noticed that LFSCK namespace had not completed on MDS02.

Comment by nasf (Inactive) [ 23/Jan/15 ]

James, I think the last failure is not the same as the original LU-6138, it is another failure instance of LU-6109 because of the LFSCK engine was blocked during object_find. So I think that either the original LU-6138 is hidden by LU-6019, or they are duplication.

Comment by James Nunez (Inactive) [ 03/Feb/15 ]

I've run this test again with the latest master 2.6.93 tag and ran into the same problem; run LFSCK under a create/remove workload and, 'lctl lfsck_stop' hangs. We may need test with a fix to LU-6109 before I can see if this ticket is still an issue or not.

Comment by nasf (Inactive) [ 04/Feb/15 ]

There another two potential reasons may cause the LFSCK hung:

1) self-deadlock on OST that may cause the LFSCK on the OST cannot exit, it has been fixed by the patch http://review.whamcloud.com/#/c/13525/
2) because the LFSCK may hold reference the MDT-object for a long time, then if someone unlinked the file/dir and marked it as dying, then when the LFSCK try to locate the same MDT-object again, it will be blocked there by its former reference. I will make patch to fix that.

Comment by Gerrit Updater [ 06/Feb/15 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13666
Subject: LU-6138 lfsck: NOT hold reference on pre-loaded object
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9b7549a222ccc7c6726800175f00259adb65778e

Comment by Gerrit Updater [ 18/Feb/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13666/
Subject: LU-6138 lfsck: NOT hold reference on pre-loaded object
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 755c04eba3f52245ecddd5c0d7a205988181d7d1

Comment by nasf (Inactive) [ 19/Feb/15 ]

We cannot close this ticket until LU-5791 resolved.

Comment by Andreas Dilger [ 20/Feb/15 ]

http://review.whamcloud.com/13818

Comment by Gerrit Updater [ 23/Feb/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13818/
Subject: LU-6138 lfsck: set async windows size properly
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: bcabadc92e1212b8397438e798ba7d47ee67f5fc

Comment by Jodi Levi (Inactive) [ 23/Feb/15 ]

Patch landed to master.

Generated at Sat Feb 10 01:57:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.