Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6138

lfsck_namespace not progressing and lfsck_stop hangs

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.7.0
    • Lustre 2.7.0
    • OpenSFS cluster with two MDSs with one MDT each, three OSSs with two OSTs each and three clients. Lustre tag 2.6.92 build #2812
    • 3
    • 17107

    Description

      I ran a small version of the Stability test in the LFSCK test plan; created 5 subdirectories with 100 objects in each. One proc runs LFSCK namespace, one deletes existing files and all other procs create more directories and files.

      LFSCK namespace completed on MDT01, but was stuck on MDT02 with

      status: scanning-phase2
      

      After a few minutes, I ran the following on MDS01:

      lctl lfsck_stop -A -M scratch-MDT0000
      

      This command did not and has not returned for over 30 minutes. So, I tried to stop LFSCK on MDS02:

      lctl lfsck_stop -A -M scratch-MDT0001
      

      and this command has not returned.

      The dmesg logs complain about losing contact with a client, but, from the user perspective, all nodes look up and responsive when I get on them.

      From MDS02, containing MDT02 that LFSCK namespace did not complete, dmesg contains:

      LustreError: 11-0: scratch-MDT0000-osp-MDT0001: operation mds_connect to node 192.168.2.125@o2ib failed: rc = -11
      Lustre: scratch-MDT0001: haven't heard from client d9f65e0e-5d98-7dfb-985b-592359f9067c (at 192.168.2.111@o2ib) in 234 seconds. I think it's dead, and I am evicting it. exp ffff880292852c00, cur 1421640751 expire 1421640601 last 1421640517
      Lustre: cli-ctl-scratch-MDT0001: Allocated super-sequence [0x00000003c0000400-0x0000000400000400):1:mdt]
      LNet: Service thread pid 32420 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Pid: 32420, comm: mdt_out02_000
      
      Call Trace:
       [<ffffffff81061d12>] ? default_wake_function+0x12/0x20
       [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
       [<ffffffffa0ebd14c>] lfsck_stop+0x34c/0x3f0 [lfsck]
       [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
       [<ffffffffa091a842>] ? __req_capsule_get+0x162/0x6d0 [ptlrpc]
       [<ffffffffa0ebe9b0>] lfsck_in_notify+0x2a0/0x330 [lfsck]
       [<ffffffffa094d694>] tgt_handle_lfsck_notify+0x64/0x150 [ptlrpc]
       [<ffffffffa095597e>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
       [<ffffffffa0905711>] ptlrpc_main+0xe41/0x1960 [ptlrpc]
       [<ffffffff81069290>] ? pick_next_task_fair+0xd0/0x130
       [<ffffffff81529466>] ? schedule+0x176/0x3b0
       [<ffffffffa09048d0>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
       [<ffffffff8109abf6>] kthread+0x96/0xa0
       [<ffffffff8100c20a>] child_rip+0xa/0x20
       [<ffffffff8109ab60>] ? kthread+0x0/0xa0
       [<ffffffff8100c200>] ? child_rip+0x0/0x20
      
      LustreError: dumping log to /tmp/lustre-log.1421730002.32420
      INFO: task lctl:5236 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      lctl          D 000000000000000d     0  5236   8921 0x00000080
       ffff88102b211b58 0000000000000086 0000000000000000 ffff88083fd304c0
       ffff88102b211ae8 ffff88083fd304c0 0000000000000000 ffff88083fd304c0
       ffff880c97827058 ffff88102b211fd8 000000000000fbc8 ffff880c97827058
      Call Trace:
       [<ffffffff8152ae5e>] __mutex_lock_slowpath+0x13e/0x180
       [<ffffffff8152acfb>] mutex_lock+0x2b/0x50
       [<ffffffffa0ebcf2f>] lfsck_stop+0x12f/0x3f0 [lfsck]
       [<ffffffffa1130c7b>] ? osp_key_init+0x6b/0x190 [osp]
       [<ffffffffa06cf30f>] ? keys_fill+0x6f/0x190 [obdclass]
       [<ffffffffa10c74cf>] mdd_iocontrol+0x1ff/0xb70 [mdd]
       [<ffffffffa06d3c23>] ? lu_context_init+0xa3/0x240 [obdclass]
       [<ffffffffa0f7f05e>] mdt_iocontrol+0x6be/0x8e0 [mdt]
       [<ffffffffa069fad6>] ? class_name2dev+0x56/0xe0 [obdclass]
       [<ffffffffa069b77c>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
       [<ffffffff8119c3c8>] ? do_filp_open+0x798/0xd20
       [<ffffffffa06822ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
       [<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
       [<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
       [<ffffffff81196dd6>] ? final_putname+0x26/0x50
       [<ffffffff8119f091>] sys_ioctl+0x81/0xa0
       [<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      INFO: task lctl:5236 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      lctl          D 000000000000000d     0  5236   8921 0x00000080
       ffff88102b211b58 0000000000000086 0000000000000000 ffff88083fd304c0
       ffff88102b211ae8 ffff88083fd304c0 0000000000000000 ffff88083fd304c0
       ffff880c97827058 ffff88102b211fd8 000000000000fbc8 ffff880c97827058
      Call Trace:
       [<ffffffff8152ae5e>] __mutex_lock_slowpath+0x13e/0x180
       [<ffffffff8152acfb>] mutex_lock+0x2b/0x50
       [<ffffffffa0ebcf2f>] lfsck_stop+0x12f/0x3f0 [lfsck]
       [<ffffffffa1130c7b>] ? osp_key_init+0x6b/0x190 [osp]
       [<ffffffffa06cf30f>] ? keys_fill+0x6f/0x190 [obdclass]
       [<ffffffffa10c74cf>] mdd_iocontrol+0x1ff/0xb70 [mdd]
       [<ffffffffa06d3c23>] ? lu_context_init+0xa3/0x240 [obdclass]
       [<ffffffffa0f7f05e>] mdt_iocontrol+0x6be/0x8e0 [mdt]
       [<ffffffffa069fad6>] ? class_name2dev+0x56/0xe0 [obdclass]
       [<ffffffffa069b77c>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
       [<ffffffff8119c3c8>] ? do_filp_open+0x798/0xd20
       [<ffffffffa06822ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
       [<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
       [<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
       [<ffffffff81196dd6>] ? final_putname+0x26/0x50
       [<ffffffff8119f091>] sys_ioctl+0x81/0xa0
       [<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      INFO: task lctl:5236 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      lctl          D 000000000000000d     0  5236   8921 0x00000080
       ffff88102b211b58 0000000000000086 0000000000000000 ffff88083fd304c0
       ffff88102b211ae8 ffff88083fd304c0 0000000000000000 ffff88083fd304c0
       ffff880c97827058 ffff88102b211fd8 000000000000fbc8 ffff880c97827058
      Call Trace:
       [<ffffffff8152ae5e>] __mutex_lock_slowpath+0x13e/0x180
       [<ffffffff8152acfb>] mutex_lock+0x2b/0x50
       [<ffffffffa0ebcf2f>] lfsck_stop+0x12f/0x3f0 [lfsck]
       [<ffffffffa1130c7b>] ? osp_key_init+0x6b/0x190 [osp]
       [<ffffffffa06cf30f>] ? keys_fill+0x6f/0x190 [obdclass]
       [<ffffffffa10c74cf>] mdd_iocontrol+0x1ff/0xb70 [mdd]
       [<ffffffffa06d3c23>] ? lu_context_init+0xa3/0x240 [obdclass]
       [<ffffffffa0f7f05e>] mdt_iocontrol+0x6be/0x8e0 [mdt]
       [<ffffffffa069fad6>] ? class_name2dev+0x56/0xe0 [obdclass]
       [<ffffffffa069b77c>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
       [<ffffffff8119c3c8>] ? do_filp_open+0x798/0xd20
       [<ffffffffa06822ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
       [<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
       [<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
       [<ffffffff81196dd6>] ? final_putname+0x26/0x50
       [<ffffffff8119f091>] sys_ioctl+0x81/0xa0
       [<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      Lustre: 5459:0:(service.c:1335:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
        req@ffff880c5f8d46c0 x1490612292269532/t0(0) o1101->scratch-MDT0000-mdtlov_UUID@192.168.2.125@o2ib:471/0 lens 320/224 e 24 to 0 dl 1421730401 ref 2 fl Interpret:/0/0 rc 0/0
      INFO: task lctl:5236 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      lctl          D 000000000000000d     0  5236   8921 0x00000080
       ffff88102b211b58 0000000000000086 0000000000000000 ffff88083fd304c0
       ffff88102b211ae8 ffff88083fd304c0 0000000000000000 ffff88083fd304c0
       ffff880c97827058 ffff88102b211fd8 000000000000fbc8 ffff880c97827058
      Call Trace:
       [<ffffffff8152ae5e>] __mutex_lock_slowpath+0x13e/0x180
       [<ffffffff8152acfb>] mutex_lock+0x2b/0x50
       [<ffffffffa0ebcf2f>] lfsck_stop+0x12f/0x3f0 [lfsck]
       [<ffffffffa1130c7b>] ? osp_key_init+0x6b/0x190 [osp]
       [<ffffffffa06cf30f>] ? keys_fill+0x6f/0x190 [obdclass]
       [<ffffffffa10c74cf>] mdd_iocontrol+0x1ff/0xb70 [mdd]
       [<ffffffffa06d3c23>] ? lu_context_init+0xa3/0x240 [obdclass]
       [<ffffffffa0f7f05e>] mdt_iocontrol+0x6be/0x8e0 [mdt]
       [<ffffffffa069fad6>] ? class_name2dev+0x56/0xe0 [obdclass]
       [<ffffffffa069b77c>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
       [<ffffffff8119c3c8>] ? do_filp_open+0x798/0xd20
       [<ffffffffa06822ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
       [<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
       [<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
       [<ffffffff81196dd6>] ? final_putname+0x26/0x50
       [<ffffffff8119f091>] sys_ioctl+0x81/0xa0
       [<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      Lustre: scratch-MDT0001: Client scratch-MDT0000-mdtlov_UUID (at 192.168.2.125@o2ib) reconnecting
      INFO: task lctl:5236 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      lctl          D 000000000000000d     0  5236   8921 0x00000080
       ffff88102b211b58 0000000000000086 0000000000000000 ffff88083fd304c0
       ffff88102b211ae8 ffff88083fd304c0 0000000000000000 ffff88083fd304c0
       ffff880c97827058 ffff88102b211fd8 000000000000fbc8 ffff880c97827058
      Call Trace:
       [<ffffffff8152ae5e>] __mutex_lock_slowpath+0x13e/0x180
       [<ffffffff8152acfb>] mutex_lock+0x2b/0x50
       [<ffffffffa0ebcf2f>] lfsck_stop+0x12f/0x3f0 [lfsck]
       [<ffffffffa1130c7b>] ? osp_key_init+0x6b/0x190 [osp]
       [<ffffffffa06cf30f>] ? keys_fill+0x6f/0x190 [obdclass]
       [<ffffffffa10c74cf>] mdd_iocontrol+0x1ff/0xb70 [mdd]
       [<ffffffffa06d3c23>] ? lu_context_init+0xa3/0x240 [obdclass]
       [<ffffffffa0f7f05e>] mdt_iocontrol+0x6be/0x8e0 [mdt]
       [<ffffffffa069fad6>] ? class_name2dev+0x56/0xe0 [obdclass]
       [<ffffffffa069b77c>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
       [<ffffffff8119c3c8>] ? do_filp_open+0x798/0xd20
       [<ffffffffa06822ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
       [<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
       [<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
       [<ffffffff81196dd6>] ? final_putname+0x26/0x50
       [<ffffffff8119f091>] sys_ioctl+0x81/0xa0
       [<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      INFO: task lctl:5236 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      lctl          D 000000000000000d     0  5236   8921 0x00000080
       ffff88102b211b58 0000000000000086 0000000000000000 ffff88083fd304c0
       ffff88102b211ae8 ffff88083fd304c0 0000000000000000 ffff88083fd304c0
       ffff880c97827058 ffff88102b211fd8 000000000000fbc8 ffff880c97827058
      Call Trace:
       [<ffffffff8152ae5e>] __mutex_lock_slowpath+0x13e/0x180
       [<ffffffff8152acfb>] mutex_lock+0x2b/0x50
       [<ffffffffa0ebcf2f>] lfsck_stop+0x12f/0x3f0 [lfsck]
       [<ffffffffa1130c7b>] ? osp_key_init+0x6b/0x190 [osp]
       [<ffffffffa06cf30f>] ? keys_fill+0x6f/0x190 [obdclass]
       [<ffffffffa10c74cf>] mdd_iocontrol+0x1ff/0xb70 [mdd]
       [<ffffffffa06d3c23>] ? lu_context_init+0xa3/0x240 [obdclass]
       [<ffffffffa0f7f05e>] mdt_iocontrol+0x6be/0x8e0 [mdt]
       [<ffffffffa069fad6>] ? class_name2dev+0x56/0xe0 [obdclass]
       [<ffffffffa069b77c>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
       [<ffffffff8119c3c8>] ? do_filp_open+0x798/0xd20
       [<ffffffffa06822ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
       [<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
       [<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
       [<ffffffff81196dd6>] ? final_putname+0x26/0x50
       [<ffffffff8119f091>] sys_ioctl+0x81/0xa0
       [<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      INFO: task lctl:5236 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      lctl          D 000000000000000d     0  5236   8921 0x00000080
       ffff88102b211b58 0000000000000086 0000000000000000 ffff88083fd304c0
       ffff88102b211ae8 ffff88083fd304c0 0000000000000000 ffff88083fd304c0
       ffff880c97827058 ffff88102b211fd8 000000000000fbc8 ffff880c97827058
      Call Trace:
       [<ffffffff8152ae5e>] __mutex_lock_slowpath+0x13e/0x180
       [<ffffffff8152acfb>] mutex_lock+0x2b/0x50
       [<ffffffffa0ebcf2f>] lfsck_stop+0x12f/0x3f0 [lfsck]
       [<ffffffffa1130c7b>] ? osp_key_init+0x6b/0x190 [osp]
       [<ffffffffa06cf30f>] ? keys_fill+0x6f/0x190 [obdclass]
       [<ffffffffa10c74cf>] mdd_iocontrol+0x1ff/0xb70 [mdd]
       [<ffffffffa06d3c23>] ? lu_context_init+0xa3/0x240 [obdclass]
       [<ffffffffa0f7f05e>] mdt_iocontrol+0x6be/0x8e0 [mdt]
       [<ffffffffa069fad6>] ? class_name2dev+0x56/0xe0 [obdclass]
       [<ffffffffa069b77c>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
       [<ffffffff8119c3c8>] ? do_filp_open+0x798/0xd20
       [<ffffffffa06822ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
       [<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
       [<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
       [<ffffffff81196dd6>] ? final_putname+0x26/0x50
       [<ffffffff8119f091>] sys_ioctl+0x81/0xa0
       [<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      INFO: task lctl:5236 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      lctl          D 000000000000000d     0  5236   8921 0x00000080
       ffff88102b211b58 0000000000000086 0000000000000000 ffff88083fd304c0
       ffff88102b211ae8 ffff88083fd304c0 0000000000000000 ffff88083fd304c0
       ffff880c97827058 ffff88102b211fd8 000000000000fbc8 ffff880c97827058
      Call Trace:
       [<ffffffff8152ae5e>] __mutex_lock_slowpath+0x13e/0x180
       [<ffffffff8152acfb>] mutex_lock+0x2b/0x50
       [<ffffffffa0ebcf2f>] lfsck_stop+0x12f/0x3f0 [lfsck]
       [<ffffffffa1130c7b>] ? osp_key_init+0x6b/0x190 [osp]
       [<ffffffffa06cf30f>] ? keys_fill+0x6f/0x190 [obdclass]
       [<ffffffffa10c74cf>] mdd_iocontrol+0x1ff/0xb70 [mdd]
       [<ffffffffa06d3c23>] ? lu_context_init+0xa3/0x240 [obdclass]
       [<ffffffffa0f7f05e>] mdt_iocontrol+0x6be/0x8e0 [mdt]
       [<ffffffffa069fad6>] ? class_name2dev+0x56/0xe0 [obdclass]
       [<ffffffffa069b77c>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
       [<ffffffff8119c3c8>] ? do_filp_open+0x798/0xd20
       [<ffffffffa06822ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
       [<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
       [<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
       [<ffffffff81196dd6>] ? final_putname+0x26/0x50
       [<ffffffff8119f091>] sys_ioctl+0x81/0xa0
       [<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      INFO: task lctl:5236 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      lctl          D 000000000000000d     0  5236   8921 0x00000080
       ffff88102b211b58 0000000000000086 0000000000000000 ffff88083fd304c0
       ffff88102b211ae8 ffff88083fd304c0 0000000000000000 ffff88083fd304c0
       ffff880c97827058 ffff88102b211fd8 000000000000fbc8 ffff880c97827058
      Call Trace:
       [<ffffffff8152ae5e>] __mutex_lock_slowpath+0x13e/0x180
       [<ffffffff8152acfb>] mutex_lock+0x2b/0x50
       [<ffffffffa0ebcf2f>] lfsck_stop+0x12f/0x3f0 [lfsck]
       [<ffffffffa1130c7b>] ? osp_key_init+0x6b/0x190 [osp]
       [<ffffffffa06cf30f>] ? keys_fill+0x6f/0x190 [obdclass]
       [<ffffffffa10c74cf>] mdd_iocontrol+0x1ff/0xb70 [mdd]
       [<ffffffffa06d3c23>] ? lu_context_init+0xa3/0x240 [obdclass]
       [<ffffffffa0f7f05e>] mdt_iocontrol+0x6be/0x8e0 [mdt]
       [<ffffffffa069fad6>] ? class_name2dev+0x56/0xe0 [obdclass]
       [<ffffffffa069b77c>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
       [<ffffffff8119c3c8>] ? do_filp_open+0x798/0xd20
       [<ffffffffa06822ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
       [<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
       [<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
       [<ffffffff81196dd6>] ? final_putname+0x26/0x50
       [<ffffffff8119f091>] sys_ioctl+0x81/0xa0
       [<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      INFO: task lctl:5236 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.29.2.el6_lustre.gefdacb7.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      lctl          D 000000000000000d     0  5236   8921 0x00000080
       ffff88102b211b58 0000000000000086 0000000000000000 ffff88083fd304c0
       ffff88102b211ae8 ffff88083fd304c0 0000000000000000 ffff88083fd304c0
       ffff880c97827058 ffff88102b211fd8 000000000000fbc8 ffff880c97827058
      Call Trace:
       [<ffffffff8152ae5e>] __mutex_lock_slowpath+0x13e/0x180
       [<ffffffff8152acfb>] mutex_lock+0x2b/0x50
       [<ffffffffa0ebcf2f>] lfsck_stop+0x12f/0x3f0 [lfsck]
       [<ffffffffa1130c7b>] ? osp_key_init+0x6b/0x190 [osp]
       [<ffffffffa06cf30f>] ? keys_fill+0x6f/0x190 [obdclass]
       [<ffffffffa10c74cf>] mdd_iocontrol+0x1ff/0xb70 [mdd]
       [<ffffffffa06d3c23>] ? lu_context_init+0xa3/0x240 [obdclass]
       [<ffffffffa0f7f05e>] mdt_iocontrol+0x6be/0x8e0 [mdt]
       [<ffffffffa069fad6>] ? class_name2dev+0x56/0xe0 [obdclass]
       [<ffffffffa069b77c>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
       [<ffffffff8119c3c8>] ? do_filp_open+0x798/0xd20
       [<ffffffffa06822ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
       [<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
       [<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
       [<ffffffff81196dd6>] ? final_putname+0x26/0x50
       [<ffffffff8119f091>] sys_ioctl+0x81/0xa0
       [<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      Lustre: scratch-MDT0001: Client scratch-MDT0000-mdtlov_UUID (at 192.168.2.125@o2ib) reconnecting
      

      On MDS01, containing the MDS01 that LFSCK did complete, dmesg contains:

      LustreError: 11-0: scratch-MDT0001-osp-MDT0000: operation mds_connect to node 192.168.2.126@o2ib failed: rc = -11
      Lustre: ctl-scratch-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:ost
      Lustre: ctl-scratch-MDT0000: super-sequence allocation rc = 0 [0x0000000280000400-0x00000002c0000400):4:ost
      Lustre: Skipped 1 previous similar message
      Lustre: ctl-scratch-MDT0000: super-sequence allocation rc = 0 [0x0000000300000400-0x0000000340000400):3:ost
      Lustre: Skipped 1 previous similar message
      Lustre: ctl-scratch-MDT0000: super-sequence allocation rc = 0 [0x0000000340000400-0x0000000380000400):5:ost
      Lustre: MGS: haven't heard from client b55379e3-6789-caf6-3733-61f9e03b0716 (at 192.168.2.111@o2ib) in 234 seconds. I think it's dead, and I am evicting it. exp ffff8807fed38000, cur 1421640751 expire 1421640601 last 1421640517
      Lustre: ctl-scratch-MDT0000: super-sequence allocation rc = 0 [0x0000000380000400-0x00000003c0000400):0:mdt
      Lustre: ctl-scratch-MDT0000: super-sequence allocation rc = 0 [0x00000003c0000400-0x0000000400000400):1:mdt
      LustreError: 2296:0:(class_obd.c:397:class_handle_ioctl()) OBD ioctl : No Device -12066
      LustreError: 2297:0:(class_obd.c:397:class_handle_ioctl()) OBD ioctl : No Device -12066
      LustreError: 2300:0:(class_obd.c:397:class_handle_ioctl()) OBD ioctl : No Device -12066
      Lustre: 2301:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1421729801/real 1421729801]  req@ffff880796e3a6c0 x1490612292269532/t0(0) o1101->scratch-MDT0001-osp-MDT0000@192.168.2.126@o2ib:24/4 lens 320/224 e 24 to 1 dl 1421730557 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
      Lustre: scratch-MDT0001-osp-MDT0000: Connection to scratch-MDT0001 (at 192.168.2.126@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Lustre: scratch-MDT0001-osp-MDT0000: Connection restored to scratch-MDT0001 (at 192.168.2.126@o2ib)
      Lustre: 2301:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1421730557/real 1421730557]  req@ffff880796e3a6c0 x1490612292269532/t0(0) o1101->scratch-MDT0001-osp-MDT0000@192.168.2.126@o2ib:24/4 lens 320/224 e 24 to 1 dl 1421731313 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
      Lustre: scratch-MDT0001-osp-MDT0000: Connection to scratch-MDT0001 (at 192.168.2.126@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Lustre: scratch-MDT0001-osp-MDT0000: Connection restored to scratch-MDT0001 (at 192.168.2.126@o2ib)
      

      I will upload the log from /tmp/lustre-log.1421730002.32420 from MDS02.

      I have not tried to reproduce this error.

      Attachments

        Issue Links

          Activity

            People

              yong.fan nasf (Inactive)
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: