Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6655

MDS LBUG: (ldlm_lib.c:2277:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.12.0
    • Lustre 2.7.0
    • None
    • RHEL6, during upgrade from 2.5 to 2.7
    • 3
    • 9223372036854775807

    Description

      While attempting a Lustre upgrade from 2.5.3 to 2.7.0 on our preproduction file system, we encountered this LBUG after first mounting the MDT while attempting to mount the OSTs for the first time. The first OST mounted fine and while mounting the second OST, the LBUG happened.

      There are most likely clients out there that haven't had the file system unmounted and have been trying to reconnect during this time.

      The information below has been extracted from the Red Hat crash log as we didn't have a serial console attached at the time.

      <4>Lustre: 14008:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
      <0>LustreError: 31012:0:(ldlm_lib.c:2277:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed: 
      <0>LustreError: 31012:0:(ldlm_lib.c:2277:target_queue_recovery_request()) LBUG
      <4>Pid: 31012, comm: mdt00_001
      <4>
      <4>Call Trace:
      <4> [<ffffffffa03c8895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa03c8e97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa0674d10>] target_queue_recovery_request+0xb00/0xc10 [ptlrpc]
      <4> [<ffffffffa06b3c4c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc]
      <4> [<ffffffffa0714b3d>] tgt_request_handle+0xe8d/0x1000 [ptlrpc]
      <3>LustreError: 11-0: play01-OST0000-osc-MDT0000: operation ost_connect to node 172.23.144.18@tcp failed: rc = -16
      <4> [<ffffffffa06c45a1>] ptlrpc_main+0xe41/0x1960 [ptlrpc]
      <4> [<ffffffffa06c3760>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
      <4> [<ffffffff8109e66e>] kthread+0x9e/0xc0
      <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
      <4> [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      

      After power cycling the affected MDT and OSS and starting again, so far we've not seen it again and recovery on the MDT is completed.

      The only other information I could potentially provide is a vmcore that Red Hat collected automatically as well as more lines from the vmcore-dmesg.txt file if required.

      Attachments

        Issue Links

          Activity

            People

              bobijam Zhenyu Xu
              ferner Frederik Ferner (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: