Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6655

MDS LBUG: (ldlm_lib.c:2277:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.12.0
    • Lustre 2.7.0
    • None
    • RHEL6, during upgrade from 2.5 to 2.7
    • 3
    • 9223372036854775807

    Description

      While attempting a Lustre upgrade from 2.5.3 to 2.7.0 on our preproduction file system, we encountered this LBUG after first mounting the MDT while attempting to mount the OSTs for the first time. The first OST mounted fine and while mounting the second OST, the LBUG happened.

      There are most likely clients out there that haven't had the file system unmounted and have been trying to reconnect during this time.

      The information below has been extracted from the Red Hat crash log as we didn't have a serial console attached at the time.

      <4>Lustre: 14008:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
      <0>LustreError: 31012:0:(ldlm_lib.c:2277:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed: 
      <0>LustreError: 31012:0:(ldlm_lib.c:2277:target_queue_recovery_request()) LBUG
      <4>Pid: 31012, comm: mdt00_001
      <4>
      <4>Call Trace:
      <4> [<ffffffffa03c8895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa03c8e97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa0674d10>] target_queue_recovery_request+0xb00/0xc10 [ptlrpc]
      <4> [<ffffffffa06b3c4c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc]
      <4> [<ffffffffa0714b3d>] tgt_request_handle+0xe8d/0x1000 [ptlrpc]
      <3>LustreError: 11-0: play01-OST0000-osc-MDT0000: operation ost_connect to node 172.23.144.18@tcp failed: rc = -16
      <4> [<ffffffffa06c45a1>] ptlrpc_main+0xe41/0x1960 [ptlrpc]
      <4> [<ffffffffa06c3760>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
      <4> [<ffffffff8109e66e>] kthread+0x9e/0xc0
      <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
      <4> [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      

      After power cycling the affected MDT and OSS and starting again, so far we've not seen it again and recovery on the MDT is completed.

      The only other information I could potentially provide is a vmcore that Red Hat collected automatically as well as more lines from the vmcore-dmesg.txt file if required.

      Attachments

        Issue Links

          Activity

            [LU-6655] MDS LBUG: (ldlm_lib.c:2277:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed

            By the way, we are running Lustre FE-2.7.1, with ZFS 0.6.4.2, CentOS 6.6.7

            Haisong

            haisong Haisong Cai (Inactive) added a comment - By the way, we are running Lustre FE-2.7.1, with ZFS 0.6.4.2, CentOS 6.6.7 Haisong

            We just hit the same LBUG today in our OSS. While I was searching inside Jira I found this ticket. What happened to us was, we shutdown OSS gracefully for maintenance while filesystem was still running. After we mount OSTs back on OSS, in about a minute we hit the LBUG. It looks like during recovery.

            [root@wombat-oss-20-5 ~]#
            Message from syslogd@wombat-oss-20-5 at Mar 31 12:18:18 ...
            kernel:LustreError: 13701:0:(ldlm_lib.c:2277:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed:

            Message from syslogd@wombat-oss-20-5 at Mar 31 12:18:18 ...
            kernel:LustreError: 13701:0:(ldlm_lib.c:2277:target_queue_recovery_request()) LBUG

            haisong Haisong Cai (Inactive) added a comment - We just hit the same LBUG today in our OSS. While I was searching inside Jira I found this ticket. What happened to us was, we shutdown OSS gracefully for maintenance while filesystem was still running. After we mount OSTs back on OSS, in about a minute we hit the LBUG. It looks like during recovery. [root@wombat-oss-20-5 ~] # Message from syslogd@wombat-oss-20-5 at Mar 31 12:18:18 ... kernel:LustreError: 13701:0:(ldlm_lib.c:2277:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed: Message from syslogd@wombat-oss-20-5 at Mar 31 12:18:18 ... kernel:LustreError: 13701:0:(ldlm_lib.c:2277:target_queue_recovery_request()) LBUG
            bobijam Zhenyu Xu added a comment -

            you are right, it's a client patch, a client w/o this patch connecting to upgraded server could LBUG the server.

            bobijam Zhenyu Xu added a comment - you are right, it's a client patch, a client w/o this patch connecting to upgraded server could LBUG the server.

            I had looked at LU-5651 but initially didn't think it was the same as all servers had been upgraded. Reading it again, I'm not suspecting there's a client side patch which is not on our clients yet, so you might be right. Could I check that I read this right?

            Cheers,
            Frederik

            ferner Frederik Ferner (Inactive) added a comment - I had looked at LU-5651 but initially didn't think it was the same as all servers had been upgraded. Reading it again, I'm not suspecting there's a client side patch which is not on our clients yet, so you might be right. Could I check that I read this right? Cheers, Frederik
            bobijam Zhenyu Xu added a comment -

            I think it's a dup of LU-5651. With all nodes upgraded to 2.7, the issue should be gone.

            bobijam Zhenyu Xu added a comment - I think it's a dup of LU-5651 . With all nodes upgraded to 2.7, the issue should be gone.
            pjones Peter Jones added a comment -

            Bobijam

            Could you please assist with this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - Bobijam Could you please assist with this one? Thanks Peter

            People

              bobijam Zhenyu Xu
              ferner Frederik Ferner (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: