Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6655

MDS LBUG: (ldlm_lib.c:2277:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.12.0
    • Lustre 2.7.0
    • None
    • RHEL6, during upgrade from 2.5 to 2.7
    • 3
    • 9223372036854775807

    Description

      While attempting a Lustre upgrade from 2.5.3 to 2.7.0 on our preproduction file system, we encountered this LBUG after first mounting the MDT while attempting to mount the OSTs for the first time. The first OST mounted fine and while mounting the second OST, the LBUG happened.

      There are most likely clients out there that haven't had the file system unmounted and have been trying to reconnect during this time.

      The information below has been extracted from the Red Hat crash log as we didn't have a serial console attached at the time.

      <4>Lustre: 14008:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
      <0>LustreError: 31012:0:(ldlm_lib.c:2277:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed: 
      <0>LustreError: 31012:0:(ldlm_lib.c:2277:target_queue_recovery_request()) LBUG
      <4>Pid: 31012, comm: mdt00_001
      <4>
      <4>Call Trace:
      <4> [<ffffffffa03c8895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa03c8e97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa0674d10>] target_queue_recovery_request+0xb00/0xc10 [ptlrpc]
      <4> [<ffffffffa06b3c4c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc]
      <4> [<ffffffffa0714b3d>] tgt_request_handle+0xe8d/0x1000 [ptlrpc]
      <3>LustreError: 11-0: play01-OST0000-osc-MDT0000: operation ost_connect to node 172.23.144.18@tcp failed: rc = -16
      <4> [<ffffffffa06c45a1>] ptlrpc_main+0xe41/0x1960 [ptlrpc]
      <4> [<ffffffffa06c3760>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
      <4> [<ffffffff8109e66e>] kthread+0x9e/0xc0
      <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
      <4> [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      

      After power cycling the affected MDT and OSS and starting again, so far we've not seen it again and recovery on the MDT is completed.

      The only other information I could potentially provide is a vmcore that Red Hat collected automatically as well as more lines from the vmcore-dmesg.txt file if required.

      Attachments

        Issue Links

          Activity

            [LU-6655] MDS LBUG: (ldlm_lib.c:2277:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed
            pjones Peter Jones added a comment -

            Landed for 2.12

            pjones Peter Jones added a comment - Landed for 2.12

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/23205/
            Subject: LU-6655 ptlrpc: skip delayed replay requests
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: c1d465de13ccf0eda8020c88661c3cc4d78538ca

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/23205/ Subject: LU-6655 ptlrpc: skip delayed replay requests Project: fs/lustre-release Branch: master Current Patch Set: Commit: c1d465de13ccf0eda8020c88661c3cc4d78538ca

            Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/23205
            Subject: LU-6655 ptlrpc: skip delayed replay requests
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: c819f1ca2d3c12348d1e7e779500d7f774f923f7

            gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/23205 Subject: LU-6655 ptlrpc: skip delayed replay requests Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c819f1ca2d3c12348d1e7e779500d7f774f923f7
            bobijam Zhenyu Xu added a comment -

            For the workaround, you can mount the OST disabling recovery ("-o abort_recov").

            And could you upload the dump file as well as all the support files (rpms with debuginfo)?

            bobijam Zhenyu Xu added a comment - For the workaround, you can mount the OST disabling recovery ("-o abort_recov"). And could you upload the dump file as well as all the support files (rpms with debuginfo)?

            Zhenyu,

            We have carefully walked through our clients and checked Lustre version among them. To our acknowledge, all of them have the patch Lu-5651.

            Haisong

            haisong Haisong Cai (Inactive) added a comment - Zhenyu, We have carefully walked through our clients and checked Lustre version among them. To our acknowledge, all of them have the patch Lu-5651. Haisong
            bobijam Zhenyu Xu added a comment -

            All clients have this? Since a single old client (without it) could possibly cause this OSS LBUG.

            bobijam Zhenyu Xu added a comment - All clients have this? Since a single old client (without it) could possibly cause this OSS LBUG.

            Zhenyu,

            Our clients do have the pactch:

            [root@comet-02-01 ~]# rpm -qi lustre-client
            Name : lustre-client Relocations: (not relocatable)
            Version : 2.7.1 Vendor: (none)
            Release : 2.6.32_573.12.1.el6.x86_64_g965bd63 Build Date: Tue 02 Feb 2016 02:29:06 PM PST
            Install Date: Thu 11 Feb 2016 05:54:41 PM PST Build Host: comet-23-11.sdsc.edu
            Group : Utilities/System Source RPM: lustre-client-2.7.1-2.6.32_573.12.1.el6.x86_64_g965bd63.src.rpm
            Size : 2030643 License: GPL
            Signature : (none)
            URL : https://wiki.hpdd.intel.com/
            Summary : Lustre File System
            Description :
            Userspace tools and files for the Lustre file system.

            http://git.whamcloud.com/fs/lustre-release.git/commit/d730750a6311cae8a4427824867410faccc6698f is contained in the version we’re using:

            dimm:lustre-release-fe dimm$ git branch --contains d730750a6311cae8a4427824867410faccc6698f

            • (HEAD detached at 965bd63)
              b2_8_fe
            haisong Haisong Cai (Inactive) added a comment - Zhenyu, Our clients do have the pactch: [root@comet-02-01 ~] # rpm -qi lustre-client Name : lustre-client Relocations: (not relocatable) Version : 2.7.1 Vendor: (none) Release : 2.6.32_573.12.1.el6.x86_64_g965bd63 Build Date: Tue 02 Feb 2016 02:29:06 PM PST Install Date: Thu 11 Feb 2016 05:54:41 PM PST Build Host: comet-23-11.sdsc.edu Group : Utilities/System Source RPM: lustre-client-2.7.1-2.6.32_573.12.1.el6.x86_64_g965bd63.src.rpm Size : 2030643 License: GPL Signature : (none) URL : https://wiki.hpdd.intel.com/ Summary : Lustre File System Description : Userspace tools and files for the Lustre file system. http://git.whamcloud.com/fs/lustre-release.git/commit/d730750a6311cae8a4427824867410faccc6698f is contained in the version we’re using: dimm:lustre-release-fe dimm$ git branch --contains d730750a6311cae8a4427824867410faccc6698f (HEAD detached at 965bd63) b2_8_fe
            bobijam Zhenyu Xu added a comment -

            Does all clients contains the LU-5651 fix? That is a client issue which corrects the client restore state, otherwise server could be confused of the client's recovery state and hit this LBUG.

            bobijam Zhenyu Xu added a comment - Does all clients contains the LU-5651 fix? That is a client issue which corrects the client restore state, otherwise server could be confused of the client's recovery state and hit this LBUG.

            By the way, we are running Lustre FE-2.7.1, with ZFS 0.6.4.2, CentOS 6.6.7

            Haisong

            haisong Haisong Cai (Inactive) added a comment - By the way, we are running Lustre FE-2.7.1, with ZFS 0.6.4.2, CentOS 6.6.7 Haisong

            We just hit the same LBUG today in our OSS. While I was searching inside Jira I found this ticket. What happened to us was, we shutdown OSS gracefully for maintenance while filesystem was still running. After we mount OSTs back on OSS, in about a minute we hit the LBUG. It looks like during recovery.

            [root@wombat-oss-20-5 ~]#
            Message from syslogd@wombat-oss-20-5 at Mar 31 12:18:18 ...
            kernel:LustreError: 13701:0:(ldlm_lib.c:2277:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed:

            Message from syslogd@wombat-oss-20-5 at Mar 31 12:18:18 ...
            kernel:LustreError: 13701:0:(ldlm_lib.c:2277:target_queue_recovery_request()) LBUG

            haisong Haisong Cai (Inactive) added a comment - We just hit the same LBUG today in our OSS. While I was searching inside Jira I found this ticket. What happened to us was, we shutdown OSS gracefully for maintenance while filesystem was still running. After we mount OSTs back on OSS, in about a minute we hit the LBUG. It looks like during recovery. [root@wombat-oss-20-5 ~] # Message from syslogd@wombat-oss-20-5 at Mar 31 12:18:18 ... kernel:LustreError: 13701:0:(ldlm_lib.c:2277:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed: Message from syslogd@wombat-oss-20-5 at Mar 31 12:18:18 ... kernel:LustreError: 13701:0:(ldlm_lib.c:2277:target_queue_recovery_request()) LBUG

            People

              bobijam Zhenyu Xu
              ferner Frederik Ferner (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: