[LU-6655] MDS LBUG: (ldlm_lib.c:2277:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.12.0
Affects Version/s: Lustre 2.7.0
Labels:
None
Environment:
RHEL6, during upgrade from 2.5 to 2.7

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

While attempting a Lustre upgrade from 2.5.3 to 2.7.0 on our preproduction file system, we encountered this LBUG after first mounting the MDT while attempting to mount the OSTs for the first time. The first OST mounted fine and while mounting the second OST, the LBUG happened.

There are most likely clients out there that haven't had the file system unmounted and have been trying to reconnect during this time.

The information below has been extracted from the Red Hat crash log as we didn't have a serial console attached at the time.

<4>Lustre: 14008:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
<0>LustreError: 31012:0:(ldlm_lib.c:2277:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed: 
<0>LustreError: 31012:0:(ldlm_lib.c:2277:target_queue_recovery_request()) LBUG
<4>Pid: 31012, comm: mdt00_001
<4>
<4>Call Trace:
<4> [<ffffffffa03c8895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa03c8e97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa0674d10>] target_queue_recovery_request+0xb00/0xc10 [ptlrpc]
<4> [<ffffffffa06b3c4c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc]
<4> [<ffffffffa0714b3d>] tgt_request_handle+0xe8d/0x1000 [ptlrpc]
<3>LustreError: 11-0: play01-OST0000-osc-MDT0000: operation ost_connect to node 172.23.144.18@tcp failed: rc = -16
<4> [<ffffffffa06c45a1>] ptlrpc_main+0xe41/0x1960 [ptlrpc]
<4> [<ffffffffa06c3760>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
<4> [<ffffffff8109e66e>] kthread+0x9e/0xc0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20

After power cycling the affected MDT and OSS and starting again, so far we've not seen it again and recovery on the MDT is completed.

The only other information I could potentially provide is a vmcore that Red Hat collected automatically as well as more lines from the vmcore-dmesg.txt file if required.

Attachments

Issue Links

is related to

LU-8544 recovery-double-scale test_pairwise_fail: start client on trevis-54vm5 failed

Resolved

Activity

[LU-6655] MDS LBUG: (ldlm_lib.c:2277:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed

Peter Jones added a comment - 06/May/18 4:17 AM

Landed for 2.12

Peter Jones added a comment - 06/May/18 4:17 AM Landed for 2.12

Gerrit Updater added a comment - 06/May/18 3:40 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/23205/
Subject: ~~LU-6655~~ ptlrpc: skip delayed replay requests
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c1d465de13ccf0eda8020c88661c3cc4d78538ca

Gerrit Updater added a comment - 06/May/18 3:40 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/23205/ Subject: LU-6655 ptlrpc: skip delayed replay requests Project: fs/lustre-release Branch: master Current Patch Set: Commit: c1d465de13ccf0eda8020c88661c3cc4d78538ca

Gerrit Updater added a comment - 17/Oct/16 2:33 PM

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/23205
Subject: ~~LU-6655~~ ptlrpc: skip delayed replay requests
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c819f1ca2d3c12348d1e7e779500d7f774f923f7

Gerrit Updater added a comment - 17/Oct/16 2:33 PM Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/23205 Subject: LU-6655 ptlrpc: skip delayed replay requests Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c819f1ca2d3c12348d1e7e779500d7f774f923f7

Zhenyu Xu added a comment - 08/Apr/16 12:25 AM

For the workaround, you can mount the OST disabling recovery ("-o abort_recov").

And could you upload the dump file as well as all the support files (rpms with debuginfo)?

Zhenyu Xu added a comment - 08/Apr/16 12:25 AM For the workaround, you can mount the OST disabling recovery ("-o abort_recov"). And could you upload the dump file as well as all the support files (rpms with debuginfo)?

Haisong Cai (Inactive) added a comment - 07/Apr/16 5:28 PM

Zhenyu,

We have carefully walked through our clients and checked Lustre version among them. To our acknowledge, all of them have the patch Lu-5651.

Haisong

Haisong Cai (Inactive) added a comment - 07/Apr/16 5:28 PM Zhenyu, We have carefully walked through our clients and checked Lustre version among them. To our acknowledge, all of them have the patch Lu-5651. Haisong

Zhenyu Xu added a comment - 03/Apr/16 10:55 AM

All clients have this? Since a single old client (without it) could possibly cause this OSS LBUG.

Zhenyu Xu added a comment - 03/Apr/16 10:55 AM All clients have this? Since a single old client (without it) could possibly cause this OSS LBUG.

Haisong Cai (Inactive) added a comment - 01/Apr/16 8:18 PM

Zhenyu,

Our clients do have the pactch:

[root@comet-02-01 ~]# rpm -qi lustre-client
Name : lustre-client Relocations: (not relocatable)
Version : 2.7.1 Vendor: (none)
Release : 2.6.32_573.12.1.el6.x86_64_g965bd63 Build Date: Tue 02 Feb 2016 02:29:06 PM PST
Install Date: Thu 11 Feb 2016 05:54:41 PM PST Build Host: comet-23-11.sdsc.edu
Group : Utilities/System Source RPM: lustre-client-2.7.1-2.6.32_573.12.1.el6.x86_64_g965bd63.src.rpm
Size : 2030643 License: GPL
Signature : (none)
URL : https://wiki.hpdd.intel.com/
Summary : Lustre File System
Description :
Userspace tools and files for the Lustre file system.

http://git.whamcloud.com/fs/lustre-release.git/commit/d730750a6311cae8a4427824867410faccc6698f is contained in the version we’re using:

dimm:lustre-release-fe dimm$ git branch --contains d730750a6311cae8a4427824867410faccc6698f

(HEAD detached at 965bd63)
b2_8_fe

Haisong Cai (Inactive) added a comment - 01/Apr/16 8:18 PM Zhenyu, Our clients do have the pactch: [root@comet-02-01 ~] # rpm -qi lustre-client Name : lustre-client Relocations: (not relocatable) Version : 2.7.1 Vendor: (none) Release : 2.6.32_573.12.1.el6.x86_64_g965bd63 Build Date: Tue 02 Feb 2016 02:29:06 PM PST Install Date: Thu 11 Feb 2016 05:54:41 PM PST Build Host: comet-23-11.sdsc.edu Group : Utilities/System Source RPM: lustre-client-2.7.1-2.6.32_573.12.1.el6.x86_64_g965bd63.src.rpm Size : 2030643 License: GPL Signature : (none) URL : https://wiki.hpdd.intel.com/ Summary : Lustre File System Description : Userspace tools and files for the Lustre file system. http://git.whamcloud.com/fs/lustre-release.git/commit/d730750a6311cae8a4427824867410faccc6698f is contained in the version we’re using: dimm:lustre-release-fe dimm$ git branch --contains d730750a6311cae8a4427824867410faccc6698f (HEAD detached at 965bd63) b2_8_fe

Zhenyu Xu added a comment - 01/Apr/16 3:50 AM

Does all clients contains the ~~LU-5651~~ fix? That is a client issue which corrects the client restore state, otherwise server could be confused of the client's recovery state and hit this LBUG.

Zhenyu Xu added a comment - 01/Apr/16 3:50 AM Does all clients contains the LU-5651 fix? That is a client issue which corrects the client restore state, otherwise server could be confused of the client's recovery state and hit this LBUG.

Haisong Cai (Inactive) added a comment - 31/Mar/16 8:35 PM

By the way, we are running Lustre FE-2.7.1, with ZFS 0.6.4.2, CentOS 6.6.7

Haisong

Haisong Cai (Inactive) added a comment - 31/Mar/16 8:35 PM By the way, we are running Lustre FE-2.7.1, with ZFS 0.6.4.2, CentOS 6.6.7 Haisong

Haisong Cai (Inactive) added a comment - 31/Mar/16 7:57 PM

We just hit the same LBUG today in our OSS. While I was searching inside Jira I found this ticket. What happened to us was, we shutdown OSS gracefully for maintenance while filesystem was still running. After we mount OSTs back on OSS, in about a minute we hit the LBUG. It looks like during recovery.

[root@wombat-oss-20-5 ~]#
Message from syslogd@wombat-oss-20-5 at Mar 31 12:18:18 ...
kernel:LustreError: 13701:0:(ldlm_lib.c:2277:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed:

Message from syslogd@wombat-oss-20-5 at Mar 31 12:18:18 ...
kernel:LustreError: 13701:0:(ldlm_lib.c:2277:target_queue_recovery_request()) LBUG

Haisong Cai (Inactive) added a comment - 31/Mar/16 7:57 PM We just hit the same LBUG today in our OSS. While I was searching inside Jira I found this ticket. What happened to us was, we shutdown OSS gracefully for maintenance while filesystem was still running. After we mount OSTs back on OSS, in about a minute we hit the LBUG. It looks like during recovery. [root@wombat-oss-20-5 ~] # Message from syslogd@wombat-oss-20-5 at Mar 31 12:18:18 ... kernel:LustreError: 13701:0:(ldlm_lib.c:2277:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed: Message from syslogd@wombat-oss-20-5 at Mar 31 12:18:18 ... kernel:LustreError: 13701:0:(ldlm_lib.c:2277:target_queue_recovery_request()) LBUG

People

Assignee:: Zhenyu Xu

Reporter:: Frederik Ferner (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 27/May/15 6:42 PM

Updated:: 06/May/18 4:29 AM

Resolved:: 06/May/18 4:17 AM