[LU-6655] MDS LBUG: (ldlm_lib.c:2277:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed Created: 27/May/15  Updated: 06/May/18  Resolved: 06/May/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.12.0

Type: Bug Priority: Critical
Reporter: Frederik Ferner (Inactive) Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None
Environment:

RHEL6, during upgrade from 2.5 to 2.7


Issue Links:
Duplicate
Related
is related to LU-8544 recovery-double-scale test_pairwise_f... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

While attempting a Lustre upgrade from 2.5.3 to 2.7.0 on our preproduction file system, we encountered this LBUG after first mounting the MDT while attempting to mount the OSTs for the first time. The first OST mounted fine and while mounting the second OST, the LBUG happened.

There are most likely clients out there that haven't had the file system unmounted and have been trying to reconnect during this time.

The information below has been extracted from the Red Hat crash log as we didn't have a serial console attached at the time.

<4>Lustre: 14008:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
<0>LustreError: 31012:0:(ldlm_lib.c:2277:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed: 
<0>LustreError: 31012:0:(ldlm_lib.c:2277:target_queue_recovery_request()) LBUG
<4>Pid: 31012, comm: mdt00_001
<4>
<4>Call Trace:
<4> [<ffffffffa03c8895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa03c8e97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa0674d10>] target_queue_recovery_request+0xb00/0xc10 [ptlrpc]
<4> [<ffffffffa06b3c4c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc]
<4> [<ffffffffa0714b3d>] tgt_request_handle+0xe8d/0x1000 [ptlrpc]
<3>LustreError: 11-0: play01-OST0000-osc-MDT0000: operation ost_connect to node 172.23.144.18@tcp failed: rc = -16
<4> [<ffffffffa06c45a1>] ptlrpc_main+0xe41/0x1960 [ptlrpc]
<4> [<ffffffffa06c3760>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
<4> [<ffffffff8109e66e>] kthread+0x9e/0xc0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20

After power cycling the affected MDT and OSS and starting again, so far we've not seen it again and recovery on the MDT is completed.

The only other information I could potentially provide is a vmcore that Red Hat collected automatically as well as more lines from the vmcore-dmesg.txt file if required.



 Comments   
Comment by Peter Jones [ 27/May/15 ]

Bobijam

Could you please assist with this one?

Thanks

Peter

Comment by Zhenyu Xu [ 28/May/15 ]

I think it's a dup of LU-5651. With all nodes upgraded to 2.7, the issue should be gone.

Comment by Frederik Ferner (Inactive) [ 28/May/15 ]

I had looked at LU-5651 but initially didn't think it was the same as all servers had been upgraded. Reading it again, I'm not suspecting there's a client side patch which is not on our clients yet, so you might be right. Could I check that I read this right?

Cheers,
Frederik

Comment by Zhenyu Xu [ 28/May/15 ]

you are right, it's a client patch, a client w/o this patch connecting to upgraded server could LBUG the server.

Comment by Haisong Cai (Inactive) [ 31/Mar/16 ]

We just hit the same LBUG today in our OSS. While I was searching inside Jira I found this ticket. What happened to us was, we shutdown OSS gracefully for maintenance while filesystem was still running. After we mount OSTs back on OSS, in about a minute we hit the LBUG. It looks like during recovery.

[root@wombat-oss-20-5 ~]#
Message from syslogd@wombat-oss-20-5 at Mar 31 12:18:18 ...
kernel:LustreError: 13701:0:(ldlm_lib.c:2277:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed:

Message from syslogd@wombat-oss-20-5 at Mar 31 12:18:18 ...
kernel:LustreError: 13701:0:(ldlm_lib.c:2277:target_queue_recovery_request()) LBUG

Comment by Haisong Cai (Inactive) [ 31/Mar/16 ]

By the way, we are running Lustre FE-2.7.1, with ZFS 0.6.4.2, CentOS 6.6.7

Haisong

Comment by Zhenyu Xu [ 01/Apr/16 ]

Does all clients contains the LU-5651 fix? That is a client issue which corrects the client restore state, otherwise server could be confused of the client's recovery state and hit this LBUG.

Comment by Haisong Cai (Inactive) [ 01/Apr/16 ]

Zhenyu,

Our clients do have the pactch:

[root@comet-02-01 ~]# rpm -qi lustre-client
Name : lustre-client Relocations: (not relocatable)
Version : 2.7.1 Vendor: (none)
Release : 2.6.32_573.12.1.el6.x86_64_g965bd63 Build Date: Tue 02 Feb 2016 02:29:06 PM PST
Install Date: Thu 11 Feb 2016 05:54:41 PM PST Build Host: comet-23-11.sdsc.edu
Group : Utilities/System Source RPM: lustre-client-2.7.1-2.6.32_573.12.1.el6.x86_64_g965bd63.src.rpm
Size : 2030643 License: GPL
Signature : (none)
URL : https://wiki.hpdd.intel.com/
Summary : Lustre File System
Description :
Userspace tools and files for the Lustre file system.

http://git.whamcloud.com/fs/lustre-release.git/commit/d730750a6311cae8a4427824867410faccc6698f is contained in the version we’re using:

dimm:lustre-release-fe dimm$ git branch --contains d730750a6311cae8a4427824867410faccc6698f

  • (HEAD detached at 965bd63)
    b2_8_fe
Comment by Zhenyu Xu [ 03/Apr/16 ]

All clients have this? Since a single old client (without it) could possibly cause this OSS LBUG.

Comment by Haisong Cai (Inactive) [ 07/Apr/16 ]

Zhenyu,

We have carefully walked through our clients and checked Lustre version among them. To our acknowledge, all of them have the patch Lu-5651.

Haisong

Comment by Zhenyu Xu [ 08/Apr/16 ]

For the workaround, you can mount the OST disabling recovery ("-o abort_recov").

And could you upload the dump file as well as all the support files (rpms with debuginfo)?

Comment by Gerrit Updater [ 17/Oct/16 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/23205
Subject: LU-6655 ptlrpc: skip delayed replay requests
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c819f1ca2d3c12348d1e7e779500d7f774f923f7

Comment by Gerrit Updater [ 06/May/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/23205/
Subject: LU-6655 ptlrpc: skip delayed replay requests
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c1d465de13ccf0eda8020c88661c3cc4d78538ca

Comment by Peter Jones [ 06/May/18 ]

Landed for 2.12

Generated at Sat Feb 10 02:02:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.