[LU-6655] MDS LBUG: (ldlm_lib.c:2277:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed Created: 27/May/15 Updated: 06/May/18 Resolved: 06/May/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.12.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Frederik Ferner (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL6, during upgrade from 2.5 to 2.7 |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
While attempting a Lustre upgrade from 2.5.3 to 2.7.0 on our preproduction file system, we encountered this LBUG after first mounting the MDT while attempting to mount the OSTs for the first time. The first OST mounted fine and while mounting the second OST, the LBUG happened. There are most likely clients out there that haven't had the file system unmounted and have been trying to reconnect during this time. The information below has been extracted from the Red Hat crash log as we didn't have a serial console attached at the time. <4>Lustre: 14008:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 2 previous similar messages <0>LustreError: 31012:0:(ldlm_lib.c:2277:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed: <0>LustreError: 31012:0:(ldlm_lib.c:2277:target_queue_recovery_request()) LBUG <4>Pid: 31012, comm: mdt00_001 <4> <4>Call Trace: <4> [<ffffffffa03c8895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4> [<ffffffffa03c8e97>] lbug_with_loc+0x47/0xb0 [libcfs] <4> [<ffffffffa0674d10>] target_queue_recovery_request+0xb00/0xc10 [ptlrpc] <4> [<ffffffffa06b3c4c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] <4> [<ffffffffa0714b3d>] tgt_request_handle+0xe8d/0x1000 [ptlrpc] <3>LustreError: 11-0: play01-OST0000-osc-MDT0000: operation ost_connect to node 172.23.144.18@tcp failed: rc = -16 <4> [<ffffffffa06c45a1>] ptlrpc_main+0xe41/0x1960 [ptlrpc] <4> [<ffffffffa06c3760>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] <4> [<ffffffff8109e66e>] kthread+0x9e/0xc0 <4> [<ffffffff8100c20a>] child_rip+0xa/0x20 <4> [<ffffffff8109e5d0>] ? kthread+0x0/0xc0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 After power cycling the affected MDT and OSS and starting again, so far we've not seen it again and recovery on the MDT is completed. The only other information I could potentially provide is a vmcore that Red Hat collected automatically as well as more lines from the vmcore-dmesg.txt file if required. |
| Comments |
| Comment by Peter Jones [ 27/May/15 ] |
|
Bobijam Could you please assist with this one? Thanks Peter |
| Comment by Zhenyu Xu [ 28/May/15 ] |
|
I think it's a dup of |
| Comment by Frederik Ferner (Inactive) [ 28/May/15 ] |
|
I had looked at Cheers, |
| Comment by Zhenyu Xu [ 28/May/15 ] |
|
you are right, it's a client patch, a client w/o this patch connecting to upgraded server could LBUG the server. |
| Comment by Haisong Cai (Inactive) [ 31/Mar/16 ] |
|
We just hit the same LBUG today in our OSS. While I was searching inside Jira I found this ticket. What happened to us was, we shutdown OSS gracefully for maintenance while filesystem was still running. After we mount OSTs back on OSS, in about a minute we hit the LBUG. It looks like during recovery. [root@wombat-oss-20-5 ~]# Message from syslogd@wombat-oss-20-5 at Mar 31 12:18:18 ... |
| Comment by Haisong Cai (Inactive) [ 31/Mar/16 ] |
|
By the way, we are running Lustre FE-2.7.1, with ZFS 0.6.4.2, CentOS 6.6.7 Haisong |
| Comment by Zhenyu Xu [ 01/Apr/16 ] |
|
Does all clients contains the |
| Comment by Haisong Cai (Inactive) [ 01/Apr/16 ] |
|
Zhenyu, Our clients do have the pactch: [root@comet-02-01 ~]# rpm -qi lustre-client http://git.whamcloud.com/fs/lustre-release.git/commit/d730750a6311cae8a4427824867410faccc6698f is contained in the version we’re using: dimm:lustre-release-fe dimm$ git branch --contains d730750a6311cae8a4427824867410faccc6698f
|
| Comment by Zhenyu Xu [ 03/Apr/16 ] |
|
All clients have this? Since a single old client (without it) could possibly cause this OSS LBUG. |
| Comment by Haisong Cai (Inactive) [ 07/Apr/16 ] |
|
Zhenyu, We have carefully walked through our clients and checked Lustre version among them. To our acknowledge, all of them have the patch Lu-5651. Haisong |
| Comment by Zhenyu Xu [ 08/Apr/16 ] |
|
For the workaround, you can mount the OST disabling recovery ("-o abort_recov"). And could you upload the dump file as well as all the support files (rpms with debuginfo)? |
| Comment by Gerrit Updater [ 17/Oct/16 ] |
|
Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/23205 |
| Comment by Gerrit Updater [ 06/May/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/23205/ |
| Comment by Peter Jones [ 06/May/18 ] |
|
Landed for 2.12 |