Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.7.0
-
None
-
RHEL6, during upgrade from 2.5 to 2.7
-
3
-
9223372036854775807
Description
While attempting a Lustre upgrade from 2.5.3 to 2.7.0 on our preproduction file system, we encountered this LBUG after first mounting the MDT while attempting to mount the OSTs for the first time. The first OST mounted fine and while mounting the second OST, the LBUG happened.
There are most likely clients out there that haven't had the file system unmounted and have been trying to reconnect during this time.
The information below has been extracted from the Red Hat crash log as we didn't have a serial console attached at the time.
<4>Lustre: 14008:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 2 previous similar messages <0>LustreError: 31012:0:(ldlm_lib.c:2277:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed: <0>LustreError: 31012:0:(ldlm_lib.c:2277:target_queue_recovery_request()) LBUG <4>Pid: 31012, comm: mdt00_001 <4> <4>Call Trace: <4> [<ffffffffa03c8895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4> [<ffffffffa03c8e97>] lbug_with_loc+0x47/0xb0 [libcfs] <4> [<ffffffffa0674d10>] target_queue_recovery_request+0xb00/0xc10 [ptlrpc] <4> [<ffffffffa06b3c4c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] <4> [<ffffffffa0714b3d>] tgt_request_handle+0xe8d/0x1000 [ptlrpc] <3>LustreError: 11-0: play01-OST0000-osc-MDT0000: operation ost_connect to node 172.23.144.18@tcp failed: rc = -16 <4> [<ffffffffa06c45a1>] ptlrpc_main+0xe41/0x1960 [ptlrpc] <4> [<ffffffffa06c3760>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] <4> [<ffffffff8109e66e>] kthread+0x9e/0xc0 <4> [<ffffffff8100c20a>] child_rip+0xa/0x20 <4> [<ffffffff8109e5d0>] ? kthread+0x0/0xc0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
After power cycling the affected MDT and OSS and starting again, so far we've not seen it again and recovery on the MDT is completed.
The only other information I could potentially provide is a vmcore that Red Hat collected automatically as well as more lines from the vmcore-dmesg.txt file if required.
Attachments
Issue Links
- is related to
-
LU-8544 recovery-double-scale test_pairwise_fail: start client on trevis-54vm5 failed
- Resolved