Details
-
Bug
-
Resolution: Duplicate
-
Minor
-
None
-
Lustre 2.4.3
-
None
-
(Git repo at https://github.com/jlan/lustre-nas)
Server: 2.4.3-12nasS, kernel: 2.6.32-358.23.2.el6
Client: 2.4.3-11nasC, kernel: sles11sp3 3.0.101-0.31.1.
-
3
-
17268
Description
We have several clients panicked yesterday on this problem. There was ptlrpc_expire_one_request message on nbp9 (10.151.26.5):
[1422397399.479345] Lustre: 5076:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1422397397/real 1422397397] req@ffff88020b91dc00 x1487556472851932/t0(0) o250->MGC10.151.26.5@o2ib@10.151.26.5@o2ib:26/25 lens 400/544 e 0 to 1 dl 1422397502 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[1422397399.507345] Lustre: 5076:0:(client.c:1878:ptlrpc_expire_one_request()) Skipped 3507 previous similar messages
The client hits LBUG as shown in dmesg below:
[1422397952.774718] LustreError: 22882:0:(sec_null.c:196:null_free_reqbuf()) ASSERTION( req->rq_reqmsg == req->rq_reqbuf ) failed: req ffff880837378000: reqmsg ffff8808373780e8 is not reqbuf ffff8805b759a400 in null sec
[1422397952.794717] LustreError: 22882:0:(sec_null.c:196:null_free_reqbuf()) LBUG
[1422397952.802717] Pid: 22882, comm: kworker/u:1
[1422397952.806717]
[1422397952.806717] Call Trace:
[1422397952.810716] [<ffffffff81004935>] dump_trace+0x75/0x310
[1422397952.818716] [<ffffffffa062c82a>] libcfs_debug_dumpstack+0x4a/0x70 [libcfs]
[1422397952.818716] [<ffffffffa062cd5e>] lbug_with_loc+0x3e/0xb0 [libcfs]
[1422397952.818716] [<ffffffffa0914030>] null_free_reqbuf+0x260/0x2d0 [ptlrpc]
[1422397952.818716] [<ffffffffa09022e4>] sptlrpc_cli_free_reqbuf+0x44/0x130 [ptlrpc]
[1422397952.818716] [<ffffffffa08c77ad>] __ptlrpc_free_req+0x15d/0x590 [ptlrpc]
[1422397952.818716] [<ffffffffa08c7d3b>] __ptlrpc_req_finished+0x15b/0x260 [ptlrpc]
[1422397952.818716] [<ffffffffa08de527>] request_out_callback+0xb7/0x240 [ptlrpc]
[1422397952.818716] [<ffffffffa06e2da7>] lnet_msg_detach_md+0x57/0xf0 [lnet]
[1422397952.818716] [<ffffffffa06e42b9>] lnet_finalize+0x89/0x3e0 [lnet]
[1422397952.818716] [<ffffffffa085b4d7>] kiblnd_tx_done+0x117/0x3e0 [ko2iblnd]
[1422397952.818716] [<ffffffffa085c372>] kiblnd_txlist_done+0x52/0x60 [ko2iblnd]
[1422397952.818716] [<ffffffffa085c463>] kiblnd_peer_connect_failed+0xe3/0x2e0 [ko2iblnd]
[1422397952.818716] [<ffffffffa08655ed>] kiblnd_cm_callback+0x4ed/0x1260 [ko2iblnd]
[1422397952.818716] [<ffffffffa04f1a02>] cma_work_handler+0x72/0xa0 [rdma_cm]
[1422397952.818716] [<ffffffff8107b4dc>] process_one_work+0x16c/0x350
[1422397952.818716] [<ffffffff8107e10a>] worker_thread+0x17a/0x410
[1422397952.818716] [<ffffffff81082476>] kthread+0x96/0xa0
[1422397952.818716] [<ffffffff8147aae4>] kernel_thread_helper+0x4/0x10
[1422397952.818716]
[1422397952.818716] Kernel panic - not syncing: LBUG
[1422397952.818716] Pid: 22882, comm: kworker/u:1 Tainted: GF NX 3.0.101-0.31.1.20140612-nasa #1
[1422397952.818716] Call Trace:
[1422397952.818716] [<ffffffff81004935>] dump_trace+0x75/0x310
[1422397952.818716] [<ffffffff8146ec63>] dump_stack+0x69/0x6f
[1422397952.818716] [<ffffffff8146ecfc>] panic+0x93/0x201
Broadcast message from root (Wed Jan 28 12:11:27 2015):+0xa3/0xb0 [libcfs]
[1422397952.818716] [<ffffffffa0914030>] null_free_reqbuf+0x260/0x2d0 [ptlrpc]
System is being shutdown to apply patches during scheduled dedicated time. Plea[e logoff now to a]oid losing data. 1912,1 99%
[he system is goin] DOWN for reboot in 9 minutes!
[1422397952.818716] [<ffffffffa09022e4>] sptlrpc_cli_free_reqbuf+0x44/0x130 [ptlrpc]
[1422397952.818716] [<ffffffffa08c77ad>] __ptlrpc_free_req+0x15d/0x590 [ptlrpc]
[1422397952.818716] [<ffffffffa08c7d3b>] __ptlrpc_req_finished+0x15b/0x260 [ptlrpc]
[1422397952.818716] [<ffffffffa08de527>] request_out_callback+0xb7/0x240 [ptlrpc]
[1422397952.818716] [<ffffffffa06e2da7>] lnet_msg_detach_md+0x57/0xf0 [lnet]
[1422397952.818716] [<ffffffffa06e42b9>] lnet_finalize+0x89/0x3e0 [lnet]
[1422397952.818716] [<ffffffffa085b4d7>] kiblnd_tx_done+0x117/0x3e0 [ko2iblnd]
[1422397952.818716] [<ffffffffa085c372>] kiblnd_txlist_done+0x52/0x60 [ko2iblnd]
[1422397952.818716] [<ffffffffa085c463>] kiblnd_peer_connect_failed+0xe3/0x2e0 [ko2iblnd]
[1422397952.818716] [<ffffffffa08655ed>] kiblnd_cm_callback+0x4ed/0x1260 [ko2iblnd]
[1422397952.818716] [<ffffffffa04f1a02>] cma_work_handler+0x72/0xa0 [rdma_cm]
[1422397952.818716] [<ffffffff8107b4dc>] process_one_work+0x16c/0x350
[1422397952.818716] [<ffffffff8107e10a>] worker_thread+0x17a/0x410
[1422397952.818716] [<ffffffff81082476>] kthread+0x96/0xa0
[1422397952.818716] [<ffffffff8147aae4>] kernel_thread_helper+0x4/0x10
Attachments
Issue Links
- is related to
-
LU-3333 lustre_msg_get_opc()) incorrect message magic: a0b03b5 LBUG
-
- Resolved
-
Duplicate of
LU-3333.~ jfc.