[LU-6172] Client LBUG in null_free_reqbuf Created: 28/Jan/15  Updated: 12/Feb/15  Resolved: 12/Feb/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.3
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Jay Lan (Inactive) Assignee: Yang Sheng
Resolution: Duplicate Votes: 0
Labels: None
Environment:

(Git repo at https://github.com/jlan/lustre-nas)
Server: 2.4.3-12nasS, kernel: 2.6.32-358.23.2.el6
Client: 2.4.3-11nasC, kernel: sles11sp3 3.0.101-0.31.1.


Issue Links:
Related
is related to LU-3333 lustre_msg_get_opc()) incorrect messa... Resolved
Severity: 3
Rank (Obsolete): 17268

 Description   

We have several clients panicked yesterday on this problem. There was ptlrpc_expire_one_request message on nbp9 (10.151.26.5):
[1422397399.479345] Lustre: 5076:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1422397397/real 1422397397] req@ffff88020b91dc00 x1487556472851932/t0(0) o250->MGC10.151.26.5@o2ib@10.151.26.5@o2ib:26/25 lens 400/544 e 0 to 1 dl 1422397502 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[1422397399.507345] Lustre: 5076:0:(client.c:1878:ptlrpc_expire_one_request()) Skipped 3507 previous similar messages

The client hits LBUG as shown in dmesg below:

[1422397952.774718] LustreError: 22882:0:(sec_null.c:196:null_free_reqbuf()) ASSERTION( req->rq_reqmsg == req->rq_reqbuf ) failed: req ffff880837378000: reqmsg ffff8808373780e8 is not reqbuf ffff8805b759a400 in null sec
[1422397952.794717] LustreError: 22882:0:(sec_null.c:196:null_free_reqbuf()) LBUG
[1422397952.802717] Pid: 22882, comm: kworker/u:1
[1422397952.806717]
[1422397952.806717] Call Trace:
[1422397952.810716] [<ffffffff81004935>] dump_trace+0x75/0x310
[1422397952.818716] [<ffffffffa062c82a>] libcfs_debug_dumpstack+0x4a/0x70 [libcfs]
[1422397952.818716] [<ffffffffa062cd5e>] lbug_with_loc+0x3e/0xb0 [libcfs]
[1422397952.818716] [<ffffffffa0914030>] null_free_reqbuf+0x260/0x2d0 [ptlrpc]
[1422397952.818716] [<ffffffffa09022e4>] sptlrpc_cli_free_reqbuf+0x44/0x130 [ptlrpc]
[1422397952.818716] [<ffffffffa08c77ad>] __ptlrpc_free_req+0x15d/0x590 [ptlrpc]
[1422397952.818716] [<ffffffffa08c7d3b>] __ptlrpc_req_finished+0x15b/0x260 [ptlrpc]
[1422397952.818716] [<ffffffffa08de527>] request_out_callback+0xb7/0x240 [ptlrpc]
[1422397952.818716] [<ffffffffa06e2da7>] lnet_msg_detach_md+0x57/0xf0 [lnet]
[1422397952.818716] [<ffffffffa06e42b9>] lnet_finalize+0x89/0x3e0 [lnet]
[1422397952.818716] [<ffffffffa085b4d7>] kiblnd_tx_done+0x117/0x3e0 [ko2iblnd]
[1422397952.818716] [<ffffffffa085c372>] kiblnd_txlist_done+0x52/0x60 [ko2iblnd]
[1422397952.818716] [<ffffffffa085c463>] kiblnd_peer_connect_failed+0xe3/0x2e0 [ko2iblnd]
[1422397952.818716] [<ffffffffa08655ed>] kiblnd_cm_callback+0x4ed/0x1260 [ko2iblnd]
[1422397952.818716] [<ffffffffa04f1a02>] cma_work_handler+0x72/0xa0 [rdma_cm]
[1422397952.818716] [<ffffffff8107b4dc>] process_one_work+0x16c/0x350
[1422397952.818716] [<ffffffff8107e10a>] worker_thread+0x17a/0x410
[1422397952.818716] [<ffffffff81082476>] kthread+0x96/0xa0
[1422397952.818716] [<ffffffff8147aae4>] kernel_thread_helper+0x4/0x10
[1422397952.818716]
[1422397952.818716] Kernel panic - not syncing: LBUG
[1422397952.818716] Pid: 22882, comm: kworker/u:1 Tainted: GF NX 3.0.101-0.31.1.20140612-nasa #1
[1422397952.818716] Call Trace:
[1422397952.818716] [<ffffffff81004935>] dump_trace+0x75/0x310
[1422397952.818716] [<ffffffff8146ec63>] dump_stack+0x69/0x6f
[1422397952.818716] [<ffffffff8146ecfc>] panic+0x93/0x201
Broadcast message from root (Wed Jan 28 12:11:27 2015):+0xa3/0xb0 [libcfs]
[1422397952.818716] [<ffffffffa0914030>] null_free_reqbuf+0x260/0x2d0 [ptlrpc]
System is being shutdown to apply patches during scheduled dedicated time. Plea[e logoff now to a]oid losing data. 1912,1 99%
[he system is goin] DOWN for reboot in 9 minutes!
[1422397952.818716] [<ffffffffa09022e4>] sptlrpc_cli_free_reqbuf+0x44/0x130 [ptlrpc]
[1422397952.818716] [<ffffffffa08c77ad>] __ptlrpc_free_req+0x15d/0x590 [ptlrpc]
[1422397952.818716] [<ffffffffa08c7d3b>] __ptlrpc_req_finished+0x15b/0x260 [ptlrpc]
[1422397952.818716] [<ffffffffa08de527>] request_out_callback+0xb7/0x240 [ptlrpc]
[1422397952.818716] [<ffffffffa06e2da7>] lnet_msg_detach_md+0x57/0xf0 [lnet]
[1422397952.818716] [<ffffffffa06e42b9>] lnet_finalize+0x89/0x3e0 [lnet]
[1422397952.818716] [<ffffffffa085b4d7>] kiblnd_tx_done+0x117/0x3e0 [ko2iblnd]
[1422397952.818716] [<ffffffffa085c372>] kiblnd_txlist_done+0x52/0x60 [ko2iblnd]
[1422397952.818716] [<ffffffffa085c463>] kiblnd_peer_connect_failed+0xe3/0x2e0 [ko2iblnd]
[1422397952.818716] [<ffffffffa08655ed>] kiblnd_cm_callback+0x4ed/0x1260 [ko2iblnd]
[1422397952.818716] [<ffffffffa04f1a02>] cma_work_handler+0x72/0xa0 [rdma_cm]
[1422397952.818716] [<ffffffff8107b4dc>] process_one_work+0x16c/0x350
[1422397952.818716] [<ffffffff8107e10a>] worker_thread+0x17a/0x410
[1422397952.818716] [<ffffffff81082476>] kthread+0x96/0xa0
[1422397952.818716] [<ffffffff8147aae4>] kernel_thread_helper+0x4/0x10



 Comments   
Comment by Jay Lan (Inactive) [ 29/Jan/15 ]

The Call Trace part in the decription of the problem was cut-n-paste from a screeen that was contaminated by broadcast messages from root.

The correct stack trace should be as below:

[1422397952.818716] Kernel panic - not syncing: LBUG
[1422397952.818716] Pid: 22882, comm: kworker/u:1 Tainted: GF NX 3.0.101-0.31.1.20140612-nasa #1
[1422397952.818716] Call Trace:
[1422397952.818716] [<ffffffff81004935>] dump_trace+0x75/0x310
[1422397952.818716] [<ffffffff8146ec63>] dump_stack+0x69/0x6f
[1422397952.818716] [<ffffffff8146ecfc>] panic+0x93/0x201
[1422397952.818716] [<ffffffffa062cdc3>] lbug_with_loc+0xa3/0xb0 [libcfs]
[1422397952.818716] [<ffffffffa0914030>] null_free_reqbuf+0x260/0x2d0 [ptlrpc]
[1422397952.818716] [<ffffffffa09022e4>] sptlrpc_cli_free_reqbuf+0x44/0x130 [ptlrpc]
[1422397952.818716] [<ffffffffa08c77ad>] __ptlrpc_free_req+0x15d/0x590 [ptlrpc]
[1422397952.818716] [<ffffffffa08c7d3b>] __ptlrpc_req_finished+0x15b/0x260 [ptlrpc]
[1422397952.818716] [<ffffffffa08de527>] request_out_callback+0xb7/0x240 [ptlrpc]
[1422397952.818716] [<ffffffffa06e2da7>] lnet_msg_detach_md+0x57/0xf0 [lnet]
[1422397952.818716] [<ffffffffa06e42b9>] lnet_finalize+0x89/0x3e0 [lnet]
[1422397952.818716] [<ffffffffa085b4d7>] kiblnd_tx_done+0x117/0x3e0 [ko2iblnd]
[1422397952.818716] [<ffffffffa085c372>] kiblnd_txlist_done+0x52/0x60 [ko2iblnd]
[1422397952.818716] [<ffffffffa085c463>] kiblnd_peer_connect_failed+0xe3/0x2e0 [ko2iblnd]
[1422397952.818716] [<ffffffffa08655ed>] kiblnd_cm_callback+0x4ed/0x1260 [ko2iblnd]
[1422397952.818716] [<ffffffffa04f1a02>] cma_work_handler+0x72/0xa0 [rdma_cm]
[1422397952.818716] [<ffffffff8107b4dc>] process_one_work+0x16c/0x350
[1422397952.818716] [<ffffffff8107e10a>] worker_thread+0x17a/0x410
[1422397952.818716] [<ffffffff81082476>] kthread+0x96/0xa0
[1422397952.818716] [<ffffffff8147aae4>] kernel_thread_helper+0x4/0x10

Comment by Peter Jones [ 29/Jan/15 ]

Yang Sheng

Could you please advise?

Peter

Comment by Oleg Drokin [ 02/Feb/15 ]

I suspect this might be an instance of LU-3333

Comment by Yang Sheng [ 04/Feb/15 ]

Hi, Oleg. I think you may right. Looks like not other place can cause such issue. But seem we won't landed patch to 2.4. How to handle this ticket?

Comment by Peter Jones [ 04/Feb/15 ]

Yang Sheng

NASA have their own Lustre distribution and so can pick up the patch. They will advise if they need any assistance in porting the fix back to b2_4. Alternatively, if this is a rare issue, they may wait, knowing that this issue will disappear when they upgrade to 2.5.x clients.

Peter

Comment by Jay Lan (Inactive) [ 04/Feb/15 ]

Thanks, guys!

I cherry-picked the patch to nas-2.4.3 branch with no problem.

Comment by Yang Sheng [ 09/Feb/15 ]

So can we mark this ticket as duplicated with lu-3333?

Comment by John Fuchs-Chesney (Inactive) [ 12/Feb/15 ]

Duplicate of LU-3333.
~ jfc.

Generated at Sat Feb 10 01:57:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.