[LU-1336] OSS GPF at ptlrpc_send_reply+0x470 Created: 18/Apr/12  Updated: 30/Apr/12  Resolved: 30/Apr/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Ned Bass Assignee: Zhenyu Xu
Resolution: Duplicate Votes: 0
Labels: None
Environment:

https://github.com/chaos/lustre/commits/2.1.0-24chaos


Severity: 3
Rank (Obsolete): 6411

 Description   

Back trace looks like this:

machine_kexec
crash_kexec
oops_end
die
do_general_protection
general_protection
[exception RIP: ptlrpc_send_replay+1136]
ptlrpc_send_error
target_send_replay_msg
target_send_reply
ost_handle
ptlrpc_main
kernel_thread

That RIP resolves to lustre/ptlrpc/niobuf.c:436 which in our tree is here:

434         /* There may be no rq_export during failover */
435 
436         if (unlikely(req->rq_export && req->rq_export->exp_obd &&
437                      req->rq_export->exp_obd->obd_fail)) { 
438                 /* Failed obd's only send ENODEV */
439                 req->rq_type = PTL_RPC_MSG_ERR;
440                 req->rq_status = -ENODEV;
441                 CDEBUG(D_HA, "sending ENODEV from failed obd %d\n",
442                        req->rq_export->exp_obd->obd_minor);
443         }

Server was handling many client reconnects, under similar conditions as reported in LU-1085, LU-1092, LU-1093, and LU-1094.



 Comments   
Comment by Peter Jones [ 18/Apr/12 ]

Bobi

Could you please comment on this one?

Thanks

Peter

Comment by Mikhail Pershin [ 19/Apr/12 ]

isn't that LU-1092 which was landed:

http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=893cf2014a38c5bd94890d3522fafe55f024a958

Also other LU-1085, LU-1093 and LU-1094 look like duplicates

Comment by Ned Bass [ 19/Apr/12 ]

Hi Mikhail,

I've been tracking those separately because the of the different exception sites. That would be nice if they were all symptoms of the same bug.

Can we consider landing the LU-1092 patch in 2.1? These crashes are having a pretty big impact on our production systems. Turning off OSS read cache seems to help avoid them, so we've been leaving it turned off, but that has its own severe performance impacts for some workloads. For now we'll cherry-pick the fix in our tree.

Comment by Peter Jones [ 30/Apr/12 ]

Ned

We have landed LU1092 for 2.1.2 also. Please reopen if this is not a duplicate after all

Peter

Generated at Sat Feb 10 01:15:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.