[LU-1094] general protection fault in _debug_req() Created: 10/Feb/12  Updated: 30/Apr/12  Resolved: 30/Apr/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Ned Bass Assignee: Oleg Drokin
Resolution: Duplicate Votes: 0
Labels: paj
Environment:

https://github.com/chaos/lustre/commits/2.1.0-llnl
RHEL 6.2


Severity: 3
Rank (Obsolete): 6461

 Description   

We had five occurrences of this crash on OSS nodes in our classified Lustre 2.1 cluster. Timeframe coincided with LU-1085. Like the other bugs in that window, this crash was preceded by hundreds of messages like:

LustreError: 14210:0:(genops.c:1270:class_disconnect_stale_exports()) ls5-OST0349: disconnect stale client [UUID]@<unknown>

general protection fault: 0000 1 SMP
Pid: 13890, comm: ll_ost_34

machine_kexec
crash_kexec
oops_end
die
do_general_protection
general_protection
[exception RIP: strnlen+9]
string
vsnprintf
libcfs_debug_vmsg2
_debug_req
target_send_reply_msg
target_send_reply
ost_handle
ptlrpc_main
kernel_thread



 Comments   
Comment by Ned Bass [ 10/Feb/12 ]

Comment copied from LU-1085:

I did some digging in crash to see what state the ptlrpc_reqeust was in. I dug up the pointer address from the backtrace (let's call it <addr1> to save typing). Then resolving some of the strings that get passed to libcfs_debug_vmsg2() from _debug_req(), I see:

crash> struct ptlrpc_request.rq_import <addr1>
 rp_import = 0x0 
crash> struct ptlrpc_request.rq_export <addr1>
 rp_export = <addr2>
crash> struct obd_export.exp_connection <addr2>
 exp_connection = 0x5a5a5a5a5a5a5a5a
crash> struct obd_export.exp_client_uuid <addr2>
 exp_client_uuid = { 
        uuid = "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ"
 }

So the presence of poison value and bogus uuid suggests this export has already been destroyed.

For reference, here a snippet from from _debug_req() that uses these values:

2271 void _debug_req(struct ptlrpc_request *req,
2272                 struct libcfs_debug_msg_data *msgdata,
2273                 const char *fmt, ... )
2274 {       
2275         va_list args;
2276         va_start(args, fmt);
2277         libcfs_debug_vmsg2(msgdata, fmt, args,
2278                            " req@%p x"LPU64"/t"LPD64"("LPD64") o%d->%s@%s:%d/%d"
2279                            " lens %d/%d e %d to %d dl "CFS_TIME_T" ref %d "
2280                            "fl "REQ_FLAGS_FMT"/%x/%x rc %d/%d\n",
2281                            req, req->rq_xid, req->rq_transno,
2282                            req->rq_reqmsg ? lustre_msg_get_transno(req->rq_reqmsg) : 0,
2283                            req->rq_reqmsg && req_ptlrpc_body_swabbed(req) ?
2284                            lustre_msg_get_opc(req->rq_reqmsg) : -1, 
2285                            req->rq_import ? obd2cli_tgt(req->rq_import->imp_obd) :
2286                            req->rq_export ?
2287                            (char*)req->rq_export->exp_client_uuid.uuid : "<?>",
Comment by Oleg Drokin [ 11/Feb/12 ]

I think this one also has a chance of being related to lu-106, so let's see if the runs with the patch would help.

Comment by Ned Bass [ 18/Apr/12 ]

FYI, we did in fact hit this again with the LU-106 patch here:

http://review.whamcloud.com/326

Comment by Ned Bass [ 18/Apr/12 ]

Sorry, disregard previous comment. We hit a new GPF, not this one.

Comment by Peter Jones [ 30/Apr/12 ]

Believed to be a duplicate of LU-1092

Generated at Sat Feb 10 01:13:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.