[LU-1093] unable to handle kernel paging request in target_handle_connect() Created: 10/Feb/12  Updated: 30/Apr/12  Resolved: 30/Apr/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Ned Bass Assignee: Oleg Drokin
Resolution: Duplicate Votes: 0
Labels: None
Environment:

https://github.com/chaos/lustre/commits/2.1.0-llnl
RHEL 6.2


Severity: 3
Rank (Obsolete): 6462

 Description   

We had one occurence of this bug on a classified Lustre 2.1 OSS. Timeframe coincided with LU-1085. Like the other bugs in that window, this crash was preceded by hundreds of messages like:

LustreError: 14210:0:(genops.c:1270:class_disconnect_stale_exports()) ls5-OST0349: disconnect stale client [UUID]@<unknown>

BUG: unable to handle kernel paging request at 0000000100000017
IP: [<ffffffffa05ab58f>] target_handle_connect+0x9ff/0x2220 [ptlrpc]

Pid: 15974, comm: ll_ost_506
machine_kexec
crash_kexec
oops_end
no_context
__bad_area_nosemaphore
__do_page_fault
do_page_fault
page_fault
[exception RIP: target_handle_connect+2559]
dequeue_entity
__switch_to
ost_handle
ptlrpc_main
kernel_thread



 Comments   
Comment by Oleg Drokin [ 10/Feb/12 ]

The disconnect stale client message is about clients that failed to contact the server during the recovery window.
Were they dead or is there some other problem at play?

Comment by Ned Bass [ 13/Feb/12 ]

We are still trying to understand what happened, but it's hard to identify the clients because the UUIDS are all @<unknown>. It could be that they were BGP nodes that get rebooted between jobs. We suspect RPC traffic was not moving through the system well, but we don't know if it was due to high server load or some network or LNET router issue.

Comment by Peter Jones [ 30/Apr/12 ]

Believed to be a duplicate of LU-1092

Generated at Sat Feb 10 01:13:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.