[LU-755] lustre had panic on network error (stale ZQ entry) Created: 12/Oct/11  Updated: 29/May/17  Resolved: 29/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Alexey Lyashkov Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

RHEL6/Lustre 2.1.0


Severity: 3
Rank (Obsolete): 10310

 Description   

While iozone testing system had a panic
@tcp has timed out for slow reply: [sent 1317906741] [real_sent 1317906741] [current 1317906748] [deadline 7s] [delay 0s] req@ffff88010df32400 x1381925181718579/t5108(5108) o-1->testfs-OST0000_UUID@192.168.123.12@tcp:6/4 lens 456/416 e 0 to 1 dl 1317906748 ref 3 fl Bulk:RX/ffffffff/ffffffff rc 0/-1
[ 4790.534571] Lustre: testfs-OST0000-osc-ffff88010e1a9400: Connection to service testfs-OST0000 via nid 192.168.123.12@tcp was lost; in progress operations using this service will wait for recovery to complete.
[ 4790.567408] Lustre: testfs-OST0000-osc-ffff88010e1a9400: Connection restored to service testfs-OST0000 using nid 192.168.123.12@tcp.
[ 4844.512099] LustreError: 31786:0:(socklnd_cb.c:2518:ksocknal_check_peer_timeouts()) Total 1 stale ZC_REQs for peer 192.168.123.12@tcp detected; the oldest(ffff8800daf1d600) timed out 10 secs ago, resid: 0, wmem: 0
[ 4844.521468] LustreError: 31786:0:(events.c:194:client_bulk_callback()) event type 0, status -5, desc ffff8800ada53200
[ 4844.526766] LustreError: 31789:0:(client.c:1695:ptlrpc_check_set()) @@@ bulk transfer failed req@ffff88010df32400 x1381925181718581/t5108(5108) o-1->testfs-OST0000_UUID@192.168.123.12@tcp:6/4 lens 456/416 e 0 to 0 dl 1317906748 ref 2 fl Bulk:RS/ffffffff/ffffffff rc -11/-1
[ 4844.536035] LustreError: 31789:0:(client.c:1696:ptlrpc_check_set()) LBUG

that panic caused error in client bulk callback - which a wakeup request and unregister a bulk transfer, but not a mark request as failed.

crash> struct ptlrpc_request ffff88010df32400
struct ptlrpc_request {
...
rq_intr = 0,
rq_replied = 1,
rq_err = 0,
rq_timedout = 0,
rq_resend = 1,
rq_restart = 0,
rq_replay = 0,
rq_no_resend = 0,
rq_waiting = 0,
rq_receiving_reply = 0,
rq_no_delay = 0,
rq_net_err = 0,
rq_wait_ctx = 0,
rq_early = 0,
rq_must_unlink = 0,
rq_fake = 0,
rq_memalloc = 0,
rq_packed_final = 0,
rq_hp = 0,
rq_at_linked = 0,
rq_reply_truncate = 0,
rq_committed = 0,
rq_invalid_rqset = 0,
rq_phase = 3955285506,
rq_next_phase = 3955285506,
rq_refcount =

{ counter = 2 }

,
...
crash> p *((struct ptlrpc_bulk_desc *)0xffff8800ada53200)
$9 = {
bd_success = 0,
bd_network_rw = 0,
bd_type = 0,
bd_registered = 1,
bd_lock = {
raw_lock =

{ slock = 0 }

},
bd_import_generation = 0,
bd_export = 0x0,
bd_import = 0xffff88010ed33000,
bd_portal = 8,
bd_req = 0xffff88010df32400,

so it's panic in same bulk desc as failed in client_bulk_callback



 Comments   
Comment by Colin Faber [X] (Inactive) [ 14/Aug/12 ]

This issue continues to be a problem in our mostly up to date 2.1.2 release.

-cf

Comment by Andreas Dilger [ 29/May/17 ]

Close old ticket.

Generated at Sat Feb 10 01:10:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.