Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Lustre 2.1.0
-
None
-
RHEL6/Lustre 2.1.0
-
3
-
10310
Description
While iozone testing system had a panic
@tcp has timed out for slow reply: [sent 1317906741] [real_sent 1317906741] [current 1317906748] [deadline 7s] [delay 0s] req@ffff88010df32400 x1381925181718579/t5108(5108) o-1->testfs-OST0000_UUID@192.168.123.12@tcp:6/4 lens 456/416 e 0 to 1 dl 1317906748 ref 3 fl Bulk:RX/ffffffff/ffffffff rc 0/-1
[ 4790.534571] Lustre: testfs-OST0000-osc-ffff88010e1a9400: Connection to service testfs-OST0000 via nid 192.168.123.12@tcp was lost; in progress operations using this service will wait for recovery to complete.
[ 4790.567408] Lustre: testfs-OST0000-osc-ffff88010e1a9400: Connection restored to service testfs-OST0000 using nid 192.168.123.12@tcp.
[ 4844.512099] LustreError: 31786:0:(socklnd_cb.c:2518:ksocknal_check_peer_timeouts()) Total 1 stale ZC_REQs for peer 192.168.123.12@tcp detected; the oldest(ffff8800daf1d600) timed out 10 secs ago, resid: 0, wmem: 0
[ 4844.521468] LustreError: 31786:0:(events.c:194:client_bulk_callback()) event type 0, status -5, desc ffff8800ada53200
[ 4844.526766] LustreError: 31789:0:(client.c:1695:ptlrpc_check_set()) @@@ bulk transfer failed req@ffff88010df32400 x1381925181718581/t5108(5108) o-1->testfs-OST0000_UUID@192.168.123.12@tcp:6/4 lens 456/416 e 0 to 0 dl 1317906748 ref 2 fl Bulk:RS/ffffffff/ffffffff rc -11/-1
[ 4844.536035] LustreError: 31789:0:(client.c:1696:ptlrpc_check_set()) LBUG
that panic caused error in client bulk callback - which a wakeup request and unregister a bulk transfer, but not a mark request as failed.
crash> struct ptlrpc_request ffff88010df32400
struct ptlrpc_request {
...
rq_intr = 0,
rq_replied = 1,
rq_err = 0,
rq_timedout = 0,
rq_resend = 1,
rq_restart = 0,
rq_replay = 0,
rq_no_resend = 0,
rq_waiting = 0,
rq_receiving_reply = 0,
rq_no_delay = 0,
rq_net_err = 0,
rq_wait_ctx = 0,
rq_early = 0,
rq_must_unlink = 0,
rq_fake = 0,
rq_memalloc = 0,
rq_packed_final = 0,
rq_hp = 0,
rq_at_linked = 0,
rq_reply_truncate = 0,
rq_committed = 0,
rq_invalid_rqset = 0,
rq_phase = 3955285506,
rq_next_phase = 3955285506,
rq_refcount =
,
...
crash> p *((struct ptlrpc_bulk_desc *)0xffff8800ada53200)
$9 = {
bd_success = 0,
bd_network_rw = 0,
bd_type = 0,
bd_registered = 1,
bd_lock = {
raw_lock =
},
bd_import_generation = 0,
bd_export = 0x0,
bd_import = 0xffff88010ed33000,
bd_portal = 8,
bd_req = 0xffff88010df32400,
so it's panic in same bulk desc as failed in client_bulk_callback