Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
Lustre 2.8.0
-
None
-
3
-
9223372036854775807
Description
While testing the last pre-2.8 code I noticed heavy client reconnects to OSS servers. The error on the client side was:
Lustre: sultan-OST0008-osc-ffff8803ea302800: Connection to sultan-OST0008 (at 10.37.248.69@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 55 previous similar messages
Lustre: 5355:0:(client.c:2009:ptlrpc_expire_one_request()) Skipped 61 previous similar messages
Lustre: 5350:0:(client.c:2009:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1434742560/real 1434742560] req@ffff8803c23fb6c0 x1504421695570504/t0(0) o8->sultan-OST0023-osc-ffff8803ea302800@10.37.248.72@o2ib1:28/4 lens 520/544 e 0 to 1 dl 1434742568 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 5350:0:(client.c:2009:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
Lustre: sultan-OST0000-osc-ffff8803ea302800: Connection restored to sultan-OST0000 (at 10.37.248.69@o2ib1)
Lustre: Skipped 27 previous similar messages
Lustre: 5356:0:(client.c:2009:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1434742782/real 1434742782] req@ffff8803c1b639c0 x1504421695572244/t0(0) o400->sultan-OST0034-osc-ffff8803ea302800@10.37.248.69@o2ib1:28/4 lens 224/224 e 0 to 1 dl 1434742789 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: sultan-OST0000-osc-ffff8803ea302800: Connection to sultan-OST0000 (at 10.37.248.69@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 41 previous similar messages
Lustre: 5356:0:(client.c:2009:ptlrpc_expire_one_request()) Skipped 73 previous similar messages
Lustre: sultan-OST0000-osc-ffff8803ea302800: Connection restored to sultan-OST0000 (at 10.37.248.69@o2ib1)
Lustre: Skipped 41 previous similar messages
Lustre: sultan-OST0003-osc-ffff8803ea302800: Connection restored to sultan-OST0003 (at 10.37.248.72@o2ib1)
Lustre: Skipped 27 previous similar messages
and the messages seen on the OSS side are:
20639.820176] Lustre: sultan-OST0008: Client 57c62113-31f1-f463-ffeb-9d0c7541279d (at 26@gni1) reconnecting
[20639.829910] Lustre: Skipped 20 previous similar messages
[20676.881745] Lustre: sultan-OST000c: Client 57c62113-31f1-f463-ffeb-9d0c7541279d (at 26@gni1) reconnecting
[20676.891462] Lustre: Skipped 29 previous similar messages
[20868.910972] Lustre: sultan-OST0004: Client 57c62113-31f1-f463-ffeb-9d0c7541279d (at 26@gni1) reconnecting
[20868.920682] Lustre: Skipped 23 previous similar messages
[20906.993360] Lustre: sultan-OST0000: Client 57c62113-31f1-f463-ffeb-9d0c7541279d (at 26@gni1) reconnecting
[20906.993364] Lustre: sultan-OST0004: Client 57c62113-31f1-f463-ffeb-9d0c7541279d (at 26@gni1) reconnecting
[20906.993368] Lustre: Skipped 17 previous similar messages
[20907.018191] Lustre: Skipped 11 previous similar messages
This occured when I ran a file per process IOR job across 20 nodes with 32 threads per client.