Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
Lustre 2.12.2
-
None
-
our github
https://github.com/jlan/lustre-nas
-
2
-
9223372036854775807
Description
After upgrading to 2.12.2 clients and 2.10.8 servers we start to see large amounts of bulk io timeouts.
client side
Sep 16 16:36:59 r323i3n6 kernel: [1568677019.825837] Lustre: nbp2-OST0008-osc-ffff9ad887dc1800: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 16 16:37:01 r585i7n2 kernel: [1568677021.115645] Lustre: nbp2-OST0008-osc-ffff90c85d05e000: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 16 16:37:01 r311i0n3 kernel: [1568677021.371165] Lustre: nbp2-OST0094-osc-ffff976bea358800: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 16 16:37:01 r311i4n9 kernel: [1568677021.578522] Lustre: nbp2-OST0094-osc-ffff9c68adf2d000: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this sea12000: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 16 16:36:59 r323i3n6 kernel: [1568677019.825837] Lustre: nbp2-OST0008-osc-ffff9ad887dc1800: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 16 16:37:01 r585i7n2 kernel: [1568677021.115645] Lustre: nbp2-OST0008-osc-ffff90c85d05e000: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 16 16:37:01 r311i0n3 kernel: [1568677021.371165] Lustre: nbp2-OST0094-osc-ffff976bea358800: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 16 16:37:01 r311i4n9 kernel: [1568677021.578522] Lustre: nbp2-OST0094-osc-ffff9c68adf2d000: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this seConnection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 16 16:36:59 r323i3n6 kernel: [1568677019.825837] Lustre: nbp2-OST0008-osc-ffff9ad887dc1800: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 16 16:37:01 r585i7n2 kernel: [1568677021.115645] Lustre: nbp2-OST0008-osc-ffff90c85d05e000: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 16 16:37:01 r311i0n3 kernel: [1568677021.371165] Lustre: nbp2-OST0094-osc-ffff976bea358800: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 16 16:37:01 r311i4n9 kernel: [1568677021.578522] Lustre: nbp2-OST0094-osc-ffff9c68adf2d000: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this se
server side
[90158.366440] LustreError: 30777:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 300+0s req@ffff8affe46eb450 x1644657589074304/t0(0) o4->825a80a7-da45-880e-35d1-4a750d2cf7f0@10.151.16.212@o2ib:502/0 lens 2168/448 e 0 to 0 dl 1568676837 ref 1 fl Interpret:/2/0 rc 0/0
Been trying to get additional debugging while we wait to update our servers.
I found this exchange between the server and client.
x1645242562975920 is the request that keeps timeout over and over.
server
On the client
00000100:00000200:52.0:1569522293.227691:0:4645:0:(events.c:57:request_out_callback()) @@@ type 5, status 0 req@ffff9568c0c48040 x1645193981558368/t0(0) o3->nbp8-OST0107-osc-ffff958574643000@10.151.27.65@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1569522648 ref 3 fl Rpc:/2/ffffffff rc 0/-1 00000800:00000200:47.0:1569522293.227691:0:4643:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffff95709daf1800] (70)-- 00000400:00000200:52.0:1569522293.227698:0:4645:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff956f2881d770 00000400:00000200:52.0:1569522293.227700:0:4645:0:(lib-msg.c:816:lnet_is_health_check()) health check = 1, status = 0, hstatus = 0 00000400:00000200:52.0:1569522293.227703:0:4645:0:(lib-msg.c:630:lnet_health_check()) health check: 10.141.2.117@o2ib417->10.141.25.169@o2ib417: PUT: OK 00000800:00000200:52.0:1569522293.227706:0:4645:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffff95709daf1800] (69)-- 00000800:00000200:38.2:1569522293.227889:0:0:0:(o2iblnd_cb.c:3721:kiblnd_cq_completion()) conn[ffff956f289cee00] (68)++ 00000800:00000200:13.0:1569522293.227905:0:4646:0:(o2iblnd_cb.c:3843:kiblnd_scheduler()) conn[ffff956f289cee00] (69)++ 00000800:00000200:13.0:1569522293.227913:0:4646:0:(o2iblnd_cb.c:343:kiblnd_handle_rx()) Received d2[0] from 10.141.25.170@o2ib417 00000800:00000200:15.0:1569522293.227915:0:4644:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffff956f289cee00] (70)-- 00000400:00000200:13.0:1569522293.227920:0:4646:0:(lib-move.c:4114:lnet_parse()) TRACE: 10.141.2.117@o2ib417(10.141.2.117@o2ib417) <- 10.151.27.65@o2ib : PUT - for me 00000400:00000200:13.0:1569522293.227928:0:4646:0:(lib-ptl.c:571:lnet_ptl_match_md()) Request from 12345-10.151.27.65@o2ib of length 4096 into portal 8 MB=0x0 00000400:00000100:13.0:1569522293.227933:0:4646:0:(lib-move.c:3753:lnet_parse_put()) Dropping PUT from 12345-10.151.27.65@o2ib portal 8 match 0 offset 0 length 4096: 4
So it drops it because it timeout the bulk io. But I think that was the previous one. It looks like to me thats its stuck timeout a previous transfer and drop the current one.
I will up load full debug log that shows this keeps repeating.
ftp:uploads/
LU-12772/oss4.out.gzftp:uploads/
LU-12772/r901i3n9.out.gzThis looks like
LU-11951but our 2.12 clients has that patch. FYI we have turn osc idle timeout.osc.*.idle_timeout=0