Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
Lustre 2.12.2
-
None
-
our github
https://github.com/jlan/lustre-nas
-
2
-
9223372036854775807
Description
After upgrading to 2.12.2 clients and 2.10.8 servers we start to see large amounts of bulk io timeouts.
client side
Sep 16 16:36:59 r323i3n6 kernel: [1568677019.825837] Lustre: nbp2-OST0008-osc-ffff9ad887dc1800: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 16 16:37:01 r585i7n2 kernel: [1568677021.115645] Lustre: nbp2-OST0008-osc-ffff90c85d05e000: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 16 16:37:01 r311i0n3 kernel: [1568677021.371165] Lustre: nbp2-OST0094-osc-ffff976bea358800: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 16 16:37:01 r311i4n9 kernel: [1568677021.578522] Lustre: nbp2-OST0094-osc-ffff9c68adf2d000: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this sea12000: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 16 16:36:59 r323i3n6 kernel: [1568677019.825837] Lustre: nbp2-OST0008-osc-ffff9ad887dc1800: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 16 16:37:01 r585i7n2 kernel: [1568677021.115645] Lustre: nbp2-OST0008-osc-ffff90c85d05e000: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 16 16:37:01 r311i0n3 kernel: [1568677021.371165] Lustre: nbp2-OST0094-osc-ffff976bea358800: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 16 16:37:01 r311i4n9 kernel: [1568677021.578522] Lustre: nbp2-OST0094-osc-ffff9c68adf2d000: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this seConnection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 16 16:36:59 r323i3n6 kernel: [1568677019.825837] Lustre: nbp2-OST0008-osc-ffff9ad887dc1800: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 16 16:37:01 r585i7n2 kernel: [1568677021.115645] Lustre: nbp2-OST0008-osc-ffff90c85d05e000: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 16 16:37:01 r311i0n3 kernel: [1568677021.371165] Lustre: nbp2-OST0094-osc-ffff976bea358800: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 16 16:37:01 r311i4n9 kernel: [1568677021.578522] Lustre: nbp2-OST0094-osc-ffff9c68adf2d000: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this se
server side
[90158.366440] LustreError: 30777:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 300+0s req@ffff8affe46eb450 x1644657589074304/t0(0) o4->825a80a7-da45-880e-35d1-4a750d2cf7f0@10.151.16.212@o2ib:502/0 lens 2168/448 e 0 to 0 dl 1568676837 ref 1 fl Interpret:/2/0 rc 0/0
We update the clients to 2.12 but didn't see these timeouts. Only after rebooting the servers into 2.10.8 did we see the timeouts. But moving 1 filesystem back to 2.10.6 didn't make the timeouts go away on that filesystem. Only reboot of clients stops the timeouts. I am not convened these are actual "network" timeouts. The clients and server get out of sync and the client keeps trying the same bulk io over and the server times it out. The last sent of debug logs I uploaded give a good trace of the RPC traffic.
This is exactly like https://jira.whamcloud.com/browse/LU-12543. When a user job finishes their process can't be killed because of a open file(or a lock). Only an eviction or a reboot will clear up the client.