Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
Lustre 2.10.2
-
None
-
Lustre 2.10.2, Lustre 2.10.3 RC1, CentOS 7.4
-
3
-
9223372036854775807
Description
We're seeing a recurrent crash on OSSes running 2.10.2 on Oak. Many of the following log messages can be seen before the crash:
[928518.391343] LustreError: 325230:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io [928518.401925] LustreError: 325230:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io [928518.410543] LustreError: 325230:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
Finally, the server crashes after an NMI watchdog is triggered:
[928639.223483] NMI watchdog: Watchdog detected hard LOCKUP on cpu 24
See oak-io1-s1.vmcore-dmesg.txt for full log.
After restarting the OSS, I can see the following in the logs:
Jan 29 12:45:21 oak-io1-s1 kernel: LNet: Using FMR for registration Jan 29 12:45:21 oak-io1-s1 kernel: LustreError: 325230:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io Jan 29 12:45:21 oak-io1-s1 kernel: LustreError: 359716:0:(pack_generic.c:590:__lustre_unpack_msg()) message length 0 too small for magic/version check Jan 29 12:45:21 oak-io1-s1 kernel: LustreError: 359716:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-10.0.2.223@o2ib5 x1590957016747328 Jan 29 12:45:21 oak-io1-s1 kernel: LustreError: 359716:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) Skipped 1386 previous similar messages Jan 29 12:45:21 oak-io1-s1 kernel: Lustre: oak-OST001c: Connection restored to 7cfcbcde-275c-eb1e-9911-bb9f7ea0c616 (at 10.0.2.223@o2ib5) Jan 29 12:45:21 oak-io1-s1 kernel: LustreError: 325230:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io Jan 29 12:45:21 oak-io1-s1 kernel: LustreError: 359716:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk WRITE req@ffff881c45339850 x1590957016747344/t0(0) o4->7f9fc76a-8204-2cde-2104-5efdf97586ca@10.0.2.223@o2ib5:162/0 lens 3440/1152 e 0 to 0 dl 1517258732 ref 1 fl Interpret:/0/0 rc 0/0 Jan 29 12:45:21 oak-io1-s1 kernel: LNet: 325230:0:(o2iblnd_cb.c:1350:kiblnd_reconnect_peer()) Abort reconnection of 10.0.2.223@o2ib5: connected
"message length 0 too small.." can also be found in (fixed in 2.10.3) and LU-9983 (not clear if fully resolved).LU-10068
10.0.2.223@o2ib5 is oak-gw04, a lustre client running in a VM with SR-IOV on IB (mlx4) serving as a SMB gateway. It was been working fine (almost a year) as long as we were running 2.9, but we're now having this recurrent issue after upgrading clients and servers to the 2.10.x branch. For a bit I suspected the change on map_on_demand now set to 256 instead of 0, but after reading the changelogs, I don't think it should have such impact... We're using mlx4 on both clients and servers on Oak (but it's connected to several lnet routers with both mlx4 and mlx5 remotely.
I tried to upgrade oak-gw04 to 2.10.3 RC1 and the same happened. I'm now in the process of upgrading all Lustre servers to 2.10.3 RC1, because it should also fix another issue that we have, - not related to this issue. I will update this ticket if the issue happens again with all servers running 2.10.3 RC1...LU-10267
A crash is usually occuring after a few hours/days when oak-gw04 is up. Note that I don't get any OSS crash if oak-gw04 stays down. We have numerous other VMs using SR-IOV and have no issue, only this one (serving SMB). This "SMB gateway" is experimental, our users love it but it's not critical for production.
Any idea welcomed...
Thanks!
Stephane