Loading...

XML

Word

Printable

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.10.2
Labels:
None
Environment:
Lustre 2.10.2, Lustre 2.10.3 RC1, CentOS 7.4

Severity:
3
Rank (Obsolete):
9223372036854775807

We're seeing a recurrent crash on OSSes running 2.10.2 on Oak. Many of the following log messages can be seen before the crash:

[928518.391343] LustreError: 325230:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[928518.401925] LustreError: 325230:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[928518.410543] LustreError: 325230:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io

Finally, the server crashes after an NMI watchdog is triggered:

[928639.223483] NMI watchdog: Watchdog detected hard LOCKUP on cpu 24

See oak-io1-s1.vmcore-dmesg.txt for full log.

After restarting the OSS, I can see the following in the logs:

Jan 29 12:45:21 oak-io1-s1 kernel: LNet: Using FMR for registration
Jan 29 12:45:21 oak-io1-s1 kernel: LustreError: 325230:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
Jan 29 12:45:21 oak-io1-s1 kernel: LustreError: 359716:0:(pack_generic.c:590:__lustre_unpack_msg()) message length 0 too small for magic/version check
Jan 29 12:45:21 oak-io1-s1 kernel: LustreError: 359716:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-10.0.2.223@o2ib5 x1590957016747328
Jan 29 12:45:21 oak-io1-s1 kernel: LustreError: 359716:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) Skipped 1386 previous similar messages
Jan 29 12:45:21 oak-io1-s1 kernel: Lustre: oak-OST001c: Connection restored to 7cfcbcde-275c-eb1e-9911-bb9f7ea0c616 (at 10.0.2.223@o2ib5)
Jan 29 12:45:21 oak-io1-s1 kernel: LustreError: 325230:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
Jan 29 12:45:21 oak-io1-s1 kernel: LustreError: 359716:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk WRITE  req@ffff881c45339850 x1590957016747344/t0(0) o4->7f9fc76a-8204-2cde-2104-5efdf97586ca@10.0.2.223@o2ib5:162/0 lens 3440/1152 e 0 to 0 dl 1517258732 ref 1 fl Interpret:/0/0 rc 0/0
Jan 29 12:45:21 oak-io1-s1 kernel: LNet: 325230:0:(o2iblnd_cb.c:1350:kiblnd_reconnect_peer()) Abort reconnection of 10.0.2.223@o2ib5: connected

"message length 0 too small.." can also be found in ~~LU-9983~~ (fixed in 2.10.3) and ~~LU-10068~~ (not clear if fully resolved).

10.0.2.223@o2ib5 is oak-gw04, a lustre client running in a VM with SR-IOV on IB (mlx4) serving as a SMB gateway. It was been working fine (almost a year) as long as we were running 2.9, but we're now having this recurrent issue after upgrading clients and servers to the 2.10.x branch. For a bit I suspected the change on map_on_demand now set to 256 instead of 0, but after reading the changelogs, I don't think it should have such impact... We're using mlx4 on both clients and servers on Oak (but it's connected to several lnet routers with both mlx4 and mlx5 remotely.

I tried to upgrade oak-gw04 to 2.10.3 RC1 and the same happened. I'm now in the process of upgrading all Lustre servers to 2.10.3 RC1, because it should also fix another issue that we have, ~~LU-10267~~ - not related to this issue. I will update this ticket if the issue happens again with all servers running 2.10.3 RC1...

A crash is usually occuring after a few hours/days when oak-gw04 is up. Note that I don't get any OSS crash if oak-gw04 stays down. We have numerous other VMs using SR-IOV and have no issue, only this one (serving SMB). This "SMB gateway" is experimental, our users love it but it's not critical for production.

Any idea welcomed...

Thanks!
Stephane

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

oak-io1-s1.vmcore-dmesg.txt
1.01 MB
30/Jan/18 12:21 AM
oak-gw04.kernel.log
52 kB
30/Jan/18 12:21 AM
oak-gw22_lustre.log.gz
160 kB
22/Mar/18 10:13 PM
oak-io1-s2_lustre.log.gz
5.18 MB
22/Mar/18 10:13 PM
oak-gw22_lustre_neterror.log
4.88 MB
23/Mar/18 2:54 AM

Assignee:: Sonia Sharma (Inactive)

Reporter:: Stephane Thiell

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 30/Jan/18 12:21 AM

Updated:: 13/Apr/18 7:17 PM

Resolved:: 13/Apr/18 7:17 PM

Details

Description

Attachments

Attachments

Activity

People

Dates