Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10578

request_in_callback() event type 2, status -103, service ost_io

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.10.2
    • None
    • Lustre 2.10.2, Lustre 2.10.3 RC1, CentOS 7.4
    • 3
    • 9223372036854775807

    Description

      We're seeing a recurrent crash on OSSes running 2.10.2 on Oak. Many of the following log messages can be seen before the crash:

      [928518.391343] LustreError: 325230:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
      [928518.401925] LustreError: 325230:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
      [928518.410543] LustreError: 325230:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
      
      

      Finally, the server crashes after an NMI watchdog is triggered:

      [928639.223483] NMI watchdog: Watchdog detected hard LOCKUP on cpu 24
      
      

      See oak-io1-s1.vmcore-dmesg.txt for full log.

      After restarting the OSS, I can see the following in the logs:

      Jan 29 12:45:21 oak-io1-s1 kernel: LNet: Using FMR for registration
      Jan 29 12:45:21 oak-io1-s1 kernel: LustreError: 325230:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
      Jan 29 12:45:21 oak-io1-s1 kernel: LustreError: 359716:0:(pack_generic.c:590:__lustre_unpack_msg()) message length 0 too small for magic/version check
      Jan 29 12:45:21 oak-io1-s1 kernel: LustreError: 359716:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-10.0.2.223@o2ib5 x1590957016747328
      Jan 29 12:45:21 oak-io1-s1 kernel: LustreError: 359716:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) Skipped 1386 previous similar messages
      Jan 29 12:45:21 oak-io1-s1 kernel: Lustre: oak-OST001c: Connection restored to 7cfcbcde-275c-eb1e-9911-bb9f7ea0c616 (at 10.0.2.223@o2ib5)
      Jan 29 12:45:21 oak-io1-s1 kernel: LustreError: 325230:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
      Jan 29 12:45:21 oak-io1-s1 kernel: LustreError: 359716:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk WRITE  req@ffff881c45339850 x1590957016747344/t0(0) o4->7f9fc76a-8204-2cde-2104-5efdf97586ca@10.0.2.223@o2ib5:162/0 lens 3440/1152 e 0 to 0 dl 1517258732 ref 1 fl Interpret:/0/0 rc 0/0
      Jan 29 12:45:21 oak-io1-s1 kernel: LNet: 325230:0:(o2iblnd_cb.c:1350:kiblnd_reconnect_peer()) Abort reconnection of 10.0.2.223@o2ib5: connected
      
      

      "message length 0 too small.." can also be found in LU-9983 (fixed in 2.10.3) and LU-10068 (not clear if fully resolved).

      10.0.2.223@o2ib5 is oak-gw04, a lustre client running in a VM with SR-IOV on IB (mlx4) serving as a SMB gateway. It was been working fine (almost a year) as long as we were running 2.9, but we're now having this recurrent issue after upgrading clients and servers to the 2.10.x branch. For a bit I suspected the change on map_on_demand now set to 256 instead of 0, but after reading the changelogs, I don't think it should have such impact... We're using mlx4 on both clients and servers on Oak (but it's connected to several lnet routers with both mlx4 and mlx5 remotely.

      I tried to upgrade oak-gw04 to 2.10.3 RC1 and the same happened. I'm now in the process of upgrading all Lustre servers to 2.10.3 RC1, because it should also fix another issue that we have, LU-10267 - not related to this issue. I will update this ticket if the issue happens again with all servers running 2.10.3 RC1...

      A crash is usually occuring after a few hours/days when oak-gw04 is up. Note that I don't get any OSS crash if oak-gw04 stays down. We have numerous other VMs using SR-IOV and have no issue, only this one (serving SMB). This "SMB gateway" is experimental, our users love it but it's not critical for production.

      Any idea welcomed...

      Thanks!
      Stephane

      Attachments

        1. oak-gw04.kernel.log
          52 kB
        2. oak-gw22_lustre_neterror.log
          4.88 MB
        3. oak-gw22_lustre.log.gz
          160 kB
        4. oak-io1-s1.vmcore-dmesg.txt
          1.01 MB
        5. oak-io1-s2_lustre.log.gz
          5.18 MB

        Activity

          People

            sharmaso Sonia Sharma (Inactive)
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: