Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13763

ptlrpc_invalidate_import()) lsrza-OST0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out.

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0, Lustre 2.12.6
    • Lustre 2.12.4
    • TOSS 3.6-3 / RH78
      in-kernel OFED
      3.10.0-1127.8.2.1chaos.ch6.x86_64
      lustre-2.12.4_6.chaos-1.ch6.x86_64
    • 3
    • 9223372036854775807

    Description

      Console log messages following this pattern, repeatedly, for several days:

      LustreError: 67801:0:(import.c:361:ptlrpc_invalidate_import()) lsrza-OST0000_UUID: rc = -110 waiting for callback (1 != 0)
      LustreError: 67801:0:(import.c:387:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff8c3eb65f7500 x1669124751850560/t0(0) o4->lsrza-OST0000-osc-ffff8c44c608a000@172.21.3.5@o2ib700:6/4 lens 488/448 e 2 to 0 dl 1592847228 ref 1 fl Interpret:E/0/ffffffff rc -5/-1
      LustreError: 67801:0:(import.c:401:ptlrpc_invalidate_import()) lsrza-OST0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out.
      

      Note that the number in the parentheses is 0. This refers to imp->imp_unregistering, an atomic variable that looks like it is intended to track the number of RPC buffers we ware waiting for the underlying network to unregister so know that no data will be lost. But there is still one RPC on the sending list, so why is imp_unregistering 0?

      Attachments

        1. console.rzgenie28
          1.33 MB
        2. console.rzgenie28-20200619.gz
          235 kB
        3. console.rzgenie28-20200705.gz
          121 kB
        4. dk.rzgenie28.1594702860
          12.70 MB

        Activity

          [LU-13763] ptlrpc_invalidate_import()) lsrza-OST0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out.

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39380/
          Subject: LU-13763 osc: don't allow negative grants
          Project: fs/lustre-release
          Branch: b2_12
          Current Patch Set:
          Commit: f96aa90548f062e95d2ef4c9ea978ba0e08aae19

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39380/ Subject: LU-13763 osc: don't allow negative grants Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: f96aa90548f062e95d2ef4c9ea978ba0e08aae19
          pjones Peter Jones added a comment -

          Landed for 2.14

          pjones Peter Jones added a comment - Landed for 2.14

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39827/
          Subject: LU-13763 osc: don't allow negative grants
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: e05ccafd6ee214895d01efbb13a3757e3625a859

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39827/ Subject: LU-13763 osc: don't allow negative grants Project: fs/lustre-release Branch: master Current Patch Set: Commit: e05ccafd6ee214895d01efbb13a3757e3625a859

          Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39827
          Subject: LU-13763 osc: don't allow negative grants
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: bd9497c1690b9c4d2358a287277d1d67864d6735

          gerrit Gerrit Updater added a comment - Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39827 Subject: LU-13763 osc: don't allow negative grants Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: bd9497c1690b9c4d2358a287277d1d67864d6735
          ofaaland Olaf Faaland added a comment -

          The patch passes testing now. I posted a comment listing all the failures. It never failed a test twice, every test that failed passed when that test group was re-tested, and the failures didn't look to me like grant accounting problems.

          ofaaland Olaf Faaland added a comment - The patch passes testing now. I posted a comment listing all the failures. It never failed a test twice, every test that failed passed when that test group was re-tested, and the failures didn't look to me like grant accounting problems.

          I rebased the patch as it was too old to be retested.

          ofaaland Olaf Faaland added a comment - I rebased the patch as it was too old to be retested.

          Olaf, I have no good idea about what to get from that node right now, so you can bounce it. I don't see how grant underflow could occur due to this particular situation - for that server should consider that client has less grants than it really has already. I'd say that such situation could occur due to DIO grants problem and then cause grants underflow later. Maybe there are other scenarios exist. As for imp_unregistering 0 I think it is OK if request is in sending list - it is waiting for reply still, so is not unregistered yet and imp counter is still 0.

          tappro Mikhail Pershin added a comment - Olaf, I have no good idea about what to get from that node right now, so you can bounce it. I don't see how grant underflow could occur due to this particular situation - for that server should consider that client has less grants than it really has already. I'd say that such situation could occur due to DIO grants problem and then cause grants underflow later. Maybe there are other scenarios exist. As for imp_unregistering 0 I think it is OK if request is in sending list - it is waiting for reply still, so is not unregistered yet and imp counter is still 0.
          ofaaland Olaf Faaland added a comment -

          Hi Mikhail,
          Can I bounce that node?
          Do you have any thoughts on whether the stuck import was really related to the grant underflow?
          Do you have any idea why imp_unregistering was 0 when there was still one RPC on the sending list?
          Thanks!

          ofaaland Olaf Faaland added a comment - Hi Mikhail, Can I bounce that node? Do you have any thoughts on whether the stuck import was really related to the grant underflow? Do you have any idea why imp_unregistering was 0 when there was still one RPC on the sending list? Thanks!
          ofaaland Olaf Faaland added a comment -

          Mikhail,

          I have the node in this state drained. Should I keep it that way in case you want me to gather information from it, or should I go ahead and crash it and put it back into service? I won't be able to send you the crash dump, but I could extract information from the dump for you, potentially (although my crash skills are not great).

          Thanks

          ofaaland Olaf Faaland added a comment - Mikhail, I have the node in this state drained. Should I keep it that way in case you want me to gather information from it, or should I go ahead and crash it and put it back into service? I won't be able to send you the crash dump, but I could extract information from the dump for you, potentially (although my crash skills are not great). Thanks
          tappro Mikhail Pershin added a comment - - edited

          Olaf, I've made patch to prevent that underflow. It is worth to do in any case because ocd_grant is received from server so it shouldn't be trusted blindly to be always greater than local consumed grants.

          E.g. in conjunction with LU-12687 that looks as real case.

          tappro Mikhail Pershin added a comment - - edited Olaf, I've made patch to prevent that underflow. It is worth to do in any case because ocd_grant is received from server so it shouldn't be trusted blindly to be always greater than local consumed grants. E.g. in conjunction with LU-12687 that looks as real case.

          People

            tappro Mikhail Pershin
            ofaaland Olaf Faaland
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: