Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 2.8.0
    • lustre-2.8.0_9.chaos
    • 3
    • 9223372036854775807

    Description

      An OSS evicted a client on Aug 1 during a planned network outage.

       

      [Tue Aug 1 16:43:54 2017] Lustre: lsh-OST0005: haven't heard from client d40a30fc-ef66-94ff-e318-77d2c23e45f8 (at 192.168.137.212@o2ib27) in 227 seconds. I think it's dead, and I am evicting it. exp ffff881d1aaa8c00, cur 1501631035 expire 1501630885 last 1501630808
      

      Two days later the client had still not reconnected, although both sides could lctl ping eachother. The client logged this on the console.

      [Thu Aug  3 12:03:52 2017] LustreError: 11704:0:(import.c:336:ptlrpc_invalidate_import()) lsh-OST0005_UUID: rc = -110 waiting for callback (1 != 0)
      [Thu Aug  3 12:03:52 2017] LustreError: 11704:0:(import.c:336:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
      [Thu Aug  3 12:03:52 2017] LustreError: 11704:0:(import.c:362:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff880168fc3800 x1574573141267560/t0(0) o3->lsh-OST0005-osc-ffff88203c63f800@172.19.3.22@o2ib600:6/4 lens 488/432 e 0 to 0 dl 1501630582 ref 2 fl Unregistering:ES/0/ffffffff rc -5/-1
      [Thu Aug  3 12:03:52 2017] LustreError: 11704:0:(import.c:362:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
      [Thu Aug  3 12:03:52 2017] LustreError: 11704:0:(import.c:378:ptlrpc_invalidate_import()) lsh-OST0005_UUID: RPCs in "Unregistering" phase found (1). Network is sluggish? Waiting them to error out.
      [Thu Aug  3 12:03:52 2017] LustreError: 11704:0:(import.c:378:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
      

      This seems quite similar to LU-8511, which was closed as a duplicate of LU-7434. That issue had two associated patches, but only https://review.whamcloud.com/#/c/18934/ was landed to 2.8 FE, whereas https://review.whamcloud.com/#/c/19953/ was not.

      Attachments

        Issue Links

          Activity

            [LU-9861] Client not reconnecting to OST
            pjones Peter Jones added a comment -

            The mentioned fix has been ported, reviewed and landed to the 2.8 FE branch so closing this ticket for now. We can reopen if this same issue is hit again with a release including this change.

            pjones Peter Jones added a comment - The mentioned fix has been ported, reviewed and landed to the 2.8 FE branch so closing this ticket for now. We can reopen if this same issue is hit again with a release including this change.

            Yes, I agree it looks like the symptom of LU-7434 and patch 19953 should be able to fix the problem.

            jay Jinshan Xiong (Inactive) added a comment - Yes, I agree it looks like the symptom of LU-7434 and patch 19953 should be able to fix the problem.
            pjones Peter Jones added a comment -

            Jinshan

            In your opinion could the described behaviour be due to this patch missing from 2.8 FE - https://git.hpdd.intel.com/?p=fs/lustre-release.git;a=commit;h=ac5044566b97c7f6881bed817c2ed9752a0c6d63. If not, what is your alternative theory?

            Peter

            pjones Peter Jones added a comment - Jinshan In your opinion could the described behaviour be due to this patch missing from 2.8 FE - https://git.hpdd.intel.com/?p=fs/lustre-release.git;a=commit;h=ac5044566b97c7f6881bed817c2ed9752a0c6d63 . If not, what is your alternative theory? Peter

            People

              jay Jinshan Xiong (Inactive)
              nedbass Ned Bass (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: