Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5225

Client is evicted by multiple OSTs on all OSSs

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • None
    • None
    • 3
    • 14559

    Description

      As part of LU-2827 patch intensive testing, J.Simmons encountered a new issue when running with patch on top of latest/current 2.5.59/master version.

      Having a look to the infos provided (ftp.whamcloud.com/uploads/LU-4584/20140609-run1.tbz and 20140609-run2.tbz), it appears that at some point of time, Client's RPCs are not sent anymore. This mainly causes Client's locks cancel answers to Server's/OSTs blocking ASTs requests not to be sent and further evictions.

      The reason why Client's RPCs are not sent anymore can not be found using only the Lustre debug log level (dlmtrace) on Client, but I can see during Client's eviction process/handling these RPCs were on the delayed queue.

      Attachments

        Issue Links

          Activity

            [LU-5225] Client is evicted by multiple OSTs on all OSSs

            Dup of LU-4861, as proven by on-site testing.

            bfaccini Bruno Faccini (Inactive) added a comment - Dup of LU-4861 , as proven by on-site testing.

            Yes this is still working for me. You can close it. If I have any problems in next weeks test shot I will open another ticket.

            simmonsja James A Simmons added a comment - Yes this is still working for me. You can close it. If I have any problems in next weeks test shot I will open another ticket.

            James, is this still working for you ?
            If yes, do you agree if we close it as a dup of LU-4861 ?

            bfaccini Bruno Faccini (Inactive) added a comment - James, is this still working for you ? If yes, do you agree if we close it as a dup of LU-4861 ?

            > 2) The patch deals with a deadlock issue would can explain why I saw evictions.
            Hummm yes, that could be where your flair has made the difference, because LU-4861 only reports an application hang due to this dead-lock but no Client evictions ...

            bfaccini Bruno Faccini (Inactive) added a comment - > 2) The patch deals with a deadlock issue would can explain why I saw evictions. Hummm yes, that could be where your flair has made the difference, because LU-4861 only reports an application hang due to this dead-lock but no Client evictions ...

            It was the testing with 2..5.60 clients. When I updated the clients to a newer version and could not reproduce the problem I figured some patch that landed fixed the problem. So I examined the list of merged patches since the broken client. The only one that made sense was LU-4861 since it

            1) Since I was seeing evictions from the OST it makes since a possible source of the problem could be the osc layer.

            2) The patch deals with a deadlock issue would can explain why I saw evictions.

            It seems I'm familiar enough with the code to make a good enough educated guess what will fix my problems

            I have been testing the LU-4861 patch with 2.5.2 clients with excellent success so far.

            simmonsja James A Simmons added a comment - It was the testing with 2..5.60 clients. When I updated the clients to a newer version and could not reproduce the problem I figured some patch that landed fixed the problem. So I examined the list of merged patches since the broken client. The only one that made sense was LU-4861 since it 1) Since I was seeing evictions from the OST it makes since a possible source of the problem could be the osc layer. 2) The patch deals with a deadlock issue would can explain why I saw evictions. It seems I'm familiar enough with the code to make a good enough educated guess what will fix my problems I have been testing the LU-4861 patch with 2.5.2 clients with excellent success so far.

            James,
            Thanks working+helping so hard on this, but I have an additional question, what made you point to LU-4861 patch as a possible fix ?

            bfaccini Bruno Faccini (Inactive) added a comment - James, Thanks working+helping so hard on this, but I have an additional question, what made you point to LU-4861 patch as a possible fix ?

            So far the results on my small scale system are very promising using the patch from LU-4861. If all goes well I will move it to the next scale system. If that works then we will use it in our test shot for Tuesday.

            simmonsja James A Simmons added a comment - So far the results on my small scale system are very promising using the patch from LU-4861 . If all goes well I will move it to the next scale system. If that works then we will use it in our test shot for Tuesday.

            So I updated my 2.6 tree to what is current and tried to duplicate the problem on my small scale system. I couldn't so I moved to a larger system (not titan) and same thing. I'm think it is the patch from LU-4861 that fixed this so I'm going to try a back port of the patch to 2.5 to see what happens.

            simmonsja James A Simmons added a comment - So I updated my 2.6 tree to what is current and tried to duplicate the problem on my small scale system. I couldn't so I moved to a larger system (not titan) and same thing. I'm think it is the patch from LU-4861 that fixed this so I'm going to try a back port of the patch to 2.5 to see what happens.

            James, I wonder if, during some dedicated time, you can get a Client's crash-dump (with a resized debug buffer and full debug trace enabled, or at least "net+dlmtrace+rpctrace" in addition to the default mask) when such eviction scenario occurs ?? This can be automated if you add http://review.whamcloud.com/#/c/8875/ on top of your Client's distro.

            bfaccini Bruno Faccini (Inactive) added a comment - James, I wonder if, during some dedicated time, you can get a Client's crash-dump (with a resized debug buffer and full debug trace enabled, or at least "net+dlmtrace+rpctrace" in addition to the default mask) when such eviction scenario occurs ?? This can be automated if you add http://review.whamcloud.com/#/c/8875/ on top of your Client's distro.

            Sorry you are right. Using 3 OSS is the most recent setup since I lost one OSS this last week. Original setup was 4 OSS with 7 OSTs per OSS. So it should be 28 OSTs all total.

            simmonsja James A Simmons added a comment - Sorry you are right. Using 3 OSS is the most recent setup since I lost one OSS this last week. Original setup was 4 OSS with 7 OSTs per OSS. So it should be 28 OSTs all total.

            People

              bfaccini Bruno Faccini (Inactive)
              bfaccini Bruno Faccini (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: