Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • None
    • Lustre 2.4.0
    • Lustre 2.3.58-6chaos (github.com/chaos/lustre) on MDS.
    • 6314

    Description

      We have had some problems in recent weeks with the MDS on grove (sequoia's filesystem cluster) thrashing for anywhere from minutes to many hours while under load. While it does so, it is not appear to be handling traffic very quickly, and the node load is so high that login is nearly impossible.

      I caught it doing that for a while today during testing and dumped some SysRq info to the console.

      It looks to me like the active tasks may be spending too much time under ptlrpc_alloc_rqbd() doing vmallocs.

      Prakash had a patch to move those allocations to a slab. But it became time consuming to keep moving forward. We may need to look at reviving that.

      See attached file "console.grove-mds1.txt.bz2".

      Attachments

        Issue Links

          Activity

            [LU-2708] MDS thrashing in ptlrpc_alloc_rqbd
            liang Liang Zhen (Inactive) made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]

            we have landed two patches for this:
            http://review.whamcloud.com/#change,4939 can largely decrease threads number contending on vmalloc.
            http://review.whamcloud.com/#change,4940 can improve buffer utilization rate, and decrease chance to call vmalloc.
            So I think we can close this ticket for now.

            liang Liang Zhen (Inactive) added a comment - we have landed two patches for this: http://review.whamcloud.com/#change,4939 can largely decrease threads number contending on vmalloc. http://review.whamcloud.com/#change,4940 can improve buffer utilization rate, and decrease chance to call vmalloc. So I think we can close this ticket for now.
            jlevi Jodi Levi (Inactive) made changes -
            Affects Version/s New: Lustre 2.4.0 [ 10154 ]
            jlevi Jodi Levi (Inactive) made changes -
            Labels Original: sequoia topsequoia New: HB sequoia topsequoia
            Priority Original: Major [ 3 ] New: Blocker [ 1 ]
            pjones Peter Jones made changes -
            Assignee Original: WC Triage [ wc-triage ] New: Liang Zhen [ liang ]

            I have posted another patch : http://review.whamcloud.com/#change,4940
            it will not fundamentally resolve the issue, but it will largely reduce request buffer size/count, and save a lot of memory.
            But I think it has to be reviewed by Andreas first.

            liang Liang Zhen (Inactive) added a comment - I have posted another patch : http://review.whamcloud.com/#change,4940 it will not fundamentally resolve the issue, but it will largely reduce request buffer size/count, and save a lot of memory. But I think it has to be reviewed by Andreas first.
            morrone Christopher Morrone (Inactive) made changes -
            Environment New: Lustre 2.3.58-6chaos (github.com/chaos/lustre) on MDS.
            morrone Christopher Morrone (Inactive) made changes -
            Link New: This issue is related to LU-2432 [ LU-2432 ]

            Related to LU-2432. However, I verified that we are running with the kernel patch that fixes the kernel vmalloc problem for the problem noted in this ticket.

            morrone Christopher Morrone (Inactive) added a comment - Related to LU-2432 . However, I verified that we are running with the kernel patch that fixes the kernel vmalloc problem for the problem noted in this ticket.
            morrone Christopher Morrone (Inactive) made changes -
            Issue Type Original: Task [ 3 ] New: Bug [ 1 ]

            People

              liang Liang Zhen (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: