Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8043

MDS running lustre 2.5.5+ OOM when running with Lustre 2.8 GA clients

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • Lustre 2.5.5
    • Lustre 2.5.5
    • None
    • Cray clients running unpatched lustre 2.8 GA clients. Server side running Lustre 2.5.5 with a patch set in a RHEL6.7 environment.
    • 3
    • 9223372036854775807

    Description

      Today we performed a test shot on our smaller Cray Aries cluster (700 nodes) with a non-patched lustre 2.8 GA client specially build for this system. The test were run against our atlas file system which is running a RHEL6.7 distro with the lustre version 2.5.5 with patches. During our test shot while running an IOR single shared file test across all nodes with the stripe count of 1008 the MDS server ran out of memory. I attached the dmesg output to this ticket.

      Attachments

        1. mylog.dk.gz
          4.50 MB
        2. vmcore-dmesg.txt
          454 kB

        Issue Links

          Activity

            [LU-8043] MDS running lustre 2.5.5+ OOM when running with Lustre 2.8 GA clients
            gerrit Gerrit Updater added a comment - - edited

            Comment deleted (wrong LU in commit message).

            gerrit Gerrit Updater added a comment - - edited Comment deleted (wrong LU in commit message).
            gerrit Gerrit Updater added a comment - - edited

            Comment deleted (wrong LU in commit message).

            gerrit Gerrit Updater added a comment - - edited Comment deleted (wrong LU in commit message).
            yujian Jian Yu added a comment -

            Thank you, Matt.

            yujian Jian Yu added a comment - Thank you, Matt.
            ezell Matt Ezell added a comment -

            We attempted to reproduce this assertion on Tuesday using the same conditions as last time, but it never crashed. After that, we moved to a server with 19717 and 18060 to hopefully prevent it in the future. I think we can close this ticket and reopen if we see it again. Thanks.

            ezell Matt Ezell added a comment - We attempted to reproduce this assertion on Tuesday using the same conditions as last time, but it never crashed. After that, we moved to a server with 19717 and 18060 to hopefully prevent it in the future. I think we can close this ticket and reopen if we see it again. Thanks.
            jhammond John Hammond added a comment -

            It's hard to say for sure without more information but the failed assertion may be addresses by http://review.whamcloud.com/#/c/18060/.

            jhammond John Hammond added a comment - It's hard to say for sure without more information but the failed assertion may be addresses by http://review.whamcloud.com/#/c/18060/ .
            ezell Matt Ezell added a comment -

            We have been unable to reproduce on our testbed systems, and we haven't had an opportunity to reproduce on the production systems.

            ezell Matt Ezell added a comment - We have been unable to reproduce on our testbed systems, and we haven't had an opportunity to reproduce on the production systems.
            jhammond John Hammond added a comment -

            I understood that ORNL was going to reproduce with a stronger debug mask. Has that been done?

            jhammond John Hammond added a comment - I understood that ORNL was going to reproduce with a stronger debug mask. Has that been done?

            Any progress on fixing the assertion?

            simmonsja James A Simmons added a comment - Any progress on fixing the assertion?
            ezell Matt Ezell added a comment -
            crash> struct ptlrpc_request.rq_pill ffff8837a0061000
              rq_pill = {
                rc_req = 0xffff8837a0061000, 
                rc_fmt = 0xffffffffa0819240 <RQF_LDLM_INTENT_LAYOUT>, 
                rc_loc = RCL_SERVER, 
                rc_area = {{4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}, {4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}}
              }
            
            ezell Matt Ezell added a comment - crash> struct ptlrpc_request.rq_pill ffff8837a0061000 rq_pill = { rc_req = 0xffff8837a0061000, rc_fmt = 0xffffffffa0819240 <RQF_LDLM_INTENT_LAYOUT>, rc_loc = RCL_SERVER, rc_area = {{4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}, {4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}} }

            People

              bzzz Alex Zhuravlev
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: