Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8043

MDS running lustre 2.5.5+ OOM when running with Lustre 2.8 GA clients

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • Lustre 2.5.5
    • Lustre 2.5.5
    • None
    • Cray clients running unpatched lustre 2.8 GA clients. Server side running Lustre 2.5.5 with a patch set in a RHEL6.7 environment.
    • 3
    • 9223372036854775807

    Description

      Today we performed a test shot on our smaller Cray Aries cluster (700 nodes) with a non-patched lustre 2.8 GA client specially build for this system. The test were run against our atlas file system which is running a RHEL6.7 distro with the lustre version 2.5.5 with patches. During our test shot while running an IOR single shared file test across all nodes with the stripe count of 1008 the MDS server ran out of memory. I attached the dmesg output to this ticket.

      Attachments

        1. mylog.dk.gz
          4.50 MB
        2. vmcore-dmesg.txt
          454 kB

        Issue Links

          Activity

            [LU-8043] MDS running lustre 2.5.5+ OOM when running with Lustre 2.8 GA clients
            gerrit Gerrit Updater added a comment - - edited

            Comment deleted (wrong LU in commit message).

            gerrit Gerrit Updater added a comment - - edited Comment deleted (wrong LU in commit message).
            gerrit Gerrit Updater added a comment - - edited

            Comment deleted (wrong LU in commit message).

            gerrit Gerrit Updater added a comment - - edited Comment deleted (wrong LU in commit message).
            yujian Jian Yu added a comment -

            Thank you, Matt.

            yujian Jian Yu added a comment - Thank you, Matt.
            ezell Matt Ezell added a comment -

            We attempted to reproduce this assertion on Tuesday using the same conditions as last time, but it never crashed. After that, we moved to a server with 19717 and 18060 to hopefully prevent it in the future. I think we can close this ticket and reopen if we see it again. Thanks.

            ezell Matt Ezell added a comment - We attempted to reproduce this assertion on Tuesday using the same conditions as last time, but it never crashed. After that, we moved to a server with 19717 and 18060 to hopefully prevent it in the future. I think we can close this ticket and reopen if we see it again. Thanks.
            jhammond John Hammond added a comment -

            It's hard to say for sure without more information but the failed assertion may be addresses by http://review.whamcloud.com/#/c/18060/.

            jhammond John Hammond added a comment - It's hard to say for sure without more information but the failed assertion may be addresses by http://review.whamcloud.com/#/c/18060/ .
            ezell Matt Ezell added a comment -

            We have been unable to reproduce on our testbed systems, and we haven't had an opportunity to reproduce on the production systems.

            ezell Matt Ezell added a comment - We have been unable to reproduce on our testbed systems, and we haven't had an opportunity to reproduce on the production systems.
            jhammond John Hammond added a comment -

            I understood that ORNL was going to reproduce with a stronger debug mask. Has that been done?

            jhammond John Hammond added a comment - I understood that ORNL was going to reproduce with a stronger debug mask. Has that been done?

            Any progress on fixing the assertion?

            simmonsja James A Simmons added a comment - Any progress on fixing the assertion?
            ezell Matt Ezell added a comment -
            crash> struct ptlrpc_request.rq_pill ffff8837a0061000
              rq_pill = {
                rc_req = 0xffff8837a0061000, 
                rc_fmt = 0xffffffffa0819240 <RQF_LDLM_INTENT_LAYOUT>, 
                rc_loc = RCL_SERVER, 
                rc_area = {{4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}, {4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}}
              }
            
            ezell Matt Ezell added a comment - crash> struct ptlrpc_request.rq_pill ffff8837a0061000 rq_pill = { rc_req = 0xffff8837a0061000, rc_fmt = 0xffffffffa0819240 <RQF_LDLM_INTENT_LAYOUT>, rc_loc = RCL_SERVER, rc_area = {{4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}, {4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}} }
            ezell Matt Ezell added a comment -

            I just attached mylog.dk.gz. Unfortunately it was only running with "normal" debug levels.

            James Simmons was running IOR when it crashed. He said it was a single-shared-file striped across 1008 OSTs from ~700 clients. "testfile.out" likely is the name of the file he was using, but I understand mti_rr might be from a previous request.

            Let me know if there's anything you want to see from the crash. We can't upload it to Intel for you to look at, but I can run commands for you and provide the output.

            Thanks!

            #5
                    mod libcfs, name lbug_with_loc, RIP 0xffffffffa0420eeb
                    frame start 0xffff883f8d5a3ce8, end 0xffff883f8d5a3d08, *base 0xffff883f8d5a3d10
                    XBT_RBX = ffffffffa0e15620
                    msgdata = ffffffffa0e15620
            
            #6
                    mod mdt, name mdt_lock_handle_fini, RIP 0xffffffffa0d9e64b
                    frame start 0xffff883f8d5a3d08, end 0xffff883f8d5a3d18, *base 0xffff883f8d5a3d40
                    XBT_RBX = ffff883f8d4ff000
            
            #7
                    mod mdt, name mdt_thread_info_fini, RIP 0xffffffffa0da4cc0
                    frame start 0xffff883f8d5a3d18, end 0xffff883f8d5a3d48, *base 0xffff883f8d5a3d90
                    XBT_RBX = ffff883f8d4ff000
                    info = ffff883f8d4ff000
            
            #8
                    mod mdt, name mdt_handle_common, RIP 0xffffffffa0daa473
                    frame start 0xffff883f8d5a3d48, end 0xffff883f8d5a3d98, *base 0xffff883f8d5a3da0
                    XBT_RBX = ffff8837a0061000
                    XBT_R12 = ffff883f8d4ff000
                    XBT_R13 = ffffffffa0e23ee0
                    req = ffff8837a0061000
                    &supported = ffff883f8d5a3d58
                    supported = ffffffff00000002 ...
                    info = ffff883f8d4ff000
                    &supported = ffff883f8d5a3d58
                    supported = ffffffff00000002 ...
                    info = ffff883f8d4ff000
                    req = ffff8837a0061000
                    set = 1
                    id = 123
                    quiet = 0
                    subsystem = 4
                    mask = 1
            #9
                    mod mdt, name mds_regular_handle, RIP 0xffffffffa0de98c5
                    frame start 0xffff883f8d5a3d98, end 0xffff883f8d5a3da8, *base 0xffff883f8d5a3ee0
                    XBT_RBX = ffff883f8df0d140
                    XBT_R12 = ffff883fa7a96800
                    XBT_R13 = ffff8837a0061000
                    XBT_R14 = 42
                    XBT_R15 = ffff883f8d418940
            
            #10
                    mod ptlrpc, name ptlrpc_main, RIP 0xffffffffa076d07e
                    frame start 0xffff883f8d5a3da8, end 0xffff883f8d5a3ee8, *base 0xffff883f8d5a3f40
                    XBT_RBX = ffff883f8df0d140
                    XBT_R12 = ffff883fa7a96800
                    XBT_R13 = ffff8837a0061000
                    XBT_R14 = 42
                    XBT_R15 = ffff883f8d418940
                    arg = ffff883f8df0d140
                    thread = ffff883f8df0d140
                    svcpt = ffff883fa7a96800
                    &svc = ffff883f8d5a3e50
                    svc = ffff883fa7f96080 ...
                    &rs = ffff883f8d5a3e68
                    rs = ffff883fa7a96868 ...
                    &env = ffff883f8d5a3e80
                    env = ffff883fa7f96080 ...
                    counter = 42
                    &rc = ffff883f8d5a3e64
                    rc = 0 ...
                    flags = 8050
                    size = 38
                    &ret = ffff883f8d5a3e80
                    ret = ffff883fa7f96080 ...
                    i = 1
                    id = c00
                    quiet = 0
                    set = 0
                    value = 0
                    subsystem = 100
                    mask = 10
                    flags = 8250
                    thread = ffff883f8df0d140
                    flags = 4
                    thread = ffff883f8df0d140
                    flags = 8
                    thread = ffff883f8df0d140
                    &lock = ffff883f8d5a3e88
                    lock = ffff883fa7a96830 ...
                    svcpt = ffff883fa7a96800
                    head = ffff883fa7a96a18
                    new = ffff883fa7a96878
                    subsystem = 100
                    mask = 200
                    svcpt = ffff883fa7a96800
                    thread = ffff883f8df0d140
                    svcpt = ffff883fa7a96800
                    &svc = ffff883f8d5a3e80
                    svc = ffff883fa7f96080 ...
                    request = ffff8837a0061000
                    &work_start = ffff883f8d5a3ea0
                    work_start = 57164916 ...
                    &work_end = ffff883f8d5a3e90
                    work_end = 57164916 ...
                    small = ffff8837a00611e0
                    large = ffff883f8d5a3ea0
                    &r = ffff883f8d5a3e40
                    r = 4 ...
                    result = 0
                    id = 50e
                    quiet = 0
                    set = 0
                    value = 0
                    req = ffff8837a0061000
                    id = 512
                    quiet = 0
                    set = 0
                    value = 0
                    subsystem = 100
                    mask = 1
                    subsystem = 100
                    mask = 100000
                    subsystem = 100
                    mask = 200
                    req = ffff8837a0061000
                    thread = ffff883f8df0d140
                    svcpt = ffff883fa7a96800
                    svcpt = ffff883fa7a96800
                    svcpt = ffff883fa7a96800
                    force = 0
                    svcpt = ffff883fa7a96800
            
            ezell Matt Ezell added a comment - I just attached mylog.dk.gz. Unfortunately it was only running with "normal" debug levels. James Simmons was running IOR when it crashed. He said it was a single-shared-file striped across 1008 OSTs from ~700 clients. "testfile.out" likely is the name of the file he was using, but I understand mti_rr might be from a previous request. Let me know if there's anything you want to see from the crash. We can't upload it to Intel for you to look at, but I can run commands for you and provide the output. Thanks! #5 mod libcfs, name lbug_with_loc, RIP 0xffffffffa0420eeb frame start 0xffff883f8d5a3ce8, end 0xffff883f8d5a3d08, *base 0xffff883f8d5a3d10 XBT_RBX = ffffffffa0e15620 msgdata = ffffffffa0e15620 #6 mod mdt, name mdt_lock_handle_fini, RIP 0xffffffffa0d9e64b frame start 0xffff883f8d5a3d08, end 0xffff883f8d5a3d18, *base 0xffff883f8d5a3d40 XBT_RBX = ffff883f8d4ff000 #7 mod mdt, name mdt_thread_info_fini, RIP 0xffffffffa0da4cc0 frame start 0xffff883f8d5a3d18, end 0xffff883f8d5a3d48, *base 0xffff883f8d5a3d90 XBT_RBX = ffff883f8d4ff000 info = ffff883f8d4ff000 #8 mod mdt, name mdt_handle_common, RIP 0xffffffffa0daa473 frame start 0xffff883f8d5a3d48, end 0xffff883f8d5a3d98, *base 0xffff883f8d5a3da0 XBT_RBX = ffff8837a0061000 XBT_R12 = ffff883f8d4ff000 XBT_R13 = ffffffffa0e23ee0 req = ffff8837a0061000 &supported = ffff883f8d5a3d58 supported = ffffffff00000002 ... info = ffff883f8d4ff000 &supported = ffff883f8d5a3d58 supported = ffffffff00000002 ... info = ffff883f8d4ff000 req = ffff8837a0061000 set = 1 id = 123 quiet = 0 subsystem = 4 mask = 1 #9 mod mdt, name mds_regular_handle, RIP 0xffffffffa0de98c5 frame start 0xffff883f8d5a3d98, end 0xffff883f8d5a3da8, *base 0xffff883f8d5a3ee0 XBT_RBX = ffff883f8df0d140 XBT_R12 = ffff883fa7a96800 XBT_R13 = ffff8837a0061000 XBT_R14 = 42 XBT_R15 = ffff883f8d418940 #10 mod ptlrpc, name ptlrpc_main, RIP 0xffffffffa076d07e frame start 0xffff883f8d5a3da8, end 0xffff883f8d5a3ee8, *base 0xffff883f8d5a3f40 XBT_RBX = ffff883f8df0d140 XBT_R12 = ffff883fa7a96800 XBT_R13 = ffff8837a0061000 XBT_R14 = 42 XBT_R15 = ffff883f8d418940 arg = ffff883f8df0d140 thread = ffff883f8df0d140 svcpt = ffff883fa7a96800 &svc = ffff883f8d5a3e50 svc = ffff883fa7f96080 ... &rs = ffff883f8d5a3e68 rs = ffff883fa7a96868 ... &env = ffff883f8d5a3e80 env = ffff883fa7f96080 ... counter = 42 &rc = ffff883f8d5a3e64 rc = 0 ... flags = 8050 size = 38 &ret = ffff883f8d5a3e80 ret = ffff883fa7f96080 ... i = 1 id = c00 quiet = 0 set = 0 value = 0 subsystem = 100 mask = 10 flags = 8250 thread = ffff883f8df0d140 flags = 4 thread = ffff883f8df0d140 flags = 8 thread = ffff883f8df0d140 &lock = ffff883f8d5a3e88 lock = ffff883fa7a96830 ... svcpt = ffff883fa7a96800 head = ffff883fa7a96a18 new = ffff883fa7a96878 subsystem = 100 mask = 200 svcpt = ffff883fa7a96800 thread = ffff883f8df0d140 svcpt = ffff883fa7a96800 &svc = ffff883f8d5a3e80 svc = ffff883fa7f96080 ... request = ffff8837a0061000 &work_start = ffff883f8d5a3ea0 work_start = 57164916 ... &work_end = ffff883f8d5a3e90 work_end = 57164916 ... small = ffff8837a00611e0 large = ffff883f8d5a3ea0 &r = ffff883f8d5a3e40 r = 4 ... result = 0 id = 50e quiet = 0 set = 0 value = 0 req = ffff8837a0061000 id = 512 quiet = 0 set = 0 value = 0 subsystem = 100 mask = 1 subsystem = 100 mask = 100000 subsystem = 100 mask = 200 req = ffff8837a0061000 thread = ffff883f8df0d140 svcpt = ffff883fa7a96800 svcpt = ffff883fa7a96800 svcpt = ffff883fa7a96800 force = 0 svcpt = ffff883fa7a96800

            People

              bzzz Alex Zhuravlev
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: