[LU-8043] MDS running lustre 2.5.5+ OOM when running with Lustre 2.8 GA clients Created: 19/Apr/16  Updated: 03/Sep/16  Resolved: 20/May/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.5
Fix Version/s: Lustre 2.5.5

Type: Bug Priority: Critical
Reporter: James A Simmons Assignee: Alex Zhuravlev
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Cray clients running unpatched lustre 2.8 GA clients. Server side running Lustre 2.5.5 with a patch set in a RHEL6.7 environment.


Attachments: File mylog.dk.gz     Text File vmcore-dmesg.txt    
Issue Links:
Related
is related to LU-7535 mdt_intent_layout does not care about... Resolved
is related to LU-5367 (mdt_handler.c:2782:mdt_lock_handle_f... Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Today we performed a test shot on our smaller Cray Aries cluster (700 nodes) with a non-patched lustre 2.8 GA client specially build for this system. The test were run against our atlas file system which is running a RHEL6.7 distro with the lustre version 2.5.5 with patches. During our test shot while running an IOR single shared file test across all nodes with the stripe count of 1008 the MDS server ran out of memory. I attached the dmesg output to this ticket.



 Comments   
Comment by Oleg Drokin [ 20/Apr/16 ]

From James:

BTW I ran kmem -S on the vmcore.
CACHE            NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
ffff883f77c23c40 osp_obj                  200   35894144  35894895 1889205     4k
ffff883f811f3c00 mdd_obj                   80   12196492  14344704 298848     4k
ffff883f810f3bc0 lod_obj                  112   12196492  13903654 408931     4k
ffff883f940a3b80 mdt_cdt_agent_req        168          0         0      0     4k
ffff883f94653b40 mdt_cdt_restore_handle   112          0         0      0     4k
ffff883f94033b00 mdt_obj                  256   12196492  13096710 873114     4k
ffff883fa7b83ac0 dynlock_cache             72          0       265      5     4k
ffff883fa7ab3a80 upd_kmem                  96          0         0      0     4k
ffff883fa7aa3a40 lqe_kmem                 200         18        19      1     4k
ffff883fa7a23a00 ldiskfs_inode_cache     1040   24405903  24405930 8135310     4k
ffff883fa7a139c0 ldiskfs_xattr             88          0         0      0     4k
ffff883fa7a03980 ldiskfs_free_data         64          1        59      1     4k
ffff883fa79f3940 ldiskfs_alloc_context    136          0        84      3     4k
ffff883fa79e3900 ldiskfs_prealloc_space   112        121       238      7     4k
ffff883fa79d38c0 ldiskfs_system_zone       40          0         0      0     4k
ffff883fa78d3880 jbd2_journal_handle       48          0       385      5     4k
ffff883fa7863840 jbd2_journal_head        112       3985      8126    239     4k
ffff883fa7853800 jbd2_revoke_table         16          2       202      1     4k
ffff883fa7f737c0 jbd2_revoke_record        32          0         0      0     4k
ffff883fb05c3780 xattr_kmem                40          0         0      0     4k
Comment by James A Simmons [ 21/Apr/16 ]

The memory exhaustion problem exposed LU-5367 which is also present in the b2_5_fe release

Comment by Jian Yu [ 21/Apr/16 ]

Hi James,
The patch http://review.whamcloud.com/11219 for LU-5367 has already included in that release.

Comment by Alex Zhuravlev [ 21/Apr/16 ]

Can you please confirm that you did same 1008 stripes with the old clients and it was doing fine?

Comment by James A Simmons [ 21/Apr/16 ]

Yes we tested 2.5 client on the same platform and the MDS did not OOM.

Comment by James A Simmons [ 21/Apr/16 ]

Oh I do see it now. We still encountered that same issues even with the patch I posted the assert here:

0>[4805210.570311] LustreError: 15316:0:(mdt_handler.c:3224:mdt_lock_handle_fini()) ASSERTION( !lustre_handle_is_used(&lh->mlh_reg_lh) ) failed:
<0>[4805210.570478] LustreError: 15659:0:(mdt_handler.c:3224:mdt_lock_handle_fini()) ASSERTION( !lustre_handle_is_used(&lh->mlh_reg_lh) ) failed:
<0>[4805210.570480] LustreError: 15659:0:(mdt_handler.c:3224:mdt_lock_handle_fini()) LBUG
<4>[4805210.570481] Pid: 15659, comm: mdt00_441
<4>[4805210.570481]
<4>[4805210.570482] Call Trace:
<4>[4805210.570495] [<ffffffffa0420895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4>[4805210.570499] [<ffffffffa0420e97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4>[4805210.570511] [<ffffffffa0d9e64b>] mdt_lock_handle_fini+0x4b/0x80 [mdt]
<4>[4805210.570517] [<ffffffffa0da4cc0>] mdt_thread_info_fini+0xe0/0x190 [mdt]
<4>[4805210.570523] [<ffffffffa0daa473>] mdt_handle_common+0x653/0x1470 [mdt]
<4>[4805210.570533] [<ffffffffa0de98c5>] mds_regular_handle+0x15/0x20 [mdt]
<4>[4805210.570563] [<ffffffffa076d07e>] ptlrpc_main+0xf8e/0x1af0 [ptlrpc]
<4>[4805210.570581] [<ffffffffa076c0f0>] ? ptlrpc_main+0x0/0x1af0 [ptlrpc]
<4>[4805210.570584] [<ffffffff810a0fce>] kthread+0x9e/0xc0
<4>[4805210.570587] [<ffffffff8100c28a>] child_rip+0xa/0x20
<4>[4805210.570588] [<ffffffff810a0f30>] ? kthread+0x0/0xc0
<4>[4805210.570589] [<ffffffff8100c280>] ? child_rip+0x0/0x20

Comment by Matt Ezell [ 21/Apr/16 ]

Our assertion was for a regular lock, not a pdo lock.

Comment by Matt Ezell [ 21/Apr/16 ]

I dumped the debug log from the crash dump, but I didn't see anything interesting. The most recent data is

00000004:00080000:2.0:1461078294.488964:0:15455:0:(mdt_open.c:661:mdt_mfd_set_mode()) [0x2002d12a5:0x47ab:0x0] Change mfd mode 00 -> 0302.
00000004:00080000:2.0:1461078294.488967:0:15455:0:(mdt_handler.c:2853:mdt_save_lock()) request = ffff8834683bd400 reply state = ffff881fa5957000 transno = 448115532568
00000004:00080000:2.0:1461078294.503955:0:15455:0:(mdt_open.c:661:mdt_mfd_set_mode()) [0x2002ce5a3:0x5f:0x0] Change mfd mode 00 -> 01.
00000004:00080000:2.0:1461078294.504528:0:15455:0:(mdt_open.c:661:mdt_mfd_set_mode()) [0x2002d12a5:0x47ac:0x0] Change mfd mode 00 -> 0302.
00000004:00080000:2.0:1461078294.504532:0:15455:0:(mdt_handler.c:2853:mdt_save_lock()) request = ffff88048acf6000 reply state = ffff8816e7196000 transno = 448115532572
00000004:00080000:3.0:1461078294.522700:0:15239:0:(mdt_open.c:661:mdt_mfd_set_mode()) [0x2002ccbf6:0x55:0x0] Change mfd mode 00 -> 01.
00000004:00080000:3.0:1461078294.523266:0:15239:0:(mdt_open.c:661:mdt_mfd_set_mode()) [0x2002d12a5:0x47ad:0x0] Change mfd mode 00 -> 0302.
00000004:00080000:3.0:1461078294.523269:0:15239:0:(mdt_handler.c:2853:mdt_save_lock()) request = ffff880339962400 reply state = ffff8817f1155000 transno = 448115532576
00000004:00080000:1.0:1461078294.538529:0:15601:0:(mdt_open.c:661:mdt_mfd_set_mode()) [0x2002d111a:0x1f4b0:0x0] Change mfd mode 00 -> 01.
00000004:00080000:2.0:1461078294.538598:0:15371:0:(mdt_handler.c:2853:mdt_save_lock()) request = ffff880873ac7000 reply state = ffff8834db7b5000 transno = 448115532578
00000004:00040000:0.0:1461078294.538823:0:15316:0:(mdt_handler.c:3224:mdt_lock_handle_fini()) ASSERTION( !lustre_handle_is_used(&lh->mlh_reg_lh) ) failed: 
00000004:00080000:2.0:1461078294.538953:0:15371:0:(mdt_handler.c:2853:mdt_save_lock()) request = ffff880873ac7000 reply state = ffff8834db7b5000 transno = 448115532578
00000004:00040000:2.0:1461078294.538990:0:15659:0:(mdt_handler.c:3224:mdt_lock_handle_fini()) ASSERTION( !lustre_handle_is_used(&lh->mlh_reg_lh) ) failed: 
00000004:00040000:2.0:1461078294.538993:0:15659:0:(mdt_handler.c:3224:mdt_lock_handle_fini()) LBUG
Comment by Gerrit Updater [ 21/Apr/16 ]

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/19717
Subject: LU-8043 obdclass: use atomic allocations interrupt context only
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 07b0663db5dc84926e55cc89f0d97df9df24377f

Comment by Alex Zhuravlev [ 21/Apr/16 ]

in all the cases it's stats:

<3>[2806622.047940] LustreError: 14895:0:(lvfs_lib.c:157:lprocfs_stats_alloc_one()) LNET: out of memory at /data/buildsystem/jsimmons-atlas/rpmbuild/BUILD/lustre-2.5.5/lustre/lvfs/lvfs_lib.c:157 (tried to alloc '(stats->ls_percpu[cpuid])' = 4224)

lprocfs_stats_alloc_one() is trying to allocate per-cpu slot:
LIBCFS_ALLOC_ATOMIC(stats->ls_percpu[cpuid], percpusize);

Comment by Alex Zhuravlev [ 21/Apr/16 ]

the patch above hasn't been tested yet.. uploaded to start testing in Maloo.

Comment by Matt Ezell [ 21/Apr/16 ]

I took a look in the crash dump and found that

mdt_thread_info.mti_lh[MDT_LH_OLD]

is:

{
      mlh_type = MDT_NUL_LOCK, 
      mlh_reg_lh = {
        cookie = 16144317094821187230
      }, 
      mlh_reg_mode = LCK_CR, 
      mlh_pdo_lh = {
        cookie = 0
      }, 
      mlh_pdo_mode = LCK_MINMODE, 
      mlh_pdo_hash = 0, 
      mlh_rreg_lh = {
        cookie = 0
      }, 
      mlh_rreg_mode = LCK_MINMODE
    }

Also looking at the mdt_thread_info, I see:

  mti_rr = {
    rr_opcode = REINT_OPEN, 
    rr_handle = 0xffff883f36707350, 
    rr_fid1 = 0xffff883f36707330, 
    rr_fid2 = 0xffff883f36707340, 
    rr_name = 0xffff883f36707480 "testfile.out", 
    rr_namelen = 12, 
    rr_tgt = 0x0, 
    rr_tgtlen = 0, 
    rr_eadata = 0xffff883f36707490, 
    rr_eadatalen = 4096, 
    rr_logcookielen = 0, 
    rr_logcookies = 0x0, 
    rr_flags = 0
  }

Why does a create reint have a lock handle for "old". Isn't that just used for renames and layout swaps?

Comment by James A Simmons [ 25/Apr/16 ]

After a discussion internally we really like to see the assertion that is present in the dmesg log to be fixed as well. It is possible that the assertion in the log is independent of the memory exhaustion problems.

Comment by Jian Yu [ 25/Apr/16 ]

Hi John,
Do you have any suggestions on the assertion failure above? Thank you.

Comment by John Hammond [ 26/Apr/16 ]

Hi Matt, could you post the debug log? Or if it's too big, could you post the lines from tasks 15316 and 15659?

Comment by John Hammond [ 26/Apr/16 ]

> Also looking at the mdt_thread_info, I see

The values in mti_rr may be from a previous request. I think mdt_intent_layout() is more likely here. Note that MDT_LH_LAYOUT has the same value as MDT_LH_OLD.

Comment by Matt Ezell [ 26/Apr/16 ]

I just attached mylog.dk.gz. Unfortunately it was only running with "normal" debug levels.

James Simmons was running IOR when it crashed. He said it was a single-shared-file striped across 1008 OSTs from ~700 clients. "testfile.out" likely is the name of the file he was using, but I understand mti_rr might be from a previous request.

Let me know if there's anything you want to see from the crash. We can't upload it to Intel for you to look at, but I can run commands for you and provide the output.

Thanks!

#5
        mod libcfs, name lbug_with_loc, RIP 0xffffffffa0420eeb
        frame start 0xffff883f8d5a3ce8, end 0xffff883f8d5a3d08, *base 0xffff883f8d5a3d10
        XBT_RBX = ffffffffa0e15620
        msgdata = ffffffffa0e15620

#6
        mod mdt, name mdt_lock_handle_fini, RIP 0xffffffffa0d9e64b
        frame start 0xffff883f8d5a3d08, end 0xffff883f8d5a3d18, *base 0xffff883f8d5a3d40
        XBT_RBX = ffff883f8d4ff000

#7
        mod mdt, name mdt_thread_info_fini, RIP 0xffffffffa0da4cc0
        frame start 0xffff883f8d5a3d18, end 0xffff883f8d5a3d48, *base 0xffff883f8d5a3d90
        XBT_RBX = ffff883f8d4ff000
        info = ffff883f8d4ff000

#8
        mod mdt, name mdt_handle_common, RIP 0xffffffffa0daa473
        frame start 0xffff883f8d5a3d48, end 0xffff883f8d5a3d98, *base 0xffff883f8d5a3da0
        XBT_RBX = ffff8837a0061000
        XBT_R12 = ffff883f8d4ff000
        XBT_R13 = ffffffffa0e23ee0
        req = ffff8837a0061000
        &supported = ffff883f8d5a3d58
        supported = ffffffff00000002 ...
        info = ffff883f8d4ff000
        &supported = ffff883f8d5a3d58
        supported = ffffffff00000002 ...
        info = ffff883f8d4ff000
        req = ffff8837a0061000
        set = 1
        id = 123
        quiet = 0
        subsystem = 4
        mask = 1
#9
        mod mdt, name mds_regular_handle, RIP 0xffffffffa0de98c5
        frame start 0xffff883f8d5a3d98, end 0xffff883f8d5a3da8, *base 0xffff883f8d5a3ee0
        XBT_RBX = ffff883f8df0d140
        XBT_R12 = ffff883fa7a96800
        XBT_R13 = ffff8837a0061000
        XBT_R14 = 42
        XBT_R15 = ffff883f8d418940

#10
        mod ptlrpc, name ptlrpc_main, RIP 0xffffffffa076d07e
        frame start 0xffff883f8d5a3da8, end 0xffff883f8d5a3ee8, *base 0xffff883f8d5a3f40
        XBT_RBX = ffff883f8df0d140
        XBT_R12 = ffff883fa7a96800
        XBT_R13 = ffff8837a0061000
        XBT_R14 = 42
        XBT_R15 = ffff883f8d418940
        arg = ffff883f8df0d140
        thread = ffff883f8df0d140
        svcpt = ffff883fa7a96800
        &svc = ffff883f8d5a3e50
        svc = ffff883fa7f96080 ...
        &rs = ffff883f8d5a3e68
        rs = ffff883fa7a96868 ...
        &env = ffff883f8d5a3e80
        env = ffff883fa7f96080 ...
        counter = 42
        &rc = ffff883f8d5a3e64
        rc = 0 ...
        flags = 8050
        size = 38
        &ret = ffff883f8d5a3e80
        ret = ffff883fa7f96080 ...
        i = 1
        id = c00
        quiet = 0
        set = 0
        value = 0
        subsystem = 100
        mask = 10
        flags = 8250
        thread = ffff883f8df0d140
        flags = 4
        thread = ffff883f8df0d140
        flags = 8
        thread = ffff883f8df0d140
        &lock = ffff883f8d5a3e88
        lock = ffff883fa7a96830 ...
        svcpt = ffff883fa7a96800
        head = ffff883fa7a96a18
        new = ffff883fa7a96878
        subsystem = 100
        mask = 200
        svcpt = ffff883fa7a96800
        thread = ffff883f8df0d140
        svcpt = ffff883fa7a96800
        &svc = ffff883f8d5a3e80
        svc = ffff883fa7f96080 ...
        request = ffff8837a0061000
        &work_start = ffff883f8d5a3ea0
        work_start = 57164916 ...
        &work_end = ffff883f8d5a3e90
        work_end = 57164916 ...
        small = ffff8837a00611e0
        large = ffff883f8d5a3ea0
        &r = ffff883f8d5a3e40
        r = 4 ...
        result = 0
        id = 50e
        quiet = 0
        set = 0
        value = 0
        req = ffff8837a0061000
        id = 512
        quiet = 0
        set = 0
        value = 0
        subsystem = 100
        mask = 1
        subsystem = 100
        mask = 100000
        subsystem = 100
        mask = 200
        req = ffff8837a0061000
        thread = ffff883f8df0d140
        svcpt = ffff883fa7a96800
        svcpt = ffff883fa7a96800
        svcpt = ffff883fa7a96800
        force = 0
        svcpt = ffff883fa7a96800
Comment by Matt Ezell [ 26/Apr/16 ]
crash> struct ptlrpc_request.rq_pill ffff8837a0061000
  rq_pill = {
    rc_req = 0xffff8837a0061000, 
    rc_fmt = 0xffffffffa0819240 <RQF_LDLM_INTENT_LAYOUT>, 
    rc_loc = RCL_SERVER, 
    rc_area = {{4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}, {4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}}
  }
Comment by James A Simmons [ 03/May/16 ]

Any progress on fixing the assertion?

Comment by John Hammond [ 03/May/16 ]

I understood that ORNL was going to reproduce with a stronger debug mask. Has that been done?

Comment by Matt Ezell [ 03/May/16 ]

We have been unable to reproduce on our testbed systems, and we haven't had an opportunity to reproduce on the production systems.

Comment by John Hammond [ 03/May/16 ]

It's hard to say for sure without more information but the failed assertion may be addresses by http://review.whamcloud.com/#/c/18060/.

Comment by Matt Ezell [ 19/May/16 ]

We attempted to reproduce this assertion on Tuesday using the same conditions as last time, but it never crashed. After that, we moved to a server with 19717 and 18060 to hopefully prevent it in the future. I think we can close this ticket and reopen if we see it again. Thanks.

Comment by Jian Yu [ 20/May/16 ]

Thank you, Matt.

Comment by Gerrit Updater [ 19/Jul/16 ]

Comment deleted (wrong LU in commit message).

Comment by Gerrit Updater [ 19/Jul/16 ]

Comment deleted (wrong LU in commit message).

Generated at Sat Feb 10 02:14:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.