[LU-8435] LBUG (osc_cache.c:1290:osc_completion()) ASSERTION( equi(page->cp_state == CPS_PAGEIN, cmd == OBD_BRW_READ) ) Created: 25/Jul/16  Updated: 11/Nov/17  Resolved: 31/Aug/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Major
Reporter: Sebastien Piechurski Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: cea, p4b
Environment:

Bull Lustre distribution based on Lustre 2.7.2


Attachments: Text File crash_output.txt     Text File foreach_bt_merge.txt     Text File struct_analyze1.txt    
Issue Links:
Duplicate
Related
is related to LU-6215 Sync Lustre external tree with lustre... Resolved
is related to LU-9966 sanity test_411: fail to trigger a me... Resolved
Severity: 3
Epic: client
Rank (Obsolete): 9223372036854775807

 Description   

In the last month one of our customer hit more than 100 times a crash with the following signature:

[506626.555125] SLUB: Unable to allocate memory on node -1 (gfp=0x80c0)
[506626.562216]   cache: kvm_mmu_page_header(22:step_batch), object size: 168,
buffer size: 168, default order: 1, min order: 0
[506626.574729]   node 0: slabs: 0, objs: 0, free: 0
[506626.579974]   node 1: slabs: 0, objs: 0, free: 0
[506626.585219]   node 2: slabs: 60, objs: 2880, free: 0
[506626.590852]   node 3: slabs: 0, objs: 0, free: 0
[506626.596112] LustreError: 41604:0:(osc_cache.c:1290:osc_completion())
ASSERTION( equi(page->cp_state == CPS_PAGEIN, cmd == OBD_BRW_READ) ) failed:
cp_state:0, cmd:1
[506626.612512] LustreError: 41604:0:(osc_cache.c:1290:osc_completion()) LBUG
[506626.620186] Pid: 41604, comm: cat
[506626.623978]
                Call Trace:
[506626.628573]  [<ffffffffa05eb853>] libcfs_debug_dumpstack+0x53/0x80
[libcfs]
[506626.636448]  [<ffffffffa05ebdf5>] lbug_with_loc+0x45/0xc0 [libcfs]
[506626.643456]  [<ffffffffa0dea859>] osc_ap_completion.isra.30+0x4d9/0x5b0
[osc]
[506626.651526]  [<ffffffffa0df558d>] osc_queue_sync_pages+0x2dd/0x350 [osc]
[506626.659108]  [<ffffffffa0de750f>] osc_io_submit+0x42f/0x530 [osc]
[506626.666037]  [<ffffffffa086fbd6>] cl_io_submit_rw+0x66/0x170 [obdclass]
[506626.673531]  [<ffffffffa0b8d257>] lov_io_submit+0x2a7/0x420 [lov]
[506626.680450]  [<ffffffffa086fbd6>] cl_io_submit_rw+0x66/0x170 [obdclass]
[506626.687961]  [<ffffffffa0c67f70>] ll_readpage+0x2d0/0x560 [lustre]
[506626.694964]  [<ffffffff8116af87>] generic_file_aio_read+0x3b7/0x750
[506626.702078]  [<ffffffffa0c98485>] vvp_io_read_start+0x3c5/0x470 [lustre]
[506626.709674]  [<ffffffffa086f965>] cl_io_start+0x65/0x130 [obdclass]
[506626.716785]  [<ffffffffa0872f85>] cl_io_loop+0xa5/0x190 [obdclass]
[506626.723797]  [<ffffffffa0c34e8c>] ll_file_io_generic+0x5fc/0xae0 [lustre]
[506626.731477]  [<ffffffffa0c35db2>] ll_file_aio_read+0x192/0x530 [lustre]
[506626.738962]  [<ffffffffa0c3621b>] ll_file_read+0xcb/0x1e0 [lustre]
[506626.745962]  [<ffffffff811dea1c>] vfs_read+0x9c/0x170
[506626.751700]  [<ffffffff811df56f>] SyS_read+0x7f/0xe0
[506626.757345]  [<ffffffff81646889>] system_call_fastpath+0x16/0x1b
[506626.764138]
[506626.765990] Kernel panic - not syncing: LBUG
[506626.770850] CPU: 53 PID: 41604 Comm: cat Tainted: G           OE 
------------   3.10.0-327.22.2.el7.x86_64 #1
[506626.782104] Hardware name: BULL bullx blade/CHPU, BIOS BIOSX07.037.01.003
10/23/2015
[506626.790838]  ffffffffa0610ced 000000000f6a3070 ffff8817799eb8c0
ffffffff816360f4
[506626.799228]  ffff8817799eb940 ffffffff8162f96a ffffffff00000008
ffff8817799eb950
[506626.807618]  ffff8817799eb8f0 000000000f6a3070 ffffffffa0e01466
0000000000000246
[506626.816005] Call Trace:
[506626.818839]  [<ffffffff816360f4>] dump_stack+0x19/0x1b
[506626.824668]  [<ffffffff8162f96a>] panic+0xd8/0x1e7
[506626.830128]  [<ffffffffa05ebe5b>] lbug_with_loc+0xab/0xc0 [libcfs]
[506626.837129]  [<ffffffffa0dea859>] osc_ap_completion.isra.30+0x4d9/0x5b0
[osc]
[506626.845192]  [<ffffffffa0df558d>] osc_queue_sync_pages+0x2dd/0x350 [osc]
[506626.852766]  [<ffffffffa0de750f>] osc_io_submit+0x42f/0x530 [osc]
[506626.859702]  [<ffffffffa086fbd6>] cl_io_submit_rw+0x66/0x170 [obdclass]
[506626.867184]  [<ffffffffa0b8d257>] lov_io_submit+0x2a7/0x420 [lov]
[506626.874099]  [<ffffffffa086fbd6>] cl_io_submit_rw+0x66/0x170 [obdclass]
[506626.881611]  [<ffffffffa0c67f70>] ll_readpage+0x2d0/0x560 [lustre]
[506626.888609]  [<ffffffff8116af87>] generic_file_aio_read+0x3b7/0x750
[506626.895721]  [<ffffffffa0c98485>] vvp_io_read_start+0x3c5/0x470 [lustre]
[506626.903322]  [<ffffffffa086f965>] cl_io_start+0x65/0x130 [obdclass]
[506626.910418]  [<ffffffffa0872f85>] cl_io_loop+0xa5/0x190 [obdclass]
[506626.917420]  [<ffffffffa0c34e8c>] ll_file_io_generic+0x5fc/0xae0 [lustre]
[506626.925091]  [<ffffffffa0c35db2>] ll_file_aio_read+0x192/0x530 [lustre]
[506626.932575]  [<ffffffffa0c3621b>] ll_file_read+0xcb/0x1e0 [lustre]
[506626.939569]  [<ffffffff811dea1c>] vfs_read+0x9c/0x170
[506626.945300]  [<ffffffff811df56f>] SyS_read+0x7f/0xe0
[506626.950938]  [<ffffffff81646889>] system_call_fastpath+0x16/0x1b

The customer being a black site, we can't provide the crashdump, but will happily provide any text output you would find useful.



 Comments   
Comment by Bruno Faccini (Inactive) [ 25/Jul/16 ]

Hello Seb! I have talked with CEA people about this problem today and it seems that, very likely due to memory/cgroup limit (kmem), a Slub allocation failed that may than have caused to trigger a buggy error handling path in Lustre code.

If necessary I can go on-site and analyze the crash-dump to confirm our previous assumptions.

Comment by Bruno Faccini (Inactive) [ 28/Jul/16 ]

I have been on-site to analyze crash-dumps for this problem and also to discuss with people about how/when problem occurs and try to understand its root cause.

Finally we have discovered that the SLUB allocation failure, likely to further lead to the LBUG during error processing code path, is due to the memory cgroup limit also taking Kernel memory volume into account on a per thread basis. Thus, the obvious work-around has been to avoid kmem accounting via "cgroup.memory=nokmem" boot parameter.

Concerning the LBUG, I am presently reviewing the different and possibly involved routines/code-paths to understand how/why the expected flag is missing. One of the interesting discovery I presently investigate is that the current cl_page is the 1st one of the 256 for a 1MB read and also that its underlying VM-page is not locked (!PG_locked) like the others.

Comment by Aurelien Degremont (Inactive) [ 28/Jul/16 ]

Actually, cgroup.memory=nokmem is not the correct workaround as this option was introduced recently in Linux kernel. This is not supported by RHEL7.
However, it still possible to diable kernel memory accounting with cgroup, using a different method, but more difficult to deploy as it requires to patch cgroup code in Slurm.

I will try to post a reproducer which does not require slurm at all.

Comment by Bruno Faccini (Inactive) [ 28/Jul/16 ]

Well sorry Aurelien, usually I am late about features but this time I have been too early! I must have been tired because I thought that this was this kind of parameter you had found to disable kmem accounting in cgroup and that we have discussed briefly about...
So thanks for the clarification.

Comment by Aurelien Degremont (Inactive) [ 29/Jul/16 ]

No problem Bruno. We had a better understanding of these params after you left

Here is a shell script to help you reproduce the problem.
System is RHEL 7.2.
It is easier to reproduce the problem when reading the file. The file should not be in cache (if it is, there won't be I/O going through osc to OST)

#!/bin/bash
# usage: $0 LUSTREFILE CGROUPDIR
# ie: $0 /path/to/file/in/lustre /sys/fs/cgroup/memory/foo/test
#
# LUSTREFILE should exist and be bigger than 100MB
#
# First, root should create the parent cgroup
#   mkdir /sys/fs/cgroup/memory/foo
#   chown -R foo /sys/fs/cgroup/memory/foo
#

[[ -n "$1" ]] && OUTPUT_FILE="$1"
[[ -n "$2" ]] && CGROUP_DIR="$2"

[[ ! -d $CGROUP_DIR ]] && mkdir -p $CGROUP_DIR

echo 3G > ${CGROUP_DIR}/memory.kmem.limit_in_bytes
echo $(( $(cat ${CGROUP_DIR}/memory.usage_in_bytes) + 5000000 )) > ${CGROUP_DIR}/memory.limit_in_bytes
echo $$ > ${CGROUP_DIR}/tasks

dd if=${OUTPUT_FILE} of=/dev/null bs=1M
Comment by Ben Evans (Inactive) [ 01/Aug/16 ]

Cray has also seen this, most often when running SLURM with multiple jobs per node enabled.

Comment by Aurelien Degremont (Inactive) [ 01/Aug/16 ]

Could you verify memory cgroup is enabled in your SLURM environment? See /etc/slurm/cgroup.conf
SLURM is tuning these files.

https://github.com/SchedMD/slurm/blob/slurm-16.05/src/plugins/task/cgroup/task_cgroup_memory.c#L305

Comment by Ben Evans (Inactive) [ 01/Aug/16 ]

the cgroup.conf looks like this:

CgroupAutomount=yes
CgroupMountpoint="/dev"
ConstrainCores=yes
ConstrainRAMSpace=yes
TaskAffinity=no

Comment by Aurelien Degremont (Inactive) [ 01/Aug/16 ]

ConstrainRAMSpace=yes is the line. That's the same issue than us. If you can afford to turn it off, this should work around the crash.

Comment by Bruno Faccini (Inactive) [ 05/Aug/16 ]

Well having a better look to the code involved in error path, my guess is that the problem may be caused by the following piece:

2621 int osc_queue_sync_pages(const struct lu_env *env, struct osc_object *obj,
2622                          struct list_head *list, int cmd, int brw_flags)
2623 {
2624         struct client_obd     *cli = osc_cli(obj);
2625         struct osc_extent     *ext;
2626         struct osc_async_page *oap;
2627         int     page_count = 0;
2628         int     mppr       = cli->cl_max_pages_per_rpc;
2629         pgoff_t start      = CL_PAGE_EOF;
2630         pgoff_t end        = 0;
2631         ENTRY;
2632 
2633         list_for_each_entry(oap, list, oap_pending_item) {
2634                 pgoff_t index = osc_index(oap2osc(oap));
2635                 if (index > end)
2636                         end = index;
2637                 if (index < start)
2638                         start = index;
2639                 ++page_count;
2640                 mppr <<= (page_count > mppr);
2641         }
2642 
2643         ext = osc_extent_alloc(obj);   <<<<<<<<< Memory allocation failure can occur here
2644         if (ext == NULL) {
2645                 list_for_each_entry(oap, list, oap_pending_item) {
2646                         list_del_init(&oap->oap_pending_item);     <<<<<<<<<<<<
2647                         osc_ap_completion(env, cli, oap, 0, -ENOMEM);
2648                 }
2649                 RETURN(-ENOMEM);
2650         }

the list_del_init() should cause the current oap to have its next/prev oap_pending_item list pointers to point on itself and thus to be again selected by list-for_each_entry() to be passed for a 2nd call to osc_ap_completion() when its processing has already been done during 1st call to osc_ap_completion() for it, and particularly its associated cl_page->cp_state has already been set to CPS_CACHED in osc_ap_completion()>osc_completion()>cl_page_completion(), hence the Assert du to no longer CPS_PAGEIN.

I am working on a patch to fix this and to be pushed soon.

Comment by Aurelien Degremont (Inactive) [ 05/Aug/16 ]

I agree with this analysis. cp_state is CPS_PAGEIN in almost all our crashdumps. Except for 1 or 2 but this could means that the page was freed between the 2 iterations IMO?

list_for_each_entry + list_del_init seems a really bad idea.

Comment by Bruno Faccini (Inactive) [ 05/Aug/16 ]

And use of list_for_each_entry_safe() + list_del() looks like a better one
This will be my first patch attempt to fix this, I will then try to expose it to your reproducer to see how this works then.

Comment by Bruno Faccini (Inactive) [ 05/Aug/16 ]

Well in fact, I just found that this has already been fixed in master by Andreas fix for LU-6215 (http://review.whamcloud.com/13956, commit eaba382df06cd6f570b16c5f759f612b057e5324). Thus I will simply back-port it to b2_7!

Comment by Gerrit Updater [ 05/Aug/16 ]

Aurelien Degremont (aurelien.degremont@cea.fr) uploaded a new patch: http://review.whamcloud.com/21745
Subject: LU-8435 tests: slab alloc error does not LBUG
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0e8a912a7dfc65d14d1f252efea043637620f518

Comment by Aurelien Degremont (Inactive) [ 05/Aug/16 ]

My bad, I miss your last comment . That's the problem working too often with 2.7.2 only, you sometimes miss what was added in master.

I've pushed a reproducer for this in sanity. Do whatever you want with it.

Comment by Ben Evans (Inactive) [ 05/Aug/16 ]

I'm running a test on this now. Seems good, no crashes ... yet.

Comment by Bruno Faccini (Inactive) [ 14/Apr/17 ]

Hello Sebastien,
Patch has been back-ported and has landed on b2_7_fe branch, do you agree to close ?

Comment by Sebastien Piechurski [ 20/Apr/17 ]

Hi Bruno,

Thanks for the backport.
You can close.

Regards.

Comment by Peter Jones [ 20/Apr/17 ]

I think that we need the ticket to remain open until the original patch has landed to master

Comment by James A Simmons [ 22/Apr/17 ]

Peter the original fix https://review.whamcloud.com/#/c/13956 has already landed to master. I think this is safe to close. Or do you mean https://review.whamcloud.com/#/c/21745 ?

Comment by Peter Jones [ 23/Apr/17 ]

Yes I meant the testing patch

Comment by Gerrit Updater [ 31/Aug/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/21745/
Subject: LU-8435 tests: slab alloc error does not LBUG
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 15dac618aabf2d5611a280bce13ca79c673f4f6d

Comment by Peter Jones [ 31/Aug/17 ]

Landed for 2.11

Comment by Oleg Drokin [ 01/Sep/17 ]

Hm I just hada failure in a test introduced by this patch:

[38199.302263] Lustre: DEBUG MARKER: == sanity test 411: Slab allocation error with cgroup does not LBUG ================================== 10:34:27 (1504276467)
[38212.118675] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[38212.120795] IP: [<ffffffff811dbb04>] __memcg_kmem_get_cache+0xe4/0x220
[38212.121489] PGD 310c0a067 PUD 28e92c067 PMD 0 
[38212.122192] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[38212.122849] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_zfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) osc(OE) mdc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) brd ext4 mbcache loop zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) zlib_deflate jbd2 syscopyarea sysfillrect ata_generic sysimgblt pata_acpi ttm drm_kms_helper ata_piix drm i2c_piix4 libata serio_raw virtio_balloon pcspkr virtio_console i2c_core virtio_blk floppy nfsd ip_tables rpcsec_gss_krb5 [last unloaded: libcfs]
[38212.145920] CPU: 2 PID: 31539 Comm: dd Tainted: P        W  OE  ------------   3.10.0-debug #2
[38212.147177] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[38212.147821] task: ffff8802f2bf4800 ti: ffff880294f20000 task.ti: ffff880294f20000
[38212.152755] RIP: 0010:[<ffffffff811dbb04>]  [<ffffffff811dbb04>] __memcg_kmem_get_cache+0xe4/0x220
[38212.153730] RSP: 0018:ffff880294f237f0  EFLAGS: 00010286
[38212.154194] RAX: 0000000000000000 RBX: ffff8803232c5c40 RCX: 0000000000000002
[38212.154672] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000246
[38212.155168] RBP: ffff880294f23810 R08: 0000000000000000 R09: 0000000000000000
[38212.155647] R10: 0000000000000000 R11: 0000000200000007 R12: ffff8802f2bf4800
[38212.156134] R13: ffff88031f6a6000 R14: ffff8803232c5c40 R15: ffff8803232c5c40
[38212.156898] FS:  00007f1f35a4e740(0000) GS:ffff88033e440000(0000) knlGS:0000000000000000
[38212.159271] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[38212.159923] CR2: 0000000000000008 CR3: 00000002f011d000 CR4: 00000000000006e0
[38212.160625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[38212.161320] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[38212.163273] Stack:
[38212.163852]  ffffffff811dba68 0000000000008050 ffff8802c59a5000 ffff8802a991ee00
[38212.165119]  ffff880294f238a0 ffffffff811cca5c ffffffffa0570615 ffffc9000ab51000
[38212.166468]  ffff880200000127 ffffffffa05a5547 ffff88028b683e80 ffff8803232c5c40
[38212.168537] Call Trace:
[38212.169340]  [<ffffffff811dba68>] ? __memcg_kmem_get_cache+0x48/0x220
[38212.170547]  [<ffffffff811cca5c>] kmem_cache_alloc+0x1ec/0x640
[38212.171879]  [<ffffffffa0570615>] ? ldlm_resource_putref+0x75/0x400 [ptlrpc]
[38212.172659]  [<ffffffffa05a5547>] ? ptlrpc_request_cache_alloc+0x27/0x110 [ptlrpc]
[38212.174145]  [<ffffffffa07c0f0d>] ? mdc_resource_get_unused+0x14d/0x2a0 [mdc]
[38212.174871]  [<ffffffffa05a5547>] ptlrpc_request_cache_alloc+0x27/0x110 [ptlrpc]
[38212.177273]  [<ffffffffa05a5655>] ptlrpc_request_alloc_internal+0x25/0x480 [ptlrpc]
[38212.178618]  [<ffffffffa05a5ac3>] ptlrpc_request_alloc+0x13/0x20 [ptlrpc]
[38212.179440]  [<ffffffffa07c6a60>] mdc_enqueue_base+0x6c0/0x18a0 [mdc]
[38212.180168]  [<ffffffffa07c845b>] mdc_intent_lock+0x26b/0x520 [mdc]
[38212.180869]  [<ffffffffa161dad0>] ? ll_invalidate_negative_children+0x1e0/0x1e0 [lustre]
[38212.182291]  [<ffffffffa0584ab0>] ? ldlm_expired_completion_wait+0x240/0x240 [ptlrpc]
[38212.183569]  [<ffffffffa079723d>] lmv_intent_lock+0xc0d/0x1b50 [lmv]
[38212.184289]  [<ffffffff810ac3c1>] ? in_group_p+0x31/0x40
[38212.184941]  [<ffffffffa161e5c5>] ? ll_i2suppgid+0x15/0x40 [lustre]
[38212.185667]  [<ffffffffa161e614>] ? ll_i2gids+0x24/0xb0 [lustre]
[38212.186372]  [<ffffffff811073d2>] ? from_kgid+0x12/0x20
[38212.187062]  [<ffffffffa1609275>] ? ll_prep_md_op_data+0x235/0x520 [lustre]
[38212.187754]  [<ffffffffa161dad0>] ? ll_invalidate_negative_children+0x1e0/0x1e0 [lustre]
[38212.190244]  [<ffffffffa161fd34>] ll_lookup_it+0x2a4/0xef0 [lustre]
[38212.190918]  [<ffffffffa1620ab7>] ll_atomic_open+0x137/0x12d0 [lustre]
[38212.191636]  [<ffffffff817063d7>] ? _raw_spin_unlock+0x27/0x40
[38212.192425]  [<ffffffff811f82fb>] ? lookup_dcache+0x8b/0xb0
[38212.193270]  [<ffffffff811fd551>] do_last+0xa21/0x12b0
[38212.194603]  [<ffffffff811fdea2>] path_openat+0xc2/0x4a0
[38212.195481]  [<ffffffff811ff69b>] do_filp_open+0x4b/0xb0
[38212.196351]  [<ffffffff817063d7>] ? _raw_spin_unlock+0x27/0x40
[38212.197169]  [<ffffffff8120d137>] ? __alloc_fd+0xa7/0x130
[38212.197815]  [<ffffffff811ec553>] do_sys_open+0xf3/0x1f0
[38212.198506]  [<ffffffff811ec66e>] SyS_open+0x1e/0x20
[38212.199225]  [<ffffffff8170fc49>] system_call_fastpath+0x16/0x1b
[38212.199896] Code: 01 00 00 41 f6 85 10 03 00 00 03 0f 84 f6 00 00 00 4d 85 ed 48 c7 c2 ff ff ff ff 74 07 49 63 95 98 06 00 00 48 8b 83 e0 00 00 00 <4c> 8b 64 d0 08 4d 85 e4 0f 85 d1 00 00 00 41 f6 45 10 01 0f 84 
[38212.202617] RIP  [<ffffffff811dbb04>] __memcg_kmem_get_cache+0xe4/0x220
[38212.203345]  RSP <ffff880294f237f0>

I have a crashdump if anybody is interested.

Comment by Bruno Faccini (Inactive) [ 02/Sep/17 ]

Oleg,
my guess is that this new sub-test sanity/test_411, introduced by change #21745, is setting a highly constraining Kernel memory limit that is very likely to trigger some memcg/slab bug.
But I am ok to have a look to the crash dump to try to confirm.

Comment by Oleg Drokin [ 04/Sep/17 ]

Ok, thanks.
I had 4 more failures in the past 24 hours, btw.

The crashdumps are on onyx-68 in /export/crashdumps.
they are:
192.168.123.199-2017-09-01-10:34:*
192.168.123.111-2017-09-02-15:06:*
192.168.123.195-2017-09-03-13:*
192.168.123.151-2017-09-03-14:06:*
192.168.123.135-2017-09-03-14:11:*

build tree is currently in /export/centos7-nfsroot/home/green/git/lustre-release with all the modules (I'll update it on Tuesday ,but should be good for the next 30 or so hours).

Comment by Aurelien Degremont (Inactive) [ 04/Sep/17 ]

Bruno, this was exactly the purpose of this test. It seems it discover other memory management issues in client code. I/O is not really expected to succeed under such constraints, but only returing EIO or ENOMEM, not crashing

Comment by Patrick Farrell (Inactive) [ 04/Sep/17 ]

On that note, Aurelien, I think we should add a write component to the test after the memory limit is set... Or perhaps a separate test. But either way - write under pressure would be good to have as well.

Comment by Bruno Faccini (Inactive) [ 05/Sep/17 ]

Aurelien, I know that you posted this new test to ensure memcg limit do not cause crashes in Lustre code, but with this new kind of crash in Kernel/memcg layer, it seems you should also propose it for the kernel regression tests suite!!

My first crash-dumps analysis results poins to a possible race during memcg lazy registration to current kmem_caches and concurrent Slab allocations, that triggers the unexpected situation, in __memcg_kmem_get_cache(), where memcg_params has still not been initialized in "ptlrpc_cache" kmem_cache.

It is also interesting to note that recent auto-tests results of sanity/tests_411 are all success, when all these crashes have occurred during single-node sessions, and that the only kmem_cache in the system that do not have memcg_params initialized are those that have been created in Lustre code.

More to come.

Comment by Oleg Drokin [ 05/Sep/17 ]

noet that I am running unsupported redhat kernel config with a lot of debugging and I also enable SLAB, where as redhat defaults to SLUB.

This does not make the bugs any less real of course, but at times it takes some time to convince the RH people that it's ok to have unsupported config because it just helps to detect the bugs earlier vs random crashes elsewhere later.

Comment by Bruno Faccini (Inactive) [ 07/Sep/17 ]

Well, I have been quite puzzled with these crash-dumps, because according to the Kernel source code, the only case where memcg_params could be NULL is when a kmem_cache is being created and !memcg_kmem_enabled() (tested in memcg_register_cache(), this is when no memory group with kmem limit is still active), but then all kmem_cache should have their memcg_params==NULL, and not only the Lustre ones!! I have tested this on a fresh system after reboot with no memory cgroup configured, all kmem_cache, and even the Lustre ones after a delayed Lustre start and modules load, have their memcg_params==NULL.

Finally, I found a possible scenario where this can happen, and this is confirmed by/in all Kernel logs from the crash-dumps, the sanity/test_411 test (likely to be the only one to create+use a memory group), has to be run at least 2 times with Lustre modules unload in between (causing Lustre kmem_cache to be destroyed and re-created).
Then, trying to reproduce by doing so on a test system I was not able to make it since !memcg_kmem_enabled() could never be satisfied because of "child/memcg" kmem_cache (created during 1st sub-test run and memory group creation) still being present, and this even being fully empty/freed. But this could be where the SLAB vs SLUB config should have made the difference, because it appears that SLAB mechanism include a regular kmem_cache shrink/drain cleanup that could have caused these kind of leaked kmem_cache to be garbage collected, when this never (or may be a lot later) could happen running with SLUB.

But anyway, the fact that memcg_params could be NULL for a recently created kmem_cache when no kmem memory cgroup is still active (but some had already existed before!), needs to be handled by current code. I will check if this is already the case in recent Kernels, and push a fix if not.

Oleg, this looks like a new argument to confirm the need to test Kernels with "unsupported" configs!

Comment by Bruno Faccini (Inactive) [ 08/Sep/17 ]

This problem should have already been addressed by patch at https://patchwork.kernel.org/patch/2850152/.
But since the RHEL7 Kernel version we have experienced these crashes with already includes it, after digging into the Kernel related source code, I think the following one is required to fully fix this issue :

[root@onyx-68 linux-3.10.0-327.22.2.el7-debug]# diff -urN mm/memcontrol.c mm/memcontrol.c.bfi
--- mm/memcontrol.c     2016-06-09 06:31:12.000000000 -0700
+++ mm/memcontrol.c.bfi 2017-09-08 07:37:18.647281366 -0700
@@ -3163,7 +3163,15 @@ int memcg_update_cache_size(struct kmem_cache *s, int num_groups)
 
        VM_BUG_ON(s->memcg_params && !s->memcg_params->is_root_cache);
 
-       if (num_groups > memcg_limited_groups_array_size) {
+       /* resize/grow existing memcg_params or allocate it if had not already
+        * been done during kmem_cache creation because none of previously used
+        * kmem memcg were present at this time (i.e.
+        * memcg_limited_groups_array_size != 0 but memcg_kmem_enabled()
+        * returned false). It could not have been necessary if
+        * memcg_caches_array_size() was not used to anticipate more than slots
+        * required and if memcg_limited_groups_array_size would simply
+        * increment upon each new kmem memcg creation.
+       if (num_groups > memcg_limited_groups_array_size || !s->memcg_params) {
                int i;
                ssize_t size = memcg_caches_array_size(num_groups);
 
@@ -3203,7 +3211,8 @@
                 * bigger than the others. And all updates will reset this
                 * anyway.
                 */
-               kfree(cur_params);
+               if (cur_params)
+                       kfree(cur_params);
        }
        return 0;
 }
[root@onyx-68 linux-3.10.0-327.22.2.el7-debug]# 

Oleg, you may want to try it ? And on my side I will check if this problem is still present in recent Kernels and report it if not.

Comment by Bruno Faccini (Inactive) [ 08/Sep/17 ]

After browsing Kernel sources in kernel.org, I think this issue is fixed in 4.x and 3.x more recent Kernels, mainly with changes that have been introduced in patch/commit "memcg: don't call memcg_update_all_caches if new cache id fits" available at
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/patch/mm/memcontrol.c?id=f3bb3043a092368a255bca5d1c6f4352c96a3b2d, and where ALL kmem_cache are systematically parsed to re-size/initialize a new memcg_params when a kmem mermory-cgroup is being started.

And again, I need to insist that this bug is very unlikely to be triggered if we keep running with SLUB configured instead to switch back to SLAB.

Comment by Bruno Faccini (Inactive) [ 12/Sep/17 ]

Hmm, in fact I think the kernel patch ("memcg: don't call memcg_update_all_caches if new cache id fits") that I have indicated in my just previous comment is not likely to fix finally.

After I have browsed again more recent Kernels source code, problem appears to be fixed in 4.x kernels because now memcg_params is no longer a pointer to but an embedded struct memcg_cache_params inside the kmem_cache! But looks like problem is still in 3.x.

BTW, I have found a possible other/alternate way to fix it with the following patch :

[root@onyx-68 linux-3.10.0-327.22.2.el7-debug]# diff -urN mm/memcontrol.c mm/memcontrol.c.bfi+
--- mm/memcontrol.c     2016-06-09 06:31:12.000000000 -0700
+++ mm/memcontrol.c.bfi+        2017-09-12 09:24:53.235452071 -0700
@@ -3213,7 +3213,7 @@ int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s,
 {
        size_t size = sizeof(struct memcg_cache_params);
 
-       if (!memcg_kmem_enabled())
+       if (!memcg_kmem_enabled() && memcg_limited_groups_array_size == 0)
                return 0;
 
        if (!memcg)
[root@onyx-68 linux-3.10.0-327.22.2.el7-debug]# 

where memcg_register_cache() routine has been changed by memcg_alloc_cache_params() in some 3.x versions but where same fix applies.

I will try to propose either patch to the Kernel mailing list.

Comment by Bruno Faccini (Inactive) [ 19/Sep/17 ]

Problem description and 2x patch proposals have been posted/emailed to cgroups@vger.kernel.org.

Comment by Bruno Faccini (Inactive) [ 29/Sep/17 ]

Hmm, still not a lot of answer to my report at http://www.spinics.net/lists/cgroups/msg18187.html ....

So, I have submitted #197075 bug report to bugzilla.kernel.org, https://bugzilla.kernel.org/show_bug.cgi?id=197075.

Comment by Bruno Faccini (Inactive) [ 05/Oct/17 ]

And here is the answer I got for BZ #197075 !!... :

Kernel memory accounting is totally unusable in 3.10, because it lacks
dcache reclaim and there are a lot of implementation bugs. The one you
caught is just one of many that have been fixed since 3.10. That's why
it stayed disabled by default until 4.x.

So IMHO the best you can do if you really want to use kernel memory
accounting is upgrade to 4.x or backport all related patches.
Comment by Andreas Dilger [ 11/Nov/17 ]

I think it makes sense to disable cgroup testing if client is running version < 4.4, until we know there is a fix in RHEL and what version it is in.

Generated at Sat Feb 10 02:17:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.