[LU-9966] sanity test_411: fail to trigger a memory allocation error Created: 09/Sep/17  Updated: 06/Oct/18  Resolved: 06/Oct/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: Lustre 2.12.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Yang Sheng
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-10366 sanity test 410 fails with 'no inode ... Open
is related to LU-8435 LBUG (osc_cache.c:1290:osc_completion... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/32b0aa4c-9502-11e7-ba84-5254006e85c2.

The sub-test test_411 failed with the following error:

fail to trigger a memory allocation error

test_411 is very new. has been failing since 9/1
some (all?) of the instances of FAIL have been seen on sles12sp2/sles12sp3

more:
https://testing.hpdd.intel.com/test_sets/8c6725ca-8f6c-11e7-b5c2-5254006e85c2
https://testing.hpdd.intel.com/test_sets/0a6acf7c-8f8f-11e7-b67f-5254006e85c2

Info required for matching: sanity 411



 Comments   
Comment by Bob Glossman (Inactive) [ 09/Sep/17 ]

this may be a 100% fail on any sles12. It may never have worked except on RHEL 7.
another:
https://testing.hpdd.intel.com/test_sets/85081fe4-9536-11e7-b75f-5254006e85c2

Comment by Andreas Dilger [ 09/Sep/17 ]

Bob, it would be helpful if you linked this (and other regressions) to the Jira ticket and patch that added this new test.

Comment by Bob Glossman (Inactive) [ 09/Sep/17 ]

The patch that added test 411 was https://review.whamcloud.com/21745, "LU-8435 tests: slab alloc error does not LBUG"

Comment by Bob Glossman (Inactive) [ 10/Sep/17 ]

In sles12 there is no /sys/fs/cgroup/memory/memory.kmem.limit_in_bytes
Since test_411 uses this it's no surprise the test doesn't work.

Comment by Peter Jones [ 10/Sep/17 ]

Is there an equivalent function that could be used instead or should we just skip the test for sles12 (and presumably any other newer kernels)?

Comment by Bob Glossman (Inactive) [ 10/Sep/17 ]

Is there an equivalent function that could be used instead or should we just skip the test for sles12 (and presumably any other newer kernels)?

Needs the Author of the test to answer that question.
As far as I can tell there is nothing equivalent in sles12.

If the solution is to skip the test when the needed /sys entry isn't there I can push a patch for that. There is already some skip logic there, I would just need to extend it a bit.

Comment by Yang Sheng [ 13/Sep/17 ]

https://testing.hpdd.intel.com/test_sets/57f429a2-97f8-11e7-b9c6-5254006e85c2

Looks like this test is failed by permission issue.

== sanity test 411: Slab allocation error with cgroup does not LBUG ================================== 01:03:59 (1505203439)
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 1.78497 s, 58.7 MB/s
/usr/lib64/lustre/tests/sanity.sh: line 16400: /sys/fs/cgroup/memory/osc_slab_alloc/memory.kmem.limit_in_bytes: Permission denied
204800+0 records in
204800+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 23.5257 s, 4.5 MB/s
 sanity test_411: @@@@@@ FAIL: fail to trigger a memory allocation error 
  Trace dump:

The 'osc_slab_alloc/memory.kmem.limit_in_bytes' cannot be changed so trigger action is failed. I'll try to find the cause.

Thanks,
YangSheng

Comment by Bob Glossman (Inactive) [ 13/Sep/17 ]

it reports as "permission denied" but pretty sure it's due to the entry not existing. Easy fix to check for the entry & skip if it doesn't exist, but not sure that's the right approach.

I can push a mod that does that for inspection.

Comment by Gerrit Updater [ 13/Sep/17 ]

Bob Glossman (bob.glossman@intel.com) uploaded a new patch: https://review.whamcloud.com/28974
Subject: LU-9966 test: add a skip test to test_411
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6c5dcd0caca8f08eb9874b533629f09af3cc1b7f

Comment by Yang Sheng [ 13/Sep/17 ]

The 'CONFIG_MEMCG_KMEM' is disabled in sles12 default. So kmem.limit_in_bytes is absent. Then skipping is right solution.

Comment by Bob Glossman (Inactive) [ 13/Sep/17 ]

I see that 'CONFIG_MEMCG_KMEM' is enabled in rhel7 by default. Totally explains why test 411 works on rhel7 and doesn't work on sles12.

Comment by Gerrit Updater [ 18/Sep/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28974/
Subject: LU-9966 test: add a skip test to test_411
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f6b0e358f304b006dd24524503bb16d649c5499d

Comment by Peter Jones [ 18/Sep/17 ]

Landed for 2.11

Comment by Joseph Gmitter (Inactive) [ 21/Nov/17 ]

Seeing this failure on the flr branch:
https://testing.hpdd.intel.com/test_sets/93f8d4c2-ce65-11e7-9c63-52540065bddc

Comment by Jian Yu [ 01/Feb/18 ]

Two failure instances occurred on master branch yesterday:
https://testing.hpdd.intel.com/test_sets/76978322-06b6-11e8-a7cd-52540065bddc
https://testing.hpdd.intel.com/test_sets/455d1326-06d9-11e8-bd00-52540065bddc

Comment by Bruno Faccini (Inactive) [ 08/Feb/18 ]

Looks like some allocation errors did occur anyway during these failed test sessions :

[ 6308.394164] Lustre: DEBUG MARKER: == sanity test 411: Slab allocation error with cgroup does not LBUG ================================== 22:34:17 (1517438057)
[ 6311.637464] SLUB: Unable to allocate memory on node -1 (gfp=0x8050)
[ 6311.638238]   cache: kmalloc-512(0:osc_slab_alloc), object size: 512, buffer size: 512, default order: 1, min order: 0
[ 6311.638238]   node 0: slabs: 13, objs: 208, free: 0
[ 6311.670203] SLUB: Unable to allocate memory on node -1 (gfp=0x0)
[ 6311.670957]   cache: kmalloc-192(0:osc_slab_alloc), object size: 192, buffer size: 192, default order: 0, min order: 0
[ 6311.670957]   node 0: slabs: 1, objs: 21, free: 0
[ 6360.020975] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_411: @@@@@@ FAIL: fail to trigger a memory allocation error 
[ 6360.203970] Lustre: DEBUG MARKER: sanity test_411: @@@@@@ FAIL: fail to trigger a memory allocation error

But not causing "dd" command to fail as it is expected in sanity/test_411:

== sanity test 411: Slab allocation error with cgroup does not LBUG ================================== 22:34:17 (1517438057)
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 3.13092 s, 33.5 MB/s
204800+0 records in
204800+0 records out
104857600 bytes (105 MB) copied, 48.1542 s, 2.2 MB/s
 sanity test_411: @@@@@@ FAIL: fail to trigger a memory allocation error 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:5718:error()
  = /usr/lib64/lustre/tests/sanity.sh:17667:test_411()
  = /usr/lib64/lustre/tests/test-framework.sh:5994:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:6033:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:5880:run_test()
  = /usr/lib64/lustre/tests/sanity.sh:17673:main()
Dumping lctl log to /home/autotest/autotest/logs/test_logs/2018-01-31/lustre-reviews-el7-x86_64--review-ldiskfs--1_8_1__54107___60bc072b-48d1-4e5e-bb15-747752d7c9b7/sanity.test_411.*.1517438110.log
CMD: trevis-10vm10,trevis-10vm11,trevis-10vm12,trevis-10vm9.trevis.hpdd.intel.com /usr/sbin/lctl dk > /home/autotest/autotest/logs/test_logs/2018-01-31/lustre-reviews-el7-x86_64--review-ldiskfs--1_8_1__54107___60bc072b-48d1-4e5e-bb15-747752d7c9b7/sanity.test_411.debug_log.\$(hostname -s).1517438110.log;
         dmesg > /home/autotest/autotest/logs/test_logs/2018-01-31/lustre-reviews-el7-x86_64--review-ldiskfs--1_8_1__54107___60bc072b-48d1-4e5e-bb15-747752d7c9b7/sanity.test_411.dmesg.\$(hostname -s).1517438110.log
Resetting fail_loc on all nodes...CMD: trevis-10vm10,trevis-10vm11,trevis-10vm12,trevis-10vm9.trevis.hpdd.intel.com lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
done.

Since as per my LU-8435 analysis, kmem/memory cgroup features is known to be buggy with 3.x kernels (even if CONFIG_MEMCG_KMEM is configured by default in 3.x kernel shipped in CentOS/RH distros) and only safe to be used starting with 4.x kernels, why don't we simply skip sanity/test_411 for now, or at least add an other skip-test checking for Kernel 4.x?

Comment by Bruno Faccini (Inactive) [ 19/Feb/18 ]

+1 on master review for LU-10680 at https://testing.hpdd.intel.com/test_sets/9b8d6f46-150e-11e8-a10a-52540065bddc

Comment by Yang Sheng [ 23/Feb/18 ]

Hi, Bruno,

I think test_411 just intends to verify not hit on LBUG. So we can avoid to check dd whether success or not. Does it necessary?

Thanks,
YangSheng

Comment by Bob Glossman (Inactive) [ 24/Mar/18 ]

another on master:
https://testing.hpdd.intel.com/test_sets/a0416dd0-2f8a-11e8-b6a0-52540065bddc

Comment by Bruno Faccini (Inactive) [ 30/Mar/18 ]

> I think test_411 just intends to verify not hit on LBUG. So we can avoid to check dd whether success or not.
> Does it necessary?

YangSheng, yes we may do that, but we could also try to find a way to set the conditions that may cause dd to fail.

Comment by Gerrit Updater [ 04/May/18 ]

Yang Sheng (yang.sheng@intel.com) uploaded a new patch: https://review.whamcloud.com/32293
Subject: LU-9966 tests: sanity-411 check LBUG direct
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: eb5551388e81b16704c6655f6a2d0c469b9d5262

Comment by Chris Horn [ 29/May/18 ]

Just a note: This issue is also seen with Lustre 2.11 on SLES15 RC4

Comment by James A Simmons [ 20/Aug/18 ]

We see a different but related bug with Ubuntu18.

Comment by John Hammond [ 05/Sep/18 ]

Logs show that some allocations are failing but dd is succeeding. Perhaps we should weaken the test to just check that we don't crash.

Comment by Yang Sheng [ 05/Sep/18 ]

Yes, since the allocation is really depend on situation. So we should avoid to verify dd whether success or not. Anyway, I'll update the patch as this way.

Comment by Jian Yu [ 26/Sep/18 ]

+1 on master branch:
https://testing.whamcloud.com/test_sets/8e14dad8-c18f-11e8-a9d9-52540065bddc

Comment by Gerrit Updater [ 05/Oct/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32293/
Subject: LU-9966 tests: sanity-411 check LBUG direct
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 18637db7ca6fa92ec6ea494a353a5ec46700a30e

Comment by Peter Jones [ 06/Oct/18 ]

Landed for 2.12

Generated at Sat Feb 10 02:30:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.