Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9966

sanity test_411: fail to trigger a memory allocation error

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.0
    • Lustre 2.12.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/32b0aa4c-9502-11e7-ba84-5254006e85c2.

      The sub-test test_411 failed with the following error:

      fail to trigger a memory allocation error
      

      test_411 is very new. has been failing since 9/1
      some (all?) of the instances of FAIL have been seen on sles12sp2/sles12sp3

      more:
      https://testing.hpdd.intel.com/test_sets/8c6725ca-8f6c-11e7-b5c2-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/0a6acf7c-8f8f-11e7-b67f-5254006e85c2

      Info required for matching: sanity 411

      Attachments

        Issue Links

          Activity

            [LU-9966] sanity test_411: fail to trigger a memory allocation error
            yujian Jian Yu added a comment - +1 on master branch: https://testing.whamcloud.com/test_sets/8e14dad8-c18f-11e8-a9d9-52540065bddc
            ys Yang Sheng added a comment -

            Yes, since the allocation is really depend on situation. So we should avoid to verify dd whether success or not. Anyway, I'll update the patch as this way.

            ys Yang Sheng added a comment - Yes, since the allocation is really depend on situation. So we should avoid to verify dd whether success or not. Anyway, I'll update the patch as this way.
            jhammond John Hammond added a comment -

            Logs show that some allocations are failing but dd is succeeding. Perhaps we should weaken the test to just check that we don't crash.

            jhammond John Hammond added a comment - Logs show that some allocations are failing but dd is succeeding. Perhaps we should weaken the test to just check that we don't crash.

            We see a different but related bug with Ubuntu18.

            simmonsja James A Simmons added a comment - We see a different but related bug with Ubuntu18.
            hornc Chris Horn added a comment - - edited

            Just a note: This issue is also seen with Lustre 2.11 on SLES15 RC4

            hornc Chris Horn added a comment - - edited Just a note: This issue is also seen with Lustre 2.11 on SLES15 RC4

            Yang Sheng (yang.sheng@intel.com) uploaded a new patch: https://review.whamcloud.com/32293
            Subject: LU-9966 tests: sanity-411 check LBUG direct
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: eb5551388e81b16704c6655f6a2d0c469b9d5262

            gerrit Gerrit Updater added a comment - Yang Sheng (yang.sheng@intel.com) uploaded a new patch: https://review.whamcloud.com/32293 Subject: LU-9966 tests: sanity-411 check LBUG direct Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: eb5551388e81b16704c6655f6a2d0c469b9d5262

            > I think test_411 just intends to verify not hit on LBUG. So we can avoid to check dd whether success or not.
            > Does it necessary?

            YangSheng, yes we may do that, but we could also try to find a way to set the conditions that may cause dd to fail.

            bfaccini Bruno Faccini (Inactive) added a comment - > I think test_411 just intends to verify not hit on LBUG. So we can avoid to check dd whether success or not. > Does it necessary? YangSheng, yes we may do that, but we could also try to find a way to set the conditions that may cause dd to fail.
            bogl Bob Glossman (Inactive) added a comment - another on master: https://testing.hpdd.intel.com/test_sets/a0416dd0-2f8a-11e8-b6a0-52540065bddc
            ys Yang Sheng added a comment -

            Hi, Bruno,

            I think test_411 just intends to verify not hit on LBUG. So we can avoid to check dd whether success or not. Does it necessary?

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - Hi, Bruno, I think test_411 just intends to verify not hit on LBUG. So we can avoid to check dd whether success or not. Does it necessary? Thanks, YangSheng
            bfaccini Bruno Faccini (Inactive) added a comment - +1 on master review for LU-10680 at https://testing.hpdd.intel.com/test_sets/9b8d6f46-150e-11e8-a10a-52540065bddc

            Looks like some allocation errors did occur anyway during these failed test sessions :

            [ 6308.394164] Lustre: DEBUG MARKER: == sanity test 411: Slab allocation error with cgroup does not LBUG ================================== 22:34:17 (1517438057)
            [ 6311.637464] SLUB: Unable to allocate memory on node -1 (gfp=0x8050)
            [ 6311.638238]   cache: kmalloc-512(0:osc_slab_alloc), object size: 512, buffer size: 512, default order: 1, min order: 0
            [ 6311.638238]   node 0: slabs: 13, objs: 208, free: 0
            [ 6311.670203] SLUB: Unable to allocate memory on node -1 (gfp=0x0)
            [ 6311.670957]   cache: kmalloc-192(0:osc_slab_alloc), object size: 192, buffer size: 192, default order: 0, min order: 0
            [ 6311.670957]   node 0: slabs: 1, objs: 21, free: 0
            [ 6360.020975] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_411: @@@@@@ FAIL: fail to trigger a memory allocation error 
            [ 6360.203970] Lustre: DEBUG MARKER: sanity test_411: @@@@@@ FAIL: fail to trigger a memory allocation error
            

            But not causing "dd" command to fail as it is expected in sanity/test_411:

            == sanity test 411: Slab allocation error with cgroup does not LBUG ================================== 22:34:17 (1517438057)
            100+0 records in
            100+0 records out
            104857600 bytes (105 MB) copied, 3.13092 s, 33.5 MB/s
            204800+0 records in
            204800+0 records out
            104857600 bytes (105 MB) copied, 48.1542 s, 2.2 MB/s
             sanity test_411: @@@@@@ FAIL: fail to trigger a memory allocation error 
              Trace dump:
              = /usr/lib64/lustre/tests/test-framework.sh:5718:error()
              = /usr/lib64/lustre/tests/sanity.sh:17667:test_411()
              = /usr/lib64/lustre/tests/test-framework.sh:5994:run_one()
              = /usr/lib64/lustre/tests/test-framework.sh:6033:run_one_logged()
              = /usr/lib64/lustre/tests/test-framework.sh:5880:run_test()
              = /usr/lib64/lustre/tests/sanity.sh:17673:main()
            Dumping lctl log to /home/autotest/autotest/logs/test_logs/2018-01-31/lustre-reviews-el7-x86_64--review-ldiskfs--1_8_1__54107___60bc072b-48d1-4e5e-bb15-747752d7c9b7/sanity.test_411.*.1517438110.log
            CMD: trevis-10vm10,trevis-10vm11,trevis-10vm12,trevis-10vm9.trevis.hpdd.intel.com /usr/sbin/lctl dk > /home/autotest/autotest/logs/test_logs/2018-01-31/lustre-reviews-el7-x86_64--review-ldiskfs--1_8_1__54107___60bc072b-48d1-4e5e-bb15-747752d7c9b7/sanity.test_411.debug_log.\$(hostname -s).1517438110.log;
                     dmesg > /home/autotest/autotest/logs/test_logs/2018-01-31/lustre-reviews-el7-x86_64--review-ldiskfs--1_8_1__54107___60bc072b-48d1-4e5e-bb15-747752d7c9b7/sanity.test_411.dmesg.\$(hostname -s).1517438110.log
            Resetting fail_loc on all nodes...CMD: trevis-10vm10,trevis-10vm11,trevis-10vm12,trevis-10vm9.trevis.hpdd.intel.com lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
            done.
            

            Since as per my LU-8435 analysis, kmem/memory cgroup features is known to be buggy with 3.x kernels (even if CONFIG_MEMCG_KMEM is configured by default in 3.x kernel shipped in CentOS/RH distros) and only safe to be used starting with 4.x kernels, why don't we simply skip sanity/test_411 for now, or at least add an other skip-test checking for Kernel 4.x?

            bfaccini Bruno Faccini (Inactive) added a comment - Looks like some allocation errors did occur anyway during these failed test sessions : [ 6308.394164] Lustre: DEBUG MARKER: == sanity test 411: Slab allocation error with cgroup does not LBUG ================================== 22:34:17 (1517438057) [ 6311.637464] SLUB: Unable to allocate memory on node -1 (gfp=0x8050) [ 6311.638238] cache: kmalloc-512(0:osc_slab_alloc), object size: 512, buffer size: 512, default order: 1, min order: 0 [ 6311.638238] node 0: slabs: 13, objs: 208, free: 0 [ 6311.670203] SLUB: Unable to allocate memory on node -1 (gfp=0x0) [ 6311.670957] cache: kmalloc-192(0:osc_slab_alloc), object size: 192, buffer size: 192, default order: 0, min order: 0 [ 6311.670957] node 0: slabs: 1, objs: 21, free: 0 [ 6360.020975] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity test_411: @@@@@@ FAIL: fail to trigger a memory allocation error [ 6360.203970] Lustre: DEBUG MARKER: sanity test_411: @@@@@@ FAIL: fail to trigger a memory allocation error But not causing "dd" command to fail as it is expected in sanity/test_411: == sanity test 411: Slab allocation error with cgroup does not LBUG ================================== 22:34:17 (1517438057) 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 3.13092 s, 33.5 MB/s 204800+0 records in 204800+0 records out 104857600 bytes (105 MB) copied, 48.1542 s, 2.2 MB/s sanity test_411: @@@@@@ FAIL: fail to trigger a memory allocation error Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:5718:error() = /usr/lib64/lustre/tests/sanity.sh:17667:test_411() = /usr/lib64/lustre/tests/test-framework.sh:5994:run_one() = /usr/lib64/lustre/tests/test-framework.sh:6033:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:5880:run_test() = /usr/lib64/lustre/tests/sanity.sh:17673:main() Dumping lctl log to /home/autotest/autotest/logs/test_logs/2018-01-31/lustre-reviews-el7-x86_64--review-ldiskfs--1_8_1__54107___60bc072b-48d1-4e5e-bb15-747752d7c9b7/sanity.test_411.*.1517438110.log CMD: trevis-10vm10,trevis-10vm11,trevis-10vm12,trevis-10vm9.trevis.hpdd.intel.com /usr/sbin/lctl dk > /home/autotest/autotest/logs/test_logs/2018-01-31/lustre-reviews-el7-x86_64--review-ldiskfs--1_8_1__54107___60bc072b-48d1-4e5e-bb15-747752d7c9b7/sanity.test_411.debug_log.\$(hostname -s).1517438110.log; dmesg > /home/autotest/autotest/logs/test_logs/2018-01-31/lustre-reviews-el7-x86_64--review-ldiskfs--1_8_1__54107___60bc072b-48d1-4e5e-bb15-747752d7c9b7/sanity.test_411.dmesg.\$(hostname -s).1517438110.log Resetting fail_loc on all nodes...CMD: trevis-10vm10,trevis-10vm11,trevis-10vm12,trevis-10vm9.trevis.hpdd.intel.com lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null done. Since as per my LU-8435 analysis, kmem/memory cgroup features is known to be buggy with 3.x kernels (even if CONFIG_MEMCG_KMEM is configured by default in 3.x kernel shipped in CentOS/RH distros) and only safe to be used starting with 4.x kernels, why don't we simply skip sanity/test_411 for now, or at least add an other skip-test checking for Kernel 4.x?

            People

              ys Yang Sheng
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: