Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9966

sanity test_411: fail to trigger a memory allocation error

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.0
    • Lustre 2.12.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/32b0aa4c-9502-11e7-ba84-5254006e85c2.

      The sub-test test_411 failed with the following error:

      fail to trigger a memory allocation error
      

      test_411 is very new. has been failing since 9/1
      some (all?) of the instances of FAIL have been seen on sles12sp2/sles12sp3

      more:
      https://testing.hpdd.intel.com/test_sets/8c6725ca-8f6c-11e7-b5c2-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/0a6acf7c-8f8f-11e7-b67f-5254006e85c2

      Info required for matching: sanity 411

      Attachments

        Issue Links

          Activity

            [LU-9966] sanity test_411: fail to trigger a memory allocation error
            hornc Chris Horn added a comment - - edited

            Just a note: This issue is also seen with Lustre 2.11 on SLES15 RC4

            hornc Chris Horn added a comment - - edited Just a note: This issue is also seen with Lustre 2.11 on SLES15 RC4

            Yang Sheng (yang.sheng@intel.com) uploaded a new patch: https://review.whamcloud.com/32293
            Subject: LU-9966 tests: sanity-411 check LBUG direct
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: eb5551388e81b16704c6655f6a2d0c469b9d5262

            gerrit Gerrit Updater added a comment - Yang Sheng (yang.sheng@intel.com) uploaded a new patch: https://review.whamcloud.com/32293 Subject: LU-9966 tests: sanity-411 check LBUG direct Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: eb5551388e81b16704c6655f6a2d0c469b9d5262

            > I think test_411 just intends to verify not hit on LBUG. So we can avoid to check dd whether success or not.
            > Does it necessary?

            YangSheng, yes we may do that, but we could also try to find a way to set the conditions that may cause dd to fail.

            bfaccini Bruno Faccini (Inactive) added a comment - > I think test_411 just intends to verify not hit on LBUG. So we can avoid to check dd whether success or not. > Does it necessary? YangSheng, yes we may do that, but we could also try to find a way to set the conditions that may cause dd to fail.
            bogl Bob Glossman (Inactive) added a comment - another on master: https://testing.hpdd.intel.com/test_sets/a0416dd0-2f8a-11e8-b6a0-52540065bddc
            ys Yang Sheng added a comment -

            Hi, Bruno,

            I think test_411 just intends to verify not hit on LBUG. So we can avoid to check dd whether success or not. Does it necessary?

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - Hi, Bruno, I think test_411 just intends to verify not hit on LBUG. So we can avoid to check dd whether success or not. Does it necessary? Thanks, YangSheng
            bfaccini Bruno Faccini (Inactive) added a comment - +1 on master review for LU-10680 at https://testing.hpdd.intel.com/test_sets/9b8d6f46-150e-11e8-a10a-52540065bddc

            Looks like some allocation errors did occur anyway during these failed test sessions :

            [ 6308.394164] Lustre: DEBUG MARKER: == sanity test 411: Slab allocation error with cgroup does not LBUG ================================== 22:34:17 (1517438057)
            [ 6311.637464] SLUB: Unable to allocate memory on node -1 (gfp=0x8050)
            [ 6311.638238]   cache: kmalloc-512(0:osc_slab_alloc), object size: 512, buffer size: 512, default order: 1, min order: 0
            [ 6311.638238]   node 0: slabs: 13, objs: 208, free: 0
            [ 6311.670203] SLUB: Unable to allocate memory on node -1 (gfp=0x0)
            [ 6311.670957]   cache: kmalloc-192(0:osc_slab_alloc), object size: 192, buffer size: 192, default order: 0, min order: 0
            [ 6311.670957]   node 0: slabs: 1, objs: 21, free: 0
            [ 6360.020975] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_411: @@@@@@ FAIL: fail to trigger a memory allocation error 
            [ 6360.203970] Lustre: DEBUG MARKER: sanity test_411: @@@@@@ FAIL: fail to trigger a memory allocation error
            

            But not causing "dd" command to fail as it is expected in sanity/test_411:

            == sanity test 411: Slab allocation error with cgroup does not LBUG ================================== 22:34:17 (1517438057)
            100+0 records in
            100+0 records out
            104857600 bytes (105 MB) copied, 3.13092 s, 33.5 MB/s
            204800+0 records in
            204800+0 records out
            104857600 bytes (105 MB) copied, 48.1542 s, 2.2 MB/s
             sanity test_411: @@@@@@ FAIL: fail to trigger a memory allocation error 
              Trace dump:
              = /usr/lib64/lustre/tests/test-framework.sh:5718:error()
              = /usr/lib64/lustre/tests/sanity.sh:17667:test_411()
              = /usr/lib64/lustre/tests/test-framework.sh:5994:run_one()
              = /usr/lib64/lustre/tests/test-framework.sh:6033:run_one_logged()
              = /usr/lib64/lustre/tests/test-framework.sh:5880:run_test()
              = /usr/lib64/lustre/tests/sanity.sh:17673:main()
            Dumping lctl log to /home/autotest/autotest/logs/test_logs/2018-01-31/lustre-reviews-el7-x86_64--review-ldiskfs--1_8_1__54107___60bc072b-48d1-4e5e-bb15-747752d7c9b7/sanity.test_411.*.1517438110.log
            CMD: trevis-10vm10,trevis-10vm11,trevis-10vm12,trevis-10vm9.trevis.hpdd.intel.com /usr/sbin/lctl dk > /home/autotest/autotest/logs/test_logs/2018-01-31/lustre-reviews-el7-x86_64--review-ldiskfs--1_8_1__54107___60bc072b-48d1-4e5e-bb15-747752d7c9b7/sanity.test_411.debug_log.\$(hostname -s).1517438110.log;
                     dmesg > /home/autotest/autotest/logs/test_logs/2018-01-31/lustre-reviews-el7-x86_64--review-ldiskfs--1_8_1__54107___60bc072b-48d1-4e5e-bb15-747752d7c9b7/sanity.test_411.dmesg.\$(hostname -s).1517438110.log
            Resetting fail_loc on all nodes...CMD: trevis-10vm10,trevis-10vm11,trevis-10vm12,trevis-10vm9.trevis.hpdd.intel.com lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
            done.
            

            Since as per my LU-8435 analysis, kmem/memory cgroup features is known to be buggy with 3.x kernels (even if CONFIG_MEMCG_KMEM is configured by default in 3.x kernel shipped in CentOS/RH distros) and only safe to be used starting with 4.x kernels, why don't we simply skip sanity/test_411 for now, or at least add an other skip-test checking for Kernel 4.x?

            bfaccini Bruno Faccini (Inactive) added a comment - Looks like some allocation errors did occur anyway during these failed test sessions : [ 6308.394164] Lustre: DEBUG MARKER: == sanity test 411: Slab allocation error with cgroup does not LBUG ================================== 22:34:17 (1517438057) [ 6311.637464] SLUB: Unable to allocate memory on node -1 (gfp=0x8050) [ 6311.638238] cache: kmalloc-512(0:osc_slab_alloc), object size: 512, buffer size: 512, default order: 1, min order: 0 [ 6311.638238] node 0: slabs: 13, objs: 208, free: 0 [ 6311.670203] SLUB: Unable to allocate memory on node -1 (gfp=0x0) [ 6311.670957] cache: kmalloc-192(0:osc_slab_alloc), object size: 192, buffer size: 192, default order: 0, min order: 0 [ 6311.670957] node 0: slabs: 1, objs: 21, free: 0 [ 6360.020975] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity test_411: @@@@@@ FAIL: fail to trigger a memory allocation error [ 6360.203970] Lustre: DEBUG MARKER: sanity test_411: @@@@@@ FAIL: fail to trigger a memory allocation error But not causing "dd" command to fail as it is expected in sanity/test_411: == sanity test 411: Slab allocation error with cgroup does not LBUG ================================== 22:34:17 (1517438057) 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 3.13092 s, 33.5 MB/s 204800+0 records in 204800+0 records out 104857600 bytes (105 MB) copied, 48.1542 s, 2.2 MB/s sanity test_411: @@@@@@ FAIL: fail to trigger a memory allocation error Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:5718:error() = /usr/lib64/lustre/tests/sanity.sh:17667:test_411() = /usr/lib64/lustre/tests/test-framework.sh:5994:run_one() = /usr/lib64/lustre/tests/test-framework.sh:6033:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:5880:run_test() = /usr/lib64/lustre/tests/sanity.sh:17673:main() Dumping lctl log to /home/autotest/autotest/logs/test_logs/2018-01-31/lustre-reviews-el7-x86_64--review-ldiskfs--1_8_1__54107___60bc072b-48d1-4e5e-bb15-747752d7c9b7/sanity.test_411.*.1517438110.log CMD: trevis-10vm10,trevis-10vm11,trevis-10vm12,trevis-10vm9.trevis.hpdd.intel.com /usr/sbin/lctl dk > /home/autotest/autotest/logs/test_logs/2018-01-31/lustre-reviews-el7-x86_64--review-ldiskfs--1_8_1__54107___60bc072b-48d1-4e5e-bb15-747752d7c9b7/sanity.test_411.debug_log.\$(hostname -s).1517438110.log; dmesg > /home/autotest/autotest/logs/test_logs/2018-01-31/lustre-reviews-el7-x86_64--review-ldiskfs--1_8_1__54107___60bc072b-48d1-4e5e-bb15-747752d7c9b7/sanity.test_411.dmesg.\$(hostname -s).1517438110.log Resetting fail_loc on all nodes...CMD: trevis-10vm10,trevis-10vm11,trevis-10vm12,trevis-10vm9.trevis.hpdd.intel.com lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null done. Since as per my LU-8435 analysis, kmem/memory cgroup features is known to be buggy with 3.x kernels (even if CONFIG_MEMCG_KMEM is configured by default in 3.x kernel shipped in CentOS/RH distros) and only safe to be used starting with 4.x kernels, why don't we simply skip sanity/test_411 for now, or at least add an other skip-test checking for Kernel 4.x?
            yujian Jian Yu added a comment - Two failure instances occurred on master branch yesterday: https://testing.hpdd.intel.com/test_sets/76978322-06b6-11e8-a7cd-52540065bddc https://testing.hpdd.intel.com/test_sets/455d1326-06d9-11e8-bd00-52540065bddc
            jgmitter Joseph Gmitter (Inactive) added a comment - Seeing this failure on the flr branch: https://testing.hpdd.intel.com/test_sets/93f8d4c2-ce65-11e7-9c63-52540065bddc
            pjones Peter Jones added a comment -

            Landed for 2.11

            pjones Peter Jones added a comment - Landed for 2.11

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28974/
            Subject: LU-9966 test: add a skip test to test_411
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: f6b0e358f304b006dd24524503bb16d649c5499d

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28974/ Subject: LU-9966 test: add a skip test to test_411 Project: fs/lustre-release Branch: master Current Patch Set: Commit: f6b0e358f304b006dd24524503bb16d649c5499d

            People

              ys Yang Sheng
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: