Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17151

sanity: test_411b Error: '(3) failed to write successfully'

Details

    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Serguei Smirnov <ssmirnov@ddn.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/2a3fbe0b-f784-4875-bd67-6ab32aa223a3

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-reviews/99011 - 4.18.0-425.10.1.el8_7.aarch64
      servers: https://build.whamcloud.com/job/lustre-reviews/99011 - 4.18.0-477.21.1.el8_lustre.x86_64

      sanity test_411b: @@@@@@ FAIL: (3) failed to write successfully
      Trace dump:
      = /usr/lib64/lustre/tests/test-framework.sh:6700:error()
      = /usr/lib64/lustre/tests/sanity.sh:27545:test_411b()
      = /usr/lib64/lustre/tests/test-framework.sh:7040:run_one()
      = /usr/lib64/lustre/tests/test-framework.sh:7096:run_one_logged()
      = /usr/lib64/lustre/tests/test-framework.sh:6926:run_test()
      = /usr/lib64/lustre/tests/sanity.sh:27592:main()
      Dumping lctl log to /autotest/autotest-2/2023-09-27/lustre-reviews_review-ldiskfs-dne-arm_99011_29_0b05909e-9d3c-46a3-9f81-125f5c37cc5d//sanity.test_411b.*.1695797605.log
      CMD: trevis-108vm17.trevis.whamcloud.com,trevis-108vm18,trevis-72vm4,trevis-72vm5,trevis-72vm6 /usr/sbin/lctl dk > /autotest/autotest-2/2023-09-27/lustre-reviews_review-ldiskfs-dne-arm_99011_29_0b05909e-9d3c-46a3-9f81-125f5c37cc5d//sanity.test_411b.debug_log.$(hostname -s).1695797605.log;
      dmesg > /autotest/autotest-2/2023-09-27/lustre-reviews_review-ldiskfs-dne-arm_99011_29_0b05909e-9d3c-46a3-9f81-125f5c37cc5d//sanity.test_411b.dmesg.$(hostname -s).1695797605.log
      cache 19660800
      rss 0
      rss_huge 0
      shmem 0
      mapped_file 0
      dirty 2883584
      writeback 0
      swap 2949120
      pgpgin 21206
      pgpgout 20906
      pgfault 4034
      pgmajfault 417
      inactive_anon 0
      active_anon 0
      inactive_file 18022400
      active_file 1638400
      unevictable 0
      hierarchical_memory_limit 268435456
      hierarchical_memsw_limit 9223372036854710272
      total_cache 19660800
      total_rss 0
      total_rss_huge 0
      total_shmem 0
      total_mapped_file 0
      total_dirty 2883584
      total_writeback 0
      total_swap 2949120
      total_pgpgin 21206
      total_pgpgout 20906
      total_pgfault 4034
      total_pgmajfault 417
      total_inactive_anon 0
      total_active_anon 0
      total_inactive_file 18022400
      total_active_file 1638400
      total_unevictable 0

      Attachments

        Issue Links

          Activity

            [LU-17151] sanity: test_411b Error: '(3) failed to write successfully'
            nangelinas Nikitas Angelinas added a comment - - edited

            There is a failure in a branch for master at https://testing.whamcloud.com/test_sets/702d252c-180a-4b1a-a1cd-d8ab56d4d11f, that includes the fixes from this ticket, but I'm not sure if it's due to the same issue?

            nangelinas Nikitas Angelinas added a comment - - edited There is a failure in a branch for master at https://testing.whamcloud.com/test_sets/702d252c-180a-4b1a-a1cd-d8ab56d4d11f , that includes the fixes from this ticket, but I'm not sure if it's due to the same issue?
            pjones Peter Jones added a comment -

            Merged for 2.16

            pjones Peter Jones added a comment - Merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55568/
            Subject: LU-17151 tests: increase memcg limit on x86_64
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: a26b49279c86a08e84215698d5bbc3d1ebf4d939

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55568/ Subject: LU-17151 tests: increase memcg limit on x86_64 Project: fs/lustre-release Branch: master Current Patch Set: Commit: a26b49279c86a08e84215698d5bbc3d1ebf4d939

            "Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55568
            Subject: LU-17151 tests: increase memcg limit on x86_64
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 78fccbdb60d4686afc1c3d004e069bbd71e1f3ff

            gerrit Gerrit Updater added a comment - "Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55568 Subject: LU-17151 tests: increase memcg limit on x86_64 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 78fccbdb60d4686afc1c3d004e069bbd71e1f3ff

            The x86_64 limit of 384MB is quite small and could be increased. That said, it shouldn't be too large, or we won't know if cgroups is actually working properly.

            adilger Andreas Dilger added a comment - The x86_64 limit of 384MB is quite small and could be increased. That said, it shouldn't be too large, or we won't know if cgroups is actually working properly.
            qian_wc Qian Yingjin added a comment -

            11 failures in the past week...

             

            qian_wc Qian Yingjin added a comment - 11 failures in the past week...  

            22 failures in the past week

            adilger Andreas Dilger added a comment - 22 failures in the past week

            Does trim get passed from the VM down to the host to free the memory?

            yes, here are my local runs:

            • sanity on ZFS w/o trim: 9785 MBs allocated from the host by sanity's completion
            • sanity on ZFS w/ zfs trim in run_one_logged: 8159 MBs
            bzzz Alex Zhuravlev added a comment - Does trim get passed from the VM down to the host to free the memory? yes, here are my local runs: sanity on ZFS w/o trim: 9785 MBs allocated from the host by sanity's completion sanity on ZFS w/ zfs trim in run_one_logged: 8159 MBs

            Does trim get passed from the VM down to the host to free the memory? You could try adding a patch to this subtest to run fstrim or zfs trim on all of the targets before or during the test. However, if only the MDT is on tmpfs then I don't think these tests are using much memory there.

            It would be useful to add some debugging to see where all of the memory is used (slabinfo/meminfo) so that we can reduce the size. Some of the internal data structures are allocated/limited at mount time based on the total RAM size and not limited by the cgroup size (e.g. lu cache, max_dirty_mb, etc), and the client needs to do a better job to free this memory under pressure (e.g. registered shrinker). Should we do anything to flush the cache at the start of the test, so that the process in the cgroup is not penalized by previous allocations outside its control?

            It would also be good to improve the test scripts slightly to match proper test script style:

            • no need for "trap 0" in cleanup_test411_cgroup() since this would clobber any other registered stack_trap calls
            • in test_411a use "stack_trap cleanup_test411_cgroup" to do the cleanup instead of calling it explicitly
            • in test_411b add "stack_trap 'rm -f $DIR/$tfile.*'" to clean up the files even if the test fails
            adilger Andreas Dilger added a comment - Does trim get passed from the VM down to the host to free the memory? You could try adding a patch to this subtest to run fstrim or zfs trim on all of the targets before or during the test. However, if only the MDT is on tmpfs then I don't think these tests are using much memory there. It would be useful to add some debugging to see where all of the memory is used (slabinfo/meminfo) so that we can reduce the size. Some of the internal data structures are allocated/limited at mount time based on the total RAM size and not limited by the cgroup size (e.g. lu cache, max_dirty_mb, etc), and the client needs to do a better job to free this memory under pressure (e.g. registered shrinker). Should we do anything to flush the cache at the start of the test, so that the process in the cgroup is not penalized by previous allocations outside its control? It would also be good to improve the test scripts slightly to match proper test script style: no need for " trap 0 " in cleanup_test411_cgroup() since this would clobber any other registered stack_trap calls in test_411a use " stack_trap cleanup_test411_cgroup " to do the cleanup instead of calling it explicitly in test_411b add " stack_trap 'rm -f $DIR/$tfile.*' " to clean up the files even if the test fails

            my observation with tmpfs-based ZFS is that it tends do not reuse blocks instead allocate new ones ("inner") and this leads to memory overuse. can be fixed running "zfs trim .." once few minutes/subtests.

            bzzz Alex Zhuravlev added a comment - my observation with tmpfs-based ZFS is that it tends do not reuse blocks instead allocate new ones ("inner") and this leads to memory overuse. can be fixed running "zfs trim .." once few minutes/subtests.

            Yes, it is using tmpfs on the VM host for the MDT since August or so. It is exported to the VM guest as a block device. The OSTs are still HDD I believe.

            adilger Andreas Dilger added a comment - Yes, it is using tmpfs on the VM host for the MDT since August or so. It is exported to the VM guest as a block device. The OSTs are still HDD I believe.

            People

              qian_wc Qian Yingjin
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: