Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18451

sanity-quota: test_12b fails with "unlink mdt0 files failed" message during review-dne-zfs-part-4 test group

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Bruno Faccini <bfaccini62@gmail.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/df189fdc-3b66-438c-bd15-a3d85c45a644

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-reviews/109008 - 4.18.0-513.24.1.el8_9.x86_64
      servers: https://build.whamcloud.com/job/lustre-reviews/109008 - 4.18.0-513.24.1.el8_lustre.x86_64

      I have created this ticket because I have found an other similar failed test session for an other ticket/patch (gerrit:56950, jira:LU-18435), at https://testing.whamcloud.com/test_sets/16e25cdb-60f1-44f6-b102-72d53d52d783 !!...

      Attachments

        Issue Links

          Activity

            [LU-18451] sanity-quota: test_12b fails with "unlink mdt0 files failed" message during review-dne-zfs-part-4 test group

            I've created this patch https://review.whamcloud.com/c/fs/lustre-release/+/57929 against the problem.

            scherementsev Sergey Cheremencev added a comment - I've created this patch https://review.whamcloud.com/c/fs/lustre-release/+/57929 against the problem.

            This is being hit about 10x per month on master. The first such failure was on 2024-10-09, and given the +/- 3 days per test failure, it looks like patch https://review.whamcloud.com/53969 ("LU-16641 tests: fix sanity-quota_12b") which landed on 2024-10-04 is the reason this test started failing in this manner:

            $ git log --oneline --grep quota --after 2024-10-03 --before 2024-10-10
            2a5e8e3554 LU-18247 nodemap: initialize unused fields on disk
            25896b8b88 LU-16641 tests: fix sanity-quota_12b
            71994fa608 LU-18191 tests: sanity-quota 90b racer fix
            cf2c5fe27e LU-4315 doc: remove usage of lgroff-macros
            

            It looks like before that patch the same problem existed, but with a different error message "create failed, but expect success" when the createmany couldn't create all of the files.

            This still points to the same root cause - that something is preventing all 2048 files from being created, which leads the test to fail later.

            I see that patch https://review.whamcloud.com/54071 ("LU-16641 quota: sync in osd_declare_create for zfs") was not landed, so possibly this is a known issue and that patch just needs reviewers and landing to solve this issue?

            adilger Andreas Dilger added a comment - This is being hit about 10x per month on master. The first such failure was on 2024-10-09, and given the +/- 3 days per test failure, it looks like patch https://review.whamcloud.com/53969 (" LU-16641 tests: fix sanity-quota_12b ") which landed on 2024-10-04 is the reason this test started failing in this manner: $ git log --oneline --grep quota --after 2024-10-03 --before 2024-10-10 2a5e8e3554 LU-18247 nodemap: initialize unused fields on disk 25896b8b88 LU-16641 tests: fix sanity-quota_12b 71994fa608 LU-18191 tests: sanity-quota 90b racer fix cf2c5fe27e LU-4315 doc: remove usage of lgroff-macros It looks like before that patch the same problem existed, but with a different error message " create failed, but expect success " when the createmany couldn't create all of the files. This still points to the same root cause - that something is preventing all 2048 files from being created, which leads the test to fail later. I see that patch https://review.whamcloud.com/54071 (" LU-16641 quota: sync in osd_declare_create for zfs ") was not landed, so possibly this is a known issue and that patch just needs reviewers and landing to solve this issue?

            It looks like this test is failing during initial setup because the mdt0 file creates are hitting EDQUOT trying to create 2048 files, so only 1024 (or some other lesser number) of files is created:
            https://testing.whamcloud.com/test_sets/cb1355a4-6b09-4fba-b0f9-4dea2b7176f1

            Create 2048 files on mdt0...
            running as uid/gid/euid/egid 60000/60000/60000/60000, groups: 60000
             [createmany] [-m] [/mnt/lustre/d12b.sanity-quota/f12b.sanity-quota] [2048]
            mknod(/mnt/lustre/d12b.sanity-quota/f12b.sanity-quota1024) error: Disk quota exceeded
            total: 1024 create in 1.01 seconds: 1009.60 ops/second
            Create files on mdt1...
            running as uid/gid/euid/egid 60000/60000/60000/60000, groups: 60000
             [createmany] [-m] [/mnt/lustre/d12b.sanity-quota-1/f12b.sanity-quota] [1]
            mknod(/mnt/lustre/d12b.sanity-quota-1/f12b.sanity-quota0) error: Disk quota exceeded
            total: 0 create in 0.01 seconds: 0.00 ops/second
            Free space from mdt0...
            running as uid/gid/euid/egid 60000/60000/60000/60000, groups: 60000
             [unlinkmany] [/mnt/lustre/d12b.sanity-quota/f12b.sanity-quota] [2048]
             - unlinked 0 (time 1737706783 ; total 0 ; last 0)
            unlink(/mnt/lustre/d12b.sanity-quota/f12b.sanity-quota1024) error: No such file or directory
            total: 1024 unlinks in 2 seconds: 512.000000 unlinks/second
             sanity-quota test_12b: @@@@@@ FAIL: unlink mdt0 files failed 
            

            So there is no error reported when only 1024 files are created on mdt0 initially, but the unlinkmany call fails when the test is trying to balance space usage.

            It looks like there is some issue cleaning up the previous subtest, or the quota is set too low (unlikely), or some other reason why the user is not able to create all of the 2048 requested files.

            Either the test needs to be changed to only try to delete the actual number of created files, or ignore the error from unlinkmany, or the reason why only 1024 files can be created should be fixed. I do not think that checking that all of the files were created and returning an error from createmany makes sense, since that is unrelated to the main reason for this subtest.

            adilger Andreas Dilger added a comment - It looks like this test is failing during initial setup because the mdt0 file creates are hitting EDQUOT trying to create 2048 files, so only 1024 (or some other lesser number) of files is created: https://testing.whamcloud.com/test_sets/cb1355a4-6b09-4fba-b0f9-4dea2b7176f1 Create 2048 files on mdt0... running as uid/gid/euid/egid 60000/60000/60000/60000, groups: 60000 [createmany] [-m] [/mnt/lustre/d12b.sanity-quota/f12b.sanity-quota] [2048] mknod(/mnt/lustre/d12b.sanity-quota/f12b.sanity-quota1024) error: Disk quota exceeded total: 1024 create in 1.01 seconds: 1009.60 ops/second Create files on mdt1... running as uid/gid/euid/egid 60000/60000/60000/60000, groups: 60000 [createmany] [-m] [/mnt/lustre/d12b.sanity-quota-1/f12b.sanity-quota] [1] mknod(/mnt/lustre/d12b.sanity-quota-1/f12b.sanity-quota0) error: Disk quota exceeded total: 0 create in 0.01 seconds: 0.00 ops/second Free space from mdt0... running as uid/gid/euid/egid 60000/60000/60000/60000, groups: 60000 [unlinkmany] [/mnt/lustre/d12b.sanity-quota/f12b.sanity-quota] [2048] - unlinked 0 (time 1737706783 ; total 0 ; last 0) unlink(/mnt/lustre/d12b.sanity-quota/f12b.sanity-quota1024) error: No such file or directory total: 1024 unlinks in 2 seconds: 512.000000 unlinks/second sanity-quota test_12b: @@@@@@ FAIL: unlink mdt0 files failed So there is no error reported when only 1024 files are created on mdt0 initially, but the unlinkmany call fails when the test is trying to balance space usage. It looks like there is some issue cleaning up the previous subtest, or the quota is set too low (unlikely), or some other reason why the user is not able to create all of the 2048 requested files. Either the test needs to be changed to only try to delete the actual number of created files, or ignore the error from unlinkmany , or the reason why only 1024 files can be created should be fixed. I do not think that checking that all of the files were created and returning an error from createmany makes sense, since that is unrelated to the main reason for this subtest.

            Browsing more tests results show more previous occurrences starting on 2024-10-09.

             

            bfaccini-nvda Bruno Faccini added a comment - Browsing more tests results show more previous occurrences starting on 2024-10-09.  

            I have created this ticket because I have found an other similar failed test session for an other ticket/patch (gerrit:56950, jira:LU-18435), at https://testing.whamcloud.com/test_sets/16e25cdb-60f1-44f6-b102-72d53d52d783 !!...

             

            bfaccini-nvda Bruno Faccini added a comment - I have created this ticket because I have found an other similar failed test session for an other ticket/patch (gerrit:56950, jira: LU-18435 ), at https://testing.whamcloud.com/test_sets/16e25cdb-60f1-44f6-b102-72d53d52d783 !!...  

            People

              scherementsev Sergey Cheremencev
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: