Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18191

sanity-quota test_90b: quota info from /mnt/lustre not xxx , found xxx

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • Lustre 2.16.0
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for emoly <emoly@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/2705afb2-3a37-4d85-aedd-c5dd5e7b7adb

      test_90b failed with the following error:

      $'quota info from /mnt/lustre not '     /mnt/lustre      root   11686       0       0       -     376       0       0       -n     /mnt/lustre quota_usr       0       0       0       -       0       0       0       -n     /mnt/lustre quota_2usr       0   
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-reviews/107261 - 4.18.0-513.24.1.el8_9.x86_64
      servers: https://build.whamcloud.com/job/lustre-reviews/107261 - 4.18.0-513.24.1.el8_lustre.x86_64

      <<Please provide additional information about the failure here>>
      This issue has happened more than 10 times since July.

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-quota test_90b - $'quota info from /mnt/lustre not ' /mnt/lustre root 11686 0 0 - 376 0 0 -n /mnt/lustre quota_usr 0 0 0 - 0 0 0 -n /mnt/lustre quota_2usr 0

      Attachments

        Issue Links

          Activity

            [LU-18191] sanity-quota test_90b: quota info from /mnt/lustre not xxx , found xxx
            pjones Peter Jones added a comment -

            Fixed for 2.16

            pjones Peter Jones added a comment - Fixed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56532/
            Subject: LU-18191 tests: sanity-quota 90b racer fix
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 71994fa608ae9641d471cbb8cf1ce487a9f761b6

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56532/ Subject: LU-18191 tests: sanity-quota 90b racer fix Project: fs/lustre-release Branch: master Current Patch Set: Commit: 71994fa608ae9641d471cbb8cf1ce487a9f761b6

            "Frederick Dilger <fdilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56532
            Subject: LU-4315 tests: sanity-quota 90b racer fix
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: c42a17a6a53355c300ee2d5f69195fc36a97a44a

            adilger Andreas Dilger added a comment - "Frederick Dilger <fdilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56532 Subject: LU-4315 tests: sanity-quota 90b racer fix Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c42a17a6a53355c300ee2d5f69195fc36a97a44a
            adilger Andreas Dilger added a comment - - edited

            It looks like this started failing on 2024-07-19 but only with patch https://review.whamcloud.com/55683 "LU-17702 utils: 'lfs quota' MOUNT_POINT optional" before it landed on 2024-08-16 since test_90b is a new subtest added in that patch.

            It looks like there is a small race condition in the test. With the "unmangled" output from the test:

            cmd: /usr/bin/lfs quota -q -a -u  /mnt/lustre /mnt/lustre2
             sanity-quota test_90b: @@@@@@ FAIL: quota info from /mnt/lustre not
            '     /mnt/lustre      root   11686       0       0       -     376       0       0       -
                 /mnt/lustre quota_usr       0       0       0       -       0       0       0       -
                 /mnt/lustre quota_2usr       0       0       0       -       0       0       0       -'
            found
            '     /mnt/lustre      root   11685       0       0       -     376       0       0       -
                 /mnt/lustre quota_usr       0       0       0       -       0       0       0       -
                 /mnt/lustre quota_2usr       0       0       0       -       0       0       0       -' 
            

            so it looks like there is 1 block of root user block usage that has been freed since the start of the test. This looks like it is mostly failing on ZFS filesystems, possibly by some quota accounting update at transaction commit after destroying objects?

            My suggestion to fix this would be to re-fetch the "$head" or "$tail" in the error case to see if it now matches and only report an error if it is still different. That will handle the race condition without adding overhead for the common case where the output matches, unlike adding a longer sleep to wait_delete_completed->wait_zfs_commit() that would slow down every test by a few seconds.

            adilger Andreas Dilger added a comment - - edited It looks like this started failing on 2024-07-19 but only with patch https://review.whamcloud.com/55683 " LU-17702 utils: 'lfs quota' MOUNT_POINT optional " before it landed on 2024-08-16 since test_90b is a new subtest added in that patch. It looks like there is a small race condition in the test. With the "unmangled" output from the test: cmd: /usr/bin/lfs quota -q -a -u /mnt/lustre /mnt/lustre2 sanity-quota test_90b: @@@@@@ FAIL: quota info from /mnt/lustre not ' /mnt/lustre root 11686 0 0 - 376 0 0 - /mnt/lustre quota_usr 0 0 0 - 0 0 0 - /mnt/lustre quota_2usr 0 0 0 - 0 0 0 -' found ' /mnt/lustre root 11685 0 0 - 376 0 0 - /mnt/lustre quota_usr 0 0 0 - 0 0 0 - /mnt/lustre quota_2usr 0 0 0 - 0 0 0 -' so it looks like there is 1 block of root user block usage that has been freed since the start of the test. This looks like it is mostly failing on ZFS filesystems, possibly by some quota accounting update at transaction commit after destroying objects? My suggestion to fix this would be to re-fetch the " $head " or " $tail " in the error case to see if it now matches and only report an error if it is still different. That will handle the race condition without adding overhead for the common case where the output matches, unlike adding a longer sleep to wait_delete_completed->wait_zfs_commit() that would slow down every test by a few seconds.

            People

              fdilger Fred Dilger
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: