Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12757

sanity-lfsck test 36a fails with '(N) Fail to resync /mnt/lustre/d36a.sanity-lfsck/f2'

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • Lustre 2.13.0, Lustre 2.12.3, Lustre 2.12.4, Lustre 2.12.5
    • 3
    • 9223372036854775807

    Description

      We see sanity-lfsck test_36a fail in resync for the last two of the following three calls to ‘lfs mirror resync’ from test 36a:

      5271         $LFS mirror resync $DIR/$tdir/f0 ||
      5272                 error "(6) Fail to resync $DIR/$tdir/f0"
      5273         $LFS mirror resync $DIR/$tdir/f1 ||
      5274                 error "(7) Fail to resync $DIR/$tdir/f1"
      5275         $LFS mirror resync $DIR/$tdir/f2 ||
      5276                 error "(8) Fail to resync $DIR/$tdir/f2"
      5277 
      

      It looks like this test started failing with these two errors on 07-September-2019 with Lustre master version 2.12.57.54.

      Looking at the suite_log for https://testing.whamcloud.com/test_sets/a5f2b938-d438-11e9-a2b6-52540065bddc, we see

      lfs mirror mirror: component 131075 not synced
      : No space left on device (28)
      lfs mirror mirror: component 131076 not synced
      : No space left on device (28)
      lfs mirror mirror: component 196613 not synced
      : No space left on device (28)
      lfs mirror: '/mnt/lustre/d36a.sanity-lfsck/f1' llapi_mirror_resync_many: No space left on device.
       sanity-lfsck test_36a: @@@@@@ FAIL: (7) Fail to resync /mnt/lustre/d36a.sanity-lfsck/f1 
      

      Similarly, looking at the suite_log for https://testing.whamcloud.com/test_sets/42fbb9fe-d575-11e9-9fc9-52540065bddc, we see

      lfs mirror mirror: component 131075 not synced
      : No space left on device (28)
      lfs mirror mirror: component 131076 not synced
      : No space left on device (28)
      lfs mirror mirror: component 196613 not synced
      : No space left on device (28)
      lfs mirror: '/mnt/lustre/d36a.sanity-lfsck/f2' llapi_mirror_resync_many: No space left on device.
       sanity-lfsck test_36a: @@@@@@ FAIL: (8) Fail to resync /mnt/lustre/d36a.sanity-lfsck/f2 
       

      It is possible that we are running out of disk space on an OST, but it seems strange that this just started earlier this month.

      Logs for other failures are at
      https://testing.whamcloud.com/test_sessions/279dd05c-e122-4f8f-bafe-b8299e8e0e61
      https://testing.whamcloud.com/test_sessions/fe936f3a-df7d-4d23-9d28-721da7ab8f76

      Attachments

        Issue Links

          Activity

            [LU-12757] sanity-lfsck test 36a fails with '(N) Fail to resync /mnt/lustre/d36a.sanity-lfsck/f2'

            +1 on master:
            https://testing.whamcloud.com/test_sets/18633cde-dc95-4d48-add3-405591582c3f

            Patch https://review.whamcloud.com/35896 "LU-12687 osc: consume grants for direct I/O" is in master-next and should resolve this issue. This ticket can be closed once that patch lands and this problem is no longer seen.

            adilger Andreas Dilger added a comment - +1 on master: https://testing.whamcloud.com/test_sets/18633cde-dc95-4d48-add3-405591582c3f Patch https://review.whamcloud.com/35896 " LU-12687 osc: consume grants for direct I/O " is in master-next and should resolve this issue. This ticket can be closed once that patch lands and this problem is no longer seen.

            To fix this, we need one of the patches from LU-4664 and/or LU-12687 to be landed, so that O_DIRECT writes used by resync do not consume all of the grants.

            adilger Andreas Dilger added a comment - To fix this, we need one of the patches from LU-4664 and/or LU-12687 to be landed, so that O_DIRECT writes used by resync do not consume all of the grants.

            Reopening this ticket because patch that landed changes error message formatting but does not address the problem described in this ticket.

            jamesanunez James Nunez (Inactive) added a comment - Reopening this ticket because patch that landed changes error message formatting but does not address the problem described in this ticket.
            pjones Peter Jones added a comment -

            Landed for 2.14

            pjones Peter Jones added a comment - Landed for 2.14

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36176/
            Subject: LU-12757 utils: avoid newline inside error message
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 3f3a241498be7e043df7e416da7fc8722a559498

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36176/ Subject: LU-12757 utils: avoid newline inside error message Project: fs/lustre-release Branch: master Current Patch Set: Commit: 3f3a241498be7e043df7e416da7fc8722a559498

            It looks like the small filesystems used by sanity-lfsck.sh are causing problems in this test, because the client is consuming all of the free space as grant because "lfs mirror resync" is using O_DIRECT to do the writes, which currently does not consume grants.

            adilger Andreas Dilger added a comment - It looks like the small filesystems used by sanity-lfsck.sh are causing problems in this test, because the client is consuming all of the free space as grant because " lfs mirror resync " is using O_DIRECT to do the writes, which currently does not consume grants.
            sarah Sarah Liu added a comment -

            Found similar error on PPC client 2.12.3
            https://testing.whamcloud.com/test_sets/d9ac1a0c-eb0e-11e9-b62b-52540065bddc

            == sanity-lfsck test 36a: rebuild LOV EA for mirrored file (1) ======================================= 22:20:32 (1570573232)
            #####
            The target MDT-object's LOV EA corrupted as to lose one of the 
            mirrors information. The layout LFSCK should rebuild the LOV EA 
            with the PFID EA of related OST-object(s) belong to the mirror.
            #####
            4+0 records in
            4+0 records out
            4194304 bytes (4.2 MB) copied, 0.147001 s, 28.5 MB/s
            4+0 records in
            4+0 records out
            4194304 bytes (4.2 MB) copied, 0.0381554 s, 110 MB/s
            4+0 records in
            4+0 records out
            4194304 bytes (4.2 MB) copied, 0.0381345 s, 110 MB/s
            lfs mirror mirror: cannot get WRITE lease, ext 1: Device or resource busy (16)
            lfs mirror: '/mnt/lustre/d36a.sanity-lfsck/f0' llapi_lease_get_ext resync failed: Device or resource busy.
             sanity-lfsck test_36a: @@@@@@ FAIL: (6) Fail to resync /mnt/lustre/d36a.sanity-lfsck/f0 
            
            sarah Sarah Liu added a comment - Found similar error on PPC client 2.12.3 https://testing.whamcloud.com/test_sets/d9ac1a0c-eb0e-11e9-b62b-52540065bddc == sanity-lfsck test 36a: rebuild LOV EA for mirrored file (1) ======================================= 22:20:32 (1570573232) ##### The target MDT-object's LOV EA corrupted as to lose one of the mirrors information. The layout LFSCK should rebuild the LOV EA with the PFID EA of related OST-object(s) belong to the mirror. ##### 4+0 records in 4+0 records out 4194304 bytes (4.2 MB) copied, 0.147001 s, 28.5 MB/s 4+0 records in 4+0 records out 4194304 bytes (4.2 MB) copied, 0.0381554 s, 110 MB/s 4+0 records in 4+0 records out 4194304 bytes (4.2 MB) copied, 0.0381345 s, 110 MB/s lfs mirror mirror: cannot get WRITE lease, ext 1: Device or resource busy (16) lfs mirror: '/mnt/lustre/d36a.sanity-lfsck/f0' llapi_lease_get_ext resync failed: Device or resource busy. sanity-lfsck test_36a: @@@@@@ FAIL: (6) Fail to resync /mnt/lustre/d36a.sanity-lfsck/f0
            arshad512 Arshad Hussain added a comment - Again detected under -> https://testing.whamcloud.com/test_sets/b004f920-e795-11e9-b62b-52540065bddc

            The above patch does not fix the test failure here, it is just cosmetic to fix the error message to not have a newline in the middle:

            lfs mirror mirror: component 131075 not synced
            : No space left on device (28)
            

            I also notice that progname is "lfs mirror mirror", which is also not correct. That is because one or more of lfs_setstripe_internal() is appending argv[0] to progname internally (via the cmd[] buffer), and printing both progname and argv[0] explicitly in error messages, and lfs_mirror() is appending argv[0] to progname, and llapi_error()->error_callback_default() is appending liblustreapi_cmd as well. That is very confusing. That should be fixed in a separate patch.

            adilger Andreas Dilger added a comment - The above patch does not fix the test failure here, it is just cosmetic to fix the error message to not have a newline in the middle: lfs mirror mirror: component 131075 not synced : No space left on device (28) I also notice that progname is " lfs mirror mirror ", which is also not correct. That is because one or more of lfs_setstripe_internal() is appending argv [0] to progname internally (via the cmd[] buffer), and printing both progname and argv [0] explicitly in error messages, and lfs_mirror() is appending argv [0] to progname, and llapi_error()->error_callback_default() is appending liblustreapi_cmd as well. That is very confusing. That should be fixed in a separate patch.

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36176
            Subject: LU-12757 utils: avoid newline inside error message
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4d39bed4489f3b38b388ec69e449fdc65afe6f19

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36176 Subject: LU-12757 utils: avoid newline inside error message Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4d39bed4489f3b38b388ec69e449fdc65afe6f19
            adilger Andreas Dilger added a comment - - edited

            It is very likely that the culprit is:

            commit 0f670d1ca9dd5af697bfbf3b95a301c61a8b4447
            Author:     Bobi Jam <bobijam@whamcloud.com>
            AuthorDate: Wed Oct 10 14:23:55 2018 +0800
            
                LU-11239 lfs: fix mirror resync error handling
                
                This patch returns error for partially successful mirror resync.
                
                Signed-off-by: Bobi Jam <bobijam@whamcloud.com>
                Change-Id: I9d6c9ef5aca1674ceb7a9cbc6b790f3f7276ff5d
                Reviewed-on: https://review.whamcloud.com/33537
            

            though this is just returning the error, it isn't causing the error, AFAICS.

            adilger Andreas Dilger added a comment - - edited It is very likely that the culprit is: commit 0f670d1ca9dd5af697bfbf3b95a301c61a8b4447 Author: Bobi Jam <bobijam@whamcloud.com> AuthorDate: Wed Oct 10 14:23:55 2018 +0800 LU-11239 lfs: fix mirror resync error handling This patch returns error for partially successful mirror resync. Signed-off-by: Bobi Jam <bobijam@whamcloud.com> Change-Id: I9d6c9ef5aca1674ceb7a9cbc6b790f3f7276ff5d Reviewed-on: https://review.whamcloud.com/33537 though this is just returning the error, it isn't causing the error, AFAICS.

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: