Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12757

sanity-lfsck test 36a fails with '(N) Fail to resync /mnt/lustre/d36a.sanity-lfsck/f2'

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • Lustre 2.13.0, Lustre 2.12.3, Lustre 2.12.4, Lustre 2.12.5
    • 3
    • 9223372036854775807

    Description

      We see sanity-lfsck test_36a fail in resync for the last two of the following three calls to ‘lfs mirror resync’ from test 36a:

      5271         $LFS mirror resync $DIR/$tdir/f0 ||
      5272                 error "(6) Fail to resync $DIR/$tdir/f0"
      5273         $LFS mirror resync $DIR/$tdir/f1 ||
      5274                 error "(7) Fail to resync $DIR/$tdir/f1"
      5275         $LFS mirror resync $DIR/$tdir/f2 ||
      5276                 error "(8) Fail to resync $DIR/$tdir/f2"
      5277 
      

      It looks like this test started failing with these two errors on 07-September-2019 with Lustre master version 2.12.57.54.

      Looking at the suite_log for https://testing.whamcloud.com/test_sets/a5f2b938-d438-11e9-a2b6-52540065bddc, we see

      lfs mirror mirror: component 131075 not synced
      : No space left on device (28)
      lfs mirror mirror: component 131076 not synced
      : No space left on device (28)
      lfs mirror mirror: component 196613 not synced
      : No space left on device (28)
      lfs mirror: '/mnt/lustre/d36a.sanity-lfsck/f1' llapi_mirror_resync_many: No space left on device.
       sanity-lfsck test_36a: @@@@@@ FAIL: (7) Fail to resync /mnt/lustre/d36a.sanity-lfsck/f1 
      

      Similarly, looking at the suite_log for https://testing.whamcloud.com/test_sets/42fbb9fe-d575-11e9-9fc9-52540065bddc, we see

      lfs mirror mirror: component 131075 not synced
      : No space left on device (28)
      lfs mirror mirror: component 131076 not synced
      : No space left on device (28)
      lfs mirror mirror: component 196613 not synced
      : No space left on device (28)
      lfs mirror: '/mnt/lustre/d36a.sanity-lfsck/f2' llapi_mirror_resync_many: No space left on device.
       sanity-lfsck test_36a: @@@@@@ FAIL: (8) Fail to resync /mnt/lustre/d36a.sanity-lfsck/f2 
       

      It is possible that we are running out of disk space on an OST, but it seems strange that this just started earlier this month.

      Logs for other failures are at
      https://testing.whamcloud.com/test_sessions/279dd05c-e122-4f8f-bafe-b8299e8e0e61
      https://testing.whamcloud.com/test_sessions/fe936f3a-df7d-4d23-9d28-721da7ab8f76

      Attachments

        Issue Links

          Activity

            [LU-12757] sanity-lfsck test 36a fails with '(N) Fail to resync /mnt/lustre/d36a.sanity-lfsck/f2'

            The patch that Andreas references that should fix this issue, https://review.whamcloud.com/#/c/35896/, landed to master on July 10 and we haven’t seen this issue since July 9. It looks like this issue is fixed and we can close this ticket.

            jamesanunez James Nunez (Inactive) added a comment - The patch that Andreas references that should fix this issue, https://review.whamcloud.com/#/c/35896/ , landed to master on July 10 and we haven’t seen this issue since July 9. It looks like this issue is fixed and we can close this ticket.

            Additional details about test, I hope the will be useful to reproduce issue.  In most cases failed tests contains only single MDT (list from link below was analyzed for sanity-lfsck.sh/test36a)

            https://testing.whamcloud.com/search?horizon=2332800&test_set_script_id=4f25830c-64fe-11e2-bfb2-52540035b04c&sub_test_script_id=1bd8f58e-6f10-11e8-a55d-52540065bddc&source=sub_tests#redirect

            However, sometimes tests with one MDT passed as well.

             

            For example failed tests: output from test

            UUID                   1K-blocks        Used   Available Use% Mounted on
            lustre-MDT0000_UUID        43584        2444       37152   7% /mnt/lustre[MDT:0]
            lustre-OST0000_UUID        71100        7556       56544  12% /mnt/lustre[OST:0]
            lustre-OST0001_UUID        71100        5376       58724   9% /mnt/lustre[OST:1]
            lustre-OST0002_UUID        71100       10496       50508  18% /mnt/lustre[OST:2]
            lustre-OST0003_UUID        71100        1280       61948   3% /mnt/lustre[OST:3]

            Test where multiple MDTs used - passed: output from test

             

            UUID                   1K-blocks        Used   Available Use% Mounted on
            lustre-MDT0000_UUID       283520        3968      277504   2% /mnt/lustre[MDT:0]
            lustre-MDT0001_UUID       283520        3200      278272   2% /mnt/lustre[MDT:1]
            lustre-MDT0002_UUID       283520        3200      278272   2% /mnt/lustre[MDT:2]
            lustre-MDT0003_UUID       283520        3200      278272   2% /mnt/lustre[MDT:3]
            lustre-OST0000_UUID       282624       16384      264192   6% /mnt/lustre[OST:0]
            lustre-OST0001_UUID       282624       10240      270336   4% /mnt/lustre[OST:1]
            lustre-OST0002_UUID       282624       22528      258048   9% /mnt/lustre[OST:2]
            lustre-OST0003_UUID       282624        4096      276480   2% /mnt/lustre[OST:3]
            

            When OST's number less than 3 test is skipped, so in this cases no failure in test.

             

             

             

             

            vilapa Vikentsi Lapa added a comment - Additional details about test, I hope the will be useful to reproduce issue.  In most cases failed tests contains only single MDT (list from link below was analyzed for sanity-lfsck.sh/test36a) https://testing.whamcloud.com/search?horizon=2332800&test_set_script_id=4f25830c-64fe-11e2-bfb2-52540035b04c&sub_test_script_id=1bd8f58e-6f10-11e8-a55d-52540065bddc&source=sub_tests#redirect However, sometimes tests with one MDT passed as well.   For example failed tests: output from test UUID 1K-blocks Used Available Use% Mounted on lustre-MDT0000_UUID 43584 2444 37152 7% /mnt/lustre[MDT:0] lustre-OST0000_UUID 71100 7556 56544 12% /mnt/lustre[OST:0] lustre-OST0001_UUID 71100 5376 58724 9% /mnt/lustre[OST:1] lustre-OST0002_UUID 71100 10496 50508 18% /mnt/lustre[OST:2] lustre-OST0003_UUID 71100 1280 61948 3% /mnt/lustre[OST:3] Test where multiple MDTs used - passed: output from test   UUID 1K-blocks Used Available Use% Mounted on lustre-MDT0000_UUID 283520 3968 277504 2% /mnt/lustre[MDT:0] lustre-MDT0001_UUID 283520 3200 278272 2% /mnt/lustre[MDT:1] lustre-MDT0002_UUID 283520 3200 278272 2% /mnt/lustre[MDT:2] lustre-MDT0003_UUID 283520 3200 278272 2% /mnt/lustre[MDT:3] lustre-OST0000_UUID 282624 16384 264192 6% /mnt/lustre[OST:0] lustre-OST0001_UUID 282624 10240 270336 4% /mnt/lustre[OST:1] lustre-OST0002_UUID 282624 22528 258048 9% /mnt/lustre[OST:2] lustre-OST0003_UUID 282624 4096 276480 2% /mnt/lustre[OST:3] When OST's number less than 3 test is skipped, so in this cases no failure in test.        

            +1 on master:
            https://testing.whamcloud.com/test_sets/18633cde-dc95-4d48-add3-405591582c3f

            Patch https://review.whamcloud.com/35896 "LU-12687 osc: consume grants for direct I/O" is in master-next and should resolve this issue. This ticket can be closed once that patch lands and this problem is no longer seen.

            adilger Andreas Dilger added a comment - +1 on master: https://testing.whamcloud.com/test_sets/18633cde-dc95-4d48-add3-405591582c3f Patch https://review.whamcloud.com/35896 " LU-12687 osc: consume grants for direct I/O " is in master-next and should resolve this issue. This ticket can be closed once that patch lands and this problem is no longer seen.

            To fix this, we need one of the patches from LU-4664 and/or LU-12687 to be landed, so that O_DIRECT writes used by resync do not consume all of the grants.

            adilger Andreas Dilger added a comment - To fix this, we need one of the patches from LU-4664 and/or LU-12687 to be landed, so that O_DIRECT writes used by resync do not consume all of the grants.

            Reopening this ticket because patch that landed changes error message formatting but does not address the problem described in this ticket.

            jamesanunez James Nunez (Inactive) added a comment - Reopening this ticket because patch that landed changes error message formatting but does not address the problem described in this ticket.
            pjones Peter Jones added a comment -

            Landed for 2.14

            pjones Peter Jones added a comment - Landed for 2.14

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36176/
            Subject: LU-12757 utils: avoid newline inside error message
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 3f3a241498be7e043df7e416da7fc8722a559498

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36176/ Subject: LU-12757 utils: avoid newline inside error message Project: fs/lustre-release Branch: master Current Patch Set: Commit: 3f3a241498be7e043df7e416da7fc8722a559498

            It looks like the small filesystems used by sanity-lfsck.sh are causing problems in this test, because the client is consuming all of the free space as grant because "lfs mirror resync" is using O_DIRECT to do the writes, which currently does not consume grants.

            adilger Andreas Dilger added a comment - It looks like the small filesystems used by sanity-lfsck.sh are causing problems in this test, because the client is consuming all of the free space as grant because " lfs mirror resync " is using O_DIRECT to do the writes, which currently does not consume grants.
            sarah Sarah Liu added a comment -

            Found similar error on PPC client 2.12.3
            https://testing.whamcloud.com/test_sets/d9ac1a0c-eb0e-11e9-b62b-52540065bddc

            == sanity-lfsck test 36a: rebuild LOV EA for mirrored file (1) ======================================= 22:20:32 (1570573232)
            #####
            The target MDT-object's LOV EA corrupted as to lose one of the 
            mirrors information. The layout LFSCK should rebuild the LOV EA 
            with the PFID EA of related OST-object(s) belong to the mirror.
            #####
            4+0 records in
            4+0 records out
            4194304 bytes (4.2 MB) copied, 0.147001 s, 28.5 MB/s
            4+0 records in
            4+0 records out
            4194304 bytes (4.2 MB) copied, 0.0381554 s, 110 MB/s
            4+0 records in
            4+0 records out
            4194304 bytes (4.2 MB) copied, 0.0381345 s, 110 MB/s
            lfs mirror mirror: cannot get WRITE lease, ext 1: Device or resource busy (16)
            lfs mirror: '/mnt/lustre/d36a.sanity-lfsck/f0' llapi_lease_get_ext resync failed: Device or resource busy.
             sanity-lfsck test_36a: @@@@@@ FAIL: (6) Fail to resync /mnt/lustre/d36a.sanity-lfsck/f0 
            
            sarah Sarah Liu added a comment - Found similar error on PPC client 2.12.3 https://testing.whamcloud.com/test_sets/d9ac1a0c-eb0e-11e9-b62b-52540065bddc == sanity-lfsck test 36a: rebuild LOV EA for mirrored file (1) ======================================= 22:20:32 (1570573232) ##### The target MDT-object's LOV EA corrupted as to lose one of the mirrors information. The layout LFSCK should rebuild the LOV EA with the PFID EA of related OST-object(s) belong to the mirror. ##### 4+0 records in 4+0 records out 4194304 bytes (4.2 MB) copied, 0.147001 s, 28.5 MB/s 4+0 records in 4+0 records out 4194304 bytes (4.2 MB) copied, 0.0381554 s, 110 MB/s 4+0 records in 4+0 records out 4194304 bytes (4.2 MB) copied, 0.0381345 s, 110 MB/s lfs mirror mirror: cannot get WRITE lease, ext 1: Device or resource busy (16) lfs mirror: '/mnt/lustre/d36a.sanity-lfsck/f0' llapi_lease_get_ext resync failed: Device or resource busy. sanity-lfsck test_36a: @@@@@@ FAIL: (6) Fail to resync /mnt/lustre/d36a.sanity-lfsck/f0
            arshad512 Arshad Hussain added a comment - Again detected under -> https://testing.whamcloud.com/test_sets/b004f920-e795-11e9-b62b-52540065bddc

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: