Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3633

sanity.sh test_101d failed for 'dd failed'

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.1, Lustre 2.5.0
    • Lustre 2.5.0
    • None
    • 3
    • 9355

    Description

      sanity.sh test_101d 'dd failed' for not enough space.

      https://maloo.whamcloud.com/sub_tests/b1487964-f4b3-11e2-b8a2-52540035b04c

      console log from client:

      08:52:27:Lustre: DEBUG MARKER: == sanity test 101d: file read with and without read-ahead enabled =================== 08:52:25 (1374681145)
      08:52:38:LustreError: 24293:0:(vvp_io.c:1094:vvp_io_commit_write()) Write page 46696 of inode ffff880062769b78 failed -28
      08:52:38:LustreError: 24293:0:(vvp_io.c:1094:vvp_io_commit_write()) Write page 46696 of inode ffff880062769b78 failed -28
      08:52:39:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_101d: @@@@@@ FAIL: dd failed 
      08:52:39:Lustre: DEBUG MARKER: sanity test_101d: @@@@@@ FAIL: dd failed
      08:52:39:Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /logdir/test_logs/2013-07-24/lustre-reviews-el6-x86_64--review--1_2_1__16787__-69987943176880-074027/sanity.test_101d.debug_log.$(hostname -s).1374681155.log;
      08:52:39:         dmesg > /logdir/test_logs/2013-07-24/lustre-reviews-el6-x86_64--rev
      08:52:50:Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity test 101e: check read-ahead for small read\(1k\) for small files\(500k\) == 08:52:40 \(1374681160\)
      

      Attachments

        Issue Links

          Activity

            [LU-3633] sanity.sh test_101d failed for 'dd failed'

            This ticket is stopping the landing of the last 2 HSM tickets, and thereby blocking the feature freeze. Escalating to a blocker.

            doug Doug Oucharek (Inactive) added a comment - This ticket is stopping the landing of the last 2 HSM tickets, and thereby blocking the feature freeze. Escalating to a blocker.

            Both LU-3640 and this ticket are a result of a single OST filling.

            jamesanunez James Nunez (Inactive) added a comment - Both LU-3640 and this ticket are a result of a single OST filling.

            I see I am just reading the error messages wrong then. Thanks for the information.

            keith Keith Mannthey (Inactive) added a comment - I see I am just reading the error messages wrong then. Thanks for the information.

            I don't know this code well but it I would sort of assume that left + tot_grant == Total resource in the pool.

            I think left + tot_grant == Total unused space.

            We see the total grant drastically shrink and the "left" go to 0 ie nothing close to a static resource pool. Where did all the resource go? Maybe I am just looking at these error messages wrong.

            I think it's possible, the space allocated for file isn't included in the 'left + tot_granted'.

            niu Niu Yawei (Inactive) added a comment - I don't know this code well but it I would sort of assume that left + tot_grant == Total resource in the pool. I think left + tot_grant == Total unused space. We see the total grant drastically shrink and the "left" go to 0 ie nothing close to a static resource pool. Where did all the resource go? Maybe I am just looking at these error messages wrong. I think it's possible, the space allocated for file isn't included in the 'left + tot_granted'.

            James your patch looks like a good test change to me. I am likely just reading the error messages wrong.

            keith Keith Mannthey (Inactive) added a comment - James your patch looks like a good test change to me. I am likely just reading the error messages wrong.
            jamesanunez James Nunez (Inactive) added a comment - - edited

            Proposed patch at

            http://review.whamcloud.com/#/c/7179

            This patch only fixes the issue that Niu brought up about test 101d. It does not address any issues that may exist with grants.

            jamesanunez James Nunez (Inactive) added a comment - - edited Proposed patch at http://review.whamcloud.com/#/c/7179 This patch only fixes the issue that Niu brought up about test 101d. It does not address any issues that may exist with grants.
            keith Keith Mannthey (Inactive) added a comment - - edited

            Nui,
            The grant messages don't seem to line up for me:

             left 44511232 < tot_grant 59851008 unstable 5242880 pending 5242880
             left 41365504 < tot_grant 56705280 unstable 5242880 pending 3145728
             left 20393984 < tot_grant 37220608 unstable 4194304 pending 4194304
             left 0 <        tot_grant 16392448 unstable 4194304 pending 4194304
            

            I don't know this code well but it I would sort of assume that left + tot_grant == Total resource in the pool.

            We see the total grant drastically shrink and the "left" go to 0 ie nothing close to a static resource pool. Where did all the resource go? Maybe I am just looking at these error messages wrong.

            keith Keith Mannthey (Inactive) added a comment - - edited Nui, The grant messages don't seem to line up for me: left 44511232 < tot_grant 59851008 unstable 5242880 pending 5242880 left 41365504 < tot_grant 56705280 unstable 5242880 pending 3145728 left 20393984 < tot_grant 37220608 unstable 4194304 pending 4194304 left 0 < tot_grant 16392448 unstable 4194304 pending 4194304 I don't know this code well but it I would sort of assume that left + tot_grant == Total resource in the pool. We see the total grant drastically shrink and the "left" go to 0 ie nothing close to a static resource pool. Where did all the resource go? Maybe I am just looking at these error messages wrong.

            Seems like the test fs is running out of space, I'm not sure what consumed so much space, but I think the test script is simply wrong, the test 101d checks if the whole fs avail space is greater than 'size' first, then it tries to write 'size' to a file, however, the file isn't necessary full stripe file, so it's quite possible that the write failed for out of space.

            niu Niu Yawei (Inactive) added a comment - Seems like the test fs is running out of space, I'm not sure what consumed so much space, but I think the test script is simply wrong, the test 101d checks if the whole fs avail space is greater than 'size' first, then it tries to write 'size' to a file, however, the file isn't necessary full stripe file, so it's quite possible that the write failed for out of space.

            There is a good set of messages in the debug log for the ost as well (just filter for grant messages for ost1). Andreas mentioned on skype this seems like a grant accounting issue.

            I didn't get a chance to fully parse that log yet but it looks to have very good info in it.

            keith Keith Mannthey (Inactive) added a comment - There is a good set of messages in the debug log for the ost as well (just filter for grant messages for ost1). Andreas mentioned on skype this seems like a grant accounting issue. I didn't get a chance to fully parse that log yet but it looks to have very good info in it.

            From the OST console:

            08:52:26:Lustre: DEBUG MARKER: == sanity test 101d: file read with and without read-ahead enabled =================== 08:52:25 (1374681145)
            08:52:38:LustreError: 16787:0:(ofd_grant.c:255:ofd_grant_space_left()) lustre-OST0001: cli 795f3143-5aac-1e12-ab67-46f017cf8245/ffff88007bd67000 left 44511232 < tot_grant 59851008 unstable 5242880 pending 5242880
            08:52:38:LustreError: 842:0:(ofd_grant.c:255:ofd_grant_space_left()) lustre-OST0001: cli 795f3143-5aac-1e12-ab67-46f017cf8245/ffff88007bd67000 left 41365504 < tot_grant 56705280 unstable 5242880 pending 3145728
            08:52:39:LustreError: 16787:0:(ofd_grant.c:255:ofd_grant_space_left()) lustre-OST0001: cli 795f3143-5aac-1e12-ab67-46f017cf8245/ffff88007bd67000 left 20393984 < tot_grant 37220608 unstable 4194304 pending 4194304
            08:52:39:LustreError: 16787:0:(ofd_grant.c:255:ofd_grant_space_left()) Skipped 8 previous similar messages
            08:52:39:LustreError: 840:0:(ofd_grant.c:255:ofd_grant_space_left()) lustre-OST0001: cli 795f3143-5aac-1e12-ab67-46f017cf8245/ffff88007bd67000 left 0 < tot_grant 16392448 unstable 4194304 pending 4194304
            08:52:39:LustreError: 840:0:(ofd_grant.c:255:ofd_grant_space_left()) Skipped 13 previous similar messages
            08:52:39:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_101d: @@@@@@ FAIL: dd failed 
            
            jamesanunez James Nunez (Inactive) added a comment - From the OST console: 08:52:26:Lustre: DEBUG MARKER: == sanity test 101d: file read with and without read-ahead enabled =================== 08:52:25 (1374681145) 08:52:38:LustreError: 16787:0:(ofd_grant.c:255:ofd_grant_space_left()) lustre-OST0001: cli 795f3143-5aac-1e12-ab67-46f017cf8245/ffff88007bd67000 left 44511232 < tot_grant 59851008 unstable 5242880 pending 5242880 08:52:38:LustreError: 842:0:(ofd_grant.c:255:ofd_grant_space_left()) lustre-OST0001: cli 795f3143-5aac-1e12-ab67-46f017cf8245/ffff88007bd67000 left 41365504 < tot_grant 56705280 unstable 5242880 pending 3145728 08:52:39:LustreError: 16787:0:(ofd_grant.c:255:ofd_grant_space_left()) lustre-OST0001: cli 795f3143-5aac-1e12-ab67-46f017cf8245/ffff88007bd67000 left 20393984 < tot_grant 37220608 unstable 4194304 pending 4194304 08:52:39:LustreError: 16787:0:(ofd_grant.c:255:ofd_grant_space_left()) Skipped 8 previous similar messages 08:52:39:LustreError: 840:0:(ofd_grant.c:255:ofd_grant_space_left()) lustre-OST0001: cli 795f3143-5aac-1e12-ab67-46f017cf8245/ffff88007bd67000 left 0 < tot_grant 16392448 unstable 4194304 pending 4194304 08:52:39:LustreError: 840:0:(ofd_grant.c:255:ofd_grant_space_left()) Skipped 13 previous similar messages 08:52:39:Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity test_101d: @@@@@@ FAIL: dd failed

            Just for clarity -28 is:

            #define ENOSPC 28 /* No space left on device */

            keith Keith Mannthey (Inactive) added a comment - Just for clarity -28 is: #define ENOSPC 28 /* No space left on device */

            People

              jamesanunez James Nunez (Inactive)
              niu Niu Yawei (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: