Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3633

sanity.sh test_101d failed for 'dd failed'

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.1, Lustre 2.5.0
    • Lustre 2.5.0
    • None
    • 3
    • 9355

    Description

      sanity.sh test_101d 'dd failed' for not enough space.

      https://maloo.whamcloud.com/sub_tests/b1487964-f4b3-11e2-b8a2-52540035b04c

      console log from client:

      08:52:27:Lustre: DEBUG MARKER: == sanity test 101d: file read with and without read-ahead enabled =================== 08:52:25 (1374681145)
      08:52:38:LustreError: 24293:0:(vvp_io.c:1094:vvp_io_commit_write()) Write page 46696 of inode ffff880062769b78 failed -28
      08:52:38:LustreError: 24293:0:(vvp_io.c:1094:vvp_io_commit_write()) Write page 46696 of inode ffff880062769b78 failed -28
      08:52:39:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_101d: @@@@@@ FAIL: dd failed 
      08:52:39:Lustre: DEBUG MARKER: sanity test_101d: @@@@@@ FAIL: dd failed
      08:52:39:Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /logdir/test_logs/2013-07-24/lustre-reviews-el6-x86_64--review--1_2_1__16787__-69987943176880-074027/sanity.test_101d.debug_log.$(hostname -s).1374681155.log;
      08:52:39:         dmesg > /logdir/test_logs/2013-07-24/lustre-reviews-el6-x86_64--rev
      08:52:50:Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity test 101e: check read-ahead for small read\(1k\) for small files\(500k\) == 08:52:40 \(1374681160\)
      

      Attachments

        Issue Links

          Activity

            [LU-3633] sanity.sh test_101d failed for 'dd failed'
            sarah Sarah Liu added a comment -

            Hit this issue when running interop test between 2.4.0 client and 2.5 server:

            https://maloo.whamcloud.com/test_sets/bfe9f440-09ec-11e3-ad93-52540035b04c

            sarah Sarah Liu added a comment - Hit this issue when running interop test between 2.4.0 client and 2.5 server: https://maloo.whamcloud.com/test_sets/bfe9f440-09ec-11e3-ad93-52540035b04c

            Landed for 2.5

            jamesanunez James Nunez (Inactive) added a comment - Landed for 2.5

            This ticket is stopping the landing of the last 2 HSM tickets, and thereby blocking the feature freeze. Escalating to a blocker.

            doug Doug Oucharek (Inactive) added a comment - This ticket is stopping the landing of the last 2 HSM tickets, and thereby blocking the feature freeze. Escalating to a blocker.

            Both LU-3640 and this ticket are a result of a single OST filling.

            jamesanunez James Nunez (Inactive) added a comment - Both LU-3640 and this ticket are a result of a single OST filling.

            I see I am just reading the error messages wrong then. Thanks for the information.

            keith Keith Mannthey (Inactive) added a comment - I see I am just reading the error messages wrong then. Thanks for the information.

            I don't know this code well but it I would sort of assume that left + tot_grant == Total resource in the pool.

            I think left + tot_grant == Total unused space.

            We see the total grant drastically shrink and the "left" go to 0 ie nothing close to a static resource pool. Where did all the resource go? Maybe I am just looking at these error messages wrong.

            I think it's possible, the space allocated for file isn't included in the 'left + tot_granted'.

            niu Niu Yawei (Inactive) added a comment - I don't know this code well but it I would sort of assume that left + tot_grant == Total resource in the pool. I think left + tot_grant == Total unused space. We see the total grant drastically shrink and the "left" go to 0 ie nothing close to a static resource pool. Where did all the resource go? Maybe I am just looking at these error messages wrong. I think it's possible, the space allocated for file isn't included in the 'left + tot_granted'.

            James your patch looks like a good test change to me. I am likely just reading the error messages wrong.

            keith Keith Mannthey (Inactive) added a comment - James your patch looks like a good test change to me. I am likely just reading the error messages wrong.
            jamesanunez James Nunez (Inactive) added a comment - - edited

            Proposed patch at

            http://review.whamcloud.com/#/c/7179

            This patch only fixes the issue that Niu brought up about test 101d. It does not address any issues that may exist with grants.

            jamesanunez James Nunez (Inactive) added a comment - - edited Proposed patch at http://review.whamcloud.com/#/c/7179 This patch only fixes the issue that Niu brought up about test 101d. It does not address any issues that may exist with grants.
            keith Keith Mannthey (Inactive) added a comment - - edited

            Nui,
            The grant messages don't seem to line up for me:

             left 44511232 < tot_grant 59851008 unstable 5242880 pending 5242880
             left 41365504 < tot_grant 56705280 unstable 5242880 pending 3145728
             left 20393984 < tot_grant 37220608 unstable 4194304 pending 4194304
             left 0 <        tot_grant 16392448 unstable 4194304 pending 4194304
            

            I don't know this code well but it I would sort of assume that left + tot_grant == Total resource in the pool.

            We see the total grant drastically shrink and the "left" go to 0 ie nothing close to a static resource pool. Where did all the resource go? Maybe I am just looking at these error messages wrong.

            keith Keith Mannthey (Inactive) added a comment - - edited Nui, The grant messages don't seem to line up for me: left 44511232 < tot_grant 59851008 unstable 5242880 pending 5242880 left 41365504 < tot_grant 56705280 unstable 5242880 pending 3145728 left 20393984 < tot_grant 37220608 unstable 4194304 pending 4194304 left 0 < tot_grant 16392448 unstable 4194304 pending 4194304 I don't know this code well but it I would sort of assume that left + tot_grant == Total resource in the pool. We see the total grant drastically shrink and the "left" go to 0 ie nothing close to a static resource pool. Where did all the resource go? Maybe I am just looking at these error messages wrong.

            Seems like the test fs is running out of space, I'm not sure what consumed so much space, but I think the test script is simply wrong, the test 101d checks if the whole fs avail space is greater than 'size' first, then it tries to write 'size' to a file, however, the file isn't necessary full stripe file, so it's quite possible that the write failed for out of space.

            niu Niu Yawei (Inactive) added a comment - Seems like the test fs is running out of space, I'm not sure what consumed so much space, but I think the test script is simply wrong, the test 101d checks if the whole fs avail space is greater than 'size' first, then it tries to write 'size' to a file, however, the file isn't necessary full stripe file, so it's quite possible that the write failed for out of space.

            People

              jamesanunez James Nunez (Inactive)
              niu Niu Yawei (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: