Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3906

Failure on test suite parallel-scale test_compilebench: IOError, No space left on device

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.1, Lustre 2.5.0, Lustre 2.6.0
    • None
    • server and client: lustre-master build #1652
    • 3
    • 10301

    Description

      https://maloo.whamcloud.com/test_sets/70ec74de-15b9-11e3-8938-52540035b04c

      client console shows:

      11:00:39:Lustre: DEBUG MARKER: == parallel-scale test compilebench: compilebench == 11:00:31 (1378317631)
      11:00:39:Lustre: DEBUG MARKER: /usr/sbin/lctl mark free space=1194928, reducing initial dirs to 1
      11:00:40:Lustre: DEBUG MARKER: free space=1194928, reducing initial dirs to 1
      11:00:40:Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.compilebench -i 1         -r 2 --makej
      11:00:40:Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 1 -r 2 --makej
      11:08:26:LustreError: 8551:0:(vvp_io.c:1078:vvp_io_commit_write()) Write page 3250 of inode ffff88001916d1b8 failed -28
      11:08:28:LustreError: 8551:0:(vvp_io.c:1078:vvp_io_commit_write()) Write page 3250 of inode ffff88001916d1b8 failed -28
      

      OST console shows:

      11:00:42:Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.compilebench -i 1         -r 2 --makej
      11:00:43:Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 1 -r 2 --makej
      11:05:07:LustreError: 20425:0:(ofd_grant.c:255:ofd_grant_space_left()) lustre-OST0004: cli lustre-OST0004_UUID/ffff880021d8d000 left 44863488 < tot_grant 47737472 unstable 0 pending 0
      11:05:07:LustreError: 20425:0:(ofd_grant.c:255:ofd_grant_space_left()) Skipped 6 previous similar messages
      11:05:07:LustreError: 21683:0:(ofd_grant.c:255:ofd_grant_space_left()) lustre-OST0004: cli 009dd603-5497-a62c-77c6-19fda0311814/ffff880021d8ec00 left 44863488 < tot_grant 47735680 unstable 0 pending 0
      

      Attachments

        Issue Links

          Activity

            [LU-3906] Failure on test suite parallel-scale test_compilebench: IOError, No space left on device
            yujian Jian Yu added a comment -

            In the current run_compilebench(), lfs_df is used to get the free disk space usage information. However, run_compilebench() will also be run on NFS client which has no Lustre filesystem, so we need change lfs_df to df. Here are the patches:
            For master branch: http://review.whamcloud.com/8429
            For b2_4 branch: http://review.whamcloud.com/8430

            yujian Jian Yu added a comment - In the current run_compilebench(), lfs_df is used to get the free disk space usage information. However, run_compilebench() will also be run on NFS client which has no Lustre filesystem, so we need change lfs_df to df. Here are the patches: For master branch: http://review.whamcloud.com/8429 For b2_4 branch: http://review.whamcloud.com/8430
            pjones Peter Jones added a comment -

            Landed for 2.6

            pjones Peter Jones added a comment - Landed for 2.6
            yujian Jian Yu added a comment -

            Patch landed on Lustre b2_4 branch. The real issue is LU-3522, which still needs to be fixed.

            yujian Jian Yu added a comment - Patch landed on Lustre b2_4 branch. The real issue is LU-3522 , which still needs to be fixed.
            yujian Jian Yu added a comment -

            Here is the patch for Lustre b2_4 branch to fix the space estimation codes in run_compilebench(): http://review.whamcloud.com/8288

            yujian Jian Yu added a comment - Here is the patch for Lustre b2_4 branch to fix the space estimation codes in run_compilebench(): http://review.whamcloud.com/8288
            yujian Jian Yu added a comment -

            Lustre build: http://build.whamcloud.com/job/lustre-b2_4/50/
            Distro/Arch: RHEL6.4/x86_64

            FSTYPE=zfs
            MDSCOUNT=1
            MDSSIZE=2097152
            OSTCOUNT=2
            OSTSIZE=2097152
            PTLDEBUG=-1
            DEBUG_SIZE=128

            While parallel-scale test compilebench hitting "No space left on device" failure, the space usage status of the Lustre filesystem was as follows:

            create dir kernel-0 222MB in 344.06 seconds (0.65 MB/s)
            Traceback (most recent call last):
              File "./compilebench", line 576, in <module>
                mbs = run_directory(dset.unpatched, dirname, "create dir")
              File "./compilebench", line 245, in run_directory
                fp.close()
            IOError: [Errno 28] No space left on device
            
            du -sh /mnt/lustre/*
            27M	/mnt/lustre/d0.compilebench
            
            du -sh /mnt/lustre/d0.compilebench/*
            19M	/mnt/lustre/d0.compilebench/kernel-0
            7.6M	/mnt/lustre/d0.compilebench/kernel-1
            
            lfs df -i
            UUID                      Inodes       IUsed       IFree IUse% Mounted on
            lustre-MDT0000_UUID      1149005       32051     1116954   3% /mnt/lustre[MDT:0]
            lustre-OST0000_UUID        16053       15332         721  96% /mnt/lustre[OST:0]
            lustre-OST0001_UUID        16080       15297         783  95% /mnt/lustre[OST:1]
            
            filesystem summary:      1149005       32051     1116954   3% /mnt/lustre
            
            
            lfs df -h
            UUID                       bytes        Used   Available Use% Mounted on
            lustre-MDT0000_UUID         2.0G       53.6M        1.9G   3% /mnt/lustre[MDT:0]
            lustre-OST0000_UUID         2.0G        1.9G       88.1M  96% /mnt/lustre[OST:0]
            lustre-OST0001_UUID         2.0G        1.9G       93.9M  95% /mnt/lustre[OST:1]
            
            filesystem summary:         3.9G        3.7G      182.0M  95% /mnt/lustre
            
            
             parallel-scale test_compilebench: @@@@@@ FAIL: compilebench failed: 1 
            

            Maloo report: https://maloo.whamcloud.com/test_sets/082a9faa-4db5-11e3-8fb6-52540035b04c

            yujian Jian Yu added a comment - Lustre build: http://build.whamcloud.com/job/lustre-b2_4/50/ Distro/Arch: RHEL6.4/x86_64 FSTYPE=zfs MDSCOUNT=1 MDSSIZE=2097152 OSTCOUNT=2 OSTSIZE=2097152 PTLDEBUG=-1 DEBUG_SIZE=128 While parallel-scale test compilebench hitting "No space left on device" failure, the space usage status of the Lustre filesystem was as follows: create dir kernel-0 222MB in 344.06 seconds (0.65 MB/s) Traceback (most recent call last): File "./compilebench", line 576, in <module> mbs = run_directory(dset.unpatched, dirname, "create dir") File "./compilebench", line 245, in run_directory fp.close() IOError: [Errno 28] No space left on device du -sh /mnt/lustre/* 27M /mnt/lustre/d0.compilebench du -sh /mnt/lustre/d0.compilebench/* 19M /mnt/lustre/d0.compilebench/kernel-0 7.6M /mnt/lustre/d0.compilebench/kernel-1 lfs df -i UUID Inodes IUsed IFree IUse% Mounted on lustre-MDT0000_UUID 1149005 32051 1116954 3% /mnt/lustre[MDT:0] lustre-OST0000_UUID 16053 15332 721 96% /mnt/lustre[OST:0] lustre-OST0001_UUID 16080 15297 783 95% /mnt/lustre[OST:1] filesystem summary: 1149005 32051 1116954 3% /mnt/lustre lfs df -h UUID bytes Used Available Use% Mounted on lustre-MDT0000_UUID 2.0G 53.6M 1.9G 3% /mnt/lustre[MDT:0] lustre-OST0000_UUID 2.0G 1.9G 88.1M 96% /mnt/lustre[OST:0] lustre-OST0001_UUID 2.0G 1.9G 93.9M 95% /mnt/lustre[OST:1] filesystem summary: 3.9G 3.7G 182.0M 95% /mnt/lustre parallel-scale test_compilebench: @@@@@@ FAIL: compilebench failed: 1 Maloo report: https://maloo.whamcloud.com/test_sets/082a9faa-4db5-11e3-8fb6-52540035b04c
            yujian Jian Yu added a comment -

            After running compilebench test manually on master branch, I found that the required space for one kernel directory was about 1GB instead of 680MB. For two directories, the test will consume about 2GB space which should not fill up 4GB space. So, this is a duplicate of LU-3522.

            Here is the patch for master branch to fix the space estimation codes in run_compilebench(): http://review.whamcloud.com/8258

            yujian Jian Yu added a comment - After running compilebench test manually on master branch, I found that the required space for one kernel directory was about 1GB instead of 680MB. For two directories, the test will consume about 2GB space which should not fill up 4GB space. So, this is a duplicate of LU-3522 . Here is the patch for master branch to fix the space estimation codes in run_compilebench(): http://review.whamcloud.com/8258

            This is likely a duplicate of LU-3522 caused by the OST reserving too much grant for each client block.

            I'm not closing it yet, in case this is actually a problem of the test trying to write more data than will fit into the 4GB of space with 2 OSTs.

            adilger Andreas Dilger added a comment - This is likely a duplicate of LU-3522 caused by the OST reserving too much grant for each client block. I'm not closing it yet, in case this is actually a problem of the test trying to write more data than will fit into the 4GB of space with 2 OSTs.
            yujian Jian Yu added a comment - - edited

            This is blocking the parallel-scale{,-nfsv3,nfsv4} testing on ZFS. I'll check whether the OSTCOUNT=2 and OSTSIZE=2097152 configuration cause the out of space failure.

            yujian Jian Yu added a comment - - edited This is blocking the parallel-scale{,-nfsv3,nfsv4} testing on ZFS. I'll check whether the OSTCOUNT=2 and OSTSIZE=2097152 configuration cause the out of space failure.
            sarah Sarah Liu added a comment -

            seen in 2.5.51 testing: https://maloo.whamcloud.com/test_sets/2fb17d6c-47ea-11e3-a445-52540035b04c

            IOError: [Errno 28] No space left on device
            
            sarah Sarah Liu added a comment - seen in 2.5.51 testing: https://maloo.whamcloud.com/test_sets/2fb17d6c-47ea-11e3-a445-52540035b04c IOError: [Errno 28] No space left on device
            yujian Jian Yu added a comment - - edited

            Lustre build: http://build.whamcloud.com/job/lustre-b2_4/47/
            Distro/Arch: RHEL6.4/x86_64

            FSTYPE=zfs
            MDSCOUNT=1
            MDSSIZE=2097152
            OSTCOUNT=2
            OSTSIZE=2097152

            parallel-scale test compilebench failed as follows:

            create dir kernel-0 222MB in 459.56 seconds (0.48 MB/s)
            Traceback (most recent call last):
              File "./compilebench", line 576, in <module>
                mbs = run_directory(dset.unpatched, dirname, "create dir")
              File "./compilebench", line 245, in run_directory
                fp.close()
            IOError: [Errno 28] No space left on device
             parallel-scale test_compilebench: @@@@@@ FAIL: compilebench failed: 1 
            

            Console log on client node:

            10:44:00:Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej
            11:10:17:LustreError: 13106:0:(vvp_io.c:1088:vvp_io_commit_write()) Write page 3 of inode ffff880032124678 failed -28
            11:10:18:LustreError: 13106:0:(vvp_io.c:1088:vvp_io_commit_write()) Write page 3 of inode ffff880032124678 failed -28
            

            Maloo report: https://maloo.whamcloud.com/test_sets/39210a1a-4453-11e3-8472-52540035b04c

            parallel-scale-nfsv3 and parallel-scale-nfsv4 also failed:
            https://maloo.whamcloud.com/test_sets/ce8cbd1a-4453-11e3-8472-52540035b04c
            https://maloo.whamcloud.com/test_sets/fc337542-4453-11e3-8472-52540035b04c

            Hi Lai,
            Is this similar to LU-3522?

            yujian Jian Yu added a comment - - edited Lustre build: http://build.whamcloud.com/job/lustre-b2_4/47/ Distro/Arch: RHEL6.4/x86_64 FSTYPE=zfs MDSCOUNT=1 MDSSIZE=2097152 OSTCOUNT=2 OSTSIZE=2097152 parallel-scale test compilebench failed as follows: create dir kernel-0 222MB in 459.56 seconds (0.48 MB/s) Traceback (most recent call last): File "./compilebench", line 576, in <module> mbs = run_directory(dset.unpatched, dirname, "create dir") File "./compilebench", line 245, in run_directory fp.close() IOError: [Errno 28] No space left on device parallel-scale test_compilebench: @@@@@@ FAIL: compilebench failed: 1 Console log on client node: 10:44:00:Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej 11:10:17:LustreError: 13106:0:(vvp_io.c:1088:vvp_io_commit_write()) Write page 3 of inode ffff880032124678 failed -28 11:10:18:LustreError: 13106:0:(vvp_io.c:1088:vvp_io_commit_write()) Write page 3 of inode ffff880032124678 failed -28 Maloo report: https://maloo.whamcloud.com/test_sets/39210a1a-4453-11e3-8472-52540035b04c parallel-scale-nfsv3 and parallel-scale-nfsv4 also failed: https://maloo.whamcloud.com/test_sets/ce8cbd1a-4453-11e3-8472-52540035b04c https://maloo.whamcloud.com/test_sets/fc337542-4453-11e3-8472-52540035b04c Hi Lai, Is this similar to LU-3522 ?

            People

              yujian Jian Yu
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: