[LU-3906] Failure on test suite parallel-scale test_compilebench: IOError, No space left on device Created: 08/Sep/13  Updated: 31/Dec/13  Resolved: 23/Nov/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1, Lustre 2.5.0, Lustre 2.6.0
Fix Version/s: Lustre 2.6.0, Lustre 2.4.2, Lustre 2.5.1

Type: Bug Priority: Blocker
Reporter: Sarah Liu Assignee: Jian Yu
Resolution: Fixed Votes: 0
Labels: None
Environment:

server and client: lustre-master build #1652


Issue Links:
Duplicate
duplicates LU-3522 sanity-benchmark test_iozone: "no spa... Resolved
is duplicated by LU-3912 Failure on test suite sanity-quota te... Resolved
is duplicated by LU-3913 Failure on test suite recovery-mds-sc... Resolved
Related
is related to LU-3909 Interop 2.4.0<->2.5 failure on test s... Resolved
is related to LU-3904 Failure on test suite parallel-scale ... Closed
is related to LU-3905 Failure on test suite parallel-scale ... Closed
Severity: 3
Rank (Obsolete): 10301

 Description   

https://maloo.whamcloud.com/test_sets/70ec74de-15b9-11e3-8938-52540035b04c

client console shows:

11:00:39:Lustre: DEBUG MARKER: == parallel-scale test compilebench: compilebench == 11:00:31 (1378317631)
11:00:39:Lustre: DEBUG MARKER: /usr/sbin/lctl mark free space=1194928, reducing initial dirs to 1
11:00:40:Lustre: DEBUG MARKER: free space=1194928, reducing initial dirs to 1
11:00:40:Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.compilebench -i 1         -r 2 --makej
11:00:40:Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 1 -r 2 --makej
11:08:26:LustreError: 8551:0:(vvp_io.c:1078:vvp_io_commit_write()) Write page 3250 of inode ffff88001916d1b8 failed -28
11:08:28:LustreError: 8551:0:(vvp_io.c:1078:vvp_io_commit_write()) Write page 3250 of inode ffff88001916d1b8 failed -28

OST console shows:

11:00:42:Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.compilebench -i 1         -r 2 --makej
11:00:43:Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 1 -r 2 --makej
11:05:07:LustreError: 20425:0:(ofd_grant.c:255:ofd_grant_space_left()) lustre-OST0004: cli lustre-OST0004_UUID/ffff880021d8d000 left 44863488 < tot_grant 47737472 unstable 0 pending 0
11:05:07:LustreError: 20425:0:(ofd_grant.c:255:ofd_grant_space_left()) Skipped 6 previous similar messages
11:05:07:LustreError: 21683:0:(ofd_grant.c:255:ofd_grant_space_left()) lustre-OST0004: cli 009dd603-5497-a62c-77c6-19fda0311814/ffff880021d8ec00 left 44863488 < tot_grant 47735680 unstable 0 pending 0


 Comments   
Comment by Jian Yu [ 04/Nov/13 ]

Lustre build: http://build.whamcloud.com/job/lustre-b2_4/47/
Distro/Arch: RHEL6.4/x86_64

FSTYPE=zfs
MDSCOUNT=1
MDSSIZE=2097152
OSTCOUNT=2
OSTSIZE=2097152

parallel-scale test compilebench failed as follows:

create dir kernel-0 222MB in 459.56 seconds (0.48 MB/s)
Traceback (most recent call last):
  File "./compilebench", line 576, in <module>
    mbs = run_directory(dset.unpatched, dirname, "create dir")
  File "./compilebench", line 245, in run_directory
    fp.close()
IOError: [Errno 28] No space left on device
 parallel-scale test_compilebench: @@@@@@ FAIL: compilebench failed: 1 

Console log on client node:

10:44:00:Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej
11:10:17:LustreError: 13106:0:(vvp_io.c:1088:vvp_io_commit_write()) Write page 3 of inode ffff880032124678 failed -28
11:10:18:LustreError: 13106:0:(vvp_io.c:1088:vvp_io_commit_write()) Write page 3 of inode ffff880032124678 failed -28

Maloo report: https://maloo.whamcloud.com/test_sets/39210a1a-4453-11e3-8472-52540035b04c

parallel-scale-nfsv3 and parallel-scale-nfsv4 also failed:
https://maloo.whamcloud.com/test_sets/ce8cbd1a-4453-11e3-8472-52540035b04c
https://maloo.whamcloud.com/test_sets/fc337542-4453-11e3-8472-52540035b04c

Hi Lai,
Is this similar to LU-3522?

Comment by Sarah Liu [ 11/Nov/13 ]

seen in 2.5.51 testing: https://maloo.whamcloud.com/test_sets/2fb17d6c-47ea-11e3-a445-52540035b04c

IOError: [Errno 28] No space left on device
Comment by Jian Yu [ 12/Nov/13 ]

This is blocking the parallel-scale{,-nfsv3,nfsv4} testing on ZFS. I'll check whether the OSTCOUNT=2 and OSTSIZE=2097152 configuration cause the out of space failure.

Comment by Andreas Dilger [ 12/Nov/13 ]

This is likely a duplicate of LU-3522 caused by the OST reserving too much grant for each client block.

I'm not closing it yet, in case this is actually a problem of the test trying to write more data than will fit into the 4GB of space with 2 OSTs.

Comment by Jian Yu [ 13/Nov/13 ]

After running compilebench test manually on master branch, I found that the required space for one kernel directory was about 1GB instead of 680MB. For two directories, the test will consume about 2GB space which should not fill up 4GB space. So, this is a duplicate of LU-3522.

Here is the patch for master branch to fix the space estimation codes in run_compilebench(): http://review.whamcloud.com/8258

Comment by Jian Yu [ 15/Nov/13 ]

Lustre build: http://build.whamcloud.com/job/lustre-b2_4/50/
Distro/Arch: RHEL6.4/x86_64

FSTYPE=zfs
MDSCOUNT=1
MDSSIZE=2097152
OSTCOUNT=2
OSTSIZE=2097152
PTLDEBUG=-1
DEBUG_SIZE=128

While parallel-scale test compilebench hitting "No space left on device" failure, the space usage status of the Lustre filesystem was as follows:

create dir kernel-0 222MB in 344.06 seconds (0.65 MB/s)
Traceback (most recent call last):
  File "./compilebench", line 576, in <module>
    mbs = run_directory(dset.unpatched, dirname, "create dir")
  File "./compilebench", line 245, in run_directory
    fp.close()
IOError: [Errno 28] No space left on device

du -sh /mnt/lustre/*
27M	/mnt/lustre/d0.compilebench

du -sh /mnt/lustre/d0.compilebench/*
19M	/mnt/lustre/d0.compilebench/kernel-0
7.6M	/mnt/lustre/d0.compilebench/kernel-1

lfs df -i
UUID                      Inodes       IUsed       IFree IUse% Mounted on
lustre-MDT0000_UUID      1149005       32051     1116954   3% /mnt/lustre[MDT:0]
lustre-OST0000_UUID        16053       15332         721  96% /mnt/lustre[OST:0]
lustre-OST0001_UUID        16080       15297         783  95% /mnt/lustre[OST:1]

filesystem summary:      1149005       32051     1116954   3% /mnt/lustre


lfs df -h
UUID                       bytes        Used   Available Use% Mounted on
lustre-MDT0000_UUID         2.0G       53.6M        1.9G   3% /mnt/lustre[MDT:0]
lustre-OST0000_UUID         2.0G        1.9G       88.1M  96% /mnt/lustre[OST:0]
lustre-OST0001_UUID         2.0G        1.9G       93.9M  95% /mnt/lustre[OST:1]

filesystem summary:         3.9G        3.7G      182.0M  95% /mnt/lustre


 parallel-scale test_compilebench: @@@@@@ FAIL: compilebench failed: 1 

Maloo report: https://maloo.whamcloud.com/test_sets/082a9faa-4db5-11e3-8fb6-52540035b04c

Comment by Jian Yu [ 15/Nov/13 ]

Here is the patch for Lustre b2_4 branch to fix the space estimation codes in run_compilebench(): http://review.whamcloud.com/8288

Comment by Jian Yu [ 22/Nov/13 ]

Patch landed on Lustre b2_4 branch. The real issue is LU-3522, which still needs to be fixed.

Comment by Peter Jones [ 23/Nov/13 ]

Landed for 2.6

Comment by Jian Yu [ 28/Nov/13 ]

In the current run_compilebench(), lfs_df is used to get the free disk space usage information. However, run_compilebench() will also be run on NFS client which has no Lustre filesystem, so we need change lfs_df to df. Here are the patches:
For master branch: http://review.whamcloud.com/8429
For b2_4 branch: http://review.whamcloud.com/8430

Generated at Sat Feb 10 01:37:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.