[LU-3906] Failure on test suite parallel-scale test_compilebench: IOError, No space left on device Created: 08/Sep/13 Updated: 31/Dec/13 Resolved: 23/Nov/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1, Lustre 2.5.0, Lustre 2.6.0 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.4.2, Lustre 2.5.1 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Sarah Liu | Assignee: | Jian Yu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
server and client: lustre-master build #1652 |
||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 10301 | ||||||||||||||||||||||||||||||||
| Description |
|
https://maloo.whamcloud.com/test_sets/70ec74de-15b9-11e3-8938-52540035b04c client console shows: 11:00:39:Lustre: DEBUG MARKER: == parallel-scale test compilebench: compilebench == 11:00:31 (1378317631) 11:00:39:Lustre: DEBUG MARKER: /usr/sbin/lctl mark free space=1194928, reducing initial dirs to 1 11:00:40:Lustre: DEBUG MARKER: free space=1194928, reducing initial dirs to 1 11:00:40:Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.compilebench -i 1 -r 2 --makej 11:00:40:Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 1 -r 2 --makej 11:08:26:LustreError: 8551:0:(vvp_io.c:1078:vvp_io_commit_write()) Write page 3250 of inode ffff88001916d1b8 failed -28 11:08:28:LustreError: 8551:0:(vvp_io.c:1078:vvp_io_commit_write()) Write page 3250 of inode ffff88001916d1b8 failed -28 OST console shows: 11:00:42:Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.compilebench -i 1 -r 2 --makej 11:00:43:Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 1 -r 2 --makej 11:05:07:LustreError: 20425:0:(ofd_grant.c:255:ofd_grant_space_left()) lustre-OST0004: cli lustre-OST0004_UUID/ffff880021d8d000 left 44863488 < tot_grant 47737472 unstable 0 pending 0 11:05:07:LustreError: 20425:0:(ofd_grant.c:255:ofd_grant_space_left()) Skipped 6 previous similar messages 11:05:07:LustreError: 21683:0:(ofd_grant.c:255:ofd_grant_space_left()) lustre-OST0004: cli 009dd603-5497-a62c-77c6-19fda0311814/ffff880021d8ec00 left 44863488 < tot_grant 47735680 unstable 0 pending 0 |
| Comments |
| Comment by Jian Yu [ 04/Nov/13 ] |
|
Lustre build: http://build.whamcloud.com/job/lustre-b2_4/47/ FSTYPE=zfs parallel-scale test compilebench failed as follows: create dir kernel-0 222MB in 459.56 seconds (0.48 MB/s)
Traceback (most recent call last):
File "./compilebench", line 576, in <module>
mbs = run_directory(dset.unpatched, dirname, "create dir")
File "./compilebench", line 245, in run_directory
fp.close()
IOError: [Errno 28] No space left on device
parallel-scale test_compilebench: @@@@@@ FAIL: compilebench failed: 1
Console log on client node: 10:44:00:Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej 11:10:17:LustreError: 13106:0:(vvp_io.c:1088:vvp_io_commit_write()) Write page 3 of inode ffff880032124678 failed -28 11:10:18:LustreError: 13106:0:(vvp_io.c:1088:vvp_io_commit_write()) Write page 3 of inode ffff880032124678 failed -28 Maloo report: https://maloo.whamcloud.com/test_sets/39210a1a-4453-11e3-8472-52540035b04c parallel-scale-nfsv3 and parallel-scale-nfsv4 also failed: Hi Lai, |
| Comment by Sarah Liu [ 11/Nov/13 ] |
|
seen in 2.5.51 testing: https://maloo.whamcloud.com/test_sets/2fb17d6c-47ea-11e3-a445-52540035b04c IOError: [Errno 28] No space left on device |
| Comment by Jian Yu [ 12/Nov/13 ] |
|
This is blocking the parallel-scale{,-nfsv3,nfsv4} testing on ZFS. I'll check whether the OSTCOUNT=2 and OSTSIZE=2097152 configuration cause the out of space failure. |
| Comment by Andreas Dilger [ 12/Nov/13 ] |
|
This is likely a duplicate of I'm not closing it yet, in case this is actually a problem of the test trying to write more data than will fit into the 4GB of space with 2 OSTs. |
| Comment by Jian Yu [ 13/Nov/13 ] |
|
After running compilebench test manually on master branch, I found that the required space for one kernel directory was about 1GB instead of 680MB. For two directories, the test will consume about 2GB space which should not fill up 4GB space. So, this is a duplicate of Here is the patch for master branch to fix the space estimation codes in run_compilebench(): http://review.whamcloud.com/8258 |
| Comment by Jian Yu [ 15/Nov/13 ] |
|
Lustre build: http://build.whamcloud.com/job/lustre-b2_4/50/ FSTYPE=zfs While parallel-scale test compilebench hitting "No space left on device" failure, the space usage status of the Lustre filesystem was as follows: create dir kernel-0 222MB in 344.06 seconds (0.65 MB/s)
Traceback (most recent call last):
File "./compilebench", line 576, in <module>
mbs = run_directory(dset.unpatched, dirname, "create dir")
File "./compilebench", line 245, in run_directory
fp.close()
IOError: [Errno 28] No space left on device
du -sh /mnt/lustre/*
27M /mnt/lustre/d0.compilebench
du -sh /mnt/lustre/d0.compilebench/*
19M /mnt/lustre/d0.compilebench/kernel-0
7.6M /mnt/lustre/d0.compilebench/kernel-1
lfs df -i
UUID Inodes IUsed IFree IUse% Mounted on
lustre-MDT0000_UUID 1149005 32051 1116954 3% /mnt/lustre[MDT:0]
lustre-OST0000_UUID 16053 15332 721 96% /mnt/lustre[OST:0]
lustre-OST0001_UUID 16080 15297 783 95% /mnt/lustre[OST:1]
filesystem summary: 1149005 32051 1116954 3% /mnt/lustre
lfs df -h
UUID bytes Used Available Use% Mounted on
lustre-MDT0000_UUID 2.0G 53.6M 1.9G 3% /mnt/lustre[MDT:0]
lustre-OST0000_UUID 2.0G 1.9G 88.1M 96% /mnt/lustre[OST:0]
lustre-OST0001_UUID 2.0G 1.9G 93.9M 95% /mnt/lustre[OST:1]
filesystem summary: 3.9G 3.7G 182.0M 95% /mnt/lustre
parallel-scale test_compilebench: @@@@@@ FAIL: compilebench failed: 1
Maloo report: https://maloo.whamcloud.com/test_sets/082a9faa-4db5-11e3-8fb6-52540035b04c |
| Comment by Jian Yu [ 15/Nov/13 ] |
|
Here is the patch for Lustre b2_4 branch to fix the space estimation codes in run_compilebench(): http://review.whamcloud.com/8288 |
| Comment by Jian Yu [ 22/Nov/13 ] |
|
Patch landed on Lustre b2_4 branch. The real issue is |
| Comment by Peter Jones [ 23/Nov/13 ] |
|
Landed for 2.6 |
| Comment by Jian Yu [ 28/Nov/13 ] |
|
In the current run_compilebench(), lfs_df is used to get the free disk space usage information. However, run_compilebench() will also be run on NFS client which has no Lustre filesystem, so we need change lfs_df to df. Here are the patches: |