[LU-3633] sanity.sh test_101d failed for 'dd failed' Created: 25/Jul/13 Updated: 13/Sep/13 Resolved: 07/Aug/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0 |
| Fix Version/s: | Lustre 2.4.1, Lustre 2.5.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Niu Yawei (Inactive) | Assignee: | James Nunez (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9355 | ||||||||||||
| Description |
|
sanity.sh test_101d 'dd failed' for not enough space. https://maloo.whamcloud.com/sub_tests/b1487964-f4b3-11e2-b8a2-52540035b04c console log from client: 08:52:27:Lustre: DEBUG MARKER: == sanity test 101d: file read with and without read-ahead enabled =================== 08:52:25 (1374681145) 08:52:38:LustreError: 24293:0:(vvp_io.c:1094:vvp_io_commit_write()) Write page 46696 of inode ffff880062769b78 failed -28 08:52:38:LustreError: 24293:0:(vvp_io.c:1094:vvp_io_commit_write()) Write page 46696 of inode ffff880062769b78 failed -28 08:52:39:Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity test_101d: @@@@@@ FAIL: dd failed 08:52:39:Lustre: DEBUG MARKER: sanity test_101d: @@@@@@ FAIL: dd failed 08:52:39:Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /logdir/test_logs/2013-07-24/lustre-reviews-el6-x86_64--review--1_2_1__16787__-69987943176880-074027/sanity.test_101d.debug_log.$(hostname -s).1374681155.log; 08:52:39: dmesg > /logdir/test_logs/2013-07-24/lustre-reviews-el6-x86_64--rev 08:52:50:Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity test 101e: check read-ahead for small read\(1k\) for small files\(500k\) == 08:52:40 \(1374681160\) |
| Comments |
| Comment by Keith Mannthey (Inactive) [ 25/Jul/13 ] |
|
Just for clarity -28 is: #define ENOSPC 28 /* No space left on device */ |
| Comment by James Nunez (Inactive) [ 25/Jul/13 ] |
|
From the OST console: 08:52:26:Lustre: DEBUG MARKER: == sanity test 101d: file read with and without read-ahead enabled =================== 08:52:25 (1374681145) 08:52:38:LustreError: 16787:0:(ofd_grant.c:255:ofd_grant_space_left()) lustre-OST0001: cli 795f3143-5aac-1e12-ab67-46f017cf8245/ffff88007bd67000 left 44511232 < tot_grant 59851008 unstable 5242880 pending 5242880 08:52:38:LustreError: 842:0:(ofd_grant.c:255:ofd_grant_space_left()) lustre-OST0001: cli 795f3143-5aac-1e12-ab67-46f017cf8245/ffff88007bd67000 left 41365504 < tot_grant 56705280 unstable 5242880 pending 3145728 08:52:39:LustreError: 16787:0:(ofd_grant.c:255:ofd_grant_space_left()) lustre-OST0001: cli 795f3143-5aac-1e12-ab67-46f017cf8245/ffff88007bd67000 left 20393984 < tot_grant 37220608 unstable 4194304 pending 4194304 08:52:39:LustreError: 16787:0:(ofd_grant.c:255:ofd_grant_space_left()) Skipped 8 previous similar messages 08:52:39:LustreError: 840:0:(ofd_grant.c:255:ofd_grant_space_left()) lustre-OST0001: cli 795f3143-5aac-1e12-ab67-46f017cf8245/ffff88007bd67000 left 0 < tot_grant 16392448 unstable 4194304 pending 4194304 08:52:39:LustreError: 840:0:(ofd_grant.c:255:ofd_grant_space_left()) Skipped 13 previous similar messages 08:52:39:Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity test_101d: @@@@@@ FAIL: dd failed |
| Comment by Keith Mannthey (Inactive) [ 25/Jul/13 ] |
|
There is a good set of messages in the debug log for the ost as well (just filter for grant messages for ost1). Andreas mentioned on skype this seems like a grant accounting issue. I didn't get a chance to fully parse that log yet but it looks to have very good info in it. |
| Comment by Niu Yawei (Inactive) [ 29/Jul/13 ] |
|
Seems like the test fs is running out of space, I'm not sure what consumed so much space, but I think the test script is simply wrong, the test 101d checks if the whole fs avail space is greater than 'size' first, then it tries to write 'size' to a file, however, the file isn't necessary full stripe file, so it's quite possible that the write failed for out of space. |
| Comment by Keith Mannthey (Inactive) [ 30/Jul/13 ] |
|
Nui, left 44511232 < tot_grant 59851008 unstable 5242880 pending 5242880 left 41365504 < tot_grant 56705280 unstable 5242880 pending 3145728 left 20393984 < tot_grant 37220608 unstable 4194304 pending 4194304 left 0 < tot_grant 16392448 unstable 4194304 pending 4194304 I don't know this code well but it I would sort of assume that left + tot_grant == Total resource in the pool. We see the total grant drastically shrink and the "left" go to 0 ie nothing close to a static resource pool. Where did all the resource go? Maybe I am just looking at these error messages wrong. |
| Comment by James Nunez (Inactive) [ 30/Jul/13 ] |
|
Proposed patch at http://review.whamcloud.com/#/c/7179 This patch only fixes the issue that Niu brought up about test 101d. It does not address any issues that may exist with grants. |
| Comment by Keith Mannthey (Inactive) [ 30/Jul/13 ] |
|
James your patch looks like a good test change to me. I am likely just reading the error messages wrong. |
| Comment by Niu Yawei (Inactive) [ 31/Jul/13 ] |
I think left + tot_grant == Total unused space.
I think it's possible, the space allocated for file isn't included in the 'left + tot_granted'. |
| Comment by Keith Mannthey (Inactive) [ 31/Jul/13 ] |
|
I see I am just reading the error messages wrong then. Thanks for the information. |
| Comment by James Nunez (Inactive) [ 05/Aug/13 ] |
|
Both |
| Comment by Doug Oucharek (Inactive) [ 07/Aug/13 ] |
|
This ticket is stopping the landing of the last 2 HSM tickets, and thereby blocking the feature freeze. Escalating to a blocker. |
| Comment by James Nunez (Inactive) [ 07/Aug/13 ] |
|
Landed for 2.5 |
| Comment by Sarah Liu [ 21/Aug/13 ] |
|
Hit this issue when running interop test between 2.4.0 client and 2.5 server: https://maloo.whamcloud.com/test_sets/bfe9f440-09ec-11e3-ad93-52540035b04c |