[LU-3633] sanity.sh test_101d failed for 'dd failed' Created: 25/Jul/13  Updated: 13/Sep/13  Resolved: 07/Aug/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.4.1, Lustre 2.5.0

Type: Bug Priority: Blocker
Reporter: Niu Yawei (Inactive) Assignee: James Nunez (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-3640 Test failure on test suite sanity, te... Resolved
is related to LU-3720 Test failure on test suite sanity, su... Resolved
Severity: 3
Rank (Obsolete): 9355

 Description   

sanity.sh test_101d 'dd failed' for not enough space.

https://maloo.whamcloud.com/sub_tests/b1487964-f4b3-11e2-b8a2-52540035b04c

console log from client:

08:52:27:Lustre: DEBUG MARKER: == sanity test 101d: file read with and without read-ahead enabled =================== 08:52:25 (1374681145)
08:52:38:LustreError: 24293:0:(vvp_io.c:1094:vvp_io_commit_write()) Write page 46696 of inode ffff880062769b78 failed -28
08:52:38:LustreError: 24293:0:(vvp_io.c:1094:vvp_io_commit_write()) Write page 46696 of inode ffff880062769b78 failed -28
08:52:39:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_101d: @@@@@@ FAIL: dd failed 
08:52:39:Lustre: DEBUG MARKER: sanity test_101d: @@@@@@ FAIL: dd failed
08:52:39:Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /logdir/test_logs/2013-07-24/lustre-reviews-el6-x86_64--review--1_2_1__16787__-69987943176880-074027/sanity.test_101d.debug_log.$(hostname -s).1374681155.log;
08:52:39:         dmesg > /logdir/test_logs/2013-07-24/lustre-reviews-el6-x86_64--rev
08:52:50:Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity test 101e: check read-ahead for small read\(1k\) for small files\(500k\) == 08:52:40 \(1374681160\)


 Comments   
Comment by Keith Mannthey (Inactive) [ 25/Jul/13 ]

Just for clarity -28 is:

#define ENOSPC 28 /* No space left on device */

Comment by James Nunez (Inactive) [ 25/Jul/13 ]

From the OST console:

08:52:26:Lustre: DEBUG MARKER: == sanity test 101d: file read with and without read-ahead enabled =================== 08:52:25 (1374681145)
08:52:38:LustreError: 16787:0:(ofd_grant.c:255:ofd_grant_space_left()) lustre-OST0001: cli 795f3143-5aac-1e12-ab67-46f017cf8245/ffff88007bd67000 left 44511232 < tot_grant 59851008 unstable 5242880 pending 5242880
08:52:38:LustreError: 842:0:(ofd_grant.c:255:ofd_grant_space_left()) lustre-OST0001: cli 795f3143-5aac-1e12-ab67-46f017cf8245/ffff88007bd67000 left 41365504 < tot_grant 56705280 unstable 5242880 pending 3145728
08:52:39:LustreError: 16787:0:(ofd_grant.c:255:ofd_grant_space_left()) lustre-OST0001: cli 795f3143-5aac-1e12-ab67-46f017cf8245/ffff88007bd67000 left 20393984 < tot_grant 37220608 unstable 4194304 pending 4194304
08:52:39:LustreError: 16787:0:(ofd_grant.c:255:ofd_grant_space_left()) Skipped 8 previous similar messages
08:52:39:LustreError: 840:0:(ofd_grant.c:255:ofd_grant_space_left()) lustre-OST0001: cli 795f3143-5aac-1e12-ab67-46f017cf8245/ffff88007bd67000 left 0 < tot_grant 16392448 unstable 4194304 pending 4194304
08:52:39:LustreError: 840:0:(ofd_grant.c:255:ofd_grant_space_left()) Skipped 13 previous similar messages
08:52:39:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_101d: @@@@@@ FAIL: dd failed 
Comment by Keith Mannthey (Inactive) [ 25/Jul/13 ]

There is a good set of messages in the debug log for the ost as well (just filter for grant messages for ost1). Andreas mentioned on skype this seems like a grant accounting issue.

I didn't get a chance to fully parse that log yet but it looks to have very good info in it.

Comment by Niu Yawei (Inactive) [ 29/Jul/13 ]

Seems like the test fs is running out of space, I'm not sure what consumed so much space, but I think the test script is simply wrong, the test 101d checks if the whole fs avail space is greater than 'size' first, then it tries to write 'size' to a file, however, the file isn't necessary full stripe file, so it's quite possible that the write failed for out of space.

Comment by Keith Mannthey (Inactive) [ 30/Jul/13 ]

Nui,
The grant messages don't seem to line up for me:

 left 44511232 < tot_grant 59851008 unstable 5242880 pending 5242880
 left 41365504 < tot_grant 56705280 unstable 5242880 pending 3145728
 left 20393984 < tot_grant 37220608 unstable 4194304 pending 4194304
 left 0 <        tot_grant 16392448 unstable 4194304 pending 4194304

I don't know this code well but it I would sort of assume that left + tot_grant == Total resource in the pool.

We see the total grant drastically shrink and the "left" go to 0 ie nothing close to a static resource pool. Where did all the resource go? Maybe I am just looking at these error messages wrong.

Comment by James Nunez (Inactive) [ 30/Jul/13 ]

Proposed patch at

http://review.whamcloud.com/#/c/7179

This patch only fixes the issue that Niu brought up about test 101d. It does not address any issues that may exist with grants.

Comment by Keith Mannthey (Inactive) [ 30/Jul/13 ]

James your patch looks like a good test change to me. I am likely just reading the error messages wrong.

Comment by Niu Yawei (Inactive) [ 31/Jul/13 ]

I don't know this code well but it I would sort of assume that left + tot_grant == Total resource in the pool.

I think left + tot_grant == Total unused space.

We see the total grant drastically shrink and the "left" go to 0 ie nothing close to a static resource pool. Where did all the resource go? Maybe I am just looking at these error messages wrong.

I think it's possible, the space allocated for file isn't included in the 'left + tot_granted'.

Comment by Keith Mannthey (Inactive) [ 31/Jul/13 ]

I see I am just reading the error messages wrong then. Thanks for the information.

Comment by James Nunez (Inactive) [ 05/Aug/13 ]

Both LU-3640 and this ticket are a result of a single OST filling.

Comment by Doug Oucharek (Inactive) [ 07/Aug/13 ]

This ticket is stopping the landing of the last 2 HSM tickets, and thereby blocking the feature freeze. Escalating to a blocker.

Comment by James Nunez (Inactive) [ 07/Aug/13 ]

Landed for 2.5

Comment by Sarah Liu [ 21/Aug/13 ]

Hit this issue when running interop test between 2.4.0 client and 2.5 server:

https://maloo.whamcloud.com/test_sets/bfe9f440-09ec-11e3-ad93-52540035b04c

Generated at Sat Feb 10 01:35:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.