[LU-3640] Test failure on test suite sanity, test_116a, when OST is full Created: 25/Jul/13  Updated: 16/Oct/13  Resolved: 27/Sep/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1, Lustre 2.5.0
Fix Version/s: Lustre 2.5.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: James Nunez (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-3633 sanity.sh test_101d failed for 'dd fa... Resolved
Severity: 3
Rank (Obsolete): 9374

 Description   

This issue was created by maloo for James Nunez <james.a.nunez@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/92e7707e-f4b3-11e2-b8a2-52540035b04c.

The sub-test test_116a failed with the following error:

test_116a returned 1

When an OST fills, test 116a looks for the OST with the minimum size available and tries to fill 25% of the remaining space, but it never checks to see if there is any space available.

This test should check if the minimum size available on an OST is zero and, if so, exit/skip.

The client test_log looks like:

== sanity test 116a: stripe QOS: free space balance ===================== 22:17:35 (1374729455)
Free space priority error: get_param: /proc/{fs,sys}/{lnet,lustre}/lov/*-clilov-*/qos_prio_free: Found no match
CMD: client-26vm3 lctl set_param -n osd*.*MD*.force_sync 1
CMD: client-26vm3 lctl get_param -n osc.*MDT*.sync_*
CMD: client-26vm3 lctl get_param -n osc.*MDT*.sync_*
CMD: client-26vm3 lctl get_param -n osc.*MDT*.sync_*
CMD: client-26vm3 lctl get_param -n osc.*MDT*.sync_*
CMD: client-26vm3 lctl get_param -n osc.*MDT*.sync_*
CMD: client-26vm3 lctl get_param -n osc.*MDT*.sync_*
CMD: client-26vm3 lctl get_param -n osc.*MDT*.sync_*
CMD: client-26vm3 lctl get_param -n osc.*MDT*.sync_*
Waiting for local destroys to complete
OST kbytes available: 164288 172508 172284 0 164312 172300 162240
Min free space: OST 3: 0
Max free space: OST 1: 172508
Filling 25% remaining space in OST3 with 0Kb
CMD: client-26vm3 lctl get_param -n lov.*.qos_maxage
Waiting for local destroys to complete
OST kbytes available: 164068 172508 172504 0 164312 172520 162016
Min free space: OST 3: 0
Max free space: OST 5: 172520
/usr/lib64/lustre/tests/sanity.sh: line 6604: 172520 * 100 / 0: division by 0 (error token is "0")


 Comments   
Comment by James Nunez (Inactive) [ 26/Jul/13 ]

Proposed patch at:

http://review.whamcloud.com/7132

Comment by Andreas Dilger [ 31/Jul/13 ]

I don't think there is anything wrong with this patch, but I think it is fixing the symptom and not the cause of the problem.

Why is one OST full in the first place? Is some test creating huge files and not cleaning them up? Is there a missing "wait_delete_completed" after some large file is deleted? Are deleted files not causing the objects to be destroyed (which would be a serious bug that should be fixed)?

Comment by James Nunez (Inactive) [ 31/Jul/13 ]

I agree. I'm looking into what test(s) is filling an OST. In the cases of a single OST filling that I've looked at, sanity test 101d is the first test that fails due to a full OST. So, working back from there.

Comment by James Nunez (Inactive) [ 05/Aug/13 ]

So far, I can say that I've run into this problem twice, both times on Toro where the OSTs are ~9.77 GB:
7/31/13 https://maloo.whamcloud.com/test_sessions/951e6f9c-f985-11e2-8917-52540035b04c
7/29/13 https://maloo.whamcloud.com/test_sessions/496f53b4-f8da-11e2-977b-52540035b04c

I submitted a test run to try and trigger the full OST problem again with printing available space on the OSTs throughout the sanity suite, but haven't had any luck ... yet; http://review.whamcloud.com/#/c/7207/ .

Currently, some of sanity tests 27 look like possible culprits:
sanity test 27m: create file while OST0 was full [always skipped, so not the problem]
sanity test 27n: create file with some full OSTs
sanity test 27o: create file with all full OSTs (should error)
sanity test 27p: append to a truncated file with some full OSTs
sanity test 27q: append to truncated file with all OSTs full (should error)
sanity test 27r: stripe file with some full OSTs (shouldn't LBUG)

Comment by James Nunez (Inactive) [ 06/Aug/13 ]

In the end, we didn't have to look very far from test 101d. It just looks like test 101d is filling the OST on it's own. From the test suite log at http://review.whamcloud.com/#/c/7207/, we see that test 101c ends with space left on all OSTs, i.e. OST 0 is not full:

18:35:11:== sanity test 101c: check stripe_size aligned read-ahead =================== 18:35:10 (1375752910)
...
18:35:34:OST kbytes available: 148656 148672 157676 147636 157892 157672 157892 18:35:34:Min free space: OST 3: 147636 18:35:34:Max free space: OST 4: 157892

Then 101d starts and dd fails:
18:35:34:== sanity test 101d: file read with and without read-ahead enabled =================== 18:35:29 (1375752929)
18:35:34:Creating 500M test file /mnt/lustre/f.sanity.101d 18:35:45:dd: writing `/mnt/lustre/f.sanity.101d': No space left on device 18:35:45:165+0 records in 18:35:45:164+0 records out 18:35:46:172576768 bytes (173 MB) copied, 7.17172 s, 24.1 MB/s 18:35:46: sanity test_101d: @@@@@@ FAIL: dd failed

From the code, we get the space on the file system and then try and write a 500 MB file, but the stripe of that file is never set. So, the file system has more than 500 MB of space available, but a single OST does not and, thus, the 500 MB file fills a single OST.

I verified this locally by running df as in 101d:
[root@client1 tests]# df -P /mnt/lscratch
Filesystem 1024-blocks Used Available Capacity Mounted on
192.168.0.200@tcp0:/lscratch 749856 103072 601184 15% /mnt/lscratch

and we see we have more than enough for a 500MB file, yet:
== sanity test 101d: file read with and without read-ahead enabled =================== 07:32:58 (1375795978)
Waiting for local destroys to complete
OST kbytes available: 151696 150096 150296 149896
Min free space: OST 3: 149896
Max free space: OST 0: 151696
Found local space to be 601984 kb
Creating 500M test file /mnt/lscratch/f.sanity.101d
dd: writing `/mnt/lscratch/f.sanity.101d': No space left on device
156+0 records in
155+0 records out
163078144 bytes (163 MB) copied, 14.5852 s, 11.2 MB/s
sanity test_101d: @@@@@@ FAIL: dd failed

So, I guess the assumption is that the file should be striped across all OSTs since the check for available space is for the whole file system and not just individual OSTs.

The patch for LU-3633 sets the file stripe and, thus, should "fix" the OST filling problem. Although there may be other tests that make a similar assumption about file stripe.

Comment by Andreas Dilger [ 23/Sep/13 ]

Is there anything left in this bug that needs to be done, or can it be closed?

Comment by James Nunez (Inactive) [ 23/Sep/13 ]

The patch hasn't landed yet. Last batch of autotest results never posted ... at least didn't post for over 10 days after auto test started. So, just rebased and resubmitted earlier today.

Comment by James Nunez (Inactive) [ 27/Sep/13 ]

Patch landed for 2.5.0

Generated at Sat Feb 10 01:35:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.