[LU-3640] Test failure on test suite sanity, test_116a, when OST is full Created: 25/Jul/13 Updated: 16/Oct/13 Resolved: 27/Sep/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1, Lustre 2.5.0 |
| Fix Version/s: | Lustre 2.5.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | James Nunez (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9374 | ||||||||
| Description |
|
This issue was created by maloo for James Nunez <james.a.nunez@intel.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/92e7707e-f4b3-11e2-b8a2-52540035b04c. The sub-test test_116a failed with the following error:
When an OST fills, test 116a looks for the OST with the minimum size available and tries to fill 25% of the remaining space, but it never checks to see if there is any space available. This test should check if the minimum size available on an OST is zero and, if so, exit/skip. The client test_log looks like: == sanity test 116a: stripe QOS: free space balance ===================== 22:17:35 (1374729455)
Free space priority error: get_param: /proc/{fs,sys}/{lnet,lustre}/lov/*-clilov-*/qos_prio_free: Found no match
CMD: client-26vm3 lctl set_param -n osd*.*MD*.force_sync 1
CMD: client-26vm3 lctl get_param -n osc.*MDT*.sync_*
CMD: client-26vm3 lctl get_param -n osc.*MDT*.sync_*
CMD: client-26vm3 lctl get_param -n osc.*MDT*.sync_*
CMD: client-26vm3 lctl get_param -n osc.*MDT*.sync_*
CMD: client-26vm3 lctl get_param -n osc.*MDT*.sync_*
CMD: client-26vm3 lctl get_param -n osc.*MDT*.sync_*
CMD: client-26vm3 lctl get_param -n osc.*MDT*.sync_*
CMD: client-26vm3 lctl get_param -n osc.*MDT*.sync_*
Waiting for local destroys to complete
OST kbytes available: 164288 172508 172284 0 164312 172300 162240
Min free space: OST 3: 0
Max free space: OST 1: 172508
Filling 25% remaining space in OST3 with 0Kb
CMD: client-26vm3 lctl get_param -n lov.*.qos_maxage
Waiting for local destroys to complete
OST kbytes available: 164068 172508 172504 0 164312 172520 162016
Min free space: OST 3: 0
Max free space: OST 5: 172520
/usr/lib64/lustre/tests/sanity.sh: line 6604: 172520 * 100 / 0: division by 0 (error token is "0")
|
| Comments |
| Comment by James Nunez (Inactive) [ 26/Jul/13 ] |
|
Proposed patch at: |
| Comment by Andreas Dilger [ 31/Jul/13 ] |
|
I don't think there is anything wrong with this patch, but I think it is fixing the symptom and not the cause of the problem. Why is one OST full in the first place? Is some test creating huge files and not cleaning them up? Is there a missing "wait_delete_completed" after some large file is deleted? Are deleted files not causing the objects to be destroyed (which would be a serious bug that should be fixed)? |
| Comment by James Nunez (Inactive) [ 31/Jul/13 ] |
|
I agree. I'm looking into what test(s) is filling an OST. In the cases of a single OST filling that I've looked at, sanity test 101d is the first test that fails due to a full OST. So, working back from there. |
| Comment by James Nunez (Inactive) [ 05/Aug/13 ] |
|
So far, I can say that I've run into this problem twice, both times on Toro where the OSTs are ~9.77 GB: I submitted a test run to try and trigger the full OST problem again with printing available space on the OSTs throughout the sanity suite, but haven't had any luck ... yet; http://review.whamcloud.com/#/c/7207/ . Currently, some of sanity tests 27 look like possible culprits: |
| Comment by James Nunez (Inactive) [ 06/Aug/13 ] |
|
In the end, we didn't have to look very far from test 101d. It just looks like test 101d is filling the OST on it's own. From the test suite log at http://review.whamcloud.com/#/c/7207/, we see that test 101c ends with space left on all OSTs, i.e. OST 0 is not full: 18:35:11:== sanity test 101c: check stripe_size aligned read-ahead =================== 18:35:10 (1375752910) Then 101d starts and dd fails: From the code, we get the space on the file system and then try and write a 500 MB file, but the stripe of that file is never set. So, the file system has more than 500 MB of space available, but a single OST does not and, thus, the 500 MB file fills a single OST. I verified this locally by running df as in 101d: and we see we have more than enough for a 500MB file, yet: So, I guess the assumption is that the file should be striped across all OSTs since the check for available space is for the whole file system and not just individual OSTs. The patch for |
| Comment by Andreas Dilger [ 23/Sep/13 ] |
|
Is there anything left in this bug that needs to be done, or can it be closed? |
| Comment by James Nunez (Inactive) [ 23/Sep/13 ] |
|
The patch hasn't landed yet. Last batch of autotest results never posted ... at least didn't post for over 10 days after auto test started. So, just rebased and resubmitted earlier today. |
| Comment by James Nunez (Inactive) [ 27/Sep/13 ] |
|
Patch landed for 2.5.0 |