[LU-4846] Failover test failure on test suite replay-single test_26: No space left Created: 01/Apr/14 Updated: 08/May/18 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0, Lustre 2.7.0, Lustre 2.5.3, Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Hongchao Zhang |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | mq115 | ||
| Environment: |
client and server: lustre-master build # 1945 RHEL6 |
||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||
| Rank (Obsolete): | 13355 | ||||||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/17d5d848-b1c4-11e3-9a4b-52540035b04c. The sub-test test_26 failed with the following error:
In the previous tag-2.5.56, we didn't hit this no space left error == replay-single test 26: |X| open(O_CREAT), unlink two, close one, replay, close one (test mds_cleanup_orphans) == 03:14:43 (1395483283) CMD: client-30vm3 sync; sync; sync Filesystem 1K-blocks Used Available Use% Mounted on client-30vm3:client-30vm7:/lustre 14449456 760996 12928224 6% /mnt/lustre CMD: client-30vm1.lab.whamcloud.com,client-30vm5,client-30vm6 mcreate /mnt/lustre/fsa-\$(hostname); rm /mnt/lustre/fsa-\$(hostname) CMD: client-30vm1.lab.whamcloud.com,client-30vm5,client-30vm6 if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-\$(hostname); rm /mnt/lustre2/fsa-\$(hostname); fi CMD: client-30vm3 /usr/sbin/lctl --device lustre-MDT0000 notransno CMD: client-30vm3 /usr/sbin/lctl --device lustre-MDT0000 readonly CMD: client-30vm3 /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000 multiop /mnt/lustre/f26.replay-single-1 vO_tSc TMPPIPE=/tmp/multiop_open_wait_pipe.19802 open(O_RDWR|O_CREAT): No space left on device |
| Comments |
| Comment by Sarah Liu [ 01/Apr/14 ] |
|
Also seen in SLES11 SP3 failover test: |
| Comment by Oleg Drokin [ 01/Apr/14 ] |
|
This test seems to have run out of space for some reason that needs to be tracked in test scripts I suspect. |
| Comment by Jian Yu [ 20/Aug/14 ] |
|
The replay-single out of space failures on Lustre b2_5 branch in hard failover test sessions were reported in Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/80/ replay-single test 13,14,15,26,27,28,53e all hit out of space failures: |
| Comment by Jian Yu [ 21/Aug/14 ] |
|
More instance on Lustre b2_5 branch: https://testing.hpdd.intel.com/test_sets/c237917a-2904-11e4-9362-5254006e85c2 Hi Hongchao, is this a duplicate of |
| Comment by Hongchao Zhang [ 22/Aug/14 ] |
|
this should be another issue, and it was caused by the tests before these failed test (say, test_15), for the failed test failed just at the beginning, == replay-single test 15: open(O_CREAT), unlink |X| touch new, close == 05:23:08 (1408598588) multiop /mnt/lustre/f15.replay-single vO_tSc TMPPIPE=/tmp/multiop_open_wait_pipe.794 open(O_RDWR|O_CREAT): No space left on device then some of the previous tests could forget to cleanup its test and trigger the issue. for debug purpose, we can add "sync" in the failed tests to try to get some more free space. index 446283c..e7a85c1 100755
--- a/lustre/tests/replay-single.sh
+++ b/lustre/tests/replay-single.sh
@@ -336,6 +336,7 @@ test_14() {
run_test 14 "open(O_CREAT), unlink |X| close"
test_15() {
+ sync
multiop_bg_pause $DIR/$tfile O_tSc || return 5
pid=$!
rm -f $DIR/$tfile
|
| Comment by Jian Yu [ 23/Aug/14 ] |
|
Hi Hongchao, The out of space failure occurred on different sub-tests in replay-single.sh in different test runs. I'm afraid that we cannot make sure which sub-tests need add "sync". |
| Comment by Jian Yu [ 09/Sep/14 ] |
|
The same failure occurred consistently on Lustre b_ieel2_0 branch: |
| Comment by Jian Yu [ 30/Oct/14 ] |
|
More instances on Lustre b2_5 branches: |
| Comment by Jian Yu [ 22/Nov/14 ] |
|
The same failure occurred on master branch: |
| Comment by Sarah Liu [ 16/Apr/15 ] |
|
Hit this in current master: |
| Comment by Hongchao Zhang [ 04/May/15 ] |
|
should be a duplicate |
| Comment by Saurabh Tandan (Inactive) [ 10/Dec/15 ] |
|
master, build# 3264, 2.7.64 tag https://testing.hpdd.intel.com/test_sets/80a20678-9edd-11e5-87a9-5254006e85c2 |
| Comment by Saurabh Tandan (Inactive) [ 10/Dec/15 ] |
|
master, build# 3264, 2.7.64 tag |
| Comment by Saurabh Tandan (Inactive) [ 15/Dec/15 ] |
|
master, build# 3266, 2.7.64 tag |
| Comment by Saurabh Tandan (Inactive) [ 15/Dec/15 ] |
|
replay-dual test_26 failing with same issue. |
| Comment by Saurabh Tandan (Inactive) [ 15/Dec/15 ] |
|
replay-single test_18, test_21, test_48 failed with same issue. |
| Comment by James Nunez (Inactive) [ 30/Dec/15 ] |
|
|
| Comment by Saurabh Tandan (Inactive) [ 20/Jan/16 ] |
|
Another instance found for hardfailover: EL6.7 Server/Client Another instance found for hardfailover: EL6.7 Server/Client |