[LU-4846] Failover test failure on test suite replay-single test_26: No space left Created: 01/Apr/14  Updated: 08/May/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.7.0, Lustre 2.5.3, Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Maloo Assignee: Hongchao Zhang
Resolution: Unresolved Votes: 0
Labels: mq115
Environment:

client and server: lustre-master build # 1945 RHEL6


Issue Links:
Duplicate
is duplicated by LU-5947 Failover replay-single test_15: No sp... Closed
is duplicated by LU-7294 Failover - replay-single test_53b: m... Closed
Related
is related to LU-5526 recovery-mds-scale test failover_mds:... Resolved
is related to LU-7309 replay-single test_70b: no space left... Resolved
is related to LU-10613 replay-single tests 20c, 21, 23, 24, ... Open
Severity: 3
Rank (Obsolete): 13355

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/17d5d848-b1c4-11e3-9a4b-52540035b04c.

The sub-test test_26 failed with the following error:

test_26 failed with 5

In the previous tag-2.5.56, we didn't hit this no space left error
test log:

== replay-single test 26: |X| open(O_CREAT), unlink two, close one, replay, close one (test mds_cleanup_orphans) == 03:14:43 (1395483283)
CMD: client-30vm3 sync; sync; sync
Filesystem                        1K-blocks   Used Available Use% Mounted on
client-30vm3:client-30vm7:/lustre  14449456 760996  12928224   6% /mnt/lustre
CMD: client-30vm1.lab.whamcloud.com,client-30vm5,client-30vm6 mcreate /mnt/lustre/fsa-\$(hostname); rm /mnt/lustre/fsa-\$(hostname)
CMD: client-30vm1.lab.whamcloud.com,client-30vm5,client-30vm6 if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-\$(hostname); rm /mnt/lustre2/fsa-\$(hostname); fi
CMD: client-30vm3 /usr/sbin/lctl --device lustre-MDT0000 notransno
CMD: client-30vm3 /usr/sbin/lctl --device lustre-MDT0000 readonly
CMD: client-30vm3 /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
multiop /mnt/lustre/f26.replay-single-1 vO_tSc
TMPPIPE=/tmp/multiop_open_wait_pipe.19802
open(O_RDWR|O_CREAT): No space left on device


 Comments   
Comment by Sarah Liu [ 01/Apr/14 ]

Also seen in SLES11 SP3 failover test:
https://maloo.whamcloud.com/test_sets/e76aab3c-b3b4-11e3-85be-52540035b04c

Comment by Oleg Drokin [ 01/Apr/14 ]

This test seems to have run out of space for some reason that needs to be tracked in test scripts I suspect.

Comment by Jian Yu [ 20/Aug/14 ]

The replay-single out of space failures on Lustre b2_5 branch in hard failover test sessions were reported in LU-3326 before. Since that ticket was closed, let's track the replay-single out of space failures in this ticket.

Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/80/
Distro/Arch: RHEL6.5/x86_64
Test Group: failover

replay-single test 13,14,15,26,27,28,53e all hit out of space failures:
https://testing.hpdd.intel.com/test_sets/eba92472-25eb-11e4-8ee8-5254006e85c2

Comment by Jian Yu [ 21/Aug/14 ]

More instance on Lustre b2_5 branch: https://testing.hpdd.intel.com/test_sets/c237917a-2904-11e4-9362-5254006e85c2

Hi Hongchao, is this a duplicate of LU-5526?

Comment by Hongchao Zhang [ 22/Aug/14 ]

this should be another issue, and it was caused by the tests before these failed test (say, test_15), for the failed test failed just at the beginning,

== replay-single test 15: open(O_CREAT), unlink |X|  touch new, close == 05:23:08 (1408598588)
multiop /mnt/lustre/f15.replay-single vO_tSc
TMPPIPE=/tmp/multiop_open_wait_pipe.794
open(O_RDWR|O_CREAT): No space left on device

then some of the previous tests could forget to cleanup its test and trigger the issue.

for debug purpose, we can add "sync" in the failed tests to try to get some more free space.

index 446283c..e7a85c1 100755
--- a/lustre/tests/replay-single.sh
+++ b/lustre/tests/replay-single.sh
@@ -336,6 +336,7 @@ test_14() {
 run_test 14 "open(O_CREAT), unlink |X| close"
 
 test_15() {
+    sync
     multiop_bg_pause $DIR/$tfile O_tSc || return 5
     pid=$!
     rm -f $DIR/$tfile
Comment by Jian Yu [ 23/Aug/14 ]

Hi Hongchao,

The out of space failure occurred on different sub-tests in replay-single.sh in different test runs. I'm afraid that we cannot make sure which sub-tests need add "sync".

Comment by Jian Yu [ 09/Sep/14 ]

The same failure occurred consistently on Lustre b_ieel2_0 branch:
https://testing.hpdd.intel.com/test_sets/c5d51748-3765-11e4-a2a6-5254006e85c2
https://testing.hpdd.intel.com/test_sets/a8f61b6a-3781-11e4-9142-5254006e85c2
https://testing.hpdd.intel.com/test_sets/5e9611e2-375e-11e4-bcac-5254006e85c2
https://testing.hpdd.intel.com/test_sets/d9e8c6e6-3a5b-11e4-9a75-5254006e85c2
https://testing.hpdd.intel.com/test_sets/b5cfc644-4765-11e4-9640-5254006e85c2
https://testing.hpdd.intel.com/test_sets/d0040858-4667-11e4-b3aa-5254006e85c2
https://testing.hpdd.intel.com/test_sets/7ce44470-5211-11e4-af18-5254006e85c2
https://testing.hpdd.intel.com/test_sets/f144b5fc-5301-11e4-92d7-5254006e85c2

Comment by Jian Yu [ 30/Oct/14 ]

More instances on Lustre b2_5 branches:
https://testing.hpdd.intel.com/test_sets/a64f7f90-5ea6-11e4-badb-5254006e85c2
https://testing.hpdd.intel.com/test_sets/1cd5f3a2-5483-11e4-84d2-5254006e85c2
https://testing.hpdd.intel.com/test_sets/9c4b1bbc-7a8a-11e4-b9fd-5254006e85c2

Comment by Jian Yu [ 22/Nov/14 ]

The same failure occurred on master branch:
https://testing.hpdd.intel.com/test_sets/c3f0ec98-6903-11e4-9d25-5254006e85c2

Comment by Sarah Liu [ 16/Apr/15 ]

Hit this in current master:
https://testing.hpdd.intel.com/test_sets/9709e12a-e1b3-11e4-87fa-5254006e85c2

Comment by Hongchao Zhang [ 04/May/15 ]

should be a duplicate

Comment by Saurabh Tandan (Inactive) [ 10/Dec/15 ]

master, build# 3264, 2.7.64 tag
Hard Failover: EL6.7 Server/Client
Same issue blocking a series of tests tests_15,18,21 with the same issue.

https://testing.hpdd.intel.com/test_sets/80a20678-9edd-11e5-87a9-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 10/Dec/15 ]

master, build# 3264, 2.7.64 tag
Hard Failover: EL6.7 Server/Client - ZFS
replay-single test_18 failed and test_20 timeout because of the same issue.
https://testing.hpdd.intel.com/test_sets/2f719028-9ebc-11e5-98a4-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 15/Dec/15 ]

master, build# 3266, 2.7.64 tag
Hard Failover: EL6.7 Server/SLES11 SP3 Clients
replay-single test_48 failed with same issue.
https://testing.hpdd.intel.com/test_sets/bcb62a46-a080-11e5-85ed-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 15/Dec/15 ]

replay-dual test_26 failing with same issue.
master, build# 3266, 2.7.64 tag
Hard Failover: EL7 Server/SLES11 SP3 Client
https://testing.hpdd.intel.com/test_sets/a6c5740c-a077-11e5-8d69-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 15/Dec/15 ]

replay-single test_18, test_21, test_48 failed with same issue.
master, build# 3266, 2.7.64 tag
Hard Failover: EL7 Server/SLES11 SP3 Client
https://testing.hpdd.intel.com/test_sets/a8b3fb9e-a077-11e5-8d69-5254006e85c2

Comment by James Nunez (Inactive) [ 30/Dec/15 ]

LU-7309 seems to be tracking the same problem; multiple replay-single tests failing with 'No space left on device'

Comment by Saurabh Tandan (Inactive) [ 20/Jan/16 ]

Another instance found for hardfailover: EL6.7 Server/Client
https://testing.hpdd.intel.com/test_sets/439c1dda-bc93-11e5-8f65-5254006e85c2

Another instance found for hardfailover: EL6.7 Server/Client
https://testing.hpdd.intel.com/test_sets/48226896-bc93-11e5-8f65-5254006e85c2

Generated at Sat Feb 10 01:46:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.