[LU-10613] replay-single tests 20c, 21, 23, 24, 25, 26, 30, 48, 53f, 53g, 62, 70b, 70c, fails on open with ‘ No space left on device’ Created: 06/Feb/18  Updated: 07/Jun/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.10.4, Lustre 2.10.5, Lustre 2.13.0, Lustre 2.10.7, Lustre 2.12.1, Lustre 2.12.3, Lustre 2.12.4, Lustre 2.12.5
Fix Version/s: None

Type: Bug Priority: Major
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-4846 Failover test failure on test suite r... Open
is related to LU-10609 replay-single test_48: FAIL: createma... Open
is related to LU-10609 replay-single test_48: FAIL: createma... Open
is related to LU-10612 replay-single test_48 defect causes t... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

A variety of replay-single tests are failing due to the file system being full; calls like open and touch are failing with “No space left on device”. For each test sessions, a different set of replay-single test will fail. For example, test session at https://testing.hpdd.intel.com/test_sets/91b58a56-0a6b-11e8-a6ad-52540065bddc has tests 23, 24, 25, 26, 48, 53g, 70b, 70c fail with no space left on device, but test session https://testing.hpdd.intel.com/test_sets/b226317a-08a2-11e8-a10a-52540065bddc has tests 20c, 21, 30, 32, 32, 33a, 48, 53c, 53f, 53g, 70b, 70c fail with the same error message.

For example, looking at a failover test session for lustre-master build #3703 for ldiskfs servers where replay-single fails (https://testing.hpdd.intel.com/test_sets/91b58a56-0a6b-11e8-a6ad-52540065bddc), we see test 26 fail with the error message

'multiop_bg_pause /mnt/lustre/f26.replay-single-1 failed' 

Looking at the test_log for replay-single test_26 (and most of the other failed tests), we see that the file system is not close to being full

== replay-single test 26: |X| open(O_CREAT), unlink two, close one, replay, close one (test mds_cleanup_orphans) ====================================================================================================== 08:40:42 (1517820042)
CMD: onyx-49vm11 sync; sync; sync
UUID                   1K-blocks        Used   Available Use% Mounted on
lustre-MDT0000_UUID      5825660       48852     5253976   1% /mnt/lustre[MDT:0]
lustre-OST0000_UUID      1933276       30916     1781120   2% /mnt/lustre[OST:0]
lustre-OST0001_UUID      1933276       25792     1786244   1% /mnt/lustre[OST:1]
lustre-OST0002_UUID      1933276       25788     1786248   1% /mnt/lustre[OST:2]
lustre-OST0003_UUID      1933276       25788     1786248   1% /mnt/lustre[OST:3]
lustre-OST0004_UUID      1933276       25788     1786248   1% /mnt/lustre[OST:4]
lustre-OST0005_UUID      1933276       25788     1786248   1% /mnt/lustre[OST:5]
lustre-OST0006_UUID      1933276       25840     1769716   1% /mnt/lustre[OST:6]

filesystem_summary:     13532932      185700    12482072   1% /mnt/lustre

CMD: onyx-49vm5.onyx.hpdd.intel.com,onyx-49vm7,onyx-49vm8 mcreate /mnt/lustre/fsa-\$(hostname); rm /mnt/lustre/fsa-\$(hostname)
CMD: onyx-49vm5.onyx.hpdd.intel.com,onyx-49vm7,onyx-49vm8 if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-\$(hostname); rm /mnt/lustre2/fsa-\$(hostname); fi
CMD: onyx-49vm11 /usr/sbin/lctl --device lustre-MDT0000 notransno
CMD: onyx-49vm11 /usr/sbin/lctl --device lustre-MDT0000 readonly
CMD: onyx-49vm11 /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
multiop /mnt/lustre/f26.replay-single-1 vO_tSc
TMPPIPE=/tmp/multiop_open_wait_pipe.976
open(O_RDWR|O_CREAT): No space left on device
replay-single test_26: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f26.replay-single-1 failed 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:5336:error()
  = /usr/lib64/lustre/tests/replay-single.sh:613:test_26()

In the MDS dmesg, the only hint of trouble is

[  117.289138] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
[  117.297983] LustreError: 1978:0:(lod_qos.c:1352:lod_alloc_specific()) can't lstripe objid [0x200097212:0xe:0x0]: have 0 want 1
[  117.511816] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-single test_26: @@@@@@ FAIL: multiop_bg_pause \/mnt\/lustre\/f26.replay-single-1 failed 

This error “can't lstripe objid” make the problem look like LU-10350, but that patch landed and sanity-dom is not run in this test group.

For each test that fails due to ‘no space’, here is the test and the error from the test log. Several lines of test output have been removed to focus on the error/failure:

== replay-single test 20c: check that client eviction does not affect file content =================== 18:24:30 (1517624670)
multiop /mnt/lustre/f20c.replay-single vOw_c
TMPPIPE=/tmp/multiop_open_wait_pipe.31377
open(O_RDWR|O_CREAT): No space left on device
replay-single test_20c: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f20c.replay-single failed 

== replay-single test 21: |X| open(O_CREAT), unlink touch new, replay, close (test mds_cleanup_orphans) 
multiop /mnt/lustre/f21.replay-single vO_tSc
TMPPIPE=/tmp/multiop_open_wait_pipe.31377
open(O_RDWR|O_CREAT): No space left on device
replay-single test_21: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f21.replay-single failed 

== replay-single test 30: open(O_CREAT) two, unlink two, replay, close two (test mds_cleanup_orphans) ====================================================================================================== 18:34:12 (1517625252)
multiop /mnt/lustre/f30.replay-single-1 vO_tSc
TMPPIPE=/tmp/multiop_open_wait_pipe.31377
open(O_RDWR|O_CREAT): No space left on device
replay-single test_30: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f30.replay-single-1 failed 

== replay-single test 31: open(O_CREAT) two, unlink one, |X| unlink one, close two (test mds_cleanup_orphans) ====================================================================================================== 18:34:16 (1517625256)
multiop /mnt/lustre/f31.replay-single-1 vO_tSc
TMPPIPE=/tmp/multiop_open_wait_pipe.31377
open(O_RDWR|O_CREAT): No space left on device
replay-single test_31: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f31.replay-single-1 failed 

== replay-single test 32: close() notices client eviction; close() after client eviction ============= 18:34:18 (1517625258)
multiop /mnt/lustre/f32.replay-single vO_c
TMPPIPE=/tmp/multiop_open_wait_pipe.31377
open(O_RDWR|O_CREAT): No space left on device
replay-single test_32: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f32.replay-single failed 

== replay-single test 33a: fid seq shouldn't be reused after abort recovery ========================== 18:34:20 (1517625260)
open(/mnt/lustre/f33a.replay-single-0) error: No space left on device
total: 0 open/close in 0.00 seconds: 0.00 ops/second
replay-single test_33a: @@@@@@ FAIL: createmany create /mnt/lustre/f33a.replay-single failed 

== replay-single test 48: MDS->OSC failure during precreate cleanup (2824) =========================== 19:12:20 (1517627540)
Started lustre-MDT0000
CMD: onyx-48vm10 lctl set_param fail_loc=0x80000216
fail_loc=0x80000216
open(/mnt/lustre/f48.replay-single20) error: No space left on device
total: 0 open/close in 0.00 seconds: 0.00 ops/second
replay-single test_48: @@@@@@ FAIL: createmany recraete /mnt/lustre/f48.replay-single failed 

== replay-single test 53c: |X| open request and close request while two MDC requests in flight ======= 19:17:23 (1517627843)
CMD: onyx-48vm11 lctl set_param fail_loc=0x80000107
open(O_RDWR|O_CREAT): No space left on device
fail_loc=0x80000107
CMD: onyx-48vm11 lctl set_param fail_loc=0x80000115
fail_loc=0x80000115
/usr/lib64/lustre/tests/replay-single.sh: line 1293: kill: (3180) - No such process
replay-single test_53c: @@@@@@ FAIL: close_pid doesn't exist 

== replay-single test 53f: |X| open reply and close reply while two MDC requests in flight =========== 19:20:23 (1517628023)
CMD: onyx-48vm11 lctl set_param fail_loc=0x119
open(O_RDWR|O_CREAT): No space left on device
fail_loc=0x119
CMD: onyx-48vm11 lctl set_param fail_loc=0x8000013b
fail_loc=0x8000013b
/usr/lib64/lustre/tests/replay-single.sh: line 1398: kill: (6748) - No such process
replay-single test_53f: @@@@@@ FAIL: close_pid doesn't exist 

== replay-single test 53g: |X| drop open reply and close request while close and open are both in flight ====================================================================================================== 19:20:31 (1517628031)
CMD: onyx-48vm11 lctl set_param fail_loc=0x119
open(O_RDWR|O_CREAT): No space left on device
fail_loc=0x119
CMD: onyx-48vm11 lctl set_param fail_loc=0x80000115
fail_loc=0x80000115
/usr/lib64/lustre/tests/replay-single.sh: line 1436: kill: (7302) - No such process
CMD: onyx-48vm11 lctl set_param fail_loc=0
fail_loc=0
replay-single test_53g: @@@@@@ FAIL: close_pid doesn't exist 

== replay-single test 70b: dbench 1mdts recovery; 3 clients ========================================== 20:06:47 (1517630807)
onyx-48vm1: [6141] open ./clients/client0/~dmtmp/EXCEL/BEED0000 failed for handle 11169 (No space left on device)
onyx-48vm1: (6142) ERROR: handle 11169 was not found
onyx-48vm1: Child failed with status 1
onyx-48vm4: [6141] open ./clients/client0/~dmtmp/EXCEL/BEED0000 failed for handle 11169 (No space left on device)
onyx-48vm4: (6142) ERROR: handle 11169 was not found
onyx-48vm4: Child failed with status 1
onyx-48vm4: dbench: no process found
onyx-48vm1: dbench: no process found
onyx-48vm3: [6320] open ./clients/client0/~dmtmp/WORDPRO/LWPSAV0.TMP failed for handle 11188 (No space left on device)
onyx-48vm3: (6321) ERROR: handle 11188 was not found
onyx-48vm3: Child failed with status 1
onyx-48vm3: dbench: no process found
onyx-48vm3: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 10 sec
onyx-48vm4: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 10 sec
onyx-48vm1: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 10 sec
CMD: onyx-48vm1,onyx-48vm3,onyx-48vm4 killall -0 dbench
onyx-48vm4: dbench: no process found
onyx-48vm3: dbench: no process found
onyx-48vm1: dbench: no process found
replay-single test_70b: @@@@@@ FAIL: dbench stopped on some of onyx-48vm1,onyx-48vm3,onyx-48vm4! 

CMD: onyx-48vm1,onyx-48vm3,onyx-48vm4 killall  dbench
onyx-48vm3: dbench: no process found
onyx-48vm4: dbench: no process found
onyx-48vm1: dbench: no process found
replay-single test_70b: @@@@@@ FAIL: rundbench load on onyx-48vm1,onyx-48vm3,onyx-48vm4 failed! 

Logs for this failure can be found at
lustre-master build # 3703 el7 - https://testing.hpdd.intel.com/test_sets/91b58a56-0a6b-11e8-a6ad-52540065bddc
lustre-master build # 3703 el7 servers/sles12sp3 clients - https://testing.hpdd.intel.com/test_sets/b226317a-08a2-11e8-a10a-52540065bddc
lustre-master build # 3703 sles12sp2 - https://testing.hpdd.intel.com/test_sets/4d1259a8-06d6-11e8-a7cd-52540065bddc
lustre-master build # 3702 el7 servers/sles12sp3 clients - https://testing.hpdd.intel.com/test_sets/cee4eabe-04e1-11e8-bd00-52540065bddc



 Comments   
Comment by James Nunez (Inactive) [ 13/Mar/19 ]

We are still seeing a few replay-single tests fail with 'No space left on device'. For example, for 2.10.7 RC1 failover test session with logs at https://testing.whamcloud.com/test_sets/01b2e4f4-43fd-11e9-9720-52540065bddc .

Comment by James Nunez (Inactive) [ 27/Sep/19 ]

We are still seeing these issues for master and 2.12.3.

Looking at https://testing.whamcloud.com/test_sets/56982af2-dfec-11e9-a0ba-52540065bddc, we can see that it is unlikely that the file system is full since we print out the space usage earlier in test 70b

UUID                   1K-blocks        Used   Available Use% Mounted on
lustre-MDT0000_UUID      5825660       48764     5254064   1% /mnt/lustre[MDT:0]
lustre-OST0000_UUID      1933276      195836     1611176  11% /mnt/lustre[OST:0]
lustre-OST0001_UUID      1933276       25796     1786240   2% /mnt/lustre[OST:1]
lustre-OST0002_UUID      1933276       25796     1786240   2% /mnt/lustre[OST:2]
lustre-OST0003_UUID      1933276       25796     1786240   2% /mnt/lustre[OST:3]
lustre-OST0004_UUID      1933276       25796     1786240   2% /mnt/lustre[OST:4]
lustre-OST0005_UUID      1933276       25796     1786240   2% /mnt/lustre[OST:5]
lustre-OST0006_UUID      1933276       25796     1786240   2% /mnt/lustre[OST:6]

filesystem_summary:     13532932      350612    12328616   3% /mnt/lustre

CMD: trevis-40vm6.trevis.whamcloud.com,trevis-40vm8,trevis-40vm9 mcreate /mnt/lustre/fsa-\$(hostname); rm /mnt/lustre/fsa-\$(hostname)
CMD: trevis-40vm6.trevis.whamcloud.com,trevis-40vm8,trevis-40vm9 if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-\$(hostname); rm /mnt/lustre2/fsa-\$(hostname); fi

In this case, no other test fails with 'no space left on device' except for test 70b. So, this may be a different issue?

Generated at Sat Feb 10 02:36:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.