Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.11.0, Lustre 2.10.4, Lustre 2.10.5, Lustre 2.13.0, Lustre 2.10.7, Lustre 2.12.1, Lustre 2.12.3, Lustre 2.12.4, Lustre 2.12.5
-
None
-
3
-
9223372036854775807
Description
A variety of replay-single tests are failing due to the file system being full; calls like open and touch are failing with “No space left on device”. For each test sessions, a different set of replay-single test will fail. For example, test session at https://testing.hpdd.intel.com/test_sets/91b58a56-0a6b-11e8-a6ad-52540065bddc has tests 23, 24, 25, 26, 48, 53g, 70b, 70c fail with no space left on device, but test session https://testing.hpdd.intel.com/test_sets/b226317a-08a2-11e8-a10a-52540065bddc has tests 20c, 21, 30, 32, 32, 33a, 48, 53c, 53f, 53g, 70b, 70c fail with the same error message.
For example, looking at a failover test session for lustre-master build #3703 for ldiskfs servers where replay-single fails (https://testing.hpdd.intel.com/test_sets/91b58a56-0a6b-11e8-a6ad-52540065bddc), we see test 26 fail with the error message
'multiop_bg_pause /mnt/lustre/f26.replay-single-1 failed'
Looking at the test_log for replay-single test_26 (and most of the other failed tests), we see that the file system is not close to being full
== replay-single test 26: |X| open(O_CREAT), unlink two, close one, replay, close one (test mds_cleanup_orphans) ====================================================================================================== 08:40:42 (1517820042) CMD: onyx-49vm11 sync; sync; sync UUID 1K-blocks Used Available Use% Mounted on lustre-MDT0000_UUID 5825660 48852 5253976 1% /mnt/lustre[MDT:0] lustre-OST0000_UUID 1933276 30916 1781120 2% /mnt/lustre[OST:0] lustre-OST0001_UUID 1933276 25792 1786244 1% /mnt/lustre[OST:1] lustre-OST0002_UUID 1933276 25788 1786248 1% /mnt/lustre[OST:2] lustre-OST0003_UUID 1933276 25788 1786248 1% /mnt/lustre[OST:3] lustre-OST0004_UUID 1933276 25788 1786248 1% /mnt/lustre[OST:4] lustre-OST0005_UUID 1933276 25788 1786248 1% /mnt/lustre[OST:5] lustre-OST0006_UUID 1933276 25840 1769716 1% /mnt/lustre[OST:6] filesystem_summary: 13532932 185700 12482072 1% /mnt/lustre CMD: onyx-49vm5.onyx.hpdd.intel.com,onyx-49vm7,onyx-49vm8 mcreate /mnt/lustre/fsa-\$(hostname); rm /mnt/lustre/fsa-\$(hostname) CMD: onyx-49vm5.onyx.hpdd.intel.com,onyx-49vm7,onyx-49vm8 if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-\$(hostname); rm /mnt/lustre2/fsa-\$(hostname); fi CMD: onyx-49vm11 /usr/sbin/lctl --device lustre-MDT0000 notransno CMD: onyx-49vm11 /usr/sbin/lctl --device lustre-MDT0000 readonly CMD: onyx-49vm11 /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000 multiop /mnt/lustre/f26.replay-single-1 vO_tSc TMPPIPE=/tmp/multiop_open_wait_pipe.976 open(O_RDWR|O_CREAT): No space left on device replay-single test_26: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f26.replay-single-1 failed Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:5336:error() = /usr/lib64/lustre/tests/replay-single.sh:613:test_26()
In the MDS dmesg, the only hint of trouble is
[ 117.289138] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000 [ 117.297983] LustreError: 1978:0:(lod_qos.c:1352:lod_alloc_specific()) can't lstripe objid [0x200097212:0xe:0x0]: have 0 want 1 [ 117.511816] Lustre: DEBUG MARKER: /usr/sbin/lctl mark replay-single test_26: @@@@@@ FAIL: multiop_bg_pause \/mnt\/lustre\/f26.replay-single-1 failed
This error “can't lstripe objid” make the problem look like LU-10350, but that patch landed and sanity-dom is not run in this test group.
For each test that fails due to ‘no space’, here is the test and the error from the test log. Several lines of test output have been removed to focus on the error/failure:
== replay-single test 20c: check that client eviction does not affect file content =================== 18:24:30 (1517624670) multiop /mnt/lustre/f20c.replay-single vOw_c TMPPIPE=/tmp/multiop_open_wait_pipe.31377 open(O_RDWR|O_CREAT): No space left on device replay-single test_20c: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f20c.replay-single failed == replay-single test 21: |X| open(O_CREAT), unlink touch new, replay, close (test mds_cleanup_orphans) multiop /mnt/lustre/f21.replay-single vO_tSc TMPPIPE=/tmp/multiop_open_wait_pipe.31377 open(O_RDWR|O_CREAT): No space left on device replay-single test_21: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f21.replay-single failed == replay-single test 30: open(O_CREAT) two, unlink two, replay, close two (test mds_cleanup_orphans) ====================================================================================================== 18:34:12 (1517625252) multiop /mnt/lustre/f30.replay-single-1 vO_tSc TMPPIPE=/tmp/multiop_open_wait_pipe.31377 open(O_RDWR|O_CREAT): No space left on device replay-single test_30: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f30.replay-single-1 failed == replay-single test 31: open(O_CREAT) two, unlink one, |X| unlink one, close two (test mds_cleanup_orphans) ====================================================================================================== 18:34:16 (1517625256) multiop /mnt/lustre/f31.replay-single-1 vO_tSc TMPPIPE=/tmp/multiop_open_wait_pipe.31377 open(O_RDWR|O_CREAT): No space left on device replay-single test_31: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f31.replay-single-1 failed == replay-single test 32: close() notices client eviction; close() after client eviction ============= 18:34:18 (1517625258) multiop /mnt/lustre/f32.replay-single vO_c TMPPIPE=/tmp/multiop_open_wait_pipe.31377 open(O_RDWR|O_CREAT): No space left on device replay-single test_32: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f32.replay-single failed == replay-single test 33a: fid seq shouldn't be reused after abort recovery ========================== 18:34:20 (1517625260) open(/mnt/lustre/f33a.replay-single-0) error: No space left on device total: 0 open/close in 0.00 seconds: 0.00 ops/second replay-single test_33a: @@@@@@ FAIL: createmany create /mnt/lustre/f33a.replay-single failed == replay-single test 48: MDS->OSC failure during precreate cleanup (2824) =========================== 19:12:20 (1517627540) Started lustre-MDT0000 CMD: onyx-48vm10 lctl set_param fail_loc=0x80000216 fail_loc=0x80000216 open(/mnt/lustre/f48.replay-single20) error: No space left on device total: 0 open/close in 0.00 seconds: 0.00 ops/second replay-single test_48: @@@@@@ FAIL: createmany recraete /mnt/lustre/f48.replay-single failed == replay-single test 53c: |X| open request and close request while two MDC requests in flight ======= 19:17:23 (1517627843) CMD: onyx-48vm11 lctl set_param fail_loc=0x80000107 open(O_RDWR|O_CREAT): No space left on device fail_loc=0x80000107 CMD: onyx-48vm11 lctl set_param fail_loc=0x80000115 fail_loc=0x80000115 /usr/lib64/lustre/tests/replay-single.sh: line 1293: kill: (3180) - No such process replay-single test_53c: @@@@@@ FAIL: close_pid doesn't exist == replay-single test 53f: |X| open reply and close reply while two MDC requests in flight =========== 19:20:23 (1517628023) CMD: onyx-48vm11 lctl set_param fail_loc=0x119 open(O_RDWR|O_CREAT): No space left on device fail_loc=0x119 CMD: onyx-48vm11 lctl set_param fail_loc=0x8000013b fail_loc=0x8000013b /usr/lib64/lustre/tests/replay-single.sh: line 1398: kill: (6748) - No such process replay-single test_53f: @@@@@@ FAIL: close_pid doesn't exist == replay-single test 53g: |X| drop open reply and close request while close and open are both in flight ====================================================================================================== 19:20:31 (1517628031) CMD: onyx-48vm11 lctl set_param fail_loc=0x119 open(O_RDWR|O_CREAT): No space left on device fail_loc=0x119 CMD: onyx-48vm11 lctl set_param fail_loc=0x80000115 fail_loc=0x80000115 /usr/lib64/lustre/tests/replay-single.sh: line 1436: kill: (7302) - No such process CMD: onyx-48vm11 lctl set_param fail_loc=0 fail_loc=0 replay-single test_53g: @@@@@@ FAIL: close_pid doesn't exist == replay-single test 70b: dbench 1mdts recovery; 3 clients ========================================== 20:06:47 (1517630807) onyx-48vm1: [6141] open ./clients/client0/~dmtmp/EXCEL/BEED0000 failed for handle 11169 (No space left on device) onyx-48vm1: (6142) ERROR: handle 11169 was not found onyx-48vm1: Child failed with status 1 onyx-48vm4: [6141] open ./clients/client0/~dmtmp/EXCEL/BEED0000 failed for handle 11169 (No space left on device) onyx-48vm4: (6142) ERROR: handle 11169 was not found onyx-48vm4: Child failed with status 1 onyx-48vm4: dbench: no process found onyx-48vm1: dbench: no process found onyx-48vm3: [6320] open ./clients/client0/~dmtmp/WORDPRO/LWPSAV0.TMP failed for handle 11188 (No space left on device) onyx-48vm3: (6321) ERROR: handle 11188 was not found onyx-48vm3: Child failed with status 1 onyx-48vm3: dbench: no process found onyx-48vm3: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 10 sec onyx-48vm4: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 10 sec onyx-48vm1: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 10 sec CMD: onyx-48vm1,onyx-48vm3,onyx-48vm4 killall -0 dbench onyx-48vm4: dbench: no process found onyx-48vm3: dbench: no process found onyx-48vm1: dbench: no process found replay-single test_70b: @@@@@@ FAIL: dbench stopped on some of onyx-48vm1,onyx-48vm3,onyx-48vm4! CMD: onyx-48vm1,onyx-48vm3,onyx-48vm4 killall dbench onyx-48vm3: dbench: no process found onyx-48vm4: dbench: no process found onyx-48vm1: dbench: no process found replay-single test_70b: @@@@@@ FAIL: rundbench load on onyx-48vm1,onyx-48vm3,onyx-48vm4 failed!
Logs for this failure can be found at
lustre-master build # 3703 el7 - https://testing.hpdd.intel.com/test_sets/91b58a56-0a6b-11e8-a6ad-52540065bddc
lustre-master build # 3703 el7 servers/sles12sp3 clients - https://testing.hpdd.intel.com/test_sets/b226317a-08a2-11e8-a10a-52540065bddc
lustre-master build # 3703 sles12sp2 - https://testing.hpdd.intel.com/test_sets/4d1259a8-06d6-11e8-a7cd-52540065bddc
lustre-master build # 3702 el7 servers/sles12sp3 clients - https://testing.hpdd.intel.com/test_sets/cee4eabe-04e1-11e8-bd00-52540065bddc
Attachments
Issue Links
- is related to
-
LU-10609 replay-single test_48: FAIL: createmany recraete /mnt/lustre/f48.replay-single failed
- Open
-
LU-10612 replay-single test_48 defect causes test failure due to error: No space left on device
- Resolved
- is related to
-
LU-10609 replay-single test_48: FAIL: createmany recraete /mnt/lustre/f48.replay-single failed
- Open
-
LU-4846 Failover test failure on test suite replay-single test_26: No space left
- Resolved
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...