Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.11.0, Lustre 2.12.0, Lustre 2.10.3, Lustre 2.10.4, Lustre 2.10.5, Lustre 2.12.1
-
None
-
3
-
9223372036854775807
Description
Several of the tests in parallel-scale are failing with some variant of ‘No space left on device’. One failed parallel-scale test suite is at
https://testing.hpdd.intel.com/test_sets/fde9d7ba-dae4-11e7-8027-52540065bddc
The tests that fail are compilebench, simul, connectathon, iorssf, iorfpp, ior_mdtest_parallel_ssf, ior_mdtest_parallel_fpp, and fio.
For test_compilebench, the test checks to see if there is enough space to write to the file system and will skip the test if there is not enough. compilebench needs ~ 1GB of space to run. In this case, we can see that we have 12482080 KB. The following is from the client test log:
== parallel-scale test compilebench: compilebench ==================================================== 16:17:36 (1512577056) OPTIONS: cbench_DIR=/usr/bin cbench_IDIRS=2 cbench_RUNS=2 trevis-3vm1.trevis.hpdd.intel.com trevis-3vm2 free space = 12482080 KB ./compilebench -D /mnt/lustre/d0.compilebench.5990 -i 2 -r 2 --makej using working directory /mnt/lustre/d0.compilebench.5990, 2 intial dirs 2 runs native unpatched native-0 222MB in 285.09 seconds (0.78 MB/s) native patched native-0 109MB in 48.51 seconds (2.26 MB/s) native patched compiled native-0 691MB in 169.31 seconds (4.08 MB/s) create dir kernel-0 222MB in 132.84 seconds (1.67 MB/s) create dir kernel-1 222MB in 150.75 seconds (1.48 MB/s) compile dir kernel-1 680MB in 172.82 seconds (3.94 MB/s) Traceback (most recent call last): File "./compilebench", line 594, in <module> if not compile_one_dir(dset, rnd): File "./compilebench", line 368, in compile_one_dir mbs = run_directory(ch[0], dir, "compile dir") File "./compilebench", line 243, in run_directory fp.write(buf) IOError: [Errno 28] No space left on device parallel-scale test_compilebench: @@@@@@ FAIL: compilebench failed: 1
metabench runs with no problems, but simul fails with
16:46:21: Process 0(trevis-3vm1.trevis.hpdd.intel.com): FAILED in create_files, write in file /mnt/lustre/d0.simul/simul_read.0: No space left on device
Similar to compilebench, connectathon checks to see how much space is available on the file system and will skip the test if there is not enough space; it needs about 40 MB. From the client test log, we can see that there is free space = 10654792 KB available on the Lustre file system. From the client test log:
./test5: read and write ./test5: (/mnt/lustre/d0.connectathon) 'bigfile' write failed : No space left on device basic tests failed parallel-scale test_connectathon: @@@@@@ FAIL: connectathon failed: 1
The “bigfile” is 30 MB.
Looking in the MDS (vm4) console for this test session (https://testing.hpdd.intel.com/test_sessions/6c155f47-820d-447d-893f-15b24418827f), eventhough metabench passed, we see
[29644.867500] Lustre: DEBUG MARKER: == parallel-scale test metabench: metabench ========================================================== 16:38:42 (1512578322) [29723.246493] LustreError: 19423:0:(osp_precreate.c:657:osp_precreate_send()) lustre-OST0000-osc-MDT0000: precreate fid [0x100000000:0xc7e76:0x0] < local used fid [0x100000000:0xc7e76:0x0]: rc = -116 [29723.252559] LustreError: 19423:0:(osp_precreate.c:1282:osp_precreate_thread()) lustre-OST0000-osc-MDT0000: cannot precreate objects: rc = -116 [30044.166827] sched: RT throttling activated [30098.760027] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null [30099.740367] Lustre: DEBUG MARKER: rc=0;
Similarly, for another test that passed, mdtestfpp, in the same MDS console log:
[30552.987700] Lustre: DEBUG MARKER: == parallel-scale test mdtestfpp: mdtestfpp ========================================================== 16:53:49 (1512579229) [30792.302159] LustreError: 19423:0:(osp_precreate.c:657:osp_precreate_send()) lustre-OST0000-osc-MDT0000: precreate fid [0x100000000:0xf05cd:0x0] < local used fid [0x100000000:0xf05cd:0x0]: rc = -116 [30792.307280] LustreError: 19423:0:(osp_precreate.c:1282:osp_precreate_thread()) lustre-OST0000-osc-MDT0000: cannot precreate objects: rc = -116 [30792.307283] LustreError: 19401:0:(osp_precreate.c:1334:osp_precreate_ready_condition()) lustre-OST0000-osc-MDT0000: precreate failed opd_pre_status -116 [30792.307290] LustreError: 19401:0:(osp_precreate.c:1334:osp_precreate_ready_condition()) Skipped 1 previous similar message [30917.997538] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null [30919.134892] Lustre: DEBUG MARKER: rc=0;
This failure looks similar to LU-7834 except, only test_compilebench fails.
Note: These failures are only seen on full test sessions.
Logs for parallel-scale ‘No space left on device’ failures with several tests failing are at
https://testing.hpdd.intel.com/test_sets/4fdaa576-daa0-11e7-9c63-52540065bddc
https://testing.hpdd.intel.com/test_sets/0e389fce-da73-11e7-8027-52540065bddc
The first time parallel-scale test_compilebench failed with this error and with osp_precreate_thread errors in the MDS console log was on 2017-11-22 with master build # 3672.
Attachments
Issue Links
- is related to
-
LU-10382 posix test_1: Run POSIX testsuite on /mnt/lustre failed
- Closed
- is related to
-
LU-10350 ost-pools test 1n fails with 'failed to write to /mnt/lustre/d1n.ost-pools/file: 1'
- Resolved
-
LU-9324 sanity-pfl test 10 needs to reset the file system default layout
- Resolved
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...