Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.10.0, Lustre 2.11.0
-
3
-
9223372036854775807
Description
Seen everywhere test_parallel_grouplock was tested in tag 56 testing (2.9.56):
https://testing.hpdd.intel.com/test_sessions/732dde3a-7e28-437b-8865-c350e9438ee4
https://testing.hpdd.intel.com/test_sessions/f14b71e9-9eda-4053-814d-fdf644925d29
https://testing.hpdd.intel.com/test_sessions/cb12c60c-613a-44b3-bfef-03c0651d2607
https://testing.hpdd.intel.com/test_sessions/30cc75b6-594f-4255-accf-24fe11bdd565
https://testing.hpdd.intel.com/test_sessions/20ddc92f-b9fe-482d-ac1b-1602a513c824
https://testing.hpdd.intel.com/test_sessions/4f7e260b-bce2-4834-b77c-a1b47527d05a
With tag 56 testing, parallel-scale had two subtests (test_cascading_rw & test_parallel_grouplock) that failed 6 of 6 times.
With tag 52-55 testing, some instances of test_cascading_rw failing were seen, but test_parallel_grouplock passed 100% of the time.
With all 6 failures, we saw this sequence:
test_cascading_rw: cascading_rw failed! 1 (covered by LU-9367):
From test_log:
/usr/lib64/lustre/tests/cascading_rw is running with 4 process(es) in DEBUG mode 22:55:22: Running test #/usr/lib64/lustre/tests/cascading_rw(iter 0) [onyx-48vm1:12185] *** Process received signal *** [onyx-48vm1:12185] Signal: Floating point exception (8) [onyx-48vm1:12185] Signal code: Integer divide-by-zero (1) [onyx-48vm1:12185] Failing at address: 0x4024c8 [onyx-48vm1:12185] [ 0] /lib64/libpthread.so.0(+0xf370) [0x7f6060bb8370] [onyx-48vm1:12185] [ 1] /usr/lib64/lustre/tests/cascading_rw() [0x4024c8] [onyx-48vm1:12185] [ 2] /usr/lib64/lustre/tests/cascading_rw() [0x402be0] [onyx-48vm1:12185] [ 3] /usr/lib64/lustre/tests/cascading_rw() [0x40158e] [onyx-48vm1:12185] [ 4] /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f6060809b35] [onyx-48vm1:12185] [ 5] /usr/lib64/lustre/tests/cascading_rw() [0x40169d] [onyx-48vm1:12185] *** End of error message *** [onyx-48vm2.onyx.hpdd.intel.com][[59688,1],1][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 12185 on node onyx-48vm1.onyx.hpdd.intel.com exited on signal 8 (Floating point exception). -------------------------------------------------------------------------- parallel-scale test_cascading_rw: @@@@@@ FAIL: cascading_rw failed! 1 Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4931:error() = /usr/lib64/lustre/tests/functions.sh:740:run_cascading_rw() = /usr/lib64/lustre/tests/parallel-scale.sh:130:test_cascading_rw() = /usr/lib64/lustre/tests/test-framework.sh:5207:run_one() = /usr/lib64/lustre/tests/test-framework.sh:5246:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:5093:run_test() = /usr/lib64/lustre/tests/parallel-scale.sh:132:main()
The test_cascading_rw failure was then followed by:
test_parallel_grouplock: test failed to respond and timed out
From test_log:
parallel_grouplock subtests -t 11 PASS
Note: Only for non-DNE configs did subtest 11 pass.
Also from test_log:
CMD: trevis-52vm1.trevis.hpdd.intel.com,trevis-52vm2,trevis-52vm7,trevis-52vm8 lctl clear + /usr/lib64/lustre/tests/parallel_grouplock -g -v -d /mnt/lustre/d0.parallel_grouplock -t 12 + chmod 0777 /mnt/lustre drwxrwxrwx 5 root root 4096 Apr 25 13:00 /mnt/lustre + su mpiuser sh -c "/usr/lib64/compat-openmpi16/bin/mpirun --mca btl tcp,self --mca btl_tcp_if_include eth0 -mca boot ssh -machinefile /tmp/parallel-scale.machines -np 5 /usr/lib64/lustre/tests/parallel_grouplock -g -v -d /mnt/lustre/d0.parallel_grouplock -t 12 " /usr/lib64/lustre/tests/parallel_grouplock is running with 5 task(es) in DEBUG mode 23:38:55: Running test #/usr/lib64/lustre/tests/parallel_grouplock(iter 0) 23:38:55: Beginning subtest 12
This was the last activity seen before the test_parallel_grouplock timeout. Nothing obvious was found in any of the console or dmesg logs.
Attachments
Issue Links
- is duplicated by
-
LU-9511 parallel-scale-stress-hw_parallel_grouplock test stuck on subtest 12, timeout 2hours, normally takes < 400sec
- Resolved
- is related to
-
LU-9793 sanity test 244 fail
- Resolved
-
LU-9963 add parallel-scale test_parallel_grouplock to ALWAYS_EXCEPT list
- Resolved
- is related to
-
LU-9479 sanity test 184d 244: don't instantiate PFL component when taking group lock
- Open
-
LU-9367 parallel-scale test_cascading_rw: cascading_rw failed! 1
- Resolved
-
LU-9344 sanity test_244: sendfile_grouplock test12() test hung
- Resolved
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...