Loading...

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.10.0, Lustre 2.11.0
Labels:
- always_except
Environment:

Hide
all full tests:
clients: EL7 & SLES12, master branch, v2.9.56.11, b3565
  (servers: ldiskfs & zfs, DNE & non-DNE)

one interop test:
clients: EL7, master branch, v2.9.56.11, b3565
  (servers: ldiskfs, b2_9 branch, v2.9.0, b22)

Show
all full tests: clients: EL7 & SLES12, master branch, v2.9.56.11, b3565   (servers: ldiskfs & zfs, DNE & non-DNE) one interop test: clients: EL7, master branch, v2.9.56.11, b3565   (servers: ldiskfs, b2_9 branch, v2.9.0, b22)

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Seen everywhere test_parallel_grouplock was tested in tag 56 testing (2.9.56):
https://testing.hpdd.intel.com/test_sessions/732dde3a-7e28-437b-8865-c350e9438ee4
https://testing.hpdd.intel.com/test_sessions/f14b71e9-9eda-4053-814d-fdf644925d29
https://testing.hpdd.intel.com/test_sessions/cb12c60c-613a-44b3-bfef-03c0651d2607
https://testing.hpdd.intel.com/test_sessions/30cc75b6-594f-4255-accf-24fe11bdd565
https://testing.hpdd.intel.com/test_sessions/20ddc92f-b9fe-482d-ac1b-1602a513c824
https://testing.hpdd.intel.com/test_sessions/4f7e260b-bce2-4834-b77c-a1b47527d05a

With tag 56 testing, parallel-scale had two subtests (test_cascading_rw & test_parallel_grouplock) that failed 6 of 6 times.

With tag 52-55 testing, some instances of test_cascading_rw failing were seen, but test_parallel_grouplock passed 100% of the time.

With all 6 failures, we saw this sequence:

test_cascading_rw: cascading_rw failed! 1 (covered by ~~LU-9367~~):

From test_log:

/usr/lib64/lustre/tests/cascading_rw is running with 4 process(es) in DEBUG mode
22:55:22: Running test #/usr/lib64/lustre/tests/cascading_rw(iter 0)
[onyx-48vm1:12185] *** Process received signal ***
[onyx-48vm1:12185] Signal: Floating point exception (8)
[onyx-48vm1:12185] Signal code: Integer divide-by-zero (1)
[onyx-48vm1:12185] Failing at address: 0x4024c8
[onyx-48vm1:12185] [ 0] /lib64/libpthread.so.0(+0xf370) [0x7f6060bb8370]
[onyx-48vm1:12185] [ 1] /usr/lib64/lustre/tests/cascading_rw() [0x4024c8]
[onyx-48vm1:12185] [ 2] /usr/lib64/lustre/tests/cascading_rw() [0x402be0]
[onyx-48vm1:12185] [ 3] /usr/lib64/lustre/tests/cascading_rw() [0x40158e]
[onyx-48vm1:12185] [ 4] /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f6060809b35]
[onyx-48vm1:12185] [ 5] /usr/lib64/lustre/tests/cascading_rw() [0x40169d]
[onyx-48vm1:12185] *** End of error message ***
[onyx-48vm2.onyx.hpdd.intel.com][[59688,1],1][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 12185 on node onyx-48vm1.onyx.hpdd.intel.com exited on signal 8 (Floating point exception).
--------------------------------------------------------------------------
 parallel-scale test_cascading_rw: @@@@@@ FAIL: cascading_rw failed! 1 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4931:error()
  = /usr/lib64/lustre/tests/functions.sh:740:run_cascading_rw()
  = /usr/lib64/lustre/tests/parallel-scale.sh:130:test_cascading_rw()
  = /usr/lib64/lustre/tests/test-framework.sh:5207:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:5246:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:5093:run_test()
  = /usr/lib64/lustre/tests/parallel-scale.sh:132:main()

The test_cascading_rw failure was then followed by:

test_parallel_grouplock: test failed to respond and timed out

From test_log:

parallel_grouplock subtests -t 11 PASS

Note: Only for non-DNE configs did subtest 11 pass.

Also from test_log:

CMD: trevis-52vm1.trevis.hpdd.intel.com,trevis-52vm2,trevis-52vm7,trevis-52vm8 lctl clear
+ /usr/lib64/lustre/tests/parallel_grouplock -g -v -d /mnt/lustre/d0.parallel_grouplock -t 12
+ chmod 0777 /mnt/lustre
drwxrwxrwx 5 root root 4096 Apr 25 13:00 /mnt/lustre
+ su mpiuser sh -c "/usr/lib64/compat-openmpi16/bin/mpirun --mca btl tcp,self --mca btl_tcp_if_include eth0 -mca boot ssh -machinefile /tmp/parallel-scale.machines -np 5 /usr/lib64/lustre/tests/parallel_grouplock -g -v -d /mnt/lustre/d0.parallel_grouplock -t 12 "
/usr/lib64/lustre/tests/parallel_grouplock is running with 5 task(es) in DEBUG mode
23:38:55: Running test #/usr/lib64/lustre/tests/parallel_grouplock(iter 0)
23:38:55:	Beginning subtest 12

This was the last activity seen before the test_parallel_grouplock timeout. Nothing obvious was found in any of the console or dmesg logs.

Attachments

Issue Links

is duplicated by

LU-9511 parallel-scale-stress-hw_parallel_grouplock test stuck on subtest 12, timeout 2hours, normally takes < 400sec

Resolved

is related to

LU-9793 sanity test 244 fail

Resolved

LU-9963 add parallel-scale test_parallel_grouplock to ALWAYS_EXCEPT list

Resolved

is related to

LU-9479 sanity test 184d 244: don't instantiate PFL component when taking group lock

Open

LU-9367 parallel-scale test_cascading_rw: cascading_rw failed! 1

Resolved

LU-9344 sanity test_244: sendfile_grouplock test12() test hung

Resolved

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(1 is related to , 8 mentioned in)

parallel-scale test_parallel_grouplock: test failed to respond and timed out

Details

Description

Attachments

Issue Links

Activity

People

Dates