[LU-9429] parallel-scale test_parallel_grouplock: test failed to respond and timed out Created: 02/May/17  Updated: 04/Jan/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0, Lustre 2.11.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: James Casper Assignee: Zhenyu Xu
Resolution: Unresolved Votes: 0
Labels: always_except
Environment:

all full tests:
clients: EL7 & SLES12, master branch, v2.9.56.11, b3565
(servers: ldiskfs & zfs, DNE & non-DNE)

one interop test:
clients: EL7, master branch, v2.9.56.11, b3565
(servers: ldiskfs, b2_9 branch, v2.9.0, b22)


Issue Links:
Duplicate
is duplicated by LU-9511 parallel-scale-stress-hw_parallel_gro... Resolved
Related
is related to LU-9479 sanity test 184d 244: don't instantia... Open
is related to LU-9367 parallel-scale test_cascading_rw: cas... Resolved
is related to LU-9344 sanity test_244: sendfile_grouplock t... Resolved
is related to LU-9793 sanity test 244 fail Resolved
is related to LU-9963 add parallel-scale test_parallel_grou... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Seen everywhere test_parallel_grouplock was tested in tag 56 testing (2.9.56):
https://testing.hpdd.intel.com/test_sessions/732dde3a-7e28-437b-8865-c350e9438ee4
https://testing.hpdd.intel.com/test_sessions/f14b71e9-9eda-4053-814d-fdf644925d29
https://testing.hpdd.intel.com/test_sessions/cb12c60c-613a-44b3-bfef-03c0651d2607
https://testing.hpdd.intel.com/test_sessions/30cc75b6-594f-4255-accf-24fe11bdd565
https://testing.hpdd.intel.com/test_sessions/20ddc92f-b9fe-482d-ac1b-1602a513c824
https://testing.hpdd.intel.com/test_sessions/4f7e260b-bce2-4834-b77c-a1b47527d05a

With tag 56 testing, parallel-scale had two subtests (test_cascading_rw & test_parallel_grouplock) that failed 6 of 6 times.

With tag 52-55 testing, some instances of test_cascading_rw failing were seen, but test_parallel_grouplock passed 100% of the time.

With all 6 failures, we saw this sequence:

test_cascading_rw: cascading_rw failed! 1 (covered by LU-9367):

From test_log:

/usr/lib64/lustre/tests/cascading_rw is running with 4 process(es) in DEBUG mode
22:55:22: Running test #/usr/lib64/lustre/tests/cascading_rw(iter 0)
[onyx-48vm1:12185] *** Process received signal ***
[onyx-48vm1:12185] Signal: Floating point exception (8)
[onyx-48vm1:12185] Signal code: Integer divide-by-zero (1)
[onyx-48vm1:12185] Failing at address: 0x4024c8
[onyx-48vm1:12185] [ 0] /lib64/libpthread.so.0(+0xf370) [0x7f6060bb8370]
[onyx-48vm1:12185] [ 1] /usr/lib64/lustre/tests/cascading_rw() [0x4024c8]
[onyx-48vm1:12185] [ 2] /usr/lib64/lustre/tests/cascading_rw() [0x402be0]
[onyx-48vm1:12185] [ 3] /usr/lib64/lustre/tests/cascading_rw() [0x40158e]
[onyx-48vm1:12185] [ 4] /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f6060809b35]
[onyx-48vm1:12185] [ 5] /usr/lib64/lustre/tests/cascading_rw() [0x40169d]
[onyx-48vm1:12185] *** End of error message ***
[onyx-48vm2.onyx.hpdd.intel.com][[59688,1],1][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 12185 on node onyx-48vm1.onyx.hpdd.intel.com exited on signal 8 (Floating point exception).
--------------------------------------------------------------------------
 parallel-scale test_cascading_rw: @@@@@@ FAIL: cascading_rw failed! 1 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4931:error()
  = /usr/lib64/lustre/tests/functions.sh:740:run_cascading_rw()
  = /usr/lib64/lustre/tests/parallel-scale.sh:130:test_cascading_rw()
  = /usr/lib64/lustre/tests/test-framework.sh:5207:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:5246:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:5093:run_test()
  = /usr/lib64/lustre/tests/parallel-scale.sh:132:main()

The test_cascading_rw failure was then followed by:

test_parallel_grouplock: test failed to respond and timed out

From test_log:

parallel_grouplock subtests -t 11 PASS

Note: Only for non-DNE configs did subtest 11 pass.

Also from test_log:

CMD: trevis-52vm1.trevis.hpdd.intel.com,trevis-52vm2,trevis-52vm7,trevis-52vm8 lctl clear
+ /usr/lib64/lustre/tests/parallel_grouplock -g -v -d /mnt/lustre/d0.parallel_grouplock -t 12
+ chmod 0777 /mnt/lustre
drwxrwxrwx 5 root root 4096 Apr 25 13:00 /mnt/lustre
+ su mpiuser sh -c "/usr/lib64/compat-openmpi16/bin/mpirun --mca btl tcp,self --mca btl_tcp_if_include eth0 -mca boot ssh -machinefile /tmp/parallel-scale.machines -np 5 /usr/lib64/lustre/tests/parallel_grouplock -g -v -d /mnt/lustre/d0.parallel_grouplock -t 12 "
/usr/lib64/lustre/tests/parallel_grouplock is running with 5 task(es) in DEBUG mode
23:38:55: Running test #/usr/lib64/lustre/tests/parallel_grouplock(iter 0)
23:38:55:	Beginning subtest 12

This was the last activity seen before the test_parallel_grouplock timeout. Nothing obvious was found in any of the console or dmesg logs.



 Comments   
Comment by Peter Jones [ 03/May/17 ]

Bobijam

Could you please advise on this one?

Thanks

Peter

Comment by Gerrit Updater [ 04/May/17 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/26943
Subject: LU-9429 mpi: parallel_grouplock.c group_test4 hung
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cd96ac0012419916adb473cb9e2a7e21c6f8c963

Comment by James Casper [ 04/May/17 ]

parallel-scale test_parallel_grouplock TIMEOUTs in last 16 months (in master):
> 2016-01-01 to 2017-02-02: 0 occurances
> 2017-02-03 to 2017-04-05: 18 occurances
> since 2017-04-05: 87 occurances
>
> tag 55 test (b3550): 2017-04-05 (parallel_grouplock 100% passing)
>
> landed 2017-04-08:
> LU-8998 pfl: Basic data structures for composite layout — jinshan.xiong / gitweb
> LU-8998 pfl: enhance PFID EA for PFL — jinshan.xiong / gitweb
> LU-8998 pfl: layout LFSCK handles PFL file — jinshan.xiong / gitweb
> LU-8998 pfl: test cases for lfsck on PFL — jinshan.xiong / gitweb
> LU-9008 pfl: dynamic layout modification with write/truncate — jinshan.xiong / gitweb
> LU-9165 pfl: MDS handling of write intent IT_LAYOUT RPC — jinshan.xiong / gitweb
>
> LU-8998 clio: Client side implementation for PFL — jinshan.xiong / gitweb
> LU-8998 lfs: user space tools for PFL — jinshan.xiong / gitweb
> LU-8998 docs: man pages for tools of PFL — jinshan.xiong / gitweb
> LU-8998 tests: test scripts for PFL — jinshan.xiong / gitweb
>
> tag 56 test (b3565): 2017-04-23 (parallel_grouplock 100% failing)

Comment by Andreas Dilger [ 23/May/17 ]

Will this MPI application failure be fixed by LU-9490 patch https://review.whamcloud.com/27183
"LU-9490 llite: return v1/v3 layout for legacy app" ? We really shouldn't be breaking MPI applications due to PFL, and I think a workaround in the Lustre code is warranted.

Comment by Zhenyu Xu [ 24/May/17 ]

I think the group lock test hung up won't be fixed by the #27183.

Partial OST object instantiation will trigger layout change when un-init component extent is written, and layout will be fetched and refreshed to continue the IO; while prior getting group lock increases lov_io::lo_active_ios (would be decreased in cl_put_grouplock), so the refreshing-layout IO will come across this lo_active_ios in lov_conf_set() and wait for the lo_active_ios drops to 0. This implies that there should be no race competing getting group lock and IO caring for layout change (vvp_io_init() will call ll_layout_refresh() if !io->ci_ignore_layout).

Comment by Andreas Dilger [ 06/Jun/17 ]

Will this test program be fixed by patch https://review.whamcloud.com/26646 "LU-9344 test: hung with sendfile_grouplock test12()" because group locking always instantiates the PFL components?

Comment by Zhenyu Xu [ 08/Jun/17 ]

I tried it on my VM, it shows that the hung is not because of the PFL issue. It looks like those write threads are waiting for extent locks which is blocked by the group lock.

parallel_grou S 0000000000000001     0 21418  21416 0x00000080
 ffff8800296cf788 0000000000000086 ccf27c28fd06e26f 000006b4ffffff9d
 593905fb000053aa 0000000000000000 ffff880000000001 00000000fffffffc
 ffff8800296cf798 ffff88001b92b0c4 ffff88001cf59060 ffff8800296cffd8
Call Trace:
 [<ffffffffa064e38d>] ldlm_completion_ast+0x67d/0x9a0 [ptlrpc]
 [<ffffffff810640e0>] ? default_wake_function+0x0/0x20
 [<ffffffffa0648506>] ldlm_cli_enqueue_fini+0x936/0xe30 [ptlrpc]
 [<ffffffffa06699d1>] ? ptlrpc_set_destroy+0x2d1/0x450 [ptlrpc]
 [<ffffffffa064c88d>] ldlm_cli_enqueue+0x3ad/0x7d0 [ptlrpc]
 [<ffffffffa064dd10>] ? ldlm_completion_ast+0x0/0x9a0 [ptlrpc]
 [<ffffffffa098fab0>] ? osc_ldlm_blocking_ast+0x0/0x3c0 [osc]
 [<ffffffffa098f510>] ? osc_ldlm_glimpse_ast+0x0/0x340 [osc]
 [<ffffffffa097edff>] osc_enqueue_base+0x1ff/0x630 [osc]
 [<ffffffffa099056d>] osc_lock_enqueue+0x2bd/0xa00 [osc]
 [<ffffffffa0991f90>] ? osc_lock_upcall+0x0/0x530 [osc]
 [<ffffffffa04d8b9b>] cl_lock_enqueue+0x6b/0x120 [obdclass]
 [<ffffffffa0407e17>] lov_lock_enqueue+0x97/0x140 [lov]
 [<ffffffffa04d8b9b>] cl_lock_enqueue+0x6b/0x120 [obdclass]
 [<ffffffffa04d958b>] cl_lock_request+0x7b/0x200 [obdclass]
 [<ffffffffa04dd301>] cl_io_lock+0x381/0x3d0 [obdclass]
 [<ffffffffa04dd466>] cl_io_loop+0x116/0xb20 [obdclass]
 [<ffffffffa0663066>] ? interval_insert+0x296/0x410 [ptlrpc]
 [<ffffffffa0e7d9e1>] ll_file_io_generic+0x231/0xaa0 [lustre]
 [<ffffffffa0e8021d>] ll_file_aio_write+0x13d/0x280 [lustre]
 [<ffffffffa0e8049a>] ll_file_write+0x13a/0x270 [lustre]
 [<ffffffff81189ef8>] vfs_write+0xb8/0x1a0
 [<ffffffff8118b40f>] ? fget_light_pos+0x3f/0x50
 [<ffffffff8118aa31>] sys_write+0x51/0xb0
 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
Comment by Andreas Dilger [ 23/Aug/17 ]

Bobijam, could you please provide a brief description of what is needed to fix this problem, and how much work that will be to implement?

Generated at Sat Feb 10 02:26:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.