[LU-9429] parallel-scale test_parallel_grouplock: test failed to respond and timed out Created: 02/May/17 Updated: 04/Jan/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0, Lustre 2.11.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | James Casper | Assignee: | Zhenyu Xu |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | always_except | ||
| Environment: |
all full tests: one interop test: |
||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||
| Description |
|
Seen everywhere test_parallel_grouplock was tested in tag 56 testing (2.9.56): With tag 56 testing, parallel-scale had two subtests (test_cascading_rw & test_parallel_grouplock) that failed 6 of 6 times. With tag 52-55 testing, some instances of test_cascading_rw failing were seen, but test_parallel_grouplock passed 100% of the time. With all 6 failures, we saw this sequence: test_cascading_rw: cascading_rw failed! 1 (covered by From test_log: /usr/lib64/lustre/tests/cascading_rw is running with 4 process(es) in DEBUG mode 22:55:22: Running test #/usr/lib64/lustre/tests/cascading_rw(iter 0) [onyx-48vm1:12185] *** Process received signal *** [onyx-48vm1:12185] Signal: Floating point exception (8) [onyx-48vm1:12185] Signal code: Integer divide-by-zero (1) [onyx-48vm1:12185] Failing at address: 0x4024c8 [onyx-48vm1:12185] [ 0] /lib64/libpthread.so.0(+0xf370) [0x7f6060bb8370] [onyx-48vm1:12185] [ 1] /usr/lib64/lustre/tests/cascading_rw() [0x4024c8] [onyx-48vm1:12185] [ 2] /usr/lib64/lustre/tests/cascading_rw() [0x402be0] [onyx-48vm1:12185] [ 3] /usr/lib64/lustre/tests/cascading_rw() [0x40158e] [onyx-48vm1:12185] [ 4] /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f6060809b35] [onyx-48vm1:12185] [ 5] /usr/lib64/lustre/tests/cascading_rw() [0x40169d] [onyx-48vm1:12185] *** End of error message *** [onyx-48vm2.onyx.hpdd.intel.com][[59688,1],1][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 12185 on node onyx-48vm1.onyx.hpdd.intel.com exited on signal 8 (Floating point exception). -------------------------------------------------------------------------- parallel-scale test_cascading_rw: @@@@@@ FAIL: cascading_rw failed! 1 Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4931:error() = /usr/lib64/lustre/tests/functions.sh:740:run_cascading_rw() = /usr/lib64/lustre/tests/parallel-scale.sh:130:test_cascading_rw() = /usr/lib64/lustre/tests/test-framework.sh:5207:run_one() = /usr/lib64/lustre/tests/test-framework.sh:5246:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:5093:run_test() = /usr/lib64/lustre/tests/parallel-scale.sh:132:main() The test_cascading_rw failure was then followed by: test_parallel_grouplock: test failed to respond and timed out From test_log: parallel_grouplock subtests -t 11 PASS Note: Only for non-DNE configs did subtest 11 pass. Also from test_log: CMD: trevis-52vm1.trevis.hpdd.intel.com,trevis-52vm2,trevis-52vm7,trevis-52vm8 lctl clear + /usr/lib64/lustre/tests/parallel_grouplock -g -v -d /mnt/lustre/d0.parallel_grouplock -t 12 + chmod 0777 /mnt/lustre drwxrwxrwx 5 root root 4096 Apr 25 13:00 /mnt/lustre + su mpiuser sh -c "/usr/lib64/compat-openmpi16/bin/mpirun --mca btl tcp,self --mca btl_tcp_if_include eth0 -mca boot ssh -machinefile /tmp/parallel-scale.machines -np 5 /usr/lib64/lustre/tests/parallel_grouplock -g -v -d /mnt/lustre/d0.parallel_grouplock -t 12 " /usr/lib64/lustre/tests/parallel_grouplock is running with 5 task(es) in DEBUG mode 23:38:55: Running test #/usr/lib64/lustre/tests/parallel_grouplock(iter 0) 23:38:55: Beginning subtest 12 This was the last activity seen before the test_parallel_grouplock timeout. Nothing obvious was found in any of the console or dmesg logs. |
| Comments |
| Comment by Peter Jones [ 03/May/17 ] |
|
Bobijam Could you please advise on this one? Thanks Peter |
| Comment by Gerrit Updater [ 04/May/17 ] |
|
Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/26943 |
| Comment by James Casper [ 04/May/17 ] |
|
parallel-scale test_parallel_grouplock TIMEOUTs in last 16 months (in master): |
| Comment by Andreas Dilger [ 23/May/17 ] |
|
Will this MPI application failure be fixed by |
| Comment by Zhenyu Xu [ 24/May/17 ] |
|
I think the group lock test hung up won't be fixed by the #27183. Partial OST object instantiation will trigger layout change when un-init component extent is written, and layout will be fetched and refreshed to continue the IO; while prior getting group lock increases lov_io::lo_active_ios (would be decreased in cl_put_grouplock), so the refreshing-layout IO will come across this lo_active_ios in lov_conf_set() and wait for the lo_active_ios drops to 0. This implies that there should be no race competing getting group lock and IO caring for layout change (vvp_io_init() will call ll_layout_refresh() if !io->ci_ignore_layout). |
| Comment by Andreas Dilger [ 06/Jun/17 ] |
|
Will this test program be fixed by patch https://review.whamcloud.com/26646 " |
| Comment by Zhenyu Xu [ 08/Jun/17 ] |
|
I tried it on my VM, it shows that the hung is not because of the PFL issue. It looks like those write threads are waiting for extent locks which is blocked by the group lock. parallel_grou S 0000000000000001 0 21418 21416 0x00000080 ffff8800296cf788 0000000000000086 ccf27c28fd06e26f 000006b4ffffff9d 593905fb000053aa 0000000000000000 ffff880000000001 00000000fffffffc ffff8800296cf798 ffff88001b92b0c4 ffff88001cf59060 ffff8800296cffd8 Call Trace: [<ffffffffa064e38d>] ldlm_completion_ast+0x67d/0x9a0 [ptlrpc] [<ffffffff810640e0>] ? default_wake_function+0x0/0x20 [<ffffffffa0648506>] ldlm_cli_enqueue_fini+0x936/0xe30 [ptlrpc] [<ffffffffa06699d1>] ? ptlrpc_set_destroy+0x2d1/0x450 [ptlrpc] [<ffffffffa064c88d>] ldlm_cli_enqueue+0x3ad/0x7d0 [ptlrpc] [<ffffffffa064dd10>] ? ldlm_completion_ast+0x0/0x9a0 [ptlrpc] [<ffffffffa098fab0>] ? osc_ldlm_blocking_ast+0x0/0x3c0 [osc] [<ffffffffa098f510>] ? osc_ldlm_glimpse_ast+0x0/0x340 [osc] [<ffffffffa097edff>] osc_enqueue_base+0x1ff/0x630 [osc] [<ffffffffa099056d>] osc_lock_enqueue+0x2bd/0xa00 [osc] [<ffffffffa0991f90>] ? osc_lock_upcall+0x0/0x530 [osc] [<ffffffffa04d8b9b>] cl_lock_enqueue+0x6b/0x120 [obdclass] [<ffffffffa0407e17>] lov_lock_enqueue+0x97/0x140 [lov] [<ffffffffa04d8b9b>] cl_lock_enqueue+0x6b/0x120 [obdclass] [<ffffffffa04d958b>] cl_lock_request+0x7b/0x200 [obdclass] [<ffffffffa04dd301>] cl_io_lock+0x381/0x3d0 [obdclass] [<ffffffffa04dd466>] cl_io_loop+0x116/0xb20 [obdclass] [<ffffffffa0663066>] ? interval_insert+0x296/0x410 [ptlrpc] [<ffffffffa0e7d9e1>] ll_file_io_generic+0x231/0xaa0 [lustre] [<ffffffffa0e8021d>] ll_file_aio_write+0x13d/0x280 [lustre] [<ffffffffa0e8049a>] ll_file_write+0x13a/0x270 [lustre] [<ffffffff81189ef8>] vfs_write+0xb8/0x1a0 [<ffffffff8118b40f>] ? fget_light_pos+0x3f/0x50 [<ffffffff8118aa31>] sys_write+0x51/0xb0 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b |
| Comment by Andreas Dilger [ 23/Aug/17 ] |
|
Bobijam, could you please provide a brief description of what is needed to fix this problem, and how much work that will be to implement? |