[LU-2664] Failure on test suite racer test_1: system hang when doing racer cleanup Created: 22/Jan/13 Updated: 22/Dec/17 Resolved: 22/Dec/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
lustre-master build #1176 ofd build server: 2.3.59--PRISTINE-2.6.32-279.14.1.el6_lustre.x86_64 (x86_64) |
||
| Severity: | 3 |
| Rank (Obsolete): | 6214 |
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/e7af68c6-62c5-11e2-982f-52540035b04c. The sub-test test_1 failed with the following error:
racer cleanup sleeping 5 sec ... sleeping 5 sec ... Waited 5, rc=1 sleeping 10 sec ... Waited 5, rc=1 sleeping 10 sec ... Waited 20, rc=1 sleeping 20 sec ... Waited 20, rc=1 sleeping 20 sec ... Waited 50, rc=1 sleeping 40 sec ... Waited 50, rc=1 sleeping 40 sec ... Waited 110, rc=1 sleeping 80 sec ... Waited 110, rc=1 sleeping 80 sec ... Waited 230, rc=1 sleeping 160 sec ... Waited 230, rc=1 sleeping 160 sec ... Waited 470, rc=1 sleeping 320 sec ... Waited 470, rc=1 sleeping 320 sec ... Waited 950, rc=1 root 10821 0.0 0.0 106096 488 ? D 05:31 0:00 /bin/bash ./file_concat.sh /mnt/lustre2/racer 20 root 10852 0.0 0.0 106096 492 ? D 05:31 0:00 /bin/bash ./file_concat.sh /mnt/lustre2/racer 20 root 10861 0.0 0.0 106096 492 ? D 05:31 0:00 /bin/bash ./file_concat.sh /mnt/lustre/racer 20 root 11207 0.0 0.0 103248 920 ? S 05:52 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec Waited 950, rc=1 root 11453 0.0 0.0 106096 488 ? D 05:31 0:00 /bin/bash ./file_concat.sh /mnt/lustre/racer 20 |
| Comments |
| Comment by Sarah Liu [ 22/Jan/13 ] |
|
client dmesg: Lustre: DEBUG MARKER: == racer test 1: racer on clients: client-22-ib,client-23-ib.lab.whamcloud.com DURATION=900 == 05:26:46 (1358602006) Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=1 /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre2/racer Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=1 /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/racer INFO: task rm:10785 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. rm D 0000000000000002 0 10785 23659 0x00000080 ffff8802ec0ede08 0000000000000086 ffff880319193500 ffff880319193500 found this from ost dmesg Lustre: DEBUG MARKER: == racer test 1: racer on clients: client-22-ib,client-23-ib.lab.whamcloud.com DURATION=900 == 05:26:46 (1358602006) Lustre: 26943:0:(client.c:1837:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1358605640/real 1358605640] req@ffff8802e86c4800 x1424596583948926/t0(0) o105->lustre-OST0004@192.168.4.23@o2ib:15/16 lens 360/224 e 0 to 1 dl 1358605647 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 26943:0:(client.c:1837:ptlrpc_expire_one_request()) Skipped 2 previous similar messages LustreError: 138-a: lustre-OST0004: A client on nid 192.168.4.23@o2ib was evicted due to a lock completion callback time out: rc -107 LustreError: 24305:0:(client.c:1037:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff8803245ba000 x1424596583949475/t0(0) o104->lustre-OST0004@192.168.4.23@o2ib:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:N/0/ffffffff rc 0/-1 LustreError: 24305:0:(ldlm_lockd.c:709:ldlm_handle_ast_error()) ### client (nid 192.168.4.23@o2ib) returned 0 from blocking AST ns: filter-ffff880331f50800 lock: ffff8802eca8c400/0x429adfb43f41f771 lrc: 1/0,0 mode: --/PW res: 5/0 rrc: 3 type: EXT [319488->18446744073709551615] (req 319488->323583) flags: 0x801000000020 nid: 192.168.4.23@o2ib remote: 0x63e3b619d86f2f9 expref: 2 pid: 26976 timeout: 4316329044 lvb_type: 0 Lustre: lustre-OST0003: Client c26030b8-51fa-9927-5bd0-122ea741d475 (at 192.168.4.22@o2ib) reconnecting Lustre: lustre-OST0003: Client c26030b8-51fa-9927-5bd0-122ea741d475 (at 192.168.4.22@o2ib) refused reconnection, still busy with 1 active RPCs LustreError: 24312:0:(ldlm_lib.c:2718:target_bulk_io()) @@@ Reconnect on bulk GET req@ffff8802fbeb2050 x1424597441216683/t0(0) o4->c26030b8-51fa-9927-5bd0-122ea741d475@192.168.4.22@o2ib:0/0 lens 488/448 e 1 to 0 dl 1358605714 ref 1 fl Interpret:/0/0 rc 0/0 LustreError: 19085:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880170880000 LustreError: 19085:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880170880000 ffff880319193500 ffff8803093e46e0 ffff880319193500 00000010fffff000 ffff880319193ab8 ffff8802ec0edfd8 000000000000fb88 ffff880319193ab8 Call Trace: [<ffffffff814ff6fe>] __mutex_lock_slowpath+0x13e/0x180 [<ffffffff814ff59b>] mutex_lock+0x2b/0x50 [<ffffffff8118b686>] do_unlinkat+0x96/0x1c0 [<ffffffff810d6d42>] ? audit_syscall_entry+0x272/0x2a0 [<ffffffff81503ade>] ? do_page_fault+0x3e/0xa0 [<ffffffff8118b912>] sys_unlinkat+0x22/0x40 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b INFO: task mv:10804 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mv D 0000000000000000 0 10804 23650 0x00000080 ffff88031a1a3ba8 0000000000000086 000000000000009e 0020000000000080 50faa02900000000 000000000000b7a7 ffffffff8100ba4e ffff88031a1a3bd8 ffff8801cdfbb098 ffff88031a1a3fd8 000000000000fb88 ffff8801cdfbb098 Call Trace: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13 [<ffffffffa1100583>] ? ll_dcompare+0xf3/0x290 [lustre] [<ffffffff814ff6fe>] __mutex_lock_slowpath+0x13e/0x180 [<ffffffff81193c88>] ? __d_lookup+0xd8/0x150 [<ffffffff814ff59b>] mutex_lock+0x2b/0x50 [<ffffffff81189b9b>] do_lookup+0x11b/0x230 [<ffffffff8118a4e4>] __link_path_walk+0x734/0x1030 [<ffffffff8113f527>] ? handle_pte_fault+0xf7/0xb50 [<ffffffffa1115a24>] ? __ll_inode_revalidate_it+0x224/0xd10 [lustre] [<ffffffff8118b06a>] path_walk+0x6a/0xe0 [<ffffffff8118b23b>] do_path_lookup+0x5b/0xa0 [<ffffffff8118bea7>] user_path_at+0x57/0xa0 [<ffffffff811808b4>] ? cp_new_stat+0xe4/0x100 [<ffffffff81180adc>] vfs_fstatat+0x3c/0x80 [<ffffffff81180b8e>] vfs_lstat+0x1e/0x20 [<ffffffff81180bb4>] sys_newlstat+0x24/0x50 [<ffffffff810d6d42>] ? audit_syscall_entry+0x272/0x2a0 [<ffffffff81503ade>] ? do_page_fault+0x3e/0xa0 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b INFO: task mv:10811 blocked for more than 120 seconds. |
| Comment by Sarah Liu [ 11/Mar/13 ] |
|
Another instance in OFED build: https://maloo.whamcloud.com/test_sets/4e449708-8834-11e2-961a-52540035b04c There are "D" processes found in the console log of MDS and clients 16:02:36:Lustre: DEBUG MARKER: == racer test 1: racer on clients: client-22-ib,client-23-ib.lab.whamcloud.com DURATION=900 == 16:02:24 (1362700944) 16:02:36:Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=1 /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/racer 16:02:36:Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=1 /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre2/racer 16:08:10:INFO: task rm:21867 blocked for more than 120 seconds. 16:08:10:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 16:08:10:rm D 0000000000000000 0 21867 7073 0x00000080 16:08:10: ffff8802e7471c48 0000000000000082 ffff8802e7471bc8 0000000000000082 16:08:10: ffff8802bf381a20 ffff880032e11960 ffffffff81aa5740 0000000000000286 16:08:10: ffff8802b87c1af8 ffff8802e7471fd8 000000000000fb88 ffff8802b87c1af8 16:08:10:Call Trace: 16:08:10: [<ffffffff810de175>] ? call_rcu_sched+0x15/0x20 16:08:10: [<ffffffff814eb2ae>] __mutex_lock_slowpath+0x13e/0x180 16:08:10: [<ffffffff814eb14b>] mutex_lock+0x2b/0x50 16:08:10: [<ffffffffa0d0968c>] mdc_reint+0x3c/0x3b0 [mdc] 16:08:10: [<ffffffffa0d0a700>] mdc_unlink+0x1b0/0x500 [mdc] 16:08:10: [<ffffffffa0cc0e87>] lmv_unlink+0x257/0x830 [lmv] 16:08:10: [<ffffffffa124701d>] ll_rmdir+0x18d/0x630 [lustre] 16:08:10: [<ffffffff81183e5d>] vfs_rmdir+0xbd/0xf0 16:08:10: [<ffffffff81182aaa>] ? lookup_hash+0x3a/0x50 16:08:10: [<ffffffff811863f3>] do_rmdir+0x103/0x120 16:08:10: [<ffffffff810d3d27>] ? audit_syscall_entry+0x1d7/0x200 16:08:10: [<ffffffff8118643d>] sys_unlinkat+0x2d/0x40 16:08:10: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b 16:08:10:INFO: task mkdir:24973 blocked for more than 120 seconds. 16:08:10:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 16:08:10:mkdir D 0000000000000000 0 24973 7081 0x00000080 16:08:10: ffff8802d835bde8 0000000000000082 ffff8802d835be48 0000000081185038 16:08:10: 0000000000000000 ffff8802b87c0ae0 ffff8802b87c0ae0 ffff8802b87c0ae0 16:08:10: ffff8802b87c1098 ffff8802d835bfd8 000000000000fb88 ffff8802b87c1098 16:08:10:Call Trace: 16:08:10: [<ffffffff81195540>] ? mntput_no_expire+0x30/0x110 16:08:10: [<ffffffff814eb2ae>] __mutex_lock_slowpath+0x13e/0x180 16:08:10: [<ffffffff814eb14b>] mutex_lock+0x2b/0x50 16:08:10: [<ffffffff81182af0>] lookup_create+0x30/0xd0 16:08:10: [<ffffffff811864d5>] sys_mkdirat+0x65/0x120 16:08:10: [<ffffffff81195540>] ? mntput_no_expire+0x30/0x110 16:08:10: [<ffffffff810d3d27>] ? audit_syscall_entry+0x1d7/0x200 16:08:10: [<ffffffff811865a8>] sys_mkdir+0x18/0x20 16:08:10: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b 16:08:10:INFO: task file_concat.sh:24978 blocked for more than 120 seconds. 16:08:10:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 16:08:10:file_concat.s D 0000000000000001 0 24978 7078 0x00000080 16:08:10: ffff8802f4969858 0000000000000082 ffff8802d5c3a400 0000000000000000 16:08:10: ffff8802f49697c8 ffffffffa08bc160 ffff8801d1e6c000 ffffffffa08293f0 16:08:10: ffff8803232ab098 ffff8802f4969fd8 000000000000fb88 ffff8803232ab098 16:08:10:Call Trace: 16:08:10: [<ffffffffa08293f0>] ? lustre_msg_string+0x0/0x290 [ptlrpc] 16:08:10: [<ffffffff814eb2ae>] __mutex_lock_slowpath+0x13e/0x180 16:08:10: [<ffffffffa0d0d5ef>] ? mdc_open_pack+0x21f/0x240 [mdc] 16:08:10: [<ffffffff814eb14b>] mutex_lock+0x2b/0x50 16:08:10: [<ffffffffa0d120b6>] mdc_enqueue+0x706/0x18c0 [mdc] 16:08:10: [<ffffffffa0d1346d>] mdc_intent_lock+0x1fd/0x64a [mdc] 16:08:10: [<ffffffffa12476c0>] ? ll_md_blocking_ast+0x0/0x710 [lustre] 16:08:10: [<ffffffffa0808560>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 16:08:10: [<ffffffffa0cb5977>] ? lmv_fid_alloc+0x117/0x4b0 [lmv] 16:08:10: [<ffffffffa0ccc969>] lmv_intent_open+0x1e9/0x8b0 [lmv] 16:08:10: [<ffffffffa12476c0>] ? ll_md_blocking_ast+0x0/0x710 [lustre] 16:08:10: [<ffffffffa0ccd2db>] lmv_intent_lock+0x2ab/0x370 [lmv] 16:08:10: [<ffffffffa12476c0>] ? ll_md_blocking_ast+0x0/0x710 [lustre] 16:08:10: [<ffffffffa124678e>] ? ll_i2gids+0x2e/0xd0 [lustre] 16:08:10: [<ffffffffa122db9a>] ? ll_prep_md_op_data+0xfa/0x3a0 [lustre] 16:08:10: [<ffffffffa124bc11>] ll_lookup_it+0x3a1/0xbf0 [lustre] 16:08:10: [<ffffffffa12476c0>] ? ll_md_blocking_ast+0x0/0x710 [lustre] 16:08:10: [<ffffffffa121817d>] ? ll_inode_permission+0xfd/0x220 [lustre] 16:08:10: [<ffffffffa124c4ec>] ll_lookup_nd+0x8c/0x430 [lustre] 16:08:10: [<ffffffff8118ea87>] ? d_alloc+0x137/0x1b0 16:08:10: [<ffffffff81182402>] __lookup_hash+0x102/0x160 16:08:10: [<ffffffff8115eb22>] ? kmem_cache_alloc+0x182/0x190 16:08:10: [<ffffffff81182aaa>] lookup_hash+0x3a/0x50 16:08:10: [<ffffffff81186e56>] do_filp_open+0x2c6/0xd60 16:08:10: [<ffffffff8104338c>] ? __do_page_fault+0x1ec/0x480 16:08:10: [<ffffffff814eca45>] ? page_fault+0x25/0x30 16:08:10: [<ffffffff81193272>] ? alloc_fd+0x92/0x160 16:08:10: [<ffffffff81173a39>] do_sys_open+0x69/0x140 16:08:10: [<ffffffff81173b50>] sys_open+0x20/0x30 16:08:10: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b |
| Comment by Andreas Dilger [ 22/Dec/17 ] |
|
Close old bug that has not been hit in a long time. |