[LU-2664] Failure on test suite racer test_1: system hang when doing racer cleanup Created: 22/Jan/13  Updated: 22/Dec/17  Resolved: 22/Dec/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Maloo Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

lustre-master build #1176 ofd build

server: 2.3.59--PRISTINE-2.6.32-279.14.1.el6_lustre.x86_64 (x86_64)
client: 2.3.59--PRISTINE-2.6.32-279.14.1.el6.x86_64 (x86_64)


Severity: 3
Rank (Obsolete): 6214

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/e7af68c6-62c5-11e2-982f-52540035b04c.

The sub-test test_1 failed with the following error:

test failed to respond and timed out

racer cleanup
sleeping 5 sec ...
sleeping 5 sec ...
Waited 5, rc=1 sleeping 10 sec ...
Waited 5, rc=1 sleeping 10 sec ...
Waited 20, rc=1 sleeping 20 sec ...
Waited 20, rc=1 sleeping 20 sec ...
Waited 50, rc=1 sleeping 40 sec ...
Waited 50, rc=1 sleeping 40 sec ...
Waited 110, rc=1 sleeping 80 sec ...
Waited 110, rc=1 sleeping 80 sec ...
Waited 230, rc=1 sleeping 160 sec ...
Waited 230, rc=1 sleeping 160 sec ...
Waited 470, rc=1 sleeping 320 sec ...
Waited 470, rc=1 sleeping 320 sec ...
Waited 950, rc=1 root     10821  0.0  0.0 106096   488 ?        D    05:31   0:00 /bin/bash ./file_concat.sh /mnt/lustre2/racer 20
root     10852  0.0  0.0 106096   492 ?        D    05:31   0:00 /bin/bash ./file_concat.sh /mnt/lustre2/racer 20
root     10861  0.0  0.0 106096   492 ?        D    05:31   0:00 /bin/bash ./file_concat.sh /mnt/lustre/racer 20
root     11207  0.0  0.0 103248   920 ?        S    05:52   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
Waited 950, rc=1 root     11453  0.0  0.0 106096   488 ?        D    05:31   0:00 /bin/bash ./file_concat.sh /mnt/lustre/racer 20


 Comments   
Comment by Sarah Liu [ 22/Jan/13 ]

client dmesg:

Lustre: DEBUG MARKER: == racer test 1: racer on clients: client-22-ib,client-23-ib.lab.whamcloud.com DURATION=900 == 05:26:46 (1358602006)
Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=1 				   /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre2/racer 
Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=1 				   /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/racer 
INFO: task rm:10785 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
rm            D 0000000000000002     0 10785  23659 0x00000080
 ffff8802ec0ede08 0000000000000086 ffff880319193500 ffff880319193500

found this from ost dmesg

Lustre: DEBUG MARKER: == racer test 1: racer on clients: client-22-ib,client-23-ib.lab.whamcloud.com DURATION=900 == 05:26:46 (1358602006)
Lustre: 26943:0:(client.c:1837:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1358605640/real 1358605640]  req@ffff8802e86c4800 x1424596583948926/t0(0) o105->lustre-OST0004@192.168.4.23@o2ib:15/16 lens 360/224 e 0 to 1 dl 1358605647 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 26943:0:(client.c:1837:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
LustreError: 138-a: lustre-OST0004: A client on nid 192.168.4.23@o2ib was evicted due to a lock completion callback time out: rc -107
LustreError: 24305:0:(client.c:1037:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff8803245ba000 x1424596583949475/t0(0) o104->lustre-OST0004@192.168.4.23@o2ib:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:N/0/ffffffff rc 0/-1
LustreError: 24305:0:(ldlm_lockd.c:709:ldlm_handle_ast_error()) ### client (nid 192.168.4.23@o2ib) returned 0 from blocking AST ns: filter-ffff880331f50800 lock: ffff8802eca8c400/0x429adfb43f41f771 lrc: 1/0,0 mode: --/PW res: 5/0 rrc: 3 type: EXT [319488->18446744073709551615] (req 319488->323583) flags: 0x801000000020 nid: 192.168.4.23@o2ib remote: 0x63e3b619d86f2f9 expref: 2 pid: 26976 timeout: 4316329044 lvb_type: 0
Lustre: lustre-OST0003: Client c26030b8-51fa-9927-5bd0-122ea741d475 (at 192.168.4.22@o2ib) reconnecting
Lustre: lustre-OST0003: Client c26030b8-51fa-9927-5bd0-122ea741d475 (at 192.168.4.22@o2ib) refused reconnection, still busy with 1 active RPCs
LustreError: 24312:0:(ldlm_lib.c:2718:target_bulk_io()) @@@ Reconnect on bulk GET  req@ffff8802fbeb2050 x1424597441216683/t0(0) o4->c26030b8-51fa-9927-5bd0-122ea741d475@192.168.4.22@o2ib:0/0 lens 488/448 e 1 to 0 dl 1358605714 ref 1 fl Interpret:/0/0 rc 0/0
LustreError: 19085:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880170880000
LustreError: 19085:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880170880000
 ffff880319193500 ffff8803093e46e0 ffff880319193500 00000010fffff000
 ffff880319193ab8 ffff8802ec0edfd8 000000000000fb88 ffff880319193ab8
Call Trace:
 [<ffffffff814ff6fe>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff814ff59b>] mutex_lock+0x2b/0x50
 [<ffffffff8118b686>] do_unlinkat+0x96/0x1c0
 [<ffffffff810d6d42>] ? audit_syscall_entry+0x272/0x2a0
 [<ffffffff81503ade>] ? do_page_fault+0x3e/0xa0
 [<ffffffff8118b912>] sys_unlinkat+0x22/0x40
 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
INFO: task mv:10804 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mv            D 0000000000000000     0 10804  23650 0x00000080
 ffff88031a1a3ba8 0000000000000086 000000000000009e 0020000000000080
 50faa02900000000 000000000000b7a7 ffffffff8100ba4e ffff88031a1a3bd8
 ffff8801cdfbb098 ffff88031a1a3fd8 000000000000fb88 ffff8801cdfbb098
Call Trace:
 [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13
 [<ffffffffa1100583>] ? ll_dcompare+0xf3/0x290 [lustre]
 [<ffffffff814ff6fe>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff81193c88>] ? __d_lookup+0xd8/0x150
 [<ffffffff814ff59b>] mutex_lock+0x2b/0x50
 [<ffffffff81189b9b>] do_lookup+0x11b/0x230
 [<ffffffff8118a4e4>] __link_path_walk+0x734/0x1030
 [<ffffffff8113f527>] ? handle_pte_fault+0xf7/0xb50
 [<ffffffffa1115a24>] ? __ll_inode_revalidate_it+0x224/0xd10 [lustre]
 [<ffffffff8118b06a>] path_walk+0x6a/0xe0
 [<ffffffff8118b23b>] do_path_lookup+0x5b/0xa0
 [<ffffffff8118bea7>] user_path_at+0x57/0xa0
 [<ffffffff811808b4>] ? cp_new_stat+0xe4/0x100
 [<ffffffff81180adc>] vfs_fstatat+0x3c/0x80
 [<ffffffff81180b8e>] vfs_lstat+0x1e/0x20
 [<ffffffff81180bb4>] sys_newlstat+0x24/0x50
 [<ffffffff810d6d42>] ? audit_syscall_entry+0x272/0x2a0
 [<ffffffff81503ade>] ? do_page_fault+0x3e/0xa0
 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
INFO: task mv:10811 blocked for more than 120 seconds.
Comment by Sarah Liu [ 11/Mar/13 ]

Another instance in OFED build: https://maloo.whamcloud.com/test_sets/4e449708-8834-11e2-961a-52540035b04c

There are "D" processes found in the console log of MDS and clients

16:02:36:Lustre: DEBUG MARKER: == racer test 1: racer on clients: client-22-ib,client-23-ib.lab.whamcloud.com DURATION=900 == 16:02:24 (1362700944)
16:02:36:Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=1 				   /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/racer 
16:02:36:Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=1 				   /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre2/racer 
16:08:10:INFO: task rm:21867 blocked for more than 120 seconds.
16:08:10:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
16:08:10:rm            D 0000000000000000     0 21867   7073 0x00000080
16:08:10: ffff8802e7471c48 0000000000000082 ffff8802e7471bc8 0000000000000082
16:08:10: ffff8802bf381a20 ffff880032e11960 ffffffff81aa5740 0000000000000286
16:08:10: ffff8802b87c1af8 ffff8802e7471fd8 000000000000fb88 ffff8802b87c1af8
16:08:10:Call Trace:
16:08:10: [<ffffffff810de175>] ? call_rcu_sched+0x15/0x20
16:08:10: [<ffffffff814eb2ae>] __mutex_lock_slowpath+0x13e/0x180
16:08:10: [<ffffffff814eb14b>] mutex_lock+0x2b/0x50
16:08:10: [<ffffffffa0d0968c>] mdc_reint+0x3c/0x3b0 [mdc]
16:08:10: [<ffffffffa0d0a700>] mdc_unlink+0x1b0/0x500 [mdc]
16:08:10: [<ffffffffa0cc0e87>] lmv_unlink+0x257/0x830 [lmv]
16:08:10: [<ffffffffa124701d>] ll_rmdir+0x18d/0x630 [lustre]
16:08:10: [<ffffffff81183e5d>] vfs_rmdir+0xbd/0xf0
16:08:10: [<ffffffff81182aaa>] ? lookup_hash+0x3a/0x50
16:08:10: [<ffffffff811863f3>] do_rmdir+0x103/0x120
16:08:10: [<ffffffff810d3d27>] ? audit_syscall_entry+0x1d7/0x200
16:08:10: [<ffffffff8118643d>] sys_unlinkat+0x2d/0x40
16:08:10: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
16:08:10:INFO: task mkdir:24973 blocked for more than 120 seconds.
16:08:10:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
16:08:10:mkdir         D 0000000000000000     0 24973   7081 0x00000080
16:08:10: ffff8802d835bde8 0000000000000082 ffff8802d835be48 0000000081185038
16:08:10: 0000000000000000 ffff8802b87c0ae0 ffff8802b87c0ae0 ffff8802b87c0ae0
16:08:10: ffff8802b87c1098 ffff8802d835bfd8 000000000000fb88 ffff8802b87c1098
16:08:10:Call Trace:
16:08:10: [<ffffffff81195540>] ? mntput_no_expire+0x30/0x110
16:08:10: [<ffffffff814eb2ae>] __mutex_lock_slowpath+0x13e/0x180
16:08:10: [<ffffffff814eb14b>] mutex_lock+0x2b/0x50
16:08:10: [<ffffffff81182af0>] lookup_create+0x30/0xd0
16:08:10: [<ffffffff811864d5>] sys_mkdirat+0x65/0x120
16:08:10: [<ffffffff81195540>] ? mntput_no_expire+0x30/0x110
16:08:10: [<ffffffff810d3d27>] ? audit_syscall_entry+0x1d7/0x200
16:08:10: [<ffffffff811865a8>] sys_mkdir+0x18/0x20
16:08:10: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
16:08:10:INFO: task file_concat.sh:24978 blocked for more than 120 seconds.
16:08:10:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
16:08:10:file_concat.s D 0000000000000001     0 24978   7078 0x00000080
16:08:10: ffff8802f4969858 0000000000000082 ffff8802d5c3a400 0000000000000000
16:08:10: ffff8802f49697c8 ffffffffa08bc160 ffff8801d1e6c000 ffffffffa08293f0
16:08:10: ffff8803232ab098 ffff8802f4969fd8 000000000000fb88 ffff8803232ab098
16:08:10:Call Trace:
16:08:10: [<ffffffffa08293f0>] ? lustre_msg_string+0x0/0x290 [ptlrpc]
16:08:10: [<ffffffff814eb2ae>] __mutex_lock_slowpath+0x13e/0x180
16:08:10: [<ffffffffa0d0d5ef>] ? mdc_open_pack+0x21f/0x240 [mdc]
16:08:10: [<ffffffff814eb14b>] mutex_lock+0x2b/0x50
16:08:10: [<ffffffffa0d120b6>] mdc_enqueue+0x706/0x18c0 [mdc]
16:08:10: [<ffffffffa0d1346d>] mdc_intent_lock+0x1fd/0x64a [mdc]
16:08:10: [<ffffffffa12476c0>] ? ll_md_blocking_ast+0x0/0x710 [lustre]
16:08:10: [<ffffffffa0808560>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
16:08:10: [<ffffffffa0cb5977>] ? lmv_fid_alloc+0x117/0x4b0 [lmv]
16:08:10: [<ffffffffa0ccc969>] lmv_intent_open+0x1e9/0x8b0 [lmv]
16:08:10: [<ffffffffa12476c0>] ? ll_md_blocking_ast+0x0/0x710 [lustre]
16:08:10: [<ffffffffa0ccd2db>] lmv_intent_lock+0x2ab/0x370 [lmv]
16:08:10: [<ffffffffa12476c0>] ? ll_md_blocking_ast+0x0/0x710 [lustre]
16:08:10: [<ffffffffa124678e>] ? ll_i2gids+0x2e/0xd0 [lustre]
16:08:10: [<ffffffffa122db9a>] ? ll_prep_md_op_data+0xfa/0x3a0 [lustre]
16:08:10: [<ffffffffa124bc11>] ll_lookup_it+0x3a1/0xbf0 [lustre]
16:08:10: [<ffffffffa12476c0>] ? ll_md_blocking_ast+0x0/0x710 [lustre]
16:08:10: [<ffffffffa121817d>] ? ll_inode_permission+0xfd/0x220 [lustre]
16:08:10: [<ffffffffa124c4ec>] ll_lookup_nd+0x8c/0x430 [lustre]
16:08:10: [<ffffffff8118ea87>] ? d_alloc+0x137/0x1b0
16:08:10: [<ffffffff81182402>] __lookup_hash+0x102/0x160
16:08:10: [<ffffffff8115eb22>] ? kmem_cache_alloc+0x182/0x190
16:08:10: [<ffffffff81182aaa>] lookup_hash+0x3a/0x50
16:08:10: [<ffffffff81186e56>] do_filp_open+0x2c6/0xd60
16:08:10: [<ffffffff8104338c>] ? __do_page_fault+0x1ec/0x480
16:08:10: [<ffffffff814eca45>] ? page_fault+0x25/0x30
16:08:10: [<ffffffff81193272>] ? alloc_fd+0x92/0x160
16:08:10: [<ffffffff81173a39>] do_sys_open+0x69/0x140
16:08:10: [<ffffffff81173b50>] sys_open+0x20/0x30
16:08:10: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Comment by Andreas Dilger [ 22/Dec/17 ]

Close old bug that has not been hit in a long time.

Generated at Sat Feb 10 01:27:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.