[LU-547] 1.8<->2.1 / 2.3<->2.4 interop: dbench kept doing cleanup Created: 28/Jul/11 Updated: 16/Aug/16 Resolved: 16/Aug/16 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0, Lustre 2.4.0, Lustre 1.8.6 |
| Fix Version/s: | Lustre 2.1.0, Lustre 1.8.7 |
| Type: | Bug | Priority: | Major |
| Reporter: | Jian Yu | Assignee: | Oleg Drokin |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre Clients: Lustre Servers: |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 6058 |
| Description |
|
While running sanity-benchmark test suite, the dbench test kept doing cleanup and could not stop: <~snip~> 1 737819 5.51 MB/sec execute 598 sec latency 38819.862 ms 1 737819 5.50 MB/sec execute 599 sec latency 39820.034 ms 1 cleanup 600 sec 1 cleanup 601 sec ...... 1 cleanup 2897 sec 1 cleanup 2898 sec ...... On client node fat-amd-3-ib: # ps auxww <~snip~> 500 18628 0.0 0.0 107264 2552 pts/0 S+ 00:50 0:00 bash rundbench -D /mnt/lustre/d0.fat-amd-3-ib 1 500 18636 0.0 0.0 6092 596 pts/0 S+ 00:50 0:00 dbench -c client.txt 1 500 18637 2.2 0.0 8140 2540 pts/0 S+ 00:50 2:00 dbench -c client.txt 1 # echo t > /proc/sysrq-trigger <~snip~> dbench S 0000000000000009 0 18636 18628 0x00000080 ffff880219239e78 0000000000000086 0000000000000000 ffffffff812688c5 ffff880323c50f98 ffff880218872068 0000000000000000 0000000100b91979 ffff880218af3a78 ffff880219239fd8 000000000000f598 ffff880218af3a78 Call Trace: [<ffffffff812688c5>] ? rb_insert_color+0x125/0x160 [<ffffffff8106b325>] do_wait+0x1c5/0x250 [<ffffffff8106b453>] sys_wait4+0xa3/0x100 [<ffffffff810699b0>] ? child_wait_callback+0x0/0x70 [<ffffffff8100b425>] ? int_check_syscall_exit_work+0x34/0x3d [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b dbench S 000000000000000c 0 18637 18636 0x00000080 ffff880419933a98 0000000000000082 0000000000000000 000000004e31176c ffff880419933a78 ffffffffa04a2cac ffff880419933a78 ffff8803191cb000 ffff8804197b05f8 ffff880419933fd8 000000000000f598 ffff8804197b05f8 Call Trace: [<ffffffffa04a2cac>] ? ptlrpc_at_adj_net_latency+0x7c/0x230 [ptlrpc] [<ffffffffa049c00d>] ldlm_flock_completion_ast+0x61d/0x9f0 [ptlrpc] [<ffffffffa046c15f>] ? lock_res_and_lock+0x5f/0xe0 [ptlrpc] [<ffffffff8105dc20>] ? default_wake_function+0x0/0x20 [<ffffffffa048a565>] ldlm_cli_enqueue_fini+0x6c5/0xba0 [ptlrpc] [<ffffffff8105dc20>] ? default_wake_function+0x0/0x20 [<ffffffffa048e074>] ldlm_cli_enqueue+0x344/0x7a0 [ptlrpc] [<ffffffffa0677edd>] ll_file_flock+0x47d/0x6b0 [lustre] [<ffffffffa049b9f0>] ? ldlm_flock_completion_ast+0x0/0x9f0 [ptlrpc] [<ffffffff811bc243>] vfs_lock_file+0x23/0x40 [<ffffffff811bc497>] fcntl_setlk+0x177/0x320 [<ffffffff8107ff06>] ? group_send_sig_info+0x26/0x70 [<ffffffff81080212>] ? sys_kill+0x92/0x190 [<ffffffff811845f7>] sys_fcntl+0x197/0x530 [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b Maloo report: https://maloo.whamcloud.com/test_sets/4f04bb24-b8f5-11e0-8bdf-52540025f9af Please find the attached dbench-debug_log.tar.bz2 for debug logs. |
| Comments |
| Comment by Jian Yu [ 28/Jul/11 ] |
|
sanity-quota test 8 also hit the same issue: |
| Comment by Peter Jones [ 29/Jul/11 ] |
|
Oleg will look into this one. He suspects the same root cause as LU451 |
| Comment by Oleg Drokin [ 01/Aug/11 ] |
|
I now think this is in fact different than Unfortunately debug logs don't shed enough light on the situation. client log is empty. I wonder if this dbench test works between 1.8 client and 1.8 server if you mount client with -o flock? (I don't think we mount with flock by default in 1.8 nor 2.1). |
| Comment by Jian Yu [ 02/Aug/11 ] |
The issue could be reproduced consistently under the 1.8.6-wc1 client and 2.0.66.0 server configuration. After the Toro cluster is back, I'd reproduce the issue and gather more debug logs.
"flock" option is used by default in our test suite on both b1_8 and master branches: and dbench test passed on 1.8.6-wc1 with flock option: |
| Comment by Jian Yu [ 04/Aug/11 ] |
|
Hi Oleg, |
| Comment by Oleg Drokin [ 10/Aug/11 ] |
|
It seems since we don't enable flock support by default, let's not include mounting with -o flock for interop testing for now. |
| Comment by Sarah Liu [ 29/Nov/12 ] |
|
Looks like the same issue seen in 2.1.3<->2.4 interop testing: server: 2.3 RHEL6 test log shows: 1 cleanup 7149 sec 1 cleanup 7150 sec 1 cleanup 7151 sec 1 cleanup 7152 sec 1 cleanup 7153 sec client console shows: 17:38:39:Lustre: DEBUG MARKER: == sanity-benchmark test dbench: dbench == 17:38:24 (1353980304) 17:39:21:Lustre: 3001:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1353980345/real 1353980345] req@ffff880074163400 x1419751045755358/t0(0) o2->lustre-OST0002-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 440/432 e 0 to 1 dl 1353980352 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 17:39:21:Lustre: lustre-OST0002-osc-ffff8800749b5c00: Connection to lustre-OST0002 (at 10.10.4.155@tcp) was lost; in progress operations using this service will wait for recovery to complete 17:39:21:Lustre: 3002:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1353980345/real 1353980345] req@ffff880075a1d000 x1419751045755356/t0(0) o4->lustre-OST0005-osc-ffff8800749b5c00@10.10.4.155@tcp:6/4 lens 488/448 e 0 to 1 dl 1353980352 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 17:39:21:Lustre: lustre-OST0005-osc-ffff8800749b5c00: Connection to lustre-OST0005 (at 10.10.4.155@tcp) was lost; in progress operations using this service will wait for recovery to complete 17:39:22:Lustre: 3002:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1353980350/real 0] req@ffff880075a1d800 x1419751045755403/t0(0) o4->lustre-OST0005-osc-ffff8800749b5c00@10.10.4.155@tcp:6/4 lens 488/448 e 0 to 1 dl 1353980357 ref 3 fl Rpc:X/0/ffffffff rc 0/-1 17:39:22:Lustre: lustre-OST0003-osc-ffff8800749b5c00: Connection to lustre-OST0003 (at 10.10.4.155@tcp) was lost; in progress operations using this service will wait for recovery to complete 17:39:22:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1353980352/real 0] req@ffff880075de8000 x1419751045755406/t0(0) o8->lustre-OST0002-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 400/544 e 0 to 1 dl 1353980358 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 17:39:22:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) Skipped 19 previous similar messages 17:39:22:Lustre: lustre-OST0004-osc-ffff8800749b5c00: Connection to lustre-OST0004 (at 10.10.4.155@tcp) was lost; in progress operations using this service will wait for recovery to complete 17:39:22:Lustre: Skipped 3 previous similar messages 17:39:33:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1353980358/real 0] req@ffff880037692000 x1419751045755416/t0(0) o8->lustre-OST0004-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 400/544 e 0 to 1 dl 1353980364 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 17:39:33:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) Skipped 25 previous similar messages 17:39:33:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1353980358/real 0] req@ffff880037730000 x1419751045755415/t0(0) o8->lustre-OST0002-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 400/544 e 0 to 1 dl 1353980369 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 17:39:33:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) Skipped 5 previous similar messages 17:39:44:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1353980368/real 0] req@ffff880037739000 x1419751045755422/t0(0) o8->lustre-OST0000-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 400/544 e 0 to 1 dl 1353980379 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 17:39:44:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) Skipped 1 previous similar message 17:40:26:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1353980393/real 1353980394] req@ffff8800376fd000 x1419751045755451/t0(0) o8->lustre-OST0000-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 400/544 e 0 to 1 dl 1353980418 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 17:40:26:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) Skipped 18 previous similar messages 17:41:07:INFO: task dbench:8110 blocked for more than 120 seconds. 17:41:07:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 17:41:07:dbench D 0000000000000000 0 8110 8109 0x00000080 17:41:07: ffff880075d8ba58 0000000000000082 ffff880075d8ba08 ffff8800746ac0d0 17:41:07: 0000000000000286 0000000000000003 0000000000000001 0000000000000286 17:41:07: ffff880076192638 ffff880075d8bfd8 000000000000fb88 ffff880076192638 17:41:07:Call Trace: 17:41:07: [<ffffffffa04367da>] ? cfs_waitq_signal+0x1a/0x20 [libcfs] 17:41:07: [<ffffffff814fe7d5>] schedule_timeout+0x215/0x2e0 17:41:08: [<ffffffffa077bfbc>] ? ptlrpc_request_bufs_pack+0x5c/0x80 [ptlrpc] 17:41:08: [<ffffffffa0791c30>] ? lustre_swab_ost_body+0x0/0x10 [ptlrpc] 17:41:08: [<ffffffff814fe453>] wait_for_common+0x123/0x180 17:41:08: [<ffffffff81060250>] ? default_wake_function+0x0/0x20 17:41:08: [<ffffffff814fe56d>] wait_for_completion+0x1d/0x20 17:41:08: [<ffffffffa099b1d4>] osc_io_setattr_end+0xc4/0x1a0 [osc] 17:41:08: [<ffffffffa0a1dc50>] ? lov_io_end_wrapper+0x0/0x100 [lov] 17:41:08: [<ffffffffa0628a80>] cl_io_end+0x60/0x150 [obdclass] 17:41:08: [<ffffffffa0629350>] ? cl_io_start+0x0/0x140 [obdclass] 17:41:08: [<ffffffffa0a1dd41>] lov_io_end_wrapper+0xf1/0x100 [lov] 17:41:08: [<ffffffffa0a1d7ce>] lov_io_call+0x8e/0x130 [lov] 17:41:08: [<ffffffffa0a1f42c>] lov_io_end+0x4c/0x110 [lov] 17:41:08: [<ffffffffa0628a80>] cl_io_end+0x60/0x150 [obdclass] 17:41:08: [<ffffffffa062dce2>] cl_io_loop+0xc2/0x1b0 [obdclass] 17:41:08: [<ffffffffa0ad3148>] cl_setattr_ost+0x208/0x2d0 [lustre] 17:41:08: [<ffffffffa0aa2472>] ll_setattr_raw+0x752/0xfd0 [lustre] 17:41:08: [<ffffffffa0aa2d4b>] ll_setattr+0x5b/0xf0 [lustre] 17:41:08: [<ffffffff81197368>] notify_change+0x168/0x340 17:41:08: [<ffffffff81187e25>] ? putname+0x35/0x50 17:41:08: [<ffffffff811aacdc>] utimes_common+0xdc/0x1b0 17:41:08: [<ffffffff811aae8e>] do_utimes+0xde/0xf0 17:41:08: [<ffffffff811ab023>] sys_utime+0x23/0x90 17:41:08: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b 17:41:20:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1353980453/real 1353980454] req@ffff8800376fd000 x1419751045755487/t0(0) o8->lustre-OST0000-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 400/544 e 0 to 1 dl 1353980478 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 17:41:20:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) Skipped 13 previous similar messages 17:42:25:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1353980543/real 1353980543] req@ffff8800756cc400 x1419751045755544/t0(0) o8->lustre-OST0000-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 400/544 e 0 to 1 dl 1353980568 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 17:42:25:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) Skipped 20 previous similar messages 17:43:07:INFO: task dbench:8110 blocked for more than 120 seconds. |
| Comment by James A Simmons [ 16/Aug/16 ] |
|
Old blocker for unsupported version |