[LU-547] 1.8<->2.1 / 2.3<->2.4 interop: dbench kept doing cleanup Created: 28/Jul/11  Updated: 16/Aug/16  Resolved: 16/Aug/16

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0, Lustre 2.4.0, Lustre 1.8.6
Fix Version/s: Lustre 2.1.0, Lustre 1.8.7

Type: Bug Priority: Major
Reporter: Jian Yu Assignee: Oleg Drokin
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

Lustre Clients:
Tag: 1.8.6-wc1
Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32_131.2.1.el6)
Build: http://newbuild.whamcloud.com/job/lustre-b1_8/100/arch=x86_64,build_type=client,distro=el6,ib_stack=inkernel/
Network: IB (inkernel OFED)
ENABLE_QUOTA=yes

Lustre Servers:
Tag: v2_0_66_0
Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32-131.2.1.el6_lustre)
Build: http://newbuild.whamcloud.com/job/lustre-master/228/arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel/
Network: IB (inkernel OFED)


Attachments: File dbench-debug_log.tar.bz2    
Severity: 3
Rank (Obsolete): 6058

 Description   

While running sanity-benchmark test suite, the dbench test kept doing cleanup and could not stop:

<~snip~>
   1    737819     5.51 MB/sec  execute 598 sec  latency 38819.862 ms
   1    737819     5.50 MB/sec  execute 599 sec  latency 39820.034 ms
   1  cleanup 600 sec
   1  cleanup 601 sec
   ......
   1  cleanup 2897 sec
   1  cleanup 2898 sec
   ......

On client node fat-amd-3-ib:

# ps auxww
<~snip~>
500      18628  0.0  0.0 107264  2552 pts/0    S+   00:50   0:00 bash rundbench -D /mnt/lustre/d0.fat-amd-3-ib 1
500      18636  0.0  0.0   6092   596 pts/0    S+   00:50   0:00 dbench -c client.txt 1
500      18637  2.2  0.0   8140  2540 pts/0    S+   00:50   2:00 dbench -c client.txt 1

# echo t > /proc/sysrq-trigger
<~snip~>
dbench        S 0000000000000009     0 18636  18628 0x00000080
 ffff880219239e78 0000000000000086 0000000000000000 ffffffff812688c5
 ffff880323c50f98 ffff880218872068 0000000000000000 0000000100b91979
 ffff880218af3a78 ffff880219239fd8 000000000000f598 ffff880218af3a78
Call Trace:
 [<ffffffff812688c5>] ? rb_insert_color+0x125/0x160
 [<ffffffff8106b325>] do_wait+0x1c5/0x250
 [<ffffffff8106b453>] sys_wait4+0xa3/0x100
 [<ffffffff810699b0>] ? child_wait_callback+0x0/0x70
 [<ffffffff8100b425>] ? int_check_syscall_exit_work+0x34/0x3d
 [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b
dbench        S 000000000000000c     0 18637  18636 0x00000080
 ffff880419933a98 0000000000000082 0000000000000000 000000004e31176c
 ffff880419933a78 ffffffffa04a2cac ffff880419933a78 ffff8803191cb000
 ffff8804197b05f8 ffff880419933fd8 000000000000f598 ffff8804197b05f8
Call Trace:
 [<ffffffffa04a2cac>] ? ptlrpc_at_adj_net_latency+0x7c/0x230 [ptlrpc]
 [<ffffffffa049c00d>] ldlm_flock_completion_ast+0x61d/0x9f0 [ptlrpc]
 [<ffffffffa046c15f>] ? lock_res_and_lock+0x5f/0xe0 [ptlrpc]
 [<ffffffff8105dc20>] ? default_wake_function+0x0/0x20
 [<ffffffffa048a565>] ldlm_cli_enqueue_fini+0x6c5/0xba0 [ptlrpc]
 [<ffffffff8105dc20>] ? default_wake_function+0x0/0x20
 [<ffffffffa048e074>] ldlm_cli_enqueue+0x344/0x7a0 [ptlrpc]
 [<ffffffffa0677edd>] ll_file_flock+0x47d/0x6b0 [lustre]
 [<ffffffffa049b9f0>] ? ldlm_flock_completion_ast+0x0/0x9f0 [ptlrpc]
 [<ffffffff811bc243>] vfs_lock_file+0x23/0x40
 [<ffffffff811bc497>] fcntl_setlk+0x177/0x320
 [<ffffffff8107ff06>] ? group_send_sig_info+0x26/0x70
 [<ffffffff81080212>] ? sys_kill+0x92/0x190
 [<ffffffff811845f7>] sys_fcntl+0x197/0x530
 [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b

Maloo report: https://maloo.whamcloud.com/test_sets/4f04bb24-b8f5-11e0-8bdf-52540025f9af

Please find the attached dbench-debug_log.tar.bz2 for debug logs.



 Comments   
Comment by Jian Yu [ 28/Jul/11 ]

sanity-quota test 8 also hit the same issue:
https://maloo.whamcloud.com/test_sets/700c103e-b8cc-11e0-8bdf-52540025f9af

Comment by Peter Jones [ 29/Jul/11 ]

Oleg will look into this one. He suspects the same root cause as LU451

Comment by Oleg Drokin [ 01/Aug/11 ]

I now think this is in fact different than LU-451 and probably related to LU-104 interop, even though that seemed to work well for me at the time.

Unfortunately debug logs don't shed enough light on the situation. client log is empty.
MDS log only lists current lock that we are blocking due to conflicts, but I see no other lock activity so I have noway to know what was it conflicting with.

I wonder if this dbench test works between 1.8 client and 1.8 server if you mount client with -o flock? (I don't think we mount with flock by default in 1.8 nor 2.1).

Comment by Jian Yu [ 02/Aug/11 ]

Unfortunately debug logs don't shed enough light on the situation. client log is empty.
MDS log only lists current lock that we are blocking due to conflicts, but I see no other lock activity so I have noway to know what was it conflicting with.

The issue could be reproduced consistently under the 1.8.6-wc1 client and 2.0.66.0 server configuration. After the Toro cluster is back, I'd reproduce the issue and gather more debug logs.

I wonder if this dbench test works between 1.8 client and 1.8 server if you mount client with -o flock? (I don't think we mount with flock by default in 1.8 nor 2.1).

"flock" option is used by default in our test suite on both b1_8 and master branches:
MOUNTOPT=${MOUNTOPT:-"-o user_xattr,acl,flock"}

and dbench test passed on 1.8.6-wc1 with flock option:
https://maloo.whamcloud.com/test_sets/7deb786c-a269-11e0-aee5-52540025f9af

Comment by Jian Yu [ 04/Aug/11 ]

Hi Oleg,
I reproduced the issue and gathered more debug logs while running the dbench test with "PTLDEBUG=-1":
https://maloo.whamcloud.com/test_sets/94f6dce0-be57-11e0-8bdf-52540025f9af

Comment by Oleg Drokin [ 10/Aug/11 ]

It seems since we don't enable flock support by default, let's not include mounting with -o flock for interop testing for now.
I don't think this is such an important use case anyway. We can add some extra interop code in 1.8.7 if this poses a problem for real customers in the future.

Comment by Sarah Liu [ 29/Nov/12 ]

Looks like the same issue seen in 2.1.3<->2.4 interop testing:
https://maloo.whamcloud.com/test_sets/07ffc460-397f-11e2-9fda-52540035b04c

server: 2.3 RHEL6
client: lustre master build #1065

test log shows:

   1  cleanup 7149 sec
   1  cleanup 7150 sec
   1  cleanup 7151 sec
   1  cleanup 7152 sec
   1  cleanup 7153 sec

client console shows:

17:38:39:Lustre: DEBUG MARKER: == sanity-benchmark test dbench: dbench == 17:38:24 (1353980304)
17:39:21:Lustre: 3001:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1353980345/real 1353980345]  req@ffff880074163400 x1419751045755358/t0(0) o2->lustre-OST0002-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 440/432 e 0 to 1 dl 1353980352 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
17:39:21:Lustre: lustre-OST0002-osc-ffff8800749b5c00: Connection to lustre-OST0002 (at 10.10.4.155@tcp) was lost; in progress operations using this service will wait for recovery to complete
17:39:21:Lustre: 3002:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1353980345/real 1353980345]  req@ffff880075a1d000 x1419751045755356/t0(0) o4->lustre-OST0005-osc-ffff8800749b5c00@10.10.4.155@tcp:6/4 lens 488/448 e 0 to 1 dl 1353980352 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
17:39:21:Lustre: lustre-OST0005-osc-ffff8800749b5c00: Connection to lustre-OST0005 (at 10.10.4.155@tcp) was lost; in progress operations using this service will wait for recovery to complete
17:39:22:Lustre: 3002:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1353980350/real 0]  req@ffff880075a1d800 x1419751045755403/t0(0) o4->lustre-OST0005-osc-ffff8800749b5c00@10.10.4.155@tcp:6/4 lens 488/448 e 0 to 1 dl 1353980357 ref 3 fl Rpc:X/0/ffffffff rc 0/-1
17:39:22:Lustre: lustre-OST0003-osc-ffff8800749b5c00: Connection to lustre-OST0003 (at 10.10.4.155@tcp) was lost; in progress operations using this service will wait for recovery to complete
17:39:22:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1353980352/real 0]  req@ffff880075de8000 x1419751045755406/t0(0) o8->lustre-OST0002-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 400/544 e 0 to 1 dl 1353980358 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
17:39:22:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) Skipped 19 previous similar messages
17:39:22:Lustre: lustre-OST0004-osc-ffff8800749b5c00: Connection to lustre-OST0004 (at 10.10.4.155@tcp) was lost; in progress operations using this service will wait for recovery to complete
17:39:22:Lustre: Skipped 3 previous similar messages
17:39:33:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1353980358/real 0]  req@ffff880037692000 x1419751045755416/t0(0) o8->lustre-OST0004-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 400/544 e 0 to 1 dl 1353980364 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
17:39:33:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) Skipped 25 previous similar messages
17:39:33:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1353980358/real 0]  req@ffff880037730000 x1419751045755415/t0(0) o8->lustre-OST0002-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 400/544 e 0 to 1 dl 1353980369 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
17:39:33:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
17:39:44:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1353980368/real 0]  req@ffff880037739000 x1419751045755422/t0(0) o8->lustre-OST0000-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 400/544 e 0 to 1 dl 1353980379 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
17:39:44:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) Skipped 1 previous similar message
17:40:26:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1353980393/real 1353980394]  req@ffff8800376fd000 x1419751045755451/t0(0) o8->lustre-OST0000-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 400/544 e 0 to 1 dl 1353980418 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
17:40:26:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) Skipped 18 previous similar messages
17:41:07:INFO: task dbench:8110 blocked for more than 120 seconds.
17:41:07:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
17:41:07:dbench        D 0000000000000000     0  8110   8109 0x00000080
17:41:07: ffff880075d8ba58 0000000000000082 ffff880075d8ba08 ffff8800746ac0d0
17:41:07: 0000000000000286 0000000000000003 0000000000000001 0000000000000286
17:41:07: ffff880076192638 ffff880075d8bfd8 000000000000fb88 ffff880076192638
17:41:07:Call Trace:
17:41:07: [<ffffffffa04367da>] ? cfs_waitq_signal+0x1a/0x20 [libcfs]
17:41:07: [<ffffffff814fe7d5>] schedule_timeout+0x215/0x2e0
17:41:08: [<ffffffffa077bfbc>] ? ptlrpc_request_bufs_pack+0x5c/0x80 [ptlrpc]
17:41:08: [<ffffffffa0791c30>] ? lustre_swab_ost_body+0x0/0x10 [ptlrpc]
17:41:08: [<ffffffff814fe453>] wait_for_common+0x123/0x180
17:41:08: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
17:41:08: [<ffffffff814fe56d>] wait_for_completion+0x1d/0x20
17:41:08: [<ffffffffa099b1d4>] osc_io_setattr_end+0xc4/0x1a0 [osc]
17:41:08: [<ffffffffa0a1dc50>] ? lov_io_end_wrapper+0x0/0x100 [lov]
17:41:08: [<ffffffffa0628a80>] cl_io_end+0x60/0x150 [obdclass]
17:41:08: [<ffffffffa0629350>] ? cl_io_start+0x0/0x140 [obdclass]
17:41:08: [<ffffffffa0a1dd41>] lov_io_end_wrapper+0xf1/0x100 [lov]
17:41:08: [<ffffffffa0a1d7ce>] lov_io_call+0x8e/0x130 [lov]
17:41:08: [<ffffffffa0a1f42c>] lov_io_end+0x4c/0x110 [lov]
17:41:08: [<ffffffffa0628a80>] cl_io_end+0x60/0x150 [obdclass]
17:41:08: [<ffffffffa062dce2>] cl_io_loop+0xc2/0x1b0 [obdclass]
17:41:08: [<ffffffffa0ad3148>] cl_setattr_ost+0x208/0x2d0 [lustre]
17:41:08: [<ffffffffa0aa2472>] ll_setattr_raw+0x752/0xfd0 [lustre]
17:41:08: [<ffffffffa0aa2d4b>] ll_setattr+0x5b/0xf0 [lustre]
17:41:08: [<ffffffff81197368>] notify_change+0x168/0x340
17:41:08: [<ffffffff81187e25>] ? putname+0x35/0x50
17:41:08: [<ffffffff811aacdc>] utimes_common+0xdc/0x1b0
17:41:08: [<ffffffff811aae8e>] do_utimes+0xde/0xf0
17:41:08: [<ffffffff811ab023>] sys_utime+0x23/0x90
17:41:08: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
17:41:20:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1353980453/real 1353980454]  req@ffff8800376fd000 x1419751045755487/t0(0) o8->lustre-OST0000-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 400/544 e 0 to 1 dl 1353980478 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
17:41:20:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) Skipped 13 previous similar messages
17:42:25:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1353980543/real 1353980543]  req@ffff8800756cc400 x1419751045755544/t0(0) o8->lustre-OST0000-osc-ffff8800749b5c00@10.10.4.155@tcp:28/4 lens 400/544 e 0 to 1 dl 1353980568 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
17:42:25:Lustre: 3000:0:(client.c:1826:ptlrpc_expire_one_request()) Skipped 20 previous similar messages
17:43:07:INFO: task dbench:8110 blocked for more than 120 seconds.
Comment by James A Simmons [ 16/Aug/16 ]

Old blocker for unsupported version

Generated at Sat Feb 10 01:08:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.