[LU-5970] 2.4.3<->2.5.4 interop: sanity-benchmark test bonnie hung Created: 02/Dec/14  Updated: 14/Aug/16  Resolved: 14/Aug/16

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.3
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Jian Yu Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lustre client build: https://build.hpdd.intel.com/job/lustre-b2_5/104
Lustre server build: https://build.hpdd.intel.com/job/lustre-b2_4/73 (2.4.3)


Severity: 3
Rank (Obsolete): 16673

 Description   

sanity-benchmark test bonnie hung.

On client node:

bonnie++      D 0000000000000001     0 12696  12355 0x00000080
 ffff88006cd17ae8 0000000000000082 00000000ffffffff 000027e0a71a87f8
 ffff88006d38bec0 ffff8800398b1830 0000000000aebe00 ffffffffab83d239
 ffff880069133ab8 ffff88006cd17fd8 000000000000fbc8 ffff880069133ab8
Call Trace:
 [<ffffffff810a6d31>] ? ktime_get_ts+0xb1/0xf0
 [<ffffffff8111f750>] ? sync_page_killable+0x0/0x40
 [<ffffffff815295e3>] io_schedule+0x73/0xc0
 [<ffffffff8111f73d>] sync_page+0x3d/0x50
 [<ffffffff8111f75e>] sync_page_killable+0xe/0x40
 [<ffffffff81529e7a>] __wait_on_bit_lock+0x5a/0xc0
 [<ffffffff8111f667>] __lock_page_killable+0x67/0x70
 [<ffffffff8109b020>] ? wake_bit_function+0x0/0x50
 [<ffffffff8111f50e>] ? find_get_page+0x1e/0xa0
 [<ffffffff81121394>] generic_file_aio_read+0x4b4/0x700
 [<ffffffffa06ec41b>] vvp_io_read_start+0x22b/0x410 [lustre] 
 [<ffffffffa0c46d2a>] cl_io_start+0x6a/0x140 [obdclass]
 [<ffffffffa0c4b434>] cl_io_loop+0xb4/0x1b0 [obdclass]
 [<ffffffffa068e326>] ll_file_io_generic+0x2a6/0x610 [lustre] 
 [<ffffffffa068e7cf>] ll_file_aio_read+0x13f/0x2c0 [lustre] 
 [<ffffffffa068ef3c>] ll_file_read+0x16c/0x2a0 [lustre] 
 [<ffffffff81189645>] vfs_read+0xb5/0x1a0
 [<ffffffff81189781>] sys_read+0x51/0x90
 [<ffffffff810e1bfe>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

On OSS node:

ll_ost_io00_0 D 0000000000000000     0  9922      2 0x00000080
 ffff88007aa67790 0000000000000046 ffff88007aa67758 ffff88007aa67754
 ffff880079f02378 ffff88007f823040 ffff880002316740 0000000000000400
 ffff88006b02daf8 ffff88007aa67fd8 000000000000fb88 ffff88006b02daf8
Call Trace:
 [<ffffffffa0ce06cb>] osd_do_bio+0x68b/0x800 [osd_ldiskfs]
 [<ffffffffa0d4402c>] ? fsfilt_map_nblocks+0xcc/0xf0 [fsfilt_ldiskfs]
 [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa0d442d5>] ? fsfilt_ldiskfs_map_inode_pages+0x85/0x90 [fsfilt_ldiskfs]
 [<ffffffffa0ce29d8>] osd_read_prep+0x338/0x3b0 [osd_ldiskfs]
 [<ffffffffa0db2bd3>] ofd_preprw_read+0x253/0x7f0 [ofd]
 [<ffffffffa0db38ea>] ofd_preprw+0x77a/0x1480 [ofd]
 [<ffffffffa077dff1>] ? lustre_pack_reply_v2+0x1e1/0x280 [ptlrpc] 
 [<ffffffffa0d68e0c>] obd_preprw+0x12c/0x3d0 [ost]
 [<ffffffffa0d707c0>] ost_brw_read+0xd60/0x1340 [ost]
 [<ffffffff81281876>] ? vsnprintf+0x336/0x5e0
 [<ffffffffa0486beb>] ? libcfs_debug_vmsg2+0x50b/0xbb0 [libcfs] 
 [<ffffffffa077ee9c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 
 [<ffffffffa077eff8>] ? lustre_msg_check_version+0xe8/0x100 [ptlrpc] 
 [<ffffffffa0d78028>] ost_handle+0x2ac8/0x48e0 [ost]
 [<ffffffffa04830f4>] ? libcfs_id2str+0x74/0xb0 [libcfs] 
 [<ffffffffa078e3b8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] 
 [<ffffffffa04775de>] ? cfs_timer_arm+0xe/0x10 [libcfs] 
 [<ffffffffa0488d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 
 [<ffffffffa0785719>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 
 [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
 [<ffffffffa078f74e>] ptlrpc_main+0xace/0x1700 [ptlrpc] 
 [<ffffffffa078ec80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffffa078ec80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 
 [<ffffffffa078ec80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

Maloo report:
https://testing.hpdd.intel.com/test_sets/dfee42f4-7a67-11e4-be53-5254006e85c2



 Comments   
Comment by Jian Yu [ 02/Dec/14 ]

This is a regression failure introduced by Lustre b2_5 build #104.
Here is a for-test-only patch to reproduce the failure: http://review.whamcloud.com/12908

Comment by Jian Yu [ 02/Dec/14 ]

Lustre client build: https://build.hpdd.intel.com/job/lustre-b2_5/104
Lustre server build: https://build.hpdd.intel.com/job/lustre-b2_4/73 (2.4.3)

replay-ost-single test 5 also hit the same failure:

LNet: Service thread pid 9041 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 9041, comm: ll_ost_io00_022

Call Trace:
 [<ffffffff810a2431>] ? ktime_get_ts+0xb1/0xf0
 [<ffffffff8150ed93>] io_schedule+0x73/0xc0
 [<ffffffff8125ed08>] get_request_wait+0x108/0x1d0
 [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff81255fbb>] ? elv_merge+0x1cb/0x200
 [<ffffffff8125ee6b>] blk_queue_bio+0x9b/0x5d0
 [<ffffffff8125d51e>] generic_make_request+0x25e/0x520
 [<ffffffffa0002f77>] ? dm_merge_bvec+0xc7/0x100 [dm_mod] 
 [<ffffffff8125d86d>] submit_bio+0x8d/0x120
 [<ffffffffa05bd41e>] ? lprocfs_oh_tally+0x2e/0x50 [obdclass]
 [<ffffffffa0cdfffc>] osd_submit_bio+0x1c/0x60 [osd_ldiskfs]
 [<ffffffffa0ce041c>] osd_do_bio+0x3dc/0x800 [osd_ldiskfs]
 [<ffffffffa0d4402c>] ? fsfilt_map_nblocks+0xcc/0xf0 [fsfilt_ldiskfs]
 [<ffffffffa0d442d5>] ? fsfilt_ldiskfs_map_inode_pages+0x85/0x90 [fsfilt_ldiskfs]
 [<ffffffffa0ce31a8>] osd_write_commit+0x328/0x610 [osd_ldiskfs]
 [<ffffffffa0db1e54>] ofd_commitrw_write+0x684/0x11b0 [ofd]
 [<ffffffffa0db4bbd>] ofd_commitrw+0x5cd/0xbb0 [ofd]
 [<ffffffffa05287e5>] ? lprocfs_counter_add+0x125/0x182 [lvfs]
 [<ffffffffa0d691d8>] obd_commitrw+0x128/0x3d0 [ost]
 [<ffffffffa0d731d1>] ost_brw_write+0xea1/0x15d0 [ost]
 [<ffffffff81281876>] ? vsnprintf+0x336/0x5e0
 [<ffffffffa073e1a0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] 
 [<ffffffffa0d7942b>] ost_handle+0x3ecb/0x48e0 [ost]
 [<ffffffffa04830f4>] ? libcfs_id2str+0x74/0xb0 [libcfs] 
 [<ffffffffa078e3b8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] 
 [<ffffffffa04775de>] ? cfs_timer_arm+0xe/0x10 [libcfs] 
 [<ffffffffa0488d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 
 [<ffffffffa0785719>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 
 [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
 [<ffffffffa078f74e>] ptlrpc_main+0xace/0x1700 [ptlrpc] 
 [<ffffffffa078ec80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffffa078ec80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 
 [<ffffffffa078ec80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

Maloo report: https://testing.hpdd.intel.com/test_sets/32c28f34-7a69-11e4-be53-5254006e85c2

Comment by Andreas Dilger [ 03/Dec/14 ]

I'm moving the affects version to 2.4.3 since the problem appears to be totally on the server and unrelated to the client.

Comment by Oleg Drokin [ 03/Dec/14 ]

at least in the second occurence we can see that OST is excessively slow but requests do complete after 210 seconds.

Comment by Jian Yu [ 04/Dec/14 ]

Here is a for-test-only patch to reproduce the failure: http://review.whamcloud.com/12908

sanity-benchmark was run for 6 times and all passed.

Comment by James A Simmons [ 14/Aug/16 ]

Last comments suggest this is fixed.

Generated at Sat Feb 10 01:56:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.