[LU-5970] 2.4.3<->2.5.4 interop: sanity-benchmark test bonnie hung Created: 02/Dec/14 Updated: 14/Aug/16 Resolved: 14/Aug/16 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Jian Yu | Assignee: | WC Triage |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre client build: https://build.hpdd.intel.com/job/lustre-b2_5/104 |
||
| Severity: | 3 |
| Rank (Obsolete): | 16673 |
| Description |
|
sanity-benchmark test bonnie hung. On client node: bonnie++ D 0000000000000001 0 12696 12355 0x00000080 ffff88006cd17ae8 0000000000000082 00000000ffffffff 000027e0a71a87f8 ffff88006d38bec0 ffff8800398b1830 0000000000aebe00 ffffffffab83d239 ffff880069133ab8 ffff88006cd17fd8 000000000000fbc8 ffff880069133ab8 Call Trace: [<ffffffff810a6d31>] ? ktime_get_ts+0xb1/0xf0 [<ffffffff8111f750>] ? sync_page_killable+0x0/0x40 [<ffffffff815295e3>] io_schedule+0x73/0xc0 [<ffffffff8111f73d>] sync_page+0x3d/0x50 [<ffffffff8111f75e>] sync_page_killable+0xe/0x40 [<ffffffff81529e7a>] __wait_on_bit_lock+0x5a/0xc0 [<ffffffff8111f667>] __lock_page_killable+0x67/0x70 [<ffffffff8109b020>] ? wake_bit_function+0x0/0x50 [<ffffffff8111f50e>] ? find_get_page+0x1e/0xa0 [<ffffffff81121394>] generic_file_aio_read+0x4b4/0x700 [<ffffffffa06ec41b>] vvp_io_read_start+0x22b/0x410 [lustre] [<ffffffffa0c46d2a>] cl_io_start+0x6a/0x140 [obdclass] [<ffffffffa0c4b434>] cl_io_loop+0xb4/0x1b0 [obdclass] [<ffffffffa068e326>] ll_file_io_generic+0x2a6/0x610 [lustre] [<ffffffffa068e7cf>] ll_file_aio_read+0x13f/0x2c0 [lustre] [<ffffffffa068ef3c>] ll_file_read+0x16c/0x2a0 [lustre] [<ffffffff81189645>] vfs_read+0xb5/0x1a0 [<ffffffff81189781>] sys_read+0x51/0x90 [<ffffffff810e1bfe>] ? __audit_syscall_exit+0x25e/0x290 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b On OSS node: ll_ost_io00_0 D 0000000000000000 0 9922 2 0x00000080 ffff88007aa67790 0000000000000046 ffff88007aa67758 ffff88007aa67754 ffff880079f02378 ffff88007f823040 ffff880002316740 0000000000000400 ffff88006b02daf8 ffff88007aa67fd8 000000000000fb88 ffff88006b02daf8 Call Trace: [<ffffffffa0ce06cb>] osd_do_bio+0x68b/0x800 [osd_ldiskfs] [<ffffffffa0d4402c>] ? fsfilt_map_nblocks+0xcc/0xf0 [fsfilt_ldiskfs] [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa0d442d5>] ? fsfilt_ldiskfs_map_inode_pages+0x85/0x90 [fsfilt_ldiskfs] [<ffffffffa0ce29d8>] osd_read_prep+0x338/0x3b0 [osd_ldiskfs] [<ffffffffa0db2bd3>] ofd_preprw_read+0x253/0x7f0 [ofd] [<ffffffffa0db38ea>] ofd_preprw+0x77a/0x1480 [ofd] [<ffffffffa077dff1>] ? lustre_pack_reply_v2+0x1e1/0x280 [ptlrpc] [<ffffffffa0d68e0c>] obd_preprw+0x12c/0x3d0 [ost] [<ffffffffa0d707c0>] ost_brw_read+0xd60/0x1340 [ost] [<ffffffff81281876>] ? vsnprintf+0x336/0x5e0 [<ffffffffa0486beb>] ? libcfs_debug_vmsg2+0x50b/0xbb0 [libcfs] [<ffffffffa077ee9c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] [<ffffffffa077eff8>] ? lustre_msg_check_version+0xe8/0x100 [ptlrpc] [<ffffffffa0d78028>] ost_handle+0x2ac8/0x48e0 [ost] [<ffffffffa04830f4>] ? libcfs_id2str+0x74/0xb0 [libcfs] [<ffffffffa078e3b8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] [<ffffffffa04775de>] ? cfs_timer_arm+0xe/0x10 [libcfs] [<ffffffffa0488d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] [<ffffffffa0785719>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 [<ffffffffa078f74e>] ptlrpc_main+0xace/0x1700 [ptlrpc] [<ffffffffa078ec80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffffa078ec80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] [<ffffffffa078ec80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Maloo report: |
| Comments |
| Comment by Jian Yu [ 02/Dec/14 ] |
|
This is a regression failure introduced by Lustre b2_5 build #104. |
| Comment by Jian Yu [ 02/Dec/14 ] |
|
Lustre client build: https://build.hpdd.intel.com/job/lustre-b2_5/104 replay-ost-single test 5 also hit the same failure: LNet: Service thread pid 9041 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Pid: 9041, comm: ll_ost_io00_022 Call Trace: [<ffffffff810a2431>] ? ktime_get_ts+0xb1/0xf0 [<ffffffff8150ed93>] io_schedule+0x73/0xc0 [<ffffffff8125ed08>] get_request_wait+0x108/0x1d0 [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40 [<ffffffff81255fbb>] ? elv_merge+0x1cb/0x200 [<ffffffff8125ee6b>] blk_queue_bio+0x9b/0x5d0 [<ffffffff8125d51e>] generic_make_request+0x25e/0x520 [<ffffffffa0002f77>] ? dm_merge_bvec+0xc7/0x100 [dm_mod] [<ffffffff8125d86d>] submit_bio+0x8d/0x120 [<ffffffffa05bd41e>] ? lprocfs_oh_tally+0x2e/0x50 [obdclass] [<ffffffffa0cdfffc>] osd_submit_bio+0x1c/0x60 [osd_ldiskfs] [<ffffffffa0ce041c>] osd_do_bio+0x3dc/0x800 [osd_ldiskfs] [<ffffffffa0d4402c>] ? fsfilt_map_nblocks+0xcc/0xf0 [fsfilt_ldiskfs] [<ffffffffa0d442d5>] ? fsfilt_ldiskfs_map_inode_pages+0x85/0x90 [fsfilt_ldiskfs] [<ffffffffa0ce31a8>] osd_write_commit+0x328/0x610 [osd_ldiskfs] [<ffffffffa0db1e54>] ofd_commitrw_write+0x684/0x11b0 [ofd] [<ffffffffa0db4bbd>] ofd_commitrw+0x5cd/0xbb0 [ofd] [<ffffffffa05287e5>] ? lprocfs_counter_add+0x125/0x182 [lvfs] [<ffffffffa0d691d8>] obd_commitrw+0x128/0x3d0 [ost] [<ffffffffa0d731d1>] ost_brw_write+0xea1/0x15d0 [ost] [<ffffffff81281876>] ? vsnprintf+0x336/0x5e0 [<ffffffffa073e1a0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] [<ffffffffa0d7942b>] ost_handle+0x3ecb/0x48e0 [ost] [<ffffffffa04830f4>] ? libcfs_id2str+0x74/0xb0 [libcfs] [<ffffffffa078e3b8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] [<ffffffffa04775de>] ? cfs_timer_arm+0xe/0x10 [libcfs] [<ffffffffa0488d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] [<ffffffffa0785719>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 [<ffffffffa078f74e>] ptlrpc_main+0xace/0x1700 [ptlrpc] [<ffffffffa078ec80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffffa078ec80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] [<ffffffffa078ec80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Maloo report: https://testing.hpdd.intel.com/test_sets/32c28f34-7a69-11e4-be53-5254006e85c2 |
| Comment by Andreas Dilger [ 03/Dec/14 ] |
|
I'm moving the affects version to 2.4.3 since the problem appears to be totally on the server and unrelated to the client. |
| Comment by Oleg Drokin [ 03/Dec/14 ] |
|
at least in the second occurence we can see that OST is excessively slow but requests do complete after 210 seconds. |
| Comment by Jian Yu [ 04/Dec/14 ] |
sanity-benchmark was run for 6 times and all passed. |
| Comment by James A Simmons [ 14/Aug/16 ] |
|
Last comments suggest this is fixed. |