[LU-1752] Test failure on test suite obdfilter-survey, subtest test_1c Created: 15/Aug/12  Updated: 23/Aug/12  Resolved: 23/Aug/12

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Zhenyu Xu
Resolution: Duplicate Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 6350

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/8220cf66-e4f5-11e1-9681-52540035b04c.

The sub-test test_1c failed with the following error:

test failed to respond and timed out

12:46:22:Lustre: DEBUG MARKER: == obdfilter-survey test 1c: Object Storage Targets survey, big batch == 12:46:20 (1344800780)
12:46:22:Lustre: DEBUG MARKER: lctl dl | grep obdfilter
12:46:23:Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep tcp | cut -f 1 -d '@'
12:46:23:Lustre: Echo OBD driver; http://www.lustre.org/
12:47:02:Lustre: 5083:0:(filter_io_26.c:772:filter_commitrw_write()) lustre-OST0002: slow direct_io 32s
12:47:05:Lustre: 5419:0:(filter_io_26.c:772:filter_commitrw_write()) lustre-OST0002: slow direct_io 30s
12:47:06:Lustre: 5364:0:(filter_io_26.c:772:filter_commitrw_write()) lustre-OST0002: slow direct_io 31s
12:47:06:Lustre: 5364:0:(filter_io_26.c:772:filter_commitrw_write()) Skipped 9 previous similar messages
12:47:08:Lustre: 5086:0:(filter_io_26.c:772:filter_commitrw_write()) lustre-OST0001: slow direct_io 38s
12:47:08:Lustre: 5086:0:(filter_io_26.c:772:filter_commitrw_write()) Skipped 17 previous similar messages
12:47:09:Lustre: 5705:0:(filter_io_26.c:772:filter_commitrw_write()) lustre-OST0001: slow direct_io 34s
12:47:09:Lustre: 5705:0:(filter_io_26.c:772:filter_commitrw_write()) Skipped 38 previous similar messages
12:47:13:Lustre: 5706:0:(filter_io_26.c:772:filter_commitrw_write()) lustre-OST0001: slow direct_io 38s
12:47:13:Lustre: 5706:0:(filter_io_26.c:772:filter_commitrw_write()) Skipped 47 previous similar messages
12:47:22:Lustre: 5346:0:(filter_io_26.c:772:filter_commitrw_write()) lustre-OST0005: slow direct_io 36s
12:47:24:Lustre: 5346:0:(filter_io_26.c:772:filter_commitrw_write()) Skipped 155 previous similar messages
12:47:38:Lustre: 5220:0:(filter_io_26.c:772:filter_commitrw_write()) lustre-OST0004: slow direct_io 36s
12:47:39:Lustre: 5220:0:(filter_io_26.c:772:filter_commitrw_write()) Skipped 271 previous similar messages
12:48:11:Lustre: 5628:0:(filter_io_26.c:772:filter_commitrw_write()) lustre-OST0000: slow direct_io 33s
12:48:11:Lustre: 5628:0:(filter_io_26.c:772:filter_commitrw_write()) Skipped 492 previous similar messages
12:49:15:Lustre: 5901:0:(filter_io_26.c:772:filter_commitrw_write()) lustre-OST0005: slow direct_io 30s
12:49:16:Lustre: 5901:0:(filter_io_26.c:772:filter_commitrw_write()) Skipped 1089 previous similar messages
12:52:38:INFO: task jbd2/dm-0-8:29189 blocked for more than 120 seconds.
12:52:38:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
12:52:38:jbd2/dm-0-8   D 0000000000000000     0 29189      2 0x00000080
12:52:38: ffff880069127d20 0000000000000046 ffff880069127c90 ffffffff810920e6
12:52:38: ffff880069127cd0 ffffffff81039678 ffff8800ffffffff ffff880069938040
12:52:38: ffff8800699385f8 ffff880069127fd8 000000000000fb88 ffff8800699385f8
12:52:38:Call Trace:
12:52:38: [<ffffffff810920e6>] ? autoremove_wake_function+0x16/0x40
12:52:38: [<ffffffff81039678>] ? pvclock_clocksource_read+0x58/0xd0
12:52:38: [<ffffffff810923be>] ? prepare_to_wait+0x4e/0x80
12:52:38: [<ffffffffa03a493f>] jbd2_journal_commit_transaction+0x19f/0x1530 [jbd2]
12:52:38: [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
12:52:38: [<ffffffff8107e00c>] ? lock_timer_base+0x3c/0x70
12:52:38: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
12:52:38: [<ffffffffa03ab128>] kjournald2+0xb8/0x220 [jbd2]
12:52:38: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
12:52:38: [<ffffffffa03ab070>] ? kjournald2+0x0/0x220 [jbd2]
12:52:38: [<ffffffff81091d66>] kthread+0x96/0xa0
12:52:38: [<ffffffff8100c14a>] child_rip+0xa/0x20
12:52:38: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
12:52:38: [<ffffffff8100c140>] ? child_rip+0x0/0x20


 Comments   
Comment by Keith Mannthey (Inactive) [ 15/Aug/12 ]

Is there any more the the OST kernel messages in /var/log/messages? The direct IO the the DM devices seemed to just stop for several of the OSTs. Did the underlying IO fail?

Comment by Peter Jones [ 19/Aug/12 ]

Bobijam

Could you please look into this one in more detail?

Thanks

Peter

Comment by Zhenyu Xu [ 20/Aug/12 ]
OSS stack trace of major blocked threads
14:51:15:jbd2/dm-0-8   D 0000000000000000     0 29189      2 0x00000080
14:51:15: ffff880069127d20 0000000000000046 ffff880069127c90 ffffffff810920e6
14:51:15: ffff880069127cd0 ffffffff81039678 ffff8800ffffffff ffff880069938040
14:51:15: ffff8800699385f8 ffff880069127fd8 000000000000fb88 ffff8800699385f8
14:51:15:Call Trace:
14:51:15: [<ffffffff810920e6>] ? autoremove_wake_function+0x16/0x40
14:51:15: [<ffffffff81039678>] ? pvclock_clocksource_read+0x58/0xd0
14:51:15: [<ffffffff810923be>] ? prepare_to_wait+0x4e/0x80
14:51:15: [<ffffffffa03a493f>] jbd2_journal_commit_transaction+0x19f/0x1530 [jbd2]
14:51:15: [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
14:51:15: [<ffffffff8107e00c>] ? lock_timer_base+0x3c/0x70
14:51:16: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
14:51:16: [<ffffffffa03ab128>] kjournald2+0xb8/0x220 [jbd2]
14:51:16: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
14:51:16: [<ffffffffa03ab070>] ? kjournald2+0x0/0x220 [jbd2]
14:51:16: [<ffffffff81091d66>] kthread+0x96/0xa0
14:51:16: [<ffffffff8100c14a>] child_rip+0xa/0x20
14:51:16: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
14:51:16: [<ffffffff8100c140>] ? child_rip+0x0/0x20
14:51:35:jbd2/dm-1-8   D 0000000000000000     0 29544      2 0x00000080
14:51:35: ffff8800679bbd20 0000000000000046 ffff88006af48040 0000000000000000
14:51:35: ffff8800145d0000 ffff880002216728 000000000000035f ffff88006ce54aa0
14:51:35: ffff88006ce55058 ffff8800679bbfd8 000000000000fb88 ffff88006ce55058
14:51:35:Call Trace:
14:51:35: [<ffffffff810923be>] ? prepare_to_wait+0x4e/0x80
14:51:35: [<ffffffffa03a493f>] jbd2_journal_commit_transaction+0x19f/0x1530 [jbd2]
14:51:35: [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
14:51:35: [<ffffffff8107e00c>] ? lock_timer_base+0x3c/0x70
14:51:35: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
14:51:35: [<ffffffffa03ab128>] kjournald2+0xb8/0x220 [jbd2]
14:51:35: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
14:51:36: [<ffffffffa03ab070>] ? kjournald2+0x0/0x220 [jbd2]
14:51:36: [<ffffffff81091d66>] kthread+0x96/0xa0
14:51:36: [<ffffffff8100c14a>] child_rip+0xa/0x20
14:51:36: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
14:51:36: [<ffffffff8100c140>] ? child_rip+0x0/0x20
14:51:37:jbd2/dm-2-8   D 0000000000000000     0 29878      2 0x00000080
14:51:37: ffff880062127d20 0000000000000046 ffff880062127c90 ffffffff810920e6
14:51:37: ffff880044ca1000 ffff880002216728 00000000000004e1 ffff88007d352080
14:51:38: ffff88007d352638 ffff880062127fd8 000000000000fb88 ffff88007d352638
14:51:38:Call Trace:
14:51:38: [<ffffffff810920e6>] ? autoremove_wake_function+0x16/0x40
14:51:38: [<ffffffff810923be>] ? prepare_to_wait+0x4e/0x80
14:51:38: [<ffffffffa03a493f>] jbd2_journal_commit_transaction+0x19f/0x1530 [jbd2]
14:51:38: [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
14:51:38: [<ffffffff8107e00c>] ? lock_timer_base+0x3c/0x70
14:51:38: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
14:51:38: [<ffffffffa03ab128>] kjournald2+0xb8/0x220 [jbd2]
14:51:38: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
14:51:38: [<ffffffffa03ab070>] ? kjournald2+0x0/0x220 [jbd2]
14:51:38: [<ffffffff81091d66>] kthread+0x96/0xa0
14:51:38: [<ffffffff8100c14a>] child_rip+0xa/0x20
14:51:38: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
14:51:38: [<ffffffff8100c140>] ? child_rip+0x0/0x20
14:51:41:jbd2/dm-3-8   D 0000000000000000     0 30247      2 0x00000080
14:51:41: ffff88006d207d20 0000000000000046 ffff880064824ae0 0000000000000000
14:51:41: ffff880014fa3c00 ffff880002216728 0000000000000318 ffff88001e651540
14:51:41: ffff88001e651af8 ffff88006d207fd8 000000000000fb88 ffff88001e651af8
14:51:41:Call Trace:
14:51:41: [<ffffffff810923be>] ? prepare_to_wait+0x4e/0x80
14:51:41: [<ffffffffa03a493f>] jbd2_journal_commit_transaction+0x19f/0x1530 [jbd2]
14:51:41: [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
14:51:41: [<ffffffff8107e00c>] ? lock_timer_base+0x3c/0x70
14:51:41: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
14:51:41: [<ffffffffa03ab128>] kjournald2+0xb8/0x220 [jbd2]
14:51:41: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
14:51:41: [<ffffffffa03ab070>] ? kjournald2+0x0/0x220 [jbd2]
14:51:41: [<ffffffff81091d66>] kthread+0x96/0xa0
14:51:41: [<ffffffff8100c14a>] child_rip+0xa/0x20
14:51:41: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
14:51:42: [<ffffffff8100c140>] ? child_rip+0x0/0x20
14:52:05:lctl          D 0000000000000000     0  5080   5076 0x00000080
14:52:05: ffff880016eab638 0000000000000086 ffff880016eab6c8 ffffffffa0a239d8
14:52:05: ffff880079605540 ffff88007bf42680 ffff880016eab628 0000000000000000
14:52:05: ffff880079605af8 ffff880016eabfd8 000000000000fb88 ffff880079605af8
14:52:05:Call Trace:
14:52:05: [<ffffffffa0a239d8>] ? quota_chk_acq_common+0x88/0xb50 [lquota]
14:52:05: [<ffffffffa0a22a80>] ? quota_acquire_common+0x0/0xf0 [lquota]
14:52:05: [<ffffffff814ff2fe>] __mutex_lock_slowpath+0x13e/0x180
14:52:05: [<ffffffff814ff19b>] mutex_lock+0x2b/0x50
14:52:05: [<ffffffffa0b61f74>] filter_commitrw_write+0xe14/0x2eb8 [obdfilter]
14:52:05: [<ffffffffa0b56654>] ? filter_preprw_write+0x1064/0x22f0 [obdfilter]
14:52:05: [<ffffffffa04626c2>] ? cfs_hash_rw_lock+0x12/0x30 [libcfs]
14:52:05: [<ffffffffa05ddbbb>] ? cl_env_put+0x1cb/0x370 [obdclass]
14:52:05: [<ffffffffa0b5551d>] filter_commitrw+0x27d/0x2a0 [obdfilter]
14:52:05: [<ffffffffa0b586e0>] ? filter_preprw+0x80/0xa0 [obdfilter]
14:52:06: [<ffffffffa0b9d1da>] echo_client_brw_ioctl+0x82a/0x13e0 [obdecho]
14:52:06: [<ffffffff81162447>] ? cache_grow+0x217/0x320
14:52:07: [<ffffffffa05d86cb>] ? lu_context_init+0xab/0x260 [obdclass]
14:52:07: [<ffffffffa0ba56a0>] echo_client_iocontrol+0xbb0/0x2950 [obdecho]
14:52:07: [<ffffffff81114307>] ? unlock_page+0x27/0x30
14:52:07: [<ffffffffa044cbe0>] ? cfs_alloc+0x30/0x60 [libcfs]
14:52:07: [<ffffffffa0586d8a>] ? obd_ioctl_getdata+0x13a/0x1160 [obdclass]
14:52:07: [<ffffffffa059a51f>] class_handle_ioctl+0x12cf/0x1ea0 [obdclass]
14:52:07: [<ffffffff8113ff34>] ? handle_mm_fault+0x1e4/0x2b0
14:52:08: [<ffffffffa05862ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
14:52:08: [<ffffffff8118dff2>] vfs_ioctl+0x22/0xa0
14:52:08: [<ffffffff81039678>] ? pvclock_clocksource_read+0x58/0xd0
14:52:08: [<ffffffff8118e194>] do_vfs_ioctl+0x84/0x580
14:52:08: [<ffffffff8118e711>] sys_ioctl+0x81/0xa0
14:52:09: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
14:52:09:lctl          D 0000000000000000     0  5081   5076 0x00000080
14:52:09: ffff8800270ed638 0000000000000082 ffff8800270ed6c8 ffffffffa0a239d8
14:52:09: ffff88000002bb00 ffff88007bf42680 ffff8800270ed628 0000000000000000
14:52:10: ffff88004c5c7af8 ffff8800270edfd8 000000000000fb88 ffff88004c5c7af8
14:52:10:Call Trace:
14:52:10: [<ffffffffa0a239d8>] ? quota_chk_acq_common+0x88/0xb50 [lquota]
14:52:10: [<ffffffffa0a22a80>] ? quota_acquire_common+0x0/0xf0 [lquota]
14:52:10: [<ffffffff814ff2fe>] __mutex_lock_slowpath+0x13e/0x180
14:52:10: [<ffffffff814ff19b>] mutex_lock+0x2b/0x50
14:52:10: [<ffffffffa0b61f74>] filter_commitrw_write+0xe14/0x2eb8 [obdfilter]
14:52:11: [<ffffffffa0b56654>] ? filter_preprw_write+0x1064/0x22f0 [obdfilter]
14:52:11: [<ffffffffa04626c2>] ? cfs_hash_rw_lock+0x12/0x30 [libcfs]
14:52:11: [<ffffffffa05ddbbb>] ? cl_env_put+0x1cb/0x370 [obdclass]
14:52:11: [<ffffffffa0b5551d>] filter_commitrw+0x27d/0x2a0 [obdfilter]
14:52:11: [<ffffffffa0b586e0>] ? filter_preprw+0x80/0xa0 [obdfilter]
14:52:11: [<ffffffffa0b9d1da>] echo_client_brw_ioctl+0x82a/0x13e0 [obdecho]
14:52:11: [<ffffffffa05d86cb>] ? lu_context_init+0xab/0x260 [obdclass]
14:52:11: [<ffffffffa0ba56a0>] echo_client_iocontrol+0xbb0/0x2950 [obdecho]
14:52:11: [<ffffffff81134425>] ? shmem_getpage+0x225/0x9f0
14:52:11: [<ffffffffa044cbe0>] ? cfs_alloc+0x30/0x60 [libcfs]
14:52:12: [<ffffffffa0586d8a>] ? obd_ioctl_getdata+0x13a/0x1160 [obdclass]
14:52:12: [<ffffffffa059a51f>] class_handle_ioctl+0x12cf/0x1ea0 [obdclass]
14:52:12: [<ffffffff8113ff34>] ? handle_mm_fault+0x1e4/0x2b0
14:52:12: [<ffffffffa05862ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
14:52:13: [<ffffffff8118dff2>] vfs_ioctl+0x22/0xa0
14:52:13: [<ffffffff81039678>] ? pvclock_clocksource_read+0x58/0xd0
14:52:13: [<ffffffff8118e194>] do_vfs_ioctl+0x84/0x580
14:52:13: [<ffffffff8118e711>] sys_ioctl+0x81/0xa0
14:52:13: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
14:52:16:lctl          D 0000000000000000     0  5084   5075 0x00000080
14:52:16: ffff88002005d648 0000000000000082 ffff8800ffffffff 0000000000a22e38
14:52:17: ffffffffa0b7f700 000000000000004e ffff880037d35450 0000000400000000
14:52:17: ffff88005487fab8 ffff88002005dfd8 000000000000fb88 ffff88005487fab8
14:52:17:Call Trace:
14:52:17: [<ffffffff810923be>] ? prepare_to_wait+0x4e/0x80
14:52:17: [<ffffffffa03a3072>] start_this_handle+0x282/0x500 [jbd2]
14:52:17: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
14:52:17: [<ffffffffa03a34f0>] jbd2_journal_start+0xd0/0x110 [jbd2]
14:52:17: [<ffffffffa03ffad8>] ldiskfs_journal_start_sb+0x58/0x90 [ldiskfs]
14:52:17: [<ffffffffa04c7088>] fsfilt_ldiskfs_start+0x78/0x630 [fsfilt_ldiskfs]
14:52:17: [<ffffffffa0b560cd>] filter_preprw_write+0xadd/0x22f0 [obdfilter]
14:52:17: [<ffffffffa05d797c>] ? lu_object_find_at+0x8c/0x450 [obdclass]
14:52:17: [<ffffffffa04626c2>] ? cfs_hash_rw_lock+0x12/0x30 [libcfs]
14:52:17: [<ffffffffa04617b8>] ? cfs_hash_multi_bd_lock+0x68/0xd0 [libcfs]
14:52:17: [<ffffffffa05ddbbb>] ? cl_env_put+0x1cb/0x370 [obdclass]
14:52:17: [<ffffffffa044cbae>] ? cfs_free+0xe/0x10 [libcfs]
14:52:17: [<ffffffffa0b3b8d7>] ? filter_unpackmd+0x6d7/0xa70 [obdfilter]
14:52:17: [<ffffffffa0b586e0>] filter_preprw+0x80/0xa0 [obdfilter]
14:52:18: [<ffffffffa0b9cf6a>] echo_client_brw_ioctl+0x5ba/0x13e0 [obdecho]
14:52:18: [<ffffffffa05d86cb>] ? lu_context_init+0xab/0x260 [obdclass]
14:52:18: [<ffffffffa0ba56a0>] echo_client_iocontrol+0xbb0/0x2950 [obdecho]
14:52:18: [<ffffffff81134425>] ? shmem_getpage+0x225/0x9f0
14:52:18: [<ffffffff81127061>] ? __alloc_pages_nodemask+0x111/0x940
14:52:18: [<ffffffffa044cbe0>] ? cfs_alloc+0x30/0x60 [libcfs]
14:52:18: [<ffffffffa0586d8a>] ? obd_ioctl_getdata+0x13a/0x1160 [obdclass]
14:52:18: [<ffffffffa059a51f>] class_handle_ioctl+0x12cf/0x1ea0 [obdclass]
14:52:18: [<ffffffff8113ff34>] ? handle_mm_fault+0x1e4/0x2b0
14:52:18: [<ffffffffa05862ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
14:52:18: [<ffffffff8118dff2>] vfs_ioctl+0x22/0xa0
14:52:18: [<ffffffff81039678>] ? pvclock_clocksource_read+0x58/0xd0
14:52:18: [<ffffffff8118e194>] do_vfs_ioctl+0x84/0x580
14:52:18: [<ffffffff8118e711>] sys_ioctl+0x81/0xa0
14:52:18: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
14:52:20:lctl          D 0000000000000000     0  5086   5074 0x00000080
14:52:20: ffff8800676a96e8 0000000000000086 ffff8800676a9688 ffffffffa03fd4df
14:52:20: ffff8800676a9708 ffffffffa03a2ed5 ffff880000000020 ffff8800676a9698
14:52:20: ffff88006030a638 ffff8800676a9fd8 000000000000fb88 ffff88006030a638
14:52:20:Call Trace:
14:52:21: [<ffffffffa03fd4df>] ? ldiskfs_statfs+0xef/0x1d0 [ldiskfs]
14:52:21: [<ffffffffa03a2ed5>] ? start_this_handle+0xe5/0x500 [jbd2]
14:52:21: [<ffffffff810923be>] ? prepare_to_wait+0x4e/0x80
14:52:21: [<ffffffffa03aad85>] jbd2_log_wait_commit+0xc5/0x140 [jbd2]
14:52:21: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
14:52:21: [<ffffffffa04c848d>] fsfilt_ldiskfs_commit_wait+0x6d/0xf0 [fsfilt_ldiskfs]
14:52:21: [<ffffffffa0b5623d>] filter_preprw_write+0xc4d/0x22f0 [obdfilter]
14:52:21: [<ffffffffa05d797c>] ? lu_object_find_at+0x8c/0x450 [obdclass]
14:52:21: [<ffffffffa04626c2>] ? cfs_hash_rw_lock+0x12/0x30 [libcfs]
14:52:21: [<ffffffffa04617b8>] ? cfs_hash_multi_bd_lock+0x68/0xd0 [libcfs]
14:52:21: [<ffffffffa05ddbbb>] ? cl_env_put+0x1cb/0x370 [obdclass]
14:52:21: [<ffffffffa044cbae>] ? cfs_free+0xe/0x10 [libcfs]
14:52:21: [<ffffffffa0b3b8d7>] ? filter_unpackmd+0x6d7/0xa70 [obdfilter]
14:52:21: [<ffffffffa0b586e0>] filter_preprw+0x80/0xa0 [obdfilter]
14:52:21: [<ffffffffa0b9cf6a>] echo_client_brw_ioctl+0x5ba/0x13e0 [obdecho]
14:52:21: [<ffffffffa05d86cb>] ? lu_context_init+0xab/0x260 [obdclass]
14:52:22: [<ffffffffa0ba56a0>] echo_client_iocontrol+0xbb0/0x2950 [obdecho]
14:52:22: [<ffffffff81134425>] ? shmem_getpage+0x225/0x9f0
14:52:22: [<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20
14:52:22: [<ffffffffa044cbe0>] ? cfs_alloc+0x30/0x60 [libcfs]
14:52:22: [<ffffffffa0586d8a>] ? obd_ioctl_getdata+0x13a/0x1160 [obdclass]
14:52:22: [<ffffffffa059a51f>] class_handle_ioctl+0x12cf/0x1ea0 [obdclass]
14:52:22: [<ffffffff8113ff34>] ? handle_mm_fault+0x1e4/0x2b0
14:52:22: [<ffffffffa05862ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
14:52:22: [<ffffffff8118dff2>] vfs_ioctl+0x22/0xa0
14:52:22: [<ffffffff8118e194>] do_vfs_ioctl+0x84/0x580
14:52:22: [<ffffffff814fdca0>] ? thread_return+0x4e/0x76e
14:52:22: [<ffffffff8118e711>] sys_ioctl+0x81/0xa0
14:52:22: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b

Comment by Zhenyu Xu [ 21/Aug/12 ]

niu,

All this failure happens on master, and all stack trace shows in OSS console log has this, while on orion_quota, orion_head_sync and b2_1 does not have this issue.

20:50:58:lctl D 0000000000000000 0 8218 8202 0x00000080
20:50:59: ffff880030ef1638 0000000000000082 ffff880030ef16c8 ffffffffa0a0b9d8
20:50:59: ffff88000002bb00 ffff88002ba7e1d0 ffff880030ef1628 0000000000000000
20:50:59: ffff8800610fc638 ffff880030ef1fd8 000000000000fb88 ffff8800610fc638
20:51:00:Call Trace:
20:51:00: [] ? quota_chk_acq_common+0x88/0xb50 [lquota]
20:51:00: [] ? quota_acquire_common+0x0/0xf0 [lquota]
20:51:00: [] __mutex_lock_slowpath+0x13e/0x180
20:51:00: [] mutex_lock+0x2b/0x50
20:51:00: [] filter_commitrw_write+0xe14/0x2eb8 [obdfilter]
20:51:00: [] ? filter_preprw_write+0x1064/0x22f0 [obdfilter]
20:51:00: [] ? cfs_hash_rw_lock+0x12/0x30 [libcfs]
20:51:00: [] ? cl_env_put+0x1cb/0x370 [obdclass]
20:51:00: [] filter_commitrw+0x27d/0x2a0 [obdfilter]
20:51:00: [] ? filter_preprw+0x80/0xa0 [obdfilter]
20:51:00: [] echo_client_brw_ioctl+0x82a/0x13e0 [obdecho]
20:51:00: [] ? lu_context_init+0xab/0x260 [obdclass]
20:51:00: [] echo_client_iocontrol+0xbb0/0x2950 [obdecho]
20:51:01: [] ? shmem_getpage+0x225/0x9f0
20:51:02: [] ? cfs_alloc+0x30/0x60 [libcfs]
20:51:02: [] ? obd_ioctl_getdata+0x13a/0x1160 [obdclass]
20:51:02: [] class_handle_ioctl+0x12cf/0x1ea0 [obdclass]
20:51:02: [] ? handle_mm_fault+0x1e4/0x2b0
20:51:02: [] obd_class_ioctl+0x4b/0x190 [obdclass]
20:51:02: [] vfs_ioctl+0x22/0xa0
20:51:02: [] ? pvclock_clocksource_read+0x58/0xd0
20:51:02: [] do_vfs_ioctl+0x84/0x580
20:51:02: [] sys_ioctl+0x81/0xa0
20:51:02: [] system_call_fastpath+0x16/0x1b

could you check whether master branch missed some patch?

Comment by Niu Yawei (Inactive) [ 22/Aug/12 ]

I don't think it's quota related, and looks the stack trace isn't 100% accurate, that's why the two quota functions listed in the trace.

What I see in the log is that many io threads is waiting on journal commit (and some are waiting on inode semaphore), not sure why the journal commit can't finish in time.

Comment by Zhenyu Xu [ 23/Aug/12 ]

Need an I/O expert's help.

Comment by Jodi Levi (Inactive) [ 23/Aug/12 ]

Duplicate of LU-657

Generated at Sat Feb 10 01:19:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.