[LU-745] ost-pools test 23 hung Created: 10/Oct/11 Updated: 23/Mar/21 Resolved: 29/Sep/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.3, Lustre 1.8.8, Lustre 1.8.7 |
| Fix Version/s: | Lustre 2.2.0, Lustre 2.1.1 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Jian Yu | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre Branch: b1_8 |
||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 4844 | ||||
| Description |
|
ost-pools test 23 hung as follows: <~snip~>
running as UID 60000, GID 60000 [dd] [if=/dev/zero] [of=/mnt/lustre/d0.ost-pools/d23/dir/f23-quota] [bs=1024] [count=1024] [seek=2048] dd: writing `/mnt/lustre/d0.ost-pools/d23/dir/f23-quota': Disk quota exceeded dd: closing output file `/mnt/lustre/d0.ost-pools/d23/dir/f23-quota': Input/output error
Disk quotas for user quota_usr (uid 60000):
Filesystem kbytes quota limit grace files quota limit grace
/mnt/lustre/d0.ost-pools/d23/dir
2048 8192 8192 - 1 0 0 -
lustre-MDT0000_UUID
0 - 1024 - 1 - 0 -
lustre-OST0000_UUID
2048* - 1024 - - - - -
lustre-OST0001_UUID
0 - 1024 - - - - -
lustre-OST0002_UUID
0 - 1024 - - - - -
lustre-OST0003_UUID
0 - 1024 - - - - -
lustre-OST0004_UUID
0 - 1024 - - - - -
lustre-OST0005_UUID
0 - 1024 - - - - -
lustre-OST0006_UUID
0 - 1024 - - - - -
second run
Maloo report: https://maloo.whamcloud.com/test_sets/76e559f2-f24b-11e0-908b-52540025f9af |
| Comments |
| Comment by Peter Jones [ 13/Oct/11 ] |
|
Niu Could you please look into this possible 1.8.7 blocker as your to priority? Thanks Peter |
| Comment by Sarah Liu [ 14/Oct/11 ] |
|
ost-pools passed on manual test: https://maloo.whamcloud.com/test_sets/ec71c41c-f69c-11e0-a451-52540025f9af |
| Comment by Niu Yawei (Inactive) [ 17/Oct/11 ] |
|
Looks the only difference between autotest and manual test is: autotest was running on vm, whereas the manual test was running on real hardware. From the stack trace of OSS (autotest), we can see lots of jbd2 threads were stuck in journal commit: jbd2/dm-0-8 D ffff8100025219e8 0 10033 23 10034 6332 (L-TLB) ffff81007a871c80 0000000000000046 0000810067e960c0 0000000000000008 ffff810072aa7fc0 000000000000000a ffff810067e960c0 ffff81007eb9b860 00002d04b9e9b0d1 000000000004afa9 ffff810067e962a8 000000008006e649 Call Trace: [<ffffffff8006ec8f>] do_gettimeofday+0x40/0x90 [<ffffffff8005a43f>] getnstimeofday+0x10/0x29 [<ffffffff800155f8>] sync_buffer+0x0/0x3f [<ffffffff800637ce>] io_schedule+0x3f/0x67 [<ffffffff80015633>] sync_buffer+0x3b/0x3f [<ffffffff800639fa>] __wait_on_bit+0x40/0x6e [<ffffffff800155f8>] sync_buffer+0x0/0x3f [<ffffffff80063a94>] out_of_line_wait_on_bit+0x6c/0x78 [<ffffffff800a2e2d>] wake_bit_function+0x0/0x23 [<ffffffff887449f0>] :jbd2:jbd2_journal_commit_transaction+0xab4/0x1120 [<ffffffff800a2dff>] autoremove_wake_function+0x0/0x2e [<ffffffff8003d8b0>] lock_timer_base+0x1b/0x3c [<ffffffff88748209>] :jbd2:kjournald2+0x9a/0x1ec [<ffffffff800a2dff>] autoremove_wake_function+0x0/0x2e [<ffffffff800a2be7>] keventd_create_kthread+0x0/0xc4 [<ffffffff8874816f>] :jbd2:kjournald2+0x0/0x1ec [<ffffffff800a2be7>] keventd_create_kthread+0x0/0xc4 [<ffffffff8003276f>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff800a2be7>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032671>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 And lots of ost io threads were waiting on journal commit: ll_ost_io_03 D 0000000102d49df7 0 10070 1 10071 10069 (L-TLB) ffff810063ba1810 0000000000000046 00000100457bb3d8 ffff8100133b68c0 0000000900001307 000000000000000a ffff810011cb3040 ffff81005c374820 00002d04ca79a8e2 0000000000001a5a ffff810011cb3228 00000000ffffffff Call Trace: [<ffffffff8002e024>] __wake_up+0x38/0x4f [<ffffffff887477f3>] :jbd2:jbd2_log_wait_commit+0xa3/0xf5 [<ffffffff800a2dff>] autoremove_wake_function+0x0/0x2e [<ffffffff88b9190b>] :fsfilt_ldiskfs:fsfilt_ldiskfs_commit_wait+0xab/0xd0 [<ffffffff88c60144>] :obdfilter:filter_commitrw_write+0x1e14/0x2dd0 [<ffffffff88bade26>] :ost:ost_checksum_bulk+0x2b6/0x5a0 [<ffffffff88bb4d09>] :ost:ost_brw_write+0x1c99/0x2480 [<ffffffff8001aaaa>] vsnprintf+0x5df/0x627 [<ffffffff8008e7f9>] default_wake_function+0x0/0xe [<ffffffff888f30a8>] :ptlrpc:lustre_msg_check_version_v2+0x8/0x20 [<ffffffff88bb809e>] :ost:ost_handle+0x2bae/0x55b0 [<ffffffff888ad19a>] :ptlrpc:lock_res_and_lock+0xba/0xd0 [<ffffffff88761868>] :libcfs:libcfs_ip_addr2str+0x38/0x40 [<ffffffff889026d9>] :ptlrpc:ptlrpc_server_handle_request+0x989/0xe00 [<ffffffff88902e35>] :ptlrpc:ptlrpc_wait_event+0x2e5/0x310 [<ffffffff8008cc1e>] __wake_up_common+0x3e/0x68 [<ffffffff88903dc6>] :ptlrpc:ptlrpc_main+0xf66/0x1120 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff88902e60>] :ptlrpc:ptlrpc_main+0x0/0x1120 [<ffffffff8005dfa7>] child_rip+0x0/0x11 One ost io thread is little bit different: ll_ost_io_30 D ffff81007eb9b898 0 12653 1 12654 12652 (L-TLB) ffff810038eb7380 0000000000000046 0000000000000000 ffffffff8002349d 0000000000000001 0000000000000009 ffff8100029bf0c0 ffff81007eb9b860 00002d04c7a2bd6c 00000000002c10b7 ffff8100029bf2a8 0000000063c3a670 Call Trace: [<ffffffff8002349d>] mempool_alloc+0x31/0xe7 [<ffffffff887427ef>] :jbd2:do_get_write_access+0x212/0x418 [<ffffffff800a2e2d>] wake_bit_function+0x0/0x23 [<ffffffff88742a17>] :jbd2:jbd2_journal_get_write_access+0x22/0x33 [<ffffffff88b3a153>] :ldiskfs:__ldiskfs_journal_get_write_access+0x53/0x80 [<ffffffff88b436a5>] :ldiskfs:ldiskfs_reserve_inode_write+0x45/0xa0 [<ffffffff88b43749>] :ldiskfs:ldiskfs_mark_inode_dirty+0x49/0x1e0 [<ffffffff88137415>] :dm_mod:__map_bio+0x4a/0x123 [<ffffffff88b497e8>] :ldiskfs:ldiskfs_dirty_inode+0x38/0x70 [<ffffffff80013dbd>] __mark_inode_dirty+0x29/0x16e [<ffffffff88b52388>] :ldiskfs:ldiskfs_mb_new_blocks+0x4c8/0x500 [<ffffffff80019e69>] __getblk+0x25/0x22c [<ffffffff800300cb>] __up_write+0x27/0xf2 [<ffffffff88b95217>] :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x327/0x640 [<ffffffff88b3d14e>] :ldiskfs:ldiskfs_ext_get_blocks+0xae/0x1770 [<ffffffff88b3b9d1>] :ldiskfs:ldiskfs_ext_find_extent+0x281/0x2f0 [<ffffffff88b3bc7a>] :ldiskfs:ldiskfs_ext_walk_space+0x23a/0x2b0 [<ffffffff88b94ef0>] :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x0/0x640 [<ffffffff8006457b>] __down_write_nested+0x12/0x92 [<ffffffff88b91453>] :fsfilt_ldiskfs:fsfilt_map_nblocks+0x103/0x150 [<ffffffff88742d45>] :jbd2:start_this_handle+0x31d/0x3cf [<ffffffff88b44eba>] :ldiskfs:ldiskfs_get_blocks+0xaa/0x210 [<ffffffff88c5d631>] :obdfilter:filter_direct_io+0x10b1/0x10d0 [<ffffffff88c5faa9>] :obdfilter:filter_commitrw_write+0x1779/0x2dd0 [<ffffffff80062ff2>] thread_return+0x62/0xfe [<ffffffff88bade94>] :ost:ost_checksum_bulk+0x324/0x5a0 [<ffffffff88bb4d09>] :ost:ost_brw_write+0x1c99/0x2480 [<ffffffff8001aaaa>] vsnprintf+0x5df/0x627 [<ffffffff8008e7f9>] default_wake_function+0x0/0xe [<ffffffff888f30a8>] :ptlrpc:lustre_msg_check_version_v2+0x8/0x20 [<ffffffff88bb809e>] :ost:ost_handle+0x2bae/0x55b0 [<ffffffff8008dff2>] dequeue_task+0x18/0x37 [<ffffffff88761868>] :libcfs:libcfs_ip_addr2str+0x38/0x40 [<ffffffff889026d9>] :ptlrpc:ptlrpc_server_handle_request+0x989/0xe00 [<ffffffff88902e35>] :ptlrpc:ptlrpc_wait_event+0x2e5/0x310 [<ffffffff8008e7f9>] default_wake_function+0x0/0xe [<ffffffff88903dc6>] :ptlrpc:ptlrpc_main+0xf66/0x1120 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff88902e60>] :ptlrpc:ptlrpc_main+0x0/0x1120 [<ffffffff8005dfa7>] child_rip+0x0/0x11 don't see the exact reason yet, will make further investigation. |
| Comment by Niu Yawei (Inactive) [ 18/Oct/11 ] |
|
I suspect the culprit could be a jbd2 bug: From the stack trace, we can see some ost io threads are waiting in do_get_write_access(), and these threads were holding i_data_sem, which caused some other io thread waiting on i_data_sem, and those never stopped transaction caused lots of other io threads stuck in jbd2_log_wait_commit(). Hi, Yujian |
| Comment by Niu Yawei (Inactive) [ 18/Oct/11 ] |
|
Hi, Yujian I posted a patch http://review.whamcloud.com/1541, could you verify if the patch can resolve the problem? thank you. |
| Comment by Jian Yu [ 18/Oct/11 ] |
Let's wait for the autotest results first since the issue only occurred in autotest runs before. |
| Comment by Niu Yawei (Inactive) [ 19/Oct/11 ] |
|
Hi, Chris The ost-pools is not enabled for patch verify auto-test currently, could you add it in? Or is there any other way to verify the patch? Thanks. |
| Comment by Build Master (Inactive) [ 21/Oct/11 ] |
|
Integrated in Johann Lombardi : 3d371fb1743cf171b3c770fea1d8f8ce864a3896
|
| Comment by Build Master (Inactive) [ 21/Oct/11 ] |
|
Integrated in Johann Lombardi : 3d371fb1743cf171b3c770fea1d8f8ce864a3896
|
| Comment by Build Master (Inactive) [ 21/Oct/11 ] |
|
Integrated in Johann Lombardi : 3d371fb1743cf171b3c770fea1d8f8ce864a3896
|
| Comment by Build Master (Inactive) [ 21/Oct/11 ] |
|
Integrated in Johann Lombardi : 3d371fb1743cf171b3c770fea1d8f8ce864a3896
|
| Comment by Build Master (Inactive) [ 21/Oct/11 ] |
|
Integrated in Johann Lombardi : 3d371fb1743cf171b3c770fea1d8f8ce864a3896
|
| Comment by Build Master (Inactive) [ 21/Oct/11 ] |
|
Integrated in Johann Lombardi : 3d371fb1743cf171b3c770fea1d8f8ce864a3896
|
| Comment by Build Master (Inactive) [ 21/Oct/11 ] |
|
Integrated in Johann Lombardi : 3d371fb1743cf171b3c770fea1d8f8ce864a3896
|
| Comment by Build Master (Inactive) [ 21/Oct/11 ] |
|
Integrated in Johann Lombardi : 3d371fb1743cf171b3c770fea1d8f8ce864a3896
|
| Comment by Build Master (Inactive) [ 21/Oct/11 ] |
|
Integrated in Johann Lombardi : 3d371fb1743cf171b3c770fea1d8f8ce864a3896
|
| Comment by Build Master (Inactive) [ 21/Oct/11 ] |
|
Integrated in Johann Lombardi : 3d371fb1743cf171b3c770fea1d8f8ce864a3896
|
| Comment by Build Master (Inactive) [ 21/Oct/11 ] |
|
Integrated in Johann Lombardi : 3d371fb1743cf171b3c770fea1d8f8ce864a3896
|
| Comment by Peter Jones [ 03/Nov/11 ] |
|
Niu Does this fix need to be ported to master? Peter |
| Comment by Niu Yawei (Inactive) [ 04/Nov/11 ] |
Yes, I think so. |
| Comment by Niu Yawei (Inactive) [ 08/Nov/11 ] |
|
patch for master: http://review.whamcloud.com/1675 |
| Comment by Build Master (Inactive) [ 21/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 21/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 21/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 21/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 21/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 21/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 21/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 21/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 21/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 21/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 21/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 21/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 21/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 21/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Niu Yawei (Inactive) [ 29/Dec/11 ] |
|
landed for b1_8 & master |
| Comment by Jian Yu [ 11/May/12 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/194/ The same issue occurred: https://maloo.whamcloud.com/test_sets/861fb64e-9b4f-11e1-a0a0-52540035b04c |
| Comment by Jian Yu [ 13/Aug/12 ] |
|
RHEL6.3/x86_64 (2.1.3 RC1 Servers + 1.8.8-wc1 Clients): |
| Comment by Niu Yawei (Inactive) [ 05/Nov/12 ] |
|
I think that's different failures: ost-pools.sh test_23() will try to full fill the ost pool by a single dd write (in the "second run"), I think that'll definitely result in timeout on large system. Seems we should backport the ost-pool test fix from master, which checks the available space of the pool and skip test if it's larger than some maximum value. |
| Comment by Andreas Dilger [ 29/Sep/15 ] |
|
Close old bug. |