[LU-302] ll_ost_io_* threads hung Created: 10/May/11  Updated: 11/May/11  Resolved: 11/May/11

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.6
Fix Version/s: Lustre 1.8.6

Type: Bug Priority: Blocker
Reporter: Jian Yu Assignee: Yang Sheng
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Lustre Branch: b1_8
Lustre Build: http://newbuild.whamcloud.com/job/lustre-reviews/363/arch=x86_64,build_type=server,distro=el5,ib_stack=inkernel/
Kernel Version: 2.6.18-238.9.1.el5_lustre.20110509050254


Severity: 3
Rank (Obsolete): 10122

 Description   

While running runtests test, the ll_ost_io_* threads hung as follows:

Lustre: DEBUG MARKER: copying files from /etc /bin to /mnt/lustre/runtest.5368/etc /bin at Tue May 10 02:14:07 PDT 2011
Lustre: Service thread pid 6575 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 6575, comm: ll_ost_io_03

Call Trace:
 [<ffffffff8006466c>] __down_read+0x7a/0x92
 [<ffffffff88bb2b0f>] ldiskfs_ext_walk_space+0xdf/0x2d0 [ldiskfs]
 [<ffffffff88c0bf10>] ldiskfs_ext_new_extent_cb+0x0/0x650 [fsfilt_ldiskfs]
 [<ffffffff8006456b>] __down_write_nested+0x12/0x92
 [<ffffffff88c0846d>] fsfilt_map_nblocks+0xfd/0x150 [fsfilt_ldiskfs]
 [<ffffffff88c69a7d>] filter_direct_io+0x46d/0xd50 [obdfilter]
 [<ffffffff88c08be7>] fsfilt_ldiskfs_setattr+0x1a7/0x250 [fsfilt_ldiskfs]
 [<ffffffff88c6c840>] filter_commitrw_write+0x1800/0x2be0 [obdfilter]
 [<ffffffff8005c33c>] cache_alloc_refill+0x106/0x186
 [<ffffffff88c24eed>] ost_checksum_bulk+0x37d/0x5a0 [ost]
 [<ffffffff88c2bd09>] ost_brw_write+0x1c99/0x2480 [ost]
 [<ffffffff8001aa2d>] vsnprintf+0x5df/0x627
 [<ffffffff88945f25>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc]
 [<ffffffff889460d8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
 [<ffffffff88c2f09e>] ost_handle+0x2bae/0x55b0 [ost]
 [<ffffffff8890019a>] lock_res_and_lock+0xba/0xd0 [ptlrpc]
 [<ffffffff887b4a87>] libcfs_next_nidstring+0x37/0x50 [libcfs]
 [<ffffffff889556f9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
 [<ffffffff88955e55>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
 [<ffffffff8008c86f>] __wake_up_common+0x3e/0x68
 [<ffffffff88956de6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff88955e80>] ptlrpc_main+0x0/0x1120 [ptlrpc]
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

LustreError: dumping log to /tmp/lustre-log.1305018888.6575

Please refer to the following Maloo report for more logs:
https://maloo.whamcloud.com/test_sets/ecc7177c-7aec-11e0-b5bf-52540025f9af

The issue is blocking the testing on b1_8 branch.



 Comments   
Comment by Peter Jones [ 10/May/11 ]

Johann

Yu Jian suspects that this issue may be due to a recent landing. Could you please have a quick look and then assign as necessary? This is blocking all 1.8.x testing atm

Regards

Peter

Comment by Andreas Dilger [ 10/May/11 ]

This is probably the ext_walk_space locking change that hit on some of the newer kernels also, probably because the RHEL 5 kernel backported some change.

I think we need a better configure check to determine whether ext_walk_space needs to be locked by the caller or internally. I proposed a way to do this using "grep -A" in the original RHEL6 bug in bugzilla that Kalpak was working on.

Comment by Peter Jones [ 10/May/11 ]

YangSheng

It seems that this is something that you have been working on

Regards

Peter

Comment by Yang Sheng [ 11/May/11 ]

Looks REHL5.6 backport upstream commit fab3a549e204172236779f502eccb4f9bf0dc87d(ext4: Fix potential fiemap deadlock (mmap_sem vs. i_data_sem)). So we may need landed fix in https://bugzilla.lustre.org/show_bug.cgi?id=23780. Original patch against SLES11 SP1. But as further solution. How about change down_read() to down_write() in ext4_ext_walk_space:
/* find extent for this block */
down_read(&EXT4_I(inode)->i_data_sem);
path = ext4_ext_find_extent(inode, block, path);
up_read(&EXT4_I(inode)->i_data_sem);
and trying to push this change to upstream. This change with WALK_SPACE_HAS_I_DATA_SEM config check. We can deal with this situation.

Comment by Johann Lombardi (Inactive) [ 11/May/11 ]

This should be addressed by http://review.whamcloud.com/#change,491

Comment by Yang Sheng [ 11/May/11 ]

So just port lu-216 to b1_8 is enough to resolved this issue.

Generated at Sat Feb 10 01:05:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.