[LU-302] ll_ost_io_* threads hung Created: 10/May/11 Updated: 11/May/11 Resolved: 11/May/11 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.6 |
| Fix Version/s: | Lustre 1.8.6 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Jian Yu | Assignee: | Yang Sheng |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre Branch: b1_8 |
||
| Severity: | 3 |
| Rank (Obsolete): | 10122 |
| Description |
|
While running runtests test, the ll_ost_io_* threads hung as follows: Lustre: DEBUG MARKER: copying files from /etc /bin to /mnt/lustre/runtest.5368/etc /bin at Tue May 10 02:14:07 PDT 2011 Lustre: Service thread pid 6575 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Pid: 6575, comm: ll_ost_io_03 Call Trace: [<ffffffff8006466c>] __down_read+0x7a/0x92 [<ffffffff88bb2b0f>] ldiskfs_ext_walk_space+0xdf/0x2d0 [ldiskfs] [<ffffffff88c0bf10>] ldiskfs_ext_new_extent_cb+0x0/0x650 [fsfilt_ldiskfs] [<ffffffff8006456b>] __down_write_nested+0x12/0x92 [<ffffffff88c0846d>] fsfilt_map_nblocks+0xfd/0x150 [fsfilt_ldiskfs] [<ffffffff88c69a7d>] filter_direct_io+0x46d/0xd50 [obdfilter] [<ffffffff88c08be7>] fsfilt_ldiskfs_setattr+0x1a7/0x250 [fsfilt_ldiskfs] [<ffffffff88c6c840>] filter_commitrw_write+0x1800/0x2be0 [obdfilter] [<ffffffff8005c33c>] cache_alloc_refill+0x106/0x186 [<ffffffff88c24eed>] ost_checksum_bulk+0x37d/0x5a0 [ost] [<ffffffff88c2bd09>] ost_brw_write+0x1c99/0x2480 [ost] [<ffffffff8001aa2d>] vsnprintf+0x5df/0x627 [<ffffffff88945f25>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc] [<ffffffff889460d8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc] [<ffffffff88c2f09e>] ost_handle+0x2bae/0x55b0 [ost] [<ffffffff8890019a>] lock_res_and_lock+0xba/0xd0 [ptlrpc] [<ffffffff887b4a87>] libcfs_next_nidstring+0x37/0x50 [libcfs] [<ffffffff889556f9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc] [<ffffffff88955e55>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc] [<ffffffff8008c86f>] __wake_up_common+0x3e/0x68 [<ffffffff88956de6>] ptlrpc_main+0xf66/0x1120 [ptlrpc] [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff88955e80>] ptlrpc_main+0x0/0x1120 [ptlrpc] [<ffffffff8005dfa7>] child_rip+0x0/0x11 LustreError: dumping log to /tmp/lustre-log.1305018888.6575 Please refer to the following Maloo report for more logs: The issue is blocking the testing on b1_8 branch. |
| Comments |
| Comment by Peter Jones [ 10/May/11 ] |
|
Johann Yu Jian suspects that this issue may be due to a recent landing. Could you please have a quick look and then assign as necessary? This is blocking all 1.8.x testing atm Regards Peter |
| Comment by Andreas Dilger [ 10/May/11 ] |
|
This is probably the ext_walk_space locking change that hit on some of the newer kernels also, probably because the RHEL 5 kernel backported some change. I think we need a better configure check to determine whether ext_walk_space needs to be locked by the caller or internally. I proposed a way to do this using "grep -A" in the original RHEL6 bug in bugzilla that Kalpak was working on. |
| Comment by Peter Jones [ 10/May/11 ] |
|
YangSheng It seems that this is something that you have been working on Regards Peter |
| Comment by Yang Sheng [ 11/May/11 ] |
|
Looks REHL5.6 backport upstream commit fab3a549e204172236779f502eccb4f9bf0dc87d(ext4: Fix potential fiemap deadlock (mmap_sem vs. i_data_sem)). So we may need landed fix in https://bugzilla.lustre.org/show_bug.cgi?id=23780. Original patch against SLES11 SP1. But as further solution. How about change down_read() to down_write() in ext4_ext_walk_space: |
| Comment by Johann Lombardi (Inactive) [ 11/May/11 ] |
|
This should be addressed by http://review.whamcloud.com/#change,491 |
| Comment by Yang Sheng [ 11/May/11 ] |
|
So just port lu-216 to b1_8 is enough to resolved this issue. |