[LU-5023] sanity-lfsck test_12: OST lfsck_stop hung Created: 07/May/14 Updated: 03/Feb/15 Resolved: 03/Feb/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 13903 | ||||||||
| Description |
|
This issue was created by maloo for Bob Glossman <bob.glossman@intel.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/4d306b04-d1ed-11e3-a9d1-52540035b04c. searching in maloo I see 8 similar failures reported starting 5/1/2014 The sub-test test_12 failed with the following error:
Info required for matching: sanity-lfsck 12 |
| Comments |
| Comment by Oleg Drokin [ 08/May/14 ] |
|
There appears to be some deadlock on ost: 01:01:32:INFO: task ll_ost00_004:12219 blocked for more than 120 seconds. 01:01:32: Not tainted 2.6.32-431.11.2.el6_lustre.g324fa81.x86_64 #1 01:01:32:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 01:01:32:ll_ost00_004 D 0000000000000000 0 12219 2 0x00000080 01:01:32: ffff880079891bb0 0000000000000046 0000000000000000 ffff88006c70facd 01:01:32: ffffffffa08c3b3d ffffffffa08c3b3b ffff880079891cc0 ffff88006c710000 01:01:32: ffff88007d7fe5f8 ffff880079891fd8 000000000000fbc8 ffff88007d7fe5f8 01:01:32:Call Trace: 01:01:32: [<ffffffff81529b0e>] __mutex_lock_slowpath+0x13e/0x180 01:01:32: [<ffffffffa0864143>] ? null_alloc_rs+0xf3/0x390 [ptlrpc] 01:01:32: [<ffffffff815299ab>] mutex_lock+0x2b/0x50 01:01:32: [<ffffffffa0dd9cff>] lfsck_stop+0x12f/0x410 [lfsck] 01:01:32: [<ffffffffa0827505>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] 01:01:32: [<ffffffffa084f086>] ? __req_capsule_get+0x166/0x6e0 [ptlrpc] 01:01:32: [<ffffffffa0dde214>] lfsck_in_notify+0x444/0x5f0 [lfsck] 01:01:32: [<ffffffffa0882852>] tgt_handle_lfsck_notify+0x62/0x150 [ptlrpc] 01:01:32: [<ffffffffa088a93c>] tgt_request_handle+0x23c/0xac0 [ptlrpc] 01:01:32: [<ffffffffa0839b3a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc] 01:01:32: [<ffffffffa0838e20>] ? ptlrpc_main+0x0/0x1980 [ptlrpc] 01:01:32: [<ffffffff8109aee6>] kthread+0x96/0xa0 01:01:33: [<ffffffff8100c20a>] child_rip+0xa/0x20 01:01:33: [<ffffffff8109ae50>] ? kthread+0x0/0xa0 01:01:33: [<ffffffff8100c200>] ? child_rip+0x0/0x20 |
| Comment by Andreas Dilger [ 08/May/14 ] |
|
LFSCK patches that have landed around that time: commit a8c9796fddaeda0572a64cedb471b2eead78c506
CommitDate: Tue Apr 29 17:22:17 2014 +0000
LU-4941 lfsck: check LOV EA header properly
Reviewed-on: http://review.whamcloud.com/10045
commit 63e7b15a1baa1aae03dfdc965e0777cf754ff29a
CommitDate: Wed Apr 30 03:41:14 2014 +0000
LU-4895 lfsck: not create object to repair dangling by default
Reviewed-on: http://review.whamcloud.com/9989
commit 1dbba329174e6c7f7712f01fc4e44c44400fbc92
CommitDate: Wed Apr 30 03:42:00 2014 +0000
LU-4556 tests: speed up sanity-lfsck and sanity-scrub tests
Reviewed-on: http://review.whamcloud.com/9704
Not sure if any of these are the cause, but likely candidates. Jodi thought that this hadn't been hit since the initial group of failures, so it is possible that a later patch may have fixed it? |
| Comment by nasf (Inactive) [ 09/May/14 ] |
|
It cannot say that there is deadlock on the OST. The real case is probably like that: 1) The LFSCK thread on the OST is doing something but it is blocked. So we need to find out what the LFSCK thread was doing at that time. Unfortunately, the current log was not enough to indicate that. I will make a debug patch to collect more logs. |
| Comment by nasf (Inactive) [ 09/May/14 ] |
|
The debug patch for more logs: |
| Comment by Bob Glossman (Inactive) [ 09/May/14 ] |
|
more: |
| Comment by nasf (Inactive) [ 12/May/14 ] |
|
> https://maloo.whamcloud.com/test_sets/2eada7a0-d4c7-11e3-808a-52540035b04c In above failure cases, the test_11b has failed because improperly FID seq checking and error handler. I have updated the patch 10276 to resolve that. |
| Comment by Gerrit Updater [ 03/Feb/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/10276/ |
| Comment by Peter Jones [ 03/Feb/15 ] |
|
Landed for 2.7 |