[LU-5023] sanity-lfsck test_12: OST lfsck_stop hung Created: 07/May/14  Updated: 03/Feb/15  Resolved: 03/Feb/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-5049 Test failure on test suite sanity-scr... Resolved
Severity: 3
Rank (Obsolete): 13903

 Description   

This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/4d306b04-d1ed-11e3-a9d1-52540035b04c.

searching in maloo I see 8 similar failures reported starting 5/1/2014
suspect something happened then to start causing these failures
Don't think it's specific to the mod under test.

The sub-test test_12 failed with the following error:

test failed to respond and timed out

Info required for matching: sanity-lfsck 12



 Comments   
Comment by Oleg Drokin [ 08/May/14 ]

There appears to be some deadlock on ost:

01:01:32:INFO: task ll_ost00_004:12219 blocked for more than 120 seconds.
01:01:32:      Not tainted 2.6.32-431.11.2.el6_lustre.g324fa81.x86_64 #1
01:01:32:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
01:01:32:ll_ost00_004  D 0000000000000000     0 12219      2 0x00000080
01:01:32: ffff880079891bb0 0000000000000046 0000000000000000 ffff88006c70facd
01:01:32: ffffffffa08c3b3d ffffffffa08c3b3b ffff880079891cc0 ffff88006c710000
01:01:32: ffff88007d7fe5f8 ffff880079891fd8 000000000000fbc8 ffff88007d7fe5f8
01:01:32:Call Trace:
01:01:32: [<ffffffff81529b0e>] __mutex_lock_slowpath+0x13e/0x180
01:01:32: [<ffffffffa0864143>] ? null_alloc_rs+0xf3/0x390 [ptlrpc]
01:01:32: [<ffffffff815299ab>] mutex_lock+0x2b/0x50
01:01:32: [<ffffffffa0dd9cff>] lfsck_stop+0x12f/0x410 [lfsck]
01:01:32: [<ffffffffa0827505>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
01:01:32: [<ffffffffa084f086>] ? __req_capsule_get+0x166/0x6e0 [ptlrpc]
01:01:32: [<ffffffffa0dde214>] lfsck_in_notify+0x444/0x5f0 [lfsck]
01:01:32: [<ffffffffa0882852>] tgt_handle_lfsck_notify+0x62/0x150 [ptlrpc]
01:01:32: [<ffffffffa088a93c>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
01:01:32: [<ffffffffa0839b3a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
01:01:32: [<ffffffffa0838e20>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
01:01:32: [<ffffffff8109aee6>] kthread+0x96/0xa0
01:01:33: [<ffffffff8100c20a>] child_rip+0xa/0x20
01:01:33: [<ffffffff8109ae50>] ? kthread+0x0/0xa0
01:01:33: [<ffffffff8100c200>] ? child_rip+0x0/0x20
Comment by Andreas Dilger [ 08/May/14 ]

LFSCK patches that have landed around that time:

commit a8c9796fddaeda0572a64cedb471b2eead78c506
CommitDate: Tue Apr 29 17:22:17 2014 +0000
    LU-4941 lfsck: check LOV EA header properly
    Reviewed-on: http://review.whamcloud.com/10045
commit 63e7b15a1baa1aae03dfdc965e0777cf754ff29a
CommitDate: Wed Apr 30 03:41:14 2014 +0000
    LU-4895 lfsck: not create object to repair dangling by default    
    Reviewed-on: http://review.whamcloud.com/9989
commit 1dbba329174e6c7f7712f01fc4e44c44400fbc92
CommitDate: Wed Apr 30 03:42:00 2014 +0000
    LU-4556 tests: speed up sanity-lfsck and sanity-scrub tests
    Reviewed-on: http://review.whamcloud.com/9704

Not sure if any of these are the cause, but likely candidates. Jodi thought that this hadn't been hit since the initial group of failures, so it is possible that a later patch may have fixed it?

Comment by nasf (Inactive) [ 09/May/14 ]

It cannot say that there is deadlock on the OST. The real case is probably like that:

1) The LFSCK thread on the OST is doing something but it is blocked.
2) At that time, the first lfsck_stop RPC from the MDT comes in, the RPC handler (on the OST) holds the lfsck::li_mutex to signal the LFSCK thread, then wait until the LFSCK thread stopped.
3) Because the LFSCK thread is blocked on some unknown event, it cannot respond the stop signal, so the first stop RPC is blocked there.
4) Then the subsequent lfsck_stop RPC handlers will be blocked at the lfsck::li_mutex by the first lfsck_stop RPC handler.

So we need to find out what the LFSCK thread was doing at that time. Unfortunately, the current log was not enough to indicate that. I will make a debug patch to collect more logs.

Comment by nasf (Inactive) [ 09/May/14 ]

The debug patch for more logs:

http://review.whamcloud.com/10276

Comment by Bob Glossman (Inactive) [ 09/May/14 ]

more:
https://maloo.whamcloud.com/test_sets/2eada7a0-d4c7-11e3-808a-52540035b04c
https://maloo.whamcloud.com/test_sets/465c7ec4-d33e-11e3-a102-52540035b04c
https://maloo.whamcloud.com/test_sets/542c478e-d75b-11e3-9d16-52540035b04c

Comment by nasf (Inactive) [ 12/May/14 ]

> https://maloo.whamcloud.com/test_sets/2eada7a0-d4c7-11e3-808a-52540035b04c
> https://maloo.whamcloud.com/test_sets/465c7ec4-d33e-11e3-a102-52540035b04c
> https://maloo.whamcloud.com/test_sets/542c478e-d75b-11e3-9d16-52540035b04c

In above failure cases, the test_11b has failed because improperly FID seq checking and error handler. I have updated the patch 10276 to resolve that.

Comment by Gerrit Updater [ 03/Feb/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/10276/
Subject: LU-5023 tests: check FID seq properly for sanity-lfsck t_11b
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 89a53691076ba45ef415de511ed840a0e3cbdd2c

Comment by Peter Jones [ 03/Feb/15 ]

Landed for 2.7

Generated at Sat Feb 10 01:47:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.