[LU-8815] MDS fails to complete recovery - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.9.0
Labels:
- soak
Environment:
Soak test cluster, tip of lustre-master lustre: 2.8.59_79_gb8811a0

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

After previous failover, MDT0003 is failed back to lola-11:
Nov 8 22:10:14 lola-10 kernel: Lustre: server umount soaked-MDT0003 complete

MDS failover fault induced, lola-11 fails MDT0003 to lola-10:

016-11-08 22:24:25,959:fsmgmt.fsmgmt:INFO     Failing over soaked-MDT0003 ...
2016-11-08 22:24:25,959:fsmgmt.fsmgmt:INFO     Mounting soaked-MDT0003 on lola-10 ...

Recovery starts on MDS: Nov 8 22:25:29 lola-10 kernel: Lustre: soaked-MDT0003: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450

Task block on lola-10:

ov  8 22:29:17 lola-10 kernel: INFO: task mdt_out01_007:6305 blocked for more than 120 seconds.
Nov  8 22:29:17 lola-10 kernel:      Tainted: P           -- ------------    2.6.32-573.26.1.el6_lustre.x86_64 #1
Nov  8 22:29:17 lola-10 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  8 22:29:17 lola-10 kernel: mdt_out01_007 D 0000000000000018     0  6305      2 0x00000080
Nov  8 22:29:17 lola-10 kernel: ffff8808104b7b40 0000000000000046 0000000000000000 ffff8808104b7d00
Nov  8 22:29:17 lola-10 kernel: ffff8804337ea800 ffff880824300118 000014206fe1f7c1 ffff880824300118
Nov  8 22:29:17 lola-10 kernel: ffff8808104b7b40 00000001014d3a26 ffff880834add068 ffff8808104b7fd8
Nov  8 22:29:17 lola-10 kernel: Call Trace:
Nov  8 22:29:17 lola-10 kernel: [<ffffffffa07bca3d>] lu_object_find_at+0x3d/0xe0 [obdclass]
Nov  8 22:29:17 lola-10 kernel: [<ffffffffa09dd442>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
Nov  8 22:29:17 lola-10 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20
Nov  8 22:29:17 lola-10 kernel: [<ffffffffa09b2af0>] ? lustre_swab_object_update_reply+0x0/0xc0 [ptlrpc]
Nov  8 22:29:17 lola-10 kernel: [<ffffffffa07bd7cc>] dt_locate_at+0x1c/0xa0 [obdclass]
Nov  8 22:29:17 lola-10 kernel: [<ffffffffa0a24147>] out_handle+0x1067/0x18d0 [ptlrpc]
Nov  8 22:29:17 lola-10 kernel: [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80
Nov  8 22:29:17 lola-10 kernel: [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90
Nov  8 22:29:17 lola-10 kernel: [<ffffffffa067ac8a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
Nov  8 22:29:17 lola-10 kernel: [<ffffffff8153afce>] ? mutex_lock+0x1e/0x50
Nov  8 22:29:17 lola-10 kernel: [<ffffffffa0a142ca>] ? req_can_reconstruct+0x6a/0x120 [ptlrpc]
Nov  8 22:29:17 lola-10 kernel: [<ffffffffa0a1b4bc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
Nov  8 22:29:17 lola-10 kernel: [<ffffffffa09c7bc1>] ptlrpc_main+0xd31/0x1800 [ptlrpc]
Nov  8 22:29:17 lola-10 kernel: [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0
Nov  8 22:29:17 lola-10 kernel: [<ffffffffa09c6e90>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
Nov  8 22:29:17 lola-10 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0
Nov  8 22:29:17 lola-10 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
Nov  8 22:29:17 lola-10 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
Nov  8 22:29:18 lola-10 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20

Second blocked task:

Nov  8 22:29:32 lola-10 kernel: Call Trace:
Nov  8 22:29:32 lola-10 kernel: [<ffffffffa07bca3d>] lu_object_find_at+0x3d/0xe0 [obdclass]
Nov  8 22:29:32 lola-10 kernel: [<ffffffffa09dd442>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
Nov  8 22:29:32 lola-10 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20
Nov  8 22:29:32 lola-10 kernel: [<ffffffffa09b2af0>] ? lustre_swab_object_update_reply+0x0/0xc0 [ptlrpc]
Nov  8 22:29:32 lola-10 kernel: [<ffffffffa07bd7cc>] dt_locate_at+0x1c/0xa0 [obdclass]
Nov  8 22:29:32 lola-10 kernel: [<ffffffffa0a24147>] out_handle+0x1067/0x18d0 [ptlrpc]
Nov  8 22:29:32 lola-10 kernel: [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80
Nov  8 22:29:32 lola-10 kernel: [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90
Nov  8 22:29:32 lola-10 kernel: [<ffffffffa067ac8a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
Nov  8 22:29:32 lola-10 kernel: [<ffffffff8153afce>] ? mutex_lock+0x1e/0x50
Nov  8 22:29:32 lola-10 kernel: [<ffffffffa0a142ca>] ? req_can_reconstruct+0x6a/0x120 [ptlrpc]
Nov  8 22:29:32 lola-10 kernel: [<ffffffffa0a1b4bc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
Nov  8 22:29:32 lola-10 kernel: [<ffffffffa09c7bc1>] ptlrpc_main+0xd31/0x1800 [ptlrpc]
Nov  8 22:29:32 lola-10 kernel: [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0
Nov  8 22:29:32 lola-10 kernel: [<ffffffffa09c6e90>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
Nov  8 22:29:32 lola-10 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0
Nov  8 22:29:32 lola-10 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
Nov  8 22:29:32 lola-10 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
Nov  8 22:29:32 lola-10 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20

Recovery timer competes, but recovery does not:

mdt.soaked-MDT0003.recovery_status=
status: RECOVERING
recovery_start: 1478672734
time_remaining: 0
connected_clients: 21/21
req_replay_clients: 4
lock_repay_clients: 5
completed_clients: 16
evicted_clients: 0
replayed_requests: 7
queued_requests: 4
next_transno: 77309522798

System finally notices:

Nov  9 08:22:35 lola-10 kernel: Lustre: soaked-MDT0003: Recovery already passed deadline 588:42, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.

Attached lustre-log dumps from lola-10

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lola-10.syslog.txt.gz
107 kB
09/Nov/16 5:54 PM
lustre-log.1478652100.6053.txt.gz
12.17 MB
09/Nov/16 5:54 PM
lustre-log.1478705828.64556.txt.gz
5 kB
09/Nov/16 5:54 PM

Issue Links

is related to

LU-9049 DNE MDT Never completes recovery

Resolved

mentioned in: Page Loading...

Activity

[LU-8815] MDS fails to complete recovery

nasf (Inactive) added a comment - 19/Apr/17 8:43 AM

New build:
https://build.hpdd.intel.com/job/lustre-reviews/46754/

nasf (Inactive) added a comment - 19/Apr/17 8:43 AM New build: https://build.hpdd.intel.com/job/lustre-reviews/46754/

Cliff White (Inactive) added a comment - 05/Apr/17 4:51 PM - edited

We have also moved to RHEL 7 with new soak, can you re-trigger your build? Jenkins no longer has the bits. I get RPMS from Jenkins.

Cliff White (Inactive) added a comment - 05/Apr/17 4:51 PM - edited We have also moved to RHEL 7 with new soak, can you re-trigger your build? Jenkins no longer has the bits. I get RPMS from Jenkins.

nasf (Inactive) added a comment - 05/Apr/17 3:02 PM

https://review.whamcloud.com/#/c/25260/
set 4 build status:
Overall Build Status: SUCCESS
Builder CentOS 6.7 x86_64 (BUILD) succeeded (build successful) - http://build.lustre.org/builders/CentOS%206.7%20x86_64%20%28BUILD%29/builds/7691
Builder CentOS 6.8 x86_64 (BUILD) succeeded (build successful) - http://build.lustre.org/builders/CentOS%206.8%20x86_64%20%28BUILD%29/builds/3827
Builder CentOS 7.2 x86_64 (BUILD) succeeded (build successful) - http://build.lustre.org/builders/CentOS%207.2%20x86_64%20%28BUILD%29/builds/7677
Builder Ubuntu 14.04 x86_64 (BUILD) succeeded (build successful) - http://build.lustre.org/builders/Ubuntu%2014.04%20x86_64%20%28BUILD%29/builds/7637

nasf (Inactive) added a comment - 05/Apr/17 3:02 PM https://review.whamcloud.com/#/c/25260/ set 4 build status: Overall Build Status: SUCCESS Builder CentOS 6.7 x86_64 (BUILD) succeeded (build successful) - http://build.lustre.org/builders/CentOS%206.7%20x86_64%20%28BUILD%29/builds/7691 Builder CentOS 6.8 x86_64 (BUILD) succeeded (build successful) - http://build.lustre.org/builders/CentOS%206.8%20x86_64%20%28BUILD%29/builds/3827 Builder CentOS 7.2 x86_64 (BUILD) succeeded (build successful) - http://build.lustre.org/builders/CentOS%207.2%20x86_64%20%28BUILD%29/builds/7677 Builder Ubuntu 14.04 x86_64 (BUILD) succeeded (build successful) - http://build.lustre.org/builders/Ubuntu%2014.04%20x86_64%20%28BUILD%29/builds/7637

Cliff White (Inactive) added a comment - 05/Apr/17 2:32 PM

We have been busy with the move, but have not seen any recovery issues. I should be able to try the patch this week, is there a build?

Cliff White (Inactive) added a comment - 05/Apr/17 2:32 PM We have been busy with the move, but have not seen any recovery issues. I should be able to try the patch this week, is there a build?

nasf (Inactive) added a comment - 05/Apr/17 11:48 AM

Any update with the patch applied? Thanks!

nasf (Inactive) added a comment - 05/Apr/17 11:48 AM Any update with the patch applied? Thanks!

Cliff White (Inactive) added a comment - 06/Mar/17 5:49 PM

We are moving soak hardware this week, will add that to the list.

Cliff White (Inactive) added a comment - 06/Mar/17 5:49 PM We are moving soak hardware this week, will add that to the list.

nasf (Inactive) added a comment - 06/Mar/17 9:08 AM

Cliff, do you have any change to reproduce the issue with the patch https://review.whamcloud.com/#/c/25260/ applied? Thanks!

nasf (Inactive) added a comment - 06/Mar/17 9:08 AM Cliff, do you have any change to reproduce the issue with the patch https://review.whamcloud.com/#/c/25260/ applied? Thanks!

nasf (Inactive) added a comment - 10/Feb/17 12:38 AM

+1 on master:
https://testing.hpdd.intel.com/test_sets/b54d8c9c-eee8-11e6-9756-5254006e85c2

nasf (Inactive) added a comment - 10/Feb/17 12:38 AM +1 on master: https://testing.hpdd.intel.com/test_sets/b54d8c9c-eee8-11e6-9756-5254006e85c2

nasf (Inactive) added a comment - 04/Feb/17 11:40 AM

Cliff, would you please to verify the debug patch https://review.whamcloud.com/#/c/25260/? That can tell us which FID caused the blocking. Thanks!

nasf (Inactive) added a comment - 04/Feb/17 11:40 AM Cliff, would you please to verify the debug patch https://review.whamcloud.com/#/c/25260/? That can tell us which FID caused the blocking. Thanks!

Gerrit Updater added a comment - 04/Feb/17 11:37 AM

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/25260
Subject: ~~LU-8815~~ obdclass: debug information for lu_object_find_at
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e8c1940e2e7b0afa0b54d139c70359425bf92d40

Gerrit Updater added a comment - 04/Feb/17 11:37 AM Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/25260 Subject: LU-8815 obdclass: debug information for lu_object_find_at Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e8c1940e2e7b0afa0b54d139c70359425bf92d40

nasf (Inactive) added a comment - 27/Jan/17 9:44 AM

According to the bug description, it seems that one MDT was trying to recover/replay some cross-MDTs RPC, that triggered OUT_ATTR_GET RPC. Unfortunately, such OUT RPC was blocked on remote MDT inside lu_object_find_at(). Usually, the lu_object_find_at() will be blocked if the target object is in-RAM but marked as dying. If someone hold the reference on the dying object, then others will be blocked until all the reference have been released. So we need to find out what the target object is, and who hold the reference on such dying object.

Do we have any clew about the first question: what the target object is?
Di, is it possible an update_log file on remote MDT?

As for the stack trace for OI scrub in above comment, that seems not related. Because OI scrub operates the inode directly, NOT reference the up layer object.

nasf (Inactive) added a comment - 27/Jan/17 9:44 AM According to the bug description, it seems that one MDT was trying to recover/replay some cross-MDTs RPC, that triggered OUT_ATTR_GET RPC. Unfortunately, such OUT RPC was blocked on remote MDT inside lu_object_find_at(). Usually, the lu_object_find_at() will be blocked if the target object is in-RAM but marked as dying. If someone hold the reference on the dying object, then others will be blocked until all the reference have been released. So we need to find out what the target object is, and who hold the reference on such dying object. Do we have any clew about the first question: what the target object is? Di, is it possible an update_log file on remote MDT? As for the stack trace for OI scrub in above comment, that seems not related. Because OI scrub operates the inode directly, NOT reference the up layer object.

People

Assignee:: nasf (Inactive)

Reporter:: Cliff White (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 09/Nov/16 5:54 PM

Updated:: 30/Jun/17 1:51 AM

Resolved:: 30/Jun/17 1:51 AM