[LU-6684] lctl lfsck_stop hangs - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.7.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

As mentioned in ~~LU-6683~~, I ran into a situation where lctl lfsck_stop just hangs indefinitely.

I have managed to reproduce this twice:

start lfsck (using lctl lfsck_start -M play01-MDT0000 -t layout), this crashes the OSS servers, reboot the servers and restart the OSTs. Attempting to stop the lfsck in this state just hangs. I have waited >1h and it was still hanging. Unmounting the MDT in this situation also appears to be hanging (after 30 minutes I power cycled the MDS).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

15.lctl.tgz
631 kB
30/Nov/15 11:00 AM
lustre.dmesg.bz2
37 kB
04/Jun/15 11:09 AM
lustre.log.bz2
1.38 MB
04/Jun/15 11:09 AM

Issue Links

is duplicated by

LU-7662 lfsck don't complete

Resolved

is related to

LU-10321 MDS - umount hangs during failback

Resolved

mentioned in: Page Loading...

Activity

[LU-6684] lctl lfsck_stop hangs

nasf (Inactive) added a comment - 14/Jan/16 3:22 PM

The patch has been landed to master.

nasf (Inactive) added a comment - 14/Jan/16 3:22 PM The patch has been landed to master.

Gerrit Updater added a comment - 14/Jan/16 3:59 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17032/
Subject: ~~LU-6684~~ lfsck: stop lfsck even if some servers offline
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: afcf3026c6ad203b9882eaeac76326357f26fe71

Gerrit Updater added a comment - 14/Jan/16 3:59 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17032/ Subject: LU-6684 lfsck: stop lfsck even if some servers offline Project: fs/lustre-release Branch: master Current Patch Set: Commit: afcf3026c6ad203b9882eaeac76326357f26fe71

nasf (Inactive) added a comment - 09/Dec/15 12:33 PM

There are several cases:

1) The LFSCK/OI scrub is running on the MDS which to be remounted.

1.1) If the MDT is umounted when the LFSCK/OI scrub running at background, then the LFSCK/OI scrub status will be marked as paused. And when the MDT is remounted up, after the recovery done, the paused LFSCK/OI scrub will be resumed from the latest checkpoint, its status will be set as the one before paused.

1.2) If the MDT crashed when the LFSCK/OI scrub running at background, then there is no time for LFSCK/OI scrub to change its status. When the MDT is remounted up, its status will be marked as crashed, and after the recovery done, the crashed LFSCK/OI scrub will be resumed from the latest checkpoint, its status will be set as the one before crashed.

2) Assume the LFSCK/OI scrub is running on one MDT_a, another related server MDT_b/OST_c to be remounted.

2.1) If the LFSCK on the MDT_a needs to talk with the MDT_b/OST_c for verification that is amounted/crashed, then the LFSCK on MDT_a will get related connection failure, and then it knows that some of the peer server has left the LFSCK, and then the LFSCK on MDT_a will go ahead to verify part of the system, neither wait for ever nor fail out unless you specified "-e abort". So the LFSCK on the MDT_a can finish finally, and the status will be 'partial' if no other failure happened.

2.2) If we want to stop the LFSCK on the MDT_a, then the MDT_a needs to notify related peer MDT_b/OST_c to stop the LFSCK also. But it found the peer server MDT_b/OST_c is offline already, then the LFSCK on the MDT_a will go ahead to handle the stop process.

In this ticket, we hit trouble in the 2.2) case. Because the LFSCK did not detect OST_c offline, the lfsck_stop was blocked by the reconnection to the OST_c.

nasf (Inactive) added a comment - 09/Dec/15 12:33 PM There are several cases: 1) The LFSCK/OI scrub is running on the MDS which to be remounted. 1.1) If the MDT is umounted when the LFSCK/OI scrub running at background, then the LFSCK/OI scrub status will be marked as paused. And when the MDT is remounted up, after the recovery done, the paused LFSCK/OI scrub will be resumed from the latest checkpoint, its status will be set as the one before paused. 1.2) If the MDT crashed when the LFSCK/OI scrub running at background, then there is no time for LFSCK/OI scrub to change its status. When the MDT is remounted up, its status will be marked as crashed, and after the recovery done, the crashed LFSCK/OI scrub will be resumed from the latest checkpoint, its status will be set as the one before crashed. 2) Assume the LFSCK/OI scrub is running on one MDT_a, another related server MDT_b/OST_c to be remounted. 2.1) If the LFSCK on the MDT_a needs to talk with the MDT_b/OST_c for verification that is amounted/crashed, then the LFSCK on MDT_a will get related connection failure, and then it knows that some of the peer server has left the LFSCK, and then the LFSCK on MDT_a will go ahead to verify part of the system, neither wait for ever nor fail out unless you specified "-e abort". So the LFSCK on the MDT_a can finish finally, and the status will be 'partial' if no other failure happened. 2.2) If we want to stop the LFSCK on the MDT_a, then the MDT_a needs to notify related peer MDT_b/OST_c to stop the LFSCK also. But it found the peer server MDT_b/OST_c is offline already, then the LFSCK on the MDT_a will go ahead to handle the stop process. In this ticket, we hit trouble in the 2.2) case. Because the LFSCK did not detect OST_c offline, the lfsck_stop was blocked by the reconnection to the OST_c.

Ashish Purkar (Inactive) added a comment - 30/Nov/15 11:01 AM

Andreas and Fan,
What will happen in dry-run mode of OI scrub if MDS recovery happen or MDT/OST down and reconnecting? Attaching log file 15.lctl.tgz for reference.

Here MDS is going in recovery while OI scrubbing operation is underway.
The lfsck ns assistant stage2 is restarted and post operation done.
Test expects dry-run to be completed in 6 sec but due to failover and MDS undergoing recovery, it's taking more time( > 6 sec)

Ashish Purkar (Inactive) added a comment - 30/Nov/15 11:01 AM Andreas and Fan, What will happen in dry-run mode of OI scrub if MDS recovery happen or MDT/OST down and reconnecting? Attaching log file 15.lctl.tgz for reference. Here MDS is going in recovery while OI scrubbing operation is underway. The lfsck ns assistant stage2 is restarted and post operation done. Test expects dry-run to be completed in 6 sec but due to failover and MDS undergoing recovery, it's taking more time( > 6 sec)

Gerrit Updater added a comment - 03/Nov/15 3:59 PM

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/17032
Subject: ~~LU-6684~~ lfsck: stop lfsck even if some servers offline
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 51f3f69fb300c5f65cbed46a99ec8307cdc9a4f4

Gerrit Updater added a comment - 03/Nov/15 3:59 PM Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/17032 Subject: LU-6684 lfsck: stop lfsck even if some servers offline Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 51f3f69fb300c5f65cbed46a99ec8307cdc9a4f4

nasf (Inactive) added a comment - 05/Aug/15 2:19 PM

It is NOT important to deactivate the OST/MDT manually before or after the "lctl lfsck_stop", so it does not involve in "the right order". Since the ptlrpcd thread can handle the deactivate event, is it still necessary to introduce new SIGINT handlers?

nasf (Inactive) added a comment - 05/Aug/15 2:19 PM It is NOT important to deactivate the OST/MDT manually before or after the "lctl lfsck_stop", so it does not involve in "the right order". Since the ptlrpcd thread can handle the deactivate event, is it still necessary to introduce new SIGINT handlers?

Andreas Dilger added a comment - 05/Aug/15 8:23 AM

Nasf, I don't think we can require users to do things "in the right order" for them to work (i.e. to deactivate the OST/MDT manually before running "lctl lfsck_stop") if the OST is down. It definitely seems preferable to allow lfsck_stop to work properly regardless of the connection state.

Would it be possible to allow the threads to be woken up by SIGINT but have them return -EINTR or -EAGAIN to the callers, and they decide whether to retry in that case? I agree it isn't good to actually kill the ptlrpc threads. Maybe ptlrpc_set_wait() could be interruptible and cause ptlrpcd to abort those RPCs? It seems that something like this is already close to possible

Andreas Dilger added a comment - 05/Aug/15 8:23 AM Nasf, I don't think we can require users to do things "in the right order" for them to work (i.e. to deactivate the OST/MDT manually before running "lctl lfsck_stop") if the OST is down. It definitely seems preferable to allow lfsck_stop to work properly regardless of the connection state. Would it be possible to allow the threads to be woken up by SIGINT but have them return -EINTR or -EAGAIN to the callers, and they decide whether to retry in that case? I agree it isn't good to actually kill the ptlrpc threads. Maybe ptlrpc_set_wait() could be interruptible and cause ptlrpcd to abort those RPCs? It seems that something like this is already close to possible

nasf (Inactive) added a comment - 08/Jun/15 7:44 AM

In theory, we can do that. But the LWI is declared inside ptlrpc layer. If we want to make the (LFSCK) thread that is waiting on the LWI to handle SIGKILL, that means any thread (not only LFSCK engine, but also other RPC service thread, ptlrpcd thread, and so on) can by killed by user via "kill -9 $PID". It is not what we want, especially that someone may do that by wrong.

If we want the SIGKILL only to be handled by LFSCK engine, then we need some mechanism to make the ptlrpc layer to distinguish the LFSCK engine from other threads. But within current server-side API and stack framework, it is difficult to do that unless some very ugly hack.

nasf (Inactive) added a comment - 08/Jun/15 7:44 AM In theory, we can do that. But the LWI is declared inside ptlrpc layer. If we want to make the (LFSCK) thread that is waiting on the LWI to handle SIGKILL, that means any thread (not only LFSCK engine, but also other RPC service thread, ptlrpcd thread, and so on) can by killed by user via "kill -9 $PID". It is not what we want, especially that someone may do that by wrong. If we want the SIGKILL only to be handled by LFSCK engine, then we need some mechanism to make the ptlrpc layer to distinguish the LFSCK engine from other threads. But within current server-side API and stack framework, it is difficult to do that unless some very ugly hack.

Andreas Dilger added a comment - 07/Jun/15 4:30 PM

If would be possible for "lctl lfsck_stop" to send SIGINT or SIGKILL to the lfsck thread to interrupt it, if it has the right LWI handler in out_remote_sync().

Andreas Dilger added a comment - 07/Jun/15 4:30 PM If would be possible for "lctl lfsck_stop" to send SIGINT or SIGKILL to the lfsck thread to interrupt it, if it has the right LWI handler in out_remote_sync().

nasf (Inactive) added a comment - 07/Jun/15 6:00 AM - edited

The stack trace is clear as following:

lfsck_layout  S 0000000000000003     0  3643      2 0x00000000
 ffff880158e75a40 0000000000000046 0000000000000000 0000000000000000
 ffff8802a40b0ef0 ffff8802a40b0ec0 00020c3c3569f55d ffff8802a40b0ef0
 ffff880158e75a10 000000012256a121 ffff88012c1b05f8 ffff880158e75fd8
Call Trace:
 [<ffffffff8152b102>] schedule_timeout+0x192/0x2e0
 [<ffffffff810874f0>] ? process_timeout+0x0/0x10
 [<ffffffffa09bd0e2>] ptlrpc_set_wait+0x2b2/0x890 [ptlrpc]
 [<ffffffffa09b29c0>] ? ptlrpc_interrupted_set+0x0/0x110 [ptlrpc]
 [<ffffffff81064b90>] ? default_wake_function+0x0/0x20
 [<ffffffffa09c7dc6>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc]
 [<ffffffffa09bd741>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc]
 [<ffffffffa0a356d1>] out_remote_sync+0x111/0x200 [ptlrpc]
 [<ffffffffa144ca92>] osp_attr_get+0x352/0x600 [osp]
 [<ffffffffa1219e50>] lfsck_layout_assistant_handler_p1+0x530/0x19f0 [lfsck]
 [<ffffffffa066b1c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa11e06e6>] lfsck_assistant_engine+0x496/0x1de0 [lfsck]
 [<ffffffff81064b90>] ? default_wake_function+0x0/0x20
 [<ffffffffa11e0250>] ? lfsck_assistant_engine+0x0/0x1de0 [lfsck]
 [<ffffffff8109e66e>] kthread+0x9e/0xc0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20

It was inside ptlrpc layout, the LFSCK control/flags could not wakeup such thread. At that time, the target OSS was down, so the RPC (for attr_get, not for lfsck_stop) expired, then triggered the re-connection and would try to re-send the RPC after the connection recovered. Because LFSCK shares the same logic with cross-MDT operations (from OSP => ptlrpc view, they are indistinguishable) for RPC handing, we cannot simply make the RPC to be set as no_resend.
On the other hand, the LFSCK engine is background thread, cannot receive ctrl-C, but maybe we can use "kill -9 $PID" for that. But I am not sure whether we should allow someone to kill the background LFSCK engine by SIGKILL instead of the lfsck_stop interface.

nasf (Inactive) added a comment - 07/Jun/15 6:00 AM - edited The stack trace is clear as following: lfsck_layout S 0000000000000003 0 3643 2 0x00000000 ffff880158e75a40 0000000000000046 0000000000000000 0000000000000000 ffff8802a40b0ef0 ffff8802a40b0ec0 00020c3c3569f55d ffff8802a40b0ef0 ffff880158e75a10 000000012256a121 ffff88012c1b05f8 ffff880158e75fd8 Call Trace: [<ffffffff8152b102>] schedule_timeout+0x192/0x2e0 [<ffffffff810874f0>] ? process_timeout+0x0/0x10 [<ffffffffa09bd0e2>] ptlrpc_set_wait+0x2b2/0x890 [ptlrpc] [<ffffffffa09b29c0>] ? ptlrpc_interrupted_set+0x0/0x110 [ptlrpc] [<ffffffff81064b90>] ? default_wake_function+0x0/0x20 [<ffffffffa09c7dc6>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc] [<ffffffffa09bd741>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc] [<ffffffffa0a356d1>] out_remote_sync+0x111/0x200 [ptlrpc] [<ffffffffa144ca92>] osp_attr_get+0x352/0x600 [osp] [<ffffffffa1219e50>] lfsck_layout_assistant_handler_p1+0x530/0x19f0 [lfsck] [<ffffffffa066b1c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa11e06e6>] lfsck_assistant_engine+0x496/0x1de0 [lfsck] [<ffffffff81064b90>] ? default_wake_function+0x0/0x20 [<ffffffffa11e0250>] ? lfsck_assistant_engine+0x0/0x1de0 [lfsck] [<ffffffff8109e66e>] kthread+0x9e/0xc0 [<ffffffff8100c20a>] child_rip+0xa/0x20 [<ffffffff8109e5d0>] ? kthread+0x0/0xc0 [<ffffffff8100c200>] ? child_rip+0x0/0x20 It was inside ptlrpc layout, the LFSCK control/flags could not wakeup such thread. At that time, the target OSS was down, so the RPC (for attr_get, not for lfsck_stop) expired, then triggered the re-connection and would try to re-send the RPC after the connection recovered. Because LFSCK shares the same logic with cross-MDT operations (from OSP => ptlrpc view, they are indistinguishable) for RPC handing, we cannot simply make the RPC to be set as no_resend. On the other hand, the LFSCK engine is background thread, cannot receive ctrl-C, but maybe we can use "kill -9 $PID" for that. But I am not sure whether we should allow someone to kill the background LFSCK engine by SIGKILL instead of the lfsck_stop interface.

Andreas Dilger added a comment - 07/Jun/15 4:44 AM

I think that stopping the MDT or OST in such a case is too much. Is the RPC stuck at the ptlrpc layer? Is it the RPC sent by lfsck_stop itself to the OSS to stop layout lfsck that is stuck or is lfsck_stop stuck waiting for something else? Is this RPC sent by ptlrpcd or could ctrl-C interrupt the wait like some normal user process?

Having a stack trace would be useful. Fan Yong can you please create a sanity-lfsck test case for this and then collect a stack trace so it is more clear what is stuck and where.

Andreas Dilger added a comment - 07/Jun/15 4:44 AM I think that stopping the MDT or OST in such a case is too much. Is the RPC stuck at the ptlrpc layer? Is it the RPC sent by lfsck_stop itself to the OSS to stop layout lfsck that is stuck or is lfsck_stop stuck waiting for something else? Is this RPC sent by ptlrpcd or could ctrl-C interrupt the wait like some normal user process? Having a stack trace would be useful. Fan Yong can you please create a sanity-lfsck test case for this and then collect a stack trace so it is more clear what is stuck and where.

People

Assignee:: nasf (Inactive)

Reporter:: Frederik Ferner (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 03/Jun/15 4:43 PM

Updated:: 06/Dec/17 4:29 PM

Resolved:: 02/Feb/16 4:38 AM