[LU-6684] lctl lfsck_stop hangs Created: 03/Jun/15  Updated: 06/Dec/17  Resolved: 02/Feb/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Blocker
Reporter: Frederik Ferner (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: File 15.lctl.tgz     File lustre.dmesg.bz2     File lustre.log.bz2    
Issue Links:
Duplicate
is duplicated by LU-7662 lfsck don't complete Resolved
Related
is related to LU-10321 MDS - umount hangs during failback Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

As mentioned in LU-6683, I ran into a situation where lctl lfsck_stop just hangs indefinitely.

I have managed to reproduce this twice:

start lfsck (using lctl lfsck_start -M play01-MDT0000 -t layout), this crashes the OSS servers, reboot the servers and restart the OSTs. Attempting to stop the lfsck in this state just hangs. I have waited >1h and it was still hanging. Unmounting the MDT in this situation also appears to be hanging (after 30 minutes I power cycled the MDS).



 Comments   
Comment by Peter Jones [ 03/Jun/15 ]

Fan Yong

Could you please advise on this one?

Thanks

Peter

Comment by nasf (Inactive) [ 04/Jun/15 ]

Hi Frederik, if you can reproduce the issue, then please do as following on the MDT:

1) echo -1 > /proc/sys/lnet/debug
2) lctl clear
3) dmesg -c
4) when the lfsck_stop hung, "lctl dk > /tmp/lustre.log"
5) echo t > /proc/sysrq-trigger
6) dmesg > /tmp/lustre.dmesg

Please attach the lustre.log and lustre.dmesg. Thanks!

Comment by Frederik Ferner (Inactive) [ 04/Jun/15 ]

I have reproduced it, files are attached.

(Note this was before applying the patch from LU-6683, so all OSSes were down while the lfsck_stop was hanging.)

Comment by nasf (Inactive) [ 06/Jun/15 ]

According to the log, the fsck_stop was waiting for the the layout LFSCK thread to exit, but the latter one was in sending RPC to the OST. At that time, the connection between the MDT and the OST was broken, and the MDT was trying to reconnect, but the reconnect RPC expired and re-try reconnect...

Comment by Andreas Dilger [ 07/Jun/15 ]

I think that stopping the MDT or OST in such a case is too much. Is the RPC stuck at the ptlrpc layer? Is it the RPC sent by lfsck_stop itself to the OSS to stop layout lfsck that is stuck or is lfsck_stop stuck waiting for something else? Is this RPC sent by ptlrpcd or could ctrl-C interrupt the wait like some normal user process?

Having a stack trace would be useful. Fan Yong can you please create a sanity-lfsck test case for this and then collect a stack trace so it is more clear what is stuck and where.

Comment by nasf (Inactive) [ 07/Jun/15 ]

The stack trace is clear as following:

lfsck_layout  S 0000000000000003     0  3643      2 0x00000000
 ffff880158e75a40 0000000000000046 0000000000000000 0000000000000000
 ffff8802a40b0ef0 ffff8802a40b0ec0 00020c3c3569f55d ffff8802a40b0ef0
 ffff880158e75a10 000000012256a121 ffff88012c1b05f8 ffff880158e75fd8
Call Trace:
 [<ffffffff8152b102>] schedule_timeout+0x192/0x2e0
 [<ffffffff810874f0>] ? process_timeout+0x0/0x10
 [<ffffffffa09bd0e2>] ptlrpc_set_wait+0x2b2/0x890 [ptlrpc]
 [<ffffffffa09b29c0>] ? ptlrpc_interrupted_set+0x0/0x110 [ptlrpc]
 [<ffffffff81064b90>] ? default_wake_function+0x0/0x20
 [<ffffffffa09c7dc6>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc]
 [<ffffffffa09bd741>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc]
 [<ffffffffa0a356d1>] out_remote_sync+0x111/0x200 [ptlrpc]
 [<ffffffffa144ca92>] osp_attr_get+0x352/0x600 [osp]
 [<ffffffffa1219e50>] lfsck_layout_assistant_handler_p1+0x530/0x19f0 [lfsck]
 [<ffffffffa066b1c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa11e06e6>] lfsck_assistant_engine+0x496/0x1de0 [lfsck]
 [<ffffffff81064b90>] ? default_wake_function+0x0/0x20
 [<ffffffffa11e0250>] ? lfsck_assistant_engine+0x0/0x1de0 [lfsck]
 [<ffffffff8109e66e>] kthread+0x9e/0xc0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20

It was inside ptlrpc layout, the LFSCK control/flags could not wakeup such thread. At that time, the target OSS was down, so the RPC (for attr_get, not for lfsck_stop) expired, then triggered the re-connection and would try to re-send the RPC after the connection recovered. Because LFSCK shares the same logic with cross-MDT operations (from OSP => ptlrpc view, they are indistinguishable) for RPC handing, we cannot simply make the RPC to be set as no_resend.
On the other hand, the LFSCK engine is background thread, cannot receive ctrl-C, but maybe we can use "kill -9 $PID" for that. But I am not sure whether we should allow someone to kill the background LFSCK engine by SIGKILL instead of the lfsck_stop interface.

Comment by Andreas Dilger [ 07/Jun/15 ]

If would be possible for "lctl lfsck_stop" to send SIGINT or SIGKILL to the lfsck thread to interrupt it, if it has the right LWI handler in out_remote_sync().

Comment by nasf (Inactive) [ 08/Jun/15 ]

In theory, we can do that. But the LWI is declared inside ptlrpc layer. If we want to make the (LFSCK) thread that is waiting on the LWI to handle SIGKILL, that means any thread (not only LFSCK engine, but also other RPC service thread, ptlrpcd thread, and so on) can by killed by user via "kill -9 $PID". It is not what we want, especially that someone may do that by wrong.

If we want the SIGKILL only to be handled by LFSCK engine, then we need some mechanism to make the ptlrpc layer to distinguish the LFSCK engine from other threads. But within current server-side API and stack framework, it is difficult to do that unless some very ugly hack.

Comment by Andreas Dilger [ 05/Aug/15 ]

Nasf, I don't think we can require users to do things "in the right order" for them to work (i.e. to deactivate the OST/MDT manually before running "lctl lfsck_stop") if the OST is down. It definitely seems preferable to allow lfsck_stop to work properly regardless of the connection state.

Would it be possible to allow the threads to be woken up by SIGINT but have them return -EINTR or -EAGAIN to the callers, and they decide whether to retry in that case? I agree it isn't good to actually kill the ptlrpc threads. Maybe ptlrpc_set_wait() could be interruptible and cause ptlrpcd to abort those RPCs? It seems that something like this is already close to possible

Comment by nasf (Inactive) [ 05/Aug/15 ]

It is NOT important to deactivate the OST/MDT manually before or after the "lctl lfsck_stop", so it does not involve in "the right order". Since the ptlrpcd thread can handle the deactivate event, is it still necessary to introduce new SIGINT handlers?

Comment by Gerrit Updater [ 03/Nov/15 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/17032
Subject: LU-6684 lfsck: stop lfsck even if some servers offline
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 51f3f69fb300c5f65cbed46a99ec8307cdc9a4f4

Comment by Ashish Purkar (Inactive) [ 30/Nov/15 ]

Andreas and Fan,
What will happen in dry-run mode of OI scrub if MDS recovery happen or MDT/OST down and reconnecting? Attaching log file 15.lctl.tgz for reference.

  • Here MDS is going in recovery while OI scrubbing operation is underway.
  • The lfsck ns assistant stage2 is restarted and post operation done.
  • Test expects dry-run to be completed in 6 sec but due to failover and MDS undergoing recovery, it's taking more time( > 6 sec)
Comment by nasf (Inactive) [ 09/Dec/15 ]

There are several cases:

1) The LFSCK/OI scrub is running on the MDS which to be remounted.

1.1) If the MDT is umounted when the LFSCK/OI scrub running at background, then the LFSCK/OI scrub status will be marked as paused. And when the MDT is remounted up, after the recovery done, the paused LFSCK/OI scrub will be resumed from the latest checkpoint, its status will be set as the one before paused.

1.2) If the MDT crashed when the LFSCK/OI scrub running at background, then there is no time for LFSCK/OI scrub to change its status. When the MDT is remounted up, its status will be marked as crashed, and after the recovery done, the crashed LFSCK/OI scrub will be resumed from the latest checkpoint, its status will be set as the one before crashed.

2) Assume the LFSCK/OI scrub is running on one MDT_a, another related server MDT_b/OST_c to be remounted.

2.1) If the LFSCK on the MDT_a needs to talk with the MDT_b/OST_c for verification that is amounted/crashed, then the LFSCK on MDT_a will get related connection failure, and then it knows that some of the peer server has left the LFSCK, and then the LFSCK on MDT_a will go ahead to verify part of the system, neither wait for ever nor fail out unless you specified "-e abort". So the LFSCK on the MDT_a can finish finally, and the status will be 'partial' if no other failure happened.

2.2) If we want to stop the LFSCK on the MDT_a, then the MDT_a needs to notify related peer MDT_b/OST_c to stop the LFSCK also. But it found the peer server MDT_b/OST_c is offline already, then the LFSCK on the MDT_a will go ahead to handle the stop process.

In this ticket, we hit trouble in the 2.2) case. Because the LFSCK did not detect OST_c offline, the lfsck_stop was blocked by the reconnection to the OST_c.

Comment by Gerrit Updater [ 14/Jan/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17032/
Subject: LU-6684 lfsck: stop lfsck even if some servers offline
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: afcf3026c6ad203b9882eaeac76326357f26fe71

Comment by nasf (Inactive) [ 14/Jan/16 ]

The patch has been landed to master.

Comment by Jian Yu [ 15/Jan/16 ]

sanity-lfsck test 32 still hung on master branch:

stop LFSCK
CMD: onyx-57vm7 /usr/sbin/lctl lfsck_stop -M lustre-MDT0000

https://testing.hpdd.intel.com/test_sets/e45d9b64-bbac-11e5-acbb-5254006e85c2
https://testing.hpdd.intel.com/test_sets/34b63ba8-bb61-11e5-acbb-5254006e85c2

Comment by Andreas Dilger [ 18/Jan/16 ]

And I verified that these two failures are on commits that include the fix that was recently landed here.

Comment by James Nunez (Inactive) [ 19/Jan/16 ]

More failures on master and all have the previous patch landed for this ticket:
2016-01-15 15:29:21 - https://testing.hpdd.intel.com/test_sets/48126330-bbce-11e5-8506-5254006e85c2
2016-01-15 20:20:20 - https://testing.hpdd.intel.com/test_sets/7ec04c5e-bbfa-11e5-acbb-5254006e85c2
2016-01-16 00:40:11 - https://testing.hpdd.intel.com/test_sets/4988556c-bc05-11e5-8f65-5254006e85c2
2016-01-18 22:08:02 - https://testing.hpdd.intel.com/test_sets/3a54dfd8-be63-11e5-92e8-5254006e85c2
2016-01-18 22:59:29 - https://testing.hpdd.intel.com/test_sets/642d055a-be69-11e5-92e8-5254006e85c2
2016-01-18 23:21:01 - https://testing.hpdd.intel.com/test_sets/c75e157e-be6e-11e5-b113-5254006e85c2
2016-01-19 07:37:19 - https://testing.hpdd.intel.com/test_sets/325db7ae-beb4-11e5-8c8a-5254006e85c2
2016-01-19 12:10:06 - https://testing.hpdd.intel.com/test_sets/144d9d36-bed9-11e5-ad7e-5254006e85c2
2016-01-19 22:11:45 - https://testing.hpdd.intel.com/test_sets/a2f0fede-bf2e-11e5-a659-5254006e85c2
2016-01-19 22:26:33 - https://testing.hpdd.intel.com/test_sets/dc0ed974-bf2f-11e5-8f04-5254006e85c2
2016-01-19 23:59:17 - https://testing.hpdd.intel.com/test_sets/01d6b960-bf3f-11e5-8f04-5254006e85c2
2016-01-21 11:12:25 - https://testing.hpdd.intel.com/test_sets/cd343b46-c061-11e5-a8e5-5254006e85c2
2016-01-21 13:03:12 - https://testing.hpdd.intel.com/test_sets/c0b04f0e-c070-11e5-956d-5254006e85c2
2016-01-21 14:41:59 - https://testing.hpdd.intel.com/test_sets/e4cffce0-c07f-11e5-a8e5-5254006e85c2
2016-01-21 21:40:44 - https://testing.hpdd.intel.com/test_sets/85d45ece-c0bc-11e5-9620-5254006e85c2
2016-01-22 03:45:40 - https://testing.hpdd.intel.com/test_sets/abf22bb0-c0ec-11e5-8d88-5254006e85c2

Comment by Gerrit Updater [ 20/Jan/16 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/18059
Subject: Revert "LU-6684 lfsck: stop lfsck even if some servers offline"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2505fd07b29ebfddcd29f16954908f6fe4670276

Comment by Gerrit Updater [ 21/Jan/16 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/18082
Subject: LU-6684 lfsck: set the lfsck notify as interruptable
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 68c078328be253735658fcf43fa98afff936ec6c

Comment by Bob Glossman (Inactive) [ 22/Jan/16 ]

another on master:
https://testing.hpdd.intel.com/test_sets/85d45ece-c0bc-11e5-9620-5254006e85c2

Comment by James A Simmons [ 27/Jan/16 ]

This is also delaying the landing of several patches.

Comment by Bob Glossman (Inactive) [ 28/Jan/16 ]

another on master:
https://testing.hpdd.intel.com/test_sets/150c07e2-c575-11e5-825e-5254006e85c2

Comment by Jian Yu [ 31/Jan/16 ]

This is blocking patch review testing on master branch:
https://testing.hpdd.intel.com/test_sets/a29caebe-c709-11e5-9b6d-5254006e85c2
https://testing.hpdd.intel.com/test_sets/fbfee2be-c70f-11e5-a037-5254006e85c2

Comment by Gerrit Updater [ 02/Feb/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18082/
Subject: LU-6684 lfsck: set the lfsck notify as interruptable
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 069a9cf551c2e985ea254a1c570b22ed1d72d914

Comment by nasf (Inactive) [ 02/Feb/16 ]

The patch has been landed to master.

Comment by Saurabh Tandan (Inactive) [ 03/Feb/16 ]

Another instance found for tag 2.7.66 for Full - EL6.7 Server/EL6.7 Client
On master, build# 3314
https://testing.hpdd.intel.com/test_sets/35490a0c-ca6e-11e5-9215-5254006e85c2
Date : 02/02/2016 Time: 9:20 am MST

Comment by nasf (Inactive) [ 04/Feb/16 ]

Another instance found for tag 2.7.66 for Full - EL6.7 Server/EL6.7 Client
On master, build# 3314
https://testing.hpdd.intel.com/test_sets/35490a0c-ca6e-11e5-9215-5254006e85c2
Date : 02/02/2016 Time: 9:20 am MST

The patch 18082 has been landed just after new tag 2.7.66, please test the latest master.

Generated at Sat Feb 10 02:02:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.