[LU-6684] lctl lfsck_stop hangs Created: 03/Jun/15 Updated: 06/Dec/17 Resolved: 02/Feb/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Frederik Ferner (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
As mentioned in I have managed to reproduce this twice: start lfsck (using lctl lfsck_start -M play01-MDT0000 -t layout), this crashes the OSS servers, reboot the servers and restart the OSTs. Attempting to stop the lfsck in this state just hangs. I have waited >1h and it was still hanging. Unmounting the MDT in this situation also appears to be hanging (after 30 minutes I power cycled the MDS). |
| Comments |
| Comment by Peter Jones [ 03/Jun/15 ] |
|
Fan Yong Could you please advise on this one? Thanks Peter |
| Comment by nasf (Inactive) [ 04/Jun/15 ] |
|
Hi Frederik, if you can reproduce the issue, then please do as following on the MDT: 1) echo -1 > /proc/sys/lnet/debug Please attach the lustre.log and lustre.dmesg. Thanks! |
| Comment by Frederik Ferner (Inactive) [ 04/Jun/15 ] |
|
I have reproduced it, files are attached. (Note this was before applying the patch from |
| Comment by nasf (Inactive) [ 06/Jun/15 ] |
|
According to the log, the fsck_stop was waiting for the the layout LFSCK thread to exit, but the latter one was in sending RPC to the OST. At that time, the connection between the MDT and the OST was broken, and the MDT was trying to reconnect, but the reconnect RPC expired and re-try reconnect... |
| Comment by Andreas Dilger [ 07/Jun/15 ] |
|
I think that stopping the MDT or OST in such a case is too much. Is the RPC stuck at the ptlrpc layer? Is it the RPC sent by lfsck_stop itself to the OSS to stop layout lfsck that is stuck or is lfsck_stop stuck waiting for something else? Is this RPC sent by ptlrpcd or could ctrl-C interrupt the wait like some normal user process? Having a stack trace would be useful. Fan Yong can you please create a sanity-lfsck test case for this and then collect a stack trace so it is more clear what is stuck and where. |
| Comment by nasf (Inactive) [ 07/Jun/15 ] |
|
The stack trace is clear as following: lfsck_layout S 0000000000000003 0 3643 2 0x00000000 ffff880158e75a40 0000000000000046 0000000000000000 0000000000000000 ffff8802a40b0ef0 ffff8802a40b0ec0 00020c3c3569f55d ffff8802a40b0ef0 ffff880158e75a10 000000012256a121 ffff88012c1b05f8 ffff880158e75fd8 Call Trace: [<ffffffff8152b102>] schedule_timeout+0x192/0x2e0 [<ffffffff810874f0>] ? process_timeout+0x0/0x10 [<ffffffffa09bd0e2>] ptlrpc_set_wait+0x2b2/0x890 [ptlrpc] [<ffffffffa09b29c0>] ? ptlrpc_interrupted_set+0x0/0x110 [ptlrpc] [<ffffffff81064b90>] ? default_wake_function+0x0/0x20 [<ffffffffa09c7dc6>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc] [<ffffffffa09bd741>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc] [<ffffffffa0a356d1>] out_remote_sync+0x111/0x200 [ptlrpc] [<ffffffffa144ca92>] osp_attr_get+0x352/0x600 [osp] [<ffffffffa1219e50>] lfsck_layout_assistant_handler_p1+0x530/0x19f0 [lfsck] [<ffffffffa066b1c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa11e06e6>] lfsck_assistant_engine+0x496/0x1de0 [lfsck] [<ffffffff81064b90>] ? default_wake_function+0x0/0x20 [<ffffffffa11e0250>] ? lfsck_assistant_engine+0x0/0x1de0 [lfsck] [<ffffffff8109e66e>] kthread+0x9e/0xc0 [<ffffffff8100c20a>] child_rip+0xa/0x20 [<ffffffff8109e5d0>] ? kthread+0x0/0xc0 [<ffffffff8100c200>] ? child_rip+0x0/0x20 It was inside ptlrpc layout, the LFSCK control/flags could not wakeup such thread. At that time, the target OSS was down, so the RPC (for attr_get, not for lfsck_stop) expired, then triggered the re-connection and would try to re-send the RPC after the connection recovered. Because LFSCK shares the same logic with cross-MDT operations (from OSP => ptlrpc view, they are indistinguishable) for RPC handing, we cannot simply make the RPC to be set as no_resend. |
| Comment by Andreas Dilger [ 07/Jun/15 ] |
|
If would be possible for "lctl lfsck_stop" to send SIGINT or SIGKILL to the lfsck thread to interrupt it, if it has the right LWI handler in out_remote_sync(). |
| Comment by nasf (Inactive) [ 08/Jun/15 ] |
|
In theory, we can do that. But the LWI is declared inside ptlrpc layer. If we want to make the (LFSCK) thread that is waiting on the LWI to handle SIGKILL, that means any thread (not only LFSCK engine, but also other RPC service thread, ptlrpcd thread, and so on) can by killed by user via "kill -9 $PID". It is not what we want, especially that someone may do that by wrong. If we want the SIGKILL only to be handled by LFSCK engine, then we need some mechanism to make the ptlrpc layer to distinguish the LFSCK engine from other threads. But within current server-side API and stack framework, it is difficult to do that unless some very ugly hack. |
| Comment by Andreas Dilger [ 05/Aug/15 ] |
|
Nasf, I don't think we can require users to do things "in the right order" for them to work (i.e. to deactivate the OST/MDT manually before running "lctl lfsck_stop") if the OST is down. It definitely seems preferable to allow lfsck_stop to work properly regardless of the connection state. Would it be possible to allow the threads to be woken up by SIGINT but have them return -EINTR or -EAGAIN to the callers, and they decide whether to retry in that case? I agree it isn't good to actually kill the ptlrpc threads. Maybe ptlrpc_set_wait() could be interruptible and cause ptlrpcd to abort those RPCs? It seems that something like this is already close to possible |
| Comment by nasf (Inactive) [ 05/Aug/15 ] |
|
It is NOT important to deactivate the OST/MDT manually before or after the "lctl lfsck_stop", so it does not involve in "the right order". Since the ptlrpcd thread can handle the deactivate event, is it still necessary to introduce new SIGINT handlers? |
| Comment by Gerrit Updater [ 03/Nov/15 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/17032 |
| Comment by Ashish Purkar (Inactive) [ 30/Nov/15 ] |
|
Andreas and Fan,
|
| Comment by nasf (Inactive) [ 09/Dec/15 ] |
|
There are several cases: 1) The LFSCK/OI scrub is running on the MDS which to be remounted. 1.1) If the MDT is umounted when the LFSCK/OI scrub running at background, then the LFSCK/OI scrub status will be marked as paused. And when the MDT is remounted up, after the recovery done, the paused LFSCK/OI scrub will be resumed from the latest checkpoint, its status will be set as the one before paused. 1.2) If the MDT crashed when the LFSCK/OI scrub running at background, then there is no time for LFSCK/OI scrub to change its status. When the MDT is remounted up, its status will be marked as crashed, and after the recovery done, the crashed LFSCK/OI scrub will be resumed from the latest checkpoint, its status will be set as the one before crashed. 2) Assume the LFSCK/OI scrub is running on one MDT_a, another related server MDT_b/OST_c to be remounted. 2.1) If the LFSCK on the MDT_a needs to talk with the MDT_b/OST_c for verification that is amounted/crashed, then the LFSCK on MDT_a will get related connection failure, and then it knows that some of the peer server has left the LFSCK, and then the LFSCK on MDT_a will go ahead to verify part of the system, neither wait for ever nor fail out unless you specified "-e abort". So the LFSCK on the MDT_a can finish finally, and the status will be 'partial' if no other failure happened. 2.2) If we want to stop the LFSCK on the MDT_a, then the MDT_a needs to notify related peer MDT_b/OST_c to stop the LFSCK also. But it found the peer server MDT_b/OST_c is offline already, then the LFSCK on the MDT_a will go ahead to handle the stop process. In this ticket, we hit trouble in the 2.2) case. Because the LFSCK did not detect OST_c offline, the lfsck_stop was blocked by the reconnection to the OST_c. |
| Comment by Gerrit Updater [ 14/Jan/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17032/ |
| Comment by nasf (Inactive) [ 14/Jan/16 ] |
|
The patch has been landed to master. |
| Comment by Jian Yu [ 15/Jan/16 ] |
|
sanity-lfsck test 32 still hung on master branch: stop LFSCK CMD: onyx-57vm7 /usr/sbin/lctl lfsck_stop -M lustre-MDT0000 https://testing.hpdd.intel.com/test_sets/e45d9b64-bbac-11e5-acbb-5254006e85c2 |
| Comment by Andreas Dilger [ 18/Jan/16 ] |
|
And I verified that these two failures are on commits that include the fix that was recently landed here. |
| Comment by James Nunez (Inactive) [ 19/Jan/16 ] |
|
More failures on master and all have the previous patch landed for this ticket: |
| Comment by Gerrit Updater [ 20/Jan/16 ] |
|
James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/18059 |
| Comment by Gerrit Updater [ 21/Jan/16 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/18082 |
| Comment by Bob Glossman (Inactive) [ 22/Jan/16 ] |
|
another on master: |
| Comment by James A Simmons [ 27/Jan/16 ] |
|
This is also delaying the landing of several patches. |
| Comment by Bob Glossman (Inactive) [ 28/Jan/16 ] |
|
another on master: |
| Comment by Jian Yu [ 31/Jan/16 ] |
|
This is blocking patch review testing on master branch: |
| Comment by Gerrit Updater [ 02/Feb/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18082/ |
| Comment by nasf (Inactive) [ 02/Feb/16 ] |
|
The patch has been landed to master. |
| Comment by Saurabh Tandan (Inactive) [ 03/Feb/16 ] |
|
Another instance found for tag 2.7.66 for Full - EL6.7 Server/EL6.7 Client |
| Comment by nasf (Inactive) [ 04/Feb/16 ] |
The patch 18082 has been landed just after new tag 2.7.66, please test the latest master. |