[LU-6684] lctl lfsck_stop hangs - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.7.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

As mentioned in ~~LU-6683~~, I ran into a situation where lctl lfsck_stop just hangs indefinitely.

I have managed to reproduce this twice:

start lfsck (using lctl lfsck_start -M play01-MDT0000 -t layout), this crashes the OSS servers, reboot the servers and restart the OSTs. Attempting to stop the lfsck in this state just hangs. I have waited >1h and it was still hanging. Unmounting the MDT in this situation also appears to be hanging (after 30 minutes I power cycled the MDS).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

15.lctl.tgz
631 kB
30/Nov/15 11:00 AM
lustre.dmesg.bz2
37 kB
04/Jun/15 11:09 AM
lustre.log.bz2
1.38 MB
04/Jun/15 11:09 AM

Issue Links

is duplicated by

LU-7662 lfsck don't complete

Resolved

is related to

LU-10321 MDS - umount hangs during failback

Resolved

mentioned in: Page No Confluence page found with the given URL.

Activity

[LU-6684] lctl lfsck_stop hangs

James A Simmons added a comment - 27/Jan/16 3:26 PM

This is also delaying the landing of several patches.

James A Simmons added a comment - 27/Jan/16 3:26 PM This is also delaying the landing of several patches.

Bob Glossman (Inactive) added a comment - 22/Jan/16 2:44 PM

another on master:
https://testing.hpdd.intel.com/test_sets/85d45ece-c0bc-11e5-9620-5254006e85c2

Bob Glossman (Inactive) added a comment - 22/Jan/16 2:44 PM another on master: https://testing.hpdd.intel.com/test_sets/85d45ece-c0bc-11e5-9620-5254006e85c2

Gerrit Updater added a comment - 21/Jan/16 4:46 PM

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/18082
Subject: ~~LU-6684~~ lfsck: set the lfsck notify as interruptable
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 68c078328be253735658fcf43fa98afff936ec6c

Gerrit Updater added a comment - 21/Jan/16 4:46 PM Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/18082 Subject: LU-6684 lfsck: set the lfsck notify as interruptable Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 68c078328be253735658fcf43fa98afff936ec6c

Gerrit Updater added a comment - 20/Jan/16 8:08 PM

James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/18059
Subject: Revert "~~LU-6684~~ lfsck: stop lfsck even if some servers offline"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2505fd07b29ebfddcd29f16954908f6fe4670276

Gerrit Updater added a comment - 20/Jan/16 8:08 PM James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/18059 Subject: Revert " LU-6684 lfsck: stop lfsck even if some servers offline" Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2505fd07b29ebfddcd29f16954908f6fe4670276

James Nunez (Inactive) added a comment - 19/Jan/16 4:37 PM - edited

James Nunez (Inactive) added a comment - 19/Jan/16 4:37 PM - edited More failures on master and all have the previous patch landed for this ticket: 2016-01-15 15:29:21 - https://testing.hpdd.intel.com/test_sets/48126330-bbce-11e5-8506-5254006e85c2 2016-01-15 20:20:20 - https://testing.hpdd.intel.com/test_sets/7ec04c5e-bbfa-11e5-acbb-5254006e85c2 2016-01-16 00:40:11 - https://testing.hpdd.intel.com/test_sets/4988556c-bc05-11e5-8f65-5254006e85c2 2016-01-18 22:08:02 - https://testing.hpdd.intel.com/test_sets/3a54dfd8-be63-11e5-92e8-5254006e85c2 2016-01-18 22:59:29 - https://testing.hpdd.intel.com/test_sets/642d055a-be69-11e5-92e8-5254006e85c2 2016-01-18 23:21:01 - https://testing.hpdd.intel.com/test_sets/c75e157e-be6e-11e5-b113-5254006e85c2 2016-01-19 07:37:19 - https://testing.hpdd.intel.com/test_sets/325db7ae-beb4-11e5-8c8a-5254006e85c2 2016-01-19 12:10:06 - https://testing.hpdd.intel.com/test_sets/144d9d36-bed9-11e5-ad7e-5254006e85c2 2016-01-19 22:11:45 - https://testing.hpdd.intel.com/test_sets/a2f0fede-bf2e-11e5-a659-5254006e85c2 2016-01-19 22:26:33 - https://testing.hpdd.intel.com/test_sets/dc0ed974-bf2f-11e5-8f04-5254006e85c2 2016-01-19 23:59:17 - https://testing.hpdd.intel.com/test_sets/01d6b960-bf3f-11e5-8f04-5254006e85c2 2016-01-21 11:12:25 - https://testing.hpdd.intel.com/test_sets/cd343b46-c061-11e5-a8e5-5254006e85c2 2016-01-21 13:03:12 - https://testing.hpdd.intel.com/test_sets/c0b04f0e-c070-11e5-956d-5254006e85c2 2016-01-21 14:41:59 - https://testing.hpdd.intel.com/test_sets/e4cffce0-c07f-11e5-a8e5-5254006e85c2 2016-01-21 21:40:44 - https://testing.hpdd.intel.com/test_sets/85d45ece-c0bc-11e5-9620-5254006e85c2 2016-01-22 03:45:40 - https://testing.hpdd.intel.com/test_sets/abf22bb0-c0ec-11e5-8d88-5254006e85c2

Andreas Dilger added a comment - 18/Jan/16 8:39 AM

And I verified that these two failures are on commits that include the fix that was recently landed here.

Andreas Dilger added a comment - 18/Jan/16 8:39 AM And I verified that these two failures are on commits that include the fix that was recently landed here.

Jian Yu added a comment - 15/Jan/16 5:49 PM

sanity-lfsck test 32 still hung on master branch:

stop LFSCK
CMD: onyx-57vm7 /usr/sbin/lctl lfsck_stop -M lustre-MDT0000

https://testing.hpdd.intel.com/test_sets/e45d9b64-bbac-11e5-acbb-5254006e85c2
https://testing.hpdd.intel.com/test_sets/34b63ba8-bb61-11e5-acbb-5254006e85c2

Jian Yu added a comment - 15/Jan/16 5:49 PM sanity-lfsck test 32 still hung on master branch: stop LFSCK CMD: onyx-57vm7 /usr/sbin/lctl lfsck_stop -M lustre-MDT0000 https://testing.hpdd.intel.com/test_sets/e45d9b64-bbac-11e5-acbb-5254006e85c2 https://testing.hpdd.intel.com/test_sets/34b63ba8-bb61-11e5-acbb-5254006e85c2

nasf (Inactive) added a comment - 14/Jan/16 3:22 PM

The patch has been landed to master.

nasf (Inactive) added a comment - 14/Jan/16 3:22 PM The patch has been landed to master.

Gerrit Updater added a comment - 14/Jan/16 3:59 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17032/
Subject: ~~LU-6684~~ lfsck: stop lfsck even if some servers offline
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: afcf3026c6ad203b9882eaeac76326357f26fe71

Gerrit Updater added a comment - 14/Jan/16 3:59 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17032/ Subject: LU-6684 lfsck: stop lfsck even if some servers offline Project: fs/lustre-release Branch: master Current Patch Set: Commit: afcf3026c6ad203b9882eaeac76326357f26fe71

nasf (Inactive) added a comment - 09/Dec/15 12:33 PM

There are several cases:

1) The LFSCK/OI scrub is running on the MDS which to be remounted.

1.1) If the MDT is umounted when the LFSCK/OI scrub running at background, then the LFSCK/OI scrub status will be marked as paused. And when the MDT is remounted up, after the recovery done, the paused LFSCK/OI scrub will be resumed from the latest checkpoint, its status will be set as the one before paused.

1.2) If the MDT crashed when the LFSCK/OI scrub running at background, then there is no time for LFSCK/OI scrub to change its status. When the MDT is remounted up, its status will be marked as crashed, and after the recovery done, the crashed LFSCK/OI scrub will be resumed from the latest checkpoint, its status will be set as the one before crashed.

2) Assume the LFSCK/OI scrub is running on one MDT_a, another related server MDT_b/OST_c to be remounted.

2.1) If the LFSCK on the MDT_a needs to talk with the MDT_b/OST_c for verification that is amounted/crashed, then the LFSCK on MDT_a will get related connection failure, and then it knows that some of the peer server has left the LFSCK, and then the LFSCK on MDT_a will go ahead to verify part of the system, neither wait for ever nor fail out unless you specified "-e abort". So the LFSCK on the MDT_a can finish finally, and the status will be 'partial' if no other failure happened.

2.2) If we want to stop the LFSCK on the MDT_a, then the MDT_a needs to notify related peer MDT_b/OST_c to stop the LFSCK also. But it found the peer server MDT_b/OST_c is offline already, then the LFSCK on the MDT_a will go ahead to handle the stop process.

In this ticket, we hit trouble in the 2.2) case. Because the LFSCK did not detect OST_c offline, the lfsck_stop was blocked by the reconnection to the OST_c.

nasf (Inactive) added a comment - 09/Dec/15 12:33 PM There are several cases: 1) The LFSCK/OI scrub is running on the MDS which to be remounted. 1.1) If the MDT is umounted when the LFSCK/OI scrub running at background, then the LFSCK/OI scrub status will be marked as paused. And when the MDT is remounted up, after the recovery done, the paused LFSCK/OI scrub will be resumed from the latest checkpoint, its status will be set as the one before paused. 1.2) If the MDT crashed when the LFSCK/OI scrub running at background, then there is no time for LFSCK/OI scrub to change its status. When the MDT is remounted up, its status will be marked as crashed, and after the recovery done, the crashed LFSCK/OI scrub will be resumed from the latest checkpoint, its status will be set as the one before crashed. 2) Assume the LFSCK/OI scrub is running on one MDT_a, another related server MDT_b/OST_c to be remounted. 2.1) If the LFSCK on the MDT_a needs to talk with the MDT_b/OST_c for verification that is amounted/crashed, then the LFSCK on MDT_a will get related connection failure, and then it knows that some of the peer server has left the LFSCK, and then the LFSCK on MDT_a will go ahead to verify part of the system, neither wait for ever nor fail out unless you specified "-e abort". So the LFSCK on the MDT_a can finish finally, and the status will be 'partial' if no other failure happened. 2.2) If we want to stop the LFSCK on the MDT_a, then the MDT_a needs to notify related peer MDT_b/OST_c to stop the LFSCK also. But it found the peer server MDT_b/OST_c is offline already, then the LFSCK on the MDT_a will go ahead to handle the stop process. In this ticket, we hit trouble in the 2.2) case. Because the LFSCK did not detect OST_c offline, the lfsck_stop was blocked by the reconnection to the OST_c.

Ashish Purkar (Inactive) added a comment - 30/Nov/15 11:01 AM

Andreas and Fan,
What will happen in dry-run mode of OI scrub if MDS recovery happen or MDT/OST down and reconnecting? Attaching log file 15.lctl.tgz for reference.

Here MDS is going in recovery while OI scrubbing operation is underway.
The lfsck ns assistant stage2 is restarted and post operation done.
Test expects dry-run to be completed in 6 sec but due to failover and MDS undergoing recovery, it's taking more time( > 6 sec)

Ashish Purkar (Inactive) added a comment - 30/Nov/15 11:01 AM Andreas and Fan, What will happen in dry-run mode of OI scrub if MDS recovery happen or MDT/OST down and reconnecting? Attaching log file 15.lctl.tgz for reference. Here MDS is going in recovery while OI scrubbing operation is underway. The lfsck ns assistant stage2 is restarted and post operation done. Test expects dry-run to be completed in 6 sec but due to failover and MDS undergoing recovery, it's taking more time( > 6 sec)

People

Assignee:: nasf (Inactive)

Reporter:: Frederik Ferner (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 03/Jun/15 4:43 PM

Updated:: 06/Dec/17 4:29 PM

Resolved:: 02/Feb/16 4:38 AM