[LU-10419] LFSCK fails to start, hangs systems. Created: 20/Dec/17 Updated: 01/Aug/18 Resolved: 14/Jun/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0, Lustre 2.10.2, Lustre 2.10.3 |
| Fix Version/s: | Lustre 2.12.0, Lustre 2.10.5 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Cliff White (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
Soak performance cluster - Lustre version=2.10.2_4_gb151f34 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
We do OSS failover, trigger LFSCK: lctl lfsck_start -M soaked-MDT0000 -s 1000 -t all -A{code] The lfsck start hangs, lfsck is not started, the clients wedge in state 'comp' the entire system wedges. I have dumped Lustre Logs from all MDS, attached. I have crash-dumped all the MDT nodes and the dumps are available on Spirit. lfsck_layout is unkillable. |
| Comments |
| Comment by Joseph Gmitter (Inactive) [ 20/Dec/17 ] |
|
Assigning this to Fan Yong so that it is in his queue when he returns from vacation. |
| Comment by Gerrit Updater [ 08/Jan/18 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30768 |
| Comment by Cliff White (Inactive) [ 10/Jan/18 ] |
|
We are hitting this issue on b2_10.3 RC1 - need to patch ported over there |
| Comment by Gerrit Updater [ 11/Jan/18 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30831 |
| Comment by Cliff White (Inactive) [ 17/Jan/18 ] |
|
The patch for master is way out of date, uses and old kernel and contains old bugs ( |
| Comment by nasf (Inactive) [ 18/Jan/18 ] |
|
https://review.whamcloud.com/#/c/30768/ set 2 is against the latest master, here is the Jenkins build: |
| Comment by Gerrit Updater [ 25/Jan/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30768/ |
| Comment by Peter Jones [ 25/Jan/18 ] |
|
Landed for 2.11 |
| Comment by Cliff White (Inactive) [ 23/Feb/18 ] |
|
Seeing this again on DNE-enable system. version=2.10.57_58_gf24340c |
| Comment by nasf (Inactive) [ 24/Feb/18 ] |
|
Where can I get related logs? Thanks! |
| Comment by Cliff White (Inactive) [ 27/Feb/18 ] |
|
Logs are on spirit /scratch/logs/syslogs and /scratch/logs/console. The crash dumps are in /scratch/dumps on spirit. |
| Comment by Gerrit Updater [ 01/Mar/18 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/31475 |
| Comment by Cliff White (Inactive) [ 06/Mar/18 ] |
|
With the current patch, lfsck does not stop. Currently also having mount timeouts. I have crashed dumped soak-8 while lfsck was hanging, logs are available on spirit. |
| Comment by Gerrit Updater [ 08/Mar/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31475/ |
| Comment by Peter Jones [ 08/Mar/18 ] |
|
Landed for 2.11 |
| Comment by Gerrit Updater [ 09/Mar/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: https://review.whamcloud.com/31600 |
| Comment by Gerrit Updater [ 09/Mar/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31600/ |
| Comment by nasf (Inactive) [ 13/Mar/18 ] |
The LFSCK master engine was blocked when sending OUT_ATTR_GET RPC to MDT2 that may be offline or in recovery. We expect the lfsck_stop() can wakeup the blocked LFSCK engines and make them to exit, but we only single (SIGINT) the LFSCK assistant engines, forget to do that for the LFSCK master engine.
So the trouble is not related with the patch https://review.whamcloud.com/31475/. I will make another patch to notify the master engine when lfsck_stop(). |
| Comment by Gerrit Updater [ 13/Mar/18 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/31627 |
| Comment by Gerrit Updater [ 14/Jun/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31627/ |
| Comment by Gerrit Updater [ 01/Aug/18 ] |
|
John L. Hammond (jhammond@whamcloud.com) merged in patch https://review.whamcloud.com/30831/ |