[LU-10419] LFSCK fails to start, hangs systems. Created: 20/Dec/17  Updated: 01/Aug/18  Resolved: 14/Jun/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.10.2, Lustre 2.10.3
Fix Version/s: Lustre 2.12.0, Lustre 2.10.5

Type: Bug Priority: Critical
Reporter: Cliff White (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: soak
Environment:

Soak performance cluster - Lustre version=2.10.2_4_gb151f34


Attachments: File soak-10.lustre.log.gz     File soak-11.lustre.log.gz     File soak-8.lustre.log.gz     File soak-9.lustre.log.gz    
Issue Links:
Duplicate
is duplicated by LU-10036 lfsck stuck in init on multiple MDS Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We do OSS failover, trigger LFSCK:


lctl lfsck_start -M soaked-MDT0000 -s 1000 -t all -A{code]

The lfsck start hangs, lfsck is not started, the clients wedge in state 'comp' the entire system wedges. I have dumped Lustre Logs from all MDS, attached. I have crash-dumped all the MDT nodes and the dumps are available on Spirit. lfsck_layout is unkillable.



 Comments   
Comment by Joseph Gmitter (Inactive) [ 20/Dec/17 ]

Assigning this to Fan Yong so that it is in his queue when he returns from vacation.

Comment by Gerrit Updater [ 08/Jan/18 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30768
Subject: LU-10419 lfsck: no delay for notify RPC
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d8827a8ce44db121f80223dc7189e32f5bf3fd45

Comment by Cliff White (Inactive) [ 10/Jan/18 ]

We are hitting this issue on b2_10.3 RC1 - need to patch ported over there

Comment by Gerrit Updater [ 11/Jan/18 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30831
Subject: LU-10419 lfsck: no delay for notify RPC
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: e20f5d402a8a6547544b33d4890b6910e7cf9f95

Comment by Cliff White (Inactive) [ 17/Jan/18 ]

The patch for master is way out of date, uses and old kernel and contains old bugs (LU-10459) Can you move the patch to the tip of current master, so that it is testable?

Comment by nasf (Inactive) [ 18/Jan/18 ]

https://review.whamcloud.com/#/c/30768/ set 2 is against the latest master, here is the Jenkins build:
https://build.hpdd.intel.com/job/lustre-reviews/53764/

Comment by Gerrit Updater [ 25/Jan/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30768/
Subject: LU-10419 lfsck: no delay for notify RPC
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 39816213632cf9083530f1a8b644459d13e3c980

Comment by Peter Jones [ 25/Jan/18 ]

Landed for 2.11

Comment by Cliff White (Inactive) [ 23/Feb/18 ]

Seeing this again on DNE-enable system. version=2.10.57_58_gf24340c
I can crash dump systems if desired

Comment by nasf (Inactive) [ 24/Feb/18 ]

cliffw,

Where can I get related logs?

Thanks!

Comment by Cliff White (Inactive) [ 27/Feb/18 ]

Logs are on spirit /scratch/logs/syslogs and /scratch/logs/console. The crash dumps are in /scratch/dumps on spirit.

Comment by Gerrit Updater [ 01/Mar/18 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/31475
Subject: LU-10419 lfsck: skip dead target
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: be9f2eedf5039fa6308460aca6a84daa6b8003b1

Comment by Cliff White (Inactive) [ 06/Mar/18 ]

With the current patch, lfsck does not stop. Currently also having mount timeouts. I have crashed dumped soak-8 while lfsck was hanging, logs are available on spirit.
/scratch/dumps/soak-8.spirit.hpdd.intel.com/10.10.1.108-2018-03-06-19:16:47

Comment by Gerrit Updater [ 08/Mar/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31475/
Subject: LU-10419 lfsck: skip dead target
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 012834c5e7c7be50ff117cee4ac473d7fee4294d

Comment by Peter Jones [ 08/Mar/18 ]

Landed for 2.11

Comment by Gerrit Updater [ 09/Mar/18 ]

Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: https://review.whamcloud.com/31600
Subject: Revert "LU-10419 lfsck: skip dead target"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1387fa1c012dfdf5eb4f90efeb06edd45788064f

Comment by Gerrit Updater [ 09/Mar/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31600/
Subject: Revert "LU-10419 lfsck: skip dead target"
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9ba637b8949b1b8a5f2506e654a9b62d5c0cc245

Comment by nasf (Inactive) [ 13/Mar/18 ]

With the current patch, lfsck does not stop. Currently also having mount timeouts. I have crashed dumped soak-8 while lfsck was hanging, logs are available on spirit.
/scratch/dumps/soak-8.spirit.hpdd.intel.com/10.10.1.108-2018-03-06-19:16:47

The LFSCK master engine was blocked when sending OUT_ATTR_GET RPC to MDT2 that may be offline or in recovery. We expect the lfsck_stop() can wakeup the blocked LFSCK engines and make them to exit, but we only single (SIGINT) the LFSCK assistant engines, forget to do that for the LFSCK master engine.

 

So the trouble is not related with the patch https://review.whamcloud.com/31475/.

I will make another patch to notify the master engine when lfsck_stop().

Comment by Gerrit Updater [ 13/Mar/18 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/31627
Subject: LU-10419 lfsck: single master engine when stop
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e3e7d1a41711cfb0a12b941a88bf8c0bf3b4cc89

Comment by Gerrit Updater [ 14/Jun/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31627/
Subject: LU-10419 lfsck: signal master engine when stop
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1ece380412efd5dba2a8c345830f0456a4922301

Comment by Gerrit Updater [ 01/Aug/18 ]

John L. Hammond (jhammond@whamcloud.com) merged in patch https://review.whamcloud.com/30831/
Subject: LU-10419 lfsck: no delay for notify RPC
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 9fef9ad10b26a4338c22105e66308ead5408173e

Generated at Sat Feb 10 02:34:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.