[LU-11419] lfsck does not complete phase2 Created: 22/Sep/18  Updated: 07/Jan/19  Resolved: 29/Oct/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.4
Fix Version/s: Lustre 2.12.0

Type: Bug Priority: Minor
Reporter: SC Admin (Inactive) Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None
Environment:

x86_64, zfs, 3 MDTs, all on 1 MDS, , 2.10.4 + many patches.


Attachments: File messages-grep-vslurm.txt.gz     File messages-warble2.txt.gz    
Issue Links:
Related
is related to LU-11201 NMI watchdog: BUG: soft lockup in lfs... Resolved
is related to LU-11111 crash doing LFSCK: orph_index_insert(... Resolved
is related to LU-10888 'lctl abort_recovery' allow aborting ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hi,

I presume this is related to LU-11111 and LU-10888.

lctl lfsck_start -M dagg-MDT0000 -t namespace -A -n
completed ok

lctl lfsck_start -M dagg-MDT0000 -t namespace -A
completed on mdt1 and mdt2 but stuck on mdt0.

this is the summary of repairs, and md0 did not progress from here:

[warble2]root: lctl get_param -n mdd.dagg-MDT000*.lfsck_namespace | egrep 'status:|repaired|checked_'  | grep -v ' 0$'
status: scanning-phase2
checked_phase1: 33226737
checked_phase2: 10901477
dangling_repaired: 28
striped_shards_repaired: 102
name_hash_repaired: 51
status: completed
checked_phase1: 32652269
checked_phase2: 12379442
dangling_repaired: 28
striped_shards_repaired: 125
status: completed
checked_phase1: 32662678
checked_phase2: 12378342
unmatched_pairs_repaired: 1
dangling_repaired: 11
striped_shards_repaired: 96

lfsck_namespace was using 100% of a cpu but the checked_phase2 counter wasn't going up.
kill -9 on lfsck_namespace didn't work
I didn't try lfsk stop_lfsck this time.
mdt0 wouldn't umount. had to reset the MDS.

I did a sysrq 't' and 'w' before resetting the MDS and those start at
Sep 23 00:18:42
in the attached messages file.

hopefully that might help.
please let us know if there's something else we can help with.

cheers,
robin



 Comments   
Comment by Peter Jones [ 23/Sep/18 ]

Lai

Could you please assist here?

Thanks

Peter

Comment by Lai Siyao [ 25/Sep/18 ]

This looks to be the same as LU-11201, can you apply patch https://review.whamcloud.com/#/c/32958/ and try lfsck again?

Comment by SC Admin (Inactive) [ 26/Sep/18 ]

Hi Lai,

thanks. applied https://review.whamcloud.com/#/c/33078/ from LU-11201 but no change.

I left it for about 10 extra hours after phase2 counters on mdt0 stopped incrementing, but nothing changed. as per previous report of this in LU-11111 I couldn't stop the lfsck and had to reset the MDS.

this is as far as it got ->

[warble2]root: lctl get_param -n mdd.dagg-MDT000*.lfsck_namespace | egrep 'status:|repaired|checked_|speed'  | grep -v ' 0$'
status: scanning-phase2
checked_phase1: 33091005
checked_phase2: 10550536
dangling_repaired: 31
striped_shards_repaired: 28
name_hash_repaired: 18
average_speed_phase1: 874 items/sec
average_speed_phase2: 807 objs/sec
average_speed_total: 857 items/sec
real_time_speed_phase1: N/A
real_time_speed_phase2: 7 objs/sec
status: completed
checked_phase1: 32500602
checked_phase2: 12505620
dangling_repaired: 29
striped_shards_repaired: 28
name_hash_repaired: 56
average_speed_phase1: 890 items/sec
average_speed_phase2: 1923 objs/sec
average_speed_total: 1046 items/sec
real_time_speed_phase1: N/A
real_time_speed_phase2: N/A
status: completed
checked_phase1: 32512235
checked_phase2: 12504486
linkea_repaired: 1
dangling_repaired: 14
striped_shards_repaired: 28
average_speed_phase1: 896 items/sec
average_speed_phase2: 1923 objs/sec
average_speed_total: 1052 items/sec
real_time_speed_phase1: N/A
real_time_speed_phase2: N/A

I'll attach syslog for today. it includes a couple of 'echo t > /proc/sysrq-trigger' in case that helps you work out where lfsck namespace is stuck.

cheers,
robin

Comment by Lai Siyao [ 27/Sep/18 ]

Can you enable more debug: 'lctl set_param debug="+trace lfsck"' on dagg-MDT0000 when running lfsck, and then collect debug logs? This can help locate what dead loop it may fall into.

Comment by Gerrit Updater [ 29/Sep/18 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33252
Subject: LU-11419 lfsck: lfsck_namespace_shrink_linkea() dead loop
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 22503a1db2c3b6a7f3a12829ee2484ea95a25913

Comment by Lai Siyao [ 29/Sep/18 ]

Hi Robin, I just uploaded a patch, you can wait for it to pass autotest, and then apply on your system and test again.

Comment by Gerrit Updater [ 29/Oct/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33252/
Subject: LU-11419 lfsck: lfsck_namespace_shrink_linkea() dead loop
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 20a603c42ecc1a5c6f1b3d5a0e31b2b323777abb

Comment by Peter Jones [ 29/Oct/18 ]

Landed for 2.12

Generated at Sat Feb 10 02:43:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.