[LU-11419] lfsck does not complete phase2 Created: 22/Sep/18 Updated: 07/Jan/19 Resolved: 29/Oct/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.4 |
| Fix Version/s: | Lustre 2.12.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | SC Admin (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
x86_64, zfs, 3 MDTs, all on 1 MDS, , 2.10.4 + many patches. |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
Hi, I presume this is related to lctl lfsck_start -M dagg-MDT0000 -t namespace -A -n lctl lfsck_start -M dagg-MDT0000 -t namespace -A this is the summary of repairs, and md0 did not progress from here: [warble2]root: lctl get_param -n mdd.dagg-MDT000*.lfsck_namespace | egrep 'status:|repaired|checked_' | grep -v ' 0$' status: scanning-phase2 checked_phase1: 33226737 checked_phase2: 10901477 dangling_repaired: 28 striped_shards_repaired: 102 name_hash_repaired: 51 status: completed checked_phase1: 32652269 checked_phase2: 12379442 dangling_repaired: 28 striped_shards_repaired: 125 status: completed checked_phase1: 32662678 checked_phase2: 12378342 unmatched_pairs_repaired: 1 dangling_repaired: 11 striped_shards_repaired: 96 lfsck_namespace was using 100% of a cpu but the checked_phase2 counter wasn't going up. I did a sysrq 't' and 'w' before resetting the MDS and those start at hopefully that might help. cheers, |
| Comments |
| Comment by Peter Jones [ 23/Sep/18 ] |
|
Lai Could you please assist here? Thanks Peter |
| Comment by Lai Siyao [ 25/Sep/18 ] |
|
This looks to be the same as |
| Comment by SC Admin (Inactive) [ 26/Sep/18 ] |
|
Hi Lai, thanks. applied https://review.whamcloud.com/#/c/33078/ from I left it for about 10 extra hours after phase2 counters on mdt0 stopped incrementing, but nothing changed. as per previous report of this in this is as far as it got -> [warble2]root: lctl get_param -n mdd.dagg-MDT000*.lfsck_namespace | egrep 'status:|repaired|checked_|speed' | grep -v ' 0$' status: scanning-phase2 checked_phase1: 33091005 checked_phase2: 10550536 dangling_repaired: 31 striped_shards_repaired: 28 name_hash_repaired: 18 average_speed_phase1: 874 items/sec average_speed_phase2: 807 objs/sec average_speed_total: 857 items/sec real_time_speed_phase1: N/A real_time_speed_phase2: 7 objs/sec status: completed checked_phase1: 32500602 checked_phase2: 12505620 dangling_repaired: 29 striped_shards_repaired: 28 name_hash_repaired: 56 average_speed_phase1: 890 items/sec average_speed_phase2: 1923 objs/sec average_speed_total: 1046 items/sec real_time_speed_phase1: N/A real_time_speed_phase2: N/A status: completed checked_phase1: 32512235 checked_phase2: 12504486 linkea_repaired: 1 dangling_repaired: 14 striped_shards_repaired: 28 average_speed_phase1: 896 items/sec average_speed_phase2: 1923 objs/sec average_speed_total: 1052 items/sec real_time_speed_phase1: N/A real_time_speed_phase2: N/A I'll attach syslog for today. it includes a couple of 'echo t > /proc/sysrq-trigger' in case that helps you work out where lfsck namespace is stuck. cheers, |
| Comment by Lai Siyao [ 27/Sep/18 ] |
|
Can you enable more debug: 'lctl set_param debug="+trace lfsck"' on dagg-MDT0000 when running lfsck, and then collect debug logs? This can help locate what dead loop it may fall into. |
| Comment by Gerrit Updater [ 29/Sep/18 ] |
|
Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33252 |
| Comment by Lai Siyao [ 29/Sep/18 ] |
|
Hi Robin, I just uploaded a patch, you can wait for it to pass autotest, and then apply on your system and test again. |
| Comment by Gerrit Updater [ 29/Oct/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33252/ |
| Comment by Peter Jones [ 29/Oct/18 ] |
|
Landed for 2.12 |