Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.0
    • Lustre 2.10.4
    • None
    • x86_64, zfs, 3 MDTs, all on 1 MDS, , 2.10.4 + many patches.
    • 3
    • 9223372036854775807

    Description

      Hi,

      I presume this is related to LU-11111 and LU-10888.

      lctl lfsck_start -M dagg-MDT0000 -t namespace -A -n
      completed ok

      lctl lfsck_start -M dagg-MDT0000 -t namespace -A
      completed on mdt1 and mdt2 but stuck on mdt0.

      this is the summary of repairs, and md0 did not progress from here:

      [warble2]root: lctl get_param -n mdd.dagg-MDT000*.lfsck_namespace | egrep 'status:|repaired|checked_'  | grep -v ' 0$'
      status: scanning-phase2
      checked_phase1: 33226737
      checked_phase2: 10901477
      dangling_repaired: 28
      striped_shards_repaired: 102
      name_hash_repaired: 51
      status: completed
      checked_phase1: 32652269
      checked_phase2: 12379442
      dangling_repaired: 28
      striped_shards_repaired: 125
      status: completed
      checked_phase1: 32662678
      checked_phase2: 12378342
      unmatched_pairs_repaired: 1
      dangling_repaired: 11
      striped_shards_repaired: 96
      

      lfsck_namespace was using 100% of a cpu but the checked_phase2 counter wasn't going up.
      kill -9 on lfsck_namespace didn't work
      I didn't try lfsk stop_lfsck this time.
      mdt0 wouldn't umount. had to reset the MDS.

      I did a sysrq 't' and 'w' before resetting the MDS and those start at
      Sep 23 00:18:42
      in the attached messages file.

      hopefully that might help.
      please let us know if there's something else we can help with.

      cheers,
      robin

      Attachments

        Issue Links

          Activity

            [LU-11419] lfsck does not complete phase2
            pjones Peter Jones added a comment -

            Landed for 2.12

            pjones Peter Jones added a comment - Landed for 2.12

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33252/
            Subject: LU-11419 lfsck: lfsck_namespace_shrink_linkea() dead loop
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 20a603c42ecc1a5c6f1b3d5a0e31b2b323777abb

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33252/ Subject: LU-11419 lfsck: lfsck_namespace_shrink_linkea() dead loop Project: fs/lustre-release Branch: master Current Patch Set: Commit: 20a603c42ecc1a5c6f1b3d5a0e31b2b323777abb
            laisiyao Lai Siyao added a comment -

            Hi Robin, I just uploaded a patch, you can wait for it to pass autotest, and then apply on your system and test again.

            laisiyao Lai Siyao added a comment - Hi Robin, I just uploaded a patch, you can wait for it to pass autotest, and then apply on your system and test again.

            Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33252
            Subject: LU-11419 lfsck: lfsck_namespace_shrink_linkea() dead loop
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 22503a1db2c3b6a7f3a12829ee2484ea95a25913

            gerrit Gerrit Updater added a comment - Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33252 Subject: LU-11419 lfsck: lfsck_namespace_shrink_linkea() dead loop Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 22503a1db2c3b6a7f3a12829ee2484ea95a25913
            laisiyao Lai Siyao added a comment -

            Can you enable more debug: 'lctl set_param debug="+trace lfsck"' on dagg-MDT0000 when running lfsck, and then collect debug logs? This can help locate what dead loop it may fall into.

            laisiyao Lai Siyao added a comment - Can you enable more debug: 'lctl set_param debug="+trace lfsck"' on dagg-MDT0000 when running lfsck, and then collect debug logs? This can help locate what dead loop it may fall into.
            scadmin SC Admin added a comment -

            Hi Lai,

            thanks. applied https://review.whamcloud.com/#/c/33078/ from LU-11201 but no change.

            I left it for about 10 extra hours after phase2 counters on mdt0 stopped incrementing, but nothing changed. as per previous report of this in LU-11111 I couldn't stop the lfsck and had to reset the MDS.

            this is as far as it got ->

            [warble2]root: lctl get_param -n mdd.dagg-MDT000*.lfsck_namespace | egrep 'status:|repaired|checked_|speed'  | grep -v ' 0$'
            status: scanning-phase2
            checked_phase1: 33091005
            checked_phase2: 10550536
            dangling_repaired: 31
            striped_shards_repaired: 28
            name_hash_repaired: 18
            average_speed_phase1: 874 items/sec
            average_speed_phase2: 807 objs/sec
            average_speed_total: 857 items/sec
            real_time_speed_phase1: N/A
            real_time_speed_phase2: 7 objs/sec
            status: completed
            checked_phase1: 32500602
            checked_phase2: 12505620
            dangling_repaired: 29
            striped_shards_repaired: 28
            name_hash_repaired: 56
            average_speed_phase1: 890 items/sec
            average_speed_phase2: 1923 objs/sec
            average_speed_total: 1046 items/sec
            real_time_speed_phase1: N/A
            real_time_speed_phase2: N/A
            status: completed
            checked_phase1: 32512235
            checked_phase2: 12504486
            linkea_repaired: 1
            dangling_repaired: 14
            striped_shards_repaired: 28
            average_speed_phase1: 896 items/sec
            average_speed_phase2: 1923 objs/sec
            average_speed_total: 1052 items/sec
            real_time_speed_phase1: N/A
            real_time_speed_phase2: N/A
            

            I'll attach syslog for today. it includes a couple of 'echo t > /proc/sysrq-trigger' in case that helps you work out where lfsck namespace is stuck.

            cheers,
            robin

            scadmin SC Admin added a comment - Hi Lai, thanks. applied https://review.whamcloud.com/#/c/33078/ from LU-11201 but no change. I left it for about 10 extra hours after phase2 counters on mdt0 stopped incrementing, but nothing changed. as per previous report of this in LU-11111 I couldn't stop the lfsck and had to reset the MDS. this is as far as it got -> [warble2]root: lctl get_param -n mdd.dagg-MDT000*.lfsck_namespace | egrep 'status:|repaired|checked_|speed' | grep -v ' 0$' status: scanning-phase2 checked_phase1: 33091005 checked_phase2: 10550536 dangling_repaired: 31 striped_shards_repaired: 28 name_hash_repaired: 18 average_speed_phase1: 874 items/sec average_speed_phase2: 807 objs/sec average_speed_total: 857 items/sec real_time_speed_phase1: N/A real_time_speed_phase2: 7 objs/sec status: completed checked_phase1: 32500602 checked_phase2: 12505620 dangling_repaired: 29 striped_shards_repaired: 28 name_hash_repaired: 56 average_speed_phase1: 890 items/sec average_speed_phase2: 1923 objs/sec average_speed_total: 1046 items/sec real_time_speed_phase1: N/A real_time_speed_phase2: N/A status: completed checked_phase1: 32512235 checked_phase2: 12504486 linkea_repaired: 1 dangling_repaired: 14 striped_shards_repaired: 28 average_speed_phase1: 896 items/sec average_speed_phase2: 1923 objs/sec average_speed_total: 1052 items/sec real_time_speed_phase1: N/A real_time_speed_phase2: N/A I'll attach syslog for today. it includes a couple of 'echo t > /proc/sysrq-trigger' in case that helps you work out where lfsck namespace is stuck. cheers, robin
            laisiyao Lai Siyao added a comment -

            This looks to be the same as LU-11201, can you apply patch https://review.whamcloud.com/#/c/32958/ and try lfsck again?

            laisiyao Lai Siyao added a comment - This looks to be the same as LU-11201 , can you apply patch https://review.whamcloud.com/#/c/32958/ and try lfsck again?
            pjones Peter Jones added a comment -

            Lai

            Could you please assist here?

            Thanks

            Peter

            pjones Peter Jones added a comment - Lai Could you please assist here? Thanks Peter

            People

              laisiyao Lai Siyao
              scadmin SC Admin
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: