Details

    • 1
    • 9223372036854775807

    Description

      Hi,

      find's were hanging on the main filesystem from one client. these processes looked to be unkillable. I rebooted the client running the finds and restarted the find sweep, but they hung again.

      I then failed over all the MDT's to one MDS (we have 2), and that went ok. I then failed all the MDT's back to the other MDS and it LBUG'd.

       kernel: LustreError: 49321:0:(lu_object.c:1177:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
      

      since then 2 of the MDT's won't connect. they are stuck in WAITING state and never get to RECOVERING or COMPLETE.

      [warble1]root: cat /proc/fs/lustre/mdt/dagg-MDT0001/recovery_status
      status: WAITING
      non-ready MDTs:  0000
      recovery_start: 1523093864
      time_waited: 388
      
      [warble1]root: cat /proc/fs/lustre/mdt/dagg-MDT0002/recovery_status
      status: WAITING
      non-ready MDTs:  0000
      recovery_start: 1523093864
      time_waited: 391
      

      the other MDT is ok.

      [warble2]root: cat /proc/fs/lustre/mdt/dagg-MDT0000/recovery_status
      status: COMPLETE
      recovery_start: 1523093168
      recovery_duration: 30
      completed_clients: 122/122
      replayed_requests: 0
      last_transno: 214748364800
      VBR: DISABLED
      IR: DISABLED
      

      I've tried umounting a few times and remountnig, but the time_waited: just keeps incrementing. it gets to 900s, spits out a message and then keeps going forever it looks like.

      any ideas?

      cheers,
      robin

      Attachments

        1. conman-warble1-traces.txt
          1.63 MB
        2. warble1.log-20180408.gz
          115 kB
        3. warble1-traces.txt
          1.54 MB
        4. warbles.txt
          456 kB
        5. warbles-messages-20180408.txt
          1.22 MB
        6. zfs-list.warble1.txt
          1 kB
        7. zpool-status.warble1.txt
          3 kB

        Issue Links

          Activity

            [LU-10887] 2 MDTs stuck in WAITING

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31915/
            Subject: LU-10887 lfsck: offer shard's mode when re-create it
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 7d48050b7a3ba0b9db2ff823bc6fbc3091506597

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31915/ Subject: LU-10887 lfsck: offer shard's mode when re-create it Project: fs/lustre-release Branch: master Current Patch Set: Commit: 7d48050b7a3ba0b9db2ff823bc6fbc3091506597

            Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/31929
            Subject: LU-10887 mdt: ldlm lock should not pin object
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: bc77552f9fbc09b1fcc3f29151fbfc0b47fcfbb1

            The object reference leak issue will be fixed via https://review.whamcloud.com/#/c/31431/ (2.10.4)

            yong.fan nasf (Inactive) added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/31929 Subject: LU-10887 mdt: ldlm lock should not pin object Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: bc77552f9fbc09b1fcc3f29151fbfc0b47fcfbb1 The object reference leak issue will be fixed via https://review.whamcloud.com/#/c/31431/ (2.10.4)
            pjones Peter Jones added a comment -

            Robin

            Yes - 2.10.x is an LTS branch. We'd prefer to keep the ticket severity at S1 so that we correctly categorize it when we run future reports but we understand that the "all hands on deck" period is past and we're focusing on RCA and preventive actions to avoid future scenarios.

            Peter

            pjones Peter Jones added a comment - Robin Yes - 2.10.x is an LTS branch. We'd prefer to keep the ticket severity at S1 so that we correctly categorize it when we run future reports but we understand that the "all hands on deck" period is past and we're focusing on RCA and preventive actions to avoid future scenarios. Peter
            scadmin SC Admin added a comment -

            Hi Peter,

            oh, I see I haven't been reading roadmaps enough. I thought the plan was to keep folks rolling forward with 2.x releases and we were ok with that. I didn't realise 2.10.x was a LTS.

            is it appropriate to drop this from severity 1 now? the fs is up and we're reasonably confident it'll stay that way.

            cheers,
            robin

            scadmin SC Admin added a comment - Hi Peter, oh, I see I haven't been reading roadmaps enough. I thought the plan was to keep folks rolling forward with 2.x releases and we were ok with that. I didn't realise 2.10.x was a LTS. is it appropriate to drop this from severity 1 now? the fs is up and we're reasonably confident it'll stay that way. cheers, robin
            pjones Peter Jones added a comment -

            Robin

            We have no immediate plans to do a 2.11.1 release but we can queue up this fix for inclusion in 2.10.4

            Peter

            pjones Peter Jones added a comment - Robin We have no immediate plans to do a 2.11.1 release but we can queue up this fix for inclusion in 2.10.4 Peter
            scadmin SC Admin added a comment -

            oh, that makes sense. nice find!
            hopefully we'll move to 2.11 about the 2.11.1 timeframe.

            cheers,
            robin

            scadmin SC Admin added a comment - oh, that makes sense. nice find! hopefully we'll move to 2.11 about the 2.11.1 timeframe. cheers, robin

            I know the reason. You need the patch: https://review.whamcloud.com/29228
            Such patch has been landed to Lustre-2.11, not Lustre-2.10.3 yet.

            yong.fan nasf (Inactive) added a comment - I know the reason. You need the patch: https://review.whamcloud.com/29228 Such patch has been landed to Lustre-2.11, not Lustre-2.10.3 yet.
            scadmin SC Admin added a comment -
            [warble2]root: mount -l -t lustre
            warble2-apps-MDT0-pool/MDT0 on /lustre/apps/MDT0 type lustre (ro)
            warble2-home-MDT0-pool/MDT0 on /lustre/home/MDT0 type lustre (ro)
            warble1-MGT-pool/MGT on /lustre/MGT type lustre (ro)
            warble1-images-MDT0-pool/MDT0 on /lustre/images/MDT0 type lustre (ro)
            warble2-dagg-MDT0-pool/MDT0 on /lustre/dagg/MDT0 type lustre (ro)
            warble1-dagg-MDT1-pool/MDT1 on /lustre/dagg/MDT1 type lustre (ro)
            warble1-dagg-MDT2-pool/MDT2 on /lustre/dagg/MDT2 type lustre (ro)
            

            all the dagg's were definitely mounted with -o skip_lfsck.
            odd.

            cheers,
            robin

            scadmin SC Admin added a comment - [warble2]root: mount -l -t lustre warble2-apps-MDT0-pool/MDT0 on /lustre/apps/MDT0 type lustre (ro) warble2-home-MDT0-pool/MDT0 on /lustre/home/MDT0 type lustre (ro) warble1-MGT-pool/MGT on /lustre/MGT type lustre (ro) warble1-images-MDT0-pool/MDT0 on /lustre/images/MDT0 type lustre (ro) warble2-dagg-MDT0-pool/MDT0 on /lustre/dagg/MDT0 type lustre (ro) warble1-dagg-MDT1-pool/MDT1 on /lustre/dagg/MDT1 type lustre (ro) warble1-dagg-MDT2-pool/MDT2 on /lustre/dagg/MDT2 type lustre (ro) all the dagg's were definitely mounted with -o skip_lfsck. odd. cheers, robin

            What is the output on the MDT?

            mount -l -t lustre
            
            yong.fan nasf (Inactive) added a comment - What is the output on the MDT? mount -l -t lustre
            scadmin SC Admin added a comment -

            Hi Andreas,

            the system seems stable with all the MDTs on warble2 and mounted by-hand with -o skip_lfsck. so yup, happy to leave it that way for a while - both for patches to mature, and for us to swap out and try to diagnose potentially bad bits of warble1 hardware.

            obviously at some stage we'll have to test (a hopefully improved) warble1 with Lustre again though, and unfortunately there are shared components (raid controller) that we can't test separately and offline.

            Hi Fan Yong, so centos7 doesn't show any extra options in the 'mount' output unfortunately. even when I mount by hand I get just 'lustre (ro)'. I guess you're using a different OS. I don't suppose there's another way to see if -o skip_lfsck was asked for? 'dk', /proc, /sys? but anyway, no hurry 'cos we don't have any HA now so can't re-test anyway.

            cheers,
            robin

            scadmin SC Admin added a comment - Hi Andreas, the system seems stable with all the MDTs on warble2 and mounted by-hand with -o skip_lfsck. so yup, happy to leave it that way for a while - both for patches to mature, and for us to swap out and try to diagnose potentially bad bits of warble1 hardware. obviously at some stage we'll have to test (a hopefully improved) warble1 with Lustre again though, and unfortunately there are shared components (raid controller) that we can't test separately and offline. Hi Fan Yong, so centos7 doesn't show any extra options in the 'mount' output unfortunately. even when I mount by hand I get just 'lustre (ro)'. I guess you're using a different OS. I don't suppose there's another way to see if -o skip_lfsck was asked for? 'dk', /proc, /sys? but anyway, no hurry 'cos we don't have any HA now so can't re-test anyway. cheers, robin

            Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/31929
            Subject: LU-10887 mdt: ldlm lock should not pin object
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: bc77552f9fbc09b1fcc3f29151fbfc0b47fcfbb1

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/31929 Subject: LU-10887 mdt: ldlm lock should not pin object Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: bc77552f9fbc09b1fcc3f29151fbfc0b47fcfbb1

            People

              yong.fan nasf (Inactive)
              scadmin SC Admin
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: