Details

    • 1
    • 9223372036854775807

    Description

      Hi,

      find's were hanging on the main filesystem from one client. these processes looked to be unkillable. I rebooted the client running the finds and restarted the find sweep, but they hung again.

      I then failed over all the MDT's to one MDS (we have 2), and that went ok. I then failed all the MDT's back to the other MDS and it LBUG'd.

       kernel: LustreError: 49321:0:(lu_object.c:1177:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
      

      since then 2 of the MDT's won't connect. they are stuck in WAITING state and never get to RECOVERING or COMPLETE.

      [warble1]root: cat /proc/fs/lustre/mdt/dagg-MDT0001/recovery_status
      status: WAITING
      non-ready MDTs:  0000
      recovery_start: 1523093864
      time_waited: 388
      
      [warble1]root: cat /proc/fs/lustre/mdt/dagg-MDT0002/recovery_status
      status: WAITING
      non-ready MDTs:  0000
      recovery_start: 1523093864
      time_waited: 391
      

      the other MDT is ok.

      [warble2]root: cat /proc/fs/lustre/mdt/dagg-MDT0000/recovery_status
      status: COMPLETE
      recovery_start: 1523093168
      recovery_duration: 30
      completed_clients: 122/122
      replayed_requests: 0
      last_transno: 214748364800
      VBR: DISABLED
      IR: DISABLED
      

      I've tried umounting a few times and remountnig, but the time_waited: just keeps incrementing. it gets to 900s, spits out a message and then keeps going forever it looks like.

      any ideas?

      cheers,
      robin

      Attachments

        1. conman-warble1-traces.txt
          1.63 MB
        2. warble1.log-20180408.gz
          115 kB
        3. warble1-traces.txt
          1.54 MB
        4. warbles.txt
          456 kB
        5. warbles-messages-20180408.txt
          1.22 MB
        6. zfs-list.warble1.txt
          1 kB
        7. zpool-status.warble1.txt
          3 kB

        Issue Links

          Activity

            [LU-10887] 2 MDTs stuck in WAITING

            John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/32076/
            Subject: LU-10887 lfsck: offer shard's mode when re-create it
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set:
            Commit: 4118b5317b23d6d2d6c09ca1cdc797ec027622c8

            gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/32076/ Subject: LU-10887 lfsck: offer shard's mode when re-create it Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: 4118b5317b23d6d2d6c09ca1cdc797ec027622c8
            scadmin SC Admin added a comment -

            yes, that patch is what I've called lu10887-lfsck.patch and is applied.

            cheers,
            robin

            scadmin SC Admin added a comment - yes, that patch is what I've called lu10887-lfsck.patch and is applied. cheers, robin
            pjones Peter Jones added a comment -

            I've created a new ticket - LU-10988 - to track this further investigation. Please can we move any further discussion there? It helps with tracking fixes landing to releases to keep 1:1 mapping between tickets and fixes....

            pjones Peter Jones added a comment - I've created a new ticket - LU-10988 - to track this further investigation. Please can we move any further discussion there? It helps with tracking fixes landing to releases to keep 1:1 mapping between tickets and fixes....
            yong.fan nasf (Inactive) added a comment - - edited

            usr/src/lustre-2.10.3/lu10887-lfsck.patch

            So you have applied the patch https://review.whamcloud.com/31915/ on the MDT, right? If yes, there should be other corners for lfsck_namespace_repair_dangling() that missed to set the “mode”. I will investigate more. Anyway, the existed patches are still valid.

            yong.fan nasf (Inactive) added a comment - - edited usr/src/lustre-2.10.3/lu10887-lfsck.patch So you have applied the patch https://review.whamcloud.com/31915/ on the MDT, right? If yes, there should be other corners for lfsck_namespace_repair_dangling() that missed to set the “mode”. I will investigate more. Anyway, the existed patches are still valid.
            scadmin SC Admin added a comment -

            Hi,

            we tested warble1 hardware all we could for about a week and found no hardware issues. we also replaced sas cards and cables just to be safe.

            warble1 now is 3.10.0-693.21.1.el7.x86_64 and zfs 0.7.8 and has these patches applied

            usr/src/lustre-2.10.3/lu10212-estale.patch
            usr/src/lustre-2.10.3/lu10707-ksocklnd-revert-jiffies.patch
            usr/src/lustre-2.10.3/lu10707-lnet-route-jiffies.patch
            usr/src/lustre-2.10.3/lu10887-lfsck.patch
            usr/src/lustre-2.10.3/lu8990-put-root.patch
            

            when the dagg MDT's were mounted on warble1 they COMPLETED ok and then about 5 seconds later it hit an LBUG in lfsck.

            ...
            2018-05-02 22:06:06 [ 2919.828067] Lustre: dagg-MDT0000: Client 22c84389-af1f-9970-0e9b-70c3a4861afd (at 10.8.49.155@tcp201) reconnecting
            2018-05-02 22:06:06 [ 2919.828113] Lustre: dagg-MDT0002: Recovery already passed deadline 0:31. If you do not want to wait more, please abort the recovery by force.
            2018-05-02 22:06:38 [ 2951.686211] Lustre: dagg-MDT0002: recovery is timed out, evict stale exports
            2018-05-02 22:06:38 [ 2951.694197] Lustre: dagg-MDT0002: disconnecting 1 stale clients
            2018-05-02 22:06:38 [ 2951.736799] Lustre: 24680:0:(ldlm_lib.c:2544:target_recovery_thread()) too long recovery - read logs
            2018-05-02 22:06:38 [ 2951.746774] Lustre: dagg-MDT0002: Recovery over after 6:24, of 125 clients 124 recovered and 1 was evicted.
            2018-05-02 22:06:38 [ 2951.746775] LustreError: dumping log to /tmp/lustre-log.1525262798.24680
            2018-05-02 22:06:44 [ 2957.910031] LustreError: 33236:0:(dt_object.c:213:dt_mode_to_dft()) LBUG
            2018-05-02 22:06:44 [ 2957.917615] Pid: 33236, comm: lfsck_namespace
            2018-05-02 22:06:44 [ 2957.922760]
            2018-05-02 22:06:44 [ 2957.922760] Call Trace:
            2018-05-02 22:06:44 [ 2957.928142]  [<ffffffffc06457ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
            2018-05-02 22:06:44 [ 2957.935374]  [<ffffffffc064583c>] lbug_with_loc+0x4c/0xb0 [libcfs]
            2018-05-02 22:06:44 [ 2957.942270]  [<ffffffffc0d82573>] dt_mode_to_dft+0x73/0x80 [obdclass]
            2018-05-02 22:06:44 [ 2957.949398]  [<ffffffffc115ac81>] lfsck_namespace_repair_dangling+0x621/0xf40 [lfsck]
            2018-05-02 22:06:44 [ 2957.957911]  [<ffffffffc0d7ea22>] ? htable_lookup+0x102/0x180 [obdclass]
            2018-05-02 22:06:44 [ 2957.965289]  [<ffffffffc1186f4a>] lfsck_namespace_striped_dir_rescan+0x86a/0x1220 [lfsck]
            2018-05-02 22:06:44 [ 2957.974129]  [<ffffffffc115ce71>] lfsck_namespace_assistant_handler_p1+0x18d1/0x1f40 [lfsck]
            2018-05-02 22:06:44 [ 2957.983217]  [<ffffffff8102954d>] ? __switch_to+0xcd/0x500
            2018-05-02 22:06:44 [ 2957.989375]  [<ffffffffc114098e>] lfsck_assistant_engine+0x3ce/0x20b0 [lfsck]
            2018-05-02 22:06:44 [ 2957.997154]  [<ffffffff810cb0b5>] ? sched_clock_cpu+0x85/0xc0
            2018-05-02 22:06:44 [ 2958.003538]  [<ffffffff8102954d>] ? __switch_to+0xcd/0x500
            2018-05-02 22:06:44 [ 2958.009648]  [<ffffffff810c7c70>] ? default_wake_function+0x0/0x20
            2018-05-02 22:06:44 [ 2958.016449]  [<ffffffffc11405c0>] ? lfsck_assistant_engine+0x0/0x20b0 [lfsck]
            2018-05-02 22:06:44 [ 2958.024186]  [<ffffffff810b4031>] kthread+0xd1/0xe0
            2018-05-02 22:06:44 [ 2958.029662]  [<ffffffff810b3f60>] ? kthread+0x0/0xe0
            2018-05-02 22:06:44 [ 2958.035220]  [<ffffffff816c055d>] ret_from_fork+0x5d/0xb0
            2018-05-02 22:06:44 [ 2958.041197]  [<ffffffff810b3f60>] ? kthread+0x0/0xe0
            2018-05-02 22:06:44 [ 2958.046723]
            2018-05-02 22:06:44 [ 2958.048771] Kernel panic - not syncing: LBUG
            2018-05-02 22:06:44 [ 2958.053576] CPU: 2 PID: 33236 Comm: lfsck_namespace Tainted: P           OE  ------------   3.10.0-693.21.1.el7.x86_64 #1
            2018-05-02 22:06:44 [ 2958.065051] Hardware name: Dell Inc. PowerEdge R740/0JM3W2, BIOS 1.3.7 02/08/2018
            2018-05-02 22:06:44 [ 2958.073066] Call Trace:
            2018-05-02 22:06:44 [ 2958.076060]  [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
            2018-05-02 22:06:44 [ 2958.081738]  [<ffffffff816a8634>] panic+0xe8/0x21f
            2018-05-02 22:06:44 [ 2958.087058]  [<ffffffffc0645854>] lbug_with_loc+0x64/0xb0 [libcfs]
            2018-05-02 22:06:44 [ 2958.093781]  [<ffffffffc0d82573>] dt_mode_to_dft+0x73/0x80 [obdclass]
            2018-05-02 22:06:44 [ 2958.100741]  [<ffffffffc115ac81>] lfsck_namespace_repair_dangling+0x621/0xf40 [lfsck]
            2018-05-02 22:06:45 [ 2958.109091]  [<ffffffffc0d7ea22>] ? htable_lookup+0x102/0x180 [obdclass]
            2018-05-02 22:06:45 [ 2958.116294]  [<ffffffffc1186f4a>] lfsck_namespace_striped_dir_rescan+0x86a/0x1220 [lfsck]
            2018-05-02 22:06:45 [ 2958.124963]  [<ffffffffc115ce71>] lfsck_namespace_assistant_handler_p1+0x18d1/0x1f40 [lfsck]
            2018-05-02 22:06:45 [ 2958.133889]  [<ffffffff8102954d>] ? __switch_to+0xcd/0x500
            2018-05-02 22:06:45 [ 2958.139866]  [<ffffffffc114098e>] lfsck_assistant_engine+0x3ce/0x20b0 [lfsck]
            2018-05-02 22:06:45 [ 2958.147492]  [<ffffffff810cb0b5>] ? sched_clock_cpu+0x85/0xc0
            2018-05-02 22:06:45 [ 2958.153724]  [<ffffffff8102954d>] ? __switch_to+0xcd/0x500
            2018-05-02 22:06:45 [ 2958.159689]  [<ffffffff810c7c70>] ? wake_up_state+0x20/0x20
            2018-05-02 22:06:45 [ 2958.165742]  [<ffffffffc11405c0>] ? lfsck_master_engine+0x1310/0x1310 [lfsck]
            2018-05-02 22:06:45 [ 2958.173343]  [<ffffffff810b4031>] kthread+0xd1/0xe0
            2018-05-02 22:06:45 [ 2958.178685]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
            2018-05-02 22:06:45 [ 2958.185227]  [<ffffffff816c055d>] ret_from_fork+0x5d/0xb0
            2018-05-02 22:06:45 [ 2958.191065]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
            2018-05-02 22:06:45 [ 2958.197613] Kernel Offset: disabled
            

            I've failed the MDT's back to warble2 and mounted them by hand with -o skip_lfsck

            cheers,
            robin

            scadmin SC Admin added a comment - Hi, we tested warble1 hardware all we could for about a week and found no hardware issues. we also replaced sas cards and cables just to be safe. warble1 now is 3.10.0-693.21.1.el7.x86_64 and zfs 0.7.8 and has these patches applied usr/src/lustre-2.10.3/lu10212-estale.patch usr/src/lustre-2.10.3/lu10707-ksocklnd-revert-jiffies.patch usr/src/lustre-2.10.3/lu10707-lnet-route-jiffies.patch usr/src/lustre-2.10.3/lu10887-lfsck.patch usr/src/lustre-2.10.3/lu8990-put-root.patch when the dagg MDT's were mounted on warble1 they COMPLETED ok and then about 5 seconds later it hit an LBUG in lfsck. ... 2018-05-02 22:06:06 [ 2919.828067] Lustre: dagg-MDT0000: Client 22c84389-af1f-9970-0e9b-70c3a4861afd (at 10.8.49.155@tcp201) reconnecting 2018-05-02 22:06:06 [ 2919.828113] Lustre: dagg-MDT0002: Recovery already passed deadline 0:31. If you do not want to wait more, please abort the recovery by force. 2018-05-02 22:06:38 [ 2951.686211] Lustre: dagg-MDT0002: recovery is timed out, evict stale exports 2018-05-02 22:06:38 [ 2951.694197] Lustre: dagg-MDT0002: disconnecting 1 stale clients 2018-05-02 22:06:38 [ 2951.736799] Lustre: 24680:0:(ldlm_lib.c:2544:target_recovery_thread()) too long recovery - read logs 2018-05-02 22:06:38 [ 2951.746774] Lustre: dagg-MDT0002: Recovery over after 6:24, of 125 clients 124 recovered and 1 was evicted. 2018-05-02 22:06:38 [ 2951.746775] LustreError: dumping log to /tmp/lustre-log.1525262798.24680 2018-05-02 22:06:44 [ 2957.910031] LustreError: 33236:0:(dt_object.c:213:dt_mode_to_dft()) LBUG 2018-05-02 22:06:44 [ 2957.917615] Pid: 33236, comm: lfsck_namespace 2018-05-02 22:06:44 [ 2957.922760] 2018-05-02 22:06:44 [ 2957.922760] Call Trace: 2018-05-02 22:06:44 [ 2957.928142] [<ffffffffc06457ae>] libcfs_call_trace+0x4e/0x60 [libcfs] 2018-05-02 22:06:44 [ 2957.935374] [<ffffffffc064583c>] lbug_with_loc+0x4c/0xb0 [libcfs] 2018-05-02 22:06:44 [ 2957.942270] [<ffffffffc0d82573>] dt_mode_to_dft+0x73/0x80 [obdclass] 2018-05-02 22:06:44 [ 2957.949398] [<ffffffffc115ac81>] lfsck_namespace_repair_dangling+0x621/0xf40 [lfsck] 2018-05-02 22:06:44 [ 2957.957911] [<ffffffffc0d7ea22>] ? htable_lookup+0x102/0x180 [obdclass] 2018-05-02 22:06:44 [ 2957.965289] [<ffffffffc1186f4a>] lfsck_namespace_striped_dir_rescan+0x86a/0x1220 [lfsck] 2018-05-02 22:06:44 [ 2957.974129] [<ffffffffc115ce71>] lfsck_namespace_assistant_handler_p1+0x18d1/0x1f40 [lfsck] 2018-05-02 22:06:44 [ 2957.983217] [<ffffffff8102954d>] ? __switch_to+0xcd/0x500 2018-05-02 22:06:44 [ 2957.989375] [<ffffffffc114098e>] lfsck_assistant_engine+0x3ce/0x20b0 [lfsck] 2018-05-02 22:06:44 [ 2957.997154] [<ffffffff810cb0b5>] ? sched_clock_cpu+0x85/0xc0 2018-05-02 22:06:44 [ 2958.003538] [<ffffffff8102954d>] ? __switch_to+0xcd/0x500 2018-05-02 22:06:44 [ 2958.009648] [<ffffffff810c7c70>] ? default_wake_function+0x0/0x20 2018-05-02 22:06:44 [ 2958.016449] [<ffffffffc11405c0>] ? lfsck_assistant_engine+0x0/0x20b0 [lfsck] 2018-05-02 22:06:44 [ 2958.024186] [<ffffffff810b4031>] kthread+0xd1/0xe0 2018-05-02 22:06:44 [ 2958.029662] [<ffffffff810b3f60>] ? kthread+0x0/0xe0 2018-05-02 22:06:44 [ 2958.035220] [<ffffffff816c055d>] ret_from_fork+0x5d/0xb0 2018-05-02 22:06:44 [ 2958.041197] [<ffffffff810b3f60>] ? kthread+0x0/0xe0 2018-05-02 22:06:44 [ 2958.046723] 2018-05-02 22:06:44 [ 2958.048771] Kernel panic - not syncing: LBUG 2018-05-02 22:06:44 [ 2958.053576] CPU: 2 PID: 33236 Comm: lfsck_namespace Tainted: P OE ------------ 3.10.0-693.21.1.el7.x86_64 #1 2018-05-02 22:06:44 [ 2958.065051] Hardware name: Dell Inc. PowerEdge R740/0JM3W2, BIOS 1.3.7 02/08/2018 2018-05-02 22:06:44 [ 2958.073066] Call Trace: 2018-05-02 22:06:44 [ 2958.076060] [<ffffffff816ae7c8>] dump_stack+0x19/0x1b 2018-05-02 22:06:44 [ 2958.081738] [<ffffffff816a8634>] panic+0xe8/0x21f 2018-05-02 22:06:44 [ 2958.087058] [<ffffffffc0645854>] lbug_with_loc+0x64/0xb0 [libcfs] 2018-05-02 22:06:44 [ 2958.093781] [<ffffffffc0d82573>] dt_mode_to_dft+0x73/0x80 [obdclass] 2018-05-02 22:06:44 [ 2958.100741] [<ffffffffc115ac81>] lfsck_namespace_repair_dangling+0x621/0xf40 [lfsck] 2018-05-02 22:06:45 [ 2958.109091] [<ffffffffc0d7ea22>] ? htable_lookup+0x102/0x180 [obdclass] 2018-05-02 22:06:45 [ 2958.116294] [<ffffffffc1186f4a>] lfsck_namespace_striped_dir_rescan+0x86a/0x1220 [lfsck] 2018-05-02 22:06:45 [ 2958.124963] [<ffffffffc115ce71>] lfsck_namespace_assistant_handler_p1+0x18d1/0x1f40 [lfsck] 2018-05-02 22:06:45 [ 2958.133889] [<ffffffff8102954d>] ? __switch_to+0xcd/0x500 2018-05-02 22:06:45 [ 2958.139866] [<ffffffffc114098e>] lfsck_assistant_engine+0x3ce/0x20b0 [lfsck] 2018-05-02 22:06:45 [ 2958.147492] [<ffffffff810cb0b5>] ? sched_clock_cpu+0x85/0xc0 2018-05-02 22:06:45 [ 2958.153724] [<ffffffff8102954d>] ? __switch_to+0xcd/0x500 2018-05-02 22:06:45 [ 2958.159689] [<ffffffff810c7c70>] ? wake_up_state+0x20/0x20 2018-05-02 22:06:45 [ 2958.165742] [<ffffffffc11405c0>] ? lfsck_master_engine+0x1310/0x1310 [lfsck] 2018-05-02 22:06:45 [ 2958.173343] [<ffffffff810b4031>] kthread+0xd1/0xe0 2018-05-02 22:06:45 [ 2958.178685] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40 2018-05-02 22:06:45 [ 2958.185227] [<ffffffff816c055d>] ret_from_fork+0x5d/0xb0 2018-05-02 22:06:45 [ 2958.191065] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40 2018-05-02 22:06:45 [ 2958.197613] Kernel Offset: disabled I've failed the MDT's back to warble2 and mounted them by hand with -o skip_lfsck cheers, robin

            Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/32076
            Subject: LU-10887 lfsck: offer shard's mode when re-create it
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: 6510f5123b273419ee33a0ec6aaf297124dae155

            gerrit Gerrit Updater added a comment - Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/32076 Subject: LU-10887 lfsck: offer shard's mode when re-create it Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 6510f5123b273419ee33a0ec6aaf297124dae155
            pjones Peter Jones added a comment -

            Marking this is fixed in 2.12 so that this fix is also queued up for 2.10.4. Our goal would be for Swinburne to be running a vanilla release without the need for additional patches.

            pjones Peter Jones added a comment - Marking this is fixed in 2.12 so that this fix is also queued up for 2.10.4. Our goal would be for Swinburne to be running a vanilla release without the need for additional patches.
            yong.fan nasf (Inactive) added a comment - - edited

            Robin,

            https://review.whamcloud.com/31915/ has been landed to master, https://review.whamcloud.com/#/c/31431/ has been landed to 2.10.4.
            Any else do you need for this ticket?

            yong.fan nasf (Inactive) added a comment - - edited Robin, https://review.whamcloud.com/31915/ has been landed to master, https://review.whamcloud.com/#/c/31431/ has been landed to 2.10.4. Any else do you need for this ticket?

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31915/
            Subject: LU-10887 lfsck: offer shard's mode when re-create it
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 7d48050b7a3ba0b9db2ff823bc6fbc3091506597

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31915/ Subject: LU-10887 lfsck: offer shard's mode when re-create it Project: fs/lustre-release Branch: master Current Patch Set: Commit: 7d48050b7a3ba0b9db2ff823bc6fbc3091506597

            Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/31929
            Subject: LU-10887 mdt: ldlm lock should not pin object
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: bc77552f9fbc09b1fcc3f29151fbfc0b47fcfbb1

            The object reference leak issue will be fixed via https://review.whamcloud.com/#/c/31431/ (2.10.4)

            yong.fan nasf (Inactive) added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/31929 Subject: LU-10887 mdt: ldlm lock should not pin object Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: bc77552f9fbc09b1fcc3f29151fbfc0b47fcfbb1 The object reference leak issue will be fixed via https://review.whamcloud.com/#/c/31431/ (2.10.4)

            People

              yong.fan nasf (Inactive)
              scadmin SC Admin
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: