Details

    • 1
    • 9223372036854775807

    Description

      Hi,

      find's were hanging on the main filesystem from one client. these processes looked to be unkillable. I rebooted the client running the finds and restarted the find sweep, but they hung again.

      I then failed over all the MDT's to one MDS (we have 2), and that went ok. I then failed all the MDT's back to the other MDS and it LBUG'd.

       kernel: LustreError: 49321:0:(lu_object.c:1177:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
      

      since then 2 of the MDT's won't connect. they are stuck in WAITING state and never get to RECOVERING or COMPLETE.

      [warble1]root: cat /proc/fs/lustre/mdt/dagg-MDT0001/recovery_status
      status: WAITING
      non-ready MDTs:  0000
      recovery_start: 1523093864
      time_waited: 388
      
      [warble1]root: cat /proc/fs/lustre/mdt/dagg-MDT0002/recovery_status
      status: WAITING
      non-ready MDTs:  0000
      recovery_start: 1523093864
      time_waited: 391
      

      the other MDT is ok.

      [warble2]root: cat /proc/fs/lustre/mdt/dagg-MDT0000/recovery_status
      status: COMPLETE
      recovery_start: 1523093168
      recovery_duration: 30
      completed_clients: 122/122
      replayed_requests: 0
      last_transno: 214748364800
      VBR: DISABLED
      IR: DISABLED
      

      I've tried umounting a few times and remountnig, but the time_waited: just keeps incrementing. it gets to 900s, spits out a message and then keeps going forever it looks like.

      any ideas?

      cheers,
      robin

      Attachments

        1. conman-warble1-traces.txt
          1.63 MB
        2. warble1.log-20180408.gz
          115 kB
        3. warble1-traces.txt
          1.54 MB
        4. warbles.txt
          456 kB
        5. warbles-messages-20180408.txt
          1.22 MB
        6. zfs-list.warble1.txt
          1 kB
        7. zpool-status.warble1.txt
          3 kB

        Issue Links

          Activity

            [LU-10887] 2 MDTs stuck in WAITING
            scadmin SC Admin added a comment -

            yes, that patch is what I've called lu10887-lfsck.patch and is applied.

            cheers,
            robin

            scadmin SC Admin added a comment - yes, that patch is what I've called lu10887-lfsck.patch and is applied. cheers, robin
            pjones Peter Jones added a comment -

            I've created a new ticket - LU-10988 - to track this further investigation. Please can we move any further discussion there? It helps with tracking fixes landing to releases to keep 1:1 mapping between tickets and fixes....

            pjones Peter Jones added a comment - I've created a new ticket - LU-10988 - to track this further investigation. Please can we move any further discussion there? It helps with tracking fixes landing to releases to keep 1:1 mapping between tickets and fixes....
            yong.fan nasf (Inactive) added a comment - - edited

            usr/src/lustre-2.10.3/lu10887-lfsck.patch

            So you have applied the patch https://review.whamcloud.com/31915/ on the MDT, right? If yes, there should be other corners for lfsck_namespace_repair_dangling() that missed to set the “mode”. I will investigate more. Anyway, the existed patches are still valid.

            yong.fan nasf (Inactive) added a comment - - edited usr/src/lustre-2.10.3/lu10887-lfsck.patch So you have applied the patch https://review.whamcloud.com/31915/ on the MDT, right? If yes, there should be other corners for lfsck_namespace_repair_dangling() that missed to set the “mode”. I will investigate more. Anyway, the existed patches are still valid.
            scadmin SC Admin added a comment -

            Hi,

            we tested warble1 hardware all we could for about a week and found no hardware issues. we also replaced sas cards and cables just to be safe.

            warble1 now is 3.10.0-693.21.1.el7.x86_64 and zfs 0.7.8 and has these patches applied

            usr/src/lustre-2.10.3/lu10212-estale.patch
            usr/src/lustre-2.10.3/lu10707-ksocklnd-revert-jiffies.patch
            usr/src/lustre-2.10.3/lu10707-lnet-route-jiffies.patch
            usr/src/lustre-2.10.3/lu10887-lfsck.patch
            usr/src/lustre-2.10.3/lu8990-put-root.patch
            

            when the dagg MDT's were mounted on warble1 they COMPLETED ok and then about 5 seconds later it hit an LBUG in lfsck.

            ...
            2018-05-02 22:06:06 [ 2919.828067] Lustre: dagg-MDT0000: Client 22c84389-af1f-9970-0e9b-70c3a4861afd (at 10.8.49.155@tcp201) reconnecting
            2018-05-02 22:06:06 [ 2919.828113] Lustre: dagg-MDT0002: Recovery already passed deadline 0:31. If you do not want to wait more, please abort the recovery by force.
            2018-05-02 22:06:38 [ 2951.686211] Lustre: dagg-MDT0002: recovery is timed out, evict stale exports
            2018-05-02 22:06:38 [ 2951.694197] Lustre: dagg-MDT0002: disconnecting 1 stale clients
            2018-05-02 22:06:38 [ 2951.736799] Lustre: 24680:0:(ldlm_lib.c:2544:target_recovery_thread()) too long recovery - read logs
            2018-05-02 22:06:38 [ 2951.746774] Lustre: dagg-MDT0002: Recovery over after 6:24, of 125 clients 124 recovered and 1 was evicted.
            2018-05-02 22:06:38 [ 2951.746775] LustreError: dumping log to /tmp/lustre-log.1525262798.24680
            2018-05-02 22:06:44 [ 2957.910031] LustreError: 33236:0:(dt_object.c:213:dt_mode_to_dft()) LBUG
            2018-05-02 22:06:44 [ 2957.917615] Pid: 33236, comm: lfsck_namespace
            2018-05-02 22:06:44 [ 2957.922760]
            2018-05-02 22:06:44 [ 2957.922760] Call Trace:
            2018-05-02 22:06:44 [ 2957.928142]  [<ffffffffc06457ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
            2018-05-02 22:06:44 [ 2957.935374]  [<ffffffffc064583c>] lbug_with_loc+0x4c/0xb0 [libcfs]
            2018-05-02 22:06:44 [ 2957.942270]  [<ffffffffc0d82573>] dt_mode_to_dft+0x73/0x80 [obdclass]
            2018-05-02 22:06:44 [ 2957.949398]  [<ffffffffc115ac81>] lfsck_namespace_repair_dangling+0x621/0xf40 [lfsck]
            2018-05-02 22:06:44 [ 2957.957911]  [<ffffffffc0d7ea22>] ? htable_lookup+0x102/0x180 [obdclass]
            2018-05-02 22:06:44 [ 2957.965289]  [<ffffffffc1186f4a>] lfsck_namespace_striped_dir_rescan+0x86a/0x1220 [lfsck]
            2018-05-02 22:06:44 [ 2957.974129]  [<ffffffffc115ce71>] lfsck_namespace_assistant_handler_p1+0x18d1/0x1f40 [lfsck]
            2018-05-02 22:06:44 [ 2957.983217]  [<ffffffff8102954d>] ? __switch_to+0xcd/0x500
            2018-05-02 22:06:44 [ 2957.989375]  [<ffffffffc114098e>] lfsck_assistant_engine+0x3ce/0x20b0 [lfsck]
            2018-05-02 22:06:44 [ 2957.997154]  [<ffffffff810cb0b5>] ? sched_clock_cpu+0x85/0xc0
            2018-05-02 22:06:44 [ 2958.003538]  [<ffffffff8102954d>] ? __switch_to+0xcd/0x500
            2018-05-02 22:06:44 [ 2958.009648]  [<ffffffff810c7c70>] ? default_wake_function+0x0/0x20
            2018-05-02 22:06:44 [ 2958.016449]  [<ffffffffc11405c0>] ? lfsck_assistant_engine+0x0/0x20b0 [lfsck]
            2018-05-02 22:06:44 [ 2958.024186]  [<ffffffff810b4031>] kthread+0xd1/0xe0
            2018-05-02 22:06:44 [ 2958.029662]  [<ffffffff810b3f60>] ? kthread+0x0/0xe0
            2018-05-02 22:06:44 [ 2958.035220]  [<ffffffff816c055d>] ret_from_fork+0x5d/0xb0
            2018-05-02 22:06:44 [ 2958.041197]  [<ffffffff810b3f60>] ? kthread+0x0/0xe0
            2018-05-02 22:06:44 [ 2958.046723]
            2018-05-02 22:06:44 [ 2958.048771] Kernel panic - not syncing: LBUG
            2018-05-02 22:06:44 [ 2958.053576] CPU: 2 PID: 33236 Comm: lfsck_namespace Tainted: P           OE  ------------   3.10.0-693.21.1.el7.x86_64 #1
            2018-05-02 22:06:44 [ 2958.065051] Hardware name: Dell Inc. PowerEdge R740/0JM3W2, BIOS 1.3.7 02/08/2018
            2018-05-02 22:06:44 [ 2958.073066] Call Trace:
            2018-05-02 22:06:44 [ 2958.076060]  [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
            2018-05-02 22:06:44 [ 2958.081738]  [<ffffffff816a8634>] panic+0xe8/0x21f
            2018-05-02 22:06:44 [ 2958.087058]  [<ffffffffc0645854>] lbug_with_loc+0x64/0xb0 [libcfs]
            2018-05-02 22:06:44 [ 2958.093781]  [<ffffffffc0d82573>] dt_mode_to_dft+0x73/0x80 [obdclass]
            2018-05-02 22:06:44 [ 2958.100741]  [<ffffffffc115ac81>] lfsck_namespace_repair_dangling+0x621/0xf40 [lfsck]
            2018-05-02 22:06:45 [ 2958.109091]  [<ffffffffc0d7ea22>] ? htable_lookup+0x102/0x180 [obdclass]
            2018-05-02 22:06:45 [ 2958.116294]  [<ffffffffc1186f4a>] lfsck_namespace_striped_dir_rescan+0x86a/0x1220 [lfsck]
            2018-05-02 22:06:45 [ 2958.124963]  [<ffffffffc115ce71>] lfsck_namespace_assistant_handler_p1+0x18d1/0x1f40 [lfsck]
            2018-05-02 22:06:45 [ 2958.133889]  [<ffffffff8102954d>] ? __switch_to+0xcd/0x500
            2018-05-02 22:06:45 [ 2958.139866]  [<ffffffffc114098e>] lfsck_assistant_engine+0x3ce/0x20b0 [lfsck]
            2018-05-02 22:06:45 [ 2958.147492]  [<ffffffff810cb0b5>] ? sched_clock_cpu+0x85/0xc0
            2018-05-02 22:06:45 [ 2958.153724]  [<ffffffff8102954d>] ? __switch_to+0xcd/0x500
            2018-05-02 22:06:45 [ 2958.159689]  [<ffffffff810c7c70>] ? wake_up_state+0x20/0x20
            2018-05-02 22:06:45 [ 2958.165742]  [<ffffffffc11405c0>] ? lfsck_master_engine+0x1310/0x1310 [lfsck]
            2018-05-02 22:06:45 [ 2958.173343]  [<ffffffff810b4031>] kthread+0xd1/0xe0
            2018-05-02 22:06:45 [ 2958.178685]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
            2018-05-02 22:06:45 [ 2958.185227]  [<ffffffff816c055d>] ret_from_fork+0x5d/0xb0
            2018-05-02 22:06:45 [ 2958.191065]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
            2018-05-02 22:06:45 [ 2958.197613] Kernel Offset: disabled
            

            I've failed the MDT's back to warble2 and mounted them by hand with -o skip_lfsck

            cheers,
            robin

            scadmin SC Admin added a comment - Hi, we tested warble1 hardware all we could for about a week and found no hardware issues. we also replaced sas cards and cables just to be safe. warble1 now is 3.10.0-693.21.1.el7.x86_64 and zfs 0.7.8 and has these patches applied usr/src/lustre-2.10.3/lu10212-estale.patch usr/src/lustre-2.10.3/lu10707-ksocklnd-revert-jiffies.patch usr/src/lustre-2.10.3/lu10707-lnet-route-jiffies.patch usr/src/lustre-2.10.3/lu10887-lfsck.patch usr/src/lustre-2.10.3/lu8990-put-root.patch when the dagg MDT's were mounted on warble1 they COMPLETED ok and then about 5 seconds later it hit an LBUG in lfsck. ... 2018-05-02 22:06:06 [ 2919.828067] Lustre: dagg-MDT0000: Client 22c84389-af1f-9970-0e9b-70c3a4861afd (at 10.8.49.155@tcp201) reconnecting 2018-05-02 22:06:06 [ 2919.828113] Lustre: dagg-MDT0002: Recovery already passed deadline 0:31. If you do not want to wait more, please abort the recovery by force. 2018-05-02 22:06:38 [ 2951.686211] Lustre: dagg-MDT0002: recovery is timed out, evict stale exports 2018-05-02 22:06:38 [ 2951.694197] Lustre: dagg-MDT0002: disconnecting 1 stale clients 2018-05-02 22:06:38 [ 2951.736799] Lustre: 24680:0:(ldlm_lib.c:2544:target_recovery_thread()) too long recovery - read logs 2018-05-02 22:06:38 [ 2951.746774] Lustre: dagg-MDT0002: Recovery over after 6:24, of 125 clients 124 recovered and 1 was evicted. 2018-05-02 22:06:38 [ 2951.746775] LustreError: dumping log to /tmp/lustre-log.1525262798.24680 2018-05-02 22:06:44 [ 2957.910031] LustreError: 33236:0:(dt_object.c:213:dt_mode_to_dft()) LBUG 2018-05-02 22:06:44 [ 2957.917615] Pid: 33236, comm: lfsck_namespace 2018-05-02 22:06:44 [ 2957.922760] 2018-05-02 22:06:44 [ 2957.922760] Call Trace: 2018-05-02 22:06:44 [ 2957.928142] [<ffffffffc06457ae>] libcfs_call_trace+0x4e/0x60 [libcfs] 2018-05-02 22:06:44 [ 2957.935374] [<ffffffffc064583c>] lbug_with_loc+0x4c/0xb0 [libcfs] 2018-05-02 22:06:44 [ 2957.942270] [<ffffffffc0d82573>] dt_mode_to_dft+0x73/0x80 [obdclass] 2018-05-02 22:06:44 [ 2957.949398] [<ffffffffc115ac81>] lfsck_namespace_repair_dangling+0x621/0xf40 [lfsck] 2018-05-02 22:06:44 [ 2957.957911] [<ffffffffc0d7ea22>] ? htable_lookup+0x102/0x180 [obdclass] 2018-05-02 22:06:44 [ 2957.965289] [<ffffffffc1186f4a>] lfsck_namespace_striped_dir_rescan+0x86a/0x1220 [lfsck] 2018-05-02 22:06:44 [ 2957.974129] [<ffffffffc115ce71>] lfsck_namespace_assistant_handler_p1+0x18d1/0x1f40 [lfsck] 2018-05-02 22:06:44 [ 2957.983217] [<ffffffff8102954d>] ? __switch_to+0xcd/0x500 2018-05-02 22:06:44 [ 2957.989375] [<ffffffffc114098e>] lfsck_assistant_engine+0x3ce/0x20b0 [lfsck] 2018-05-02 22:06:44 [ 2957.997154] [<ffffffff810cb0b5>] ? sched_clock_cpu+0x85/0xc0 2018-05-02 22:06:44 [ 2958.003538] [<ffffffff8102954d>] ? __switch_to+0xcd/0x500 2018-05-02 22:06:44 [ 2958.009648] [<ffffffff810c7c70>] ? default_wake_function+0x0/0x20 2018-05-02 22:06:44 [ 2958.016449] [<ffffffffc11405c0>] ? lfsck_assistant_engine+0x0/0x20b0 [lfsck] 2018-05-02 22:06:44 [ 2958.024186] [<ffffffff810b4031>] kthread+0xd1/0xe0 2018-05-02 22:06:44 [ 2958.029662] [<ffffffff810b3f60>] ? kthread+0x0/0xe0 2018-05-02 22:06:44 [ 2958.035220] [<ffffffff816c055d>] ret_from_fork+0x5d/0xb0 2018-05-02 22:06:44 [ 2958.041197] [<ffffffff810b3f60>] ? kthread+0x0/0xe0 2018-05-02 22:06:44 [ 2958.046723] 2018-05-02 22:06:44 [ 2958.048771] Kernel panic - not syncing: LBUG 2018-05-02 22:06:44 [ 2958.053576] CPU: 2 PID: 33236 Comm: lfsck_namespace Tainted: P OE ------------ 3.10.0-693.21.1.el7.x86_64 #1 2018-05-02 22:06:44 [ 2958.065051] Hardware name: Dell Inc. PowerEdge R740/0JM3W2, BIOS 1.3.7 02/08/2018 2018-05-02 22:06:44 [ 2958.073066] Call Trace: 2018-05-02 22:06:44 [ 2958.076060] [<ffffffff816ae7c8>] dump_stack+0x19/0x1b 2018-05-02 22:06:44 [ 2958.081738] [<ffffffff816a8634>] panic+0xe8/0x21f 2018-05-02 22:06:44 [ 2958.087058] [<ffffffffc0645854>] lbug_with_loc+0x64/0xb0 [libcfs] 2018-05-02 22:06:44 [ 2958.093781] [<ffffffffc0d82573>] dt_mode_to_dft+0x73/0x80 [obdclass] 2018-05-02 22:06:44 [ 2958.100741] [<ffffffffc115ac81>] lfsck_namespace_repair_dangling+0x621/0xf40 [lfsck] 2018-05-02 22:06:45 [ 2958.109091] [<ffffffffc0d7ea22>] ? htable_lookup+0x102/0x180 [obdclass] 2018-05-02 22:06:45 [ 2958.116294] [<ffffffffc1186f4a>] lfsck_namespace_striped_dir_rescan+0x86a/0x1220 [lfsck] 2018-05-02 22:06:45 [ 2958.124963] [<ffffffffc115ce71>] lfsck_namespace_assistant_handler_p1+0x18d1/0x1f40 [lfsck] 2018-05-02 22:06:45 [ 2958.133889] [<ffffffff8102954d>] ? __switch_to+0xcd/0x500 2018-05-02 22:06:45 [ 2958.139866] [<ffffffffc114098e>] lfsck_assistant_engine+0x3ce/0x20b0 [lfsck] 2018-05-02 22:06:45 [ 2958.147492] [<ffffffff810cb0b5>] ? sched_clock_cpu+0x85/0xc0 2018-05-02 22:06:45 [ 2958.153724] [<ffffffff8102954d>] ? __switch_to+0xcd/0x500 2018-05-02 22:06:45 [ 2958.159689] [<ffffffff810c7c70>] ? wake_up_state+0x20/0x20 2018-05-02 22:06:45 [ 2958.165742] [<ffffffffc11405c0>] ? lfsck_master_engine+0x1310/0x1310 [lfsck] 2018-05-02 22:06:45 [ 2958.173343] [<ffffffff810b4031>] kthread+0xd1/0xe0 2018-05-02 22:06:45 [ 2958.178685] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40 2018-05-02 22:06:45 [ 2958.185227] [<ffffffff816c055d>] ret_from_fork+0x5d/0xb0 2018-05-02 22:06:45 [ 2958.191065] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40 2018-05-02 22:06:45 [ 2958.197613] Kernel Offset: disabled I've failed the MDT's back to warble2 and mounted them by hand with -o skip_lfsck cheers, robin

            Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/32076
            Subject: LU-10887 lfsck: offer shard's mode when re-create it
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: 6510f5123b273419ee33a0ec6aaf297124dae155

            gerrit Gerrit Updater added a comment - Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/32076 Subject: LU-10887 lfsck: offer shard's mode when re-create it Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 6510f5123b273419ee33a0ec6aaf297124dae155
            pjones Peter Jones added a comment -

            Marking this is fixed in 2.12 so that this fix is also queued up for 2.10.4. Our goal would be for Swinburne to be running a vanilla release without the need for additional patches.

            pjones Peter Jones added a comment - Marking this is fixed in 2.12 so that this fix is also queued up for 2.10.4. Our goal would be for Swinburne to be running a vanilla release without the need for additional patches.
            yong.fan nasf (Inactive) added a comment - - edited

            Robin,

            https://review.whamcloud.com/31915/ has been landed to master, https://review.whamcloud.com/#/c/31431/ has been landed to 2.10.4.
            Any else do you need for this ticket?

            yong.fan nasf (Inactive) added a comment - - edited Robin, https://review.whamcloud.com/31915/ has been landed to master, https://review.whamcloud.com/#/c/31431/ has been landed to 2.10.4. Any else do you need for this ticket?

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31915/
            Subject: LU-10887 lfsck: offer shard's mode when re-create it
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 7d48050b7a3ba0b9db2ff823bc6fbc3091506597

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31915/ Subject: LU-10887 lfsck: offer shard's mode when re-create it Project: fs/lustre-release Branch: master Current Patch Set: Commit: 7d48050b7a3ba0b9db2ff823bc6fbc3091506597

            Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/31929
            Subject: LU-10887 mdt: ldlm lock should not pin object
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: bc77552f9fbc09b1fcc3f29151fbfc0b47fcfbb1

            The object reference leak issue will be fixed via https://review.whamcloud.com/#/c/31431/ (2.10.4)

            yong.fan nasf (Inactive) added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/31929 Subject: LU-10887 mdt: ldlm lock should not pin object Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: bc77552f9fbc09b1fcc3f29151fbfc0b47fcfbb1 The object reference leak issue will be fixed via https://review.whamcloud.com/#/c/31431/ (2.10.4)
            pjones Peter Jones added a comment -

            Robin

            Yes - 2.10.x is an LTS branch. We'd prefer to keep the ticket severity at S1 so that we correctly categorize it when we run future reports but we understand that the "all hands on deck" period is past and we're focusing on RCA and preventive actions to avoid future scenarios.

            Peter

            pjones Peter Jones added a comment - Robin Yes - 2.10.x is an LTS branch. We'd prefer to keep the ticket severity at S1 so that we correctly categorize it when we run future reports but we understand that the "all hands on deck" period is past and we're focusing on RCA and preventive actions to avoid future scenarios. Peter
            scadmin SC Admin added a comment -

            Hi Peter,

            oh, I see I haven't been reading roadmaps enough. I thought the plan was to keep folks rolling forward with 2.x releases and we were ok with that. I didn't realise 2.10.x was a LTS.

            is it appropriate to drop this from severity 1 now? the fs is up and we're reasonably confident it'll stay that way.

            cheers,
            robin

            scadmin SC Admin added a comment - Hi Peter, oh, I see I haven't been reading roadmaps enough. I thought the plan was to keep folks rolling forward with 2.x releases and we were ok with that. I didn't realise 2.10.x was a LTS. is it appropriate to drop this from severity 1 now? the fs is up and we're reasonably confident it'll stay that way. cheers, robin

            People

              yong.fan nasf (Inactive)
              scadmin SC Admin
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: