Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: Lustre 2.10.3
    • Fix Version/s: Lustre 2.12.0, Lustre 2.10.5
    • Labels:
      None
    • Severity:
      3
    • Rank (Obsolete):
      9223372036854775807

      Description

      Hi,

      we tested warble1 hardware all we could for about a week and found no hardware issues. we also replaced sas cards and cables just to be safe.

      warble1 now is 3.10.0-693.21.1.el7.x86_64 and zfs 0.7.8 and has these patches applied

      usr/src/lustre-2.10.3/lu10212-estale.patch
      usr/src/lustre-2.10.3/lu10707-ksocklnd-revert-jiffies.patch
      usr/src/lustre-2.10.3/lu10707-lnet-route-jiffies.patch
      usr/src/lustre-2.10.3/lu10887-lfsck.patch
      usr/src/lustre-2.10.3/lu8990-put-root.patch
      

      when the dagg MDT's were mounted on warble1 they COMPLETED ok and then about 5 seconds later it hit an LBUG in lfsck.

      ...
      2018-05-02 22:06:06 [ 2919.828067] Lustre: dagg-MDT0000: Client 22c84389-af1f-9970-0e9b-70c3a4861afd (at 10.8.49.155@tcp201) reconnecting
      2018-05-02 22:06:06 [ 2919.828113] Lustre: dagg-MDT0002: Recovery already passed deadline 0:31. If you do not want to wait more, please abort the recovery by force.
      2018-05-02 22:06:38 [ 2951.686211] Lustre: dagg-MDT0002: recovery is timed out, evict stale exports
      2018-05-02 22:06:38 [ 2951.694197] Lustre: dagg-MDT0002: disconnecting 1 stale clients
      2018-05-02 22:06:38 [ 2951.736799] Lustre: 24680:0:(ldlm_lib.c:2544:target_recovery_thread()) too long recovery - read logs
      2018-05-02 22:06:38 [ 2951.746774] Lustre: dagg-MDT0002: Recovery over after 6:24, of 125 clients 124 recovered and 1 was evicted.
      2018-05-02 22:06:38 [ 2951.746775] LustreError: dumping log to /tmp/lustre-log.1525262798.24680
      2018-05-02 22:06:44 [ 2957.910031] LustreError: 33236:0:(dt_object.c:213:dt_mode_to_dft()) LBUG
      2018-05-02 22:06:44 [ 2957.917615] Pid: 33236, comm: lfsck_namespace
      2018-05-02 22:06:44 [ 2957.922760]
      2018-05-02 22:06:44 [ 2957.922760] Call Trace:
      2018-05-02 22:06:44 [ 2957.928142]  [<ffffffffc06457ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
      2018-05-02 22:06:44 [ 2957.935374]  [<ffffffffc064583c>] lbug_with_loc+0x4c/0xb0 [libcfs]
      2018-05-02 22:06:44 [ 2957.942270]  [<ffffffffc0d82573>] dt_mode_to_dft+0x73/0x80 [obdclass]
      2018-05-02 22:06:44 [ 2957.949398]  [<ffffffffc115ac81>] lfsck_namespace_repair_dangling+0x621/0xf40 [lfsck]
      2018-05-02 22:06:44 [ 2957.957911]  [<ffffffffc0d7ea22>] ? htable_lookup+0x102/0x180 [obdclass]
      2018-05-02 22:06:44 [ 2957.965289]  [<ffffffffc1186f4a>] lfsck_namespace_striped_dir_rescan+0x86a/0x1220 [lfsck]
      2018-05-02 22:06:44 [ 2957.974129]  [<ffffffffc115ce71>] lfsck_namespace_assistant_handler_p1+0x18d1/0x1f40 [lfsck]
      2018-05-02 22:06:44 [ 2957.983217]  [<ffffffff8102954d>] ? __switch_to+0xcd/0x500
      2018-05-02 22:06:44 [ 2957.989375]  [<ffffffffc114098e>] lfsck_assistant_engine+0x3ce/0x20b0 [lfsck]
      2018-05-02 22:06:44 [ 2957.997154]  [<ffffffff810cb0b5>] ? sched_clock_cpu+0x85/0xc0
      2018-05-02 22:06:44 [ 2958.003538]  [<ffffffff8102954d>] ? __switch_to+0xcd/0x500
      2018-05-02 22:06:44 [ 2958.009648]  [<ffffffff810c7c70>] ? default_wake_function+0x0/0x20
      2018-05-02 22:06:44 [ 2958.016449]  [<ffffffffc11405c0>] ? lfsck_assistant_engine+0x0/0x20b0 [lfsck]
      2018-05-02 22:06:44 [ 2958.024186]  [<ffffffff810b4031>] kthread+0xd1/0xe0
      2018-05-02 22:06:44 [ 2958.029662]  [<ffffffff810b3f60>] ? kthread+0x0/0xe0
      2018-05-02 22:06:44 [ 2958.035220]  [<ffffffff816c055d>] ret_from_fork+0x5d/0xb0
      2018-05-02 22:06:44 [ 2958.041197]  [<ffffffff810b3f60>] ? kthread+0x0/0xe0
      2018-05-02 22:06:44 [ 2958.046723]
      2018-05-02 22:06:44 [ 2958.048771] Kernel panic - not syncing: LBUG
      2018-05-02 22:06:44 [ 2958.053576] CPU: 2 PID: 33236 Comm: lfsck_namespace Tainted: P           OE  ------------   3.10.0-693.21.1.el7.x86_64 #1
      2018-05-02 22:06:44 [ 2958.065051] Hardware name: Dell Inc. PowerEdge R740/0JM3W2, BIOS 1.3.7 02/08/2018
      2018-05-02 22:06:44 [ 2958.073066] Call Trace:
      2018-05-02 22:06:44 [ 2958.076060]  [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
      2018-05-02 22:06:44 [ 2958.081738]  [<ffffffff816a8634>] panic+0xe8/0x21f
      2018-05-02 22:06:44 [ 2958.087058]  [<ffffffffc0645854>] lbug_with_loc+0x64/0xb0 [libcfs]
      2018-05-02 22:06:44 [ 2958.093781]  [<ffffffffc0d82573>] dt_mode_to_dft+0x73/0x80 [obdclass]
      2018-05-02 22:06:44 [ 2958.100741]  [<ffffffffc115ac81>] lfsck_namespace_repair_dangling+0x621/0xf40 [lfsck]
      2018-05-02 22:06:45 [ 2958.109091]  [<ffffffffc0d7ea22>] ? htable_lookup+0x102/0x180 [obdclass]
      2018-05-02 22:06:45 [ 2958.116294]  [<ffffffffc1186f4a>] lfsck_namespace_striped_dir_rescan+0x86a/0x1220 [lfsck]
      2018-05-02 22:06:45 [ 2958.124963]  [<ffffffffc115ce71>] lfsck_namespace_assistant_handler_p1+0x18d1/0x1f40 [lfsck]
      2018-05-02 22:06:45 [ 2958.133889]  [<ffffffff8102954d>] ? __switch_to+0xcd/0x500
      2018-05-02 22:06:45 [ 2958.139866]  [<ffffffffc114098e>] lfsck_assistant_engine+0x3ce/0x20b0 [lfsck]
      2018-05-02 22:06:45 [ 2958.147492]  [<ffffffff810cb0b5>] ? sched_clock_cpu+0x85/0xc0
      2018-05-02 22:06:45 [ 2958.153724]  [<ffffffff8102954d>] ? __switch_to+0xcd/0x500
      2018-05-02 22:06:45 [ 2958.159689]  [<ffffffff810c7c70>] ? wake_up_state+0x20/0x20
      2018-05-02 22:06:45 [ 2958.165742]  [<ffffffffc11405c0>] ? lfsck_master_engine+0x1310/0x1310 [lfsck]
      2018-05-02 22:06:45 [ 2958.173343]  [<ffffffff810b4031>] kthread+0xd1/0xe0
      2018-05-02 22:06:45 [ 2958.178685]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
      2018-05-02 22:06:45 [ 2958.185227]  [<ffffffff816c055d>] ret_from_fork+0x5d/0xb0
      2018-05-02 22:06:45 [ 2958.191065]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
      2018-05-02 22:06:45 [ 2958.197613] Kernel Offset: disabled
      

      I've failed the MDT's back to warble2 and mounted them by hand with -o skip_lfsck

      cheers,
      robin

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                yong.fan nasf (Inactive)
                Reporter:
                scadmin SC Admin
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: