Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.0, Lustre 2.10.5
    • Lustre 2.10.3
    • None
    • 3
    • 9223372036854775807

    Description

      Hi,

      we tested warble1 hardware all we could for about a week and found no hardware issues. we also replaced sas cards and cables just to be safe.

      warble1 now is 3.10.0-693.21.1.el7.x86_64 and zfs 0.7.8 and has these patches applied

      usr/src/lustre-2.10.3/lu10212-estale.patch
      usr/src/lustre-2.10.3/lu10707-ksocklnd-revert-jiffies.patch
      usr/src/lustre-2.10.3/lu10707-lnet-route-jiffies.patch
      usr/src/lustre-2.10.3/lu10887-lfsck.patch
      usr/src/lustre-2.10.3/lu8990-put-root.patch
      

      when the dagg MDT's were mounted on warble1 they COMPLETED ok and then about 5 seconds later it hit an LBUG in lfsck.

      ...
      2018-05-02 22:06:06 [ 2919.828067] Lustre: dagg-MDT0000: Client 22c84389-af1f-9970-0e9b-70c3a4861afd (at 10.8.49.155@tcp201) reconnecting
      2018-05-02 22:06:06 [ 2919.828113] Lustre: dagg-MDT0002: Recovery already passed deadline 0:31. If you do not want to wait more, please abort the recovery by force.
      2018-05-02 22:06:38 [ 2951.686211] Lustre: dagg-MDT0002: recovery is timed out, evict stale exports
      2018-05-02 22:06:38 [ 2951.694197] Lustre: dagg-MDT0002: disconnecting 1 stale clients
      2018-05-02 22:06:38 [ 2951.736799] Lustre: 24680:0:(ldlm_lib.c:2544:target_recovery_thread()) too long recovery - read logs
      2018-05-02 22:06:38 [ 2951.746774] Lustre: dagg-MDT0002: Recovery over after 6:24, of 125 clients 124 recovered and 1 was evicted.
      2018-05-02 22:06:38 [ 2951.746775] LustreError: dumping log to /tmp/lustre-log.1525262798.24680
      2018-05-02 22:06:44 [ 2957.910031] LustreError: 33236:0:(dt_object.c:213:dt_mode_to_dft()) LBUG
      2018-05-02 22:06:44 [ 2957.917615] Pid: 33236, comm: lfsck_namespace
      2018-05-02 22:06:44 [ 2957.922760]
      2018-05-02 22:06:44 [ 2957.922760] Call Trace:
      2018-05-02 22:06:44 [ 2957.928142]  [<ffffffffc06457ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
      2018-05-02 22:06:44 [ 2957.935374]  [<ffffffffc064583c>] lbug_with_loc+0x4c/0xb0 [libcfs]
      2018-05-02 22:06:44 [ 2957.942270]  [<ffffffffc0d82573>] dt_mode_to_dft+0x73/0x80 [obdclass]
      2018-05-02 22:06:44 [ 2957.949398]  [<ffffffffc115ac81>] lfsck_namespace_repair_dangling+0x621/0xf40 [lfsck]
      2018-05-02 22:06:44 [ 2957.957911]  [<ffffffffc0d7ea22>] ? htable_lookup+0x102/0x180 [obdclass]
      2018-05-02 22:06:44 [ 2957.965289]  [<ffffffffc1186f4a>] lfsck_namespace_striped_dir_rescan+0x86a/0x1220 [lfsck]
      2018-05-02 22:06:44 [ 2957.974129]  [<ffffffffc115ce71>] lfsck_namespace_assistant_handler_p1+0x18d1/0x1f40 [lfsck]
      2018-05-02 22:06:44 [ 2957.983217]  [<ffffffff8102954d>] ? __switch_to+0xcd/0x500
      2018-05-02 22:06:44 [ 2957.989375]  [<ffffffffc114098e>] lfsck_assistant_engine+0x3ce/0x20b0 [lfsck]
      2018-05-02 22:06:44 [ 2957.997154]  [<ffffffff810cb0b5>] ? sched_clock_cpu+0x85/0xc0
      2018-05-02 22:06:44 [ 2958.003538]  [<ffffffff8102954d>] ? __switch_to+0xcd/0x500
      2018-05-02 22:06:44 [ 2958.009648]  [<ffffffff810c7c70>] ? default_wake_function+0x0/0x20
      2018-05-02 22:06:44 [ 2958.016449]  [<ffffffffc11405c0>] ? lfsck_assistant_engine+0x0/0x20b0 [lfsck]
      2018-05-02 22:06:44 [ 2958.024186]  [<ffffffff810b4031>] kthread+0xd1/0xe0
      2018-05-02 22:06:44 [ 2958.029662]  [<ffffffff810b3f60>] ? kthread+0x0/0xe0
      2018-05-02 22:06:44 [ 2958.035220]  [<ffffffff816c055d>] ret_from_fork+0x5d/0xb0
      2018-05-02 22:06:44 [ 2958.041197]  [<ffffffff810b3f60>] ? kthread+0x0/0xe0
      2018-05-02 22:06:44 [ 2958.046723]
      2018-05-02 22:06:44 [ 2958.048771] Kernel panic - not syncing: LBUG
      2018-05-02 22:06:44 [ 2958.053576] CPU: 2 PID: 33236 Comm: lfsck_namespace Tainted: P           OE  ------------   3.10.0-693.21.1.el7.x86_64 #1
      2018-05-02 22:06:44 [ 2958.065051] Hardware name: Dell Inc. PowerEdge R740/0JM3W2, BIOS 1.3.7 02/08/2018
      2018-05-02 22:06:44 [ 2958.073066] Call Trace:
      2018-05-02 22:06:44 [ 2958.076060]  [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
      2018-05-02 22:06:44 [ 2958.081738]  [<ffffffff816a8634>] panic+0xe8/0x21f
      2018-05-02 22:06:44 [ 2958.087058]  [<ffffffffc0645854>] lbug_with_loc+0x64/0xb0 [libcfs]
      2018-05-02 22:06:44 [ 2958.093781]  [<ffffffffc0d82573>] dt_mode_to_dft+0x73/0x80 [obdclass]
      2018-05-02 22:06:44 [ 2958.100741]  [<ffffffffc115ac81>] lfsck_namespace_repair_dangling+0x621/0xf40 [lfsck]
      2018-05-02 22:06:45 [ 2958.109091]  [<ffffffffc0d7ea22>] ? htable_lookup+0x102/0x180 [obdclass]
      2018-05-02 22:06:45 [ 2958.116294]  [<ffffffffc1186f4a>] lfsck_namespace_striped_dir_rescan+0x86a/0x1220 [lfsck]
      2018-05-02 22:06:45 [ 2958.124963]  [<ffffffffc115ce71>] lfsck_namespace_assistant_handler_p1+0x18d1/0x1f40 [lfsck]
      2018-05-02 22:06:45 [ 2958.133889]  [<ffffffff8102954d>] ? __switch_to+0xcd/0x500
      2018-05-02 22:06:45 [ 2958.139866]  [<ffffffffc114098e>] lfsck_assistant_engine+0x3ce/0x20b0 [lfsck]
      2018-05-02 22:06:45 [ 2958.147492]  [<ffffffff810cb0b5>] ? sched_clock_cpu+0x85/0xc0
      2018-05-02 22:06:45 [ 2958.153724]  [<ffffffff8102954d>] ? __switch_to+0xcd/0x500
      2018-05-02 22:06:45 [ 2958.159689]  [<ffffffff810c7c70>] ? wake_up_state+0x20/0x20
      2018-05-02 22:06:45 [ 2958.165742]  [<ffffffffc11405c0>] ? lfsck_master_engine+0x1310/0x1310 [lfsck]
      2018-05-02 22:06:45 [ 2958.173343]  [<ffffffff810b4031>] kthread+0xd1/0xe0
      2018-05-02 22:06:45 [ 2958.178685]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
      2018-05-02 22:06:45 [ 2958.185227]  [<ffffffff816c055d>] ret_from_fork+0x5d/0xb0
      2018-05-02 22:06:45 [ 2958.191065]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
      2018-05-02 22:06:45 [ 2958.197613] Kernel Offset: disabled
      

      I've failed the MDT's back to warble2 and mounted them by hand with -o skip_lfsck

      cheers,
      robin

      Attachments

        Issue Links

          Activity

            [LU-10988] LBUG in lfsck
            jgmitter Joseph Gmitter (Inactive) made changes -
            Labels Original: LTS
            pjones Peter Jones made changes -
            Link Original: This issue is related to JFC-17 [ JFC-17 ]
            pjones Peter Jones made changes -
            Link New: This issue is related to JFC-20 [ JFC-20 ]

            John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/32522/
            Subject: LU-10988 lfsck: load object attr when prepare LFSCK request
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set:
            Commit: 21d33c11abdb2908fa5fa43bb942a97615c6f7a2

            gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/32522/ Subject: LU-10988 lfsck: load object attr when prepare LFSCK request Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: 21d33c11abdb2908fa5fa43bb942a97615c6f7a2
            mdiep Minh Diep made changes -
            Fix Version/s New: Lustre 2.10.5 [ 14003 ]

            Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/32522
            Subject: LU-10988 lfsck: load object attr when prepare LFSCK request
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: 6b3b1a51fe759f769f86f8ff968290754a467ff1

            gerrit Gerrit Updater added a comment - Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/32522 Subject: LU-10988 lfsck: load object attr when prepare LFSCK request Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 6b3b1a51fe759f769f86f8ff968290754a467ff1
            pjones Peter Jones made changes -
            Link Original: This issue is related to JFC-20 [ JFC-20 ]
            pjones Peter Jones made changes -
            Link New: This issue is related to JFC-17 [ JFC-17 ]
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.12.0 [ 13495 ]
            Resolution New: Fixed [ 1 ]
            Status Original: In Progress [ 3 ] New: Resolved [ 5 ]
            pjones Peter Jones added a comment -

            Landed for 2.12

            pjones Peter Jones added a comment - Landed for 2.12

            People

              yong.fan nasf (Inactive)
              scadmin SC Admin
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: