Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10134

LBUG lfsck_namespace_double_scan()) ASSERTION( list_empty(&lad->lad_req_list) ) failed:

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.11.0, Lustre 2.10.3
    • Lustre 2.11.0, Lustre 2.10.2
    • Soak cluster, lustre-master build 3654
    • 3
    • 9223372036854775807

    Description

      Soak mds_failover test run, soak-9 powered off, MDT0001 failed over to soak-8

      2017-10-17 22:18:47,423:fsmgmt.fsmgmt:INFO     Mounting soaked-MDT0001 on soak-8
      Soak-8 log:
      Oct 17 22:19:43 soak-8 kernel: Lustre: soaked-MDT0001: Will be in recovery for at least 2:30, or until 36 clients reconnect
      Oct 17 22:19:51 soak-8 kernel: Lustre: soaked-MDT0001: Connection restored to 7553a398-a79b-c096-e685-e7284e2b17df (at 172.16.1.47@o2ib1)
      

      Recovery completes, system hits LBUG and dies:

      Oct 17 22:21:03 soak-8 kernel: Lustre: soaked-MDT0001: Recovery over after 1:20, of 36 clients 36 recovered and 0 were evicted.
      Oct 17 22:21:08 soak-8 kernel: LustreError: 3745:0:(lfsck_namespace.c:4571:lfsck_namespace_double_scan()) ASSERTION( list_empty(&lad->lad_req_list) ) failed:
      Oct 17 22:21:08 soak-8 kernel: LustreError: 3745:0:(lfsck_namespace.c:4571:lfsck_namespace_double_scan()) LBUG
      Oct 17 22:21:08 soak-8 kernel: Pid: 3745, comm: lfsck
      Oct 17 22:21:08 soak-8 kernel: #012Call Trace:
      Oct 17 22:21:08 soak-8 kernel: [<ffffffffc0dc37ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
      Oct 17 22:21:08 soak-8 kernel: [<ffffffffc0dc383c>] lbug_with_loc+0x4c/0xb0 [libcfs]
      Oct 17 22:21:08 soak-8 kernel: [<ffffffffc14ef398>] lfsck_namespace_double_scan+0x108/0x140 [lfsck]
      Oct 17 22:21:08 soak-8 kernel: [<ffffffffc14e65a9>] lfsck_double_scan+0x59/0x200 [lfsck]
      Oct 17 22:21:08 soak-8 kernel: [<ffffffffc143550a>] ? osd_otable_it_fini+0xca/0x240 [osd_ldiskfs]
      Oct 17 22:21:09 soak-8 kernel: [<ffffffff811deec3>] ? kfree+0x103/0x140
      Oct 17 22:21:09 soak-8 kernel: [<ffffffffc14eb134>] lfsck_master_engine+0x494/0x12b0 [lfsck]
      Oct 17 22:21:09 soak-8 kernel: [<ffffffff810c4810>] ? default_wake_function+0x0/0x20
      Oct 17 22:21:09 soak-8 kernel: [<ffffffffc14eaca0>] ? lfsck_master_engine+0x0/0x12b0 [lfsck]
      Oct 17 22:21:09 soak-8 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
      Oct 17 22:21:09 soak-8 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
      Oct 17 22:21:09 soak-8 kernel: [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
      Oct 17 22:21:09 soak-8 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
      Oct 17 22:21:09 soak-8 kernel:
      Oct 17 22:21:09 soak-8 kernel: Kernel panic - not syncing: LBUG
      

      Crash dump from soak-8 is available on spirit cluster at: /scratch/dumps/soak-8.spirit.hpdd.intel.com/10.10.1.108-2017-10-17-22:21:34
      vmcore-dmesg.txt attached

      Attachments

        Issue Links

          Activity

            [LU-10134] LBUG lfsck_namespace_double_scan()) ASSERTION( list_empty(&lad->lad_req_list) ) failed:

            John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/30375/
            Subject: LU-10134 lfsck: not add requests if engine out of work
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set:
            Commit: 02065756c83057d3acc35792b744078f2739025d

            gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/30375/ Subject: LU-10134 lfsck: not add requests if engine out of work Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: 02065756c83057d3acc35792b744078f2739025d
            pjones Peter Jones added a comment -

            Landed for 2.11

            pjones Peter Jones added a comment - Landed for 2.11

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30165/
            Subject: LU-10134 lfsck: not add requests if engine out of work
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: f22a0ab6c37db2d983451ec01e869ed8d3226cb2

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30165/ Subject: LU-10134 lfsck: not add requests if engine out of work Project: fs/lustre-release Branch: master Current Patch Set: Commit: f22a0ab6c37db2d983451ec01e869ed8d3226cb2

            I ported Fan Yong's patch to b2_10 to see if this will help with the issues seen on the soak cluster.

            jamesanunez James Nunez (Inactive) added a comment - I ported Fan Yong's patch to b2_10 to see if this will help with the issues seen on the soak cluster.

            James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/30375
            Subject: LU-10134 lfsck: not add requests if engine out of work
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: 90df962c58fb18367291dfff8698fdb1e70d9188

            gerrit Gerrit Updater added a comment - James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/30375 Subject: LU-10134 lfsck: not add requests if engine out of work Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 90df962c58fb18367291dfff8698fdb1e70d9188

            After the first assert hits, attempting to restart the systems results in further asserts. Can't start soak at this time, due to this. 3 out of our 4 MDS are now halted.

            soak-10.log:Dec  5 16:44:33 soak-10 kernel: LustreError: 2593:0:(lfsck_namespace.c:4582:lfsck_namespace_double_scan()) LBUG
            soak-10.log:Dec  5 16:44:34 soak-10 kernel: Kernel panic - not syncing: LBUG
            soak-8.log:Dec  5 16:44:31 soak-8 kernel: LustreError: 2529:0:(lfsck_namespace.c:4582:lfsck_namespace_double_scan()) LBUG
            soak-8.log:Dec  5 16:44:32 soak-8 kernel: Kernel panic - not syncing: LBUG
            soak-9.log:Dec  5 00:32:11 soak-9 kernel: LustreError: 2796:0:(lfsck_namespace.c:4582:lfsck_namespace_double_scan()) LBUG
            soak-9.log:Dec  5 00:32:11 soak-9 kernel: Kernel panic - not syncing: LBUG
            
            cliffw Cliff White (Inactive) added a comment - After the first assert hits, attempting to restart the systems results in further asserts. Can't start soak at this time, due to this. 3 out of our 4 MDS are now halted. soak-10.log:Dec 5 16:44:33 soak-10 kernel: LustreError: 2593:0:(lfsck_namespace.c:4582:lfsck_namespace_double_scan()) LBUG soak-10.log:Dec 5 16:44:34 soak-10 kernel: Kernel panic - not syncing: LBUG soak-8.log:Dec 5 16:44:31 soak-8 kernel: LustreError: 2529:0:(lfsck_namespace.c:4582:lfsck_namespace_double_scan()) LBUG soak-8.log:Dec 5 16:44:32 soak-8 kernel: Kernel panic - not syncing: LBUG soak-9.log:Dec 5 00:32:11 soak-9 kernel: LustreError: 2796:0:(lfsck_namespace.c:4582:lfsck_namespace_double_scan()) LBUG soak-9.log:Dec 5 00:32:11 soak-9 kernel: Kernel panic - not syncing: LBUG

            We just hit this assert on b2.10.2-RC1, crash dump is available on Spirit

            
            

            [ 1256.526444] Lustre: Failing over soaked-MDT0000
            [ 1256.536364] LustreError: 2796:0:(lfsck_namespace.c:4582:lfsck_namespace_double_scan()) ASSERTION( list_empty(&lad->lad_req_list) ) failed:
            [ 1256.537017] LustreError: 2303:0:(ldlm_lockd.c:1415:ldlm_handle_enqueue0()) ### lock on destroyed export ffff88080e0d1800 ns: mdt-soaked-MDT0000_UUID lock: ffff880811089600/0x18106beaab642978 lrc: 3/0,0 mode: PR/PR res: [0x200000405:0x1:0x0].0x0 bits 0x13 rrc: 2 type: IBT flags: 0x50200000000000 nid: 192.168.1.117@o2ib remote: 0xd36fb2fda127ff7 expref: 3 pid: 2303 timeout: 0 lvb_type: 0
            [ 1256.580979] Lustre: soaked-MDT0000: Not available for connect from 192.168.1.117@o2ib (stopping)
            [ 1256.614544] LustreError: 2796:0:(lfsck_namespace.c:4582:lfsck_namespace_double_scan()) LBUG
            [ 1256.625905] Pid: 2796, comm: lfsck
            [ 1256.631630]
            Call Trace:
            [ 1256.639771] [<ffffffffc0dd57ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
            [ 1256.648850] [<ffffffffc0dd583c>] lbug_with_loc+0x4c/0xb0 [libcfs]
            [ 1256.657599] [<ffffffffc14964c8>] lfsck_namespace_double_scan+0x108/0x140 [lfsck]
            [ 1256.667697] [<ffffffffc148d6a9>] lfsck_double_scan+0x59/0x200 [lfsck]
            [ 1256.676632] [<ffffffffc143344a>] ? osd_otable_it_fini+0xca/0x240 [osd_ldiskfs]
            [ 1256.686386] [<ffffffff811dee33>] ? kfree+0x103/0x140
            [ 1256.693515] [<ffffffffc1492204>] lfsck_master_engine+0x494/0x12f0 [lfsck]
            [ 1256.702670] [<ffffffff810c4820>] ? default_wake_function+0x0/0x20
            [ 1256.710967] [<ffffffffc1491d70>] ? lfsck_master_engine+0x0/0x12f0 [lfsck]
            [ 1256.720038] [<ffffffff810b099f>] kthread+0xcf/0xe0
            [ 1256.726787] [<ffffffff810bf114>] ? finish_task_switch+0x54/0x160
            [ 1256.734873] [<ffffffff810b08d0>] ? kthread+0x0/0xe0
            [ 1256.741642] [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90
            [ 1256.748880] [<ffffffff810b08d0>] ? kthread+0x0/0xe0
            [ 1256.755609]
            [ 1256.758443] Kernel panic - not syncing: LBUG
            {code]

            cliffw Cliff White (Inactive) added a comment - We just hit this assert on b2.10.2-RC1, crash dump is available on Spirit [ 1256.526444] Lustre: Failing over soaked-MDT0000 [ 1256.536364] LustreError: 2796:0:(lfsck_namespace.c:4582:lfsck_namespace_double_scan()) ASSERTION( list_empty(&lad->lad_req_list) ) failed: [ 1256.537017] LustreError: 2303:0:(ldlm_lockd.c:1415:ldlm_handle_enqueue0()) ### lock on destroyed export ffff88080e0d1800 ns: mdt-soaked-MDT0000_UUID lock: ffff880811089600/0x18106beaab642978 lrc: 3/0,0 mode: PR/PR res: [0x200000405:0x1:0x0] .0x0 bits 0x13 rrc: 2 type: IBT flags: 0x50200000000000 nid: 192.168.1.117@o2ib remote: 0xd36fb2fda127ff7 expref: 3 pid: 2303 timeout: 0 lvb_type: 0 [ 1256.580979] Lustre: soaked-MDT0000: Not available for connect from 192.168.1.117@o2ib (stopping) [ 1256.614544] LustreError: 2796:0:(lfsck_namespace.c:4582:lfsck_namespace_double_scan()) LBUG [ 1256.625905] Pid: 2796, comm: lfsck [ 1256.631630] Call Trace: [ 1256.639771] [<ffffffffc0dd57ae>] libcfs_call_trace+0x4e/0x60 [libcfs] [ 1256.648850] [<ffffffffc0dd583c>] lbug_with_loc+0x4c/0xb0 [libcfs] [ 1256.657599] [<ffffffffc14964c8>] lfsck_namespace_double_scan+0x108/0x140 [lfsck] [ 1256.667697] [<ffffffffc148d6a9>] lfsck_double_scan+0x59/0x200 [lfsck] [ 1256.676632] [<ffffffffc143344a>] ? osd_otable_it_fini+0xca/0x240 [osd_ldiskfs] [ 1256.686386] [<ffffffff811dee33>] ? kfree+0x103/0x140 [ 1256.693515] [<ffffffffc1492204>] lfsck_master_engine+0x494/0x12f0 [lfsck] [ 1256.702670] [<ffffffff810c4820>] ? default_wake_function+0x0/0x20 [ 1256.710967] [<ffffffffc1491d70>] ? lfsck_master_engine+0x0/0x12f0 [lfsck] [ 1256.720038] [<ffffffff810b099f>] kthread+0xcf/0xe0 [ 1256.726787] [<ffffffff810bf114>] ? finish_task_switch+0x54/0x160 [ 1256.734873] [<ffffffff810b08d0>] ? kthread+0x0/0xe0 [ 1256.741642] [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90 [ 1256.748880] [<ffffffff810b08d0>] ? kthread+0x0/0xe0 [ 1256.755609] [ 1256.758443] Kernel panic - not syncing: LBUG {code]

            Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30165
            Subject: LU-10134 lfsck: not add requests if engine out of work
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: bd6411b44714543fae3e841ffd055e67f105d24b

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30165 Subject: LU-10134 lfsck: not add requests if engine out of work Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: bd6411b44714543fae3e841ffd055e67f105d24b

            Hi Fan Yong,

            Can you please look at this issue? It has become a roadblock for soak testing on master.

            Thanks.
            Joe

            jgmitter Joseph Gmitter (Inactive) added a comment - Hi Fan Yong, Can you please look at this issue? It has become a roadblock for soak testing on master. Thanks. Joe

            After this incident, the system is completely non-recoverable. Any attempt to re-mount MDT0 causes this LBUG to hit again, and again. Multiple core dumps are available on spirit.

            cliffw Cliff White (Inactive) added a comment - After this incident, the system is completely non-recoverable. Any attempt to re-mount MDT0 causes this LBUG to hit again, and again. Multiple core dumps are available on spirit.

            People

              yong.fan nasf (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: