Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10134

LBUG lfsck_namespace_double_scan()) ASSERTION( list_empty(&lad->lad_req_list) ) failed:

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.11.0, Lustre 2.10.3
    • Lustre 2.11.0, Lustre 2.10.2
    • Soak cluster, lustre-master build 3654
    • 3
    • 9223372036854775807

    Description

      Soak mds_failover test run, soak-9 powered off, MDT0001 failed over to soak-8

      2017-10-17 22:18:47,423:fsmgmt.fsmgmt:INFO     Mounting soaked-MDT0001 on soak-8
      Soak-8 log:
      Oct 17 22:19:43 soak-8 kernel: Lustre: soaked-MDT0001: Will be in recovery for at least 2:30, or until 36 clients reconnect
      Oct 17 22:19:51 soak-8 kernel: Lustre: soaked-MDT0001: Connection restored to 7553a398-a79b-c096-e685-e7284e2b17df (at 172.16.1.47@o2ib1)
      

      Recovery completes, system hits LBUG and dies:

      Oct 17 22:21:03 soak-8 kernel: Lustre: soaked-MDT0001: Recovery over after 1:20, of 36 clients 36 recovered and 0 were evicted.
      Oct 17 22:21:08 soak-8 kernel: LustreError: 3745:0:(lfsck_namespace.c:4571:lfsck_namespace_double_scan()) ASSERTION( list_empty(&lad->lad_req_list) ) failed:
      Oct 17 22:21:08 soak-8 kernel: LustreError: 3745:0:(lfsck_namespace.c:4571:lfsck_namespace_double_scan()) LBUG
      Oct 17 22:21:08 soak-8 kernel: Pid: 3745, comm: lfsck
      Oct 17 22:21:08 soak-8 kernel: #012Call Trace:
      Oct 17 22:21:08 soak-8 kernel: [<ffffffffc0dc37ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
      Oct 17 22:21:08 soak-8 kernel: [<ffffffffc0dc383c>] lbug_with_loc+0x4c/0xb0 [libcfs]
      Oct 17 22:21:08 soak-8 kernel: [<ffffffffc14ef398>] lfsck_namespace_double_scan+0x108/0x140 [lfsck]
      Oct 17 22:21:08 soak-8 kernel: [<ffffffffc14e65a9>] lfsck_double_scan+0x59/0x200 [lfsck]
      Oct 17 22:21:08 soak-8 kernel: [<ffffffffc143550a>] ? osd_otable_it_fini+0xca/0x240 [osd_ldiskfs]
      Oct 17 22:21:09 soak-8 kernel: [<ffffffff811deec3>] ? kfree+0x103/0x140
      Oct 17 22:21:09 soak-8 kernel: [<ffffffffc14eb134>] lfsck_master_engine+0x494/0x12b0 [lfsck]
      Oct 17 22:21:09 soak-8 kernel: [<ffffffff810c4810>] ? default_wake_function+0x0/0x20
      Oct 17 22:21:09 soak-8 kernel: [<ffffffffc14eaca0>] ? lfsck_master_engine+0x0/0x12b0 [lfsck]
      Oct 17 22:21:09 soak-8 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
      Oct 17 22:21:09 soak-8 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
      Oct 17 22:21:09 soak-8 kernel: [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
      Oct 17 22:21:09 soak-8 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
      Oct 17 22:21:09 soak-8 kernel:
      Oct 17 22:21:09 soak-8 kernel: Kernel panic - not syncing: LBUG
      

      Crash dump from soak-8 is available on spirit cluster at: /scratch/dumps/soak-8.spirit.hpdd.intel.com/10.10.1.108-2017-10-17-22:21:34
      vmcore-dmesg.txt attached

      Attachments

        Issue Links

          Activity

            [LU-10134] LBUG lfsck_namespace_double_scan()) ASSERTION( list_empty(&lad->lad_req_list) ) failed:
            pjones Peter Jones added a comment -

            Landed for 2.11

            pjones Peter Jones added a comment - Landed for 2.11

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30165/
            Subject: LU-10134 lfsck: not add requests if engine out of work
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: f22a0ab6c37db2d983451ec01e869ed8d3226cb2

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30165/ Subject: LU-10134 lfsck: not add requests if engine out of work Project: fs/lustre-release Branch: master Current Patch Set: Commit: f22a0ab6c37db2d983451ec01e869ed8d3226cb2

            I ported Fan Yong's patch to b2_10 to see if this will help with the issues seen on the soak cluster.

            jamesanunez James Nunez (Inactive) added a comment - I ported Fan Yong's patch to b2_10 to see if this will help with the issues seen on the soak cluster.

            James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/30375
            Subject: LU-10134 lfsck: not add requests if engine out of work
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: 90df962c58fb18367291dfff8698fdb1e70d9188

            gerrit Gerrit Updater added a comment - James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/30375 Subject: LU-10134 lfsck: not add requests if engine out of work Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 90df962c58fb18367291dfff8698fdb1e70d9188

            After the first assert hits, attempting to restart the systems results in further asserts. Can't start soak at this time, due to this. 3 out of our 4 MDS are now halted.

            soak-10.log:Dec  5 16:44:33 soak-10 kernel: LustreError: 2593:0:(lfsck_namespace.c:4582:lfsck_namespace_double_scan()) LBUG
            soak-10.log:Dec  5 16:44:34 soak-10 kernel: Kernel panic - not syncing: LBUG
            soak-8.log:Dec  5 16:44:31 soak-8 kernel: LustreError: 2529:0:(lfsck_namespace.c:4582:lfsck_namespace_double_scan()) LBUG
            soak-8.log:Dec  5 16:44:32 soak-8 kernel: Kernel panic - not syncing: LBUG
            soak-9.log:Dec  5 00:32:11 soak-9 kernel: LustreError: 2796:0:(lfsck_namespace.c:4582:lfsck_namespace_double_scan()) LBUG
            soak-9.log:Dec  5 00:32:11 soak-9 kernel: Kernel panic - not syncing: LBUG
            
            cliffw Cliff White (Inactive) added a comment - After the first assert hits, attempting to restart the systems results in further asserts. Can't start soak at this time, due to this. 3 out of our 4 MDS are now halted. soak-10.log:Dec 5 16:44:33 soak-10 kernel: LustreError: 2593:0:(lfsck_namespace.c:4582:lfsck_namespace_double_scan()) LBUG soak-10.log:Dec 5 16:44:34 soak-10 kernel: Kernel panic - not syncing: LBUG soak-8.log:Dec 5 16:44:31 soak-8 kernel: LustreError: 2529:0:(lfsck_namespace.c:4582:lfsck_namespace_double_scan()) LBUG soak-8.log:Dec 5 16:44:32 soak-8 kernel: Kernel panic - not syncing: LBUG soak-9.log:Dec 5 00:32:11 soak-9 kernel: LustreError: 2796:0:(lfsck_namespace.c:4582:lfsck_namespace_double_scan()) LBUG soak-9.log:Dec 5 00:32:11 soak-9 kernel: Kernel panic - not syncing: LBUG

            We just hit this assert on b2.10.2-RC1, crash dump is available on Spirit

            
            

            [ 1256.526444] Lustre: Failing over soaked-MDT0000
            [ 1256.536364] LustreError: 2796:0:(lfsck_namespace.c:4582:lfsck_namespace_double_scan()) ASSERTION( list_empty(&lad->lad_req_list) ) failed:
            [ 1256.537017] LustreError: 2303:0:(ldlm_lockd.c:1415:ldlm_handle_enqueue0()) ### lock on destroyed export ffff88080e0d1800 ns: mdt-soaked-MDT0000_UUID lock: ffff880811089600/0x18106beaab642978 lrc: 3/0,0 mode: PR/PR res: [0x200000405:0x1:0x0].0x0 bits 0x13 rrc: 2 type: IBT flags: 0x50200000000000 nid: 192.168.1.117@o2ib remote: 0xd36fb2fda127ff7 expref: 3 pid: 2303 timeout: 0 lvb_type: 0
            [ 1256.580979] Lustre: soaked-MDT0000: Not available for connect from 192.168.1.117@o2ib (stopping)
            [ 1256.614544] LustreError: 2796:0:(lfsck_namespace.c:4582:lfsck_namespace_double_scan()) LBUG
            [ 1256.625905] Pid: 2796, comm: lfsck
            [ 1256.631630]
            Call Trace:
            [ 1256.639771] [<ffffffffc0dd57ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
            [ 1256.648850] [<ffffffffc0dd583c>] lbug_with_loc+0x4c/0xb0 [libcfs]
            [ 1256.657599] [<ffffffffc14964c8>] lfsck_namespace_double_scan+0x108/0x140 [lfsck]
            [ 1256.667697] [<ffffffffc148d6a9>] lfsck_double_scan+0x59/0x200 [lfsck]
            [ 1256.676632] [<ffffffffc143344a>] ? osd_otable_it_fini+0xca/0x240 [osd_ldiskfs]
            [ 1256.686386] [<ffffffff811dee33>] ? kfree+0x103/0x140
            [ 1256.693515] [<ffffffffc1492204>] lfsck_master_engine+0x494/0x12f0 [lfsck]
            [ 1256.702670] [<ffffffff810c4820>] ? default_wake_function+0x0/0x20
            [ 1256.710967] [<ffffffffc1491d70>] ? lfsck_master_engine+0x0/0x12f0 [lfsck]
            [ 1256.720038] [<ffffffff810b099f>] kthread+0xcf/0xe0
            [ 1256.726787] [<ffffffff810bf114>] ? finish_task_switch+0x54/0x160
            [ 1256.734873] [<ffffffff810b08d0>] ? kthread+0x0/0xe0
            [ 1256.741642] [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90
            [ 1256.748880] [<ffffffff810b08d0>] ? kthread+0x0/0xe0
            [ 1256.755609]
            [ 1256.758443] Kernel panic - not syncing: LBUG
            {code]

            cliffw Cliff White (Inactive) added a comment - We just hit this assert on b2.10.2-RC1, crash dump is available on Spirit [ 1256.526444] Lustre: Failing over soaked-MDT0000 [ 1256.536364] LustreError: 2796:0:(lfsck_namespace.c:4582:lfsck_namespace_double_scan()) ASSERTION( list_empty(&lad->lad_req_list) ) failed: [ 1256.537017] LustreError: 2303:0:(ldlm_lockd.c:1415:ldlm_handle_enqueue0()) ### lock on destroyed export ffff88080e0d1800 ns: mdt-soaked-MDT0000_UUID lock: ffff880811089600/0x18106beaab642978 lrc: 3/0,0 mode: PR/PR res: [0x200000405:0x1:0x0] .0x0 bits 0x13 rrc: 2 type: IBT flags: 0x50200000000000 nid: 192.168.1.117@o2ib remote: 0xd36fb2fda127ff7 expref: 3 pid: 2303 timeout: 0 lvb_type: 0 [ 1256.580979] Lustre: soaked-MDT0000: Not available for connect from 192.168.1.117@o2ib (stopping) [ 1256.614544] LustreError: 2796:0:(lfsck_namespace.c:4582:lfsck_namespace_double_scan()) LBUG [ 1256.625905] Pid: 2796, comm: lfsck [ 1256.631630] Call Trace: [ 1256.639771] [<ffffffffc0dd57ae>] libcfs_call_trace+0x4e/0x60 [libcfs] [ 1256.648850] [<ffffffffc0dd583c>] lbug_with_loc+0x4c/0xb0 [libcfs] [ 1256.657599] [<ffffffffc14964c8>] lfsck_namespace_double_scan+0x108/0x140 [lfsck] [ 1256.667697] [<ffffffffc148d6a9>] lfsck_double_scan+0x59/0x200 [lfsck] [ 1256.676632] [<ffffffffc143344a>] ? osd_otable_it_fini+0xca/0x240 [osd_ldiskfs] [ 1256.686386] [<ffffffff811dee33>] ? kfree+0x103/0x140 [ 1256.693515] [<ffffffffc1492204>] lfsck_master_engine+0x494/0x12f0 [lfsck] [ 1256.702670] [<ffffffff810c4820>] ? default_wake_function+0x0/0x20 [ 1256.710967] [<ffffffffc1491d70>] ? lfsck_master_engine+0x0/0x12f0 [lfsck] [ 1256.720038] [<ffffffff810b099f>] kthread+0xcf/0xe0 [ 1256.726787] [<ffffffff810bf114>] ? finish_task_switch+0x54/0x160 [ 1256.734873] [<ffffffff810b08d0>] ? kthread+0x0/0xe0 [ 1256.741642] [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90 [ 1256.748880] [<ffffffff810b08d0>] ? kthread+0x0/0xe0 [ 1256.755609] [ 1256.758443] Kernel panic - not syncing: LBUG {code]

            Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30165
            Subject: LU-10134 lfsck: not add requests if engine out of work
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: bd6411b44714543fae3e841ffd055e67f105d24b

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30165 Subject: LU-10134 lfsck: not add requests if engine out of work Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: bd6411b44714543fae3e841ffd055e67f105d24b

            Hi Fan Yong,

            Can you please look at this issue? It has become a roadblock for soak testing on master.

            Thanks.
            Joe

            jgmitter Joseph Gmitter (Inactive) added a comment - Hi Fan Yong, Can you please look at this issue? It has become a roadblock for soak testing on master. Thanks. Joe

            After this incident, the system is completely non-recoverable. Any attempt to re-mount MDT0 causes this LBUG to hit again, and again. Multiple core dumps are available on spirit.

            cliffw Cliff White (Inactive) added a comment - After this incident, the system is completely non-recoverable. Any attempt to re-mount MDT0 causes this LBUG to hit again, and again. Multiple core dumps are available on spirit.

            Hit this issue again with 2.10.55. Seems to be pretty constant
            vmcore is available on soak.

            Nov 15 15:20:09 soak-9 systemd: Stopping User Slice of root.
            Nov 15 15:20:18 soak-9 kernel: Lustre: soaked-MDT0000: Recovery over after 1:42, of 32 clients 32 recovered and 0 were evicted.
            Nov 15 15:20:18 soak-9 kernel: LustreError: 27404:0:(lfsck_namespace.c:4571:lfsck_namespace_double_scan()) ASSERTION( list_empty(&lad->lad_req_list) ) failed:
            Nov 15 15:20:18 soak-9 kernel: LustreError: 27404:0:(lfsck_namespace.c:4571:lfsck_namespace_double_scan()) LBUG
            Nov 15 15:20:18 soak-9 kernel: Pid: 27404, comm: lfsck
            Nov 15 15:20:18 soak-9 kernel: #012Call Trace:
            Nov 15 15:20:18 soak-9 kernel: [<ffffffffc0dcd7ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
            Nov 15 15:20:18 soak-9 kernel: [<ffffffffc0dcd83c>] lbug_with_loc+0x4c/0xb0 [libcfs]
            Nov 15 15:20:18 soak-9 kernel: [<ffffffffc14a0398>] lfsck_namespace_double_scan+0x108/0x140 [lfsck]
            Nov 15 15:20:18 soak-9 kernel: [<ffffffffc14975a9>] lfsck_double_scan+0x59/0x200 [lfsck]
            Nov 15 15:20:18 soak-9 kernel: [<ffffffffc143c4ea>] ? osd_otable_it_fini+0xca/0x240 [osd_ldiskfs]
            Nov 15 15:20:18 soak-9 kernel: [<ffffffff811dee33>] ? kfree+0x103/0x140
            Nov 15 15:20:18 soak-9 kernel: [<ffffffffc149c134>] lfsck_master_engine+0x494/0x12b0 [lfsck]
            Nov 15 15:20:18 soak-9 kernel: [<ffffffff810c4820>] ? default_wake_function+0x0/0x20
            Nov 15 15:20:18 soak-9 kernel: [<ffffffffc149bca0>] ? lfsck_master_engine+0x0/0x12b0 [lfsck]
            Nov 15 15:20:18 soak-9 kernel: [<ffffffff810b099f>] kthread+0xcf/0xe0
            Nov 15 15:20:18 soak-9 kernel: [<ffffffff810b08d0>] ? kthread+0x0/0xe0
            Nov 15 15:20:18 soak-9 kernel: [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90
            Nov 15 15:20:18 soak-9 kernel: [<ffffffff810b08d0>] ? kthread+0x0/0xe0
            Nov 15 15:20:18 soak-9 kernel:
            Nov 15 15:20:18 soak-9 kernel: Kernel panic - not syncing: LBUG
            
            cliffw Cliff White (Inactive) added a comment - Hit this issue again with 2.10.55. Seems to be pretty constant vmcore is available on soak. Nov 15 15:20:09 soak-9 systemd: Stopping User Slice of root. Nov 15 15:20:18 soak-9 kernel: Lustre: soaked-MDT0000: Recovery over after 1:42, of 32 clients 32 recovered and 0 were evicted. Nov 15 15:20:18 soak-9 kernel: LustreError: 27404:0:(lfsck_namespace.c:4571:lfsck_namespace_double_scan()) ASSERTION( list_empty(&lad->lad_req_list) ) failed: Nov 15 15:20:18 soak-9 kernel: LustreError: 27404:0:(lfsck_namespace.c:4571:lfsck_namespace_double_scan()) LBUG Nov 15 15:20:18 soak-9 kernel: Pid: 27404, comm: lfsck Nov 15 15:20:18 soak-9 kernel: #012Call Trace: Nov 15 15:20:18 soak-9 kernel: [<ffffffffc0dcd7ae>] libcfs_call_trace+0x4e/0x60 [libcfs] Nov 15 15:20:18 soak-9 kernel: [<ffffffffc0dcd83c>] lbug_with_loc+0x4c/0xb0 [libcfs] Nov 15 15:20:18 soak-9 kernel: [<ffffffffc14a0398>] lfsck_namespace_double_scan+0x108/0x140 [lfsck] Nov 15 15:20:18 soak-9 kernel: [<ffffffffc14975a9>] lfsck_double_scan+0x59/0x200 [lfsck] Nov 15 15:20:18 soak-9 kernel: [<ffffffffc143c4ea>] ? osd_otable_it_fini+0xca/0x240 [osd_ldiskfs] Nov 15 15:20:18 soak-9 kernel: [<ffffffff811dee33>] ? kfree+0x103/0x140 Nov 15 15:20:18 soak-9 kernel: [<ffffffffc149c134>] lfsck_master_engine+0x494/0x12b0 [lfsck] Nov 15 15:20:18 soak-9 kernel: [<ffffffff810c4820>] ? default_wake_function+0x0/0x20 Nov 15 15:20:18 soak-9 kernel: [<ffffffffc149bca0>] ? lfsck_master_engine+0x0/0x12b0 [lfsck] Nov 15 15:20:18 soak-9 kernel: [<ffffffff810b099f>] kthread+0xcf/0xe0 Nov 15 15:20:18 soak-9 kernel: [<ffffffff810b08d0>] ? kthread+0x0/0xe0 Nov 15 15:20:18 soak-9 kernel: [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90 Nov 15 15:20:18 soak-9 kernel: [<ffffffff810b08d0>] ? kthread+0x0/0xe0 Nov 15 15:20:18 soak-9 kernel: Nov 15 15:20:18 soak-9 kernel: Kernel panic - not syncing: LBUG

            Hit this issue again with latest 2.8-fe

            cliffw Cliff White (Inactive) added a comment - Hit this issue again with latest 2.8-fe

            People

              yong.fan nasf (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: