[LU-10134] LBUG lfsck_namespace_double_scan()) ASSERTION( list_empty(&lad->lad_req_list) ) failed: Created: 18/Oct/17  Updated: 04/Jan/18  Resolved: 17/Dec/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.10.2
Fix Version/s: Lustre 2.11.0, Lustre 2.10.3

Type: Bug Priority: Critical
Reporter: Cliff White (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: soak
Environment:

Soak cluster, lustre-master build 3654


Attachments: Text File vmcore-dmesg.txt     Text File vmcore-dmesg.txt    
Issue Links:
Related
is related to LU-8647 lfsck_namespace_double_scan()) ASSERT... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Soak mds_failover test run, soak-9 powered off, MDT0001 failed over to soak-8

2017-10-17 22:18:47,423:fsmgmt.fsmgmt:INFO     Mounting soaked-MDT0001 on soak-8
Soak-8 log:
Oct 17 22:19:43 soak-8 kernel: Lustre: soaked-MDT0001: Will be in recovery for at least 2:30, or until 36 clients reconnect
Oct 17 22:19:51 soak-8 kernel: Lustre: soaked-MDT0001: Connection restored to 7553a398-a79b-c096-e685-e7284e2b17df (at 172.16.1.47@o2ib1)

Recovery completes, system hits LBUG and dies:

Oct 17 22:21:03 soak-8 kernel: Lustre: soaked-MDT0001: Recovery over after 1:20, of 36 clients 36 recovered and 0 were evicted.
Oct 17 22:21:08 soak-8 kernel: LustreError: 3745:0:(lfsck_namespace.c:4571:lfsck_namespace_double_scan()) ASSERTION( list_empty(&lad->lad_req_list) ) failed:
Oct 17 22:21:08 soak-8 kernel: LustreError: 3745:0:(lfsck_namespace.c:4571:lfsck_namespace_double_scan()) LBUG
Oct 17 22:21:08 soak-8 kernel: Pid: 3745, comm: lfsck
Oct 17 22:21:08 soak-8 kernel: #012Call Trace:
Oct 17 22:21:08 soak-8 kernel: [<ffffffffc0dc37ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
Oct 17 22:21:08 soak-8 kernel: [<ffffffffc0dc383c>] lbug_with_loc+0x4c/0xb0 [libcfs]
Oct 17 22:21:08 soak-8 kernel: [<ffffffffc14ef398>] lfsck_namespace_double_scan+0x108/0x140 [lfsck]
Oct 17 22:21:08 soak-8 kernel: [<ffffffffc14e65a9>] lfsck_double_scan+0x59/0x200 [lfsck]
Oct 17 22:21:08 soak-8 kernel: [<ffffffffc143550a>] ? osd_otable_it_fini+0xca/0x240 [osd_ldiskfs]
Oct 17 22:21:09 soak-8 kernel: [<ffffffff811deec3>] ? kfree+0x103/0x140
Oct 17 22:21:09 soak-8 kernel: [<ffffffffc14eb134>] lfsck_master_engine+0x494/0x12b0 [lfsck]
Oct 17 22:21:09 soak-8 kernel: [<ffffffff810c4810>] ? default_wake_function+0x0/0x20
Oct 17 22:21:09 soak-8 kernel: [<ffffffffc14eaca0>] ? lfsck_master_engine+0x0/0x12b0 [lfsck]
Oct 17 22:21:09 soak-8 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
Oct 17 22:21:09 soak-8 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
Oct 17 22:21:09 soak-8 kernel: [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
Oct 17 22:21:09 soak-8 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
Oct 17 22:21:09 soak-8 kernel:
Oct 17 22:21:09 soak-8 kernel: Kernel panic - not syncing: LBUG

Crash dump from soak-8 is available on spirit cluster at: /scratch/dumps/soak-8.spirit.hpdd.intel.com/10.10.1.108-2017-10-17-22:21:34
vmcore-dmesg.txt attached



 Comments   
Comment by James Nunez (Inactive) [ 18/Oct/17 ]

Fan Yong - Would you please comment on this issue?

Thank you

Comment by Cliff White (Inactive) [ 01/Nov/17 ]

Hit this issue again with latest 2.8-fe

Comment by Cliff White (Inactive) [ 15/Nov/17 ]

Hit this issue again with 2.10.55. Seems to be pretty constant
vmcore is available on soak.

Nov 15 15:20:09 soak-9 systemd: Stopping User Slice of root.
Nov 15 15:20:18 soak-9 kernel: Lustre: soaked-MDT0000: Recovery over after 1:42, of 32 clients 32 recovered and 0 were evicted.
Nov 15 15:20:18 soak-9 kernel: LustreError: 27404:0:(lfsck_namespace.c:4571:lfsck_namespace_double_scan()) ASSERTION( list_empty(&lad->lad_req_list) ) failed:
Nov 15 15:20:18 soak-9 kernel: LustreError: 27404:0:(lfsck_namespace.c:4571:lfsck_namespace_double_scan()) LBUG
Nov 15 15:20:18 soak-9 kernel: Pid: 27404, comm: lfsck
Nov 15 15:20:18 soak-9 kernel: #012Call Trace:
Nov 15 15:20:18 soak-9 kernel: [<ffffffffc0dcd7ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
Nov 15 15:20:18 soak-9 kernel: [<ffffffffc0dcd83c>] lbug_with_loc+0x4c/0xb0 [libcfs]
Nov 15 15:20:18 soak-9 kernel: [<ffffffffc14a0398>] lfsck_namespace_double_scan+0x108/0x140 [lfsck]
Nov 15 15:20:18 soak-9 kernel: [<ffffffffc14975a9>] lfsck_double_scan+0x59/0x200 [lfsck]
Nov 15 15:20:18 soak-9 kernel: [<ffffffffc143c4ea>] ? osd_otable_it_fini+0xca/0x240 [osd_ldiskfs]
Nov 15 15:20:18 soak-9 kernel: [<ffffffff811dee33>] ? kfree+0x103/0x140
Nov 15 15:20:18 soak-9 kernel: [<ffffffffc149c134>] lfsck_master_engine+0x494/0x12b0 [lfsck]
Nov 15 15:20:18 soak-9 kernel: [<ffffffff810c4820>] ? default_wake_function+0x0/0x20
Nov 15 15:20:18 soak-9 kernel: [<ffffffffc149bca0>] ? lfsck_master_engine+0x0/0x12b0 [lfsck]
Nov 15 15:20:18 soak-9 kernel: [<ffffffff810b099f>] kthread+0xcf/0xe0
Nov 15 15:20:18 soak-9 kernel: [<ffffffff810b08d0>] ? kthread+0x0/0xe0
Nov 15 15:20:18 soak-9 kernel: [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90
Nov 15 15:20:18 soak-9 kernel: [<ffffffff810b08d0>] ? kthread+0x0/0xe0
Nov 15 15:20:18 soak-9 kernel:
Nov 15 15:20:18 soak-9 kernel: Kernel panic - not syncing: LBUG
Comment by Cliff White (Inactive) [ 16/Nov/17 ]

After this incident, the system is completely non-recoverable. Any attempt to re-mount MDT0 causes this LBUG to hit again, and again. Multiple core dumps are available on spirit.

Comment by Joseph Gmitter (Inactive) [ 17/Nov/17 ]

Hi Fan Yong,

Can you please look at this issue? It has become a roadblock for soak testing on master.

Thanks.
Joe

Comment by Gerrit Updater [ 19/Nov/17 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30165
Subject: LU-10134 lfsck: not add requests if engine out of work
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bd6411b44714543fae3e841ffd055e67f105d24b

Comment by Cliff White (Inactive) [ 05/Dec/17 ]

We just hit this assert on b2.10.2-RC1, crash dump is available on Spirit


[ 1256.526444] Lustre: Failing over soaked-MDT0000
[ 1256.536364] LustreError: 2796:0:(lfsck_namespace.c:4582:lfsck_namespace_double_scan()) ASSERTION( list_empty(&lad->lad_req_list) ) failed:
[ 1256.537017] LustreError: 2303:0:(ldlm_lockd.c:1415:ldlm_handle_enqueue0()) ### lock on destroyed export ffff88080e0d1800 ns: mdt-soaked-MDT0000_UUID lock: ffff880811089600/0x18106beaab642978 lrc: 3/0,0 mode: PR/PR res: [0x200000405:0x1:0x0].0x0 bits 0x13 rrc: 2 type: IBT flags: 0x50200000000000 nid: 192.168.1.117@o2ib remote: 0xd36fb2fda127ff7 expref: 3 pid: 2303 timeout: 0 lvb_type: 0
[ 1256.580979] Lustre: soaked-MDT0000: Not available for connect from 192.168.1.117@o2ib (stopping)
[ 1256.614544] LustreError: 2796:0:(lfsck_namespace.c:4582:lfsck_namespace_double_scan()) LBUG
[ 1256.625905] Pid: 2796, comm: lfsck
[ 1256.631630]
Call Trace:
[ 1256.639771] [<ffffffffc0dd57ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
[ 1256.648850] [<ffffffffc0dd583c>] lbug_with_loc+0x4c/0xb0 [libcfs]
[ 1256.657599] [<ffffffffc14964c8>] lfsck_namespace_double_scan+0x108/0x140 [lfsck]
[ 1256.667697] [<ffffffffc148d6a9>] lfsck_double_scan+0x59/0x200 [lfsck]
[ 1256.676632] [<ffffffffc143344a>] ? osd_otable_it_fini+0xca/0x240 [osd_ldiskfs]
[ 1256.686386] [<ffffffff811dee33>] ? kfree+0x103/0x140
[ 1256.693515] [<ffffffffc1492204>] lfsck_master_engine+0x494/0x12f0 [lfsck]
[ 1256.702670] [<ffffffff810c4820>] ? default_wake_function+0x0/0x20
[ 1256.710967] [<ffffffffc1491d70>] ? lfsck_master_engine+0x0/0x12f0 [lfsck]
[ 1256.720038] [<ffffffff810b099f>] kthread+0xcf/0xe0
[ 1256.726787] [<ffffffff810bf114>] ? finish_task_switch+0x54/0x160
[ 1256.734873] [<ffffffff810b08d0>] ? kthread+0x0/0xe0
[ 1256.741642] [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90
[ 1256.748880] [<ffffffff810b08d0>] ? kthread+0x0/0xe0
[ 1256.755609]
[ 1256.758443] Kernel panic - not syncing: LBUG
{code]

Comment by Cliff White (Inactive) [ 05/Dec/17 ]

After the first assert hits, attempting to restart the systems results in further asserts. Can't start soak at this time, due to this. 3 out of our 4 MDS are now halted.

soak-10.log:Dec  5 16:44:33 soak-10 kernel: LustreError: 2593:0:(lfsck_namespace.c:4582:lfsck_namespace_double_scan()) LBUG
soak-10.log:Dec  5 16:44:34 soak-10 kernel: Kernel panic - not syncing: LBUG
soak-8.log:Dec  5 16:44:31 soak-8 kernel: LustreError: 2529:0:(lfsck_namespace.c:4582:lfsck_namespace_double_scan()) LBUG
soak-8.log:Dec  5 16:44:32 soak-8 kernel: Kernel panic - not syncing: LBUG
soak-9.log:Dec  5 00:32:11 soak-9 kernel: LustreError: 2796:0:(lfsck_namespace.c:4582:lfsck_namespace_double_scan()) LBUG
soak-9.log:Dec  5 00:32:11 soak-9 kernel: Kernel panic - not syncing: LBUG
Comment by Gerrit Updater [ 05/Dec/17 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/30375
Subject: LU-10134 lfsck: not add requests if engine out of work
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 90df962c58fb18367291dfff8698fdb1e70d9188

Comment by James Nunez (Inactive) [ 05/Dec/17 ]

I ported Fan Yong's patch to b2_10 to see if this will help with the issues seen on the soak cluster.

Comment by Gerrit Updater [ 17/Dec/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30165/
Subject: LU-10134 lfsck: not add requests if engine out of work
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f22a0ab6c37db2d983451ec01e869ed8d3226cb2

Comment by Peter Jones [ 17/Dec/17 ]

Landed for 2.11

Comment by Gerrit Updater [ 04/Jan/18 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/30375/
Subject: LU-10134 lfsck: not add requests if engine out of work
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 02065756c83057d3acc35792b744078f2739025d

Generated at Sat Feb 10 02:32:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.