[LU-10134] LBUG lfsck_namespace_double_scan()) ASSERTION( list_empty(&lad->lad_req_list) ) failed: Created: 18/Oct/17 Updated: 04/Jan/18 Resolved: 17/Dec/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0, Lustre 2.10.2 |
| Fix Version/s: | Lustre 2.11.0, Lustre 2.10.3 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Cliff White (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
Soak cluster, lustre-master build 3654 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Soak mds_failover test run, soak-9 powered off, MDT0001 failed over to soak-8 2017-10-17 22:18:47,423:fsmgmt.fsmgmt:INFO Mounting soaked-MDT0001 on soak-8
Soak-8 log:
Oct 17 22:19:43 soak-8 kernel: Lustre: soaked-MDT0001: Will be in recovery for at least 2:30, or until 36 clients reconnect
Oct 17 22:19:51 soak-8 kernel: Lustre: soaked-MDT0001: Connection restored to 7553a398-a79b-c096-e685-e7284e2b17df (at 172.16.1.47@o2ib1)
Recovery completes, system hits LBUG and dies: Oct 17 22:21:03 soak-8 kernel: Lustre: soaked-MDT0001: Recovery over after 1:20, of 36 clients 36 recovered and 0 were evicted. Oct 17 22:21:08 soak-8 kernel: LustreError: 3745:0:(lfsck_namespace.c:4571:lfsck_namespace_double_scan()) ASSERTION( list_empty(&lad->lad_req_list) ) failed: Oct 17 22:21:08 soak-8 kernel: LustreError: 3745:0:(lfsck_namespace.c:4571:lfsck_namespace_double_scan()) LBUG Oct 17 22:21:08 soak-8 kernel: Pid: 3745, comm: lfsck Oct 17 22:21:08 soak-8 kernel: #012Call Trace: Oct 17 22:21:08 soak-8 kernel: [<ffffffffc0dc37ae>] libcfs_call_trace+0x4e/0x60 [libcfs] Oct 17 22:21:08 soak-8 kernel: [<ffffffffc0dc383c>] lbug_with_loc+0x4c/0xb0 [libcfs] Oct 17 22:21:08 soak-8 kernel: [<ffffffffc14ef398>] lfsck_namespace_double_scan+0x108/0x140 [lfsck] Oct 17 22:21:08 soak-8 kernel: [<ffffffffc14e65a9>] lfsck_double_scan+0x59/0x200 [lfsck] Oct 17 22:21:08 soak-8 kernel: [<ffffffffc143550a>] ? osd_otable_it_fini+0xca/0x240 [osd_ldiskfs] Oct 17 22:21:09 soak-8 kernel: [<ffffffff811deec3>] ? kfree+0x103/0x140 Oct 17 22:21:09 soak-8 kernel: [<ffffffffc14eb134>] lfsck_master_engine+0x494/0x12b0 [lfsck] Oct 17 22:21:09 soak-8 kernel: [<ffffffff810c4810>] ? default_wake_function+0x0/0x20 Oct 17 22:21:09 soak-8 kernel: [<ffffffffc14eaca0>] ? lfsck_master_engine+0x0/0x12b0 [lfsck] Oct 17 22:21:09 soak-8 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0 Oct 17 22:21:09 soak-8 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0 Oct 17 22:21:09 soak-8 kernel: [<ffffffff816b4f18>] ret_from_fork+0x58/0x90 Oct 17 22:21:09 soak-8 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0 Oct 17 22:21:09 soak-8 kernel: Oct 17 22:21:09 soak-8 kernel: Kernel panic - not syncing: LBUG Crash dump from soak-8 is available on spirit cluster at: /scratch/dumps/soak-8.spirit.hpdd.intel.com/10.10.1.108-2017-10-17-22:21:34 |
| Comments |
| Comment by James Nunez (Inactive) [ 18/Oct/17 ] |
|
Fan Yong - Would you please comment on this issue? Thank you |
| Comment by Cliff White (Inactive) [ 01/Nov/17 ] |
|
Hit this issue again with latest 2.8-fe |
| Comment by Cliff White (Inactive) [ 15/Nov/17 ] |
|
Hit this issue again with 2.10.55. Seems to be pretty constant Nov 15 15:20:09 soak-9 systemd: Stopping User Slice of root. Nov 15 15:20:18 soak-9 kernel: Lustre: soaked-MDT0000: Recovery over after 1:42, of 32 clients 32 recovered and 0 were evicted. Nov 15 15:20:18 soak-9 kernel: LustreError: 27404:0:(lfsck_namespace.c:4571:lfsck_namespace_double_scan()) ASSERTION( list_empty(&lad->lad_req_list) ) failed: Nov 15 15:20:18 soak-9 kernel: LustreError: 27404:0:(lfsck_namespace.c:4571:lfsck_namespace_double_scan()) LBUG Nov 15 15:20:18 soak-9 kernel: Pid: 27404, comm: lfsck Nov 15 15:20:18 soak-9 kernel: #012Call Trace: Nov 15 15:20:18 soak-9 kernel: [<ffffffffc0dcd7ae>] libcfs_call_trace+0x4e/0x60 [libcfs] Nov 15 15:20:18 soak-9 kernel: [<ffffffffc0dcd83c>] lbug_with_loc+0x4c/0xb0 [libcfs] Nov 15 15:20:18 soak-9 kernel: [<ffffffffc14a0398>] lfsck_namespace_double_scan+0x108/0x140 [lfsck] Nov 15 15:20:18 soak-9 kernel: [<ffffffffc14975a9>] lfsck_double_scan+0x59/0x200 [lfsck] Nov 15 15:20:18 soak-9 kernel: [<ffffffffc143c4ea>] ? osd_otable_it_fini+0xca/0x240 [osd_ldiskfs] Nov 15 15:20:18 soak-9 kernel: [<ffffffff811dee33>] ? kfree+0x103/0x140 Nov 15 15:20:18 soak-9 kernel: [<ffffffffc149c134>] lfsck_master_engine+0x494/0x12b0 [lfsck] Nov 15 15:20:18 soak-9 kernel: [<ffffffff810c4820>] ? default_wake_function+0x0/0x20 Nov 15 15:20:18 soak-9 kernel: [<ffffffffc149bca0>] ? lfsck_master_engine+0x0/0x12b0 [lfsck] Nov 15 15:20:18 soak-9 kernel: [<ffffffff810b099f>] kthread+0xcf/0xe0 Nov 15 15:20:18 soak-9 kernel: [<ffffffff810b08d0>] ? kthread+0x0/0xe0 Nov 15 15:20:18 soak-9 kernel: [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90 Nov 15 15:20:18 soak-9 kernel: [<ffffffff810b08d0>] ? kthread+0x0/0xe0 Nov 15 15:20:18 soak-9 kernel: Nov 15 15:20:18 soak-9 kernel: Kernel panic - not syncing: LBUG |
| Comment by Cliff White (Inactive) [ 16/Nov/17 ] |
|
After this incident, the system is completely non-recoverable. Any attempt to re-mount MDT0 causes this LBUG to hit again, and again. Multiple core dumps are available on spirit. |
| Comment by Joseph Gmitter (Inactive) [ 17/Nov/17 ] |
|
Hi Fan Yong, Can you please look at this issue? It has become a roadblock for soak testing on master. Thanks. |
| Comment by Gerrit Updater [ 19/Nov/17 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30165 |
| Comment by Cliff White (Inactive) [ 05/Dec/17 ] |
|
We just hit this assert on b2.10.2-RC1, crash dump is available on Spirit [ 1256.526444] Lustre: Failing over soaked-MDT0000 |
| Comment by Cliff White (Inactive) [ 05/Dec/17 ] |
|
After the first assert hits, attempting to restart the systems results in further asserts. Can't start soak at this time, due to this. 3 out of our 4 MDS are now halted. soak-10.log:Dec 5 16:44:33 soak-10 kernel: LustreError: 2593:0:(lfsck_namespace.c:4582:lfsck_namespace_double_scan()) LBUG soak-10.log:Dec 5 16:44:34 soak-10 kernel: Kernel panic - not syncing: LBUG soak-8.log:Dec 5 16:44:31 soak-8 kernel: LustreError: 2529:0:(lfsck_namespace.c:4582:lfsck_namespace_double_scan()) LBUG soak-8.log:Dec 5 16:44:32 soak-8 kernel: Kernel panic - not syncing: LBUG soak-9.log:Dec 5 00:32:11 soak-9 kernel: LustreError: 2796:0:(lfsck_namespace.c:4582:lfsck_namespace_double_scan()) LBUG soak-9.log:Dec 5 00:32:11 soak-9 kernel: Kernel panic - not syncing: LBUG |
| Comment by Gerrit Updater [ 05/Dec/17 ] |
|
James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/30375 |
| Comment by James Nunez (Inactive) [ 05/Dec/17 ] |
|
I ported Fan Yong's patch to b2_10 to see if this will help with the issues seen on the soak cluster. |
| Comment by Gerrit Updater [ 17/Dec/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30165/ |
| Comment by Peter Jones [ 17/Dec/17 ] |
|
Landed for 2.11 |
| Comment by Gerrit Updater [ 04/Jan/18 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/30375/ |