[LU-14425] BUG: unable to handle kernel paging request at ffffffffffffffff Created: 11/Feb/21 Updated: 23/Mar/21 Resolved: 23/Mar/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Shaun Tancheff | Assignee: | WC Triage |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
[10444.024266] BUG: unable to handle kernel paging request at ffffffffffffffff [10444.034709] IP: 0xffffffffffffffff [10444.038729] PGD 1e0c067 P4D 1e0c067 PUD 1e0e067 PMD 0 [10444.046081] Oops: 0010 [#1] SMP PTI [10444.050413] CPU: 1 PID: 2493 Comm: kgnilnd_sd_00 Tainted: P O 4.12.14-197.7_5.0.96-cray_ari_c #1 SLE15 (unreleased) [10444.069060] Hardware name: Cray Inc. Cascade/Cascade, BIOS 4.6.5 09/05/2019 [10444.078578] task: ffff880f898c31c0 task.stack: ffffc90007610000 [10444.087691] RIP: 0010:0xffffffffffffffff [10444.092405] RSP: 0018:ffffc90007613ae0 EFLAGS: 00010286 [10444.099635] RAX: ffff880f87d11bc0 RBX: ffff88078e307000 RCX: 00000000ffffffff [10444.109846] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff88078e5e8f18 [10444.119712] RBP: ffffc90007613b90 R08: 0000000000000000 R09: 0000000000000000 [10444.130956] R10: 0000000000000000 R11: ffffea001e0c1a60 R12: ffff88078e5e8e58 [10444.142241] R13: 0000000000000001 R14: 0000000000000000 R15: ffffffffffffffff [10444.152325] FS: 0000000000000000(0000) GS:ffff88085f840000(0000) knlGS:0000000000000000 [10444.164237] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [10444.172192] CR2: ffffffffffffffff CR3: 0000000001e0a001 CR4: 00000000001606e0 [10444.182560] Call Trace: [10444.184378] ? lnet_finalize+0x750/0x1130 [lnet] [10444.190241] ? _kgnilnd_schedule_conn+0x128/0x3a0 [kgnilnd] [10444.197527] kgnilnd_recv+0x5b6/0xcb0 [kgnilnd] [10444.203490] lnet_ni_recv+0x5a/0x2d0 [lnet] [10444.209107] lnet_recv_put+0x7f/0xb0 [lnet] [10444.214445] lnet_parse_local+0x642/0xdc0 [lnet] [10444.220245] ? gni_cq_get_event+0x47/0xb10 [kgni_ari] [10444.227225] lnet_parse+0xaf3/0x1120 [lnet] [10444.231471] ? kmem_cache_alloc+0x1cb/0x5e0 [10444.236043] ? kgnilnd_check_fma_send_cq+0xdf5/0x1090 [kgnilnd] [10444.244100] kgnilnd_check_fma_rx+0x166d/0x1e90 [kgnilnd] [10444.251562] ? lock_timer_base+0x6b/0x90 [10444.256140] kgnilnd_process_conns+0x527/0xe50 [kgnilnd] [10444.263252] ? kgnilnd_process_mapped_tx+0x574/0x810 [kgnilnd] [10444.271873] kgnilnd_scheduler+0x199/0x5b0 [kgnilnd] [10444.278550] ? wait_woken+0x80/0x80 [10444.282602] kthread+0x121/0x140 [10444.285913] ? kgnilnd_process_conns+0xe50/0xe50 [kgnilnd] [10444.293607] ? kthread_create_worker_on_cpu+0x50/0x50 [10444.300612] ret_from_fork+0x3a/0x50 [10444.304848] Code: Bad RIP value. [10444.308187] Modules linked in: mgc(O) lustre(O) lmv(O) mdc(O) fid(O) lov(O) fld(O) osc(O) ptlrpc(O) obdclass(O) pm_api(O) xpmem(O) cmsr(O) bpmcdmod(O) x86_pkg_temp_thermal freemem(O) pcie_link_bw_monitor(O) ib_core(O) kdreg(O) kgnilnd(O) dvs(O) dvsipc(O) dvsipc_lnet(O) lnet(O) libcfs(O) dvsproc(O) gpcd_ari(O) ipogif_ari(O) kgni_ari(O) hwerr(PO) rca(O) heartbeat(O) simplex(PO) hss_os(PO) ghal_ari(O) craytrace(O) [10444.369702] CR2: ffffffffffffffff [10444.374673] ---[ end trace bd9de6e45567314b ]--- [10444.379464] RIP: 0010:0xffffffffffffffff [10444.383581] RSP: 0018:ffffc90007613ae0 EFLAGS: 00010286 [10444.389141] RAX: ffff880f87d11bc0 RBX: ffff88078e307000 RCX: 00000000ffffffff [10444.397849] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff88078e5e8f18 [10444.406839] RBP: ffffc90007613b90 R08: 0000000000000000 R09: 0000000000000000 [10444.415731] R10: 0000000000000000 R11: ffffea001e0c1a60 R12: ffff88078e5e8e58 [10444.425544] R13: 0000000000000001 R14: 0000000000000000 R15: ffffffffffffffff [10444.435234] FS: 0000000000000000(0000) GS:ffff88085f840000(0000) knlGS:0000000000000000 [10444.447830] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [10444.455364] CR2: ffffffffffffffff CR3: 0000000001e0a001 CR4: 00000000001606e0 [10444.465273] Kernel panic - not syncing: Fatal exception [10445.505720] Shutting down cpus with NMI [10445.520518] Kernel Offset: disabled [10445.524528] ---[ end Kernel panic - not syncing: Fatal exception Appears the crash here: static void
lnet_msg_detach_md(struct lnet_msg *msg, int status)
{
struct lnet_libmd *md = msg->msg_md;
lnet_handler_t handler = NULL;
int cpt = lnet_cpt_of_cookie(md->md_lh.lh_cookie);
int unlink;
lnet_res_lock(cpt);
while (md->md_flags & LNET_MD_FLAG_HANDLING)
/* An event handler is running - wait for it to
* complete to avoid races.
*/
lnet_md_wait_handling(md, cpt);
/* Now it's safe to drop my caller's ref */
md->md_refcount--;
LASSERT(md->md_refcount >= 0);
unlink = lnet_md_unlinkable(md);
if (md->md_handler) {
if ((md->md_flags & LNET_MD_FLAG_ABORTED) && !status) {
msg->msg_ev.status = -ETIMEDOUT;
CDEBUG(D_NET, "md 0x%p already unlinked\n", md);
} else {
msg->msg_ev.status = status;
}
msg->msg_ev.unlinked = unlink;
handler = md->md_handler;
if (!unlink)
md->md_flags |= LNET_MD_FLAG_HANDLING;
}
if (unlink || (md->md_refcount == 0 &&
md->md_threshold == LNET_MD_THRESH_INF))
lnet_detach_rsp_tracker(md, cpt);
msg->msg_md = NULL;
if (unlink)
lnet_md_unlink(md);
lnet_res_unlock(cpt);
if (handler) {
handler(&msg->msg_ev);
if (!unlink) {
lnet_res_lock(cpt); /* # <<-------------------- crash is here. */
md->md_flags &= ~LNET_MD_FLAG_HANDLING;
wake_up_var(md);
lnet_res_unlock(cpt);
}
}
}
Reverted |
| Comments |
| Comment by Andreas Dilger [ 11/Feb/21 ] |
|
What version is this? |
| Comment by Shaun Tancheff [ 11/Feb/21 ] |
|
master from commit: 9cd651aead327ae4589b58dde5818b068c89b3e5 |
| Comment by Andreas Dilger [ 11/Feb/21 ] |
|
Shaun, what kind of load was this under, and how often have you hit it? Since 2.14.0 is almost out the door, unless this is a "falls over continuously and is unusable" kind of bug, then it may just miss 2.14.0 and be landed afterward. That said, I'm not aware of any particular requirement for the original |
| Comment by Andreas Dilger [ 11/Feb/21 ] |
|
Oh, I didn't notice before - is this only affecting kgnilnd? That would put it into the realm of "not many people will use vanilla 2.14.0 in this config", since they will almost certainly be running a Cray-supplied client, so that pretty much takes it off the 2.14.0 candidate list, and a fix can be included later. |
| Comment by Shaun Tancheff [ 12/Feb/21 ] |
|
Load was I/O stress test with 40 client nodes. I will be bringing the system up to 2.14.0 RC2 with the required cray additions (gnilnd fixes) and |
| Comment by Cory Spitz [ 14/Feb/21 ] |
|
adilger, while we have yet to experience this bug in an environment without gnilnd clients, it doesn't seem like a defect with the patch from |
| Comment by Andreas Dilger [ 14/Feb/21 ] |
|
This is totally outside my area of expertise, but is there anything unusual in kgnilnd_recv() that is not happening in other LNDs? Also, it seems a bit strange that it is crashing when accessing "cpt", since CPTs shouldn't really be changing except at startup, but I don't know this part of the code very well. |
| Comment by Neil Brown [ 14/Feb/21 ] |
|
I'm not convinced that you've identified the crash location correctly. I assume you found the location of the top address of the stack: lnet_finalize+0x750/0x1130 This looks to be a return address that was pushed onto the stack when handler() was called. The "?" at the start of the line indicates that this address is not part of the current call stack, it is just an old return address that happens to be on the stack. The place we should be looking is kgnilnd_recv+0x5b6/0xcb0 If you can identify the line of code which this address corresponds to, that should help identify the root problem. |
| Comment by Chris Horn [ 23/Mar/21 ] |
|
This turned out to be a bug with DVS. We're porting DVS to work with some of these recent changes to the LNet API (namely |