Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.12.2
-
None
-
3
-
9223372036854775807
Description
Disconnecting from an idle import results in a NULL pointer dereference if the ptlrpc request cannot be allocated due to insufficient free memory.
From console log:
> 2019-07-28T07:17:40.193910-05:00 c3-0c1s14n3 page allocation failure: order:0 > 2019-07-28T07:17:40.193923-05:00 c3-0c1s14n3 / mems_allowed=0-1 > 2019-07-28T07:17:40.193938-05:00 c3-0c1s14n3 LustreError: 21724:0:(client.c:893:__ptlrpc_request_alloc()) request allocation out of memory > 2019-07-28T07:17:40.193953-05:00 c3-0c1s14n3 , mode:0x17080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask= > 2019-07-28T07:17:40.193967-05:00 c3-0c1s14n3 BUG: unable to handle kernel > 2019-07-28T07:17:40.193992-05:00 c3-0c1s14n3 (null) > 2019-07-28T07:17:40.194007-05:00 c3-0c1s14n3 NULL pointer dereference at 0000000000000160 > 2019-07-28T07:17:40.194073-05:00 c3-0c1s14n3 IP: ptlrpc_disconnect_and_idle_import+0x19d/0x290 [ptlrpc] > 2019-07-28T07:17:40.194088-05:00 c3-0c1s14n3 PGD 0 > 2019-07-28T07:17:40.194103-05:00 c3-0c1s14n3 memfill2 cpuset= > 2019-07-28T07:17:40.194116-05:00 c3-0c1s14n3 P4D 0 > 2019-07-28T07:17:40.194131-05:00 c3-0c1s14n3 Oops: 0002 [#1] SMP PTI > 2019-07-28T07:17:40.194146-05:00 c3-0c1s14n3 CPU: 25 PID: 21724 Comm: kworker/25:1 Tainted: P O 4.12.14-150.17_5.0.84-cray_ari_c #1 SLE15 (unreleased) > 2019-07-28T07:17:40.194161-05:00 c3-0c1s14n3 Hardware name: Cray Inc. Cascade/Cascade, BIOS 5.11 03/22/2018 > 2019-07-28T07:17:40.194175-05:00 c3-0c1s14n3 2352114 > 2019-07-28T07:17:40.194190-05:00 c3-0c1s14n3 Workqueue: ptlrpc_pinger ptlrpc_pinger_main [ptlrpc] > 2019-07-28T07:17:40.194204-05:00 c3-0c1s14n3 mems_allowed=0-1 > 2019-07-28T07:17:40.194218-05:00 c3-0c1s14n3 task: ffff8815c1696000 task.stack: ffffc9000753c000 > 2019-07-28T07:17:40.194232-05:00 c3-0c1s14n3 RIP: 0010:ptlrpc_disconnect_and_idle_import+0x19d/0x290 [ptlrpc] > 2019-07-28T07:17:40.194246-05:00 c3-0c1s14n3 RSP: 0000:ffffc9000753fe18 EFLAGS: 00010296 > 2019-07-28T07:17:40.194261-05:00 c3-0c1s14n3 RAX: 0000000000000001 RBX: ffff881f8c920000 RCX: 0000000000000000 > 2019-07-28T07:17:40.194276-05:00 c3-0c1s14n3 RDX: 0000000000000336 RSI: 0000000000000000 RDI: ffff881f90478c80 > 2019-07-28T07:17:40.194290-05:00 c3-0c1s14n3 RBP: ffffc9000753fe28 R08: 0000000000000000 R09: ffff88118101a358 > 2019-07-28T07:17:40.194304-05:00 c3-0c1s14n3 R10: ffffc9000753f960 R11: ffff88118101a392 R12: 0000000000000000 > 2019-07-28T07:17:40.194319-05:00 c3-0c1s14n3 R13: 0000000000000000 R14: ffff881f8c920278 R15: ffff881f8c920020 > 2019-07-28T07:17:40.194333-05:00 c3-0c1s14n3 FS: 0000000000000000(0000) GS:ffff88203f5c0000(0000) knlGS:0000000000000000 > 2019-07-28T07:17:40.194347-05:00 c3-0c1s14n3 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > 2019-07-28T07:17:40.194362-05:00 c3-0c1s14n3 CR2: 0000000000000160 CR3: 0000000001e0a002 CR4: 00000000003606e0 > 2019-07-28T07:17:40.194376-05:00 c3-0c1s14n3 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > 2019-07-28T07:17:40.194390-05:00 c3-0c1s14n3 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > 2019-07-28T07:17:40.194404-05:00 c3-0c1s14n3 Call Trace: > 2019-07-28T07:17:40.194419-05:00 c3-0c1s14n3 ptlrpc_pinger_main+0x53b/0x900 [ptlrpc] > 2019-07-28T07:17:40.194432-05:00 c3-0c1s14n3 process_one_work+0x260/0x4a0 > 2019-07-28T07:17:40.194446-05:00 c3-0c1s14n3 worker_thread+0x45/0x420 > 2019-07-28T07:17:40.194460-05:00 c3-0c1s14n3 kthread+0x127/0x140 > 2019-07-28T07:17:40.194475-05:00 c3-0c1s14n3 ? trace_event_raw_event_workqueue_execute_start+0x90/0x90 > 2019-07-28T07:17:40.194490-05:00 c3-0c1s14n3 ? kthread_create_on_node+0x40/0x40 > 2019-07-28T07:17:40.194504-05:00 c3-0c1s14n3 ret_from_fork+0x3a/0x50 > 2019-07-28T07:17:40.194591-05:00 c3-0c1s14n3 CR2: 0000000000000160 > 2019-07-28T07:17:40.194606-05:00 c3-0c1s14n3 ---[ end trace 26bbd436eb3f80b0 ]--- > 2019-07-28T07:17:40.194620-05:00 c3-0c1s14n3 RIP: 0010:ptlrpc_disconnect_and_idle_import+0x19d/0x290 [ptlrpc] > 2019-07-28T07:17:40.194645-05:00 c3-0c1s14n3 RSP: 0000:ffffc9000753fe18 EFLAGS: 00010296 > 2019-07-28T07:17:40.194672-05:00 c3-0c1s14n3 RAX: 0000000000000001 RBX: ffff881f8c920000 RCX: 0000000000000000 > 2019-07-28T07:17:40.194686-05:00 c3-0c1s14n3 RDX: 0000000000000336 RSI: 0000000000000000 RDI: ffff881f90478c80 > 2019-07-28T07:17:40.194701-05:00 c3-0c1s14n3 RBP: ffffc9000753fe28 R08: 0000000000000000 R09: ffff88118101a358 > 2019-07-28T07:17:40.194715-05:00 c3-0c1s14n3 R10: ffffc9000753f960 R11: ffff88118101a392 R12: 0000000000000000 > 2019-07-28T07:17:40.194729-05:00 c3-0c1s14n3 R13: 0000000000000000 R14: ffff881f8c920278 R15: ffff881f8c920020 > 2019-07-28T07:17:40.194744-05:00 c3-0c1s14n3 FS: 0000000000000000(0000) GS:ffff88203f5c0000(0000) knlGS:0000000000000000 > 2019-07-28T07:17:40.194758-05:00 c3-0c1s14n3 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > 2019-07-28T07:17:40.194772-05:00 c3-0c1s14n3 CR2: 0000000000000160 CR3: 0000000001e0a002 CR4: 00000000003606e0 > 2019-07-28T07:17:40.194788-05:00 c3-0c1s14n3 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > 2019-07-28T07:17:40.194802-05:00 c3-0c1s14n3 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > 2019-07-28T07:17:40.194818-05:00 c3-0c1s14n3 Kernel panic - not syncing: Fatal exception > 2019-07-28T07:17:40.194832-05:00 c3-0c1s14n3 Shutting down cpus with NMI > 2019-07-28T07:17:40.194846-05:00 c3-0c1s14n3 Kernel Offset: disabled > 2019-07-28T07:17:40.194860-05:00 c3-0c1s14n3 ---[ end Kernel panic - not syncing: Fatal exception
From dklog:
crash_x86_64> epython dk | grep ':21724:' 00000100:00080000:25.0:1564316239.230647:000000:21724:0:(import.c:1781:ptlrpc_disconnect_and_idle_import()) snx11242-OST0001-osc-ffff880f866d1000: disconnect after 81s idle 00000100:00080000:25.0:1564316239.230650:000000:21724:0:(import.c:1624:ptlrpc_disconnect_prep_req()) ffff881f8aeee000 snx11242-OST0002_UUID: changing import state from FULL to CONNECTING 00000100:00080000:25.0:1564316239.230651:000000:21724:0:(import.c:1781:ptlrpc_disconnect_and_idle_import()) snx11242-OST0002-osc-ffff880f866d1000: disconnect after 80s idle 00000100:00020000:25.0:1564316255.177461:000000:21724:0:(client.c:893:__ptlrpc_request_alloc()) request allocation out of memory 00000100:00080000:25.0:1564316255.180754:000000:21724:0:(import.c:1781:ptlrpc_disconnect_and_idle_import()) snx11242-OST0003-osc-ffff880f866d1000: disconnect after 97s idle crash_x86_64> epython obd_devs obd_device obd_name ip_address client_obd obd_import imp_state ish_time index conn_cnt ======================================================================================================================================================== 0xffff881fa37703c0 snx11242-OST0003 10.155.10.8@o2ib3003 0xffff881fa3770868 0xffff881f8c920000 FULL 1564316158 5 141
obd_import 0xffff881f8c920000 is idle so ptlrpc_disconnect_and_idle_import() is called.
Node is out of memory so ptlrpc_disconnect_prep_req(imp) returns NULL but error handling only checks for a bad pointer. Doesn't handle the NULL case.
1757 int ptlrpc_disconnect_and_idle_import(struct obd_import *imp) 1758 { ... 1775 req = ptlrpc_disconnect_prep_req(imp); 1776 if (IS_ERR(req)) 1777 RETURN(PTR_ERR(req)); 1778 1779 CDEBUG_LIMIT(imp->imp_idle_debug, "%s: disconnect after %llus idle\n", 1780 imp->imp_obd->obd_name, 1781 ktime_get_real_seconds() - imp->imp_last_reply_time); 1782 req->rq_interpret_reply = ptlrpc_disconnect_idle_interpret; <---- NULL pointer dereference
Attachments
Issue Links
- is related to
-
LU-7236 OST connect and disconnect on demand
- Resolved