[LU-12799] NULL pointer dereference IP: ptlrpc_disconnect_and_idle_import Created: 24/Sep/19  Updated: 10/Feb/20  Resolved: 23/Oct/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.2
Fix Version/s: Lustre 2.13.0, Lustre 2.14.0, Lustre 2.12.4

Type: Bug Priority: Major
Reporter: Ann Koehler (Inactive) Assignee: Ann Koehler (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Disconnecting from an idle import results in a NULL pointer dereference if the ptlrpc request cannot be allocated due to insufficient free memory.

From console log:

> 2019-07-28T07:17:40.193910-05:00 c3-0c1s14n3 page allocation failure: order:0
> 2019-07-28T07:17:40.193923-05:00 c3-0c1s14n3 / mems_allowed=0-1
> 2019-07-28T07:17:40.193938-05:00 c3-0c1s14n3 LustreError: 21724:0:(client.c:893:__ptlrpc_request_alloc()) request allocation out of memory
> 2019-07-28T07:17:40.193953-05:00 c3-0c1s14n3 , mode:0x17080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask=
> 2019-07-28T07:17:40.193967-05:00 c3-0c1s14n3 BUG: unable to handle kernel
> 2019-07-28T07:17:40.193992-05:00 c3-0c1s14n3 (null)
> 2019-07-28T07:17:40.194007-05:00 c3-0c1s14n3 NULL pointer dereference at 0000000000000160
> 2019-07-28T07:17:40.194073-05:00 c3-0c1s14n3 IP: ptlrpc_disconnect_and_idle_import+0x19d/0x290 [ptlrpc]
> 2019-07-28T07:17:40.194088-05:00 c3-0c1s14n3 PGD 0
> 2019-07-28T07:17:40.194103-05:00 c3-0c1s14n3 memfill2 cpuset=
> 2019-07-28T07:17:40.194116-05:00 c3-0c1s14n3 P4D 0
> 2019-07-28T07:17:40.194131-05:00 c3-0c1s14n3 Oops: 0002 [#1] SMP PTI
> 2019-07-28T07:17:40.194146-05:00 c3-0c1s14n3 CPU: 25 PID: 21724 Comm: kworker/25:1 Tainted: P           O     4.12.14-150.17_5.0.84-cray_ari_c #1 SLE15 (unreleased)
> 2019-07-28T07:17:40.194161-05:00 c3-0c1s14n3 Hardware name: Cray Inc. Cascade/Cascade, BIOS 5.11 03/22/2018
> 2019-07-28T07:17:40.194175-05:00 c3-0c1s14n3 2352114
> 2019-07-28T07:17:40.194190-05:00 c3-0c1s14n3 Workqueue: ptlrpc_pinger ptlrpc_pinger_main [ptlrpc]
> 2019-07-28T07:17:40.194204-05:00 c3-0c1s14n3 mems_allowed=0-1
> 2019-07-28T07:17:40.194218-05:00 c3-0c1s14n3 task: ffff8815c1696000 task.stack: ffffc9000753c000
> 2019-07-28T07:17:40.194232-05:00 c3-0c1s14n3 RIP: 0010:ptlrpc_disconnect_and_idle_import+0x19d/0x290 [ptlrpc]
> 2019-07-28T07:17:40.194246-05:00 c3-0c1s14n3 RSP: 0000:ffffc9000753fe18 EFLAGS: 00010296
> 2019-07-28T07:17:40.194261-05:00 c3-0c1s14n3 RAX: 0000000000000001 RBX: ffff881f8c920000 RCX: 0000000000000000
> 2019-07-28T07:17:40.194276-05:00 c3-0c1s14n3 RDX: 0000000000000336 RSI: 0000000000000000 RDI: ffff881f90478c80
> 2019-07-28T07:17:40.194290-05:00 c3-0c1s14n3 RBP: ffffc9000753fe28 R08: 0000000000000000 R09: ffff88118101a358
> 2019-07-28T07:17:40.194304-05:00 c3-0c1s14n3 R10: ffffc9000753f960 R11: ffff88118101a392 R12: 0000000000000000
> 2019-07-28T07:17:40.194319-05:00 c3-0c1s14n3 R13: 0000000000000000 R14: ffff881f8c920278 R15: ffff881f8c920020
> 2019-07-28T07:17:40.194333-05:00 c3-0c1s14n3 FS:  0000000000000000(0000) GS:ffff88203f5c0000(0000) knlGS:0000000000000000
> 2019-07-28T07:17:40.194347-05:00 c3-0c1s14n3 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 2019-07-28T07:17:40.194362-05:00 c3-0c1s14n3 CR2: 0000000000000160 CR3: 0000000001e0a002 CR4: 00000000003606e0
> 2019-07-28T07:17:40.194376-05:00 c3-0c1s14n3 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> 2019-07-28T07:17:40.194390-05:00 c3-0c1s14n3 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> 2019-07-28T07:17:40.194404-05:00 c3-0c1s14n3 Call Trace:
> 2019-07-28T07:17:40.194419-05:00 c3-0c1s14n3 ptlrpc_pinger_main+0x53b/0x900 [ptlrpc]
> 2019-07-28T07:17:40.194432-05:00 c3-0c1s14n3 process_one_work+0x260/0x4a0
> 2019-07-28T07:17:40.194446-05:00 c3-0c1s14n3 worker_thread+0x45/0x420
> 2019-07-28T07:17:40.194460-05:00 c3-0c1s14n3 kthread+0x127/0x140
> 2019-07-28T07:17:40.194475-05:00 c3-0c1s14n3 ? trace_event_raw_event_workqueue_execute_start+0x90/0x90
> 2019-07-28T07:17:40.194490-05:00 c3-0c1s14n3 ? kthread_create_on_node+0x40/0x40
> 2019-07-28T07:17:40.194504-05:00 c3-0c1s14n3 ret_from_fork+0x3a/0x50

> 2019-07-28T07:17:40.194591-05:00 c3-0c1s14n3 CR2: 0000000000000160
> 2019-07-28T07:17:40.194606-05:00 c3-0c1s14n3 ---[ end trace 26bbd436eb3f80b0 ]---
> 2019-07-28T07:17:40.194620-05:00 c3-0c1s14n3 RIP: 0010:ptlrpc_disconnect_and_idle_import+0x19d/0x290 [ptlrpc]
> 2019-07-28T07:17:40.194645-05:00 c3-0c1s14n3 RSP: 0000:ffffc9000753fe18 EFLAGS: 00010296
> 2019-07-28T07:17:40.194672-05:00 c3-0c1s14n3 RAX: 0000000000000001 RBX: ffff881f8c920000 RCX: 0000000000000000
> 2019-07-28T07:17:40.194686-05:00 c3-0c1s14n3 RDX: 0000000000000336 RSI: 0000000000000000 RDI: ffff881f90478c80
> 2019-07-28T07:17:40.194701-05:00 c3-0c1s14n3 RBP: ffffc9000753fe28 R08: 0000000000000000 R09: ffff88118101a358
> 2019-07-28T07:17:40.194715-05:00 c3-0c1s14n3 R10: ffffc9000753f960 R11: ffff88118101a392 R12: 0000000000000000
> 2019-07-28T07:17:40.194729-05:00 c3-0c1s14n3 R13: 0000000000000000 R14: ffff881f8c920278 R15: ffff881f8c920020
> 2019-07-28T07:17:40.194744-05:00 c3-0c1s14n3 FS:  0000000000000000(0000) GS:ffff88203f5c0000(0000) knlGS:0000000000000000
> 2019-07-28T07:17:40.194758-05:00 c3-0c1s14n3 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 2019-07-28T07:17:40.194772-05:00 c3-0c1s14n3 CR2: 0000000000000160 CR3: 0000000001e0a002 CR4: 00000000003606e0
> 2019-07-28T07:17:40.194788-05:00 c3-0c1s14n3 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> 2019-07-28T07:17:40.194802-05:00 c3-0c1s14n3 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> 2019-07-28T07:17:40.194818-05:00 c3-0c1s14n3 Kernel panic - not syncing: Fatal exception
> 2019-07-28T07:17:40.194832-05:00 c3-0c1s14n3 Shutting down cpus with NMI
> 2019-07-28T07:17:40.194846-05:00 c3-0c1s14n3 Kernel Offset: disabled
> 2019-07-28T07:17:40.194860-05:00 c3-0c1s14n3 ---[ end Kernel panic - not syncing: Fatal exception

From dklog:

crash_x86_64> epython dk | grep ':21724:'
00000100:00080000:25.0:1564316239.230647:000000:21724:0:(import.c:1781:ptlrpc_disconnect_and_idle_import()) snx11242-OST0001-osc-ffff880f866d1000: disconnect after 81s idle
00000100:00080000:25.0:1564316239.230650:000000:21724:0:(import.c:1624:ptlrpc_disconnect_prep_req()) ffff881f8aeee000 snx11242-OST0002_UUID: changing import state from FULL to CONNECTING
00000100:00080000:25.0:1564316239.230651:000000:21724:0:(import.c:1781:ptlrpc_disconnect_and_idle_import()) snx11242-OST0002-osc-ffff880f866d1000: disconnect after 80s idle
00000100:00020000:25.0:1564316255.177461:000000:21724:0:(client.c:893:__ptlrpc_request_alloc()) request allocation out of memory
00000100:00080000:25.0:1564316255.180754:000000:21724:0:(import.c:1781:ptlrpc_disconnect_and_idle_import()) snx11242-OST0003-osc-ffff880f866d1000: disconnect after 97s idle

crash_x86_64> epython obd_devs
obd_device          obd_name               	ip_address             client_obd          obd_import          imp_state    ish_time   index   conn_cnt
========================================================================================================================================================
0xffff881fa37703c0  snx11242-OST0003      	10.155.10.8@o2ib3003  	 0xffff881fa3770868  0xffff881f8c920000  FULL       1564316158     5   141

obd_import 0xffff881f8c920000 is idle so ptlrpc_disconnect_and_idle_import() is called.
Node is out of memory so ptlrpc_disconnect_prep_req(imp) returns NULL but error handling only checks for a bad pointer. Doesn't handle the NULL case.

1757 int ptlrpc_disconnect_and_idle_import(struct obd_import *imp)
1758 {
...
1775         req = ptlrpc_disconnect_prep_req(imp);
1776         if (IS_ERR(req))
1777                 RETURN(PTR_ERR(req));
1778
1779         CDEBUG_LIMIT(imp->imp_idle_debug, "%s: disconnect after %llus idle\n",
1780                      imp->imp_obd->obd_name,
1781                      ktime_get_real_seconds() - imp->imp_last_reply_time);
1782         req->rq_interpret_reply = ptlrpc_disconnect_idle_interpret;    <---- NULL pointer dereference


 Comments   
Comment by Gerrit Updater [ 24/Sep/19 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36282
Subject: LU-12799 ptlrpc: return proper error code
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 108a7f819ad5241c095779a6f70ba524606b3328

Comment by Gerrit Updater [ 22/Oct/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36282/
Subject: LU-12799 ptlrpc: return proper error code
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9e2620d75cce1e1b4855704ddd9a994ce8e8d650

Comment by Peter Jones [ 23/Oct/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 08/Jan/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37164
Subject: LU-12799 ptlrpc: return proper error code
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: ef4895857f85ad4f15c50b3db39732644c97b309

Comment by Gerrit Updater [ 27/Jan/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37164/
Subject: LU-12799 ptlrpc: return proper error code
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 939d8bb0cf88babe1b9812bc03e07d7811bf2770

Generated at Sat Feb 10 02:55:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.