Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12799

NULL pointer dereference IP: ptlrpc_disconnect_and_idle_import

    XMLWordPrintable

Details

    • 3
    • 9223372036854775807

    Description

      Disconnecting from an idle import results in a NULL pointer dereference if the ptlrpc request cannot be allocated due to insufficient free memory.

      From console log:

      > 2019-07-28T07:17:40.193910-05:00 c3-0c1s14n3 page allocation failure: order:0
      > 2019-07-28T07:17:40.193923-05:00 c3-0c1s14n3 / mems_allowed=0-1
      > 2019-07-28T07:17:40.193938-05:00 c3-0c1s14n3 LustreError: 21724:0:(client.c:893:__ptlrpc_request_alloc()) request allocation out of memory
      > 2019-07-28T07:17:40.193953-05:00 c3-0c1s14n3 , mode:0x17080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask=
      > 2019-07-28T07:17:40.193967-05:00 c3-0c1s14n3 BUG: unable to handle kernel
      > 2019-07-28T07:17:40.193992-05:00 c3-0c1s14n3 (null)
      > 2019-07-28T07:17:40.194007-05:00 c3-0c1s14n3 NULL pointer dereference at 0000000000000160
      > 2019-07-28T07:17:40.194073-05:00 c3-0c1s14n3 IP: ptlrpc_disconnect_and_idle_import+0x19d/0x290 [ptlrpc]
      > 2019-07-28T07:17:40.194088-05:00 c3-0c1s14n3 PGD 0
      > 2019-07-28T07:17:40.194103-05:00 c3-0c1s14n3 memfill2 cpuset=
      > 2019-07-28T07:17:40.194116-05:00 c3-0c1s14n3 P4D 0
      > 2019-07-28T07:17:40.194131-05:00 c3-0c1s14n3 Oops: 0002 [#1] SMP PTI
      > 2019-07-28T07:17:40.194146-05:00 c3-0c1s14n3 CPU: 25 PID: 21724 Comm: kworker/25:1 Tainted: P           O     4.12.14-150.17_5.0.84-cray_ari_c #1 SLE15 (unreleased)
      > 2019-07-28T07:17:40.194161-05:00 c3-0c1s14n3 Hardware name: Cray Inc. Cascade/Cascade, BIOS 5.11 03/22/2018
      > 2019-07-28T07:17:40.194175-05:00 c3-0c1s14n3 2352114
      > 2019-07-28T07:17:40.194190-05:00 c3-0c1s14n3 Workqueue: ptlrpc_pinger ptlrpc_pinger_main [ptlrpc]
      > 2019-07-28T07:17:40.194204-05:00 c3-0c1s14n3 mems_allowed=0-1
      > 2019-07-28T07:17:40.194218-05:00 c3-0c1s14n3 task: ffff8815c1696000 task.stack: ffffc9000753c000
      > 2019-07-28T07:17:40.194232-05:00 c3-0c1s14n3 RIP: 0010:ptlrpc_disconnect_and_idle_import+0x19d/0x290 [ptlrpc]
      > 2019-07-28T07:17:40.194246-05:00 c3-0c1s14n3 RSP: 0000:ffffc9000753fe18 EFLAGS: 00010296
      > 2019-07-28T07:17:40.194261-05:00 c3-0c1s14n3 RAX: 0000000000000001 RBX: ffff881f8c920000 RCX: 0000000000000000
      > 2019-07-28T07:17:40.194276-05:00 c3-0c1s14n3 RDX: 0000000000000336 RSI: 0000000000000000 RDI: ffff881f90478c80
      > 2019-07-28T07:17:40.194290-05:00 c3-0c1s14n3 RBP: ffffc9000753fe28 R08: 0000000000000000 R09: ffff88118101a358
      > 2019-07-28T07:17:40.194304-05:00 c3-0c1s14n3 R10: ffffc9000753f960 R11: ffff88118101a392 R12: 0000000000000000
      > 2019-07-28T07:17:40.194319-05:00 c3-0c1s14n3 R13: 0000000000000000 R14: ffff881f8c920278 R15: ffff881f8c920020
      > 2019-07-28T07:17:40.194333-05:00 c3-0c1s14n3 FS:  0000000000000000(0000) GS:ffff88203f5c0000(0000) knlGS:0000000000000000
      > 2019-07-28T07:17:40.194347-05:00 c3-0c1s14n3 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      > 2019-07-28T07:17:40.194362-05:00 c3-0c1s14n3 CR2: 0000000000000160 CR3: 0000000001e0a002 CR4: 00000000003606e0
      > 2019-07-28T07:17:40.194376-05:00 c3-0c1s14n3 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      > 2019-07-28T07:17:40.194390-05:00 c3-0c1s14n3 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      > 2019-07-28T07:17:40.194404-05:00 c3-0c1s14n3 Call Trace:
      > 2019-07-28T07:17:40.194419-05:00 c3-0c1s14n3 ptlrpc_pinger_main+0x53b/0x900 [ptlrpc]
      > 2019-07-28T07:17:40.194432-05:00 c3-0c1s14n3 process_one_work+0x260/0x4a0
      > 2019-07-28T07:17:40.194446-05:00 c3-0c1s14n3 worker_thread+0x45/0x420
      > 2019-07-28T07:17:40.194460-05:00 c3-0c1s14n3 kthread+0x127/0x140
      > 2019-07-28T07:17:40.194475-05:00 c3-0c1s14n3 ? trace_event_raw_event_workqueue_execute_start+0x90/0x90
      > 2019-07-28T07:17:40.194490-05:00 c3-0c1s14n3 ? kthread_create_on_node+0x40/0x40
      > 2019-07-28T07:17:40.194504-05:00 c3-0c1s14n3 ret_from_fork+0x3a/0x50
      
      > 2019-07-28T07:17:40.194591-05:00 c3-0c1s14n3 CR2: 0000000000000160
      > 2019-07-28T07:17:40.194606-05:00 c3-0c1s14n3 ---[ end trace 26bbd436eb3f80b0 ]---
      > 2019-07-28T07:17:40.194620-05:00 c3-0c1s14n3 RIP: 0010:ptlrpc_disconnect_and_idle_import+0x19d/0x290 [ptlrpc]
      > 2019-07-28T07:17:40.194645-05:00 c3-0c1s14n3 RSP: 0000:ffffc9000753fe18 EFLAGS: 00010296
      > 2019-07-28T07:17:40.194672-05:00 c3-0c1s14n3 RAX: 0000000000000001 RBX: ffff881f8c920000 RCX: 0000000000000000
      > 2019-07-28T07:17:40.194686-05:00 c3-0c1s14n3 RDX: 0000000000000336 RSI: 0000000000000000 RDI: ffff881f90478c80
      > 2019-07-28T07:17:40.194701-05:00 c3-0c1s14n3 RBP: ffffc9000753fe28 R08: 0000000000000000 R09: ffff88118101a358
      > 2019-07-28T07:17:40.194715-05:00 c3-0c1s14n3 R10: ffffc9000753f960 R11: ffff88118101a392 R12: 0000000000000000
      > 2019-07-28T07:17:40.194729-05:00 c3-0c1s14n3 R13: 0000000000000000 R14: ffff881f8c920278 R15: ffff881f8c920020
      > 2019-07-28T07:17:40.194744-05:00 c3-0c1s14n3 FS:  0000000000000000(0000) GS:ffff88203f5c0000(0000) knlGS:0000000000000000
      > 2019-07-28T07:17:40.194758-05:00 c3-0c1s14n3 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      > 2019-07-28T07:17:40.194772-05:00 c3-0c1s14n3 CR2: 0000000000000160 CR3: 0000000001e0a002 CR4: 00000000003606e0
      > 2019-07-28T07:17:40.194788-05:00 c3-0c1s14n3 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      > 2019-07-28T07:17:40.194802-05:00 c3-0c1s14n3 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      > 2019-07-28T07:17:40.194818-05:00 c3-0c1s14n3 Kernel panic - not syncing: Fatal exception
      > 2019-07-28T07:17:40.194832-05:00 c3-0c1s14n3 Shutting down cpus with NMI
      > 2019-07-28T07:17:40.194846-05:00 c3-0c1s14n3 Kernel Offset: disabled
      > 2019-07-28T07:17:40.194860-05:00 c3-0c1s14n3 ---[ end Kernel panic - not syncing: Fatal exception
      

      From dklog:

      crash_x86_64> epython dk | grep ':21724:'
      00000100:00080000:25.0:1564316239.230647:000000:21724:0:(import.c:1781:ptlrpc_disconnect_and_idle_import()) snx11242-OST0001-osc-ffff880f866d1000: disconnect after 81s idle
      00000100:00080000:25.0:1564316239.230650:000000:21724:0:(import.c:1624:ptlrpc_disconnect_prep_req()) ffff881f8aeee000 snx11242-OST0002_UUID: changing import state from FULL to CONNECTING
      00000100:00080000:25.0:1564316239.230651:000000:21724:0:(import.c:1781:ptlrpc_disconnect_and_idle_import()) snx11242-OST0002-osc-ffff880f866d1000: disconnect after 80s idle
      00000100:00020000:25.0:1564316255.177461:000000:21724:0:(client.c:893:__ptlrpc_request_alloc()) request allocation out of memory
      00000100:00080000:25.0:1564316255.180754:000000:21724:0:(import.c:1781:ptlrpc_disconnect_and_idle_import()) snx11242-OST0003-osc-ffff880f866d1000: disconnect after 97s idle
      
      crash_x86_64> epython obd_devs
      obd_device          obd_name               	ip_address             client_obd          obd_import          imp_state    ish_time   index   conn_cnt
      ========================================================================================================================================================
      0xffff881fa37703c0  snx11242-OST0003      	10.155.10.8@o2ib3003  	 0xffff881fa3770868  0xffff881f8c920000  FULL       1564316158     5   141
      

      obd_import 0xffff881f8c920000 is idle so ptlrpc_disconnect_and_idle_import() is called.
      Node is out of memory so ptlrpc_disconnect_prep_req(imp) returns NULL but error handling only checks for a bad pointer. Doesn't handle the NULL case.

      1757 int ptlrpc_disconnect_and_idle_import(struct obd_import *imp)
      1758 {
      ...
      1775         req = ptlrpc_disconnect_prep_req(imp);
      1776         if (IS_ERR(req))
      1777                 RETURN(PTR_ERR(req));
      1778
      1779         CDEBUG_LIMIT(imp->imp_idle_debug, "%s: disconnect after %llus idle\n",
      1780                      imp->imp_obd->obd_name,
      1781                      ktime_get_real_seconds() - imp->imp_last_reply_time);
      1782         req->rq_interpret_reply = ptlrpc_disconnect_idle_interpret;    <---- NULL pointer dereference
      

      Attachments

        Issue Links

          Activity

            People

              amk Ann Koehler (Inactive)
              amk Ann Koehler (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: