Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
None
-
None
-
Hyperion iwc126 client
-
3
-
9099
Description
During an ior fpp run a client paniced.
This was on:
Lustre: Lustre: Build Version: 2.4.52--PRISTINE-2.6.32-358.11.1.el6.x86_64
In the crashdump the following was seen:
LNetError: 23379:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: out of memory at /var/lib/jenkins/workspace/lustre-master/arch/x86_64/build_type/client/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.4.52/lnet/include/lnet/lib-lnet.h:457 (tried to alloc '(md)' = 4208) LNetError: 23379:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: 199035354 total bytes allocated by lnet BUG: unable to handle kernel NULL pointer dereference at 00000000000000b8 IP: [<ffffffffa0990d5a>] ptlrpc_register_bulk+0x46a/0x9d0 [ptlrpc] PGD 1a4e1b067 PUD 12fa01067 PMD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/pci0000:00/0000:00:07.0/0000:03:00.0/infiniband/mlx4_0/ports/1/pkeys/127 CPU 7 Modules linked in: lmv(U) fld(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) sha512_generic sha256_generic crc32c_intel ipmi_devintf acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr mlx4_ib ib_sa ib_mad iw_cxgb4 iw_cxgb3 ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm dcdbas i2c_i801 i2c_core ahci iTCO_wdt iTCO_vendor_support i7core_edac edac_core ioatdma dca shpchp nfs lockd fscache auth_rpcgss nfs_acl sunrpc mlx4_en mlx4_core e1000e be2iscsi bnx2i cnic uio ipv6 cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: cpufreq_ondemand] Pid: 23379, comm: ptlrpcd_1 Tainted: G W --------------- 2.6.32-358.11.1.el6.x86_64 #1 Dell XS23-TY /XS23-TY RIP: 0010:[<ffffffffa0990d5a>] [<ffffffffa0990d5a>] ptlrpc_register_bulk+0x46a/0x9d0 [ptlrpc] RSP: 0018:ffff8803017dbb10 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff880084340000 RCX: 00051e3eebeda3b4 RDX: 0000000000000000 RSI: ffffffffa09fc2c0 RDI: ffffffffa0a3e520 RBP: ffff8803017dbbd0 R08: 0000000000000000 R09: 00000000fffffff4 R10: 0000000000000002 R11: 0000000000000000 R12: 00000000fffffff4 R13: 00051e3eebeda3b4 R14: 0000000000000000 R15: 00051e3eebeda3b4 FS: 0000000000000000(0000) GS:ffff8801c58c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00000000000000b8 CR3: 00000001bad21000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process ptlrpcd_1 (pid: 23379, threadinfo ffff8803017da000, task ffff8803017d9500) Stack: ffff8800843400a0 0000000100000100 00000102000000d2 ffff880084340058 <d> 0000000000000023 00000000a07625a0 ffff8801833b2400 00000001e8c219c4 <d> ffff8800843400a0 0000000100000100 00000102000000d2 ffff880084340058 Call Trace: [<ffffffffa0991fa2>] ptl_send_rpc+0x232/0xc40 [ptlrpc] [<ffffffff81281484>] ? snprintf+0x34/0x40 [<ffffffffa0718fe1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa09876bb>] ptlrpc_send_new_req+0x45b/0x7a0 [ptlrpc] [<ffffffffa098b3a8>] ptlrpc_check_set+0x878/0x1b20 [ptlrpc] [<ffffffffa09b76cb>] ptlrpcd_check+0x53b/0x560 [ptlrpc] [<ffffffff8109705c>] ? remove_wait_queue+0x3c/0x50 [<ffffffffa09b7b50>] ptlrpcd+0x190/0x380 [ptlrpc] [<ffffffff81063310>] ? default_wake_function+0x0/0x20 [<ffffffffa09b79c0>] ? ptlrpcd+0x0/0x380 [ptlrpc] [<ffffffff81096936>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffff810968a0>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Code: f0 48 c7 05 05 d8 0a 00 50 e5 a3 a0 c7 05 f3 d7 0a 00 00 00 02 00 4c 89 e9 48 8b 43 10 48 c7 c6 c0 c2 9f a0 48 c7 c7 20 e5 a3 a0 <48> 8b 90 b8 00 00 00 31 c0 48 83 c2 0c e8 34 82 d8 ff 48 8b 7d RIP [<ffffffffa0990d5a>] ptlrpc_register_bulk+0x46a/0x9d0 [ptlrpc] RSP <ffff8803017dbb10> CR2: 00000000000000b8
There we some non-fatal memory allocation errors in the log before those messages and I will attached the full console log.
This was 1 of 100 clients.
Attachments
Issue Links
- is related to
-
LU-3598 Failure on test suite sanity-benchmark test_iozone: page allocation failure
-
- Resolved
-
At first glance it looks like the system is running out of memory. The LNET errors are a symptom of the system running out of memory. The crash seems to be happening in ptl. So I would think that a memory allocation fails, but NULL is not checked and then subsequently the NULL is deferenced causing the crash. The problem appears to be two fold.
1. We're not handling out of memory scenario correctly some where. Are we expecting the system to run out of memory under specific load cases?
2. There is a memory leak some where, which will cause us to get into the problem identified in (1).
Will investigate further.