Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3585

Client panic during IOR single file per process: Lnet out of Memory

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.1, Lustre 2.5.0
    • None
    • None
    • Hyperion iwc126 client
    • 3
    • 9099

    Description

      During an ior fpp run a client paniced.
      This was on:
      Lustre: Lustre: Build Version: 2.4.52--PRISTINE-2.6.32-358.11.1.el6.x86_64

      In the crashdump the following was seen:

      LNetError: 23379:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: out of memory at /var/lib/jenkins/workspace/lustre-master/arch/x86_64/build_type/client/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.4.52/lnet/include/lnet/lib-lnet.h:457 (tried to alloc '(md)' = 4208)
      LNetError: 23379:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: 199035354 total bytes allocated by lnet
      BUG: unable to handle kernel NULL pointer dereference at 00000000000000b8
      IP: [<ffffffffa0990d5a>] ptlrpc_register_bulk+0x46a/0x9d0 [ptlrpc]
      PGD 1a4e1b067 PUD 12fa01067 PMD 0 
      Oops: 0000 [#1] SMP 
      last sysfs file: /sys/devices/pci0000:00/0000:00:07.0/0000:03:00.0/infiniband/mlx4_0/ports/1/pkeys/127
      CPU 7 
      Modules linked in: lmv(U) fld(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) sha512_generic sha256_generic crc32c_intel ipmi_devintf acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr mlx4_ib ib_sa ib_mad iw_cxgb4 iw_cxgb3 ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm dcdbas i2c_i801 i2c_core ahci iTCO_wdt iTCO_vendor_support i7core_edac edac_core ioatdma dca shpchp nfs lockd fscache auth_rpcgss nfs_acl sunrpc mlx4_en mlx4_core e1000e be2iscsi bnx2i cnic uio ipv6 cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: cpufreq_ondemand]
      
      Pid: 23379, comm: ptlrpcd_1 Tainted: G        W  ---------------    2.6.32-358.11.1.el6.x86_64 #1 Dell        XS23-TY     /XS23-TY     
      RIP: 0010:[<ffffffffa0990d5a>]  [<ffffffffa0990d5a>] ptlrpc_register_bulk+0x46a/0x9d0 [ptlrpc]
      RSP: 0018:ffff8803017dbb10  EFLAGS: 00010282
      RAX: 0000000000000000 RBX: ffff880084340000 RCX: 00051e3eebeda3b4
      RDX: 0000000000000000 RSI: ffffffffa09fc2c0 RDI: ffffffffa0a3e520
      RBP: ffff8803017dbbd0 R08: 0000000000000000 R09: 00000000fffffff4
      R10: 0000000000000002 R11: 0000000000000000 R12: 00000000fffffff4
      R13: 00051e3eebeda3b4 R14: 0000000000000000 R15: 00051e3eebeda3b4
      FS:  0000000000000000(0000) GS:ffff8801c58c0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      CR2: 00000000000000b8 CR3: 00000001bad21000 CR4: 00000000000007e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process ptlrpcd_1 (pid: 23379, threadinfo ffff8803017da000, task ffff8803017d9500)
      Stack:
       ffff8800843400a0 0000000100000100 00000102000000d2 ffff880084340058
      <d> 0000000000000023 00000000a07625a0 ffff8801833b2400 00000001e8c219c4
      <d> ffff8800843400a0 0000000100000100 00000102000000d2 ffff880084340058
      Call Trace:
       [<ffffffffa0991fa2>] ptl_send_rpc+0x232/0xc40 [ptlrpc]
       [<ffffffff81281484>] ? snprintf+0x34/0x40
       [<ffffffffa0718fe1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
       [<ffffffffa09876bb>] ptlrpc_send_new_req+0x45b/0x7a0 [ptlrpc]
       [<ffffffffa098b3a8>] ptlrpc_check_set+0x878/0x1b20 [ptlrpc]
       [<ffffffffa09b76cb>] ptlrpcd_check+0x53b/0x560 [ptlrpc]
       [<ffffffff8109705c>] ? remove_wait_queue+0x3c/0x50
       [<ffffffffa09b7b50>] ptlrpcd+0x190/0x380 [ptlrpc]
       [<ffffffff81063310>] ? default_wake_function+0x0/0x20
       [<ffffffffa09b79c0>] ? ptlrpcd+0x0/0x380 [ptlrpc]
       [<ffffffff81096936>] kthread+0x96/0xa0
       [<ffffffff8100c0ca>] child_rip+0xa/0x20
       [<ffffffff810968a0>] ? kthread+0x0/0xa0
       [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      Code: f0 48 c7 05 05 d8 0a 00 50 e5 a3 a0 c7 05 f3 d7 0a 00 00 00 02 00 4c 89 e9 48 8b 43 10 48 c7 c6 c0 c2 9f a0 48 c7 c7 20 e5 a3 a0 <48> 8b 90 b8 00 00 00 31 c0 48 83 c2 0c e8 34 82 d8 ff 48 8b 7d 
      RIP  [<ffffffffa0990d5a>] ptlrpc_register_bulk+0x46a/0x9d0 [ptlrpc]
       RSP <ffff8803017dbb10>
      CR2: 00000000000000b8
      

      There we some non-fatal memory allocation errors in the log before those messages and I will attached the full console log.

      This was 1 of 100 clients.

      Attachments

        Issue Links

          Activity

            [LU-3585] Client panic during IOR single file per process: Lnet out of Memory

            That fpatch has landed. This will fix the issue seen in the kdump.

            keith Keith Mannthey (Inactive) added a comment - That fpatch has landed. This will fix the issue seen in the kdump.

            Keith,
            Can you confirm the fix now that the patch has landed to master and either reduce the priority or close this bug as appropriate?
            Thank you!

            jlevi Jodi Levi (Inactive) added a comment - Keith, Can you confirm the fix now that the patch has landed to master and either reduce the priority or close this bug as appropriate? Thank you!
            ashehata Amir Shehata (Inactive) added a comment - Gerrit inspection at: http://review.whamcloud.com/#/c/7121/

            After debugging the crash dump with Keith we verified that indeed desc->bd_export is NULL. We also verified that desc->bd_import is a non NULL value. From the code bd_import is used on the sender (client) side while bd_export is used on the receiver (server side) side. The crash occurs on the client side. Also ptlrpc_register_bulk() is always called from ptl_send_rpc(). In ptl_send_rpc the assumption in the code is that bd_import is none NULL. Same in ptlrpc_register_bulk(), as the peer is extracted from the bd_import structure. This leads me to believe that the CERROR statements should be looking in bd_import and not bd_export. It's likely that this was never hit because we didn't have a test that ran out of memory.

            A safe fix, I believe, is to change the CERROR to dereference bd_import rather than bd_export.

            This will most likely fix the crash. However, it doesn't address the reason that the system is running out of memory. However, as far as I understand under the load the test is creating it is expected that all system memory would be consumed.

            I'll work on a fix.

            ashehata Amir Shehata (Inactive) added a comment - After debugging the crash dump with Keith we verified that indeed desc->bd_export is NULL. We also verified that desc->bd_import is a non NULL value. From the code bd_import is used on the sender (client) side while bd_export is used on the receiver (server side) side. The crash occurs on the client side. Also ptlrpc_register_bulk() is always called from ptl_send_rpc(). In ptl_send_rpc the assumption in the code is that bd_import is none NULL. Same in ptlrpc_register_bulk(), as the peer is extracted from the bd_import structure. This leads me to believe that the CERROR statements should be looking in bd_import and not bd_export. It's likely that this was never hit because we didn't have a test that ran out of memory. A safe fix, I believe, is to change the CERROR to dereference bd_import rather than bd_export. This will most likely fix the crash. However, it doesn't address the reason that the system is running out of memory. However, as far as I understand under the load the test is creating it is expected that all system memory would be consumed. I'll work on a fix.

            I arrived to that conclusion as well. Actually desc->bd_export is NULL and when we try to access exp_obd we crash. exp_obd is @ 184 offset in the bd_export structure. Which matches the messages:
            BUG: unable to handle kernel NULL pointer dereference at 00000000000000b8
            What I'm trying to find out now is under what circumstances desc->bd_export would be NULL. I'm trying to get Liang's help since he's a more knowledgeable in this area.

            A simple fix would be to check for NULL before printing (or not printing at all), but without knowing exactly why we're getting in this scenario in the first place, I don't believe this would be a good fix.

            One thing to try is to take out this print and try to reproduce the issue again. I would like to see if it will crash some where else. Maybe we can get more information this way.

            I'll talk to Keith to see if we can collaborate to try that out.

            ashehata Amir Shehata (Inactive) added a comment - I arrived to that conclusion as well. Actually desc->bd_export is NULL and when we try to access exp_obd we crash. exp_obd is @ 184 offset in the bd_export structure. Which matches the messages: BUG: unable to handle kernel NULL pointer dereference at 00000000000000b8 What I'm trying to find out now is under what circumstances desc->bd_export would be NULL. I'm trying to get Liang's help since he's a more knowledgeable in this area. A simple fix would be to check for NULL before printing (or not printing at all), but without knowing exactly why we're getting in this scenario in the first place, I don't believe this would be a good fix. One thing to try is to take out this print and try to reproduce the issue again. I would like to see if it will crash some where else. Maybe we can get more information this way. I'll talk to Keith to see if we can collaborate to try that out.
            keith Keith Mannthey (Inactive) added a comment - - edited

            1. Amir we are not trying to run the system out of memory but the load bay be triggering it.

            I was able to spend some time with the kdump today and I have found the following:

            We know:

               [exception RIP: ptlrpc_register_bulk+1130]
                RIP: ffffffffa097fd5a  RSP: ffff8801a45c1b10  RFLAGS: 00010282
                RAX: 0000000000000000  RBX: ffff8800931ba000  RCX: 00051e339ee2b698
                RDX: 0000000000000000  RSI: ffffffffa09eb2c0  RDI: ffffffffa0a2d520
                RBP: ffff8801a45c1bd0   R8: 0000000000000000   R9: 00000000fffffff4
                R10: 0000000000000001  R11: 0000000000000000  R12: 00000000fffffff4
                R13: 00051e339ee2b698  R14: 0000000000000000  R15: 00051e339ee2b698
                ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
            

            Note rax and rdx are both 0.

            0xffffffffa097fcea <ptlrpc_register_bulk+1018>: nopw   0x0(%rax,%rax,1)
            /usr/src/debug/lustre-2.4.52/libcfs/include/libcfs/libcfs_debug.h: 211
            0xffffffffa097fcf0 <ptlrpc_register_bulk+1024>: mov    $0xfffffff4,%eax
            0xffffffffa097fcf5 <ptlrpc_register_bulk+1029>: jmp    0xffffffffa097fcd8 <ptlrpc_register_bulk+1000>
            0xffffffffa097fcf7 <ptlrpc_register_bulk+1031>: nopw   0x0(%rax,%rax,1)
            /usr/src/debug/lustre-2.4.52/lustre/ptlrpc/niobuf.c: 374
            0xffffffffa097fd00 <ptlrpc_register_bulk+1040>: movl   $0x100,0xad826(%rip)        # 0xffffffffa0a2d530 <msgdata.73466+16>
            0xffffffffa097fd0a <ptlrpc_register_bulk+1050>: movq   $0xffffffffa09eaea8,0xad80b(%rip)        # 0xffffffffa0a2d520 <msgdata.73466>
            0xffffffffa097fd15 <ptlrpc_register_bulk+1061>: mov    %r12d,%r9d
            0xffffffffa097fd18 <ptlrpc_register_bulk+1064>: movq   $0xffffffffa09cee40,0xad805(%rip)        # 0xffffffffa0a2d528 <msgdata.73466+8>
            0xffffffffa097fd23 <ptlrpc_register_bulk+1075>: movl   $0x178,0xad807(%rip)        # 0xffffffffa0a2d534 <msgdata.73466+20>
            0xffffffffa097fd2d <ptlrpc_register_bulk+1085>: mov    %r14d,%r8d
            0xffffffffa097fd30 <ptlrpc_register_bulk+1088>: movq   $0xffffffffa0a2d550,0xad805(%rip)        # 0xffffffffa0a2d540 <msgdata.73466+32>
            0xffffffffa097fd3b <ptlrpc_register_bulk+1099>: movl   $0x20000,0xad7f3(%rip)        # 0xffffffffa0a2d538 <msgdata.73466+24>
            0xffffffffa097fd45 <ptlrpc_register_bulk+1109>: mov    %r13,%rcx
            0xffffffffa097fd48 <ptlrpc_register_bulk+1112>: mov    0x10(%rbx),%rax
            0xffffffffa097fd4c <ptlrpc_register_bulk+1116>: mov    $0xffffffffa09eb2c0,%rsi
            0xffffffffa097fd53 <ptlrpc_register_bulk+1123>: mov    $0xffffffffa0a2d520,%rdi
            0xffffffffa097fd5a <ptlrpc_register_bulk+1130>: mov    0xb8(%rax),%rdx         <== Died. rax rdx both 0
            0xffffffffa097fd61 <ptlrpc_register_bulk+1137>: xor    %eax,%eax
            0xffffffffa097fd63 <ptlrpc_register_bulk+1139>: add    $0xc,%rdx
            0xffffffffa097fd67 <ptlrpc_register_bulk+1143>: callq  0xffffffffa0713fa0 <libcfs_debug_msg>
            /usr/src/debug/lustre-2.4.52/lustre/ptlrpc/niobuf.c: 377
            0xffffffffa097fd6c <ptlrpc_register_bulk+1148>: mov    -0x40(%rbp),%rdi
            0xffffffffa097fd70 <ptlrpc_register_bulk+1152>: callq  0xffffffffa07866a0 <LNetMEUnlink>
            

            I believe we are in this part of the code lustre/ptlrpc/niobuf.c, the CERROR at line 374:

                        /* About to let the network at it... */
                            rc = LNetMDAttach(me_h, md, LNET_UNLINK,
                                              &desc->bd_mds[posted_md]);
                            if (rc != 0) {
                                    CERROR("%s: LNetMDAttach failed x"LPU64"/%d: rc = %d\n",
                                           desc->bd_export->exp_obd->obd_name, xid,
                                           posted_md, rc);
                                    rc2 = LNetMEUnlink(me_h);
                                    LASSERT(rc2 == 0);
                                    break;
                            }
            

            It seems we fail the LNetMDAttach with ENOMEM (or???) and then we blow up formatting the CERROR. Maybe desc>bd_export->exp_obd->obd_name is not correctly setup for some reason?

            I guess maloo does not like my words, this looks quite odd in my browser.

            keith Keith Mannthey (Inactive) added a comment - - edited 1. Amir we are not trying to run the system out of memory but the load bay be triggering it. I was able to spend some time with the kdump today and I have found the following: We know: [exception RIP: ptlrpc_register_bulk+1130] RIP: ffffffffa097fd5a RSP: ffff8801a45c1b10 RFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff8800931ba000 RCX: 00051e339ee2b698 RDX: 0000000000000000 RSI: ffffffffa09eb2c0 RDI: ffffffffa0a2d520 RBP: ffff8801a45c1bd0 R8: 0000000000000000 R9: 00000000fffffff4 R10: 0000000000000001 R11: 0000000000000000 R12: 00000000fffffff4 R13: 00051e339ee2b698 R14: 0000000000000000 R15: 00051e339ee2b698 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 Note rax and rdx are both 0. 0xffffffffa097fcea <ptlrpc_register_bulk+1018>: nopw 0x0(%rax,%rax,1) /usr/src/debug/lustre-2.4.52/libcfs/include/libcfs/libcfs_debug.h: 211 0xffffffffa097fcf0 <ptlrpc_register_bulk+1024>: mov $0xfffffff4,%eax 0xffffffffa097fcf5 <ptlrpc_register_bulk+1029>: jmp 0xffffffffa097fcd8 <ptlrpc_register_bulk+1000> 0xffffffffa097fcf7 <ptlrpc_register_bulk+1031>: nopw 0x0(%rax,%rax,1) /usr/src/debug/lustre-2.4.52/lustre/ptlrpc/niobuf.c: 374 0xffffffffa097fd00 <ptlrpc_register_bulk+1040>: movl $0x100,0xad826(%rip) # 0xffffffffa0a2d530 <msgdata.73466+16> 0xffffffffa097fd0a <ptlrpc_register_bulk+1050>: movq $0xffffffffa09eaea8,0xad80b(%rip) # 0xffffffffa0a2d520 <msgdata.73466> 0xffffffffa097fd15 <ptlrpc_register_bulk+1061>: mov %r12d,%r9d 0xffffffffa097fd18 <ptlrpc_register_bulk+1064>: movq $0xffffffffa09cee40,0xad805(%rip) # 0xffffffffa0a2d528 <msgdata.73466+8> 0xffffffffa097fd23 <ptlrpc_register_bulk+1075>: movl $0x178,0xad807(%rip) # 0xffffffffa0a2d534 <msgdata.73466+20> 0xffffffffa097fd2d <ptlrpc_register_bulk+1085>: mov %r14d,%r8d 0xffffffffa097fd30 <ptlrpc_register_bulk+1088>: movq $0xffffffffa0a2d550,0xad805(%rip) # 0xffffffffa0a2d540 <msgdata.73466+32> 0xffffffffa097fd3b <ptlrpc_register_bulk+1099>: movl $0x20000,0xad7f3(%rip) # 0xffffffffa0a2d538 <msgdata.73466+24> 0xffffffffa097fd45 <ptlrpc_register_bulk+1109>: mov %r13,%rcx 0xffffffffa097fd48 <ptlrpc_register_bulk+1112>: mov 0x10(%rbx),%rax 0xffffffffa097fd4c <ptlrpc_register_bulk+1116>: mov $0xffffffffa09eb2c0,%rsi 0xffffffffa097fd53 <ptlrpc_register_bulk+1123>: mov $0xffffffffa0a2d520,%rdi 0xffffffffa097fd5a <ptlrpc_register_bulk+1130>: mov 0xb8(%rax),%rdx <== Died. rax rdx both 0 0xffffffffa097fd61 <ptlrpc_register_bulk+1137>: xor %eax,%eax 0xffffffffa097fd63 <ptlrpc_register_bulk+1139>: add $0xc,%rdx 0xffffffffa097fd67 <ptlrpc_register_bulk+1143>: callq 0xffffffffa0713fa0 <libcfs_debug_msg> /usr/src/debug/lustre-2.4.52/lustre/ptlrpc/niobuf.c: 377 0xffffffffa097fd6c <ptlrpc_register_bulk+1148>: mov -0x40(%rbp),%rdi 0xffffffffa097fd70 <ptlrpc_register_bulk+1152>: callq 0xffffffffa07866a0 <LNetMEUnlink> I believe we are in this part of the code lustre/ptlrpc/niobuf.c, the CERROR at line 374: /* About to let the network at it... */ rc = LNetMDAttach(me_h, md, LNET_UNLINK, &desc->bd_mds[posted_md]); if (rc != 0) { CERROR("%s: LNetMDAttach failed x"LPU64"/%d: rc = %d\n", desc->bd_export->exp_obd->obd_name, xid, posted_md, rc); rc2 = LNetMEUnlink(me_h); LASSERT(rc2 == 0); break; } It seems we fail the LNetMDAttach with ENOMEM (or???) and then we blow up formatting the CERROR. Maybe desc >bd_export->exp_obd->obd_name is not correctly setup for some reason? I guess maloo does not like my words, this looks quite odd in my browser.

            At first glance it looks like the system is running out of memory. The LNET errors are a symptom of the system running out of memory. The crash seems to be happening in ptl. So I would think that a memory allocation fails, but NULL is not checked and then subsequently the NULL is deferenced causing the crash. The problem appears to be two fold.
            1. We're not handling out of memory scenario correctly some where. Are we expecting the system to run out of memory under specific load cases?
            2. There is a memory leak some where, which will cause us to get into the problem identified in (1).

            Will investigate further.

            ashehata Amir Shehata (Inactive) added a comment - At first glance it looks like the system is running out of memory. The LNET errors are a symptom of the system running out of memory. The crash seems to be happening in ptl. So I would think that a memory allocation fails, but NULL is not checked and then subsequently the NULL is deferenced causing the crash. The problem appears to be two fold. 1. We're not handling out of memory scenario correctly some where. Are we expecting the system to run out of memory under specific load cases? 2. There is a memory leak some where, which will cause us to get into the problem identified in (1). Will investigate further.
            pjones Peter Jones added a comment -

            Amir

            Can you advise anything here?

            Peter

            pjones Peter Jones added a comment - Amir Can you advise anything here? Peter
            keith Keith Mannthey (Inactive) added a comment - - edited

            We seem to be in ptlrpc_register_bulk in both cases this function was mostly rewritten as part of LU-1431 ptlrpc: Support for over 1MB bulk I/O RPC: http://review.whamcloud.com/4993 .

            keith Keith Mannthey (Inactive) added a comment - - edited We seem to be in ptlrpc_register_bulk in both cases this function was mostly rewritten as part of LU-1431 ptlrpc: Support for over 1MB bulk I/O RPC: http://review.whamcloud.com/4993 .

            Just to note: This same issue has been reported in a different context in LU_3598.

            LNetError: 23379:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: out of memory at /var/lib/jenkins/workspace/lustre-master/arch/x86_64/build_type/client/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.4.52/lnet/include/lnet/lib-lnet.h:457 (tried to alloc '(md)' = 4208)
            LNetError: 23379:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: 199035354 total bytes allocated by lnet
            BUG: unable to handle kernel NULL pointer dereference at 00000000000000b8
            

            Is the common error.

            keith Keith Mannthey (Inactive) added a comment - Just to note: This same issue has been reported in a different context in LU_3598. LNetError: 23379:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: out of memory at /var/lib/jenkins/workspace/lustre-master/arch/x86_64/build_type/client/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.4.52/lnet/include/lnet/lib-lnet.h:457 (tried to alloc '(md)' = 4208) LNetError: 23379:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: 199035354 total bytes allocated by lnet BUG: unable to handle kernel NULL pointer dereference at 00000000000000b8 Is the common error.

            People

              ashehata Amir Shehata (Inactive)
              keith Keith Mannthey (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: