Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3010

client crashes on RHEL6 with Lustre 1.8.8

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 1.8.8
    • None
    • 3
    • 7331

    Description

      After upgrading the clients at NOAA to RHEL6 and Lustre 1.8.8, we're running into an issue where we are seeing kernel panics. Here is the analysis by Redhat:

      I have checked the provided vmcore and below is my diagnosis on same . . It
      seems like libcfs module is leading to crash due to NULL pointer dereference.
      libcfs is a 3rd party module not shipped by Red Hat and probably it is provided
      by Lustre. Please check with Lustre application vendor for further investigation
      on this. Below is more Detail analysis from vmcore . .

      Go through below shared analysis and contact Lustre team . . If there is any
      further assistance required from OS side do let us know. .

      ~~~~~~~~~~~~~~~~~~~~~~~~~~
      [1] OOPS at "kiblnd_sd_04"

      KERNEL: vmlinux
      DUMPFILE: vmcore.flat [PARTIAL DUMP]
      CPUS: 24
      DATE: Thu Mar 7 07:10:37 2013
      UPTIME: 1 days, 11:34:39
      LOAD AVERAGE: 6.40, 6.05, 6.59
      TASKS: 655
      NODENAME: r10i0n5
      RELEASE: 2.6.32-279.5.2.el6.x86_64
      VERSION: #1 SMP Tue Aug 14 11:36:39 EDT 2012
      MACHINE: x86_64 (3466 Mhz)
      MEMORY: 24 GB
      PANIC: "Oops: 0010 1 SMP " (check log for details)
      PID: 2889
      COMMAND: "kiblnd_sd_04"
      TASK: ffff8803263b7540 [THREAD_INFO: ffff880325c14000]
      CPU: 12
      STATE: TASK_RUNNING (PANIC)

      • Following PID was the one which probably HIT the panic.
        crash> ps | grep 2889
        > 2889 2 12 ffff8803263b7540 RU 0.0 0 0 [kiblnd_sd_04]
        ~~~~~~~~~~~~~~~~~~~~~~~~~~

      [2] Seems like there was some resource constraint for the concern PID 2889.
      crash> bt
      PID: 2889 TASK: ffff8803263b7540 CPU: 12 COMMAND: "kiblnd_sd_04"
      #0 [ffff880325c155a0] machine_kexec at ffffffff8103281b
      #1 [ffff880325c15600] crash_kexec at ffffffff810ba8e2
      #2 [ffff880325c156d0] oops_end at ffffffff81501510
      #3 [ffff880325c15700] no_context at ffffffff81043bab
      #4 [ffff880325c15750] __bad_area_nosemaphore at ffffffff81043e35
      #5 [ffff880325c157a0] bad_area_nosemaphore at ffffffff81043f03
      #6 [ffff880325c157b0] __do_page_fault at ffffffff81044661
      #7 [ffff880325c158d0] do_page_fault at ffffffff815034ee
      #8 [ffff880325c15900] page_fault at ffffffff815008a5
      #9 [ffff880325c15a58] libcfs_debug_dumpstack at ffffffffa04808f5 [libcfs]
      #10 [ffff880325c15a78] lbug_with_loc at ffffffffa0480f25 [libcfs]
      #11 [ffff880325c15ac8] libcfs_assertion_failed at ffffffffa0489696 [libcfs]
      #12 [ffff880325c15b18] lnet_match_md at ffffffffa04e7cdc [lnet]
      #13 [ffff880325c15be8] lnet_parse at ffffffffa04ece8a [lnet]
      #14 [ffff880325c15ce8] kiblnd_handle_rx at ffffffffa06ca2fb [ko2iblnd]
      #15 [ffff880325c15d78] kiblnd_rx_complete at ffffffffa06caea2 [ko2iblnd]
      #16 [ffff880325c15df8] kiblnd_complete at ffffffffa06cb092 [ko2iblnd]
      #17 [ffff880325c15e38] kiblnd_scheduler at ffffffffa06cb3af [ko2iblnd]
      #18 [ffff880325c15f48] kernel_thread at ffffffff8100c14a
      ~~~~~~~~~~~~~~~~~~~~~~~~~~

      [3] I see few segfaults and LustreError prior to crash . .

      crash> log

      PROLOGUE-CHKNODE Wed Mar 6 18:55:45 UTC 2013 Job 16422368: chk_node.pl
      2.3-ad-2012.06.22. Starting Checks ... 16422368.bqs1.zeus.fairmont.rdhp
      cs.noaa.gov
      remap_polar_net[778]: segfault at 100000009 ip 0000000000410f6c sp
      00007fffffffd4f0 error 4 in remap_polar_netcdf.exe[400000+16c000]
      EPILOGUE-CHKNODE Wed Mar 6 18:55:48 UTC 2013 Job 16422363: chk_node.pl
      2.3-ad-2012.06.22. Starting Checks ... 16422363.bqs1.zeus.fairmont.rdhp
      cs.noaa.gov
      EPILOGUE-CHKNODE Wed Mar 6 18:55:51 UTC 2013 Job 16422308: chk_node.pl
      2.3-ad-2012.06.22. Starting Checks ... 16422308.bqs1.zeus.fairmont.rdhp
      cs.noaa.gov
      remap_polar_net[1382]: segfault at 100000009 ip 0000000000410f6c sp
      00007fffffffd530 error 4 in remap_polar_netcdf.exe[400000+16c000]
      EPILOGUE-CHKNODE Wed Mar 6 18:55:56 UTC 2013 Job 16422368: chk_node.pl
      2.3-ad-2012.06.22. Starting Checks ... 16422368.bqs1.zeus.fairmont.rdhpcs.noaa.gov
      .
      .
      .
      .
      PROLOGUE-CHKNODE Wed Mar 6 22:41:51 UTC 2013 Job 16433865: chk_node.pl
      2.3-ad-2012.06.22. Starting Checks ... 16433865.bqs1.zeus.fairmont.rdhpcs.noaa.gov
      LustreError: 12575:0:(file.c:3331:ll_inode_revalidate_fini()) failure -2 inode
      406359028
      LustreError: 12905:0:(file.c:3331:ll_inode_revalidate_fini()) failure -2 inode
      406847678
      EPILOGUE-CHKNODE Wed Mar 6 22:43:48 UTC 2013 Job 16433865: chk_node.pl
      2.3-ad-2012.06.22. Starting Checks ... 16433865.bqs1.zeus.fairmont.rdhpcs.noaa.gov
      .
      .
      .
      .
      PROLOGUE-CHKNODE Thu Mar 7 11:53:11 UTC 2013 Job 16479113: chk_node.pl
      2.3-ad-2012.06.22. Starting Checks ... 16479113.bqs1.zeus.fairmont.rdhpcs.noaa.gov
      LustreError: 2889:0:(lib-move.c:184:lnet_match_md()) ASSERTION(me == md->md_me)
      failed
      LustreError: 2889:0:(lib-move.c:184:lnet_match_md()) LBUG
      Pid: 2889, comm: kiblnd_sd_04
      Call Trace:
      BUG: unable to handle kernel NULL pointer dereference at (null)
      IP: [<(null)>] (null)
      PGD 334320067 PUD 336739067 PMD 0
      Oops: 0010 1 SMP
      last sysfs file:
      /sys/devices/pci0000:00/0000:00:09.0/0000:02:00.0/infiniband/mlx4_1/hca_type
      CPU 12
      Modules linked in: mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U)
      ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) acpi_cpufreq freq_table mperf
      ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa
      mlx4_ib ib_mad iw_cxgb4 iw_cxgb3 ib_core xpmem(U) xp gru xvma(U) numatools(U)
      microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma
      shpchp ahci mlx4_en mlx4_core igb dca dm_mirror dm_region_hash dm_log dm_mod nfs
      lockd fscache nfs_acl auth_rpcgss sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb4i
      cxgb4 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi
      scsi_transport_iscsi [last unloaded: ipmi_msghandler]

      Pid: 2889, comm: kiblnd_sd_04 Not tainted 2.6.32-279.5.2.el6.x86_64 #1 SGI.COM
      AltixICE8400IP105/X8DTT-HallieS
      RIP: 0010:[<0000000000000000>] [<(null)>] (null)
      RSP: 0018:ffff880325c159b8 EFLAGS: 00010246
      RAX: ffff880325c15a1c RBX: ffff880325c15a10 RCX: ffffffffa048c320
      RDX: ffff880325c15a50 RSI: ffff880325c15a10 RDI: ffff880325c14000
      RBP: ffff880325c15a50 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000003 R11: 0000000000000000 R12: 000000000000cbe0
      R13: ffffffffa048c320 R14: 0000000000000000 R15: ffff8800282c3fc0
      FS: 00007ffff7fe8700(0000) GS:ffff8800282c0000(0000) knlGS:0000000000000000
      CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 0000000000000000 CR3: 000000032c5fd000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process kiblnd_sd_04 (pid: 2889, threadinfo ffff880325c14000, task ffff8803263b7540)
      Stack:
      ffffffff8100e520 ffff880325c15a1c ffff8803263b7540 ffffffffa04f8914
      <d> 00000000a04fb8d8 ffff880325c14000 ffff880325c15fd8 ffff880325c14000
      <d> 000000000000000c ffff8800282c0000 ffff880325c15a50 ffff880325c15a20
      Call Trace:
      [<ffffffff8100e520>] ? dump_trace+0x190/0x3b0
      [<ffffffffa04808f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      [<ffffffffa0480f25>] lbug_with_loc+0x75/0xe0 [libcfs]
      [<ffffffffa0489696>] libcfs_assertion_failed+0x66/0x70 [libcfs]
      [<ffffffffa04e7cdc>] lnet_match_md+0x2fc/0x350 [lnet]
      [<ffffffffa048172e>] ? cfs_free+0xe/0x10 [libcfs]
      [<ffffffffa04f73af>] ? lnet_nid2peer_locked+0x2f/0x540 [lnet]
      [<ffffffffa04ece8a>] lnet_parse+0x108a/0x1b30 [lnet]
      [<ffffffffa06ca2fb>] kiblnd_handle_rx+0x2cb/0x600 [ko2iblnd]
      [<ffffffff8104f2bd>] ? check_preempt_curr+0x6d/0x90
      [<ffffffff810600bc>] ? try_to_wake_up+0x24c/0x3e0
      [<ffffffffa06caea2>] kiblnd_rx_complete+0x252/0x3e0 [ko2iblnd]
      [<ffffffff81060262>] ? default_wake_function+0x12/0x20
      [<ffffffff8104e309>] ? __wake_up_common+0x59/0x90
      [<ffffffffa06cb092>] kiblnd_complete+0x62/0xe0 [ko2iblnd]
      [<ffffffffa06cb3af>] kiblnd_scheduler+0x29f/0x760 [ko2iblnd]
      [<ffffffff81127e40>] ? __free_pages+0x60/0xa0
      [<ffffffff81060250>] ? default_wake_function+0x0/0x20
      [<ffffffff8100c14a>] child_rip+0xa/0x20
      [<ffffffffa06cb110>] ? kiblnd_scheduler+0x0/0x760 [ko2iblnd]
      [<ffffffff8100c140>] ? child_rip+0x0/0x20
      Code: Bad RIP value.
      RIP [<(null)>] (null)
      RSP <ffff880325c159b8>
      CR2: 0000000000000000
      ~~~~~~~~~~~~~~~~~~~~~~~~~~

      Attachments

        Issue Links

          Activity

            [LU-3010] client crashes on RHEL6 with Lustre 1.8.8

            Hi Kit, a vmcore with free pages included would be a bit more helpful. Also it doesn't require kdump to go through data structures tracking free pages, so in case that those structures also get corrupted kdump would still be able to create the dump - kdump may hang if asked to exclude free pages but those data structures got corrupted. At this point, I think kernel-bug is the best way to go - we'd be extremely lucky to be able to find something useful by going through that much code almost blindly, like trying to walk out of a dark rain forest with little guide. The callback pointer corruption was a good indication that somebody else stepped on our toes because our code doesn't change it at all and I've already double checked all pointer arithmetic code in lnet to make sure that it was not ourselves that shot our own feet.

            Oleg, thanks the comment on file join. I saw errors like ...:
            LustreError: 12905:0:(file.c:3331:ll_inode_revalidate_fini()) failure -2 inode 406847678

            ... before the assertion happened. Does it ring any alarm to you?

            isaac Isaac Huang (Inactive) added a comment - Hi Kit, a vmcore with free pages included would be a bit more helpful. Also it doesn't require kdump to go through data structures tracking free pages, so in case that those structures also get corrupted kdump would still be able to create the dump - kdump may hang if asked to exclude free pages but those data structures got corrupted. At this point, I think kernel-bug is the best way to go - we'd be extremely lucky to be able to find something useful by going through that much code almost blindly, like trying to walk out of a dark rain forest with little guide. The callback pointer corruption was a good indication that somebody else stepped on our toes because our code doesn't change it at all and I've already double checked all pointer arithmetic code in lnet to make sure that it was not ourselves that shot our own feet. Oleg, thanks the comment on file join. I saw errors like ...: LustreError: 12905:0:(file.c:3331:ll_inode_revalidate_fini()) failure -2 inode 406847678 ... before the assertion happened. Does it ring any alarm to you?
            green Oleg Drokin added a comment -

            Isaac: ll_file_join is (was) a known problematic area that was removed in later versions.
            It's not used by anybody anyway.

            green Oleg Drokin added a comment - Isaac: ll_file_join is (was) a known problematic area that was removed in later versions. It's not used by anybody anyway.

            Hi Isaac,

            Thanks for the analysis. I will try to get the debug kernel installed, but it might take a little bit of time. Would the full vmcore be of any use? Also this seems to appear fairly consistently with RHEL6 clients. Do you think it would be possible to look at differences in the kernel functions that the Lustre client calls? Perhaps there are too many, but if at all possible, it would be good to keep looking for the cause.

            kitwestneat Kit Westneat (Inactive) added a comment - Hi Isaac, Thanks for the analysis. I will try to get the debug kernel installed, but it might take a little bit of time. Would the full vmcore be of any use? Also this seems to appear fairly consistently with RHEL6 clients. Do you think it would be possible to look at differences in the kernel functions that the Lustre client calls? Perhaps there are too many, but if at all possible, it would be good to keep looking for the cause.

            I've been digging through the crash dump, and it appeared that the kernel slab allocation states got corrupted somehow.

            The assertion failure on callback pointer in ptlrpc_master_callback() was an indication of memory corruption because the 'callback' pointer from the MD object was NEVER changed by the code after initialization. It looked almost impossible to be a result of racing code - no code changes that 'callback' pointer at all. Use-after-free can also be ruled out because: the MD object is bigger than a page, and crash dump contained only the initial part of the object, because the rest of the object resided in a free page that was not included in the partial dump - if it were use-after-free, both pages should have been excluded from the dump. Moreover, the initial part of the MD object contained correct reference counter. So I was led to believe that the SLAB allocation states got screwed up and a same chunk of memory was returned to multiple callers, and thus the 'callback' pointer got messed up by someone else unknowingly.

            In the past, we've seen similar bugs where the root cause was never found out, and the solution was to disable some experimental Lustre feature that looked suspicious. In fact, the culprit might not be Lustre at all - any kernel code is technically capable of such screw-ups. My suggestion is to:
            1. Kit, would it be possible to run the kernel-debug RPM (and corresponding Lustre modules which might need to be rebuilt) on a couple of clients where the error has been observed? The kernel-debug kernel should have enabled many SLAB and VM debugging options, and I think that'd catch any SLAB/VM issues much earlier and would move us closer to the root cause.

            2. Doug, I think it'd make sense to have some Lustre folk to double check the errors from file system level. I was glancing over the code, and some didn't look good, e.g.:
            ll_file_join():
            2696 tail = igrab(tail_filp->f_dentry->d_inode);

            igrab() can return NULL yet the code didn't check that. Maybe ll_file_join() wasn't even compiled, it was just something that raised my eyebrows.

            isaac Isaac Huang (Inactive) added a comment - I've been digging through the crash dump, and it appeared that the kernel slab allocation states got corrupted somehow. The assertion failure on callback pointer in ptlrpc_master_callback() was an indication of memory corruption because the 'callback' pointer from the MD object was NEVER changed by the code after initialization. It looked almost impossible to be a result of racing code - no code changes that 'callback' pointer at all. Use-after-free can also be ruled out because: the MD object is bigger than a page, and crash dump contained only the initial part of the object, because the rest of the object resided in a free page that was not included in the partial dump - if it were use-after-free, both pages should have been excluded from the dump. Moreover, the initial part of the MD object contained correct reference counter. So I was led to believe that the SLAB allocation states got screwed up and a same chunk of memory was returned to multiple callers, and thus the 'callback' pointer got messed up by someone else unknowingly. In the past, we've seen similar bugs where the root cause was never found out, and the solution was to disable some experimental Lustre feature that looked suspicious. In fact, the culprit might not be Lustre at all - any kernel code is technically capable of such screw-ups. My suggestion is to: 1. Kit, would it be possible to run the kernel-debug RPM (and corresponding Lustre modules which might need to be rebuilt) on a couple of clients where the error has been observed? The kernel-debug kernel should have enabled many SLAB and VM debugging options, and I think that'd catch any SLAB/VM issues much earlier and would move us closer to the root cause. 2. Doug, I think it'd make sense to have some Lustre folk to double check the errors from file system level. I was glancing over the code, and some didn't look good, e.g.: ll_file_join(): 2696 tail = igrab(tail_filp->f_dentry->d_inode); igrab() can return NULL yet the code didn't check that. Maybe ll_file_join() wasn't even compiled, it was just something that raised my eyebrows.

            Thanks - this one worked perfectly, no warning at all.

            isaac Isaac Huang (Inactive) added a comment - Thanks - this one worked perfectly, no warning at all.
            kitwestneat Kit Westneat (Inactive) added a comment - correct kerneldebug rpms: http://eu.ddn.com:8080/lustre/kernel-debuginfo-2.6.32-279.5.2.el6.x86_64.rpm http://eu.ddn.com:8080/lustre/kernel-debuginfo-common-x86_64-2.6.32-279.5.2.el6.x86_64.rpm

            Oh doh, I uploaded the wrong debuginfo, 279.19.1 vs 279.5.2. I am uploading the correct one now and will send you the link.

            kitwestneat Kit Westneat (Inactive) added a comment - Oh doh, I uploaded the wrong debuginfo, 279.19.1 vs 279.5.2. I am uploading the correct one now and will send you the link.

            Strange, with the RHEL debuginfo, I got the same version warning and then crash failed with other errors. I'll continue to debug with the CentOS debuginfo which seemed to work despite the version warning. Also, it seemed that the original report was based on another crash dump, because both the DATE and UPTIME in the dump differed. It'd give me more data if that dump is also available. Although this dump had a different failure, it's similar in many ways.

            isaac Isaac Huang (Inactive) added a comment - Strange, with the RHEL debuginfo, I got the same version warning and then crash failed with other errors. I'll continue to debug with the CentOS debuginfo which seemed to work despite the version warning. Also, it seemed that the original report was based on another crash dump, because both the DATE and UPTIME in the dump differed. It'd give me more data if that dump is also available. Although this dump had a different failure, it's similar in many ways.
            kitwestneat Kit Westneat (Inactive) added a comment - Here are the kerneldebug rpms: http://eu.ddn.com:8080/lustre/kernel-debuginfo-2.6.32-279.19.1.el6.x86_64.rpm http://eu.ddn.com:8080/lustre/kernel-debuginfo-common-x86_64-2.6.32-279.19.1.el6.x86_64.rpm Thanks, Kit

            Hi Isaac,

            It looks like the RHEL and Centos kernel RPMs are slightly different, and they were running the RHEL one. I am getting the RHEL debuginfo currently and will update the ticket when I get it uploaded. If you happen to have an RHN account, that should work too.

            kitwestneat Kit Westneat (Inactive) added a comment - Hi Isaac, It looks like the RHEL and Centos kernel RPMs are slightly different, and they were running the RHEL one. I am getting the RHEL debuginfo currently and will update the ticket when I get it uploaded. If you happen to have an RHN account, that should work too.

            Hi Kit,

            The kernel-debug-2.6.32-279.5.2.el6.x86_64.rpm you pointed to was a kernel with all sorts of run-time debugging options enabled. Usually it's not used in production system. Also, when I used the vmlinux from the corresponding debuginfo RPM, crash wouldn't even start. Then I tried to use the debuginfo RPM for normal (i.e. without debugging options) kernel of the same verion, crash would start with a warning on kernel version inconsistency between vmlinux and dumpfile. Can you please verify the client kernel version and point me to where the vmlinux is?

            With vmlinux from kernel-debuginfo-2.6.32-279.5.2.el6.x86_64.rpm, I got a panic that's different from the one reported. Instead of "(lib-move.c:184:lnet_match_md()) ASSERTION(me == md->md_me) failed" in process kiblnd_sd_04, I got:
            (events.c:418:ptlrpc_master_callback()) ASSERTION(callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback || callback == server_bulk_callback) failed
            in kiblnd_sd_01.

            isaac Isaac Huang (Inactive) added a comment - Hi Kit, The kernel-debug-2.6.32-279.5.2.el6.x86_64.rpm you pointed to was a kernel with all sorts of run-time debugging options enabled. Usually it's not used in production system. Also, when I used the vmlinux from the corresponding debuginfo RPM, crash wouldn't even start. Then I tried to use the debuginfo RPM for normal (i.e. without debugging options) kernel of the same verion, crash would start with a warning on kernel version inconsistency between vmlinux and dumpfile. Can you please verify the client kernel version and point me to where the vmlinux is? With vmlinux from kernel-debuginfo-2.6.32-279.5.2.el6.x86_64.rpm, I got a panic that's different from the one reported. Instead of "(lib-move.c:184:lnet_match_md()) ASSERTION(me == md->md_me) failed" in process kiblnd_sd_04, I got: (events.c:418:ptlrpc_master_callback()) ASSERTION(callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback || callback == server_bulk_callback) failed in kiblnd_sd_01.

            People

              isaac Isaac Huang (Inactive)
              kitwestneat Kit Westneat (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: