Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-912

OSS node(s) crash with Kernel oops

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • Lustre 1.8.6
    • Lustre 1.8.x (1.8.0 - 1.8.5)
    • None
    • 3
    • 21,804
    • 6507

    Description

      Sorry if this is a duplicate, but I couldn't find a similar bug.

      Failure is restricted to OSS nodes and occurs as follows:

      1 One OSS node crash. Heartbeat manage to takeover the resources towards the standy node smoothly.

      There's no indication of any IB errors in the opensm.log; No Error in /var/log/messages and /var/log/warn. No resource (CPU, Memory, network, Disk) is exhausted (I can provide the collectl files if needed). One thing that might be noticed is that the 'ldiskfs_inode_cache' increase constantly over 1GB till the nodes crashes (numslabs, object, size). See attached collectl excerpt output for slabs.

      Anyway, we found the following message in the console log file (conman):

      jf92o05 login: BUG: unable to handle kernel NULL pointer dereference at 00000000000000c8
      IP: [<ffffffffa09bbdbd>] ost_rw_prolong_locks+0x18d/0x460 [ost]
      PGD 0
      Oops: 0000 [1] SMP
      last sysfs file: /sys/kernel/uevent_seqnum
      CPU 0
      Modules linked in: obdfilter(N) fsfilt_ldiskfs(N) ost(N) mgc(N) ldiskfs(N) lustre(N) lov(N) mdc(N) lquota(N) osc(N) ko2iblnd(N) ptlrpc(N) obdclass(N) lnet(N) lvfs(N) libcfs(N) quota_v2(N) quot
      a_tree(N) jbd2(N) crc16(N) edd(N) nfs(N) lockd(N) nfs_acl(N) sunrpc(N) rdma_ucm(N) ib_sdp(N) rdma_cm(N) iw_cm(N) ib_addr(N) ib_ipoib(N) ib_cm(N) ib_sa(N) ipv6(N) ib_uverbs(N) ib_umad(N) iw_nes
      (N) libcrc32c(N) iw_cxgb3(N) cxgb3(N) ib_ipath(N) cpufreq_conservative(N) cpufreq_userspace(N) cpufreq_powersave(N) acpi_cpufreq(N) mlx4_ib(N) ib_mthca(N) ib_mad(N) ib_core(N) fuse(N) dm_crypt
      (N) crypto_blkcipher(N) loop(N) dm_round_robin(N) dm_multipath(N) scsi_dh(N) sr_mod(N) cdrom(N) ide_pci_generic(N) jmicron(N) ide_core(N) ata_generic(N) snd_hda_intel(N) thermal(N) snd_pcm(N)
      snd_timer(N) rtc_cmos(N) snd_page_alloc(N) ahci(N) processor(N) pata_jmicron(N) snd_hwdep(N) rtc_core(N) lpfc(N) libata(N) ses(N) thermal_sys(N) snd(N) rtc_lib(N) mlx4_core(N) pcspkr(N) i2c_i8
      01(N) ohci1394(N) e1000e(N) serio_raw(N) enclosure(N) igb(N) soundcore(N) joydev(N) scsi_transport_fc(N) button(N) ieee1394(N) i2c_core(N) scsi_tgt(N) hwmon(N) dock(N) sg(N) linear(N) usbhid(N
      ) hid(N) ff_memless(N) uhci_hcd(N) ehci_hcd(N) sd_mod(N) crc_t10dif(N) usbcore(N) dm_snapshot(N) dm_mod(N) ext3(N) jbd(N) mbcache(N) aacraid(N) scsi_mod(N) [last unloaded: libcfs]
      Supported: No
      Pid: 24183, comm: ll_ost_io_71 Tainted: G 2.6.27.39-0.1_lustre.1.8.4-default #1
      RIP: 0010:[<ffffffffa09bbdbd>] [<ffffffffa09bbdbd>] ost_rw_prolong_locks+0x18d/0x460 [ost]
      RSP: 0018:ffff8805bbd3bd00 EFLAGS: 00010246
      RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff8805bbd3bd40
      RDX: ffffffffa09bb480 RSI: ffff8805bbd3bd80 RDI: 0000000000000258
      RBP: ffff8801d97c41b0 R08: 0000000000000006 R09: 0000000000000000
      R10: ffff8805d0548c00 R11: ffff8805d9b5eb80 R12: 0000000000000006
      R13: ffff8801d97c40c8 R14: ffff8802ba95dc00 R15: ffff8805bbd3bd40
      FS: 00007fefa37f96f0(0000) GS:ffffffff80a33080(0000) knlGS:0000000000000000
      CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 00000000000000c8 CR3: 0000000000201000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process ll_ost_io_71 (pid: 24183, threadinfo ffff8805bbd3a000, task ffff8805bbd38100)
      Stack: ffffffff80a23680 0000000000000000 ffff88062a43e7c0 ffffffffa07fd790
      ffff8805bbd3be40 ffffffff80498e16 0000000000000000 ffffffffffffffff
      ffff880815a27e00 00000000138da000 00000000138dafff 0000000000000000
      Call Trace:
      [<ffffffffa09bc1bb>] ost_rw_hpreq_check+0x12b/0x2b0 [ost]
      [<ffffffffa076c9c3>] ptlrpc_main+0xef3/0x15f0 [ptlrpc]
      [<ffffffff8020cf49>] child_rip+0xa/0x11

      2 Some time later the node that took over the resources of the crashed node hangs, too.

      Same situation in log files and resource allocation (no resource is exhausted); 'ldiskfs_inode_cache' slabs increase continuously before the server crashes (hangs), but allocation is not very high ( ~ 200 MB).

      The same message appears in node's console log file, too:

      -Separator ---- Sun Dec 11 20:10:01 CET 2011 ----
      general protection fault: 0000 [1] SMP
      last sysfs file: /sys/kernel/uevent_seqnum
      CPU 0
      Modules linked in: obdfilter(N) fsfilt_ldiskfs(N) ost(N) mgc(N) ldiskfs(N) lustre(N) lov(N) mdc(N) lquota(N) osc(N) ko2iblnd(N) ptlrpc(N) obdclass(N) lnet(N) lvfs(N) libcfs(N) quota_v2(N) quot
      a_tree(N) jbd2(N) crc16(N) edd(N) nfs(N) lockd(N) nfs_acl(N) sunrpc(N) rdma_ucm(N) ib_sdp(N) rdma_cm(N) iw_cm(N) ib_addr(N) ib_ipoib(N) ib_cm(N) ib_sa(N) ipv6(N) ib_uverbs(N) ib_umad(N) iw_nes
      (N) libcrc32c(N) iw_cxgb3(N) cxgb3(N) ib_ipath(N) cpufreq_conservative(N) cpufreq_userspace(N) cpufreq_powersave(N) acpi_cpufreq(N) mlx4_ib(N) ib_mthca(N) ib_mad(N) ib_core(N) fuse(N) dm_crypt
      (N) crypto_blkcipher(N) loop(N) dm_round_robin(N) dm_multipath(N) scsi_dh(N) sr_mod(N) cdrom(N) ide_pci_generic(N) jmicron(N) ide_core(N) ata_generic(N) thermal(N) snd_hda_intel(N) snd_pcm(N)
      processor(N) snd_timer(N) ahci(N) pata_jmicron(N) rtc_cmos(N) snd_page_alloc(N) ses(N) lpfc(N) thermal_sys(N) ohci1394(N) libata(N) rtc_core(N) snd_hwdep(N) scsi_transport_fc(N) mlx4_core(N) e
      nclosure(N) hwmon(N) i2c_i801(N) dock(N) joydev(N) rtc_lib(N) button(N) pcspkr(N) ieee1394(N) snd(N) serio_raw(N) igb(N) scsi_tgt(N) e1000e(N) soundcore(N) i2c_core(N) sg(N) linear(N) usbhid(N
      ) hid(N) ff_memless(N) uhci_hcd(N) ehci_hcd(N) sd_mod(N) crc_t10dif(N) usbcore(N) dm_snapshot(N) dm_mod(N) ext3(N) jbd(N) mbcache(N) aacraid(N) scsi_mod(N) [last unloaded: libcfs]
      Supported: No
      Pid: 20502, comm: ll_ost_io_80 Tainted: G 2.6.27.39-0.1_lustre.1.8.4-default #1
      RIP: 0010:[<ffffffffa075ce94>] [<ffffffffa075ce94>] lustre_msg_buf+0x4/0x90 [ptlrpc]
      RSP: 0000:ffff8805cf82bdb0 EFLAGS: 00010282
      RAX: 0000000000000008 RBX: ffff88026b76a808 RCX: aaaaaaaaaaaaaaab
      RDX: 0000000000000018 RSI: 0000000000000002 RDI: 5a5a5a5a5a5a5a5a
      RBP: 0000000000000001 R08: ffff8805f0dae900 R09: 0000000000000000
      R10: 000000004ee5023d R11: ffff880c2d53edc0 R12: ffff88026b76a800
      R13: 0000000000000001 R14: ffff88026b76a800 R15: ffff8803067bc608
      FS: 00007f03bd6456f0(0000) GS:ffffffff80a33080(0000) knlGS:0000000000000000
      CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 0000000001ab9348 CR3: 0000000000201000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process ll_ost_io_80 (pid: 20502, threadinfo ffff8805cf82a000, task ffff8805cf828800)
      Stack: ffff88026b76a800 ffff88026b76a808 ffff8805f5c6c800 ffff88026b76a808
      ffff8805f5c6c800 ffffffffa09b913b ffff88026b76a800 ffffffffa09bab0c
      0000000000000000 ffff8803067bc540 ffff8805f5c6c800 ffff88026b76a800
      Call Trace:
      [<ffffffffa09b913b>] ost_rw_hpreq_check+0xab/0x2b0 [ost]
      [<ffffffffa07699c3>] ptlrpc_main+0xef3/0x15f0 [ptlrpc]
      [<ffffffff8020cf49>] child_rip+0xa/0x11

      This time the system broken. After booting the second node manually the system is operational again.

      The incident is 'restricted' to two server node pairs, and happens since 3 weeks periodically approximately after 7 days (every weekend, but that might be by chance).

      Attachments

        Issue Links

          Activity

            [LU-912] OSS node(s) crash with Kernel oops

            Duplicate of Lustre Bugzilla bug 21804.

            brian Brian Murrell (Inactive) added a comment - Duplicate of Lustre Bugzilla bug 21804.

            Ok, we need to update to 1.8.6. Many thanks for pointer to the bugzilla. I'm sorry for creating, yet another ticket. Our problem is that we don't want to change the OS distribution (SLES 11), but that's a different story.

            You can close the ticket

            heckes Frank Heckes (Inactive) added a comment - Ok, we need to update to 1.8.6. Many thanks for pointer to the bugzilla. I'm sorry for creating, yet another ticket. Our problem is that we don't want to change the OS distribution (SLES 11), but that's a different story. You can close the ticket

            This looks like bugzilla 21804.

            johann Johann Lombardi (Inactive) added a comment - This looks like bugzilla 21804.

            SLAB info of second crashed node.

            heckes Frank Heckes (Inactive) added a comment - SLAB info of second crashed node.

            SLAB info of first crashed node.

            heckes Frank Heckes (Inactive) added a comment - SLAB info of first crashed node.

            People

              wc-triage WC Triage
              heckes Frank Heckes (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: