Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-975

kernel panic on OSS when using LVM mirror regionsize greater than 512k

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 1.8.7
    • Lustre 1.8.x (1.8.0 - 1.8.5)
    • None
    • Lustre 1.8.5, OS RHEL 5.5
    • 3
    • 24,546
    • 6493

    Description

      Our customer is running Lustre 1.8.5 (from Oracle) and RHEL 5.5. OST disks are mirrored with LVM. If LVM regionsize is set to greater than the default 512k the OSSs are randomly crashing with:
      Sep 3 06:25:09 sklusp02a kernel: Unable to handle kernel NULL pointer dereference at 0000000000000040 RIP:
      Sep 3 06:25:09 sklusp02a kernel: [<ffffffff8822b2fd>] :dm_mod:dispatch_io+0xb9/0x19b
      Sep 3 06:25:09 sklusp02a kernel: PGD 11e6408067 PUD 11e640d067 PMD 0
      Sep 3 06:25:09 sklusp02a kernel: Oops: 0000 [1] SMP
      Sep 3 06:25:09 sklusp02a kernel: last sysfs file: /devices/pci0000:00/0000:00:07.0/0000:06:00.1/host1/rport-1:0-1/target1:0:1/1:0:1:3/timeout
      Sep 3 06:25:09 sklusp02a kernel: CPU 1
      Sep 3 06:25:09 sklusp02a kernel: Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U)
      lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) dm_log_clustered(U) lock_dlm(U) gfs2(U) dlm(U) configfs(U) mptctl
      (U) mptbase(U) ipmi_watchdog(U) ipmi_si(U) ipmi_devintf(U) ipmi_msghandler(U) i2c_dev(U) i2c_core(U) lockd(U) sunrpc(U) bonding(U) ipv6(U) xfr
      m_nalgo(U) crypto_api(U) dm_round_robin(U) dm_multipath(U) scsi_dh(U) parport_pc(U) lp(U) parport(U) sg(U) shpchp(U) hpilo(U) pcspkr(U) serio_
      raw(U) bnx2x(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_mem_cache(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U) dm_mod(U) usb_st
      orage(U) qla2xxx(U) scsi_transport_fc(U) cciss(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
      Sep 3 06:25:09 sklusp02a kernel: Pid: 10065, comm: kmirrord Tainted: G 2.6.18-194.17.1.el5_lustre.1.8.5 #1
      Sep 3 06:25:09 sklusp02a kernel: RIP: 0010:[<ffffffff8822b2fd>] [<ffffffff8822b2fd>] :dm_mod:dispatch_io+0xb9/0x19b
      Sep 3 06:25:09 sklusp02a kernel: RSP: 0018:ffff8111e367fb60 EFLAGS: 00010206
      Sep 3 06:25:09 sklusp02a kernel: RAX: 00000000264af800 RBX: 0000000000000000 RCX: ffffffff8008cf93
      Sep 3 06:25:09 sklusp02a kernel: RDX: 0000000000000050 RSI: ffff8111d9ab00c0 RDI: 0000000000000001
      Sep 3 06:25:09 sklusp02a kernel: RBP: 0000000000000800 R08: 0000000000000000 R09: ffff8111edda3040
      Sep 3 06:25:09 sklusp02a kernel: R10: 0000000000000001 R11: ffffffff80044fcd R12: ffff8111e367fc40
      Sep 3 06:25:09 sklusp02a kernel: R13: ffff8111e367fdc0 R14: ffff811212d01e00 R15: 0000000000000000
      Sep 3 06:25:09 sklusp02a kernel: FS: 0000000000000000(0000) GS:ffff81121ffb09c0(0000) knlGS:0000000000000000
      Sep 3 06:25:09 sklusp02a kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      Sep 3 06:25:09 sklusp02a kernel: CR2: 0000000000000040 CR3: 00000011e6842000 CR4: 00000000000006e0
      Sep 3 06:25:09 sklusp02a kernel: Process kmirrord (pid: 10065, threadinfo ffff8111e367e000, task ffff8111edda3040)

      In Lustre 1.8.2 & RHEL 5.4 there was no issue with using regionsize 4M. Customer used greater regionsize to speed up remirroring after system crash.
      Customer logged this issue to Oracle Lustre support. Oracle suggested to upgrade to their 1.8.7 release. Meantime the customer switched from Oracle to Whamcloud support.
      Currently we plan to upgrade to Whamcloud 1.8.7. Our question is if this issue with LVM mirror regionsize is known to Whamcloud, will the upgrade to 1.8.7 solve this issue?

      Attachments

        Activity

          People

            wc-triage WC Triage
            hpsk HP Slovakia team (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: