Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3188

IOR fails due to client stack overrun

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.6.0, Lustre 2.5.4
    • Lustre 2.4.0
    • Hyperion/LLNL
    • 3
    • 7781

    Description

      This is currently killing all IOR runs on Hyperion:

      2013-04-17 15:55:20 BUG: scheduling while atomic: ior/44672/0x10000002
      2013-04-17 15:55:20 BUG: unable to handle kernel paging request at fffffffceb9ee000
      2013-04-17 15:55:20 IP: [<ffffffff810568e4>] update_curr+0x144/0x1f0
      2013-04-17 15:55:20 PGD 1a87067 PUD 0
      2013-04-17 15:55:20 Thread overran stack, or stack corrupted
      2013-04-17 15:55:20 Oops: 0000 [#1] SMP
      2013-04-17 15:55:20 last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:03:00.0/infiniband/mlx4_0/ports/1/pkeys/127
      2013-04-17 15:55:20 CPU 25
      2013-04-17 15:55:20 Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) sha512_generic sha256_generic ipmi_devintf acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr mlx4_ib ib_sa ib_mad iw_cxgb4 iw_cxgb3 ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm sg sd_mod crc_t10dif wmi dcdbas sb_edac edac_core i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ahci shpchp ioatdma nfs lockd fscache auth_rpcgss nfs_acl sunrpc mlx4_en mlx4_core igb dca ptp pps_core be2iscsi bnx2i cnic uio ipv6 cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: cpufreq_ondemand]
      2013-04-17 15:55:20
      2013-04-17 15:55:20 Pid: 44672, comm: ior Not tainted 2.6.32-358.2.1.el6.x86_64 #1 Dell Inc. PowerEdge C6220/0HYFFG
      2013-04-17 15:55:20 RIP: 0010:[<ffffffff810568e4>]  [<ffffffff810568e4>] update_curr+0x144/0x1f0
      2013-04-17 15:55:20 RSP: 0018:ffff88089c523db8  EFLAGS: 00010086
      2013-04-17 15:55:20 RAX: ffff88086f748080 RBX: ffffffffad3be048 RCX: ffff880877f101c0
      2013-04-17 15:55:20 RDX: 00000000000192d8 RSI: 0000000000000000 RDI: ffff88086f7480b8
      2013-04-17 15:55:20 RBP: ffff88089c523de8 R08: ffffffff8160bb65 R09: 0000000000000007
      2013-04-17 15:55:20 R10: 0000000000000010 R11: 0000000000000007 R12: ffff88089c536768
      2013-04-17 15:55:20 R13: 000000000080f9df R14: 0000a8ac18cddce3 R15: ffff88086f748080
      2013-04-17 15:55:20 FS:  00002aaaafebf8c0(0000) GS:ffff88089c520000(0000) knlGS:0000000000000000
      2013-04-17 15:55:20 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      2013-04-17 15:55:20 CR2: fffffffceb9ee000 CR3: 000000105cb6c000 CR4: 00000000000407e0
      2013-04-17 15:55:20 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      2013-04-17 15:55:20 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      2013-04-17 15:55:20 Process ior (pid: 44672, threadinfo ffff8807ad3be000, task ffff88086f748080)
      2013-04-17 15:55:20 Stack:
      2013-04-17 15:55:20  ffff88089c523dc8 ffffffff81013783 ffff88086f7480b8 ffff88089c536768
      2013-04-17 15:55:20 <d> 0000000000000000 0000000000000000 ffff88089c523e18 ffffffff81056e9b
      2013-04-17 15:55:20 <d> ffff88089c536700 0000000000000019 0000000000016700 0000000000000019
      2013-04-17 15:55:20 Call Trace:
      
      2013-04-17 15:55:20  <IRQ>
      2013-04-17 15:55:20  [<ffffffff81013783>] ? native_sched_clock+0x13/0x80
      2013-04-17 15:55:20 BUG: unable to handle kernel paging request at 000000000001400d
      2013-04-17 15:55:20 IP: [<ffffffff8100f4dd>] print_context_stack+0xad/0x140
      2013-04-17 15:55:20 PGD 1067f2c067 PUD 105b956067 PMD 0
      2013-04-17 15:55:20 Thread overran stack, or stack corrupted
      2013-04-17 15:55:20 Oops: 0000 [#2] SMP
      2013-04-17 15:55:20 last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:03:00.0/infiniband/mlx4_0/ports/1/pkeys/127
      2013-04-17 15:55:20 CPU 25
      2013-04-17 15:55:20 Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) sha512_generic sha256_generic ipmi_devintf acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr mlx4_ib ib_sa ib_mad iw_cxgb4 iw_cxgb3 ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm sg sd_mod crc_t10dif wmi dcdbas sb_edac edac_core i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ahci shpchp ioatdma nfs lockd fscache auth_rpcgss nfs_acl sunrpc mlx4_en mlx4_core igb dca ptp pps_core be2iscsi bnx2i cnic uio ipv6 cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: cpufreq_ondemand]
      2013-04-17 15:55:21
      2013-04-17 15:55:21 Pid: 44672, comm: ior Not tainted 2.6.32-358.2.1.el6.x86_64 #1 Dell Inc. PowerEdge C6220/0HYFFG
      2013-04-17 15:55:21 RIP: 0010:[<ffffffff8100f4dd>]  [<ffffffff8100f4dd>] print_context_stack+0xad/0x140
      2013-04-17 15:55:21 RSP: 0018:ffff88089c5238c8  EFLAGS: 00010006
      2013-04-17 15:55:21 RAX: 0000000000013625 RBX: ffff88089c523dc0 RCX: 00000000000016f5
      2013-04-17 15:55:21 RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000046
      2013-04-17 15:55:21 RBP: ffff88089c523928 R08: 0000000000000000 R09: ffffffff8163fde0
      2013-04-17 15:55:21 R10: 0000000000000001 R11: 0000000000000000 R12: ffff88089c523de8
      2013-04-17 15:55:21 R13: ffff8807ad3be000 R14: ffffffff81600460 R15: ffff88089c523fc0
      2013-04-17 15:55:21 FS:  00002aaaafebf8c0(0000) GS:ffff88089c520000(0000) knlGS:0000000000000000
      2013-04-17 15:55:21 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      2013-04-17 15:55:21 CR2: 000000000001400d CR3: 000000105cb6c000 CR4: 00000000000407e0
      2013-04-17 15:55:21 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      2013-04-17 15:55:21 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      2013-04-17 15:55:21 Process ior (pid: 44672, threadinfo ffff8807ad3be000, task ffff88086f748080)
      

      It never completes dumping the stack, instead it hits this BUG in a loop until the kernel stack is corrupt, then the node reboots.
      Will retest with SWL and see if a change in parameters helps

      Attachments

        1. console.iwc108
          1.20 MB
          Keith Mannthey

        Issue Links

          Activity

            People

              bobijam Zhenyu Xu
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: