Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9230

soft lockup on v2.9 Lustre clients (ldlm?)

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.0, Lustre 2.10.7
    • Lustre 2.9.0
    • None
    • RHEL7.3
    • 3
    • 9223372036854775807

    Description

      Hi,

      We have recently configured a new filesystem running v2.9 and the clients are sporadically falling over every 2-3 days. The servers seem to be fine and a reboot of the clients seems to bring everything back to life.

      This is the exact same workload that we ran on a v2.8 based cluster (backups with hard links) and we have not seen any similar issues. Here is a snippet of the logs from the moment it started to have issues:

      Mar 20 00:50:27 foxtrot1 kernel: NMI watchdog: BUG: soft lockup - CPU#30 stuck for 23s! [ldlm_poold:3201]
      Mar 20 00:50:27 foxtrot1 kernel: Modules linked in: mpt3sas mpt2sas raid_class scsi_transport_sas mptctl mptbase dell_rbu osc(OE) mgc(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE
      ) sha512_ssse3 sha512_generic nfsv3 nfs crypto_null fscache libcfs(OE) bonding intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel iTCO_wdt iTCO_vendor_support dcdbas kvm irqbypass sg ipmi_devintf acpi_pad acpi_power_meter ipmi_s
      i sb_edac ipmi_msghandler edac_core mei_me mei shpchp lpc_ich nfsd auth_rpcgss nfs_acl lockd grace binfmt_misc ip_tables xfs sd_mod crc_t10dif crct10dif_generic 8021q garp stp llc mrp scsi_transport_iscsi crct10dif_pclmul crct10dif_c
      ommon crc32_pclmul crc32c_intel mgag200 i2c_algo_bit ghash_clmulni_intel drm_kms_helper aesni_intel syscopyarea lrw sysfillrect gf128mul
      Mar 20 00:50:27 foxtrot1 kernel: sysimgblt glue_helper fb_sys_fops ablk_helper cryptd ttm dm_multipath drm ahci libahci bnx2x libata i2c_core ptp pps_core megaraid_sas ntb mdio libcrc32c wmi fjes sunrpc dm_mirror dm_region_hash dm_lo
      g dm_mod [last unloaded: cdrom]
      Mar 20 00:50:27 foxtrot1 kernel: CPU: 30 PID: 3201 Comm: ldlm_poold Tainted: G           OE  ------------   3.10.0-514.el7_lustre.x86_64 #1
      Mar 20 00:50:27 foxtrot1 kernel: Hardware name: Dell Inc. PowerEdge R620/0PXXHP, BIOS 2.5.4 01/22/2016
      Mar 20 00:50:27 foxtrot1 kernel: task: ffff881ff4613ec0 ti: ffff881ff47d0000 task.ti: ffff881ff47d0000
      Mar 20 00:50:27 foxtrot1 kernel: RIP: 0010:[<ffffffff8168da62>]  [<ffffffff8168da62>] _raw_spin_lock+0x32/0x50
      Mar 20 00:50:27 foxtrot1 kernel: RSP: 0018:ffff881ff47d3c80  EFLAGS: 00000202
      Mar 20 00:50:27 foxtrot1 kernel: RAX: 0000000000006091 RBX: ffffffff810cde74 RCX: 000000000000c172
      Mar 20 00:50:27 foxtrot1 kernel: RDX: 000000000000c174 RSI: 000000000000c174 RDI: ffff883fee6bb418
      Mar 20 00:50:27 foxtrot1 kernel: RBP: ffff881ff47d3c80 R08: 0000000000000000 R09: 0000000000000001
      Mar 20 00:50:27 foxtrot1 kernel: R10: 000000010b885bfc R11: 0000000000000400 R12: 0000000000000000
      Mar 20 00:50:27 foxtrot1 kernel: R13: ffff881fff956cc0 R14: ffff881fff956cc0 R15: ffff881ff8f1f400
      Mar 20 00:50:27 foxtrot1 kernel: FS:  0000000000000000(0000) GS:ffff881fffbc0000(0000) knlGS:0000000000000000
      Mar 20 00:50:27 foxtrot1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      Mar 20 00:50:27 foxtrot1 kernel: CR2: 00007fbeb5124f88 CR3: 00000000019ba000 CR4: 00000000000407e0
      Mar 20 00:50:27 foxtrot1 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      Mar 20 00:50:27 foxtrot1 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Mar 20 00:50:27 foxtrot1 kernel: Stack:
      Mar 20 00:50:27 foxtrot1 kernel: ffff881ff47d3cb0 ffffffffa0bc619c ffff883e78f13400 0000000000000000
      Mar 20 00:50:27 foxtrot1 kernel: 0000400000000000 0000008000000000 ffff881ff47d3d30 ffffffffa0be5d57
      Mar 20 00:50:27 foxtrot1 kernel: ffff881ff47d3d48 000000000bd9af10 ffffffffa0be3d70 00000000000cfc1f
      Mar 20 00:50:27 foxtrot1 kernel: Call Trace:
      Mar 20 00:50:27 foxtrot1 kernel: [<ffffffffa0bc619c>] ldlm_lock_remove_from_lru_check+0x7c/0x1a0 [ptlrpc]
      Mar 20 00:50:27 foxtrot1 kernel: [<ffffffffa0be5d57>] ldlm_prepare_lru_list+0x257/0x480 [ptlrpc]
      Mar 20 00:50:27 foxtrot1 kernel: [<ffffffffa0be3d70>] ? ldlm_iter_helper+0x20/0x20 [ptlrpc]
      Mar 20 00:50:27 foxtrot1 kernel: [<ffffffffa0beadb1>] ldlm_cancel_lru+0x61/0x170 [ptlrpc]
      Mar 20 00:50:27 foxtrot1 kernel: [<ffffffffa0bfe109>] ldlm_cli_pool_recalc+0x249/0x260 [ptlrpc]
      Mar 20 00:50:27 foxtrot1 kernel: [<ffffffffa0bfe767>] ldlm_pool_recalc+0x107/0x1d0 [ptlrpc]
      Mar 20 00:50:27 foxtrot1 kernel: [<ffffffffa0c0017c>] ldlm_pools_recalc+0x21c/0x3d0 [ptlrpc]
      Mar 20 00:50:27 foxtrot1 kernel: [<ffffffffa0c003c5>] ldlm_pools_thread_main+0x95/0x330 [ptlrpc]
      Mar 20 00:50:27 foxtrot1 kernel: [<ffffffff810c4ec0>] ? wake_up_state+0x20/0x20
      Mar 20 00:50:27 foxtrot1 kernel: [<ffffffffa0c00330>] ? ldlm_pools_recalc+0x3d0/0x3d0 [ptlrpc]
      Mar 20 00:50:27 foxtrot1 kernel: [<ffffffff810b052f>] kthread+0xcf/0xe0
      Mar 20 00:50:27 foxtrot1 kernel: [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
      Mar 20 00:50:27 foxtrot1 kernel: [<ffffffff81696658>] ret_from_fork+0x58/0x90
      Mar 20 00:50:27 foxtrot1 kernel: [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
      Mar 20 00:50:27 foxtrot1 kernel: Code: 00 02 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 01 c3 55 83 e2 fe 0f b7 f2 48 89 e5 b8 00 80 00 00 eb 0d 66 0f 1f 44 00 00 f3 90 <83> e8 01 74 0a 0f b7 0f 66 39 ca 75 f1 5d c3 66 66 66 90 66 66 
      Mar 20 00:50:55 foxtrot1 kernel: NMI watchdog: BUG: soft lockup - CPU#30 stuck for 23s! [ldlm_poold:3201]
      Mar 20 00:50:55 foxtrot1 kernel: Modules linked in: mpt3sas mpt2sas raid_class scsi_transport_sas mptctl mptbase dell_rbu osc(OE) mgc(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE
      ) sha512_ssse3 sha512_generic nfsv3 nfs crypto_null fscache libcfs(OE) bonding intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel iTCO_wdt iTCO_vendor_support dcdbas kvm irqbypass sg ipmi_devintf acpi_pad acpi_power_meter ipmi_s
      i sb_edac ipmi_msghandler edac_core mei_me mei shpchp lpc_ich nfsd auth_rpcgss nfs_acl lockd grace binfmt_misc ip_tables xfs sd_mod crc_t10dif crct10dif_generic 8021q garp stp llc mrp scsi_transport_iscsi crct10dif_pclmul crct10dif_c
      ommon crc32_pclmul crc32c_intel mgag200 i2c_algo_bit ghash_clmulni_intel drm_kms_helper aesni_intel syscopyarea lrw sysfillrect gf128mul
      Mar 20 00:50:55 foxtrot1 kernel: sysimgblt glue_helper fb_sys_fops ablk_helper cryptd ttm dm_multipath drm ahci libahci bnx2x libata i2c_core ptp pps_core megaraid_sas ntb mdio libcrc32c wmi fjes sunrpc dm_mirror dm_region_hash dm_log dm_mod [last unloaded: cdrom]
      Mar 20 00:50:55 foxtrot1 kernel: CPU: 30 PID: 3201 Comm: ldlm_poold Tainted: G           OEL ------------   3.10.0-514.el7_lustre.x86_64 #1
      Mar 20 00:50:55 foxtrot1 kernel: Hardware name: Dell Inc. PowerEdge R620/0PXXHP, BIOS 2.5.4 01/22/2016
      Mar 20 00:50:55 foxtrot1 kernel: task: ffff881ff4613ec0 ti: ffff881ff47d0000 task.ti: ffff881ff47d0000
      Mar 20 00:50:55 foxtrot1 kernel: RIP: 0010:[<ffffffff8168da62>]  [<ffffffff8168da62>] _raw_spin_lock+0x32/0x50
      Mar 20 00:50:55 foxtrot1 kernel: RSP: 0018:ffff881ff47d3c80  EFLAGS: 00000216
      Mar 20 00:50:55 foxtrot1 kernel: RAX: 0000000000003624 RBX: ffffffff810cde74 RCX: 000000000000123a
      Mar 20 00:50:55 foxtrot1 kernel: RDX: 0000000000001244 RSI: 0000000000001244 RDI: ffff883fee6bb418
      Mar 20 00:50:55 foxtrot1 kernel: RBP: ffff881ff47d3c80 R08: 0000000000000000 R09: 0000000000000001
      Mar 20 00:50:55 foxtrot1 kernel: R10: 000000010b885de8 R11: 0000000000000400 R12: 0000000000000000
      Mar 20 00:50:55 foxtrot1 kernel: R13: ffff881fff956cc0 R14: ffff881fff956cc0 R15: ffff881ff8f1f400
      Mar 20 00:50:55 foxtrot1 kernel: FS:  0000000000000000(0000) GS:ffff881fffbc0000(0000) knlGS:0000000000000000
      Mar 20 00:50:55 foxtrot1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      Mar 20 00:50:55 foxtrot1 kernel: CR2: 00007fbeb5124f88 CR3: 00000000019ba000 CR4: 00000000000407e0
      Mar 20 00:50:55 foxtrot1 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      Mar 20 00:50:55 foxtrot1 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Mar 20 00:50:55 foxtrot1 kernel: Stack:
      Mar 20 00:50:55 foxtrot1 kernel: ffff881ff47d3cb0 ffffffffa0bc619c ffff883da559b200 0000000000000000
      Mar 20 00:50:55 foxtrot1 kernel: 0000400000000000 0000008000000000 ffff881ff47d3d30 ffffffffa0be5d57
      Mar 20 00:50:55 foxtrot1 kernel: ffff881ff47d3d48 000000000bd9af10 ffffffffa0be3d70 00000000000cfb78
      Mar 20 00:50:55 foxtrot1 kernel: Call Trace:
      Mar 20 00:50:55 foxtrot1 kernel: [<ffffffffa0bc619c>] ldlm_lock_remove_from_lru_check+0x7c/0x1a0 [ptlrpc]
      Mar 20 00:50:55 foxtrot1 kernel: [<ffffffffa0be5d57>] ldlm_prepare_lru_list+0x257/0x480 [ptlrpc]
      Mar 20 00:50:55 foxtrot1 kernel: [<ffffffffa0be3d70>] ? ldlm_iter_helper+0x20/0x20 [ptlrpc]
      Mar 20 00:50:55 foxtrot1 kernel: [<ffffffffa0beadb1>] ldlm_cancel_lru+0x61/0x170 [ptlrpc]
      Mar 20 00:50:55 foxtrot1 kernel: [<ffffffffa0bfe109>] ldlm_cli_pool_recalc+0x249/0x260 [ptlrpc]
      Mar 20 00:50:55 foxtrot1 kernel: [<ffffffffa0bfe767>] ldlm_pool_recalc+0x107/0x1d0 [ptlrpc]
      Mar 20 00:50:55 foxtrot1 kernel: [<ffffffffa0c0017c>] ldlm_pools_recalc+0x21c/0x3d0 [ptlrpc]
      Mar 20 00:50:55 foxtrot1 kernel: [<ffffffffa0c003c5>] ldlm_pools_thread_main+0x95/0x330 [ptlrpc]
      Mar 20 00:50:55 foxtrot1 kernel: [<ffffffff810c4ec0>] ? wake_up_state+0x20/0x20
      Mar 20 00:50:55 foxtrot1 kernel: [<ffffffffa0c00330>] ? ldlm_pools_recalc+0x3d0/0x3d0 [ptlrpc]
      Mar 20 00:50:55 foxtrot1 kernel: [<ffffffff810b052f>] kthread+0xcf/0xe0
      Mar 20 00:50:55 foxtrot1 kernel: [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
      Mar 20 00:50:55 foxtrot1 kernel: [<ffffffff81696658>] ret_from_fork+0x58/0x90
      Mar 20 00:50:55 foxtrot1 kernel: [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
      Mar 20 00:50:55 foxtrot1 kernel: Code: 00 02 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 01 c3 55 83 e2 fe 0f b7 f2 48 89 e5 b8 00 80 00 00 eb 0d 66 0f 1f 44 00 00 f3 90 <83> e8 01 74 0a 0f b7 0f 66 39 ca 75 f1 5d c3 66 66 66 90 66 66 
      Mar 20 00:51:03 foxtrot1 kernel: INFO: rcu_sched self-detected stall on CPU { 30}  (t=60001 jiffies g=32391243 c=32391242 q=296406)
      Mar 20 00:51:03 foxtrot1 kernel: Task dump for CPU 30:
      Mar 20 00:51:03 foxtrot1 kernel: ldlm_poold      R  running task        0  3201      2 0x00000008
      Mar 20 00:51:03 foxtrot1 kernel: ffff881ff4613ec0 0000000063a8c183 ffff881fffbc3db0 ffffffff810c41d8
      Mar 20 00:51:03 foxtrot1 kernel: 000000000000001e ffffffff81a1a780 ffff881fffbc3dc8 ffffffff810c7a79
      Mar 20 00:51:03 foxtrot1 kernel: 000000000000000f ffff881fffbc3df8 ffffffff811372a0 ffff881fffbd01c0
      Mar 20 00:51:03 foxtrot1 kernel: Call Trace:
      Mar 20 00:51:03 foxtrot1 kernel: <IRQ>  [<ffffffff810c41d8>] sched_show_task+0xa8/0x110
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff810c7a79>] dump_cpu_task+0x39/0x70
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff811372a0>] rcu_dump_cpu_stacks+0x90/0xd0
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff8113a9f2>] rcu_check_callbacks+0x442/0x720
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff810f2f80>] ? tick_sched_handle.isra.13+0x60/0x60
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff81099177>] update_process_times+0x47/0x80
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff810f2f45>] tick_sched_handle.isra.13+0x25/0x60
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff810f2fc1>] tick_sched_timer+0x41/0x70
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff810b4862>] __hrtimer_run_queues+0xd2/0x260
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff810b4e00>] hrtimer_interrupt+0xb0/0x1e0
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff8169819c>] ? call_softirq+0x1c/0x30
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff810510d7>] local_apic_timer_interrupt+0x37/0x60
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff81698e0f>] smp_apic_timer_interrupt+0x3f/0x60
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff8169735d>] apic_timer_interrupt+0x6d/0x80
      Mar 20 00:51:03 foxtrot1 kernel: <EOI>  [<ffffffff810cde74>] ? update_curr+0x104/0x190
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff8168da67>] ? _raw_spin_lock+0x37/0x50
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffffa0bc619c>] ldlm_lock_remove_from_lru_check+0x7c/0x1a0 [ptlrpc]
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffffa0be5d57>] ldlm_prepare_lru_list+0x257/0x480 [ptlrpc]
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffffa0be3d70>] ? ldlm_iter_helper+0x20/0x20 [ptlrpc]
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffffa0beadb1>] ldlm_cancel_lru+0x61/0x170 [ptlrpc]
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffffa0bfe109>] ldlm_cli_pool_recalc+0x249/0x260 [ptlrpc]
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffffa0c0017c>] ldlm_pools_recalc+0x21c/0x3d0 [ptlrpc]
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffffa0c003c5>] ldlm_pools_thread_main+0x95/0x330 [ptlrpc]
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff810c4ec0>] ? wake_up_state+0x20/0x20
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffffa0c00330>] ? ldlm_pools_recalc+0x3d0/0x3d0 [ptlrpc]
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff810b052f>] kthread+0xcf/0xe0
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff81696658>] ret_from_fork+0x58/0x90
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
      Mar 20 00:51:05 foxtrot1 systemd: Started Session 3948 of user root.
      Mar 20 00:51:05 foxtrot1 systemd: Starting Session 3948 of user root.
      Mar 20 00:51:06 foxtrot1 systemd: Started Session 3949 of user root.
      Mar 20 00:51:06 foxtrot1 systemd: Starting Session 3949 of user root.
      Mar 20 00:51:31 foxtrot1 kernel: NMI watchdog: BUG: soft lockup - CPU#30 stuck for 22s! [ldlm_poold:3201]
      Mar 20 00:51:31 foxtrot1 kernel: Modules linked in: mpt3sas mpt2sas raid_class scsi_transport_sas mptctl mptbase dell_rbu osc(OE) mgc(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_ssse3 sha512_generic nfsv3 nfs crypto_null fscache libcfs(OE) bonding intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel iTCO_wdt iTCO_vendor_support dcdbas kvm irqbypass sg ipmi_devintf acpi_pad acpi_power_meter ipmi_si sb_edac ipmi_msghandler edac_core mei_me mei shpchp lpc_ich nfsd auth_rpcgss nfs_acl lockd grace binfmt_misc ip_tables xfs sd_mod crc_t10dif crct10dif_generic 8021q garp stp llc mrp scsi_transport_iscsi crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel mgag200 i2c_algo_bit ghash_clmulni_intel drm_kms_helper aesni_intel syscopyarea lrw sysfillrect gf128mul
      Mar 20 00:51:31 foxtrot1 kernel: sysimgblt glue_helper fb_sys_fops ablk_helper cryptd ttm dm_multipath drm ahci libahci bnx2x libata i2c_core ptp pps_core megaraid_sas ntb mdio libcrc32c wmi fjes sunrpc dm_mirror dm_region_hash dm_log dm_mod [last unloaded: cdrom]
      Mar 20 00:51:31 foxtrot1 kernel: CPU: 30 PID: 3201 Comm: ldlm_poold Tainted: G           OEL ------------   3.10.0-514.el7_lustre.x86_64 #1
      Mar 20 00:51:31 foxtrot1 kernel: Hardware name: Dell Inc. PowerEdge R620/0PXXHP, BIOS 2.5.4 01/22/2016
      Mar 20 00:51:31 foxtrot1 kernel: task: ffff881ff4613ec0 ti: ffff881ff47d0000 task.ti: ffff881ff47d0000
      Mar 20 00:51:31 foxtrot1 kernel: RIP: 0010:[<ffffffff8168da62>]  [<ffffffff8168da62>] _raw_spin_lock+0x32/0x50
      Mar 20 00:51:31 foxtrot1 kernel: RSP: 0018:ffff881ff47d3cb0  EFLAGS: 00000212
      Mar 20 00:51:31 foxtrot1 kernel: RAX: 0000000000004ab0 RBX: 00000000000055da RCX: 00000000000055fc
      Mar 20 00:51:31 foxtrot1 kernel: RDX: 0000000000005618 RSI: 0000000000005618 RDI: ffff883fee6bb418
      Mar 20 00:51:31 foxtrot1 kernel: RBP: ffff881ff47d3cb0 R08: ffff882e7bf42b90 R09: 0000000000000001
      Mar 20 00:51:31 foxtrot1 kernel: R10: 000000010b885f78 R11: 0000000000000400 R12: 0000000000000000
      Mar 20 00:51:31 foxtrot1 kernel: R13: 0000000000000001 R14: 000000010b885f78 R15: 0000000000000400
      Mar 20 00:51:31 foxtrot1 kernel: FS:  0000000000000000(0000) GS:ffff881fffbc0000(0000) knlGS:0000000000000000
      Mar 20 00:51:31 foxtrot1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      Mar 20 00:51:31 foxtrot1 kernel: CR2: 00007fbeb5124f88 CR3: 00000000019ba000 CR4: 00000000000407e0
      Mar 20 00:51:31 foxtrot1 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      Mar 20 00:51:31 foxtrot1 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Mar 20 00:51:31 foxtrot1 kernel: Stack:
      Mar 20 00:51:31 foxtrot1 kernel: ffff881ff47d3d30 ffffffffa0be5dc2 ffff881ff47d3d48 000000000bd9af10
      Mar 20 00:51:31 foxtrot1 kernel: ffffffffa0be3d70 00000000000cfae7 000000010b885f78 ffff883fee6bb418
      Mar 20 00:51:31 foxtrot1 kernel: ffff883fee6bb400 000cfae7000003b4 ffff883fee6bb448 ffff881ff47d3d48
      Mar 20 00:51:31 foxtrot1 kernel: Call Trace:
      Mar 20 00:51:31 foxtrot1 kernel: [<ffffffffa0be5dc2>] ldlm_prepare_lru_list+0x2c2/0x480 [ptlrpc]
      Mar 20 00:51:31 foxtrot1 kernel: [<ffffffffa0be3d70>] ? ldlm_iter_helper+0x20/0x20 [ptlrpc]
      Mar 20 00:51:31 foxtrot1 kernel: [<ffffffffa0beadb1>] ldlm_cancel_lru+0x61/0x170 [ptlrpc]
      Mar 20 00:51:31 foxtrot1 kernel: [<ffffffffa0bfe109>] ldlm_cli_pool_recalc+0x249/0x260 [ptlrpc]
      Mar 20 00:51:31 foxtrot1 kernel: [<ffffffffa0bfe767>] ldlm_pool_recalc+0x107/0x1d0 [ptlrpc]
      Mar 20 00:51:31 foxtrot1 kernel: [<ffffffffa0c0017c>] ldlm_pools_recalc+0x21c/0x3d0 [ptlrpc]
      Mar 20 00:51:31 foxtrot1 kernel: [<ffffffffa0c003c5>] ldlm_pools_thread_main+0x95/0x330 [ptlrpc]
      Mar 20 00:51:31 foxtrot1 kernel: [<ffffffff810c4ec0>] ? wake_up_state+0x20/0x20
      Mar 20 00:51:31 foxtrot1 kernel: [<ffffffffa0c00330>] ? ldlm_pools_recalc+0x3d0/0x3d0 [ptlrpc]
      Mar 20 00:51:31 foxtrot1 kernel: [<ffffffff810b052f>] kthread+0xcf/0xe0
      Mar 20 00:51:31 foxtrot1 kernel: [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
      Mar 20 00:51:31 foxtrot1 kernel: [<ffffffff81696658>] ret_from_fork+0x58/0x90
      Mar 20 00:51:31 foxtrot1 kernel: [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
      Mar 20 00:51:31 foxtrot1 kernel: Code: 00 02 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 01 c3 55 83 e2 fe 0f b7 f2 48 89 e5 b8 00 80 00 00 eb 0d 66 0f 1f 44 00 00 f3 90 <83> e8 01 74 0a 0f b7 0f 66 39 ca 75 f1 5d c3 66 66 66 90 66 66 
      ..
      ..
      

      It continues to spam the logs and the server is very slow until we reboot it. We have seen this on 3 different clients now.

      On a related note, could we downgrade the client to v2.8 and keep v2.9 running on the servers as a potential quick fix for this instability?

      Attachments

        1. ldlm-locks.log.gz
          41 kB
        2. messages.gz
          94 kB
        3. sysrqt.txt.gz
          93 kB
        4. sysrq-txt
          240 kB

        Issue Links

          Activity

            People

              ys Yang Sheng
              daire Daire Byrne (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: