Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9230

soft lockup on v2.9 Lustre clients (ldlm?)

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.0, Lustre 2.10.7
    • Lustre 2.9.0
    • None
    • RHEL7.3
    • 3
    • 9223372036854775807

    Description

      Hi,

      We have recently configured a new filesystem running v2.9 and the clients are sporadically falling over every 2-3 days. The servers seem to be fine and a reboot of the clients seems to bring everything back to life.

      This is the exact same workload that we ran on a v2.8 based cluster (backups with hard links) and we have not seen any similar issues. Here is a snippet of the logs from the moment it started to have issues:

      Mar 20 00:50:27 foxtrot1 kernel: NMI watchdog: BUG: soft lockup - CPU#30 stuck for 23s! [ldlm_poold:3201]
      Mar 20 00:50:27 foxtrot1 kernel: Modules linked in: mpt3sas mpt2sas raid_class scsi_transport_sas mptctl mptbase dell_rbu osc(OE) mgc(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE
      ) sha512_ssse3 sha512_generic nfsv3 nfs crypto_null fscache libcfs(OE) bonding intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel iTCO_wdt iTCO_vendor_support dcdbas kvm irqbypass sg ipmi_devintf acpi_pad acpi_power_meter ipmi_s
      i sb_edac ipmi_msghandler edac_core mei_me mei shpchp lpc_ich nfsd auth_rpcgss nfs_acl lockd grace binfmt_misc ip_tables xfs sd_mod crc_t10dif crct10dif_generic 8021q garp stp llc mrp scsi_transport_iscsi crct10dif_pclmul crct10dif_c
      ommon crc32_pclmul crc32c_intel mgag200 i2c_algo_bit ghash_clmulni_intel drm_kms_helper aesni_intel syscopyarea lrw sysfillrect gf128mul
      Mar 20 00:50:27 foxtrot1 kernel: sysimgblt glue_helper fb_sys_fops ablk_helper cryptd ttm dm_multipath drm ahci libahci bnx2x libata i2c_core ptp pps_core megaraid_sas ntb mdio libcrc32c wmi fjes sunrpc dm_mirror dm_region_hash dm_lo
      g dm_mod [last unloaded: cdrom]
      Mar 20 00:50:27 foxtrot1 kernel: CPU: 30 PID: 3201 Comm: ldlm_poold Tainted: G           OE  ------------   3.10.0-514.el7_lustre.x86_64 #1
      Mar 20 00:50:27 foxtrot1 kernel: Hardware name: Dell Inc. PowerEdge R620/0PXXHP, BIOS 2.5.4 01/22/2016
      Mar 20 00:50:27 foxtrot1 kernel: task: ffff881ff4613ec0 ti: ffff881ff47d0000 task.ti: ffff881ff47d0000
      Mar 20 00:50:27 foxtrot1 kernel: RIP: 0010:[<ffffffff8168da62>]  [<ffffffff8168da62>] _raw_spin_lock+0x32/0x50
      Mar 20 00:50:27 foxtrot1 kernel: RSP: 0018:ffff881ff47d3c80  EFLAGS: 00000202
      Mar 20 00:50:27 foxtrot1 kernel: RAX: 0000000000006091 RBX: ffffffff810cde74 RCX: 000000000000c172
      Mar 20 00:50:27 foxtrot1 kernel: RDX: 000000000000c174 RSI: 000000000000c174 RDI: ffff883fee6bb418
      Mar 20 00:50:27 foxtrot1 kernel: RBP: ffff881ff47d3c80 R08: 0000000000000000 R09: 0000000000000001
      Mar 20 00:50:27 foxtrot1 kernel: R10: 000000010b885bfc R11: 0000000000000400 R12: 0000000000000000
      Mar 20 00:50:27 foxtrot1 kernel: R13: ffff881fff956cc0 R14: ffff881fff956cc0 R15: ffff881ff8f1f400
      Mar 20 00:50:27 foxtrot1 kernel: FS:  0000000000000000(0000) GS:ffff881fffbc0000(0000) knlGS:0000000000000000
      Mar 20 00:50:27 foxtrot1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      Mar 20 00:50:27 foxtrot1 kernel: CR2: 00007fbeb5124f88 CR3: 00000000019ba000 CR4: 00000000000407e0
      Mar 20 00:50:27 foxtrot1 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      Mar 20 00:50:27 foxtrot1 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Mar 20 00:50:27 foxtrot1 kernel: Stack:
      Mar 20 00:50:27 foxtrot1 kernel: ffff881ff47d3cb0 ffffffffa0bc619c ffff883e78f13400 0000000000000000
      Mar 20 00:50:27 foxtrot1 kernel: 0000400000000000 0000008000000000 ffff881ff47d3d30 ffffffffa0be5d57
      Mar 20 00:50:27 foxtrot1 kernel: ffff881ff47d3d48 000000000bd9af10 ffffffffa0be3d70 00000000000cfc1f
      Mar 20 00:50:27 foxtrot1 kernel: Call Trace:
      Mar 20 00:50:27 foxtrot1 kernel: [<ffffffffa0bc619c>] ldlm_lock_remove_from_lru_check+0x7c/0x1a0 [ptlrpc]
      Mar 20 00:50:27 foxtrot1 kernel: [<ffffffffa0be5d57>] ldlm_prepare_lru_list+0x257/0x480 [ptlrpc]
      Mar 20 00:50:27 foxtrot1 kernel: [<ffffffffa0be3d70>] ? ldlm_iter_helper+0x20/0x20 [ptlrpc]
      Mar 20 00:50:27 foxtrot1 kernel: [<ffffffffa0beadb1>] ldlm_cancel_lru+0x61/0x170 [ptlrpc]
      Mar 20 00:50:27 foxtrot1 kernel: [<ffffffffa0bfe109>] ldlm_cli_pool_recalc+0x249/0x260 [ptlrpc]
      Mar 20 00:50:27 foxtrot1 kernel: [<ffffffffa0bfe767>] ldlm_pool_recalc+0x107/0x1d0 [ptlrpc]
      Mar 20 00:50:27 foxtrot1 kernel: [<ffffffffa0c0017c>] ldlm_pools_recalc+0x21c/0x3d0 [ptlrpc]
      Mar 20 00:50:27 foxtrot1 kernel: [<ffffffffa0c003c5>] ldlm_pools_thread_main+0x95/0x330 [ptlrpc]
      Mar 20 00:50:27 foxtrot1 kernel: [<ffffffff810c4ec0>] ? wake_up_state+0x20/0x20
      Mar 20 00:50:27 foxtrot1 kernel: [<ffffffffa0c00330>] ? ldlm_pools_recalc+0x3d0/0x3d0 [ptlrpc]
      Mar 20 00:50:27 foxtrot1 kernel: [<ffffffff810b052f>] kthread+0xcf/0xe0
      Mar 20 00:50:27 foxtrot1 kernel: [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
      Mar 20 00:50:27 foxtrot1 kernel: [<ffffffff81696658>] ret_from_fork+0x58/0x90
      Mar 20 00:50:27 foxtrot1 kernel: [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
      Mar 20 00:50:27 foxtrot1 kernel: Code: 00 02 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 01 c3 55 83 e2 fe 0f b7 f2 48 89 e5 b8 00 80 00 00 eb 0d 66 0f 1f 44 00 00 f3 90 <83> e8 01 74 0a 0f b7 0f 66 39 ca 75 f1 5d c3 66 66 66 90 66 66 
      Mar 20 00:50:55 foxtrot1 kernel: NMI watchdog: BUG: soft lockup - CPU#30 stuck for 23s! [ldlm_poold:3201]
      Mar 20 00:50:55 foxtrot1 kernel: Modules linked in: mpt3sas mpt2sas raid_class scsi_transport_sas mptctl mptbase dell_rbu osc(OE) mgc(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE
      ) sha512_ssse3 sha512_generic nfsv3 nfs crypto_null fscache libcfs(OE) bonding intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel iTCO_wdt iTCO_vendor_support dcdbas kvm irqbypass sg ipmi_devintf acpi_pad acpi_power_meter ipmi_s
      i sb_edac ipmi_msghandler edac_core mei_me mei shpchp lpc_ich nfsd auth_rpcgss nfs_acl lockd grace binfmt_misc ip_tables xfs sd_mod crc_t10dif crct10dif_generic 8021q garp stp llc mrp scsi_transport_iscsi crct10dif_pclmul crct10dif_c
      ommon crc32_pclmul crc32c_intel mgag200 i2c_algo_bit ghash_clmulni_intel drm_kms_helper aesni_intel syscopyarea lrw sysfillrect gf128mul
      Mar 20 00:50:55 foxtrot1 kernel: sysimgblt glue_helper fb_sys_fops ablk_helper cryptd ttm dm_multipath drm ahci libahci bnx2x libata i2c_core ptp pps_core megaraid_sas ntb mdio libcrc32c wmi fjes sunrpc dm_mirror dm_region_hash dm_log dm_mod [last unloaded: cdrom]
      Mar 20 00:50:55 foxtrot1 kernel: CPU: 30 PID: 3201 Comm: ldlm_poold Tainted: G           OEL ------------   3.10.0-514.el7_lustre.x86_64 #1
      Mar 20 00:50:55 foxtrot1 kernel: Hardware name: Dell Inc. PowerEdge R620/0PXXHP, BIOS 2.5.4 01/22/2016
      Mar 20 00:50:55 foxtrot1 kernel: task: ffff881ff4613ec0 ti: ffff881ff47d0000 task.ti: ffff881ff47d0000
      Mar 20 00:50:55 foxtrot1 kernel: RIP: 0010:[<ffffffff8168da62>]  [<ffffffff8168da62>] _raw_spin_lock+0x32/0x50
      Mar 20 00:50:55 foxtrot1 kernel: RSP: 0018:ffff881ff47d3c80  EFLAGS: 00000216
      Mar 20 00:50:55 foxtrot1 kernel: RAX: 0000000000003624 RBX: ffffffff810cde74 RCX: 000000000000123a
      Mar 20 00:50:55 foxtrot1 kernel: RDX: 0000000000001244 RSI: 0000000000001244 RDI: ffff883fee6bb418
      Mar 20 00:50:55 foxtrot1 kernel: RBP: ffff881ff47d3c80 R08: 0000000000000000 R09: 0000000000000001
      Mar 20 00:50:55 foxtrot1 kernel: R10: 000000010b885de8 R11: 0000000000000400 R12: 0000000000000000
      Mar 20 00:50:55 foxtrot1 kernel: R13: ffff881fff956cc0 R14: ffff881fff956cc0 R15: ffff881ff8f1f400
      Mar 20 00:50:55 foxtrot1 kernel: FS:  0000000000000000(0000) GS:ffff881fffbc0000(0000) knlGS:0000000000000000
      Mar 20 00:50:55 foxtrot1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      Mar 20 00:50:55 foxtrot1 kernel: CR2: 00007fbeb5124f88 CR3: 00000000019ba000 CR4: 00000000000407e0
      Mar 20 00:50:55 foxtrot1 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      Mar 20 00:50:55 foxtrot1 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Mar 20 00:50:55 foxtrot1 kernel: Stack:
      Mar 20 00:50:55 foxtrot1 kernel: ffff881ff47d3cb0 ffffffffa0bc619c ffff883da559b200 0000000000000000
      Mar 20 00:50:55 foxtrot1 kernel: 0000400000000000 0000008000000000 ffff881ff47d3d30 ffffffffa0be5d57
      Mar 20 00:50:55 foxtrot1 kernel: ffff881ff47d3d48 000000000bd9af10 ffffffffa0be3d70 00000000000cfb78
      Mar 20 00:50:55 foxtrot1 kernel: Call Trace:
      Mar 20 00:50:55 foxtrot1 kernel: [<ffffffffa0bc619c>] ldlm_lock_remove_from_lru_check+0x7c/0x1a0 [ptlrpc]
      Mar 20 00:50:55 foxtrot1 kernel: [<ffffffffa0be5d57>] ldlm_prepare_lru_list+0x257/0x480 [ptlrpc]
      Mar 20 00:50:55 foxtrot1 kernel: [<ffffffffa0be3d70>] ? ldlm_iter_helper+0x20/0x20 [ptlrpc]
      Mar 20 00:50:55 foxtrot1 kernel: [<ffffffffa0beadb1>] ldlm_cancel_lru+0x61/0x170 [ptlrpc]
      Mar 20 00:50:55 foxtrot1 kernel: [<ffffffffa0bfe109>] ldlm_cli_pool_recalc+0x249/0x260 [ptlrpc]
      Mar 20 00:50:55 foxtrot1 kernel: [<ffffffffa0bfe767>] ldlm_pool_recalc+0x107/0x1d0 [ptlrpc]
      Mar 20 00:50:55 foxtrot1 kernel: [<ffffffffa0c0017c>] ldlm_pools_recalc+0x21c/0x3d0 [ptlrpc]
      Mar 20 00:50:55 foxtrot1 kernel: [<ffffffffa0c003c5>] ldlm_pools_thread_main+0x95/0x330 [ptlrpc]
      Mar 20 00:50:55 foxtrot1 kernel: [<ffffffff810c4ec0>] ? wake_up_state+0x20/0x20
      Mar 20 00:50:55 foxtrot1 kernel: [<ffffffffa0c00330>] ? ldlm_pools_recalc+0x3d0/0x3d0 [ptlrpc]
      Mar 20 00:50:55 foxtrot1 kernel: [<ffffffff810b052f>] kthread+0xcf/0xe0
      Mar 20 00:50:55 foxtrot1 kernel: [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
      Mar 20 00:50:55 foxtrot1 kernel: [<ffffffff81696658>] ret_from_fork+0x58/0x90
      Mar 20 00:50:55 foxtrot1 kernel: [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
      Mar 20 00:50:55 foxtrot1 kernel: Code: 00 02 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 01 c3 55 83 e2 fe 0f b7 f2 48 89 e5 b8 00 80 00 00 eb 0d 66 0f 1f 44 00 00 f3 90 <83> e8 01 74 0a 0f b7 0f 66 39 ca 75 f1 5d c3 66 66 66 90 66 66 
      Mar 20 00:51:03 foxtrot1 kernel: INFO: rcu_sched self-detected stall on CPU { 30}  (t=60001 jiffies g=32391243 c=32391242 q=296406)
      Mar 20 00:51:03 foxtrot1 kernel: Task dump for CPU 30:
      Mar 20 00:51:03 foxtrot1 kernel: ldlm_poold      R  running task        0  3201      2 0x00000008
      Mar 20 00:51:03 foxtrot1 kernel: ffff881ff4613ec0 0000000063a8c183 ffff881fffbc3db0 ffffffff810c41d8
      Mar 20 00:51:03 foxtrot1 kernel: 000000000000001e ffffffff81a1a780 ffff881fffbc3dc8 ffffffff810c7a79
      Mar 20 00:51:03 foxtrot1 kernel: 000000000000000f ffff881fffbc3df8 ffffffff811372a0 ffff881fffbd01c0
      Mar 20 00:51:03 foxtrot1 kernel: Call Trace:
      Mar 20 00:51:03 foxtrot1 kernel: <IRQ>  [<ffffffff810c41d8>] sched_show_task+0xa8/0x110
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff810c7a79>] dump_cpu_task+0x39/0x70
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff811372a0>] rcu_dump_cpu_stacks+0x90/0xd0
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff8113a9f2>] rcu_check_callbacks+0x442/0x720
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff810f2f80>] ? tick_sched_handle.isra.13+0x60/0x60
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff81099177>] update_process_times+0x47/0x80
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff810f2f45>] tick_sched_handle.isra.13+0x25/0x60
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff810f2fc1>] tick_sched_timer+0x41/0x70
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff810b4862>] __hrtimer_run_queues+0xd2/0x260
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff810b4e00>] hrtimer_interrupt+0xb0/0x1e0
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff8169819c>] ? call_softirq+0x1c/0x30
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff810510d7>] local_apic_timer_interrupt+0x37/0x60
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff81698e0f>] smp_apic_timer_interrupt+0x3f/0x60
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff8169735d>] apic_timer_interrupt+0x6d/0x80
      Mar 20 00:51:03 foxtrot1 kernel: <EOI>  [<ffffffff810cde74>] ? update_curr+0x104/0x190
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff8168da67>] ? _raw_spin_lock+0x37/0x50
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffffa0bc619c>] ldlm_lock_remove_from_lru_check+0x7c/0x1a0 [ptlrpc]
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffffa0be5d57>] ldlm_prepare_lru_list+0x257/0x480 [ptlrpc]
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffffa0be3d70>] ? ldlm_iter_helper+0x20/0x20 [ptlrpc]
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffffa0beadb1>] ldlm_cancel_lru+0x61/0x170 [ptlrpc]
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffffa0bfe109>] ldlm_cli_pool_recalc+0x249/0x260 [ptlrpc]
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffffa0c0017c>] ldlm_pools_recalc+0x21c/0x3d0 [ptlrpc]
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffffa0c003c5>] ldlm_pools_thread_main+0x95/0x330 [ptlrpc]
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff810c4ec0>] ? wake_up_state+0x20/0x20
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffffa0c00330>] ? ldlm_pools_recalc+0x3d0/0x3d0 [ptlrpc]
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff810b052f>] kthread+0xcf/0xe0
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff81696658>] ret_from_fork+0x58/0x90
      Mar 20 00:51:03 foxtrot1 kernel: [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
      Mar 20 00:51:05 foxtrot1 systemd: Started Session 3948 of user root.
      Mar 20 00:51:05 foxtrot1 systemd: Starting Session 3948 of user root.
      Mar 20 00:51:06 foxtrot1 systemd: Started Session 3949 of user root.
      Mar 20 00:51:06 foxtrot1 systemd: Starting Session 3949 of user root.
      Mar 20 00:51:31 foxtrot1 kernel: NMI watchdog: BUG: soft lockup - CPU#30 stuck for 22s! [ldlm_poold:3201]
      Mar 20 00:51:31 foxtrot1 kernel: Modules linked in: mpt3sas mpt2sas raid_class scsi_transport_sas mptctl mptbase dell_rbu osc(OE) mgc(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_ssse3 sha512_generic nfsv3 nfs crypto_null fscache libcfs(OE) bonding intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel iTCO_wdt iTCO_vendor_support dcdbas kvm irqbypass sg ipmi_devintf acpi_pad acpi_power_meter ipmi_si sb_edac ipmi_msghandler edac_core mei_me mei shpchp lpc_ich nfsd auth_rpcgss nfs_acl lockd grace binfmt_misc ip_tables xfs sd_mod crc_t10dif crct10dif_generic 8021q garp stp llc mrp scsi_transport_iscsi crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel mgag200 i2c_algo_bit ghash_clmulni_intel drm_kms_helper aesni_intel syscopyarea lrw sysfillrect gf128mul
      Mar 20 00:51:31 foxtrot1 kernel: sysimgblt glue_helper fb_sys_fops ablk_helper cryptd ttm dm_multipath drm ahci libahci bnx2x libata i2c_core ptp pps_core megaraid_sas ntb mdio libcrc32c wmi fjes sunrpc dm_mirror dm_region_hash dm_log dm_mod [last unloaded: cdrom]
      Mar 20 00:51:31 foxtrot1 kernel: CPU: 30 PID: 3201 Comm: ldlm_poold Tainted: G           OEL ------------   3.10.0-514.el7_lustre.x86_64 #1
      Mar 20 00:51:31 foxtrot1 kernel: Hardware name: Dell Inc. PowerEdge R620/0PXXHP, BIOS 2.5.4 01/22/2016
      Mar 20 00:51:31 foxtrot1 kernel: task: ffff881ff4613ec0 ti: ffff881ff47d0000 task.ti: ffff881ff47d0000
      Mar 20 00:51:31 foxtrot1 kernel: RIP: 0010:[<ffffffff8168da62>]  [<ffffffff8168da62>] _raw_spin_lock+0x32/0x50
      Mar 20 00:51:31 foxtrot1 kernel: RSP: 0018:ffff881ff47d3cb0  EFLAGS: 00000212
      Mar 20 00:51:31 foxtrot1 kernel: RAX: 0000000000004ab0 RBX: 00000000000055da RCX: 00000000000055fc
      Mar 20 00:51:31 foxtrot1 kernel: RDX: 0000000000005618 RSI: 0000000000005618 RDI: ffff883fee6bb418
      Mar 20 00:51:31 foxtrot1 kernel: RBP: ffff881ff47d3cb0 R08: ffff882e7bf42b90 R09: 0000000000000001
      Mar 20 00:51:31 foxtrot1 kernel: R10: 000000010b885f78 R11: 0000000000000400 R12: 0000000000000000
      Mar 20 00:51:31 foxtrot1 kernel: R13: 0000000000000001 R14: 000000010b885f78 R15: 0000000000000400
      Mar 20 00:51:31 foxtrot1 kernel: FS:  0000000000000000(0000) GS:ffff881fffbc0000(0000) knlGS:0000000000000000
      Mar 20 00:51:31 foxtrot1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      Mar 20 00:51:31 foxtrot1 kernel: CR2: 00007fbeb5124f88 CR3: 00000000019ba000 CR4: 00000000000407e0
      Mar 20 00:51:31 foxtrot1 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      Mar 20 00:51:31 foxtrot1 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Mar 20 00:51:31 foxtrot1 kernel: Stack:
      Mar 20 00:51:31 foxtrot1 kernel: ffff881ff47d3d30 ffffffffa0be5dc2 ffff881ff47d3d48 000000000bd9af10
      Mar 20 00:51:31 foxtrot1 kernel: ffffffffa0be3d70 00000000000cfae7 000000010b885f78 ffff883fee6bb418
      Mar 20 00:51:31 foxtrot1 kernel: ffff883fee6bb400 000cfae7000003b4 ffff883fee6bb448 ffff881ff47d3d48
      Mar 20 00:51:31 foxtrot1 kernel: Call Trace:
      Mar 20 00:51:31 foxtrot1 kernel: [<ffffffffa0be5dc2>] ldlm_prepare_lru_list+0x2c2/0x480 [ptlrpc]
      Mar 20 00:51:31 foxtrot1 kernel: [<ffffffffa0be3d70>] ? ldlm_iter_helper+0x20/0x20 [ptlrpc]
      Mar 20 00:51:31 foxtrot1 kernel: [<ffffffffa0beadb1>] ldlm_cancel_lru+0x61/0x170 [ptlrpc]
      Mar 20 00:51:31 foxtrot1 kernel: [<ffffffffa0bfe109>] ldlm_cli_pool_recalc+0x249/0x260 [ptlrpc]
      Mar 20 00:51:31 foxtrot1 kernel: [<ffffffffa0bfe767>] ldlm_pool_recalc+0x107/0x1d0 [ptlrpc]
      Mar 20 00:51:31 foxtrot1 kernel: [<ffffffffa0c0017c>] ldlm_pools_recalc+0x21c/0x3d0 [ptlrpc]
      Mar 20 00:51:31 foxtrot1 kernel: [<ffffffffa0c003c5>] ldlm_pools_thread_main+0x95/0x330 [ptlrpc]
      Mar 20 00:51:31 foxtrot1 kernel: [<ffffffff810c4ec0>] ? wake_up_state+0x20/0x20
      Mar 20 00:51:31 foxtrot1 kernel: [<ffffffffa0c00330>] ? ldlm_pools_recalc+0x3d0/0x3d0 [ptlrpc]
      Mar 20 00:51:31 foxtrot1 kernel: [<ffffffff810b052f>] kthread+0xcf/0xe0
      Mar 20 00:51:31 foxtrot1 kernel: [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
      Mar 20 00:51:31 foxtrot1 kernel: [<ffffffff81696658>] ret_from_fork+0x58/0x90
      Mar 20 00:51:31 foxtrot1 kernel: [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
      Mar 20 00:51:31 foxtrot1 kernel: Code: 00 02 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 01 c3 55 83 e2 fe 0f b7 f2 48 89 e5 b8 00 80 00 00 eb 0d 66 0f 1f 44 00 00 f3 90 <83> e8 01 74 0a 0f b7 0f 66 39 ca 75 f1 5d c3 66 66 66 90 66 66 
      ..
      ..
      

      It continues to spam the logs and the server is very slow until we reboot it. We have seen this on 3 different clients now.

      On a related note, could we downgrade the client to v2.8 and keep v2.9 running on the servers as a potential quick fix for this instability?

      Attachments

        1. ldlm-locks.log.gz
          41 kB
        2. messages.gz
          94 kB
        3. sysrqt.txt.gz
          93 kB
        4. sysrq-txt
          240 kB

        Issue Links

          Activity

            [LU-9230] soft lockup on v2.9 Lustre clients (ldlm?)

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33130/
            Subject: LU-9230 ldlm: speed up preparation for list of lock cancel
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set:
            Commit: 21f7f777172d68ce61734487e23cf237900f927d

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33130/ Subject: LU-9230 ldlm: speed up preparation for list of lock cancel Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: 21f7f777172d68ce61734487e23cf237900f927d

            Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33130
            Subject: LU-9230 ldlm: speed up preparation for list of lock cancel
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: e5c0983ac7b2404c51c130420befc68057fdb4a6

            gerrit Gerrit Updater added a comment - Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33130 Subject: LU-9230 ldlm: speed up preparation for list of lock cancel Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: e5c0983ac7b2404c51c130420befc68057fdb4a6
            pjones Peter Jones added a comment -

            Mahmoud

            Please open a new ticket that refers to this one to request support help

            Peter

            pjones Peter Jones added a comment - Mahmoud Please open a new ticket that refers to this one to request support help Peter

            Can we get a back port to 2.10.5

            mhanafi Mahmoud Hanafi added a comment - Can we get a back port to 2.10.5
            pjones Peter Jones added a comment -

            Landed for 2.12

            pjones Peter Jones added a comment - Landed for 2.12

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26327/
            Subject: LU-9230 ldlm: speed up preparation for list of lock cancel
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 651f2cdd2d8df1d4318f874993ab0706d16ce490

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26327/ Subject: LU-9230 ldlm: speed up preparation for list of lock cancel Project: fs/lustre-release Branch: master Current Patch Set: Commit: 651f2cdd2d8df1d4318f874993ab0706d16ce490

            LU-9313 describes and addresses more or less the same problem but uses a different approach.  The two approaches could probably be complimentary, though.  LU-9313 also includes a description of a simple way to reproduce the problem, which may be interesting.  (FWIW, LU-9313 has been in production at Cray customers for around a year.)

            paf Patrick Farrell (Inactive) added a comment - LU-9313 describes and addresses more or less the same problem but uses a different approach.  The two approaches could probably be complimentary, though.  LU-9313 also includes a description of a simple way to reproduce the problem, which may be interesting.  (FWIW, LU-9313 has been in production at Cray customers for around a year.)

            Yang Sheng (yang.sheng@intel.com) uploaded a new patch: https://review.whamcloud.com/26327
            Subject: LU-9230 ldlm: speed up preparation for list of lock cancel
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: b40347a46a320ca23bd2d34b3108061bb2bd76f2

            gerrit Gerrit Updater added a comment - Yang Sheng (yang.sheng@intel.com) uploaded a new patch: https://review.whamcloud.com/26327 Subject: LU-9230 ldlm: speed up preparation for list of lock cancel Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b40347a46a320ca23bd2d34b3108061bb2bd76f2

            Daire, if the MDT is 4x faster than before, it means that 4x as many locks will be in memory before they will be expired from the LRU due to age.

            adilger Andreas Dilger added a comment - Daire, if the MDT is 4x faster than before, it means that 4x as many locks will be in memory before they will be expired from the LRU due to age.
            ys Yang Sheng added a comment -

            Hi, Daire,

            Yes, This issue should easy reveal by heavy workload regardless 2.8 or 2.9. So this is a good chance to verify it whether be fixed.

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - Hi, Daire, Yes, This issue should easy reveal by heavy workload regardless 2.8 or 2.9. So this is a good chance to verify it whether be fixed. Thanks, YangSheng

            People

              ys Yang Sheng
              daire Daire Byrne (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: