Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.10.2
-
None
-
3
-
9223372036854775807
Description
We get quite a few soft lockups on our Lustre gateways (Lustre clients that export Lustre filesystems over NFS). Example:
Nov 13 00:26:06 foxtrot2 kernel: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [nfsd:11973]
Nov 13 00:26:06 foxtrot2 kernel: NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [rsync:36079]
Nov 13 00:26:06 foxtrot2 kernel: Modules linked in: vfat fat dm_service_time mpt3sas mpt2sas raid_class scsi_transport_sas mptctl mptb
ase nfsv3 nfs fscache osc(OE) mgc(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE)
dell_rbu libcfs(OE) bonding sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel iTCO_wdt iTCO_vendor_support kv
m joydev dcdbas irqbypass sg shpchp ipmi_si ipmi_devintf ipmi_msghandler lpc_ich mei_me mei acpi_power_meter acpi_pad nfsd auth_rpcgss
nfs_acl lockd grace binfmt_misc ip_tables xfs sd_mod crc_t10dif crct10dif_generic 8021q garp stp llc mrp mgag200 i2c_algo_bit drm_kms
_helper scsi_transport_iscsi bnx2x syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common crc32_pclmul cr
c32c_intel ahci drm ghash_clmulni_intel
Nov 13 00:26:06 foxtrot2 kernel: libahci aesni_intel dm_multipath libata lrw gf128mul glue_helper ablk_helper cryptd megaraid_sas i2c_
core ptp pps_core mdio libcrc32c wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod [last unloaded: usb_storage]
Nov 13 00:26:06 foxtrot2 kernel: CPU: 1 PID: 36079 Comm: rsync Tainted: G W OE ------------ 3.10.0-693.5.2.el7_lustre.x86_6
4 #1
Nov 13 00:26:06 foxtrot2 kernel: Hardware name: Dell Inc. PowerEdge R620/01W23F, BIOS 2.5.4 01/22/2016
Nov 13 00:26:06 foxtrot2 kernel: task: ffff883ff8a04f10 ti: ffff8815a1200000 task.ti: ffff8815a1200000
Nov 13 00:26:06 foxtrot2 kernel: RIP: 0010:[<ffffffff810fa332>] [<ffffffff810fa332>] native_queued_spin_lock_slowpath+0x112/0x1e0
Nov 13 00:26:06 foxtrot2 kernel: RSP: 0018:ffff8815a1203700 EFLAGS: 00000246
Nov 13 00:26:06 foxtrot2 kernel: RAX: 0000000000000000 RBX: ffff883fff017880 RCX: 0000000000090000
Nov 13 00:26:06 foxtrot2 kernel: RDX: ffff883fff4d7880 RSI: 0000000001390101 RDI: ffff881ff99da818
Nov 13 00:26:06 foxtrot2 kernel: RBP: ffff8815a1203700 R08: ffff883fff017880 R09: 0000000000000000
Nov 13 00:26:06 foxtrot2 kernel: R10: 0004c5dab524ba0b R11: 0000000000000000 R12: 0004c5dab524ba0b
Nov 13 00:26:06 foxtrot2 kernel: R13: 0000000000000000 R14: 0004c5dab39dc857 R15: ffff8815a12036e8
Nov 13 00:26:06 foxtrot2 kernel: FS: 00007f0ff1094740(0000) GS:ffff883fff000000(0000) knlGS:0000000000000000
Nov 13 00:26:06 foxtrot2 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 13 00:26:06 foxtrot2 kernel: CR2: 00007fd6cb1e9000 CR3: 000000163eff9000 CR4: 00000000001407e0
Nov 13 00:26:06 foxtrot2 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 13 00:26:06 foxtrot2 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Nov 13 00:26:06 foxtrot2 kernel: Stack:
Nov 13 00:26:06 foxtrot2 kernel: ffff8815a1203710 ffffffff8169e6bf ffff8815a1203720 ffffffff816abbf0
Nov 13 00:26:06 foxtrot2 kernel: ffff8815a12037a0 ffffffffc0c2d421 ffff8815a12037e0 ffffffffc0c2ba60
Nov 13 00:26:06 foxtrot2 kernel: 0000000000000000 00000161000ab602 0004c5dab524ba0b ffff88130fb65c00
Nov 13 00:26:06 foxtrot2 kernel: Call Trace:
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff8169e6bf>] queued_spin_lock_slowpath+0xb/0xf
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff816abbf0>] _raw_spin_lock+0x20/0x30
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc0c2d421>] ldlm_prepare_lru_list+0x361/0x4e0 [ptlrpc]
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc0c2ba60>] ? ldlm_cancel_aged_no_wait_policy+0x70/0x70 [ptlrpc]
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc0c30c5a>] ldlm_cancel_lru_local+0x1a/0x30 [ptlrpc]
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc0c30e8e>] ldlm_prep_elc_req+0x21e/0x490 [ptlrpc]
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc0c31128>] ldlm_prep_enqueue_req+0x28/0x30 [ptlrpc]
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc07c67a3>] mdc_intent_getattr_pack.isra.15+0x93/0x280 [mdc]
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc07c8f3b>] mdc_enqueue_base+0x9fb/0x18f0 [mdc]
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff810c45a3>] ? try_to_wake_up+0x183/0x340
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff810ba598>] ? __wake_up_common+0x58/0x90
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc07ca6cb>] mdc_intent_lock+0x26b/0x520 [mdc]
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc0c66243>] ? reply_in_callback+0x143/0x5e0 [ptlrpc]
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc0972e30>] ? ll_invalidate_negative_children+0x1d0/0x1d0 [lustre]
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc0c2c7a0>] ? ldlm_expired_completion_wait+0x240/0x240 [ptlrpc]
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc0910e4f>] lmv_intent_lock+0x5cf/0x1b50 [lmv]
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff810b8a01>] ? in_group_p+0x31/0x40
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc09738c5>] ? ll_i2suppgid+0x15/0x40 [lustre]
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc0973914>] ? ll_i2gids+0x24/0xb0 [lustre]
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff81114b02>] ? from_kgid+0x12/0x20
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc0972e30>] ? ll_invalidate_negative_children+0x1d0/0x1d0 [lustre]
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc0974feb>] ll_lookup_it+0x29b/0xee0 [lustre]
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff810c8f28>] ? __enqueue_entity+0x78/0x80
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc0976fbb>] ll_lookup_nd+0xbb/0x190 [lustre]
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff8120b3dd>] lookup_real+0x1d/0x50
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff8120bcb2>] __lookup_hash+0x42/0x60
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff816a13e2>] lookup_slow+0x42/0xa7
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff8120f25b>] path_lookupat+0x77b/0x7b0
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff811df623>] ? kmem_cache_alloc+0x193/0x1e0
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff81211c9f>] ? getname_flags+0x4f/0x1a0
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff8120f2bb>] filename_lookup+0x2b/0xc0
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff81212e37>] user_path_at_empty+0x67/0xc0
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff81212ea1>] user_path_at+0x11/0x20
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff812063e3>] vfs_fstatat+0x63/0xc0
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff812069b1>] SYSC_newlstat+0x31/0x60
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff81206c3e>] SyS_newlstat+0xe/0x10
Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff816b5089>] system_call_fastpath+0x16/0x1b
Campbell
Even the servers only need to be patched if you are using the project quotas feature. The patches that gave performance improvements in past versions have now been upstreamed and many customers prefer the simplified admin over project quotas...
Peter