[LU-12377] stuck in ldlm thread on client when large amount of locks were canceled - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
None
Environment:
master

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

when client unlinked files which have huge amount of extent locks (e.g. strided file) and one of client's kworker thread get 100% and never back. it looks deadlock in ldlm thread on client at many cancel calls.

Tasks: 380 total,   2 running, 378 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  6.6 sy,  0.0 ni, 87.1 id,  6.2 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 13182929+total, 12711252+free,  2287060 used,  2429716 buff/cache
KiB Swap:  1048572 total,  1048572 free,        0 used. 12699912+avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                 
  377 root      20   0       0      0      0 R 100.0  0.0  32:53.50 kworker/14:1

Jun  2 18:00:54 c210 kernel: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [ldlm_bl_43:4442]
Jun  2 18:00:54 c210 kernel: Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx4_en(OE) mlx4_ib(OE) mlx4_core(OE) sunrpc sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper iTCO_wdt cryptd joydev iTCO_vendor_support sg ipmi_si shpchp ipmi_devintf wmi mei_me mei ioatdma ipmi_msghandler i2c_i801 pcspkr lpc_ich binfmt_misc knem(OE) ip_tables ext4 mbcache jbd2 mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 i2c_algo_bit
Jun  2 18:00:54 c210 kernel: drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci ixgbe drm libahci libata mlx5_core(OE) crct10dif_pclmul crct10dif_common crc32c_intel megaraid_sas mlxfw(OE) devlink mdio i2c_core mlx_compat(OE) ptp pps_core dca dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
Jun  2 18:00:54 c210 kernel: CPU: 0 PID: 4442 Comm: ldlm_bl_43 Kdump: loaded Tainted: G           OEL ------------   3.10.0-862.el7.x86_64 #1
Jun  2 18:00:54 c210 kernel: Hardware name: Supermicro SYS-1027R-WC1RT/X9DRW-CF/CTF, BIOS 3.0b 04/09/2014
Jun  2 18:00:54 c210 kernel: task: ffffa04c3bb3eeb0 ti: ffffa031b58a0000 task.ti: ffffa031b58a0000
Jun  2 18:00:54 c210 kernel: RIP: 0010:[<ffffffff86b08802>]  [<ffffffff86b08802>] native_queued_spin_lock_slowpath+0x122/0x200
Jun  2 18:00:54 c210 kernel: RSP: 0018:ffffa031b58a3b48  EFLAGS: 00000246
Jun  2 18:00:54 c210 kernel: RAX: 0000000000000000 RBX: ffffffffc0f84b7f RCX: 0000000000010000
Jun  2 18:00:54 c210 kernel: RDX: ffffa04c7fcd9700 RSI: 0000000000590001 RDI: ffffa045b18e4948
Jun  2 18:00:54 c210 kernel: RBP: ffffa031b58a3b48 R08: ffffa03c3fc19700 R09: 0000000000000000
Jun  2 18:00:54 c210 kernel: R10: 0000000000000000 R11: fffff018b1d64480 R12: ffffa040a5c9b7c8
Jun  2 18:00:54 c210 kernel: R13: ffffa03f8d5a13f8 R14: ffffa040a5c9b8b0 R15: 00000000000a5aff
Jun  2 18:00:54 c210 kernel: FS:  0000000000000000(0000) GS:ffffa03c3fc00000(0000) knlGS:0000000000000000
Jun  2 18:00:54 c210 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun  2 18:00:54 c210 kernel: CR2: 00007fd700006bc9 CR3: 000000203be54000 CR4: 00000000001607f0
Jun  2 18:00:54 c210 kernel: Call Trace:
Jun  2 18:00:54 c210 kernel: [<ffffffff8710842a>] queued_spin_lock_slowpath+0xb/0xf
Jun  2 18:00:54 c210 kernel: [<ffffffff87115680>] _raw_spin_lock+0x20/0x30
Jun  2 18:00:54 c210 kernel: [<ffffffffc0c6914a>] cl_object_attr_lock+0x1a/0x20 [obdclass]
Jun  2 18:00:54 c210 kernel: [<ffffffffc0f6cf86>] osc_ldlm_blocking_ast+0x2f6/0x3a0 [osc]
Jun  2 18:00:54 c210 kernel: [<ffffffffc0dc730a>] ldlm_cancel_callback+0x8a/0x330 [ptlrpc]
Jun  2 18:00:54 c210 kernel: [<ffffffffc0c43892>] ? class_handle_unhash+0x32/0x50 [obdclass]
Jun  2 18:00:54 c210 kernel: [<ffffffffc0dd21d0>] ldlm_cli_cancel_local+0xa0/0x3f0 [ptlrpc]
Jun  2 18:00:54 c210 kernel: [<ffffffffc0dd676a>] ldlm_cli_cancel_list_local+0xea/0x280 [ptlrpc]
Jun  2 18:00:54 c210 kernel: [<ffffffffc0dd6e8b>] ldlm_cancel_lru_local+0x2b/0x30 [ptlrpc]
Jun  2 18:00:54 c210 kernel: [<ffffffffc0dd8076>] ldlm_cli_cancel+0x216/0x650 [ptlrpc]
Jun  2 18:00:54 c210 kernel: [<ffffffff86bf62b1>] ? __slab_free+0x81/0x2f0
Jun  2 18:00:54 c210 kernel: [<ffffffffc0f6ce0a>] osc_ldlm_blocking_ast+0x17a/0x3a0 [osc]
Jun  2 18:00:54 c210 kernel: [<ffffffffc0ddc7dd>] ldlm_handle_bl_callback+0xed/0x4e0 [ptlrpc]
Jun  2 18:00:54 c210 kernel: [<ffffffffc0ddd101>] ldlm_bl_thread_main+0x531/0x700 [ptlrpc]
Jun  2 18:00:54 c210 kernel: [<ffffffffc0ddcbd0>] ? ldlm_handle_bl_callback+0x4e0/0x4e0 [ptlrpc]
Jun  2 18:00:54 c210 kernel: [<ffffffff86abae31>] kthread+0xd1/0xe0
Jun  2 18:00:54 c210 kernel: [<ffffffff86ac9667>] ? finish_task_switch+0x57/0x170
Jun  2 18:00:54 c210 kernel: [<ffffffff86abad60>] ? insert_kthread_work+0x40/0x40
Jun  2 18:00:54 c210 kernel: [<ffffffff8711f61d>] ret_from_fork_nospec_begin+0x7/0x21
Jun  2 18:00:54 c210 kernel: [<ffffffff86abad60>] ? insert_kthread_work+0x40/0x40
Jun  2 18:00:54 c210 kernel: Code: 13 48 c1 ea 0d 48 98 83 e2 30 48 81 c2 00 97 01 00 48 03 14 c5 a0 53 73 87 4c 89 02 41 8b 40 08 85 c0 75 0f 0f 1f 44 00 00 f3 90 <41> 8b 40 08 85 c0 74 f6 4d 8b 08 4d 85 c9 74 04 41 0f 18 09 8b

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

messages
99 kB
03/Jun/19 12:35 AM

Issue Links

is related to

LU-12832 soft lockup in ldlm_bl_xx threads at read for a single shared strided file

Resolved

stuck in ldlm thread on client when large amount of locks were canceled

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates