Details
-
Bug
-
Resolution: Duplicate
-
Minor
-
None
-
None
-
None
-
master
-
3
-
9223372036854775807
Description
when client unlinked files which have huge amount of extent locks (e.g. strided file) and one of client's kworker thread get 100% and never back. it looks deadlock in ldlm thread on client at many cancel calls.
Tasks: 380 total, 2 running, 378 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.0 us, 6.6 sy, 0.0 ni, 87.1 id, 6.2 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 13182929+total, 12711252+free, 2287060 used, 2429716 buff/cache KiB Swap: 1048572 total, 1048572 free, 0 used. 12699912+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 377 root 20 0 0 0 0 R 100.0 0.0 32:53.50 kworker/14:1
Jun 2 18:00:54 c210 kernel: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [ldlm_bl_43:4442] Jun 2 18:00:54 c210 kernel: Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx4_en(OE) mlx4_ib(OE) mlx4_core(OE) sunrpc sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper iTCO_wdt cryptd joydev iTCO_vendor_support sg ipmi_si shpchp ipmi_devintf wmi mei_me mei ioatdma ipmi_msghandler i2c_i801 pcspkr lpc_ich binfmt_misc knem(OE) ip_tables ext4 mbcache jbd2 mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 i2c_algo_bit Jun 2 18:00:54 c210 kernel: drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci ixgbe drm libahci libata mlx5_core(OE) crct10dif_pclmul crct10dif_common crc32c_intel megaraid_sas mlxfw(OE) devlink mdio i2c_core mlx_compat(OE) ptp pps_core dca dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs] Jun 2 18:00:54 c210 kernel: CPU: 0 PID: 4442 Comm: ldlm_bl_43 Kdump: loaded Tainted: G OEL ------------ 3.10.0-862.el7.x86_64 #1 Jun 2 18:00:54 c210 kernel: Hardware name: Supermicro SYS-1027R-WC1RT/X9DRW-CF/CTF, BIOS 3.0b 04/09/2014 Jun 2 18:00:54 c210 kernel: task: ffffa04c3bb3eeb0 ti: ffffa031b58a0000 task.ti: ffffa031b58a0000 Jun 2 18:00:54 c210 kernel: RIP: 0010:[<ffffffff86b08802>] [<ffffffff86b08802>] native_queued_spin_lock_slowpath+0x122/0x200 Jun 2 18:00:54 c210 kernel: RSP: 0018:ffffa031b58a3b48 EFLAGS: 00000246 Jun 2 18:00:54 c210 kernel: RAX: 0000000000000000 RBX: ffffffffc0f84b7f RCX: 0000000000010000 Jun 2 18:00:54 c210 kernel: RDX: ffffa04c7fcd9700 RSI: 0000000000590001 RDI: ffffa045b18e4948 Jun 2 18:00:54 c210 kernel: RBP: ffffa031b58a3b48 R08: ffffa03c3fc19700 R09: 0000000000000000 Jun 2 18:00:54 c210 kernel: R10: 0000000000000000 R11: fffff018b1d64480 R12: ffffa040a5c9b7c8 Jun 2 18:00:54 c210 kernel: R13: ffffa03f8d5a13f8 R14: ffffa040a5c9b8b0 R15: 00000000000a5aff Jun 2 18:00:54 c210 kernel: FS: 0000000000000000(0000) GS:ffffa03c3fc00000(0000) knlGS:0000000000000000 Jun 2 18:00:54 c210 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jun 2 18:00:54 c210 kernel: CR2: 00007fd700006bc9 CR3: 000000203be54000 CR4: 00000000001607f0 Jun 2 18:00:54 c210 kernel: Call Trace: Jun 2 18:00:54 c210 kernel: [<ffffffff8710842a>] queued_spin_lock_slowpath+0xb/0xf Jun 2 18:00:54 c210 kernel: [<ffffffff87115680>] _raw_spin_lock+0x20/0x30 Jun 2 18:00:54 c210 kernel: [<ffffffffc0c6914a>] cl_object_attr_lock+0x1a/0x20 [obdclass] Jun 2 18:00:54 c210 kernel: [<ffffffffc0f6cf86>] osc_ldlm_blocking_ast+0x2f6/0x3a0 [osc] Jun 2 18:00:54 c210 kernel: [<ffffffffc0dc730a>] ldlm_cancel_callback+0x8a/0x330 [ptlrpc] Jun 2 18:00:54 c210 kernel: [<ffffffffc0c43892>] ? class_handle_unhash+0x32/0x50 [obdclass] Jun 2 18:00:54 c210 kernel: [<ffffffffc0dd21d0>] ldlm_cli_cancel_local+0xa0/0x3f0 [ptlrpc] Jun 2 18:00:54 c210 kernel: [<ffffffffc0dd676a>] ldlm_cli_cancel_list_local+0xea/0x280 [ptlrpc] Jun 2 18:00:54 c210 kernel: [<ffffffffc0dd6e8b>] ldlm_cancel_lru_local+0x2b/0x30 [ptlrpc] Jun 2 18:00:54 c210 kernel: [<ffffffffc0dd8076>] ldlm_cli_cancel+0x216/0x650 [ptlrpc] Jun 2 18:00:54 c210 kernel: [<ffffffff86bf62b1>] ? __slab_free+0x81/0x2f0 Jun 2 18:00:54 c210 kernel: [<ffffffffc0f6ce0a>] osc_ldlm_blocking_ast+0x17a/0x3a0 [osc] Jun 2 18:00:54 c210 kernel: [<ffffffffc0ddc7dd>] ldlm_handle_bl_callback+0xed/0x4e0 [ptlrpc] Jun 2 18:00:54 c210 kernel: [<ffffffffc0ddd101>] ldlm_bl_thread_main+0x531/0x700 [ptlrpc] Jun 2 18:00:54 c210 kernel: [<ffffffffc0ddcbd0>] ? ldlm_handle_bl_callback+0x4e0/0x4e0 [ptlrpc] Jun 2 18:00:54 c210 kernel: [<ffffffff86abae31>] kthread+0xd1/0xe0 Jun 2 18:00:54 c210 kernel: [<ffffffff86ac9667>] ? finish_task_switch+0x57/0x170 Jun 2 18:00:54 c210 kernel: [<ffffffff86abad60>] ? insert_kthread_work+0x40/0x40 Jun 2 18:00:54 c210 kernel: [<ffffffff8711f61d>] ret_from_fork_nospec_begin+0x7/0x21 Jun 2 18:00:54 c210 kernel: [<ffffffff86abad60>] ? insert_kthread_work+0x40/0x40 Jun 2 18:00:54 c210 kernel: Code: 13 48 c1 ea 0d 48 98 83 e2 30 48 81 c2 00 97 01 00 48 03 14 c5 a0 53 73 87 4c 89 02 41 8b 40 08 85 c0 75 0f 0f 1f 44 00 00 f3 90 <41> 8b 40 08 85 c0 74 f6 4d 8b 08 4d 85 c9 74 04 41 0f 18 09 8b
Attachments
Issue Links
- is related to
-
LU-12832 soft lockup in ldlm_bl_xx threads at read for a single shared strided file
-
- Resolved
-
In
LU-12377case, it disabled lru_resize and keep all locks in caches on clients and didn't hit same problem during io500 run. But once io500 finished and it removed all files which was created by IORs, huge amount of lock were canceled, then hit that issue.In
LU-12832, lru_resize was enabled and many locks were canceling with LRU manner during io500 run. In addition to that, the huge amount of cancel happened at ior_hard_read.So, this is just difference on tirger point of cancel (after remove files vs cacnel with LRU at read), but both of them hit same issue eventually. I agree
LU-12377is same withLU-12832.