[LU-12377] stuck in ldlm thread on client when large amount of locks were canceled Created: 03/Jun/19 Updated: 30/Oct/19 Resolved: 30/Oct/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Shuichi Ihara | Assignee: | Wang Shilong (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
master |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
when client unlinked files which have huge amount of extent locks (e.g. strided file) and one of client's kworker thread get 100% and never back. it looks deadlock in ldlm thread on client at many cancel calls. Tasks: 380 total, 2 running, 378 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.0 us, 6.6 sy, 0.0 ni, 87.1 id, 6.2 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 13182929+total, 12711252+free, 2287060 used, 2429716 buff/cache KiB Swap: 1048572 total, 1048572 free, 0 used. 12699912+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 377 root 20 0 0 0 0 R 100.0 0.0 32:53.50 kworker/14:1 Jun 2 18:00:54 c210 kernel: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [ldlm_bl_43:4442] Jun 2 18:00:54 c210 kernel: Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx4_en(OE) mlx4_ib(OE) mlx4_core(OE) sunrpc sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper iTCO_wdt cryptd joydev iTCO_vendor_support sg ipmi_si shpchp ipmi_devintf wmi mei_me mei ioatdma ipmi_msghandler i2c_i801 pcspkr lpc_ich binfmt_misc knem(OE) ip_tables ext4 mbcache jbd2 mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 i2c_algo_bit Jun 2 18:00:54 c210 kernel: drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci ixgbe drm libahci libata mlx5_core(OE) crct10dif_pclmul crct10dif_common crc32c_intel megaraid_sas mlxfw(OE) devlink mdio i2c_core mlx_compat(OE) ptp pps_core dca dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs] Jun 2 18:00:54 c210 kernel: CPU: 0 PID: 4442 Comm: ldlm_bl_43 Kdump: loaded Tainted: G OEL ------------ 3.10.0-862.el7.x86_64 #1 Jun 2 18:00:54 c210 kernel: Hardware name: Supermicro SYS-1027R-WC1RT/X9DRW-CF/CTF, BIOS 3.0b 04/09/2014 Jun 2 18:00:54 c210 kernel: task: ffffa04c3bb3eeb0 ti: ffffa031b58a0000 task.ti: ffffa031b58a0000 Jun 2 18:00:54 c210 kernel: RIP: 0010:[<ffffffff86b08802>] [<ffffffff86b08802>] native_queued_spin_lock_slowpath+0x122/0x200 Jun 2 18:00:54 c210 kernel: RSP: 0018:ffffa031b58a3b48 EFLAGS: 00000246 Jun 2 18:00:54 c210 kernel: RAX: 0000000000000000 RBX: ffffffffc0f84b7f RCX: 0000000000010000 Jun 2 18:00:54 c210 kernel: RDX: ffffa04c7fcd9700 RSI: 0000000000590001 RDI: ffffa045b18e4948 Jun 2 18:00:54 c210 kernel: RBP: ffffa031b58a3b48 R08: ffffa03c3fc19700 R09: 0000000000000000 Jun 2 18:00:54 c210 kernel: R10: 0000000000000000 R11: fffff018b1d64480 R12: ffffa040a5c9b7c8 Jun 2 18:00:54 c210 kernel: R13: ffffa03f8d5a13f8 R14: ffffa040a5c9b8b0 R15: 00000000000a5aff Jun 2 18:00:54 c210 kernel: FS: 0000000000000000(0000) GS:ffffa03c3fc00000(0000) knlGS:0000000000000000 Jun 2 18:00:54 c210 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jun 2 18:00:54 c210 kernel: CR2: 00007fd700006bc9 CR3: 000000203be54000 CR4: 00000000001607f0 Jun 2 18:00:54 c210 kernel: Call Trace: Jun 2 18:00:54 c210 kernel: [<ffffffff8710842a>] queued_spin_lock_slowpath+0xb/0xf Jun 2 18:00:54 c210 kernel: [<ffffffff87115680>] _raw_spin_lock+0x20/0x30 Jun 2 18:00:54 c210 kernel: [<ffffffffc0c6914a>] cl_object_attr_lock+0x1a/0x20 [obdclass] Jun 2 18:00:54 c210 kernel: [<ffffffffc0f6cf86>] osc_ldlm_blocking_ast+0x2f6/0x3a0 [osc] Jun 2 18:00:54 c210 kernel: [<ffffffffc0dc730a>] ldlm_cancel_callback+0x8a/0x330 [ptlrpc] Jun 2 18:00:54 c210 kernel: [<ffffffffc0c43892>] ? class_handle_unhash+0x32/0x50 [obdclass] Jun 2 18:00:54 c210 kernel: [<ffffffffc0dd21d0>] ldlm_cli_cancel_local+0xa0/0x3f0 [ptlrpc] Jun 2 18:00:54 c210 kernel: [<ffffffffc0dd676a>] ldlm_cli_cancel_list_local+0xea/0x280 [ptlrpc] Jun 2 18:00:54 c210 kernel: [<ffffffffc0dd6e8b>] ldlm_cancel_lru_local+0x2b/0x30 [ptlrpc] Jun 2 18:00:54 c210 kernel: [<ffffffffc0dd8076>] ldlm_cli_cancel+0x216/0x650 [ptlrpc] Jun 2 18:00:54 c210 kernel: [<ffffffff86bf62b1>] ? __slab_free+0x81/0x2f0 Jun 2 18:00:54 c210 kernel: [<ffffffffc0f6ce0a>] osc_ldlm_blocking_ast+0x17a/0x3a0 [osc] Jun 2 18:00:54 c210 kernel: [<ffffffffc0ddc7dd>] ldlm_handle_bl_callback+0xed/0x4e0 [ptlrpc] Jun 2 18:00:54 c210 kernel: [<ffffffffc0ddd101>] ldlm_bl_thread_main+0x531/0x700 [ptlrpc] Jun 2 18:00:54 c210 kernel: [<ffffffffc0ddcbd0>] ? ldlm_handle_bl_callback+0x4e0/0x4e0 [ptlrpc] Jun 2 18:00:54 c210 kernel: [<ffffffff86abae31>] kthread+0xd1/0xe0 Jun 2 18:00:54 c210 kernel: [<ffffffff86ac9667>] ? finish_task_switch+0x57/0x170 Jun 2 18:00:54 c210 kernel: [<ffffffff86abad60>] ? insert_kthread_work+0x40/0x40 Jun 2 18:00:54 c210 kernel: [<ffffffff8711f61d>] ret_from_fork_nospec_begin+0x7/0x21 Jun 2 18:00:54 c210 kernel: [<ffffffff86abad60>] ? insert_kthread_work+0x40/0x40 Jun 2 18:00:54 c210 kernel: Code: 13 48 c1 ea 0d 48 98 83 e2 30 48 81 c2 00 97 01 00 48 03 14 c5 a0 53 73 87 4c 89 02 41 8b 40 08 85 c0 75 0f 0f 1f 44 00 00 f3 90 <41> 8b 40 08 85 c0 74 f6 4d 8b 08 4d 85 c9 74 04 41 0f 18 09 8b |
| Comments |
| Comment by Patrick Farrell (Inactive) [ 03/Jun/19 ] |
|
Ihara, Can you give more details about what was running when you did this? For example, what was the exact job that created the file that was being unlinked? The sequence of events shown in the log appears to be several evictions for unknown reasons: Jun 2 17:53:47 c210 kernel: LustreError: 167-0: es90-OST000e-osc-ffffa04c7febc800: This client was evicted by es90-OST000e; in progress operations using this service will fail. [...] Jun 2 17:54:18 c210 kernel: LustreError: 167-0: es90-OST000f-osc-ffffa04c7febc800: This client was evicted by es90-OST000f; in progress operations using this service will fail. Then the LDLM thread gets a soft lockup while trying to evict, which is typical if there is a busy import with, many many locks on it (the eviction code is not super efficient about how it handles locking during cleanup, and it never sleeps). Here's the trace on that first soft lockup: Jun 2 17:55:16 c210 kernel: [<ffffffff8710842a>] queued_spin_lock_slowpath+0xb/0xf Jun 2 17:55:16 c210 kernel: [<ffffffff87115680>] _raw_spin_lock+0x20/0x30 Jun 2 17:55:16 c210 kernel: [<ffffffffc0c6914a>] cl_object_attr_lock+0x1a/0x20 [obdclass] Jun 2 17:55:16 c210 kernel: [<ffffffffc0f6cf86>] osc_ldlm_blocking_ast+0x2f6/0x3a0 [osc] Jun 2 17:55:16 c210 kernel: [<ffffffffc0dc730a>] ldlm_cancel_callback+0x8a/0x330 [ptlrpc] Jun 2 17:55:16 c210 kernel: [<ffffffffc0c43892>] ? class_handle_unhash+0x32/0x50 [obdclass] Jun 2 17:55:16 c210 kernel: [<ffffffff86bf62b1>] ? __slab_free+0x81/0x2f0 Jun 2 17:55:16 c210 kernel: [<ffffffffc0dd21d0>] ldlm_cli_cancel_local+0xa0/0x3f0 [ptlrpc] Jun 2 17:55:16 c210 kernel: [<ffffffffc0dd81e1>] ldlm_cli_cancel+0x381/0x650 [ptlrpc] Jun 2 17:55:16 c210 kernel: [<ffffffffc0dc9759>] cleanup_resource+0x1f9/0x330 [ptlrpc] Jun 2 17:55:16 c210 kernel: [<ffffffffc0dc98c5>] ldlm_resource_clean+0x35/0x50 [ptlrpc] Jun 2 17:55:16 c210 kernel: [<ffffffffc0ab2370>] cfs_hash_for_each_relax+0x250/0x450 [libcfs] Jun 2 17:55:16 c210 kernel: [<ffffffffc0dc9890>] ? cleanup_resource+0x330/0x330 [ptlrpc] Jun 2 17:55:16 c210 kernel: [<ffffffffc0dc9890>] ? cleanup_resource+0x330/0x330 [ptlrpc] Jun 2 17:55:16 c210 kernel: [<ffffffffc0ab5515>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs] Jun 2 17:55:16 c210 kernel: [<ffffffffc0dc87c0>] ldlm_namespace_cleanup+0x30/0xc0 [ptlrpc] Jun 2 17:55:16 c210 kernel: [<ffffffffc0f580c9>] osc_import_event+0x459/0xfc0 [osc] Jun 2 17:55:16 c210 kernel: [<ffffffffc0e16330>] ? ptlrpc_import_recovery_state_machine+0xc20/0xc20 [ptlrpc] Jun 2 17:55:16 c210 kernel: [<ffffffffc0e141a4>] ptlrpc_invalidate_import+0x2a4/0xae0 [ptlrpc] Note specifically invalidate_import - this is cleanup from an eviction.
So it would be good to get logs from the OSSes. I see various messages of this sort shortly before the evictions: Jun 2 17:52:02 c210 kernel: Lustre: 25173:0:(client.c:2141:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1559519513/real 1559519515] req@ffffa038b0a41680 x1635157931013696/t0(0) o103->es90-OST0000-osc-ffffa04c7febc800@172.16.254.190@o2ib:17/18 lens 3584/224 e 0 to 1 dl 1559519520 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Jun 2 17:52:02 c210 kernel: Lustre: es90-OST0000-osc-ffffa04c7febc800: Connection to es90-OST0000 (at 172.16.254.190@o2ib) was lost; in progress operations using this service will wait for recovery to complete Jun 2 17:52:02 c210 kernel: Lustre: Skipped 7 previous similar messages Jun 2 17:52:05 c210 kernel: Lustre: 25177:0:(client.c:2141:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1559519516/real 1559519518] req@ffffa03fb631d100 x1635157931013984/t0(0) o400->MGC172.16.254.190@o2ib@172.16.254.190@o2ib:26/25 lens 224/224 e 0 to 1 dl 1559519523 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jun 2 17:52:05 c210 kernel: Lustre: 25177:0:(client.c:2141:ptlrpc_expire_one_request()) Skipped 7 previous similar messages Jun 2 17:52:05 c210 kernel: LustreError: 166-1: MGC172.16.254.190@o2ib: Connection to MGS (at 172.16.254.190@o2ib) was lost; in progress operations using this service will fail Jun 2 17:52:05 c210 kernel: LNetError: 25160:0:(o2iblnd_cb.c:3350:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 1 seconds Jun 2 17:52:05 c210 kernel: LNetError: 25160:0:(o2iblnd_cb.c:3425:kiblnd_check_conns()) Timed out RDMA with 172.16.254.160@o2ib (7): c: 0, oc: 0, rc: 8 Jun 2 17:52:10 c210 kernel: LNetError: 25161:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-103, 0) Jun 2 17:52:10 c210 kernel: LNetError: 25161:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 1 previous similar message [...] Jun 2 17:52:36 c210 kernel: LustreError: 11290:0:(service.c:2216:ptlrpc_server_handle_request()) @@@ Dropping timed-out request from 12345-172.16.254.191@o2ib: deadline 6:5s ago#012 req@ffffa035793f9200 x1635232884665680/t0(0) o104->LOV_OSC_UUID@172.16.254.191@o2ib:611/0 lens 296/0 e 0 to 0 dl 1559519551 ref 1 fl Interpret:/0/ffffffff rc 0/-1 Jun 2 17:52:36 c210 kernel: LustreError: 11290:0:(service.c:2216:ptlrpc_server_handle_request()) Skipped 3317 previous similar messages Jun 2 17:52:36 c210 kernel: Lustre: 11290:0:(service.c:2254:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (6:5s); client may timeout. req@ffffa035793f9200 x1635232884665680/t0(0) o104->LOV_OSC_UUID@172.16.254.191@o2ib:611/0 lens 296/0 e 0 to 0 dl 1559519551 ref 1 fl Interpret:/0/ffffffff rc 0/-1 Jun 2 17:52:36 c210 kernel: Lustre: 11290:0:(service.c:2254:ptlrpc_server_handle_request()) Skipped 3318 previous similar messages Jun 2 17:53:20 c210 kernel: Lustre: MGC172.16.254.190@o2ib: Connection restored to 172.16.254.190@o2ib (at 172.16.254.190@o2ib) Jun 2 17:53:20 c210 kernel: Lustre: Skipped 7 previous similar messages Jun 2 17:53:42 c210 kernel: Lustre: 6327:0:(client.c:2141:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1559519611/real 1559519612] req@ffffa04b5872da00 x1635157931055392/t0(0) o101->MGC172.16.254.190@o2ib@172.16.254.190@o2ib:26/25 lens 328/344 e 0 to 1 dl 1559519621 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Jun 2 17:53:42 c210 kernel: Lustre: 6327:0:(client.c:2141:ptlrpc_expire_one_request()) Skipped 10 previous similar messages Jun 2 17:53:42 c210 kernel: LustreError: 166-1: MGC172.16.254.190@o2ib: Connection to MGS (at 172.16.254.190@o2ib) was lost; in progress operations using this service will fail Jun 2 17:53:45 c210 kernel: LNetError: 22059:0:(o2iblnd_cb.c:3307:kiblnd_cm_callback()) 172.16.254.161@o2ib DISCONNECTED Jun 2 17:53:47 c210 kernel: Lustre: es90-MDT0007-mdc-ffffa04c7febc800: Connection restored to 172.16.254.163@o2ib (at 172.16.254.163@o2ib) Indicating significant network troubles - Note in particular this message: Jun 2 17:52:36 c210 kernel: Lustre: 11290:0:(service.c:2254:ptlrpc_server_handle_request()) Skipped 3318 previous similar messages After that, I see what are basically repeats of that first back trace and one other thread that appears to be waiting for it. Then, there are some task hung messages that are not Lustre, waiting for i/o somewhere... So the soft lockup looks more like eviction of a client with a large number of locks leading to very slow eviction and warning messages. This is a known problem I've seen several times, which I should probably open an LU for. But I think the network problems are more likely the root of the issue? |
| Comment by Patrick Farrell (Inactive) [ 12/Jul/19 ] |
|
Ihara, If you don't have more details on this one, we probably won't be able to get anywhere with it - Also, it looks distinctly like there were network issues. I am learning towards closing this because we don't have enough info to pinpoint a cause, and this isn't an ongoing issue. So, let me know if you feel differently and/or have more info. |
| Comment by Patrick Farrell (Inactive) [ 16/Aug/19 ] |
|
Unfortunately, not enough info to get traction here from the one instance reported. |
| Comment by Shuichi Ihara [ 18/Sep/19 ] |
|
I see this issue quite offten at ior_hard_read in io500 test. |
| Comment by Patrick Farrell (Inactive) [ 18/Sep/19 ] |
|
So this is happening because we're trying to cancel too many locks on the hard write/hard read file. It's taking too long to do that, and so some requests are timing out, leading to evictions, etc, etc. The way to fix this in general is probably to reduce the number of locks that are cancelled each time we start doing a cancellation. This is just a guess, but if the problem is latency of lock cancellations, reducing the count should help with that. In the meantime, setting a low-ish lru_size limit should take care of the problem. I'm trying 10000 currently, but we may be able to safely go much higher (previously I believe we had hundreds of thousands of locks on each client). I'm not sure if it will help performance or not (it may hurt it slightly), but it should prevent this problem from occurring. |
| Comment by Patrick Farrell (Inactive) [ 18/Sep/19 ] |
|
No, it looks like the batch size is already pretty reasonable - Likely less than 64. Basically, it looks like our performance when cancelling a huge number of locks at once on the same resource & in the same namespace isn't very good. This has been improved previously, but it's not really solved. This is related to I'm now thinking the issue might actually be that we're cancelling one lock at a time, from an absurd number of threads... Because the existing write locks are not going to be cancelled until they are explicitly called back - We're just cancelling one lock at a time, each time a bl callback comes in. Some sort of mechanism to recognize this case might be in order... If we've got a bunch of locks cancelled in a row, over and over, on a particular resource, then we should probably start cancelling all unused locks on that resource... Will think on this a little. |
| Comment by Shuichi Ihara [ 18/Sep/19 ] |
|
Why this happens after read and why extent lock can't be canceled at time to time? as far as I observed ldlm.namespaces.*.lock_count, the number of lock for other lock (openlock?) is canceling automatically, but extent lock doesn't? |
| Comment by Patrick Farrell (Inactive) [ 18/Sep/19 ] |
|
My understanding is that it happens after the write, during the read. The reads cancel the write locks because the write locks are all messy and don't tend to match up with what's required for reading the file. (This is because the unaligned i/o from multiple clients ends up creating many locks of various unusual sizes & many are also cancelled, so the mapping of write locks across the file is quite irregular.) Extent locks are held in the LRU until either they age out, the lru size is exceeded (or the server says the client has too many, if lru resize is enabled), or there is a conflict. In this case, the conflict is from the read. Extent locks are being cancelled as needed, but the problem is that when we try to cancel many, many extent locks on the same resource (in this case, 1 at a time from many threads) we see heavy contention on the lock protecting the resource. This causes BL callbacks to take an extremely long time, resulting in evictions. |
| Comment by Shuichi Ihara [ 19/Sep/19 ] |
yeah, reduced ldlm.namespaces.*.lru_max_age helped either and it canceled extent locks aggresively in background. |
| Comment by Patrick Farrell (Inactive) [ 19/Sep/19 ] |
|
Lock conflicts. Even if write/read is on the same clients (no shuffle, I think it's called?), because the job is unaligned writes from many processes to the same stripes, locks are constantly getting cancelled during the job, and replaced by other different locks. The result is that at the end of the write, the locks that exist don't match up (like, at all) with what's needed to read the data back. Often, the lock required for any particular read belongs to another client, or doesn't quite cover the whole region (or it does cover the whole region, but it conflicts with a read that is going to happen on another client... etc etc.). So the read has to cancel most of the write locks to complete. Probably almost all of them. |
| Comment by Andreas Dilger [ 29/Oct/19 ] |
|
This seems similar to |
| Comment by Wang Shilong (Inactive) [ 30/Oct/19 ] |
|
Yup,this looks duplicated. |
| Comment by Shuichi Ihara [ 30/Oct/19 ] |
|
In |