[LU-12250] MDS with high rate of ldlm_cancel and Prolong DOM lock Created: 30/Apr/19 Updated: 24/Jul/20 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Stephane Thiell | Assignee: | Oleg Drokin |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 7.6 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
I'm investigating a metadata slowdown we had tonight on Fir, in terms of metadata. A simple find was super slow. However, when I started to gather stats, the performance came back, so now it seems ok. However, I can still see a lot of ldlm_cancel RPCs so I wanted to report it. I have a script (that I can share if needed)) that takes a 5 secs sample of Lustre RPCs on the MDS and I can see there is a high rate of ldlm_cancel locks. I also see a lot of Prolong DOM lock in the full logs also. Total_RPC_count_for_NID NID LND# RPC_type:count,RPC_type:count,... 3718 sh-107-42-ib0.ib o2ib4 mds_close:1285,ldlm_enqueue:1213,ldlm_cancel:1220 Also attaching a 5 sec full rpctrace/rpc of fir-md1-s2 (MDT0001 and MDT0003) as fir-md1-s2-rpctrace-rpc-20190430-5s.log.gz I wonder, could the patch for |
| Comments |
| Comment by Stephane Thiell [ 30/Apr/19 ] |
|
The rpc sample file fir-md1-s2-lrpc-sample-best.log |
| Comment by Stephane Thiell [ 30/Apr/19 ] |
|
We added the patch from |
| Comment by Stephane Thiell [ 30/Apr/19 ] |
|
Looking more closely, the high rate of `ldlm_cancel` mainly comes from jobs from the same user opening many files in read only from hundreds of nodes. But it looks like the same set of files everywhere. strace of one of her process shows that: [pid 308636] openat(4, "users/ogoann/proj/mcmc/grid_models/x4d8c2t3_mtab.fits", O_RDONLY|O_LARGEFILE|O_NOFOLLOW) = 56 [pid 184783] openat(4, "users/ogoann/proj/mcmc/grid_models/x4d8c2t3_mtab.fits", O_RDONLY|O_LARGEFILE|O_NOFOLLOW) = 30 [pid 185061] openat(4, "users/ogoann/proj/mcmc/grid_models/x4d8c2t3_mtab.fits", O_RDONLY|O_LARGEFILE|O_NOFOLLOW) = 6 [pid 308710] openat(4, "users/ogoann/proj/mcmc/grid_models/x4d8c2t3_mtab.fits", O_RDONLY|O_LARGEFILE|O_NOFOLLOW) = 6 [pid 317296] openat(4, "users/ogoann/proj/mcmc/grid_models/x4d8c2t3_mtab.fits", O_RDONLY|O_LARGEFILE|O_NOFOLLOW) = 28 [pid 310705] openat(4, "users/ogoann/proj/mcmc/grid_models/x4d8c2t3_mtab.fits", O_RDONLY|O_LARGEFILE|O_NOFOLLOW) = 6 [pid 336814] openat(4, "users/ogoann/proj/mcmc/grid_models/x4d8c2t3_mtab.fits", O_RDONLY|O_LARGEFILE|O_NOFOLLOW) = 28
[root@sh-107-55 ~]# lfs getstripe /fir/users/ogoann/proj/mcmc/grid_models/x4d8c2t3_mtab.fits
/fir/users/ogoann/proj/mcmc/grid_models/x4d8c2t3_mtab.fits
lcm_layout_gen: 8
lcm_mirror_count: 1
lcm_entry_count: 6
lcme_id: 1
lcme_mirror_id: 0
lcme_flags: init
lcme_extent.e_start: 0
lcme_extent.e_end: 131072
lmm_stripe_count: 0
lmm_stripe_size: 131072
lmm_pattern: mdt
lmm_layout_gen: 0
lmm_stripe_offset: 0
lcme_id: 2
lcme_mirror_id: 0
lcme_flags: init
lcme_extent.e_start: 131072
lcme_extent.e_end: 16777216
lmm_stripe_count: 1
lmm_stripe_size: 4194304
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 29
lmm_objects:
- 0: { l_ost_idx: 29, l_fid: [0xcc0000400:0x2ad71b:0x0] }
lcme_id: 3
lcme_mirror_id: 0
lcme_flags: init
lcme_extent.e_start: 16777216
lcme_extent.e_end: 1073741824
lmm_stripe_count: 2
lmm_stripe_size: 4194304
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 28
lmm_objects:
- 0: { l_ost_idx: 28, l_fid: [0xd80000400:0x2ad810:0x0] }
- 1: { l_ost_idx: 23, l_fid: [0x940000400:0x2ad834:0x0] }
lcme_id: 4
lcme_mirror_id: 0
lcme_flags: 0
lcme_extent.e_start: 1073741824
lcme_extent.e_end: 34359738368
lmm_stripe_count: 4
lmm_stripe_size: 4194304
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: -1
lcme_id: 5
lcme_mirror_id: 0
lcme_flags: 0
lcme_extent.e_start: 34359738368
lcme_extent.e_end: 274877906944
lmm_stripe_count: 8
lmm_stripe_size: 4194304
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: -1
lcme_id: 6
lcme_mirror_id: 0
lcme_flags: 0
lcme_extent.e_start: 274877906944
lcme_extent.e_end: EOF
lmm_stripe_count: 16
lmm_stripe_size: 4194304
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: -1
Can the read-on-open feature of DOM impact performance of that particular case and triggers the ldlm_cancel, even for read-only files? I'm just wondering, right now things are working fine otherwise. |
| Comment by Oleg Drokin [ 30/Apr/19 ] |
|
it's possible that these things are just related. When you open a lot of files - that creates a lot of locks therefore setting in motion mechanism to reduce lock pressure and cancel some of hte other outstanding locks - hence the lots of locks cancels. But the other facet of this is when you do open with DoM if DoM code decides to return any data back - it first needs to read the said data from the server. This might be introducing slowdowns into RPC handling leadign to formation of incoming RPC queue as all the threads are busy trying to do opens and so everything else stuck in the queue behind them. You can test this theory by looking into server side stats like this when the problem is occurring: |
| Comment by Stephane Thiell [ 30/Apr/19 ] |
|
Awesome, thanks Oleg!! This is super helpful. I didn't graph these yet, so I've added the metrics from /sys/kernel/debug/lustre/mds/MDS/mdt/stats to our Grafana (I'm taking the max + average (sum/samples) for each entry). Interesting that those are per MDS and not MDT. Note that we have 2 MDTs per MDS. Right now numbers look very good I guess, but I restarted the MDTs this morning [root@fir-md1-s2 scripts.d]# cat /sys/kernel/debug/lustre/mds/MDS/mdt/stats snapshot_time 1556651797.816324198 secs.nsecs req_waittime 471139359 samples [usec] 3 53973 7349538359 360519029729 req_qdepth 471139359 samples [reqs] 0 6 1917287 1956293 req_active 471139360 samples [reqs] 1 111 769497155 1918543425 req_timeout 471139360 samples [sec] 1 20 836111081 8135540191 reqbuf_avail 998191708 samples [bufs] 32 96 63808464245 4079030772021 ldlm_glimpse_enqueue 3327771 samples [reqs] 1 1 3327771 3327771 ldlm_flock_enqueue 8883527 samples [reqs] 1 1 8883527 8883527 ldlm_ibits_enqueue 456100785 samples [reqs] 1 1 456100785 456100785 mds_reint_setattr 846838 samples [reqs] 1 1 846838 846838 mds_reint_create 18230 samples [reqs] 1 1 18230 18230 mds_reint_link 396 samples [reqs] 1 1 396 396 mds_reint_unlink 439211 samples [reqs] 1 1 439211 439211 mds_reint_rename 14528 samples [reqs] 1 1 14528 14528 mds_reint_open 448013575 samples [reqs] 1 1 448013575 448013575 mds_reint_setxattr 491 samples [reqs] 1 1 491 491 ost_sync 3440 samples [usec] 70 12032 434598 261919830 mds_getattr 2106 samples [usec] 13 1239 53990 4786790 mds_getattr_lock 1 samples [usec] 96 96 96 9216 mds_connect 5404 samples [usec] 5 999369 19610008 11100644683594 mds_disconnect 2 samples [usec] 70 86 156 12296 mds_statfs 10543 samples [usec] 9 312 586244 37145030 mds_sync 3440 samples [usec] 75 233664 61955554 7498190710542 mds_quotactl 2688 samples [usec] 28 1075 145337 9915395 mds_getxattr 143070 samples [usec] 12 1371 3196191 77352011 mds_hsm_state_set 578 samples [usec] 201 12304 266279 543288397 obd_ping 1311390 samples [usec] 0 460 18172871 311334181 |
| Comment by Oleg Drokin [ 01/May/19 ] |
|
the reason these are per MDS is because thread pools are shared on every node. For the same reason if you only have a single slow OST on an OSS node, it wold quickly clog all io threads on the entire node and all OSTs would feel slow. So just keep in mind a single target could affect entire node performance if it's making threads to run for a long time. There were some discussions to allow NRS to create per-target limits on a node to avoid this, but I don't remember seeing any patches. If you are graphing this stuff, definitely do it for all thread pools (there are multiple on MDS) + add the OSS counterparts too, I guess. |
| Comment by Stephane Thiell [ 06/May/19 ] |
|
I don't graph that for all thread pools yet (I'll check thanks). But in the following screenshot you can see that the max value of max req_qdepth went up to 80 and max req_waitiime up to 26 seconds during the last few days. Our two MDS are now very different in terms of load:
Highest RPCs per clients (5-second sample): 6956 sh-113-01-ib0.ib o2ib4 mds_close:2070,ldlm_enqueue:2729,ldlm_cancel:2157 7358 sh-113-03-ib0.ib o2ib4 mds_close:2188,ldlm_enqueue:2873,ldlm_cancel:2297 7627 sh-112-16-ib0.ib o2ib4 mds_close:2290,ldlm_enqueue:2975,ldlm_cancel:2362 7767 10.9.115.10 o2ib4 mds_close:2292,ldlm_enqueue:3054,ldlm_cancel:2421 8089 sh-112-14-ib0.ib o2ib4 mds_close:2449,ldlm_enqueue:3147,ldlm_cancel:2493 8960 fir-io4-s2 o2ib7 obd_ping:7,quota_dqacq:8953 10906 0 lo ldlm_enqueue:141,ldlm_cancel:177,ldlm_bl_callback:1,obd_ping:1,1000:38,quota_dqacq:10548 17767 fir-io3-s2 o2ib7 quota_dqacq:17767 fir-io3-s2 and fir-io4-s2 are both OSS here. htop for fir-md1-s1:
But otherwise the filesystem seems fine and usable. |
| Comment by Stephane Thiell [ 18/Sep/19 ] |
|
Oleg, I found your presentation "Lustre Locking overview" at https://lustre.ornl.gov/ecosystem-2017/documents/Day-2_Tutorial-4_Drokin.pdf In slide 11, you wrote “If you have a lot of RAM, it makes sense to increase these values”, so I'm wondering if increasing the following values would make sense to reduce the rate of ldlm_cancel RPCs overall. 2.12.2+: [root@fir-md1-s2 ~]# lctl get_param ldlm.lock_limit_mb ldlm.lock_limit_mb=77216 [root@fir-md1-s2 ~]# lctl get_param ldlm.lock_reclaim_threshold_mb ldlm.lock_reclaim_threshold_mb=51478 2.10.8+: [root@oak-md1-s2 ~]# lctl get_param ldlm.lock_limit_mb ldlm.lock_limit_mb=38549 [root@oak-md1-s2 ~]# lctl get_param ldlm.lock_reclaim_threshold_mb ldlm.lock_reclaim_threshold_mb=25699 |
| Comment by Dominique Martinet (Inactive) [ 24/Jul/20 ] |
|
btw this is old but if you have selinux enabled (even in permissive mode, given the number of xattr requests it probably is) this actually could be The fix only made its way to 2.13 (the 2.12 backport got reverted immediately) so that'll only help after both servers and clients get upgraded though. I'm told atos backported it (with |