[LU-7861] MDS Contention during unlinks due to llog spinlock Created: 10/Mar/16  Updated: 09/Feb/17  Resolved: 22/Jun/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.5
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Major
Reporter: Matt Ezell Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None
Environment:

2.5.5-g1241c21-CHANGED-2.6.32-573.12.1.el6.atlas.x86_64


Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We see intermittent periods of interactive "slowness" on our production systems. Looking on the MDS, load can go quite high. Running perf we see that osp_sync_add_rec is chewing a lot of time in a spin_lock. I believe this is serially adding the unlink records to the llog so the OST objects will be removed.

-   29.60%    29.60%  [kernel]                   [k] _spin_lock        ▒
   - _spin_lock                                                        ▒
      - 77.51% osp_sync_add_rec                                        ▒
           osp_sync_add                                                ▒
      + 7.79% task_rq_lock                                             ▒
      + 6.68% try_to_wake_up                                           ▒
      + 1.32% osp_statfs                                               ▒
      + 1.26% kmem_cache_free                                          ▒
      + 1.12% cfs_percpt_lock

I used jobstats to confirm that we had at least two jobs doing a significant number of unlinks at the time. When multiple MDS threads attempt to do unlinks they serialize, but they spin and block the CPUs in the meantime.

I believe the following code is responsible:

osp_sync.c:421
                spin_lock(&d->opd_syn_lock);
                d->opd_syn_changes++;
                spin_unlock(&d->opd_syn_lock);

How can we improve this situation:

  • Is the spin_lock here just to protect opd_syn_changes (so it could be changed to an atomic) or does it enforce additional synchronization? Would a mutex be appropriate here, or would the context switches kill us in a different way?
  • Does it make sense to support multiple llogs per device and hash objects to the different llogs so they can be appended to in parallel? Are there assumptions of ordering for llogs?
  • Something else?


 Comments   
Comment by James A Simmons [ 10/Mar/16 ]

Looking at the use of opd_syn_changes I noticed in several places it is not protected by opd_syn_lock.

Comment by Joseph Gmitter (Inactive) [ 10/Mar/16 ]

Hi Alex,
Can you have a look at this issue?
Thanks.
Joe

Comment by Alex Zhuravlev [ 15/Mar/16 ]

i'm trying to reproduce the case. also, I don't think osp_sync_add_rec() is the issue itself, I'd rather suspect osp_sync_inflight_conflict() ..

Comment by Matt Ezell [ 16/Mar/16 ]

Hi Alex-

b2_5_fe doesn't have osp_sync_inflight_conflict(). Let us know if there is any additional information we can provide to help.

Thanks,
~Matt

Comment by Alex Zhuravlev [ 29/Mar/16 ]

well, I can't reproduce that locally, but I've got a proto patch which is in testing now.

Comment by Gerrit Updater [ 30/Mar/16 ]

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/19211
Subject: LU-7861 osp: replace the hot spinlock with atomic trackers
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e49e195abf46258a8e8c9f6ff194f21917f52a19

Comment by Alex Zhuravlev [ 25/Apr/16 ]

Matt, can you tell how did you measure (or noticed) that slowness? the patch above should improve this specific case as there will be less contention on that lock, but it'd be great to have a reproducer. my understanding is that you were running few jobs, then two jobs were doing lots of unlinks concurrently (many clients involved), right? then some jobs doing something different (open/create, ls, stat?) were getting noticable higher latency?
notice that massive unlinks by themselves put significant load on MDS usually - many syncrhonous disk reads, llog records, etc.
I'm not saying it's impossible to improve, but at the moment I can't suggest how much the patch can improve the case.

Comment by Matt Ezell [ 26/Apr/16 ]

Your description of the situation is accurate. I would guess this would be hard to reproduce on a small system. With 2.5, you only get a single metadata-modifying RPC per client. You might want to either do multiple mounts per client, set fail_loc=0x804, or try a newer server that supports LU-5319 (though it's possible that other changes make this less noticeable in 2.8 servers). My guess is that when you have many MDT threads servicing unlink requests in parallel, you get into a situation where most of the cores on the MDS are spinning on the same lock and starving the other MDT threads from running. If most of the MDT threads get filled with unlink requests that end up being blocked, you may also run low on idle MDT threads so other requests (stat, readdir, for example) end up waiting (latency) for a thread to service it.

I'm not sure we have a good reproducer, since we observed this due to user behavior in production. I would expect that a parallel mdtest (especially if the files are pre-created and you just use -r) would show this.

Comment by Gerrit Updater [ 22/Jun/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19211/
Subject: LU-7861 osp: replace the hot spinlock with atomic trackers
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5908965847d5535fc5def6621922e5ed00051e46

Comment by Joseph Gmitter (Inactive) [ 22/Jun/16 ]

Patch has landed to master for 2.9.0

Generated at Sat Feb 10 02:12:33 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.