Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.5.5
-
None
-
2.5.5-g1241c21-CHANGED-2.6.32-573.12.1.el6.atlas.x86_64
-
3
-
9223372036854775807
Description
We see intermittent periods of interactive "slowness" on our production systems. Looking on the MDS, load can go quite high. Running perf we see that osp_sync_add_rec is chewing a lot of time in a spin_lock. I believe this is serially adding the unlink records to the llog so the OST objects will be removed.
- 29.60% 29.60% [kernel] [k] _spin_lock ▒ - _spin_lock ▒ - 77.51% osp_sync_add_rec ▒ osp_sync_add ▒ + 7.79% task_rq_lock ▒ + 6.68% try_to_wake_up ▒ + 1.32% osp_statfs ▒ + 1.26% kmem_cache_free ▒ + 1.12% cfs_percpt_lock
I used jobstats to confirm that we had at least two jobs doing a significant number of unlinks at the time. When multiple MDS threads attempt to do unlinks they serialize, but they spin and block the CPUs in the meantime.
I believe the following code is responsible:
osp_sync.c:421
spin_lock(&d->opd_syn_lock); d->opd_syn_changes++; spin_unlock(&d->opd_syn_lock);
How can we improve this situation:
- Is the spin_lock here just to protect opd_syn_changes (so it could be changed to an atomic) or does it enforce additional synchronization? Would a mutex be appropriate here, or would the context switches kill us in a different way?
- Does it make sense to support multiple llogs per device and hash objects to the different llogs so they can be appended to in parallel? Are there assumptions of ordering for llogs?
- Something else?