When doing writes to many files, one bottleneck on a client
currently seems to be the grant code, specifically
spinning in the lock around:
The contention is just on osc_enter_cache_try, so there's
no obvious way to refactor the lock, etc. Instead, we can
look at where time is going in the function.
Two things that stand out:
obd_dirty_pages is an atomic, but it is always accessed
under the cl_loi_list_lock, so it can be a regular long.
In my perf tracing, the add_return to this is 50% of the
time in this function.
The assert_spin_lock in osc_consume_write_grant generates
an atomic read of the cl_loi_list_lock lock value. This
isn't too painful, but it would be nice to cut it out of
the hot path. There is already a comment saying the
cl_loi_list_lock must be held, and this is considered
enough in most places in Lustre.
mpirun -np 36 $IOR -o $LUSTRE -w -t 1M -b 2G -i 1 -F
That's 36 processes on one client, writing to separate
Looking in perf, the change is huge:
I go from spending 60% of the time in osc_enter_cache_try
to less than 1%.