Details
-
Question/Request
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.4.3
-
13557
Description
After observing substantial read I/Os on our systems during an OST umount I took a look at exactly what was causing them. They root cause turns out to be ldlm_cancel_locks_for_export() is calling ofd_lvbo_update() to update the LVB from disk for every lock as its canceled. When there are millions of locks on the server this translates in to a huge amount of IO.
After reading through the code it's not at all clear to me why this is done. How could it be out of date? Why is this required before the lock can be canceled?
[<ffffffffa031c9dc>] cv_wait_common+0x8c/0x100 [spl] [<ffffffffa031ca68>] __cv_wait_io+0x18/0x20 [spl] [<ffffffffa046353b>] zio_wait+0xfb/0x1b0 [zfs] [<ffffffffa03d16bd>] dbuf_read+0x3fd/0x740 [zfs] [<ffffffffa03d1b89>] __dbuf_hold_impl+0x189/0x480 [zfs] [<ffffffffa03d1f06>] dbuf_hold_impl+0x86/0xc0 [zfs] [<ffffffffa03d2f80>] dbuf_hold+0x20/0x30 [zfs] [<ffffffffa03d9767>] dmu_buf_hold+0x97/0x1d0 [zfs] [<ffffffffa042de8f>] zap_get_leaf_byblk+0x4f/0x2a0 [zfs] [<ffffffffa042e14a>] zap_deref_leaf+0x6a/0x80 [zfs] [<ffffffffa042e510>] fzap_lookup+0x60/0x120 [zfs] [<ffffffffa0433f11>] zap_lookup_norm+0xe1/0x190 [zfs] [<ffffffffa0434053>] zap_lookup+0x33/0x40 [zfs] [<ffffffffa0cf0710>] osd_fid_lookup+0xb0/0x2e0 [osd_zfs] [<ffffffffa0cea311>] osd_object_init+0x1a1/0x6d0 [osd_zfs] [<ffffffffa06efc9d>] lu_object_alloc+0xcd/0x300 [obdclass] [<ffffffffa06f0805>] lu_object_find_at+0x205/0x360 [obdclass] [<ffffffffa06f0976>] lu_object_find+0x16/0x20 [obdclass] [<ffffffffa0d80575>] ofd_object_find+0x35/0xf0 [ofd] [<ffffffffa0d90486>] ofd_lvbo_update+0x366/0xdac [ofd] [<ffffffffa0831828>] ldlm_cancel_locks_for_export_cb+0x88/0x200 [ptlrpc] [<ffffffffa059178f>] cfs_hash_for_each_relax+0x17f/0x360 [libcfs] [<ffffffffa0592fde>] cfs_hash_for_each_empty+0xfe/0x1e0 [libcfs] [<ffffffffa082c05f>] ldlm_cancel_locks_for_export+0x2f/0x40 [ptlrpc] [<ffffffffa083b804>] server_disconnect_export+0x64/0x1a0 [ptlrpc] [<ffffffffa0d717fa>] ofd_obd_disconnect+0x6a/0x1f0 [ofd] [<ffffffffa06b5d77>] class_disconnect_export_list+0x337/0x660 [obdclass] [<ffffffffa06b6496>] class_disconnect_exports+0x116/0x2f0 [obdclass] [<ffffffffa06de9cf>] class_cleanup+0x16f/0xda0 [obdclass] [<ffffffffa06e06bc>] class_process_config+0x10bc/0x1c80 [obdclass] [<ffffffffa06e13f9>] class_manual_cleanup+0x179/0x6f0 [obdclass] [<ffffffffa071615c>] server_put_super+0x5bc/0xf00 [obdclass] [<ffffffff8118461b>] generic_shutdown_super+0x5b/0xe0 [<ffffffff81184706>] kill_anon_super+0x16/0x60 [<ffffffffa06e3256>] lustre_kill_super+0x36/0x60 [obdclass] [<ffffffff81184ea7>] deactivate_super+0x57/0x80 [<ffffffff811a2d2f>] mntput_no_expire+0xbf/0x110 [<ffffffff811a379b>] sys_umount+0x7b/0x3a0
What might make sense in the lock cancellation in case of client eviction case is to mark the LVB stale in the resource, and then it will be refreshed from disk only if the lock is used again. That would avoid the need to update the LVB repeatedly during cancellation of many locks, and avoids any work if the resource is never used again.
I also notice the comment in ldlm_glimpse_ast() implies that "filter_intent_policy()" is handling this, but the new ofd_intent_policy() uses ldlm_glimpse_locks() which does not appear to call ldlm_res_lvbo_update(res, NULL, 1) if the glimpse fails.
So it looks like there are a few improvements that could be done: