[LU-10413] Side-effect of 'stat' on data writeback Created: 20/Dec/17 Updated: 06/May/18 Resolved: 06/May/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.12.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Mikhail Pershin | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
There is quote old code in ldlm_handle_gl_callback() which flushes data on glimpse: static void ldlm_handle_gl_callback() { ... if (lock->l_granted_mode == LCK_PW && !lock->l_readers && !lock->l_writers && ktime_after(ktime_get(), ktime_add(lock->l_last_used, ktime_set(10, 0)))) { unlock_res_and_lock(lock); if (ldlm_bl_to_thread_lock(ns, NULL, lock)) ldlm_handle_bl_callback(ns, NULL, lock); EXIT; return; } ... } It flushes lock data if lock stays on client more than 10 seconds. That means we have sort of additional and non-controlled flusher for a client dirty data. For example, regular 'stat' operation (stat-ahead?) on some client may work as 'flusher' for all other clients with dirty data. I see two problems with that: Solutions: 2. Keep the code but exclude glimpses from stat-ahead and similar things. The original idea of that code was to flush dirty data because there is a sign (glimpse) that somebody is going to use this file. With stat-ahead it is not true anymore. 3. Keep the code but introduce parameter for it instead of hardcoded 10 seconds. Zero means this feature is disabled. |
| Comments |
| Comment by Mikhail Pershin [ 20/Dec/17 ] |
|
I have found this while investigating false-negative results in tests 101[a,b,c] in sanityn. Tests uses vm kernel settings to disable flushes but they are still happening due to such glimpses. This code is very old, from 2007 at least and there is no details why it was introduced, was that just an optimization or fix or anything else. |
| Comment by Mikhail Pershin [ 20/Dec/17 ] |
|
Another bad effect from this is related to unlink. There is mechanism in Lustre to notify client to discard data instead of writing it back, but this become broken if data stays for more than 10 seconds on client, there will be glimpse from client doing unlink and data will be flushed prior unlink itself. |
| Comment by Andreas Dilger [ 25/Dec/17 ] |
|
The reason this code was added is so that if one client was writing to a file, but is now idle, and a second client starts accessing that file, the first client does not hold on to idle locks until it is canceled due to LRU aging, which prevents the second client getting a lock on that object/extent. With the current code, if the first client's lock is idle for 10s (i.e. no longer actively being written) then the glimpse callback will notify the client that there is another client interested in the lock, and cause the lock to be cancelled. Otherwise, a DLM lock can live in the LRU for up to max_lru_age (3900s by default) if there is not much DLM activity on the client's OSC. That means other clients cannot cache fstat() data for ls -l until the lock drops from the first client's LRU. If you think there is a problem with the 10s "idle glimpse cancel" time limit, this could be exposed via a tunable /proc or /sys file. |
| Comment by Mikhail Pershin [ 28/Dec/17 ] |
|
Andreas, I observed that this code cause flushes when they are no expected and when file is not used by other client, e.g. there is no open/read/write, etc. In my case that was just 'stat' request before unlink and it flushes data that must be discarded actually on unlink, but that 'stat' do flush prior that, so it breaks 'discard-on-unlink' mechanism at least. But I am afraid that with that code and stat-ahead we have sort of 'lustre-wide' flusher which flushes dirty data each 10sec despite local client vm settings and no matter if this file is going to be used or not. |
| Comment by Andreas Dilger [ 03/Jan/18 ] |
|
The reason this is needed is that if another client starts accessing this file (e.g. ls -l in some directory), we want that client to be able to cache the lock without having to refetch the lock every time that the file is accessed. Otherwise, the glimpse will never cause the cancellation of a PW DLM lock on the client, even if the client is using stat() on the file repeatedly. On the first access by another client, the initial client would flush the dirty pages from cache and cancel the lock, and on the second glimpse access the client would be able to get the DLM lock on the file and cache the file attributes locally. I think it is reasonable to expose this tunable via a /proc parameter. It might also make sense to increase this by default to be the same as the client dirty_writeback_interval / 100 so that the glimpse itself doesn't cause dirty data to be flushed from the client, but if the file is idle longer than what the VM is using to flush dirty pages then it will still cancel the lock. This could have some larger performance implications, since it means a client listing a directory wouldn't cache a file's attributes until it is at least 30s old. |
| Comment by Mikhail Pershin [ 06/Jan/18 ] |
|
So this is the question of balance between the ability to cache file's attributes on clients and the flushing of dirty data on other clients. The best strategy depends on the way cluster is being used and I think the only thing we can do here is to expose this parameter via procfs and describe this feature better somewhere, so administrators may use it as needed. |
| Comment by Patrick Farrell (Inactive) [ 17/Apr/18 ] |
|
A further thought: This seems like a good optimization. Quite good, in fact. I like Andreas' suggestion to hook it to the dirty_writeback_interval - 10 seconds vs 30 seconds doesn't seem like a huge change (both seem like fairly large numbers for jobs that are doing "create one place, open/stat in another" as sequential operations). Not sure about the /100 part...? |
| Comment by Andreas Dilger [ 19/Apr/18 ] |
|
Patrick, the / 100 is because that parameter is stored in centisecond units, not seconds (AFAIK). |
| Comment by Gerrit Updater [ 23/Apr/18 ] |
|
Mike Pershin (mike.pershin@intel.com) uploaded a new patch: https://review.whamcloud.com/32113 |
| Comment by Gerrit Updater [ 06/May/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/32113/ |
| Comment by Peter Jones [ 06/May/18 ] |
|
Landed for 2.12 |