[LU-10235] mkdir should check for directory existence on client before taking write lock Created: 13/Nov/17 Updated: 10/Jan/23 Resolved: 20/Feb/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.14.0, Lustre 2.12.9 |
| Type: | Improvement | Priority: | Minor |
| Reporter: | CEA | Assignee: | Nathaniel Clark |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | CEA | ||
| Issue Links: |
|
||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
A "common" hang we have on our filesystems is when some clients experience temporary IB problems and other clients issue an occasional `mkdir -p` on a full tree. Taking an example `/mnt/lustre/path/to/dir`, basically all the active clients have a CR or PR lock on `/mnt/lustre`, but the client issuing `mkdir -p /mnt/lustre/path/to/dir` will try to get a CW lock on each intermediate directories, which makes the server recall all PR locks on very accessed base directories. Under bad weather, that recall can take time (as long as ldlm timeout), during which clients which did give the lock back will try to reestablish it and get blocked as well until that CW is granted and the no-op mkdir is done.. Which in turn can starve all the threads on the MDS in a situation with many clients and render the MDS unresponsive for a while even after resolving the IB issues. I think we would be much less likely to experience such hangs if mkdir() would, on the client side, first check with a read lock if the child directory exist before upgrading that to a write lock and sending the request to the server. If the directory already exists we can safely return EEXIST without sending mkdir to the server, in most other cases we will need to upgrade lock and send request as we currently do.
On last occurrence here we had 324 mdt0[0-3]_[0-9]* threads stuck on this trace: #0 schedule #1 schedule_timeout #2 ldlm_completion_ast #3 ldlm_cli_enqueue_local #4 mdt_object_lock0 #5 mdt_object_lock #6 mdt_getattr_name_lock #7 mdt_intent_getattr #8 mdt_intent_policy #9 ldlm_lock_enqueue #10 ldlm_handle_enqueue0 #11 tgt_enqueue #12 tgt_request_handle #3 ptlrpc_main #14 kthread #15 kernel_thread And 4 threads on this: #0 schedule #1 ldlm_completion_ast #2 ldlm_cli_enqueue_local #3 mdt_object_lock0 #4 mdt_object_lock #5 mdt_object_find_lock #6 mdt_reint_create #7 mdt_reint_rec #8 mdt_reint_internal #9 mdt_reint #10 tgt_request_handle #11 ptlrpc_main #12 kthread #13 kernel_thread A finer examination lets us find a struct ldlm_resource in these threads, which point to a single directory (actually two in this case, we have a common directory to all users just below root here so it has just as much contention as the filesystem root itself). The ldlm_resource also has lists (lr_granted, lr_waiting) which show a few granted CR (not revoked) and a handful of PR that aren't giving lock back, and many waiting PR + a handful of CW waiting for the last granted PR to go away.
I'm afraid I do not have exact traces of this anymore as it was on a live system and we didn't crash the MDT (+ it was on black site anyway), but if there are other things to look at I'm sure it will happen again eventually so please ask.
Thanks, Dominique Martinet |
| Comments |
| Comment by Peter Jones [ 13/Nov/17 ] | |||||||||||||||
|
Dominique Which Lustre version is in use on the affected system? Peter | |||||||||||||||
| Comment by Dominique Martinet (Inactive) [ 13/Nov/17 ] | |||||||||||||||
|
Sorry, forgot to specify. The specific traces I gave here were taken on lustre-fe 2.7.3 servers with some 2.5 client (2.5.3.90 but it's Bull's versioning, not sure what it relates to); I'm pretty sure we've had something similar with 2.7 clients+2.5 servers on another cluster, probably other mixes. I had a quick glance and did not see anything that would go in that direction on newer versions, and would need to spend some time trying to reproduce with a limited number of clients on VMs for more recent versions - was there improvements to the clients in that regards since 2.7? The original problem really is some IB troubles anyway so would happen regardless of version, but I think this would make lustre servers more tolerant to a few clients behaving badly.
Thanks, Dominique | |||||||||||||||
| Comment by John Hammond [ 14/Nov/17 ] | |||||||||||||||
|
Hi Dominique, I agree that this may have some benefit in the case that you describe but I don't think it's the best policy in the general case. It violates the EAFP principle and increases complexity. We should assume that if the user calls mkdir() then the user really wants to create the directory. Also the small overhead you will be reported as a performance regression by other users. Could the uses of 'mkdir -p /mnt/lustre/path/to/dir' easily be switched to 'cd /mnt/lustre/path && mkdir -p to/dir'? | |||||||||||||||
| Comment by Dominique Martinet (Inactive) [ 15/Nov/17 ] | |||||||||||||||
|
Hi John, Answering in reverse order, I'm afraid we cannot do much about the uses of 'mkdir -p /mnt/lustre..' because they are not really in our control. If there is a job that does it in a tight loop or some overzealous user we might be able to spot it and tell them they should do something else, but users all have their own scripts and we can't really enforce anything like that. Going back to your previous point, I believe we'd need to benchmark it but the overhead might be smaller than you think. I believe that in the general case the client is very likely to already have a read lock, so it will only be querying his cache. In the mkdir of existing directory case it will actually be much faster because it doesn't need to wait for the lock upgrade and other clients to give the lock back. I would be happy with checking if the directory exists only in the case we already have a read lock, because clients will really always have one for these directories I'm concerned about. I'll definitely agree it is more complex, though. I don't have much idea to help there. I actually have to admit I haven't spent enough time working out how the process currently works exactly, is the client actually sending two separate requests to upgrade the lock then for mkdir or is the lock upgrade done on the server on behalf of the client transparently? If it is the later I believe it's not too much added complexity and could actually be simple to just check before getting the write lock (no need to even bother about read lock on server side?). Thanks, | |||||||||||||||
| Comment by John Hammond [ 15/Nov/17 ] | |||||||||||||||
|
Well we would be happy to look at any patches that you send us. | |||||||||||||||
| Comment by Dominique Martinet (Inactive) [ 16/Nov/17 ] | |||||||||||||||
|
Ok, so there are two ways to go about this that I can see. - The first is fairly straight-forward, in mdt_create() (lustre/mdt/mdt_reint.c), right at the start after we found the parent but before getting the lock as MDS_INODELOCK_UPDATE; first try with mdt_object_lock_try(...MDS_INODELOCK_LOOKUP) and if ok do a first mdt_lookup_version_check(...) just as below and return EEXIST early if applicable. Then continue as currently (get lock with UPDATE, check again in case things changed and proceed). I'm not 100% sure how lock upgrade should work (should I willingly give back the lookup bit first, or will the update request do that?) but I should be able to submit a first version to discuss further details on gerrit reasonably quickly if that's the way to go. - A second approach is directly on the client, and is way more complex, but would let the client utilize its cache if populated. I'm not as clear as the other one. Could probably check in ll_new_node() (lustre/llite/namei.c), but I can't find any way to check only in cache (d_lookup maybe, after picking a dentry from the inode? But how to tell if the cache is valid?). I've got to say I'm less confident there, so would need some help.
The obvious pro and cons are that first one is simpler while the later would be cheaper if we restraint ourselves to already-in-cache-only lookups, which I am not actually sure we want to. Should I go ahead with what I first described? Thanks! | |||||||||||||||
| Comment by Oleg Drokin [ 20/Nov/17 ] | |||||||||||||||
|
The lock upgrade is such a snakes nest that you absolutely do not want to go there. As such dropping the lock, then reobtaining it in write mode and then redoing your lookups is the way to go (also adding update bit in process - in fact there's no harm getting the update bit in read mode either). The client approach should already mostly be there and handled by vfs - if you have a valid dentry for a directory, Lustre would not even be called and you'd get your EEXIST - did you check this? This is actually a more broad problem also appicable to open-creates - I filed | |||||||||||||||
| Comment by Dominique Martinet (Inactive) [ 22/Dec/17 ] | |||||||||||||||
|
I've been a bit busy so didn't have as much time as I would have liked to work on this, but here's a few thoughts.
I can confirm that the client approach works "a bit", I think the problem is that the client cache is not as populated as I thought it would be. For example, a client doing "mkdir -p /mnt/lustre/dir1/dir2/dir3" twice in a row after a fresh mount will send mkdirs the first time (obviously), but regardless of the result will also do mkdirs the second time as well - only after a proper readdir like `find /mnt/lustre` will the client recognize that the directories exist, despite having just created them/entered in them (`mkdir -p` does chdir() after each component); the inode must exist in vfs somehow just not in the right place. This behavior explains that even very commonly accessed directories right at the root of the filesystem are not considered "in cache" from this point of view, because these directories never change, they are hardcoded in user codes and it is very likely that a readdir never happened. I'm not sure if we could improve that somehow?
Back to the actual patch, I'm a bit torn on the issue you point at with all clients getting read lock -> noticing the absence of dir -> all getting a write lock. As you say it is not such a big deal for mkdirs, but in general many opens will come after a MPI barrier() or similar and will thus be very, very synchronous. This can be problematic for For now I'll stick to the simple solution without anything for mkdir but I think this needs a bit more thought in the long run. | |||||||||||||||
| Comment by Gerrit Updater [ 16/Jan/18 ] | |||||||||||||||
|
Dominique Martinet (dominique.martinet@cea.fr) uploaded a new patch: https://review.whamcloud.com/30880 | |||||||||||||||
| Comment by Dominique Martinet (Inactive) [ 16/Jan/18 ] | |||||||||||||||
|
I've tested the patch I submitted with my usual hanger: time clush -bw vm[1-131] 'seq 0 1000 | xargs -P 7 -I{} sh -c "(({}%%3==0)) && mkdir /mnt/lustre/testdir/foo 2>/dev/null || stat /mnt/lustre/testdir > /dev/null"' With 131 clients connecting to an @tcp llmount, this takes 54 seconds to run without patch and 9 seconds with patch so it's somewhat working. I'm still not happy with the pathological case where clients all do check at same time and all try to get write lock afterwards - obviously this won't be much worse now than it was before, but I don't think we should be happy with that for Thanks, | |||||||||||||||
| Comment by Dominique Martinet (Inactive) [ 16/Jan/18 ] | |||||||||||||||
|
mdtests results, with https://github.com/LLNL/mdtest , two 16-cores haswell VMs, llmount server mounted as tcp on both nodes. mdtest invoked on a single node with `mpirun -n 8 ./mdtest -D -i 8 -I 1000`
I'll have to say, I have no idea what mdtest calls "dir creation" vs "tree creation", looking with strace only seems to have one pattern of mkdir, if someone's more familiar with the tool I'd be happy to use other options. | |||||||||||||||
| Comment by Dominique Martinet (Inactive) [ 18/Jan/18 ] | |||||||||||||||
|
Patch makes two tests fail in sanityn (43a and 45a); basically tests expect unlink on one mount followed by mkdir on other mount to make mkdir wait until the unlink is completely done; the waiting must have been happening implicitly from the CW lock. I'm not sure where that expectation comes from? More to the point, I don't see how to work around it – waiting for other "directory write" operations finish is precisely what we do not want here. | |||||||||||||||
| Comment by Andreas Dilger [ 30/Jan/18 ] | |||||||||||||||
|
As mentioned in the patch review, there is a legitimate requirement that rmdir and mkdir operations submitted in chronological order should both succeed, even if run on different nodes. However, the rmdir operation must have completed before the mkdir is started, rather than just submitted, or the outcome is racy and either success or -EEXIST are acceptable outcomes. IMHO, the tests as they are written artificially block the rmdir completion, so it isn't clear that their failure indicates a bug. The OBD_FAIL_MDS_PDO_LOCK check happens during locking, before the target file is unlinked, so it seems that these tests should be skipped for newer MDS versions that include the patch. | |||||||||||||||
| Comment by Mikhail Pershin [ 31/Jan/18 ] | |||||||||||||||
|
Andreas, yes, these tests check just PDO lock correctness and previously -EEXIST check in code was done after the taking PDO lock so test ensures exactly that. Patch changes that to check EEXIST w/o lock, test doesn't expect that is the case and fails assuming that PDO lock does't work. The important case here about EEXIST checking is concurrent creates and it is OK with patch, because it checks EEXIST again after taking PW lock. As for rmdir vs mkdir case it doesn't work now and test is not valid with patch applied. I think that patch may disable these tests for now. I will create ticket to update tests with better conflict simulation. | |||||||||||||||
| Comment by Gerrit Updater [ 20/Feb/20 ] | |||||||||||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/30880/ | |||||||||||||||
| Comment by Peter Jones [ 20/Feb/20 ] | |||||||||||||||
|
Landed for 2.14 | |||||||||||||||
| Comment by Gerrit Updater [ 15/Feb/21 ] | |||||||||||||||
|
Etienne AUJAMES (eaujames@ddn.com) uploaded a new patch: https://review.whamcloud.com/41674 | |||||||||||||||
| Comment by Gerrit Updater [ 05/May/22 ] | |||||||||||||||
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/41674/ |