[LU-6398] LDLM lock creation race condition Created: 24/Mar/15 Updated: 31/Jan/22 Resolved: 31/Jan/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Patrick Farrell (Inactive) | Assignee: | WC Triage |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
As promised in LU-6397, this ticket describes another race condition between acquiring new LDLM locks made possible by After the kms_valid problem has been fixed for new objects (see LU-6397), a closely related race condition remains. Consider this sequence of events with two processes, P1 and P2: The problem is this: Locks are currently added to the waiting or granted queue (at which point they can be matched by other lock requests) by the processing policy, which is called from ldlm_lock_enqueue, which is called from ldlm_cli_enqueue_fini. ldlm_cli_enqueue_fini is not called until a reply has been received from the server. So while P1 is waiting on a reply from the server, the P2 lock request (which would match the P1 lock if it were available for matching) can continue. This was previously prevented by I do not currently see a simple way to resolve this problem. This is made particularly difficult by async locks, such as lock ahead locks, which do not wait for a reply from the server. If we were concerned with only synchronous locks, we could either ignore this or conceivably hold a lock preventing a new lock request from calling in to ldlm_lock_match until the other lock was issued. The problem with this idea is that it would prevent other IOs from using other existing locks. I have one partial idea: This would only require preventing lock matching requests for the time it took to allocate the ptlrpc request & create the lock. I am not sure how we would lock this, though - Holding the resource spinlock for so long does not seem advisable. This bug should be fairly low priority: It does not currently cause any actual failures, as valid locks are issued for the various IO requests involved. It's just inefficient. For asynchronous locks such as the proposed lock ahead locks, this problem is perhaps a bit worse, since they can conceivably be cancelled by a normal IO request which was intended to fit in to one. Still, if there are multiple lock ahead requests on the server, the lock requested for the IO request will not be expanded, and as a result, the server will only cancel one of the lock ahead locks. |
| Comments |
| Comment by Andreas Dilger [ 30/Jan/22 ] |
|
Patrick, is this still an issue? |
| Comment by Patrick Farrell [ 31/Jan/22 ] |
|
Yes but it's pretty minor and I think not worth fixing - Basically it describes a situation where a client has a pending lock request (pending but not yet in the waiting queue, so basically, no reply from the server yet), then another thread on the client generates a second lock request which could use that lock, but doesn't match because it's not granted yet. The second lock request is actually made (when it could have just waited for the first lock to be available), and the client ends up self-conflicting and cancelling the first lock. But that all works just fine - no deadlock or other issue. Anyway, this is A) rare (there's provision in the cl/OSC locking that mostly catches this), and B) harmless - just a very slight and probably transient performance cost (ie, after the first two compatible requests have fought it out, an LDLM lock is held in the normal way without further issue, so there's no ongoing cost paid). Also, this is technically possible for normal I/O, but really it's something I saw with lockahead when I'd have lockahead start an async request, then the thread would go on to do I/O right away and would send a new request before the first one had received a reply. This was solved in practice by creating a way for userspace to check if a lock existed (via the lockahead/ladvise interface), so the lockahead library could spool off a bunch of async requests all at once, but could then wait to start I/O until at least the first lock had been granted. I'm going to close this one out. |