[LU-6397] LDLM lock creation race condition on new object creation - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.7.0
Labels:
- patch

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

~~LU-1669~~ has made possible a pair of race conditions between acquiring new LDLM locks.
This is the first of two, I will report the other shortly.

This applies to newly created objects, for which kms_valid is not yet set.

If kms_valid is not set, osc_enqueue_base will not attempt to match existing ldlm locks:

         /*
          * kms is not valid when either object is completely fresh (so that no
          * locks are cached), or object was evicted. In the latter case cached
          * lock cannot be used, because it would prime inode state with
          * potentially stale LVB.
          */
         if (!kms_valid)
                 goto(no_match);

kms_valid is read out from osc->oo_oinfo->loi_kms_valid in the osc_object.
kms_valid is set by loi_kms_set, which is done in osc_attr_update as part of cl_object_attr_update called from osc_lock_lvb_update, which is called from osc_enqueue_fini.
This is not called until a reply has been received from the server, either in osc_enqueue_base (regular locks) or osc_enqueue_interpret (async locks).

This results in a race when two IO requests are going at the same time.
Consider:

P1 makes an IO request (FX, write to the first page of the file)
P1 creates an LDLM lock request
P1 waits for reply from server
P2 makes an IO request (FX, read from the second page of the file)
P2 creates an LDLM lock request
P2 does not check for existing LDLM locks (goto(no_match) in osc_enqueue_base as described above)
P2 waits for a reply from server
P1 Receives reply, lock is granted
(Lock is expanded beyond the requested extent, so it covers the area P2 wants to read)
P2 Receives reply, lock is blocked by lock granted to P1
Lock granted to P1 is called back by server, even though it matches request from P2

This is easier to see with async lock requests, since they do not wait (and do not take the range lock which would prevent this race for truly overlapping IOs.), but it also applies to regular lock requests.

This can be solved by removing the usage of kms_valid in osc_enqueue_base.

Per the comment on that usage, there are two things to handle to remove this usage of kms_valid:
Newly created objects with no LDLM locks, and evicted objects. ("Evicted objects" refers to OSC objects removed from LRU due to memory pressure.)

For newly created objects: If the object is new and no locks exist, then it's safe to try to match.
It will simply fail to match and request a new lock.

For evicted objects, Jinshan suggested a solution:
"[...] we can change the code [in osc_object_prune] to get rid of all cached dlm locks when the [osc] object is being destroyed. After this is done, we don’t need to worry about kms_valid any more."

Jinshan also provided the patch for this, which I've done basic testing on and will upload shortly.

Attachments

Activity

[LU-6397] LDLM lock creation race condition on new object creation

Gerrit Updater added a comment - 28/Apr/15 3:43 PM

Patrick Farrell (paf@cray.com) uploaded a new patch: http://review.whamcloud.com/14630
Subject: LU-6397 osc: Remove kms_valid check in osc_enqueue
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 47a84ccb47b91b9c02d20136072d97024bde697c

Gerrit Updater added a comment - 28/Apr/15 3:43 PM Patrick Farrell (paf@cray.com) uploaded a new patch: http://review.whamcloud.com/14630 Subject: LU-6397 osc: Remove kms_valid check in osc_enqueue Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 47a84ccb47b91b9c02d20136072d97024bde697c

Gerrit Updater added a comment - 24/Mar/15 9:52 PM

Patrick Farrell (paf@cray.com) uploaded a new patch: http://review.whamcloud.com/14167
Subject: LU-6397 osc: Remove kms_valid check in osc_enqueue
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a2703428a62f531c64dbef6d1a7a8c548d0ca91f

Gerrit Updater added a comment - 24/Mar/15 9:52 PM Patrick Farrell (paf@cray.com) uploaded a new patch: http://review.whamcloud.com/14167 Subject: LU-6397 osc: Remove kms_valid check in osc_enqueue Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a2703428a62f531c64dbef6d1a7a8c548d0ca91f

LDLM lock creation race condition on new object creation

Details

Description

Attachments

Activity

People

Dates