[LU-9964] > 1 group lock on same file (group lock lifecycle/cbpending problem) Created: 08/Sep/17 Updated: 03/Mar/23 Resolved: 25/Nov/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.13.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Patrick Farrell (Inactive) | Assignee: | Patrick Farrell (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Sometimes when using group locks from many threads writing to one file, one of several assertions is encountered. Note all of this assumes all of the lock requests are cooperating & using the same GID. From osc_cache_writeback_range: And osc_extent_merge: Investigation of dumps shows that in all of these cases, multiple group locks are granted on the same resource at the same time, and one of these locks has cbpending set. This is broadly similar to I believe there are actually two problems here, one in the request phase and one in the destruction phase. It is possible for two threads (on the same client) to request a group lock from the server at the same time. If this happens, both group locks will be granted, because they are compatible with one another. This gets two group locks granted at the same time on the same file. When one of them is eventually released, this can cause the crashes noted above, because two locks cover the same dirty pages. Additionally, almost exactly the problem described in After this point, new requests on the client will not match this lock any more. That can result in new group lock requests to the server, again creating the overlapping lock problem. This also results in the same crashes. The solution comes in two parts: |
| Comments |
| Comment by Patrick Farrell (Inactive) [ 08/Sep/17 ] |
|
Attached files together comprise a test for the "two group locks granted on same resource" case. They will NOT crash (because they do not write to the file), simply exit and dump debug when the case is identified. Compile the .c file (in a directory by itself) to a binary named a.out On a 4 CPU VM without my patch, I hit the problem in < 10 minutes. On a real system with 32 CPUs, I hit the problem in < 1 minute. |
| Comment by Patrick Farrell (Inactive) [ 08/Sep/17 ] |
|
Note that these problems exist for PW locks as well, but We could achieve this by checking the exports before |
| Comment by Gerrit Updater [ 08/Sep/17 ] |
|
|
| Comment by Gerrit Updater [ 14/Aug/19 ] |
|
Alexandr Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/35791 |
| Comment by Gerrit Updater [ 07/Sep/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35791/ |
| Comment by Joseph Gmitter (Inactive) [ 25/Nov/19 ] |
|
Patch landed to 2.13.0 |
| Comment by Gerrit Updater [ 27/Jan/21 ] |
|
|
| Comment by Gerrit Updater [ 03/Mar/23 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50198 |