[LU-17190] Client-side high priority I/O handling under lock blocking AST Created: 13/Oct/23 Updated: 10/Nov/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | New Feature | Priority: | Minor |
| Reporter: | Qian Yingjin | Assignee: | Qian Yingjin |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
We found a deadlock caused by parallel DIO: T1: writer Obtain DLM extent lock: L1=PW[0, EOF] T2: DIO reader: 50M data, iosize=64M, max_pages_per_rpc=1024 (4M) max_rpcs_in_flight=8 ll_direct_IO_impl() use all available RPC slots: number of read RPC in flight is 9 on the server side: ->tgt_brw_read() ->tgt_brw_lock() # server side locking -> Try to cancel the conflict locks on client: L1=PW[0, EOF] T3: reader take DLM lock ref on L1=PW[0, EOF] Read-ahead pages (prepare pages); wait for RPC slots to send the read RPCs to OST deadlock: T2->T3: T2 is waiting for T3 to release DLM extent lock L1; T3->T2: T3 is waiting for T2 finished to free RPC slots...
To solve this problem, we propose a client-side high priority I/O where the extent lock protecting it is under blocking AST. It implements as follows: When receive a lock blocking AST and the lock is in use (reader and writer count are not zero), it check whether there are any I/O extent (osc_extent) protected by this lock is outstanding (i.e. waiting for RPC slot). Make this kind of read/write I/O with high priority and put them to the HP list. Thus the client will force to send the HP I/Os even the available RPC slots is use out. By this way, it makes I/O engine on OSC layer more efficient. For the normal urgent I/O, the client will tier over the object list one by one and send I/O one by one. Moreover, the in-flight I/O count can not exceed the max RPCs in flight. The hight priority I/Os are put into HP list of the client, will handle more quickly. It can avoid the possible deadlock caused by parallel DIO and response the lock blocking AST more quickly. |
| Comments |
| Comment by Gerrit Updater [ 16/Oct/23 ] |
|
"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52711 |
| Comment by Andreas Dilger [ 16/Oct/23 ] |
|
There may be a couple of other ways to fix this issue:
|
| Comment by Qian Yingjin [ 18/Oct/23 ] |
|
Hi Andreas,
I have implemented two solutions for this bug:
In the firat patch, it puts all acquired DLM extent locks in a list. After submit all read-ahead I/Os, the client releases all acquired DLM extent locks. By this way, in a lock blocking AST, all reading extents has already submitted and put into the list @oo_reading_exts of the OSC object. Then in blocking AST, the client can check this list to find out the conflict outstanding extents as all I/O RPC slots are used out by direct I/O. Otherwise, If we use original way that match DLM lock, add read-ahead page into queue list and release the matched DLM lock; repeat it for read-ahead and finally submit all I/Os (osc_io_submit), the conflict lockdone extents may be added into the list @oo_reading_exts after the check in blocking AST. And on the client the blocking AST for the server-side locking for DIO will try to lock the pages in these lockdone extents (all pages in lockdone extents are PG_locked, and this extent maybe watis for RPC slots and all RPC slots are used out by DIO). Thus it may cause deadlock.
The second patch gives another solution: for each matched read-ahead DLM extent lock, tag the last read-ahead page (osc_page), and increase the tagged accounting for the OSC object. After submit all I/O and add lockdone extents for read-ahead pages into the list @oo_reading_exts, reduce the tagged accounting. Once tagged count becomes zero, wake up waiters. In lock blocing AST, we first wait until the tagged count becomes zero, and then check @oo_reading_exts list to avoid the deadlock.
The first patch has passed the customized Maloo test (test_99b). The second patch has passed in my local testing (I think it can achieve the same effort with solution 1).
Could you please review these two solutions, give some advice which one is better?
(BTW, test_99a sometimes failed due to PCC mmap problem, I will fix it later.
Regards, Qian |
| Comment by Patrick Farrell [ 30/Oct/23 ] |
|
I definitely prefer the first approach - I think readahead locking needs to be much more 'normal', taking and holding dlmlocks in a more regular fashion. So I strongly prefer the first option. The second one would work but it feels 'clever' rather than 'right'. Yingjin, does that answer your questions? I know you had some stuff in Gerrit as well, are there other issues to consider? |
| Comment by Qian Yingjin [ 31/Oct/23 ] |
|
Yes, I will refine the patch of the first solution. Thanks! |
| Comment by Gerrit Updater [ 10/Nov/23 ] |
|
"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53065 |
| Comment by Qian Yingjin [ 10/Nov/23 ] |
|
The test scripts sanity/test_441 (https://review.whamcloud.com/c/fs/lustre-release/+/53065) can reproduce the deadlock problem easily on master branch without PCC-RO locally. So this is a general deadlock bug in both b_es6_0 and master. |