[LU-15994] Fio io_uring failed with error=interrupted system call on Ubuntu 2204 Created: 06/Jul/22  Updated: 01/Sep/22  Resolved: 01/Sep/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Qian Yingjin Assignee: Qian Yingjin
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-15399 Don't restart CLI IO for IOCB_NOWAIT ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

On the new Ubuntu 2204, fio with io_uring engine failed with the following error:

io_uring err=4/file:io_u.c:1845, func=io_u error, error=interrupted system call

The test script to reproduce this failure is as follows:

DIR="/mnt/lustre"
tdir="sanity-pcc.d102"
dir=$DIR/$tdir
file=$dir/$tfile
ioengine="io_uring"
numjobs=2
size=10M

fio --name=seqwrite --ioengine=$ioengine    \
        --bs=4K --direct=1 --numjobs=$numjobs   \
        --iodepth=64 --size=$size --filename=$file --rw=write ||
        error "fio seqwrite $file failed"
fio --name=seqread --ioengine=$ioengine     \
        --bs=4K --direct=1 --numjobs=$numjobs   \
        --iodepth=64 --size=$size --filename=$file --rw=read ||
        error "fio seqread $file failed"

However, this failure does not exist in the old Ubuntu 2004.
the new kernel for Ubuntu 2204 failed this test with -EINTR.



 Comments   
Comment by Qian Yingjin [ 06/Jul/22 ]

Please note this bug exists on Lustre I/O path or may be related to the io_uring codes in new Linux kernel.
It can reproduce without PCC-RO configuration.

Comment by Qian Yingjin [ 07/Jul/22 ]

This bug may be related to the "task_work" introduced in the kernel later than 5.4 according to the URL: https://issuehint.com/issue/axboe/liburing/504

It's largely driven by the use of task_work to efficiently get a task to finish/issue a specific piece of work, a mechanism which relies on "fake" signals to ensure they get run.

Comment by Gerrit Updater [ 02/Aug/22 ]

"Yingjin Qian <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/48106
Subject: LU-15994 llite: use fatal_signal_pending in range_lock
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 91244566e9d2762c2f64a67e7d4fad8f301b556c

Comment by Nathan Dauchy [ 05/Aug/22 ]

I don't see any additions to the automated test cases in this patch.  Coverage of io_uring testing seems... sparse (since we have hit a few bugs in that area recently).  Does it make sense to add something to this patchset?

Comment by Gerrit Updater [ 08/Aug/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/48106/
Subject: LU-15994 llite: use fatal_signal_pending in range_lock
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4c5b0b0967f052af33cc5cdb4e77d736f04bae56

Comment by Qian Yingjin [ 09/Aug/22 ]

I will make simple test case for io_uring I/O engine later.

Comment by Gerrit Updater [ 09/Aug/22 ]

"Yingjin Qian <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/48167
Subject: LU-15994 tests: add testing for io_uring via fio
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 023160bfe79583f3d11d98d89df33f88fe6ffd12

Comment by Gerrit Updater [ 01/Sep/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/48167/
Subject: LU-15994 tests: add testing for io_uring via fio
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 162336079df221454a273e1c306bcb0531407a1b

Comment by Peter Jones [ 01/Sep/22 ]

Landed for 2.16

Generated at Sat Feb 10 03:23:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.