[LU-10236] while running fio , file is getting corrupt under /mnt/lustre/xxx Created: 13/Nov/17 Updated: 15/Dec/19 Resolved: 15/Dec/19 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Saurabh Tandan (Inactive) | Assignee: | Mikhail Pershin |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Kraken cluster, lustre version - 2.10.55 + dom |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
While running FIO on above mentioned setup using command mentioned below, files in directory /mnt/lustre/xxx got corrupt. But when changing the parameter nrfiles=256 it works fine. fio --name=smallio --ioengine=posixaio --iodepth=32 --directory=/mnt/lustre/dom3 --nrfiles=512 --openfiles=10000 --numjobs=8 --filesize=64k --lockfile=readwrite --bs=4k --rw=ra ndread --buffered=1 --bs_unaligned=1 --file_service_type=random --randrepeat=0 --norandommap --group_reporting=1 --loops=4 [root@kapollo04 lustre]# rm -rf dom3 rm: cannot remove ‘dom3’: Directory not empty client dmesg [227470.685094] LustreError: 15069:0:(events.c:199:client_bulk_callback()) event type 2, status -90, desc ffff880eaafd7c00 [227470.686839] LustreError: 15067:0:(events.c:199:client_bulk_callback()) event type 2, status -90, desc ffff880eaafd7c00 [227470.688803] LustreError: 15069:0:(events.c:199:client_bulk_callback()) event type 2, status -90, desc ffff880eaafd7c00 [227470.690502] LustreError: 15070:0:(events.c:199:client_bulk_callback()) event type 2, status -90, desc ffff880eaafd7c00 [227470.692567] LustreError: 15068:0:(events.c:199:client_bulk_callback()) event type 2, status -90, desc ffff880eaafd7c00 [227470.694514] LustreError: 15067:0:(events.c:199:client_bulk_callback()) event type 2, status -90, desc ffff880eaafd7c00 [227470.696363] LustreError: 15070:0:(events.c:199:client_bulk_callback()) event type 2, status -90, desc ffff880eaafd7c00 [227470.698380] LustreError: 15069:0:(events.c:199:client_bulk_callback()) event type 2, status -90, desc ffff880eaafd7c00 [227470.700589] LustreError: 15068:0:(events.c:199:client_bulk_callback()) event type 2, status -90, desc ffff880eaafd7c00 [227470.702449] LustreError: 15067:0:(events.c:199:client_bulk_callback()) event type 2, status -90, desc ffff880eaafd7c00 [227470.704257] LustreError: 15068:0:(events.c:199:client_bulk_callback()) event type 2, status -90, desc ffff880eaafd7c00 [227470.706338] LustreError: 15069:0:(events.c:199:client_bulk_callback()) event type 2, status -90, desc ffff880eaafd7c00 [227470.708125] LustreError: 15067:0:(events.c:199:client_bulk_callback()) event type 2, status -90, desc ffff880eaafd7c00 [227470.710179] LustreError: 15069:0:(events.c:199:client_bulk_callback()) event type 2, status -90, desc ffff880eaafd7c00 [227470.712075] LustreError: 15068:0:(events.c:199:client_bulk_callback()) event type 2, status -90, desc ffff880eaafd7c00 [227471.546843] LustreError: 12768:0:(mdc_request.c:944:mdc_getpage()) lustre-MDT0000-mdc-ffff88105e0f6800: too many resend retries: rc = -5 MDS dmesg [259415.913026] LustreError: 137-5: nvmefs-MDT0001_UUID: not available for connect from 192.168.213.233@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. [259415.913667] LustreError: Skipped 71 previous similar messages [259502.137146] LustreError: 20014:0:(ldlm_lib.c:3208:target_bulk_io()) @@@ timeout on bulk READ after 100+0s req@ffff881029e1f450 x1583747470242320/t0(0) o37->24b31bec-af52-1a41-a067-af1c7d84e837@192.168.213.218@o2ib:597/0 lens 568/440 e 3 to 0 dl 1510613657 ref 1 fl Interpret:/2/0 rc 0/0 [260015.863227] LustreError: 137-5: nvmefs-MDT0000_UUID: not available for connect from 192.168.213.233@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. [260015.863971] LustreError: Skipped 71 previous similar messages [260643.179888] LustreError: 137-5: nvmefs-MDT0000_UUID: not available for connect from 192.168.213.126@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. [260643.180541] LustreError: Skipped 73 previous similar messages lustre version - 2.10.55 + dom This needs to be investigated. |
| Comments |
| Comment by Andreas Dilger [ 13/Nov/17 ] |
|
This looks to be caused by RPC timeouts, possibly caused by overload of the MDS. Mike recently added a patch that may fix this |
| Comment by Mikhail Pershin [ 15/Nov/17 ] |
|
Andreas, that is strange because I didn't saw anything similar on onyx nodes, is kraken less capable to sustain load? And I did tests even with higher load, e.g. number of files. Also that directory can't be removed without any load, as I know. FIO creates in it about 512*8 files and problem occurs, decreasing number of file to 256 solves it. I have no good idea what that can be |
| Comment by Mikhail Pershin [ 15/Dec/19 ] |
|
Issue was not seen since that time |