[LU-10236] while running fio , file is getting corrupt under /mnt/lustre/xxx Created: 13/Nov/17  Updated: 15/Dec/19  Resolved: 15/Dec/19

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Saurabh Tandan (Inactive) Assignee: Mikhail Pershin
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Kraken cluster,
2 OSS, 8 OSTs
2 MDS, 4 MDTs
1 client

lustre version - 2.10.55 + dom
branch: lustre-reviews
build - 52057


Issue Links:
Related
is related to LU-10180 DoM technical debts Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

While running FIO on above mentioned setup using command mentioned below, files in directory /mnt/lustre/xxx got corrupt. But when changing the parameter nrfiles=256 it works fine.

fio --name=smallio --ioengine=posixaio --iodepth=32 --directory=/mnt/lustre/dom3 --nrfiles=512 --openfiles=10000  --numjobs=8 --filesize=64k --lockfile=readwrite --bs=4k --rw=ra
ndread --buffered=1 --bs_unaligned=1 --file_service_type=random --randrepeat=0   --norandommap --group_reporting=1 --loops=4
[root@kapollo04 lustre]# rm -rf dom3
rm: cannot remove ‘dom3’: Directory not empty

client dmesg

[227470.685094] LustreError: 15069:0:(events.c:199:client_bulk_callback()) event type 2, status -90, desc ffff880eaafd7c00
[227470.686839] LustreError: 15067:0:(events.c:199:client_bulk_callback()) event type 2, status -90, desc ffff880eaafd7c00
[227470.688803] LustreError: 15069:0:(events.c:199:client_bulk_callback()) event type 2, status -90, desc ffff880eaafd7c00
[227470.690502] LustreError: 15070:0:(events.c:199:client_bulk_callback()) event type 2, status -90, desc ffff880eaafd7c00
[227470.692567] LustreError: 15068:0:(events.c:199:client_bulk_callback()) event type 2, status -90, desc ffff880eaafd7c00
[227470.694514] LustreError: 15067:0:(events.c:199:client_bulk_callback()) event type 2, status -90, desc ffff880eaafd7c00
[227470.696363] LustreError: 15070:0:(events.c:199:client_bulk_callback()) event type 2, status -90, desc ffff880eaafd7c00
[227470.698380] LustreError: 15069:0:(events.c:199:client_bulk_callback()) event type 2, status -90, desc ffff880eaafd7c00
[227470.700589] LustreError: 15068:0:(events.c:199:client_bulk_callback()) event type 2, status -90, desc ffff880eaafd7c00
[227470.702449] LustreError: 15067:0:(events.c:199:client_bulk_callback()) event type 2, status -90, desc ffff880eaafd7c00
[227470.704257] LustreError: 15068:0:(events.c:199:client_bulk_callback()) event type 2, status -90, desc ffff880eaafd7c00
[227470.706338] LustreError: 15069:0:(events.c:199:client_bulk_callback()) event type 2, status -90, desc ffff880eaafd7c00
[227470.708125] LustreError: 15067:0:(events.c:199:client_bulk_callback()) event type 2, status -90, desc ffff880eaafd7c00
[227470.710179] LustreError: 15069:0:(events.c:199:client_bulk_callback()) event type 2, status -90, desc ffff880eaafd7c00
[227470.712075] LustreError: 15068:0:(events.c:199:client_bulk_callback()) event type 2, status -90, desc ffff880eaafd7c00
[227471.546843] LustreError: 12768:0:(mdc_request.c:944:mdc_getpage()) lustre-MDT0000-mdc-ffff88105e0f6800: too many resend retries: rc = -5

MDS dmesg

[259415.913026] LustreError: 137-5: nvmefs-MDT0001_UUID: not available for connect from 192.168.213.233@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
[259415.913667] LustreError: Skipped 71 previous similar messages
[259502.137146] LustreError: 20014:0:(ldlm_lib.c:3208:target_bulk_io()) @@@ timeout on bulk READ after 100+0s  req@ffff881029e1f450 x1583747470242320/t0(0) o37->24b31bec-af52-1a41-a067-af1c7d84e837@192.168.213.218@o2ib:597/0 lens 568/440 e 3 to 0 dl 1510613657 ref 1 fl Interpret:/2/0 rc 0/0
[260015.863227] LustreError: 137-5: nvmefs-MDT0000_UUID: not available for connect from 192.168.213.233@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
[260015.863971] LustreError: Skipped 71 previous similar messages
[260643.179888] LustreError: 137-5: nvmefs-MDT0000_UUID: not available for connect from 192.168.213.126@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
[260643.180541] LustreError: Skipped 73 previous similar messages

lustre version - 2.10.55 + dom
branch: lustre-reviews
build - 52057
Which should be same as lustre-master build 3671

This needs to be investigated.



 Comments   
Comment by Andreas Dilger [ 13/Nov/17 ]

This looks to be caused by RPC timeouts, possibly caused by overload of the MDS. Mike recently added a patch that may fix this

https://review.whamcloud.com/29968

Comment by Mikhail Pershin [ 15/Nov/17 ]

Andreas, that is strange because I didn't saw anything similar on onyx nodes, is kraken less capable to sustain load? And I did tests even with higher load, e.g. number of files.

Also that directory can't be removed without any load, as I know. FIO creates in it about 512*8 files and problem occurs, decreasing number of file to 256 solves it. I have no good idea what that can be

Comment by Mikhail Pershin [ 15/Dec/19 ]

Issue was not seen since that time

Generated at Sat Feb 10 02:33:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.