Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
None
-
2
-
9223372036854775807
Description
While running IOR on a Lustre 2.17.0 client (lustre-2.17.0-RC3), we observed read failures.
The workload is SSF (Single Shared File) with Lustre striping enabled.
Below is a simplified reproducer using 32 clients, 4 OSS, and 8 OSTs.
# Server: 2.17.0-RC3 # Client: 2.17.0-RC3 mkdir /lustre/stripe lfs setstripe -c -1 -S 16M /lustre/stripe/ # run IOR with 1MB SSF salloc -N 32 -p src --ntasks-per-node=16 mpirun -mca btl_openib_if_include mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -np 512 --allow-run-as-root /work/tools/bin/ior -i 10 -w -r -b 16g -t 1m -C -Q 17 -e -vv -o /lustre/stirpe/file
In many cases, the IOR read phase fails.
Results: WARNING: The file "/lustre/stirpe/file" exists already and will be deleted Using Time Stamp 1766009857 (0x69432c01) for Data Signature access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---- ---------- ---------- --------- -------- -------- -------- -------- ---- Commencing write performance test: Thu Dec 18 07:18:16 2025 write 33315 33315 0.014452 16777216 1024.00 5.86 251.79 35.35 251.80 0 Commencing read performance test: Thu Dec 18 07:22:27 2025 WARNING: read(62, 0x7f9269c25000, 1048576) failed Input/output error ERROR: cannot read from file, (ior.c:1670) WARNING: read(62, 0x7fd56b532000, 1048576) failed Input/output error
one of clients was evicted
Dec 18 07:24:45 src09-c0-n0 kernel: Lustre: exafs-OST0000-osc-ff1bfe13a6e7b000: Connection to exafs-OST0000 (at 10.0.11.244@o2ib12) was lost; in progress operations using this service will wait for recovery to complete Dec 18 07:24:45 src09-c0-n0 kernel: LustreError: exafs-OST0002-osc-ff1bfe13a6e7b000: This client was evicted by exafs-OST0002; in progress operations using this service will fail. Dec 18 07:24:45 src09-c0-n0 kernel: Lustre: Skipped 1 previous similar message Dec 18 07:24:45 src09-c0-n0 kernel: LustreError: exafs-OST0000-osc-ff1bfe13a6e7b000: This client was evicted by exafs-OST0000; in progress operations using this service will fail.
here is server log for it
[205116.132741] LustreError: 521820:0:(ldlm_lockd.c:255:expired_lock_main()) ### lock callback timer expired after 103s: evicting client at 10.0.6.17@o2ib12 ns: filter-exafs-OST0000_UUID lock: 000000008c2158bc/0x391fa44341e54d0c lrc: 3/0,0 mode: PW/PW res: [0x280000402:0x6d:0x0].0x0 rrc: 31274 type: EXT [558379302912->558412857343] (req 558379302912->558380351487) gid 0 flags: 0x60000400000020 nid: 10.0.6.17@o2ib12 remote: 0x53abc08fd8dfe98e expref: 1036 pid: 523130 timeout: 205117 lvb_type: 0 [205116.139713] LustreError: 521820:0:(ldlm_lockd.c:255:expired_lock_main()) Skipped 4162 previous similar messages [205116.141649] LustreError: 551531:0:(ldlm_lib.c:3556:target_bulk_io()) @@@ bulk READ failed: rc = -107 req@0000000027962180 x1851795285219456/t0(0) o3->719de6e9-9c0f-4feb-946e-44f30810543c@10.0.6.17@o2ib12:371/0 lens 488/440 e 0 to 0 dl 1766010301 ref 1 fl Interpret:/600/0 rc 0/0 job:'' uid:0 gid:0 projid:0 [205116.142855] LustreError: 521819:0:(client.c:1380:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@00000000fc641c21 x1851655987935360/t0(0) o105->exafs-OST0000@10.0.6.17@o2ib12:15/16 lens 392/224 e 0 to 0 dl 0 ref 1 fl Rpc:QU/0/ffffffff rc 0/-1 job:'' uid:4294967295 gid:4294967295 projid:4294967295 [205116.145937] Lustre: exafs-OST0000: Bulk IO read error with 719de6e9-9c0f-4feb-946e-44f30810543c (at 10.0.6.17@o2ib12), client will retry: rc -107 [205116.318948] LustreError: 528616:0:(ldlm_lockd.c:2563:ldlm_cancel_handler()) ldlm_cancel from 10.0.6.17@o2ib12 arrived at 1766010290 with bad export cookie 4116189193215222764 [205116.323339] LustreError: 528616:0:(ldlm_lockd.c:2563:ldlm_cancel_handler()) Skipped 18 previous similar messages
When the Lustre client is downgraded to 2.16.1, the same workload completes successfully (i.e., 2.16.1 client against a 2.17.0 server works fine).
Attachments
Issue Links
- is related to
-
LU-19427 sanity-pcc test_99b: invalidate_lock deadlock with mixed buffered and direct I/O workload
-
- Reopened
-