Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-19721

Lustre 2.17 client causes -EIO during SSF read

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • Lustre 2.18.0
    • None
    • None
    • 2
    • 9223372036854775807

    Description

      While running IOR on a Lustre 2.17.0 client (lustre-2.17.0-RC3), we observed read failures.
      The workload is SSF (Single Shared File) with Lustre striping enabled.
      Below is a simplified reproducer using 32 clients, 4 OSS, and 8 OSTs.

      # Server: 2.17.0-RC3
      # Client: 2.17.0-RC3
      
      mkdir /lustre/stripe
      lfs setstripe -c -1 -S 16M /lustre/stripe/
      
      # run IOR with 1MB SSF 
      salloc -N 32 -p src --ntasks-per-node=16 mpirun -mca btl_openib_if_include mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -np 512 --allow-run-as-root /work/tools/bin/ior -i 10 -w -r -b 16g -t 1m -C -Q 17 -e -vv -o /lustre/stirpe/file 
      

      In many cases, the IOR read phase fails.

      Results: 
      WARNING: The file "/lustre/stirpe/file" exists already and will be deleted
      Using Time Stamp 1766009857 (0x69432c01) for Data Signature
      
      access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
      ------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
      Commencing write performance test: Thu Dec 18 07:18:16 2025
      write     33315      33315      0.014452    16777216   1024.00    5.86       251.79     35.35      251.80     0   
      Commencing read performance test: Thu Dec 18 07:22:27 2025
      
      WARNING: read(62, 0x7f9269c25000, 1048576) failed Input/output error
      ERROR: cannot read from file, (ior.c:1670)
      WARNING: read(62, 0x7fd56b532000, 1048576) failed Input/output error
      

      one of clients was evicted

      Dec 18 07:24:45 src09-c0-n0 kernel: Lustre: exafs-OST0000-osc-ff1bfe13a6e7b000: Connection to exafs-OST0000 (at 10.0.11.244@o2ib12) was lost; in progress operations using this service will wait for recovery to complete
      Dec 18 07:24:45 src09-c0-n0 kernel: LustreError: exafs-OST0002-osc-ff1bfe13a6e7b000: This client was evicted by exafs-OST0002; in progress operations using this service will fail.
      Dec 18 07:24:45 src09-c0-n0 kernel: Lustre: Skipped 1 previous similar message
      Dec 18 07:24:45 src09-c0-n0 kernel: LustreError: exafs-OST0000-osc-ff1bfe13a6e7b000: This client was evicted by exafs-OST0000; in progress operations using this service will fail.
      

      here is server log for it

      [205116.132741] LustreError: 521820:0:(ldlm_lockd.c:255:expired_lock_main()) ### lock callback timer expired after 103s: evicting client at 10.0.6.17@o2ib12  ns: filter-exafs-OST0000_UUID lock: 000000008c2158bc/0x391fa44341e54d0c lrc: 3/0,0 mode: PW/PW res: [0x280000402:0x6d:0x0].0x0 rrc: 31274 type: EXT [558379302912->558412857343] (req 558379302912->558380351487) gid 0 flags: 0x60000400000020 nid: 10.0.6.17@o2ib12 remote: 0x53abc08fd8dfe98e expref: 1036 pid: 523130 timeout: 205117 lvb_type: 0
      [205116.139713] LustreError: 521820:0:(ldlm_lockd.c:255:expired_lock_main()) Skipped 4162 previous similar messages
      [205116.141649] LustreError: 551531:0:(ldlm_lib.c:3556:target_bulk_io()) @@@ bulk READ failed: rc = -107  req@0000000027962180 x1851795285219456/t0(0) o3->719de6e9-9c0f-4feb-946e-44f30810543c@10.0.6.17@o2ib12:371/0 lens 488/440 e 0 to 0 dl 1766010301 ref 1 fl Interpret:/600/0 rc 0/0 job:'' uid:0 gid:0 projid:0
      [205116.142855] LustreError: 521819:0:(client.c:1380:ptlrpc_import_delay_req()) @@@ IMP_CLOSED  req@00000000fc641c21 x1851655987935360/t0(0) o105->exafs-OST0000@10.0.6.17@o2ib12:15/16 lens 392/224 e 0 to 0 dl 0 ref 1 fl Rpc:QU/0/ffffffff rc 0/-1 job:'' uid:4294967295 gid:4294967295 projid:4294967295
      [205116.145937] Lustre: exafs-OST0000: Bulk IO read error with 719de6e9-9c0f-4feb-946e-44f30810543c (at 10.0.6.17@o2ib12), client will retry: rc -107
      [205116.318948] LustreError: 528616:0:(ldlm_lockd.c:2563:ldlm_cancel_handler()) ldlm_cancel from 10.0.6.17@o2ib12 arrived at 1766010290 with bad export cookie 4116189193215222764
      [205116.323339] LustreError: 528616:0:(ldlm_lockd.c:2563:ldlm_cancel_handler()) Skipped 18 previous similar messages
      

      When the Lustre client is downgraded to 2.16.1, the same workload completes successfully (i.e., 2.16.1 client against a 2.17.0 server works fine).

      Attachments

        Issue Links

          Activity

            People

              qian_wc Qian Yingjin
              sihara Shuichi Ihara
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated: