[LU-6392] short read/write with stripe count > 1 Created: 20/Mar/15  Updated: 21/Oct/19  Resolved: 26/Mar/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Bruno Travouillon (Inactive) Assignee: Zhenyu Xu
Resolution: Duplicate Votes: 0
Labels: p4b
Environment:

RHEL 6.5, kernel 2.6.32-431.29.2.el6.Bull.58.x86_64, Lustre 2.5.3 (w/ and w/o Bullpatches)
RHEL 6.6, kernel 2.6.32-504.3.3.el6.Bull.68.x86_64, Lustre 2.5.3 w/ Bullpatches


Issue Links:
Duplicate
duplicates LU-6389 read()/write() returning less than av... Resolved
Related
is related to LU-6389 read()/write() returning less than av... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We are currently hitting an issue on several Lustre filesystems. When we do POSIX IO to/from the Lustre FS, we have a lot of short write/read.

This is easily reproductible with IOR, with one client, one thread, using a transferSize > lustreStripeSize and a lustreStripeCount > 1. However, we were only able to reproduce on 4 of 7 Lustre filesystems.

$ mkdir run_s2
$ lfs setstripe -c 2 -s 1m run_s2/
$ mpirun -n 16  bin/ior  -F -i 4  -t 2m -b 2g  -k -K  -o run_s2/file_s2

IOR-3.0.1: MPI Coordinated Test of Parallel I/O
Began: Thu Mar 19 16:32:39 2015
Command line used: bin/ior -F -i 4 -t 2m -b 2g -k -K -o run_s2/file_s2
Machine: Linux airain1487
Test 0 started: Thu Mar 19 16:32:39 2015
Summary:
    api                = POSIX
    test filename      = run_s2/file_s2
    access             = file-per-process
    ordering in a file = sequential offsets
    ordering inter file= no tasks offsets
    clients            = 16 (16 per node)
    repetitions        = 10
    xfersize           = 2 MiB
    blocksize          = 2 GiB
    aggregate filesize = 32 GiB

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iteraccess    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ---------- ---------  --------   --------   --------   --------   ----
write     1961.78    2097152    2048.00    0.017310   16.70      10.10      16.70      0   
read      21425      2097152    2048.00    0.009623   1.53       0.109189   1.53       0   
write     2718.68    2097152    2048.00    0.016170   12.05      4.25       12.05      1   
read      17611      2097152    2048.00    0.010240   1.85       1.45       1.86       1   
write     2973.64    2097152    2048.00    0.023313   11.02      4.05       11.02      2   
read      20168      2097152    2048.00    0.008476   1.62       0.072408   1.62       2   
WARNING: Task 13, partial write(), 1048576 of 2097152 bytes at offset 2038431744 ERRNO=Success
write     2112.16    2097152    2048.00    0.025717   15.51      6.66       15.51      3   
read      12864      2097152    2048.00    0.008304   2.54       2.07       2.55       3   

Cluster 1: 480 OST / default stripecount=2 stripesize=1M
Cluster 2: 144 OST / default stripecount=2 stripesize=1M

It looks realy close to LU-6389. We tried its reproducer with success.

I did some tests with debug vfstrace, dlmtrace and inode. Logs will be added to this ticket later. (customer is a blacksite)

  • clients = 1 (1 per node)
  • xfersize = 2 MiB
  • blocksize = 2 GiB
  • aggregate filesize = 2 GiB
  • Lustre stripe size = 1 MiB
  • Lustre stripe count = 2

WARNING: Task 0, partial write(), 1048576 of 2097152 bytes at offset 987758592

The short write seems to occur when we lose the layout lock for the file while writing the first stripe. Afterwards, the IO can't continue with the second stripe and the write end.

->> seek to offset 987758592
->> io range 987758592, 989855744 = 2M (transfer size)
->> stripe 942, chunk 987758592, 988807168 = 1M (stripe size)
->> 942 * 1M = 987758592, everything is fine
->> vvp_io_write_start 987758592, 988807168
->> n * vvp_io_commit_write() commits the 4k pages

During the commit, we can observe ldlm_cli_cancel_local(), followed by the message "vvp_conf_set() [0x298c28d7a:0xac:0x0]: losing layout lock". Then come the next stripe.

->> stripe 943, chunk 988807168, 989855744 = 1M (stripe size)
->> no vvp_io_write_start() after because of lock cancellation



 Comments   
Comment by Bruno Travouillon (Inactive) [ 20/Mar/15 ]

This ticket is mainly created to respect the process and to attach files for our case. However, it really seems to be duplicate of LU-6389.

Comment by Peter Jones [ 20/Mar/15 ]

Thanks Bruno. We'll obviously need to confirm that this is indeed (as it appears to be) a duplicate.

Comment by Bruno Travouillon (Inactive) [ 25/Mar/15 ]

This issue is of major importance for our customers. After reading the full thread of LU-6389, it is obvious that we are hitting the same issue, a loss of layout lock between two stripes in the same IO.

We are waiting for review of http://review.whamcloud.com/#/c/14160/.

Comment by Zhenyu Xu [ 26/Mar/15 ]

dup of LU-6389

Comment by Gerrit Updater [ 21/Oct/19 ]

Ben Evans (bevans@cray.com) uploaded a new patch: https://review.whamcloud.com/36526
Subject: LU-6392 hsm: perform mirror sync through HSM
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2e8191a5815864817ad01c8146f787af23ab5d42

Generated at Sat Feb 10 01:59:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.