[LU-13214] wrong data at file with an unavailable grants - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.14.0
Labels:
None

Epic/Theme:
- client
Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We hit this issue at a KVM setup (2 clients, 1MDS with 2 MDTs and 1OSS with 2OSTs) where MDT size is 200MB. But the problem is a general.

parrallel-scale write_append_truncate test with DOM 64K.

== parallel-scale test write_append_truncate: write_append_truncate ================================== 06:25:42 (1580365542)
OPTIONS:
clients=fre1139,fre1140
write_REP=10000
write_THREADS=8
MACHINEFILE=/tmp/parallel-scale.machines
fre1139
fre1140
striped dir -i1 -c2 /mnt/lustre/d0.write_append_truncate
/mnt/lustre/d0.write_append_truncate
  lcm_layout_gen:    0
  lcm_mirror_count:  1
  lcm_entry_count:   2
    lcme_id:             N/A
    lcme_mirror_id:      N/A
    lcme_flags:          0
    lcme_extent.e_start: 0
    lcme_extent.e_end:   65536
      stripe_count:  0       stripe_size:   65536       pattern:       mdt       stripe_offset: -1

    lcme_id:             N/A
    lcme_mirror_id:      N/A
    lcme_flags:          0
    lcme_extent.e_start: 65536
    lcme_extent.e_end:   EOF
      stripe_count:  1       stripe_size:   65536       pattern:       raid0       stripe_offset: -1

+ write_append_truncate -n 10000 /mnt/lustre/d0.write_append_truncate/f0.wat
+ chmod 0777 /mnt/lustre
drwxrwxrwx 3 root root 4096 Jan 30 06:25 /mnt/lustre
+ su mpiuser sh -c "/usr/bin/mpirun -wdir=/home/mpiuser  -machinefile /tmp/parallel-scale.machines -np 16 write_append_truncate -n 10000 /mnt/lustre/d0.write_append_truncate/f0.wat "
r= 0: create /mnt/lustre/d0.write_append_truncate/f0.wat, max size: 3703701, seed 1580365542: Success
r= 0 l=0000: WR A 1234451/0x12d613, AP a  649673/0x09e9c9, TR@ 1315319/0x1411f7
r= 0 l=0001: WRITE bad [0-396105]/[0-0x60b49] != B
r= 0 l=0001: append-after-TRUNC bad [396106-434216]/[0x60b4a-0x6a028] != 0
r= 0 l=0001: WR B  396106/0x060b4a, AP b   82334/0x01419e, TR@  434217/0x06a029
000000   B   B   B   B   B   B   B   B   B   B   B   B   B   B   B   B
*
001000 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
*
010000   B   B   B   B   B   B   B   B   B   B   B   B   B   B   B   B
*
06a020   B   B   B   B   B   B   B   B   B   b   b   b   b   b   b   b
06a030   b   b   b   b   b   b   b   b   b   b   b   b   b   b   b   b
*
07e1c0   b   b   b   b   b   b   b
07e1c7
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
Fatal error in PMPI_Bcast: Other MPI error, error stack:

MDT logs

[ 7447.601622] Lustre: DEBUG MARKER: == parallel-scale test write_append_truncate: write_append_truncate ================================== 06:25:42 (1580365542)
[ 7449.288681] LustreError: 15827:0:(tgt_grant.c:463:tgt_grant_space_left()) lustre-MDT0000: cli 06debc7b-0afd-06d9-7775-8742fd8f72aa/ffff93ae54a5dc00 left 229588992 < tot_grant 231669760 unstable 0 pending 0 dirty 0
[ 7449.299037] LustreError: 15827:0:(tgt_grant.c:463:tgt_grant_space_left()) Skipped 11 previous similar messages
[ 7449.923674] LustreError: 15827:0:(tgt_grant.c:463:tgt_grant_space_left()) lustre-MDT0000: cli 73f6b1fa-dcae-e04b-bd57-8198d25e98c8/ffff93ae54a5d800 left 229588992 < tot_grant 231669760 unstable 0 pending 90112 dirty 0
[ 7449.930508] LustreError: 15827:0:(tgt_grant.c:463:tgt_grant_space_left()) Skipped 1 previous similar message
[ 7450.701626] Lustre: DEBUG MARKER: parallel-scale test_write_append_truncate: @@@@@@ FAIL: write_append_truncate failed! 1

During a repeating write to a DOM component by write_append_truncate test, one client reserves all 200MB grants from MDT. So next write for another client works with limited grants.
Occurrence
The client trying to write 396106 bytes. All pages locate in cache. Client checks first page 4096 bytes, and it needs 28672 bytes to be granted. Available is 49152, so client reserves grants and send a page. Next, client don't have available grants it tries to reserve and it falls to sync IO. A client creates rpc with 1-15pages for DOM part and 15-eof for OST part and sends two rpcs in parallel. Then a client gets ENOSPC for DOM and OK for OST parts. After that inode i_size is updated to a full size 396106, but write IO ends with 4096 bytes without error. So write_append_truncate test sees a 4096 bytes were written, and it repeats write from 4096 to 396106. The file was opened with append flag so write starts from inode i_size. It leads to a wrong data at file.

00000080:00200000:0.0:1580365544.925336:0:24223:0:(rw26.c:868:ll_write_end()) queued page: 97.
00000080:00200000:0.0:1580365544.925338:0:24223:0:(vvp_io.c:969:vvp_io_write_commit()) commit async pages: 97, from 0, to 2890
00000008:00000020:0.0:1580365544.925357:0:24223:0:(osc_cache.c:1560:osc_enter_cache_try()) lustre-MDT0000-mdc-ffff9c6cb69bb000: grant { dirty: 0/118272 dirty_pages: 0/472885 dropped: 0 avail: 49152, dirty_grant: 0, reserved: 0, flight: 0 } lru {in list: 0, left: 16, waiters: 0 }need:28672
00000008:00000020:0.0:1580365544.925359:0:24223:0:(osc_cache.c:1419:osc_consume_write_grant()) using 4096 grant credits for brw ffff9c6ca1397140 page ffffe4d2828de100
...
00000008:00100000:0.0:1580365544.925497:0:9994:0:(osc_request.c:1557:osc_brw_prep_request()) brw rpc ffff9c6c17b65200 - object 0x20000040f:61854 offset 0<>4096
...
00000008:00000020:0.0:1580365544.947436:0:24223:0:(osc_cache.c:1560:osc_enter_cache_try()) lustre-MDT0000-mdc-ffff9c6cb69bb000: grant { dirty: 0/118272 dirty_pages: 0/472885 dropped: 0 avail: 20480, dirty_grant: 0, reserved: 0, flight: 0 } lru {in list: 1, left: 15, waiters: 0 }need:28672
 
00000008:00000020:0.0:1580365544.947438:0:24223:0:(osc_cache.c:1692:osc_enter_cache()) lustre-MDT0000-mdc-ffff9c6cb69bb000: grant { dirty: 0/118272 dirty_pages: 0/472885 dropped: 0 avail: 20480, dirty_grant: 0, reserved: 0, flight: 0 } lru {in list: 1, left: 15, waiters: 0 }no grant space, fall back to sync i/o
00000008:00000040:0.0:1580365544.947441:0:24223:0:(osc_io.c:328:osc_io_commit_async()) 15 -122
...
00000008:00000020:0.0:1580365544.947456:0:24223:0:(osc_cache.c:1811:osc_update_pending()) obj ffff9c6bf82b5a00 ready 0|-|- wr 15|-|+ rd 0|- update pending cmd 2 delta 15.
00000008:00000020:0.0:1580365544.947458:0:24223:0:(osc_cache.c:1770:osc_makes_rpc()) urgent request forcing RPC
00000008:00000020:0.0:1580365544.947460:0:24223:0:(osc_cache.c:2360:osc_io_unplug0()) Queue writeback work for client ffff9c6c1fd604e0.
00000100:00000040:0.0:1580365544.947462:0:24223:0:(ptlrpcd.c:299:ptlrpcd_add_req()) @@@ add req [ffff9c6cb6a36880] to pc [ptlrpcd_00_00:0]  req@ffff9c6cb6a36880 x1657125825208336/t0(0) o-1->lustre-MDT0000-mdc-ffff9c6cb69bb000@192.168.111.37@tcp:0/0 lens 0/0 e 0 to 0 dl 1580365564 ref 2 fl Interpret:N/ffffffff/ffffffff rc 0/-1 job:''
00000008:00000040:0.0:1580365544.947469:0:24223:0:(osc_io.c:211:osc_io_submit()) 0/15 0
00000008:00400020:0.0:1580365544.947473:0:24223:0:(osc_io.c:127:osc_io_submit()) 81 1
00000008:00000020:0.0:1580365544.947486:0:24223:0:(osc_cache.c:1811:osc_update_pending()) obj ffff9c6bf82b58c0 ready 0|-|- wr 81|-|+ rd 0|- update pending cmd 2 delta 81.
00000008:00000020:0.0:1580365544.947488:0:24223:0:(osc_cache.c:1770:osc_makes_rpc()) urgent request forcing RPC
00000008:00000020:0.0:1580365544.947488:0:24223:0:(osc_cache.c:2360:osc_io_unplug0()) Queue writeback work for client ffff9c6cb6a284e0.
...
00000008:00100000:0.0:1580365544.947546:0:9993:0:(osc_request.c:1557:osc_brw_prep_request()) brw rpc ffff9c6c17b65f80 - object 0x20000040f:61854 offset 4096<>65536
....
00000008:00100000:0.0:1580365544.947669:0:9993:0:(osc_request.c:1557:osc_brw_prep_request()) brw rpc ffff9c6c17b66d00 - object 0x0:68 offset 65536<>396106
...
00000008:00000002:0.0:1580365544.969377:0:9993:0:(osc_request.c:2032:brw_interpret()) request ffff9c6c17b65f80 aa ffff9c6c17b660f0 rc -28
...
00000008:00000002:0.0:1580365544.969551:0:9993:0:(osc_request.c:2032:brw_interpret()) request ffff9c6c17b66d00 aa ffff9c6c17b66e70 rc 0
00000008:00000020:0.0:1580365544.969553:0:9993:0:(osc_object.c:183:osc_attr_update()) set kms from 0to 396106
...
00000080:00200000:0.0:1580365544.969753:0:24223:0:(file.c:1280:ll_merge_attr()) [0x20000040f:0xf19e:0x0] updating i_size 396106
00000080:00200000:0.0:1580365544.969939:0:24223:0:(vvp_io.c:1122:vvp_io_write_start()) f0.wat: write nob 0, result: 4096
00000080:00200000:0.0:1580365544.969980:0:24223:0:(vvp_io.c:314:vvp_io_fini()) [0x20000040f:0xf19e:0x0] ignore/verify layout 0/0, layout version 3 need write layout 0, restore needed 0
00000080:00200000:0.0:1580365544.969981:0:24223:0:(file.c:1519:ll_file_io_generic()) f0.wat: 2 io complete with rc: 0, result: 4096, restart: 0
00000080:00200000:0.0:1580365544.969984:0:24223:0:(file.c:1553:ll_file_io_generic()) iot: 2, result: 4096
00000020:00001000:0.0:1580365544.969984:0:24223:0:(cl_object.c:832:cl_env_put()) 1@ffff9c6c20824cf0
00000020:00001000:0.0:1580365544.970006:0:24223:0:(cl_object.c:751:cl_env_get()) 1@ffff9c6c20824cf0
00000080:00200000:0.0:1580365544.970007:0:24223:0:(file.c:1399:ll_file_io_generic()) f0.wat: write ppos: 4096, count: 392010
00000080:00200000:0.0:1580365544.970009:0:24223:0:(file.c:1318:ll_io_set_mirror()) f0.wat: desiginated mirror: 0
00000020:00200000:0.0:1580365544.970012:0:24223:0:(cl_io.c:215:cl_io_rw_init()) header@ffff9c6bfa77ea50[0x4, 2, [0x20000040f:0xf19e:0x0] hash]
 
00000020:00200000:0.0:1580365544.970013:0:24223:0:(cl_io.c:215:cl_io_rw_init()) io range: 2 [4096, 396106) 0 1

The main problem is situation when during IO 1,3 stripe were written and a 2 got error.
The best way to prevent corrupted data in such cases is to write data by stripes, one after one. So when error happen, we return written data, and on a second retry return ENOSPC. And there wouldn't be gaps between data at file, like we have right now.
But the minus of this would be slow performance at high rates, when a client takes maximum grants 2GB probably.

Attachments

Issue Links

mentioned in: Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(7 mentioned in)

Activity

[LU-13214] wrong data at file with an unavailable grants

Andreas Dilger added a comment - 29/Apr/24 7:19 PM

one client reserves all 200MB grants from MDT. So next write for another client works with limited grants.

This issue seems as much a problem with the grant code as it is for the error handling? It doesn't make sense that a single client should be able to consume all of the grant from the MDT when there is a limited amount of space left. It would be better to only allow each client to hold a maximum of 1/num_clients of the available grant space on each RPC. If that is the only client writing, then it should continue to get some grant on each write, but not all of it.

Andreas Dilger added a comment - 29/Apr/24 7:19 PM one client reserves all 200MB grants from MDT. So next write for another client works with limited grants. This issue seems as much a problem with the grant code as it is for the error handling? It doesn't make sense that a single client should be able to consume all of the grant from the MDT when there is a limited amount of space left. It would be better to only allow each client to hold a maximum of 1/num_clients of the available grant space on each RPC. If that is the only client writing, then it should continue to get some grant on each write, but not all of it.

Gerrit Updater added a comment - 22/Feb/24 11:23 AM

"Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54144
Subject: LU-13214 tests: simulate data-hole-data write
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: db625b76f97c068dc5cc3a7ab69dad34c94b7df0

Gerrit Updater added a comment - 22/Feb/24 11:23 AM "Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54144 Subject: LU-13214 tests: simulate data-hole-data write Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: db625b76f97c068dc5cc3a7ab69dad34c94b7df0

Gerrit Updater added a comment - 26/Aug/20 7:41 AM

Alexander Boyko (alexander.boyko@hpe.com) uploaded a new patch: https://review.whamcloud.com/39732
Subject: LU-13214 llite: pass error to upper layers
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3b1fa3046e947820ba3bc9bc35e02c38a7d21827

Gerrit Updater added a comment - 26/Aug/20 7:41 AM Alexander Boyko (alexander.boyko@hpe.com) uploaded a new patch: https://review.whamcloud.com/39732 Subject: LU-13214 llite: pass error to upper layers Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 3b1fa3046e947820ba3bc9bc35e02c38a7d21827

Alexander Zarochentsev (Inactive) added a comment - 11/Feb/20 1:20 PM

This same problem could happen with regular writes to OSTs. Is my understanding correct that the MDT is actually out of space, or is this just a grant accounting problem?

Yes, mdt is relatively small in size and out of space. Client is out of grants, switches to synchronous write and gets ENOSPC.

I don't think it makes sense to hurt the performance for 99.99% of IO to fix the problem for the 0.01% of IO when the filesystem is full. This could even happen in a local filesystem if the writes are out-of-order and the end was written before the start/middle class and it returned -ENOSPC? I think the right behavior is maintained if an error is returned to userspace.

well, it might not be clear from the explanation but the idea is to detect out of grants situation and only then switch to write stripes one by one, being ready to get ENOSPC and exit with a valid return code and file's state. So it wouldn't hurt 99% of the writes (if there are enough grants).

One option in this case is to truncate the file down to the end of the partial write? I guess that only works if the partial write is at the end of the file, but not if data is being overwritten in the middle of the file.

yes, it doesn't work for general case. it actually makes sense to follow Posix there , the alternatives are more complex to handle either in Lustre kernel code or in user application logic. The result of a write syscall is undetermined , user app would have no idea at what offset the partial buffer has been written.

See https://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html#tag_16_685_08

<< If a write() requests that more bytes be written than there is room for (for example, [XSI] [Option Start] the file size limit of the process or [Option End] the physical end of a medium), only as many bytes as there is room for shall be written. For example, suppose there is space for 20 bytes more in a file before reaching a limit. A write of 512 bytes will return 20. The next write of a non-zero number of bytes would give a failure return (except as noted below). >>

Alexander Zarochentsev (Inactive) added a comment - 11/Feb/20 1:20 PM This same problem could happen with regular writes to OSTs. Is my understanding correct that the MDT is actually out of space, or is this just a grant accounting problem? Yes, mdt is relatively small in size and out of space. Client is out of grants, switches to synchronous write and gets ENOSPC. I don't think it makes sense to hurt the performance for 99.99% of IO to fix the problem for the 0.01% of IO when the filesystem is full. This could even happen in a local filesystem if the writes are out-of-order and the end was written before the start/middle class and it returned -ENOSPC? I think the right behavior is maintained if an error is returned to userspace. well, it might not be clear from the explanation but the idea is to detect out of grants situation and only then switch to write stripes one by one, being ready to get ENOSPC and exit with a valid return code and file's state. So it wouldn't hurt 99% of the writes (if there are enough grants). One option in this case is to truncate the file down to the end of the partial write? I guess that only works if the partial write is at the end of the file, but not if data is being overwritten in the middle of the file. yes, it doesn't work for general case. it actually makes sense to follow Posix there , the alternatives are more complex to handle either in Lustre kernel code or in user application logic. The result of a write syscall is undetermined , user app would have no idea at what offset the partial buffer has been written. See https://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html#tag_16_685_08 << If a write() requests that more bytes be written than there is room for (for example, [XSI] [Option Start] the file size limit of the process or [Option End] the physical end of a medium), only as many bytes as there is room for shall be written. For example, suppose there is space for 20 bytes more in a file before reaching a limit. A write of 512 bytes will return 20. The next write of a non-zero number of bytes would give a failure return (except as noted below). >>

Andreas Dilger added a comment - 09/Feb/20 5:00 AM

This same problem could happen with regular writes to OSTs. Is my understanding correct that the MDT is actually out of space, or is this just a grant accounting problem?

I don't think it makes sense to hurt the performance for 99.99% of IO to fix the problem for the 0.01% of IO when the filesystem is full. This could even happen in a local filesystem if the writes are out-of-order and the end was written before the start/middle class and it returned -ENOSPC? I think the right behavior is maintained if an error is returned to userspace.

One option in this case is to truncate the file down to the end of the partial write? I guess that only works if the partial write is at the end of the file, but not if data is being overwritten in the middle of the file.

Andreas Dilger added a comment - 09/Feb/20 5:00 AM This same problem could happen with regular writes to OSTs. Is my understanding correct that the MDT is actually out of space, or is this just a grant accounting problem? I don't think it makes sense to hurt the performance for 99.99% of IO to fix the problem for the 0.01% of IO when the filesystem is full. This could even happen in a local filesystem if the writes are out-of-order and the end was written before the start/middle class and it returned -ENOSPC? I think the right behavior is maintained if an error is returned to userspace. One option in this case is to truncate the file down to the end of the partial write? I guess that only works if the partial write is at the end of the file, but not if data is being overwritten in the middle of the file.

wrong data at file with an unavailable grants

Details

Description

Attachments

Issue Links

Activity

People

Dates