[LU-13214] wrong data at file with an unavailable grants Created: 07/Feb/20 Updated: 18/Oct/20 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Alexander Boyko | Assignee: | Alexander Boyko |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Epic/Theme: | client |
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
We hit this issue at a KVM setupĀ (2 clients, 1MDS with 2 MDTs and 1OSS with 2OSTs) where MDT size is 200MB. But the problem is a general. parrallel-scale write_append_truncate test with DOM 64K. == parallel-scale test write_append_truncate: write_append_truncate ================================== 06:25:42 (1580365542)
OPTIONS:
clients=fre1139,fre1140
write_REP=10000
write_THREADS=8
MACHINEFILE=/tmp/parallel-scale.machines
fre1139
fre1140
striped dir -i1 -c2 /mnt/lustre/d0.write_append_truncate
/mnt/lustre/d0.write_append_truncate
lcm_layout_gen: 0
lcm_mirror_count: 1
lcm_entry_count: 2
lcme_id: N/A
lcme_mirror_id: N/A
lcme_flags: 0
lcme_extent.e_start: 0
lcme_extent.e_end: 65536
stripe_count: 0 stripe_size: 65536 pattern: mdt stripe_offset: -1
lcme_id: N/A
lcme_mirror_id: N/A
lcme_flags: 0
lcme_extent.e_start: 65536
lcme_extent.e_end: EOF
stripe_count: 1 stripe_size: 65536 pattern: raid0 stripe_offset: -1
+ write_append_truncate -n 10000 /mnt/lustre/d0.write_append_truncate/f0.wat
+ chmod 0777 /mnt/lustre
drwxrwxrwx 3 root root 4096 Jan 30 06:25 /mnt/lustre
+ su mpiuser sh -c "/usr/bin/mpirun -wdir=/home/mpiuser -machinefile /tmp/parallel-scale.machines -np 16 write_append_truncate -n 10000 /mnt/lustre/d0.write_append_truncate/f0.wat "
r= 0: create /mnt/lustre/d0.write_append_truncate/f0.wat, max size: 3703701, seed 1580365542: Success
r= 0 l=0000: WR A 1234451/0x12d613, AP a 649673/0x09e9c9, TR@ 1315319/0x1411f7
r= 0 l=0001: WRITE bad [0-396105]/[0-0x60b49] != B
r= 0 l=0001: append-after-TRUNC bad [396106-434216]/[0x60b4a-0x6a028] != 0
r= 0 l=0001: WR B 396106/0x060b4a, AP b 82334/0x01419e, TR@ 434217/0x06a029
000000 B B B B B B B B B B B B B B B B
*
001000 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
*
010000 B B B B B B B B B B B B B B B B
*
06a020 B B B B B B B B B b b b b b b b
06a030 b b b b b b b b b b b b b b b b
*
07e1c0 b b b b b b b
07e1c7
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
Fatal error in PMPI_Bcast: Other MPI error, error stack:
MDT logs
[ 7447.601622] Lustre: DEBUG MARKER: == parallel-scale test write_append_truncate: write_append_truncate ================================== 06:25:42 (1580365542)
[ 7449.288681] LustreError: 15827:0:(tgt_grant.c:463:tgt_grant_space_left()) lustre-MDT0000: cli 06debc7b-0afd-06d9-7775-8742fd8f72aa/ffff93ae54a5dc00 left 229588992 < tot_grant 231669760 unstable 0 pending 0 dirty 0
[ 7449.299037] LustreError: 15827:0:(tgt_grant.c:463:tgt_grant_space_left()) Skipped 11 previous similar messages
[ 7449.923674] LustreError: 15827:0:(tgt_grant.c:463:tgt_grant_space_left()) lustre-MDT0000: cli 73f6b1fa-dcae-e04b-bd57-8198d25e98c8/ffff93ae54a5d800 left 229588992 < tot_grant 231669760 unstable 0 pending 90112 dirty 0
[ 7449.930508] LustreError: 15827:0:(tgt_grant.c:463:tgt_grant_space_left()) Skipped 1 previous similar message
[ 7450.701626] Lustre: DEBUG MARKER: parallel-scale test_write_append_truncate: @@@@@@ FAIL: write_append_truncate failed! 1
During a repeating write to a DOM component by write_append_truncate test, one client reserves all 200MB grants from MDT. So next write for another client works with limited grants. 00000080:00200000:0.0:1580365544.925336:0:24223:0:(rw26.c:868:ll_write_end()) queued page: 97.
00000080:00200000:0.0:1580365544.925338:0:24223:0:(vvp_io.c:969:vvp_io_write_commit()) commit async pages: 97, from 0, to 2890
00000008:00000020:0.0:1580365544.925357:0:24223:0:(osc_cache.c:1560:osc_enter_cache_try()) lustre-MDT0000-mdc-ffff9c6cb69bb000: grant { dirty: 0/118272 dirty_pages: 0/472885 dropped: 0 avail: 49152, dirty_grant: 0, reserved: 0, flight: 0 } lru {in list: 0, left: 16, waiters: 0 }need:28672
00000008:00000020:0.0:1580365544.925359:0:24223:0:(osc_cache.c:1419:osc_consume_write_grant()) using 4096 grant credits for brw ffff9c6ca1397140 page ffffe4d2828de100
...
00000008:00100000:0.0:1580365544.925497:0:9994:0:(osc_request.c:1557:osc_brw_prep_request()) brw rpc ffff9c6c17b65200 - object 0x20000040f:61854 offset 0<>4096
...
00000008:00000020:0.0:1580365544.947436:0:24223:0:(osc_cache.c:1560:osc_enter_cache_try()) lustre-MDT0000-mdc-ffff9c6cb69bb000: grant { dirty: 0/118272 dirty_pages: 0/472885 dropped: 0 avail: 20480, dirty_grant: 0, reserved: 0, flight: 0 } lru {in list: 1, left: 15, waiters: 0 }need:28672
00000008:00000020:0.0:1580365544.947438:0:24223:0:(osc_cache.c:1692:osc_enter_cache()) lustre-MDT0000-mdc-ffff9c6cb69bb000: grant { dirty: 0/118272 dirty_pages: 0/472885 dropped: 0 avail: 20480, dirty_grant: 0, reserved: 0, flight: 0 } lru {in list: 1, left: 15, waiters: 0 }no grant space, fall back to sync i/o
00000008:00000040:0.0:1580365544.947441:0:24223:0:(osc_io.c:328:osc_io_commit_async()) 15 -122
...
00000008:00000020:0.0:1580365544.947456:0:24223:0:(osc_cache.c:1811:osc_update_pending()) obj ffff9c6bf82b5a00 ready 0|-|- wr 15|-|+ rd 0|- update pending cmd 2 delta 15.
00000008:00000020:0.0:1580365544.947458:0:24223:0:(osc_cache.c:1770:osc_makes_rpc()) urgent request forcing RPC
00000008:00000020:0.0:1580365544.947460:0:24223:0:(osc_cache.c:2360:osc_io_unplug0()) Queue writeback work for client ffff9c6c1fd604e0.
00000100:00000040:0.0:1580365544.947462:0:24223:0:(ptlrpcd.c:299:ptlrpcd_add_req()) @@@ add req [ffff9c6cb6a36880] to pc [ptlrpcd_00_00:0] req@ffff9c6cb6a36880 x1657125825208336/t0(0) o-1->lustre-MDT0000-mdc-ffff9c6cb69bb000@192.168.111.37@tcp:0/0 lens 0/0 e 0 to 0 dl 1580365564 ref 2 fl Interpret:N/ffffffff/ffffffff rc 0/-1 job:''
00000008:00000040:0.0:1580365544.947469:0:24223:0:(osc_io.c:211:osc_io_submit()) 0/15 0
00000008:00400020:0.0:1580365544.947473:0:24223:0:(osc_io.c:127:osc_io_submit()) 81 1
00000008:00000020:0.0:1580365544.947486:0:24223:0:(osc_cache.c:1811:osc_update_pending()) obj ffff9c6bf82b58c0 ready 0|-|- wr 81|-|+ rd 0|- update pending cmd 2 delta 81.
00000008:00000020:0.0:1580365544.947488:0:24223:0:(osc_cache.c:1770:osc_makes_rpc()) urgent request forcing RPC
00000008:00000020:0.0:1580365544.947488:0:24223:0:(osc_cache.c:2360:osc_io_unplug0()) Queue writeback work for client ffff9c6cb6a284e0.
...
00000008:00100000:0.0:1580365544.947546:0:9993:0:(osc_request.c:1557:osc_brw_prep_request()) brw rpc ffff9c6c17b65f80 - object 0x20000040f:61854 offset 4096<>65536
....
00000008:00100000:0.0:1580365544.947669:0:9993:0:(osc_request.c:1557:osc_brw_prep_request()) brw rpc ffff9c6c17b66d00 - object 0x0:68 offset 65536<>396106
...
00000008:00000002:0.0:1580365544.969377:0:9993:0:(osc_request.c:2032:brw_interpret()) request ffff9c6c17b65f80 aa ffff9c6c17b660f0 rc -28
...
00000008:00000002:0.0:1580365544.969551:0:9993:0:(osc_request.c:2032:brw_interpret()) request ffff9c6c17b66d00 aa ffff9c6c17b66e70 rc 0
00000008:00000020:0.0:1580365544.969553:0:9993:0:(osc_object.c:183:osc_attr_update()) set kms from 0to 396106
...
00000080:00200000:0.0:1580365544.969753:0:24223:0:(file.c:1280:ll_merge_attr()) [0x20000040f:0xf19e:0x0] updating i_size 396106
00000080:00200000:0.0:1580365544.969939:0:24223:0:(vvp_io.c:1122:vvp_io_write_start()) f0.wat: write nob 0, result: 4096
00000080:00200000:0.0:1580365544.969980:0:24223:0:(vvp_io.c:314:vvp_io_fini()) [0x20000040f:0xf19e:0x0] ignore/verify layout 0/0, layout version 3 need write layout 0, restore needed 0
00000080:00200000:0.0:1580365544.969981:0:24223:0:(file.c:1519:ll_file_io_generic()) f0.wat: 2 io complete with rc: 0, result: 4096, restart: 0
00000080:00200000:0.0:1580365544.969984:0:24223:0:(file.c:1553:ll_file_io_generic()) iot: 2, result: 4096
00000020:00001000:0.0:1580365544.969984:0:24223:0:(cl_object.c:832:cl_env_put()) 1@ffff9c6c20824cf0
00000020:00001000:0.0:1580365544.970006:0:24223:0:(cl_object.c:751:cl_env_get()) 1@ffff9c6c20824cf0
00000080:00200000:0.0:1580365544.970007:0:24223:0:(file.c:1399:ll_file_io_generic()) f0.wat: write ppos: 4096, count: 392010
00000080:00200000:0.0:1580365544.970009:0:24223:0:(file.c:1318:ll_io_set_mirror()) f0.wat: desiginated mirror: 0
00000020:00200000:0.0:1580365544.970012:0:24223:0:(cl_io.c:215:cl_io_rw_init()) header@ffff9c6bfa77ea50[0x4, 2, [0x20000040f:0xf19e:0x0] hash]
00000020:00200000:0.0:1580365544.970013:0:24223:0:(cl_io.c:215:cl_io_rw_init()) io range: 2 [4096, 396106) 0 1
The main problem is situation when during IO 1,3 stripe were written and a 2 got error. |
| Comments |
| Comment by Andreas Dilger [ 09/Feb/20 ] |
|
This same problem could happen with regular writes to OSTs. Is my understanding correct that the MDT is actually out of space, or is this just a grant accounting problem? I don't think it makes sense to hurt the performance for 99.99% of IO to fix the problem for the 0.01% of IO when the filesystem is full. This could even happen in a local filesystem if the writes are out-of-order and the end was written before the start/middle class and it returned -ENOSPC? I think the right behavior is maintained if an error is returned to userspace. One option in this case is to truncate the file down to the end of the partial write? I guess that only works if the partial write is at the end of the file, but not if data is being overwritten in the middle of the file. |
| Comment by Alexander Zarochentsev (Inactive) [ 11/Feb/20 ] |
Yes, mdt is relatively small in size and out of space. Client is out of grants, switches to synchronous write and gets ENOSPC.
well, it might not be clear from the explanation but the idea is to detect out of grants situation and only then switch to write stripes one by one, being ready to get ENOSPC and exit with a valid return code and file's state. So it wouldn't hurt 99% of the writes (if there are enough grants).
yes, it doesn't work for general case. it actually makes sense to follow Posix there , the alternatives are more complex to handle either in Lustre kernel code or in user application logic. The result of a write syscall is undetermined , user app would have no idea at what offset the partial buffer has been written. See https://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html#tag_16_685_08 << If a write() requests that more bytes be written than there is room for (for example, [XSI] [Option Start] the file size limit of the process or [Option End] the physical end of a medium), only as many bytes as there is room for shall be written. For example, suppose there is space for 20 bytes more in a file before reaching a limit. A write of 512 bytes will return 20. The next write of a non-zero number of bytes would give a failure return (except as noted below). >> |
| Comment by Gerrit Updater [ 26/Aug/20 ] |
|
Alexander Boyko (alexander.boyko@hpe.com) uploaded a new patch: https://review.whamcloud.com/39732 |