Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.14.0
-
None
-
3
-
9223372036854775807
Description
We hit this issue at a KVM setupĀ (2 clients, 1MDS with 2 MDTs and 1OSS with 2OSTs) where MDT size is 200MB. But the problem is a general.
parrallel-scale write_append_truncate test with DOM 64K.
== parallel-scale test write_append_truncate: write_append_truncate ================================== 06:25:42 (1580365542) OPTIONS: clients=fre1139,fre1140 write_REP=10000 write_THREADS=8 MACHINEFILE=/tmp/parallel-scale.machines fre1139 fre1140 striped dir -i1 -c2 /mnt/lustre/d0.write_append_truncate /mnt/lustre/d0.write_append_truncate lcm_layout_gen: 0 lcm_mirror_count: 1 lcm_entry_count: 2 lcme_id: N/A lcme_mirror_id: N/A lcme_flags: 0 lcme_extent.e_start: 0 lcme_extent.e_end: 65536 stripe_count: 0 stripe_size: 65536 pattern: mdt stripe_offset: -1 lcme_id: N/A lcme_mirror_id: N/A lcme_flags: 0 lcme_extent.e_start: 65536 lcme_extent.e_end: EOF stripe_count: 1 stripe_size: 65536 pattern: raid0 stripe_offset: -1 + write_append_truncate -n 10000 /mnt/lustre/d0.write_append_truncate/f0.wat + chmod 0777 /mnt/lustre drwxrwxrwx 3 root root 4096 Jan 30 06:25 /mnt/lustre + su mpiuser sh -c "/usr/bin/mpirun -wdir=/home/mpiuser -machinefile /tmp/parallel-scale.machines -np 16 write_append_truncate -n 10000 /mnt/lustre/d0.write_append_truncate/f0.wat " r= 0: create /mnt/lustre/d0.write_append_truncate/f0.wat, max size: 3703701, seed 1580365542: Success r= 0 l=0000: WR A 1234451/0x12d613, AP a 649673/0x09e9c9, TR@ 1315319/0x1411f7 r= 0 l=0001: WRITE bad [0-396105]/[0-0x60b49] != B r= 0 l=0001: append-after-TRUNC bad [396106-434216]/[0x60b4a-0x6a028] != 0 r= 0 l=0001: WR B 396106/0x060b4a, AP b 82334/0x01419e, TR@ 434217/0x06a029 000000 B B B B B B B B B B B B B B B B * 001000 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul * 010000 B B B B B B B B B B B B B B B B * 06a020 B B B B B B B B B b b b b b b b 06a030 b b b b b b b b b b b b b b b b * 07e1c0 b b b b b b b 07e1c7 application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0 Fatal error in PMPI_Bcast: Other MPI error, error stack: MDT logs [ 7447.601622] Lustre: DEBUG MARKER: == parallel-scale test write_append_truncate: write_append_truncate ================================== 06:25:42 (1580365542) [ 7449.288681] LustreError: 15827:0:(tgt_grant.c:463:tgt_grant_space_left()) lustre-MDT0000: cli 06debc7b-0afd-06d9-7775-8742fd8f72aa/ffff93ae54a5dc00 left 229588992 < tot_grant 231669760 unstable 0 pending 0 dirty 0 [ 7449.299037] LustreError: 15827:0:(tgt_grant.c:463:tgt_grant_space_left()) Skipped 11 previous similar messages [ 7449.923674] LustreError: 15827:0:(tgt_grant.c:463:tgt_grant_space_left()) lustre-MDT0000: cli 73f6b1fa-dcae-e04b-bd57-8198d25e98c8/ffff93ae54a5d800 left 229588992 < tot_grant 231669760 unstable 0 pending 90112 dirty 0 [ 7449.930508] LustreError: 15827:0:(tgt_grant.c:463:tgt_grant_space_left()) Skipped 1 previous similar message [ 7450.701626] Lustre: DEBUG MARKER: parallel-scale test_write_append_truncate: @@@@@@ FAIL: write_append_truncate failed! 1
During a repeating write to a DOM component by write_append_truncate test, one client reserves all 200MB grants from MDT. So next write for another client works with limited grants.
Occurrence
The client trying to write 396106 bytes. All pages locate in cache. Client checks first page 4096 bytes, and it needs 28672 bytes to be granted. Available is 49152, so client reserves grants and send a page. Next, client don't have available grants it tries to reserve and it falls to sync IO. A client creates rpc with 1-15pages for DOM part and 15-eof for OST part and sends two rpcs in parallel. Then a client gets ENOSPC for DOM and OK for OST parts. After that inode i_size is updated to a full size 396106, but write IO ends with 4096 bytes without error. So write_append_truncate test sees a 4096 bytes were written, and it repeats write from 4096 to 396106. The file was opened with append flag so write starts from inode i_size. It leads to a wrong data at file.
00000080:00200000:0.0:1580365544.925336:0:24223:0:(rw26.c:868:ll_write_end()) queued page: 97. 00000080:00200000:0.0:1580365544.925338:0:24223:0:(vvp_io.c:969:vvp_io_write_commit()) commit async pages: 97, from 0, to 2890 00000008:00000020:0.0:1580365544.925357:0:24223:0:(osc_cache.c:1560:osc_enter_cache_try()) lustre-MDT0000-mdc-ffff9c6cb69bb000: grant { dirty: 0/118272 dirty_pages: 0/472885 dropped: 0 avail: 49152, dirty_grant: 0, reserved: 0, flight: 0 } lru {in list: 0, left: 16, waiters: 0 }need:28672 00000008:00000020:0.0:1580365544.925359:0:24223:0:(osc_cache.c:1419:osc_consume_write_grant()) using 4096 grant credits for brw ffff9c6ca1397140 page ffffe4d2828de100 ... 00000008:00100000:0.0:1580365544.925497:0:9994:0:(osc_request.c:1557:osc_brw_prep_request()) brw rpc ffff9c6c17b65200 - object 0x20000040f:61854 offset 0<>4096 ... 00000008:00000020:0.0:1580365544.947436:0:24223:0:(osc_cache.c:1560:osc_enter_cache_try()) lustre-MDT0000-mdc-ffff9c6cb69bb000: grant { dirty: 0/118272 dirty_pages: 0/472885 dropped: 0 avail: 20480, dirty_grant: 0, reserved: 0, flight: 0 } lru {in list: 1, left: 15, waiters: 0 }need:28672 00000008:00000020:0.0:1580365544.947438:0:24223:0:(osc_cache.c:1692:osc_enter_cache()) lustre-MDT0000-mdc-ffff9c6cb69bb000: grant { dirty: 0/118272 dirty_pages: 0/472885 dropped: 0 avail: 20480, dirty_grant: 0, reserved: 0, flight: 0 } lru {in list: 1, left: 15, waiters: 0 }no grant space, fall back to sync i/o 00000008:00000040:0.0:1580365544.947441:0:24223:0:(osc_io.c:328:osc_io_commit_async()) 15 -122 ... 00000008:00000020:0.0:1580365544.947456:0:24223:0:(osc_cache.c:1811:osc_update_pending()) obj ffff9c6bf82b5a00 ready 0|-|- wr 15|-|+ rd 0|- update pending cmd 2 delta 15. 00000008:00000020:0.0:1580365544.947458:0:24223:0:(osc_cache.c:1770:osc_makes_rpc()) urgent request forcing RPC 00000008:00000020:0.0:1580365544.947460:0:24223:0:(osc_cache.c:2360:osc_io_unplug0()) Queue writeback work for client ffff9c6c1fd604e0. 00000100:00000040:0.0:1580365544.947462:0:24223:0:(ptlrpcd.c:299:ptlrpcd_add_req()) @@@ add req [ffff9c6cb6a36880] to pc [ptlrpcd_00_00:0] req@ffff9c6cb6a36880 x1657125825208336/t0(0) o-1->lustre-MDT0000-mdc-ffff9c6cb69bb000@192.168.111.37@tcp:0/0 lens 0/0 e 0 to 0 dl 1580365564 ref 2 fl Interpret:N/ffffffff/ffffffff rc 0/-1 job:'' 00000008:00000040:0.0:1580365544.947469:0:24223:0:(osc_io.c:211:osc_io_submit()) 0/15 0 00000008:00400020:0.0:1580365544.947473:0:24223:0:(osc_io.c:127:osc_io_submit()) 81 1 00000008:00000020:0.0:1580365544.947486:0:24223:0:(osc_cache.c:1811:osc_update_pending()) obj ffff9c6bf82b58c0 ready 0|-|- wr 81|-|+ rd 0|- update pending cmd 2 delta 81. 00000008:00000020:0.0:1580365544.947488:0:24223:0:(osc_cache.c:1770:osc_makes_rpc()) urgent request forcing RPC 00000008:00000020:0.0:1580365544.947488:0:24223:0:(osc_cache.c:2360:osc_io_unplug0()) Queue writeback work for client ffff9c6cb6a284e0. ... 00000008:00100000:0.0:1580365544.947546:0:9993:0:(osc_request.c:1557:osc_brw_prep_request()) brw rpc ffff9c6c17b65f80 - object 0x20000040f:61854 offset 4096<>65536 .... 00000008:00100000:0.0:1580365544.947669:0:9993:0:(osc_request.c:1557:osc_brw_prep_request()) brw rpc ffff9c6c17b66d00 - object 0x0:68 offset 65536<>396106 ... 00000008:00000002:0.0:1580365544.969377:0:9993:0:(osc_request.c:2032:brw_interpret()) request ffff9c6c17b65f80 aa ffff9c6c17b660f0 rc -28 ... 00000008:00000002:0.0:1580365544.969551:0:9993:0:(osc_request.c:2032:brw_interpret()) request ffff9c6c17b66d00 aa ffff9c6c17b66e70 rc 0 00000008:00000020:0.0:1580365544.969553:0:9993:0:(osc_object.c:183:osc_attr_update()) set kms from 0to 396106 ... 00000080:00200000:0.0:1580365544.969753:0:24223:0:(file.c:1280:ll_merge_attr()) [0x20000040f:0xf19e:0x0] updating i_size 396106 00000080:00200000:0.0:1580365544.969939:0:24223:0:(vvp_io.c:1122:vvp_io_write_start()) f0.wat: write nob 0, result: 4096 00000080:00200000:0.0:1580365544.969980:0:24223:0:(vvp_io.c:314:vvp_io_fini()) [0x20000040f:0xf19e:0x0] ignore/verify layout 0/0, layout version 3 need write layout 0, restore needed 0 00000080:00200000:0.0:1580365544.969981:0:24223:0:(file.c:1519:ll_file_io_generic()) f0.wat: 2 io complete with rc: 0, result: 4096, restart: 0 00000080:00200000:0.0:1580365544.969984:0:24223:0:(file.c:1553:ll_file_io_generic()) iot: 2, result: 4096 00000020:00001000:0.0:1580365544.969984:0:24223:0:(cl_object.c:832:cl_env_put()) 1@ffff9c6c20824cf0 00000020:00001000:0.0:1580365544.970006:0:24223:0:(cl_object.c:751:cl_env_get()) 1@ffff9c6c20824cf0 00000080:00200000:0.0:1580365544.970007:0:24223:0:(file.c:1399:ll_file_io_generic()) f0.wat: write ppos: 4096, count: 392010 00000080:00200000:0.0:1580365544.970009:0:24223:0:(file.c:1318:ll_io_set_mirror()) f0.wat: desiginated mirror: 0 00000020:00200000:0.0:1580365544.970012:0:24223:0:(cl_io.c:215:cl_io_rw_init()) header@ffff9c6bfa77ea50[0x4, 2, [0x20000040f:0xf19e:0x0] hash] 00000020:00200000:0.0:1580365544.970013:0:24223:0:(cl_io.c:215:cl_io_rw_init()) io range: 2 [4096, 396106) 0 1
The main problem is situation when during IO 1,3 stripe were written and a 2 got error.
The best way to prevent corrupted data in such cases is to write data by stripes, one after one. So when error happen, we return written data, and on a second retry return ENOSPC. And there wouldn't be gaps between data at file, like we have right now.
But the minus of this would be slow performance at high rates, when a client takes maximum grants 2GB probably.
Attachments
Issue Links
- mentioned in
-
Page Loading...