Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13214

wrong data at file with an unavailable grants

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.14.0
    • None
    • 3
    • 9223372036854775807

    Description

      We hit this issue at a KVM setupĀ (2 clients, 1MDS with 2 MDTs and 1OSS with 2OSTs) where MDT size is 200MB. But the problem is a general.

      parrallel-scale write_append_truncate test with DOM 64K.

      == parallel-scale test write_append_truncate: write_append_truncate ================================== 06:25:42 (1580365542)
      OPTIONS:
      clients=fre1139,fre1140
      write_REP=10000
      write_THREADS=8
      MACHINEFILE=/tmp/parallel-scale.machines
      fre1139
      fre1140
      striped dir -i1 -c2 /mnt/lustre/d0.write_append_truncate
      /mnt/lustre/d0.write_append_truncate
        lcm_layout_gen:    0
        lcm_mirror_count:  1
        lcm_entry_count:   2
          lcme_id:             N/A
          lcme_mirror_id:      N/A
          lcme_flags:          0
          lcme_extent.e_start: 0
          lcme_extent.e_end:   65536
            stripe_count:  0       stripe_size:   65536       pattern:       mdt       stripe_offset: -1
      
          lcme_id:             N/A
          lcme_mirror_id:      N/A
          lcme_flags:          0
          lcme_extent.e_start: 65536
          lcme_extent.e_end:   EOF
            stripe_count:  1       stripe_size:   65536       pattern:       raid0       stripe_offset: -1
      
      + write_append_truncate -n 10000 /mnt/lustre/d0.write_append_truncate/f0.wat
      + chmod 0777 /mnt/lustre
      drwxrwxrwx 3 root root 4096 Jan 30 06:25 /mnt/lustre
      + su mpiuser sh -c "/usr/bin/mpirun -wdir=/home/mpiuser  -machinefile /tmp/parallel-scale.machines -np 16 write_append_truncate -n 10000 /mnt/lustre/d0.write_append_truncate/f0.wat "
      r= 0: create /mnt/lustre/d0.write_append_truncate/f0.wat, max size: 3703701, seed 1580365542: Success
      r= 0 l=0000: WR A 1234451/0x12d613, AP a  649673/0x09e9c9, TR@ 1315319/0x1411f7
      r= 0 l=0001: WRITE bad [0-396105]/[0-0x60b49] != B
      r= 0 l=0001: append-after-TRUNC bad [396106-434216]/[0x60b4a-0x6a028] != 0
      r= 0 l=0001: WR B  396106/0x060b4a, AP b   82334/0x01419e, TR@  434217/0x06a029
      000000   B   B   B   B   B   B   B   B   B   B   B   B   B   B   B   B
      *
      001000 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
      *
      010000   B   B   B   B   B   B   B   B   B   B   B   B   B   B   B   B
      *
      06a020   B   B   B   B   B   B   B   B   B   b   b   b   b   b   b   b
      06a030   b   b   b   b   b   b   b   b   b   b   b   b   b   b   b   b
      *
      07e1c0   b   b   b   b   b   b   b
      07e1c7
      application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
      Fatal error in PMPI_Bcast: Other MPI error, error stack:
      
      MDT logs
      
      [ 7447.601622] Lustre: DEBUG MARKER: == parallel-scale test write_append_truncate: write_append_truncate ================================== 06:25:42 (1580365542)
      [ 7449.288681] LustreError: 15827:0:(tgt_grant.c:463:tgt_grant_space_left()) lustre-MDT0000: cli 06debc7b-0afd-06d9-7775-8742fd8f72aa/ffff93ae54a5dc00 left 229588992 < tot_grant 231669760 unstable 0 pending 0 dirty 0
      [ 7449.299037] LustreError: 15827:0:(tgt_grant.c:463:tgt_grant_space_left()) Skipped 11 previous similar messages
      [ 7449.923674] LustreError: 15827:0:(tgt_grant.c:463:tgt_grant_space_left()) lustre-MDT0000: cli 73f6b1fa-dcae-e04b-bd57-8198d25e98c8/ffff93ae54a5d800 left 229588992 < tot_grant 231669760 unstable 0 pending 90112 dirty 0
      [ 7449.930508] LustreError: 15827:0:(tgt_grant.c:463:tgt_grant_space_left()) Skipped 1 previous similar message
      [ 7450.701626] Lustre: DEBUG MARKER: parallel-scale test_write_append_truncate: @@@@@@ FAIL: write_append_truncate failed! 1
      

      During a repeating write to a DOM component by write_append_truncate test, one client reserves all 200MB grants from MDT. So next write for another client works with limited grants.
      Occurrence
      The client trying to write 396106 bytes. All pages locate in cache. Client checks first page 4096 bytes, and it needs 28672 bytes to be granted. Available is 49152, so client reserves grants and send a page. Next, client don't have available grants it tries to reserve and it falls to sync IO. A client creates rpc with 1-15pages for DOM part and 15-eof for OST part and sends two rpcs in parallel. Then a client gets ENOSPC for DOM and OK for OST parts. After that inode i_size is updated to a full size 396106, but write IO ends with 4096 bytes without error. So write_append_truncate test sees a 4096 bytes were written, and it repeats write from 4096 to 396106. The file was opened with append flag so write starts from inode i_size. It leads to a wrong data at file.

      00000080:00200000:0.0:1580365544.925336:0:24223:0:(rw26.c:868:ll_write_end()) queued page: 97.
      00000080:00200000:0.0:1580365544.925338:0:24223:0:(vvp_io.c:969:vvp_io_write_commit()) commit async pages: 97, from 0, to 2890
      00000008:00000020:0.0:1580365544.925357:0:24223:0:(osc_cache.c:1560:osc_enter_cache_try()) lustre-MDT0000-mdc-ffff9c6cb69bb000: grant { dirty: 0/118272 dirty_pages: 0/472885 dropped: 0 avail: 49152, dirty_grant: 0, reserved: 0, flight: 0 } lru {in list: 0, left: 16, waiters: 0 }need:28672
      00000008:00000020:0.0:1580365544.925359:0:24223:0:(osc_cache.c:1419:osc_consume_write_grant()) using 4096 grant credits for brw ffff9c6ca1397140 page ffffe4d2828de100
      ...
      00000008:00100000:0.0:1580365544.925497:0:9994:0:(osc_request.c:1557:osc_brw_prep_request()) brw rpc ffff9c6c17b65200 - object 0x20000040f:61854 offset 0<>4096
      ...
      00000008:00000020:0.0:1580365544.947436:0:24223:0:(osc_cache.c:1560:osc_enter_cache_try()) lustre-MDT0000-mdc-ffff9c6cb69bb000: grant { dirty: 0/118272 dirty_pages: 0/472885 dropped: 0 avail: 20480, dirty_grant: 0, reserved: 0, flight: 0 } lru {in list: 1, left: 15, waiters: 0 }need:28672
       
      00000008:00000020:0.0:1580365544.947438:0:24223:0:(osc_cache.c:1692:osc_enter_cache()) lustre-MDT0000-mdc-ffff9c6cb69bb000: grant { dirty: 0/118272 dirty_pages: 0/472885 dropped: 0 avail: 20480, dirty_grant: 0, reserved: 0, flight: 0 } lru {in list: 1, left: 15, waiters: 0 }no grant space, fall back to sync i/o
      00000008:00000040:0.0:1580365544.947441:0:24223:0:(osc_io.c:328:osc_io_commit_async()) 15 -122
      ...
      00000008:00000020:0.0:1580365544.947456:0:24223:0:(osc_cache.c:1811:osc_update_pending()) obj ffff9c6bf82b5a00 ready 0|-|- wr 15|-|+ rd 0|- update pending cmd 2 delta 15.
      00000008:00000020:0.0:1580365544.947458:0:24223:0:(osc_cache.c:1770:osc_makes_rpc()) urgent request forcing RPC
      00000008:00000020:0.0:1580365544.947460:0:24223:0:(osc_cache.c:2360:osc_io_unplug0()) Queue writeback work for client ffff9c6c1fd604e0.
      00000100:00000040:0.0:1580365544.947462:0:24223:0:(ptlrpcd.c:299:ptlrpcd_add_req()) @@@ add req [ffff9c6cb6a36880] to pc [ptlrpcd_00_00:0]  req@ffff9c6cb6a36880 x1657125825208336/t0(0) o-1->lustre-MDT0000-mdc-ffff9c6cb69bb000@192.168.111.37@tcp:0/0 lens 0/0 e 0 to 0 dl 1580365564 ref 2 fl Interpret:N/ffffffff/ffffffff rc 0/-1 job:''
      00000008:00000040:0.0:1580365544.947469:0:24223:0:(osc_io.c:211:osc_io_submit()) 0/15 0
      00000008:00400020:0.0:1580365544.947473:0:24223:0:(osc_io.c:127:osc_io_submit()) 81 1
      00000008:00000020:0.0:1580365544.947486:0:24223:0:(osc_cache.c:1811:osc_update_pending()) obj ffff9c6bf82b58c0 ready 0|-|- wr 81|-|+ rd 0|- update pending cmd 2 delta 81.
      00000008:00000020:0.0:1580365544.947488:0:24223:0:(osc_cache.c:1770:osc_makes_rpc()) urgent request forcing RPC
      00000008:00000020:0.0:1580365544.947488:0:24223:0:(osc_cache.c:2360:osc_io_unplug0()) Queue writeback work for client ffff9c6cb6a284e0.
      ...
      00000008:00100000:0.0:1580365544.947546:0:9993:0:(osc_request.c:1557:osc_brw_prep_request()) brw rpc ffff9c6c17b65f80 - object 0x20000040f:61854 offset 4096<>65536
      ....
      00000008:00100000:0.0:1580365544.947669:0:9993:0:(osc_request.c:1557:osc_brw_prep_request()) brw rpc ffff9c6c17b66d00 - object 0x0:68 offset 65536<>396106
      ...
      00000008:00000002:0.0:1580365544.969377:0:9993:0:(osc_request.c:2032:brw_interpret()) request ffff9c6c17b65f80 aa ffff9c6c17b660f0 rc -28
      ...
      00000008:00000002:0.0:1580365544.969551:0:9993:0:(osc_request.c:2032:brw_interpret()) request ffff9c6c17b66d00 aa ffff9c6c17b66e70 rc 0
      00000008:00000020:0.0:1580365544.969553:0:9993:0:(osc_object.c:183:osc_attr_update()) set kms from 0to 396106
      ...
      00000080:00200000:0.0:1580365544.969753:0:24223:0:(file.c:1280:ll_merge_attr()) [0x20000040f:0xf19e:0x0] updating i_size 396106
      00000080:00200000:0.0:1580365544.969939:0:24223:0:(vvp_io.c:1122:vvp_io_write_start()) f0.wat: write nob 0, result: 4096
      00000080:00200000:0.0:1580365544.969980:0:24223:0:(vvp_io.c:314:vvp_io_fini()) [0x20000040f:0xf19e:0x0] ignore/verify layout 0/0, layout version 3 need write layout 0, restore needed 0
      00000080:00200000:0.0:1580365544.969981:0:24223:0:(file.c:1519:ll_file_io_generic()) f0.wat: 2 io complete with rc: 0, result: 4096, restart: 0
      00000080:00200000:0.0:1580365544.969984:0:24223:0:(file.c:1553:ll_file_io_generic()) iot: 2, result: 4096
      00000020:00001000:0.0:1580365544.969984:0:24223:0:(cl_object.c:832:cl_env_put()) 1@ffff9c6c20824cf0
      00000020:00001000:0.0:1580365544.970006:0:24223:0:(cl_object.c:751:cl_env_get()) 1@ffff9c6c20824cf0
      00000080:00200000:0.0:1580365544.970007:0:24223:0:(file.c:1399:ll_file_io_generic()) f0.wat: write ppos: 4096, count: 392010
      00000080:00200000:0.0:1580365544.970009:0:24223:0:(file.c:1318:ll_io_set_mirror()) f0.wat: desiginated mirror: 0
      00000020:00200000:0.0:1580365544.970012:0:24223:0:(cl_io.c:215:cl_io_rw_init()) header@ffff9c6bfa77ea50[0x4, 2, [0x20000040f:0xf19e:0x0] hash]
       
      00000020:00200000:0.0:1580365544.970013:0:24223:0:(cl_io.c:215:cl_io_rw_init()) io range: 2 [4096, 396106) 0 1
      

      The main problem is situation when during IO 1,3 stripe were written and a 2 got error.
      The best way to prevent corrupted data in such cases is to write data by stripes, one after one. So when error happen, we return written data, and on a second retry return ENOSPC. And there wouldn't be gaps between data at file, like we have right now.
      But the minus of this would be slow performance at high rates, when a client takes maximum grants 2GB probably.

      Attachments

        Activity

          People

            aboyko Alexander Boyko
            aboyko Alexander Boyko
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: