Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14066

dd oflag=direct writes hang when size is not a multiple of page size

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.14.0
    • None
    • 3.10.0-1062.9.1.el7.x86_64.debug v2_13_56-40-g0104258
    • 3
    • 9223372036854775807

    Description

      There is an issue with using direct io to overwrite a file with transfers that are not a multiple of the page size:

      k:~# cd /mnt/lustre
      k:lustre# ls 
      k:lustre# dd if=/dev/zero of=f0 oflag=direct bs=4K count=1 
      1+0 records in
      1+0 records out
      4096 bytes (4.1 kB) copied, 0.0323935 s, 126 kB/s
      k:lustre# dd if=/dev/zero of=f0 oflag=direct bs=4K count=1 
      1+0 records in
      1+0 records out
      4096 bytes (4.1 kB) copied, 0.0207773 s, 197 kB/s
      k:lustre# dd if=/dev/zero of=f0 oflag=direct bs=4K count=1 
      1+0 records in
      1+0 records out
      4096 bytes (4.1 kB) copied, 0.0300933 s, 136 kB/s
      k:lustre# /bin/truncate --size=1397 f0
      k:lustre# dd if=f0 of=f1 oflag=direct bs=4K count=1 
      0+1 records in
      0+1 records out
      1397 bytes (1.4 kB) copied, 0.0510133 s, 27.4 kB/s
      k:lustre# dd if=f0 of=f1 oflag=direct bs=4K count=1 
      # Hangs here in a 1397 byte write to f1 and resumes as follows.
      dd: fsync failed for ‘f1’: Input/output error
      0+1 records in
      0+1 records out
      1397 bytes (1.4 kB) copied, 100.296 s, 0.0 kB/s
      

      I tried adding a sync before overwriting but it did not change anything.

      Here are the stack traces I collected:

      ('26269', 'ldlm_bl_02')
      [<ffffffffc0f2e328>] osc_extent_wait+0x528/0x750 [osc]
      [<ffffffffc0f3068d>] osc_cache_wait_range+0x2dd/0x940 [osc]
      [<ffffffffc0f3163e>] osc_cache_writeback_range+0x94e/0xfd0 [osc]
      [<ffffffffc0f1cc05>] osc_lock_flush+0x195/0x290 [osc]
      [<ffffffffc0f1d0e8>] osc_ldlm_blocking_ast+0x2f8/0x3e0 [osc]
      [<ffffffffc10ed051>] ldlm_cancel_callback+0x91/0x330 [ptlrpc]
      [<ffffffffc11054d1>] ldlm_cli_cancel_local+0xd1/0x420 [ptlrpc]
      [<ffffffffc110b81c>] ldlm_cli_cancel+0x10c/0x560 [ptlrpc]
      [<ffffffffc0f1cf6a>] osc_ldlm_blocking_ast+0x17a/0x3e0 [osc]
      [<ffffffffc1117b98>] ldlm_handle_bl_callback+0xc8/0x3e0 [ptlrpc]
      [<ffffffffc111846f>] ldlm_bl_thread_main+0x5bf/0xae0 [ptlrpc]
      [<ffffffff824e1c0f>] kthread+0xef/0x100
      [<ffffffff82c878f7>] ret_from_fork_nospec_end+0x0/0x39
      [<ffffffffffffffff>] 0xffffffffffffffff
      
      ('28085', 'dd')
      [<ffffffffc0f2e328>] osc_extent_wait+0x528/0x750 [osc]
      [<ffffffffc0f3068d>] osc_cache_wait_range+0x2dd/0x940 [osc]
      [<ffffffffc0f3163e>] osc_cache_writeback_range+0x94e/0xfd0 [osc]
      [<ffffffffc0f1cc05>] osc_lock_flush+0x195/0x290 [osc]
      [<ffffffffc0f1cd3f>] osc_lock_lockless_cancel+0x3f/0xf0 [osc]
      [<ffffffffc0c29108>] cl_lock_cancel+0x78/0x160 [obdclass]
      [<ffffffffc0fa9af9>] lov_lock_cancel+0x99/0x190 [lov]
      [<ffffffffc0c29108>] cl_lock_cancel+0x78/0x160 [obdclass]
      [<ffffffffc0c29622>] cl_lock_release+0x52/0x140 [obdclass]
      [<ffffffffc0c2cfb9>] cl_io_unlock+0x139/0x290 [obdclass]
      [<ffffffffc0c2dfa8>] cl_io_loop+0xb8/0x200 [obdclass]
      [<ffffffffc1b273ed>] ll_file_io_generic+0x8ad/0xd70 [lustre]
      [<ffffffffc1b27d40>] ll_file_aio_write+0x490/0x780 [lustre]
      [<ffffffffc1b28130>] ll_file_write+0x100/0x1c0 [lustre]
      [<ffffffff826a6fbc>] vfs_write+0xdc/0x240
      [<ffffffff826a7e8a>] SyS_write+0x8a/0x100
      [<ffffffff82c87a9e>] system_call_fastpath+0x25/0x2a
      [<ffffffffffffffff>] 0xffffffffffffffff
      
      ('26540', 'll_ost_io00_001')
      [<ffffffffc1108987>] ldlm_completion_ast+0x787/0x9e0 [ptlrpc]
      [<ffffffffc110791f>] ldlm_cli_enqueue_local+0x25f/0x870 [ptlrpc]
      [<ffffffffc11a3aea>] tgt_extent_lock+0xea/0x290 [ptlrpc]
      [<ffffffffc11a5d2f>] tgt_brw_lock.isra.28+0x14f/0x340 [ptlrpc]
      [<ffffffffc11a7e27>] tgt_brw_write+0x797/0x1a10 [ptlrpc]
      [<ffffffffc11ab103>] tgt_request_handle+0x813/0x1780 [ptlrpc]
      [<ffffffffc1149e16>] ptlrpc_server_handle_request+0x266/0xb50 [ptlrpc]
      [<ffffffffc114ec51>] ptlrpc_main+0xca1/0x26b0 [ptlrpc]
      [<ffffffff824e1c0f>] kthread+0xef/0x100
      [<ffffffff82c878f7>] ret_from_fork_nospec_end+0x0/0x39
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      And some dmesg logs:

      [202397.535229] Lustre: ll_ost_io00_001: service thread pid 26540 was inactive for 40.094 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [202397.543576] Pid: 26540, comm: ll_ost_io00_001 3.10.0-1062.9.1.el7.x86_64.debug #1 SMP Mon Mar 16 12:44:56 CDT 2020
      [202397.546304] Call Trace:
      [202397.546978]  [<ffffffffc1108987>] ldlm_completion_ast+0x787/0x9e0 [ptlrpc]
      [202397.548875]  [<ffffffffc110791f>] ldlm_cli_enqueue_local+0x25f/0x870 [ptlrpc]
      [202397.550795]  [<ffffffffc11a3aea>] tgt_extent_lock+0xea/0x290 [ptlrpc]
      [202397.552573]  [<ffffffffc11a5d2f>] tgt_brw_lock.isra.28+0x14f/0x340 [ptlrpc]
      [202397.554371]  [<ffffffffc11a7e27>] tgt_brw_write+0x797/0x1a10 [ptlrpc]
      [202397.556390]  [<ffffffffc11ab103>] tgt_request_handle+0x813/0x1780 [ptlrpc]
      [202397.558253]  [<ffffffffc1149e16>] ptlrpc_server_handle_request+0x266/0xb50 [ptlrpc]
      [202397.560299]  [<ffffffffc114ec51>] ptlrpc_main+0xca1/0x26b0 [ptlrpc]
      [202397.562022]  [<ffffffff824e1c0f>] kthread+0xef/0x100
      [202397.563260]  [<ffffffff82c878f7>] ret_from_fork_nospec_end+0x0/0x39
      [202397.565262]  [<ffffffffffffffff>] 0xffffffffffffffff
      [202457.567573] LustreError: 26270:0:(ldlm_lockd.c:260:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 0@lo  ns: filter-lustre-OST0002_UUID lock: ffff9333df8ac580/0x30c43bdabaa6c7b7 lrc: 3/0,0 mode: PW/PW res: [0x2:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000480000020 nid: 0@lo remote: 0x30c43bdabaa6c7b0 expref: 7 pid: 26534 timeout: 202457 lvb_type: 0
      [202457.584810] LustreError: 26540:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0002: cli 47f1d083-bc11-4da7-b42f-9a3a45d5d410 claims 28672 GRANT, real grant 0
      [202457.607108] LustreError: 27832:0:(ldlm_lockd.c:2489:ldlm_cancel_handler()) ldlm_cancel from 0@lo arrived at 1603402419 with bad export cookie 3513999419876230990
      [202457.611541] LustreError: 11-0: lustre-OST0002-osc-ffff93340109b000: operation ost_sync to node 0@lo failed: rc = -107
      [202457.611551] Lustre: lustre-OST0002-osc-ffff93340109b000: Connection to lustre-OST0002 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
      [202457.612031] Lustre: lustre-OST0002: Connection restored to 192.168.122.75@tcp (at 0@lo)
      [202457.612195] LustreError: 167-0: lustre-OST0002-osc-ffff93340109b000: This client was evicted by lustre-OST0002; in progress operations using this service will fail.
      

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jhammond John Hammond
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: