Details
-
Bug
-
Resolution: Duplicate
-
Minor
-
None
-
Lustre 2.14.0
-
None
-
3.10.0-1062.9.1.el7.x86_64.debug v2_13_56-40-g0104258
-
3
-
9223372036854775807
Description
There is an issue with using direct io to overwrite a file with transfers that are not a multiple of the page size:
k:~# cd /mnt/lustre k:lustre# ls k:lustre# dd if=/dev/zero of=f0 oflag=direct bs=4K count=1 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.0323935 s, 126 kB/s k:lustre# dd if=/dev/zero of=f0 oflag=direct bs=4K count=1 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.0207773 s, 197 kB/s k:lustre# dd if=/dev/zero of=f0 oflag=direct bs=4K count=1 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.0300933 s, 136 kB/s k:lustre# /bin/truncate --size=1397 f0 k:lustre# dd if=f0 of=f1 oflag=direct bs=4K count=1 0+1 records in 0+1 records out 1397 bytes (1.4 kB) copied, 0.0510133 s, 27.4 kB/s k:lustre# dd if=f0 of=f1 oflag=direct bs=4K count=1 # Hangs here in a 1397 byte write to f1 and resumes as follows. dd: fsync failed for ‘f1’: Input/output error 0+1 records in 0+1 records out 1397 bytes (1.4 kB) copied, 100.296 s, 0.0 kB/s
I tried adding a sync before overwriting but it did not change anything.
Here are the stack traces I collected:
('26269', 'ldlm_bl_02') [<ffffffffc0f2e328>] osc_extent_wait+0x528/0x750 [osc] [<ffffffffc0f3068d>] osc_cache_wait_range+0x2dd/0x940 [osc] [<ffffffffc0f3163e>] osc_cache_writeback_range+0x94e/0xfd0 [osc] [<ffffffffc0f1cc05>] osc_lock_flush+0x195/0x290 [osc] [<ffffffffc0f1d0e8>] osc_ldlm_blocking_ast+0x2f8/0x3e0 [osc] [<ffffffffc10ed051>] ldlm_cancel_callback+0x91/0x330 [ptlrpc] [<ffffffffc11054d1>] ldlm_cli_cancel_local+0xd1/0x420 [ptlrpc] [<ffffffffc110b81c>] ldlm_cli_cancel+0x10c/0x560 [ptlrpc] [<ffffffffc0f1cf6a>] osc_ldlm_blocking_ast+0x17a/0x3e0 [osc] [<ffffffffc1117b98>] ldlm_handle_bl_callback+0xc8/0x3e0 [ptlrpc] [<ffffffffc111846f>] ldlm_bl_thread_main+0x5bf/0xae0 [ptlrpc] [<ffffffff824e1c0f>] kthread+0xef/0x100 [<ffffffff82c878f7>] ret_from_fork_nospec_end+0x0/0x39 [<ffffffffffffffff>] 0xffffffffffffffff ('28085', 'dd') [<ffffffffc0f2e328>] osc_extent_wait+0x528/0x750 [osc] [<ffffffffc0f3068d>] osc_cache_wait_range+0x2dd/0x940 [osc] [<ffffffffc0f3163e>] osc_cache_writeback_range+0x94e/0xfd0 [osc] [<ffffffffc0f1cc05>] osc_lock_flush+0x195/0x290 [osc] [<ffffffffc0f1cd3f>] osc_lock_lockless_cancel+0x3f/0xf0 [osc] [<ffffffffc0c29108>] cl_lock_cancel+0x78/0x160 [obdclass] [<ffffffffc0fa9af9>] lov_lock_cancel+0x99/0x190 [lov] [<ffffffffc0c29108>] cl_lock_cancel+0x78/0x160 [obdclass] [<ffffffffc0c29622>] cl_lock_release+0x52/0x140 [obdclass] [<ffffffffc0c2cfb9>] cl_io_unlock+0x139/0x290 [obdclass] [<ffffffffc0c2dfa8>] cl_io_loop+0xb8/0x200 [obdclass] [<ffffffffc1b273ed>] ll_file_io_generic+0x8ad/0xd70 [lustre] [<ffffffffc1b27d40>] ll_file_aio_write+0x490/0x780 [lustre] [<ffffffffc1b28130>] ll_file_write+0x100/0x1c0 [lustre] [<ffffffff826a6fbc>] vfs_write+0xdc/0x240 [<ffffffff826a7e8a>] SyS_write+0x8a/0x100 [<ffffffff82c87a9e>] system_call_fastpath+0x25/0x2a [<ffffffffffffffff>] 0xffffffffffffffff ('26540', 'll_ost_io00_001') [<ffffffffc1108987>] ldlm_completion_ast+0x787/0x9e0 [ptlrpc] [<ffffffffc110791f>] ldlm_cli_enqueue_local+0x25f/0x870 [ptlrpc] [<ffffffffc11a3aea>] tgt_extent_lock+0xea/0x290 [ptlrpc] [<ffffffffc11a5d2f>] tgt_brw_lock.isra.28+0x14f/0x340 [ptlrpc] [<ffffffffc11a7e27>] tgt_brw_write+0x797/0x1a10 [ptlrpc] [<ffffffffc11ab103>] tgt_request_handle+0x813/0x1780 [ptlrpc] [<ffffffffc1149e16>] ptlrpc_server_handle_request+0x266/0xb50 [ptlrpc] [<ffffffffc114ec51>] ptlrpc_main+0xca1/0x26b0 [ptlrpc] [<ffffffff824e1c0f>] kthread+0xef/0x100 [<ffffffff82c878f7>] ret_from_fork_nospec_end+0x0/0x39 [<ffffffffffffffff>] 0xffffffffffffffff
And some dmesg logs:
[202397.535229] Lustre: ll_ost_io00_001: service thread pid 26540 was inactive for 40.094 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [202397.543576] Pid: 26540, comm: ll_ost_io00_001 3.10.0-1062.9.1.el7.x86_64.debug #1 SMP Mon Mar 16 12:44:56 CDT 2020 [202397.546304] Call Trace: [202397.546978] [<ffffffffc1108987>] ldlm_completion_ast+0x787/0x9e0 [ptlrpc] [202397.548875] [<ffffffffc110791f>] ldlm_cli_enqueue_local+0x25f/0x870 [ptlrpc] [202397.550795] [<ffffffffc11a3aea>] tgt_extent_lock+0xea/0x290 [ptlrpc] [202397.552573] [<ffffffffc11a5d2f>] tgt_brw_lock.isra.28+0x14f/0x340 [ptlrpc] [202397.554371] [<ffffffffc11a7e27>] tgt_brw_write+0x797/0x1a10 [ptlrpc] [202397.556390] [<ffffffffc11ab103>] tgt_request_handle+0x813/0x1780 [ptlrpc] [202397.558253] [<ffffffffc1149e16>] ptlrpc_server_handle_request+0x266/0xb50 [ptlrpc] [202397.560299] [<ffffffffc114ec51>] ptlrpc_main+0xca1/0x26b0 [ptlrpc] [202397.562022] [<ffffffff824e1c0f>] kthread+0xef/0x100 [202397.563260] [<ffffffff82c878f7>] ret_from_fork_nospec_end+0x0/0x39 [202397.565262] [<ffffffffffffffff>] 0xffffffffffffffff [202457.567573] LustreError: 26270:0:(ldlm_lockd.c:260:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 0@lo ns: filter-lustre-OST0002_UUID lock: ffff9333df8ac580/0x30c43bdabaa6c7b7 lrc: 3/0,0 mode: PW/PW res: [0x2:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000480000020 nid: 0@lo remote: 0x30c43bdabaa6c7b0 expref: 7 pid: 26534 timeout: 202457 lvb_type: 0 [202457.584810] LustreError: 26540:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0002: cli 47f1d083-bc11-4da7-b42f-9a3a45d5d410 claims 28672 GRANT, real grant 0 [202457.607108] LustreError: 27832:0:(ldlm_lockd.c:2489:ldlm_cancel_handler()) ldlm_cancel from 0@lo arrived at 1603402419 with bad export cookie 3513999419876230990 [202457.611541] LustreError: 11-0: lustre-OST0002-osc-ffff93340109b000: operation ost_sync to node 0@lo failed: rc = -107 [202457.611551] Lustre: lustre-OST0002-osc-ffff93340109b000: Connection to lustre-OST0002 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete [202457.612031] Lustre: lustre-OST0002: Connection restored to 192.168.122.75@tcp (at 0@lo) [202457.612195] LustreError: 167-0: lustre-OST0002-osc-ffff93340109b000: This client was evicted by lustre-OST0002; in progress operations using this service will fail.
Attachments
Issue Links
- duplicates
-
LU-14072 client evicitions with DIO
-
- Resolved
-