[LU-14066] dd oflag=direct writes hang when size is not a multiple of page size Created: 22/Oct/20  Updated: 24/Nov/20  Resolved: 24/Nov/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: John Hammond Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None
Environment:

3.10.0-1062.9.1.el7.x86_64.debug v2_13_56-40-g0104258


Issue Links:
Duplicate
duplicates LU-14072 client evicitions with DIO Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

There is an issue with using direct io to overwrite a file with transfers that are not a multiple of the page size:

k:~# cd /mnt/lustre
k:lustre# ls 
k:lustre# dd if=/dev/zero of=f0 oflag=direct bs=4K count=1 
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.0323935 s, 126 kB/s
k:lustre# dd if=/dev/zero of=f0 oflag=direct bs=4K count=1 
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.0207773 s, 197 kB/s
k:lustre# dd if=/dev/zero of=f0 oflag=direct bs=4K count=1 
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.0300933 s, 136 kB/s
k:lustre# /bin/truncate --size=1397 f0
k:lustre# dd if=f0 of=f1 oflag=direct bs=4K count=1 
0+1 records in
0+1 records out
1397 bytes (1.4 kB) copied, 0.0510133 s, 27.4 kB/s
k:lustre# dd if=f0 of=f1 oflag=direct bs=4K count=1 
# Hangs here in a 1397 byte write to f1 and resumes as follows.
dd: fsync failed for ‘f1’: Input/output error
0+1 records in
0+1 records out
1397 bytes (1.4 kB) copied, 100.296 s, 0.0 kB/s

I tried adding a sync before overwriting but it did not change anything.

Here are the stack traces I collected:

('26269', 'ldlm_bl_02')
[<ffffffffc0f2e328>] osc_extent_wait+0x528/0x750 [osc]
[<ffffffffc0f3068d>] osc_cache_wait_range+0x2dd/0x940 [osc]
[<ffffffffc0f3163e>] osc_cache_writeback_range+0x94e/0xfd0 [osc]
[<ffffffffc0f1cc05>] osc_lock_flush+0x195/0x290 [osc]
[<ffffffffc0f1d0e8>] osc_ldlm_blocking_ast+0x2f8/0x3e0 [osc]
[<ffffffffc10ed051>] ldlm_cancel_callback+0x91/0x330 [ptlrpc]
[<ffffffffc11054d1>] ldlm_cli_cancel_local+0xd1/0x420 [ptlrpc]
[<ffffffffc110b81c>] ldlm_cli_cancel+0x10c/0x560 [ptlrpc]
[<ffffffffc0f1cf6a>] osc_ldlm_blocking_ast+0x17a/0x3e0 [osc]
[<ffffffffc1117b98>] ldlm_handle_bl_callback+0xc8/0x3e0 [ptlrpc]
[<ffffffffc111846f>] ldlm_bl_thread_main+0x5bf/0xae0 [ptlrpc]
[<ffffffff824e1c0f>] kthread+0xef/0x100
[<ffffffff82c878f7>] ret_from_fork_nospec_end+0x0/0x39
[<ffffffffffffffff>] 0xffffffffffffffff

('28085', 'dd')
[<ffffffffc0f2e328>] osc_extent_wait+0x528/0x750 [osc]
[<ffffffffc0f3068d>] osc_cache_wait_range+0x2dd/0x940 [osc]
[<ffffffffc0f3163e>] osc_cache_writeback_range+0x94e/0xfd0 [osc]
[<ffffffffc0f1cc05>] osc_lock_flush+0x195/0x290 [osc]
[<ffffffffc0f1cd3f>] osc_lock_lockless_cancel+0x3f/0xf0 [osc]
[<ffffffffc0c29108>] cl_lock_cancel+0x78/0x160 [obdclass]
[<ffffffffc0fa9af9>] lov_lock_cancel+0x99/0x190 [lov]
[<ffffffffc0c29108>] cl_lock_cancel+0x78/0x160 [obdclass]
[<ffffffffc0c29622>] cl_lock_release+0x52/0x140 [obdclass]
[<ffffffffc0c2cfb9>] cl_io_unlock+0x139/0x290 [obdclass]
[<ffffffffc0c2dfa8>] cl_io_loop+0xb8/0x200 [obdclass]
[<ffffffffc1b273ed>] ll_file_io_generic+0x8ad/0xd70 [lustre]
[<ffffffffc1b27d40>] ll_file_aio_write+0x490/0x780 [lustre]
[<ffffffffc1b28130>] ll_file_write+0x100/0x1c0 [lustre]
[<ffffffff826a6fbc>] vfs_write+0xdc/0x240
[<ffffffff826a7e8a>] SyS_write+0x8a/0x100
[<ffffffff82c87a9e>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff

('26540', 'll_ost_io00_001')
[<ffffffffc1108987>] ldlm_completion_ast+0x787/0x9e0 [ptlrpc]
[<ffffffffc110791f>] ldlm_cli_enqueue_local+0x25f/0x870 [ptlrpc]
[<ffffffffc11a3aea>] tgt_extent_lock+0xea/0x290 [ptlrpc]
[<ffffffffc11a5d2f>] tgt_brw_lock.isra.28+0x14f/0x340 [ptlrpc]
[<ffffffffc11a7e27>] tgt_brw_write+0x797/0x1a10 [ptlrpc]
[<ffffffffc11ab103>] tgt_request_handle+0x813/0x1780 [ptlrpc]
[<ffffffffc1149e16>] ptlrpc_server_handle_request+0x266/0xb50 [ptlrpc]
[<ffffffffc114ec51>] ptlrpc_main+0xca1/0x26b0 [ptlrpc]
[<ffffffff824e1c0f>] kthread+0xef/0x100
[<ffffffff82c878f7>] ret_from_fork_nospec_end+0x0/0x39
[<ffffffffffffffff>] 0xffffffffffffffff

And some dmesg logs:

[202397.535229] Lustre: ll_ost_io00_001: service thread pid 26540 was inactive for 40.094 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[202397.543576] Pid: 26540, comm: ll_ost_io00_001 3.10.0-1062.9.1.el7.x86_64.debug #1 SMP Mon Mar 16 12:44:56 CDT 2020
[202397.546304] Call Trace:
[202397.546978]  [<ffffffffc1108987>] ldlm_completion_ast+0x787/0x9e0 [ptlrpc]
[202397.548875]  [<ffffffffc110791f>] ldlm_cli_enqueue_local+0x25f/0x870 [ptlrpc]
[202397.550795]  [<ffffffffc11a3aea>] tgt_extent_lock+0xea/0x290 [ptlrpc]
[202397.552573]  [<ffffffffc11a5d2f>] tgt_brw_lock.isra.28+0x14f/0x340 [ptlrpc]
[202397.554371]  [<ffffffffc11a7e27>] tgt_brw_write+0x797/0x1a10 [ptlrpc]
[202397.556390]  [<ffffffffc11ab103>] tgt_request_handle+0x813/0x1780 [ptlrpc]
[202397.558253]  [<ffffffffc1149e16>] ptlrpc_server_handle_request+0x266/0xb50 [ptlrpc]
[202397.560299]  [<ffffffffc114ec51>] ptlrpc_main+0xca1/0x26b0 [ptlrpc]
[202397.562022]  [<ffffffff824e1c0f>] kthread+0xef/0x100
[202397.563260]  [<ffffffff82c878f7>] ret_from_fork_nospec_end+0x0/0x39
[202397.565262]  [<ffffffffffffffff>] 0xffffffffffffffff
[202457.567573] LustreError: 26270:0:(ldlm_lockd.c:260:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 0@lo  ns: filter-lustre-OST0002_UUID lock: ffff9333df8ac580/0x30c43bdabaa6c7b7 lrc: 3/0,0 mode: PW/PW res: [0x2:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000480000020 nid: 0@lo remote: 0x30c43bdabaa6c7b0 expref: 7 pid: 26534 timeout: 202457 lvb_type: 0
[202457.584810] LustreError: 26540:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0002: cli 47f1d083-bc11-4da7-b42f-9a3a45d5d410 claims 28672 GRANT, real grant 0
[202457.607108] LustreError: 27832:0:(ldlm_lockd.c:2489:ldlm_cancel_handler()) ldlm_cancel from 0@lo arrived at 1603402419 with bad export cookie 3513999419876230990
[202457.611541] LustreError: 11-0: lustre-OST0002-osc-ffff93340109b000: operation ost_sync to node 0@lo failed: rc = -107
[202457.611551] Lustre: lustre-OST0002-osc-ffff93340109b000: Connection to lustre-OST0002 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
[202457.612031] Lustre: lustre-OST0002: Connection restored to 192.168.122.75@tcp (at 0@lo)
[202457.612195] LustreError: 167-0: lustre-OST0002-osc-ffff93340109b000: This client was evicted by lustre-OST0002; in progress operations using this service will fail.


 Comments   
Comment by Andreas Dilger [ 24/Nov/20 ]

Patch is in LU-14072.

Generated at Sat Feb 10 03:06:33 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.