[LU-11597] sanityn test 16a failed with direct I/O Created: 01/Nov/18  Updated: 16/Dec/22  Resolved: 18/Jan/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.2, Lustre 2.12.3, Lustre 2.14.0, Lustre 2.12.4, Lustre 2.12.5
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Major
Reporter: Jian Yu Assignee: Jian Yu
Resolution: Fixed Votes: 0
Labels: arm, ppc
Environment:

Lustre Build: https://build.whamcloud.com/job/lustre-master/3811
Distro/Arch: RHEL7.5/aarch64 (client), RHEL7.5/x86_64 (server)


Issue Links:
Duplicate
Related
is related to LU-11527 sanity test_270a failed with O_DIRECT... Resolved
is related to LU-10929 skip sanity/315 if IO accounting is n... Resolved
is related to LU-10300 Can the Lustre 2.10.x clients support... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity test 241b failed with direct I/O:

== sanity test 241b: dio vs dio ====================================================================== 08:45:10 (1540543510)
1+0 records in
1+0 records out
40960 bytes (41 kB) copied, 0.00274397 s, 14.9 MB/s
-rw-r--r-- 1 root root 40960 Oct 26 08:45 /mnt/lustre/f241b.sanity
 sanity test_241b: @@@@@@ FAIL: test_241b failed with 1 

Maloo report: https://testing.whamcloud.com/test_sets/88bbf5c2-d9d0-11e8-b46b-52540065bddc

sanity tests 270a, 315, and sanityn test 16a also failed with the same issue:
https://testing.whamcloud.com/test_sets/88bbf5c2-d9d0-11e8-b46b-52540065bddc
https://testing.whamcloud.com/test_sets/8b2f4282-d9d0-11e8-b46b-52540065bddc



 Comments   
Comment by Jian Yu [ 06/Nov/18 ]

Page size of ARM processor is 64kB:

# uname -a
Linux trevis-79vm36.trevis.whamcloud.com 4.14.0-49.13.1.el7a.aarch64 #1 SMP Thu Sep 27 14:45:52 UTC 2018 aarch64 aarch64 aarch64 GNU/Linux
# getconf PAGE_SIZE
65536

while it's 4kB on x86_64 processor:

# uname -a
Linux trevis-58vm4.trevis.whamcloud.com 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Fri Oct 12 14:51:33 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
# getconf PAGE_SIZE
4096
Comment by Jian Yu [ 06/Nov/18 ]

On an ARM client with x86_64 servers, writing a file less than 65536 bytes with direct I/O mode hit "-EINVAL" failure:

# yes | dd of=/mnt/lustre/f1 bs=65535 count=1 oflag=direct
dd: error writing ‘/mnt/lustre/f1’: Invalid argument
1+0 records in
0+0 records out
0 bytes (0 B) copied, 0.00356834 s, 0.0 kB/s

Debug log on client showed that:

00000020:00000001:0.0:1541493387.509116:1776:18933:0:(cl_io.c:558:cl_io_start()) Process entered
00000080:00000001:0.0:1541493387.509118:1936:18933:0:(vvp_io.c:1036:vvp_io_write_start()) Process entered
00000080:00200000:0.0:1541493387.509121:1936:18933:0:(vvp_io.c:1061:vvp_io_write_start()) f1: write [0, 65535)
00000020:00000001:0.0:1541493387.509123:1984:18933:0:(cl_object.c:413:cl_object_maxbytes()) Process entered
00020000:00000002:0.0:1541493387.509124:2096:18933:0:(lov_object.c:1075:lov_conf_freeze()) To take share lov(ffff80002f4e0480) owner           (null)/ffff8000300f4400
00020000:00000002:0.0:1541493387.509127:2096:18933:0:(lov_object.c:2092:lov_lsm_addref()) lsm ffff8000329f9a80 addref 2/0 by ffff8000300f4400.
00020000:00000002:0.0:1541493387.509129:2096:18933:0:(lov_object.c:1083:lov_conf_thaw()) To release share lov(ffff80002f4e0480) owner           (null)/ffff8000300f4400
00000020:00000001:0.0:1541493387.509131:2016:18933:0:(cl_object.c:421:cl_object_maxbytes()) Process leaving (rc=17592186040320 : 17592186040320 : ffffffff000)
00000080:00000001:0.0:1541493387.509137:1968:18933:0:(vvp_io.c:1133:vvp_io_write_start()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea)
00000020:00000001:0.0:1541493387.509139:1808:18933:0:(cl_io.c:570:cl_io_start()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea)

Looking into vvp_io_write_start()->__generic_file_write_iter()...

Comment by Jian Yu [ 07/Nov/18 ]

On x86_64 client with 4kB page size, writing a file less than 4096 bytes with direct I/O mode also hit "-EINVAL" failure:

# yes | dd of=/mnt/lustre/f2 bs=4095 count=1 oflag=direct
dd: error writing ‘/mnt/lustre/f2’: Invalid argument
1+0 records in
0+0 records out
0 bytes (0 B) copied, 0.0014026 s, 0.0 kB/s

So, under direct I/O mode, the bytes to be written at a time should not be less than one page size. I'm creating a patch to update the test scripts.

Comment by Gerrit Updater [ 09/Nov/18 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33636
Subject: LU-11597 tests: fix O_DIRECT test usage for ARM
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e26c7a9c6e747de442d5f37cf14b352ca3e4b365

Comment by Gerrit Updater [ 13/Nov/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33636/
Subject: LU-11597 tests: fix O_DIRECT test usage for ARM
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f602b5ec7f45713122abd615a97a13d7c97d460e

Comment by Peter Jones [ 13/Nov/18 ]

Landed for 2.12

Comment by James Nunez (Inactive) [ 15/Dec/18 ]

I'm reopening this ticket because sanityn test 16a, as mentioned in the description, is still failing for ARM.

Logs for recent test failures are at
https://testing.whamcloud.com/test_sets/f1121dd4-fdef-11e8-b837-52540065bddc
https://testing.whamcloud.com/test_sets/1de368ba-fa38-11e8-bb6b-52540065bddc

Comment by James Nunez (Inactive) [ 12/Feb/20 ]

Also, PPC client testing fails this test 100% of the time; https://testing.whamcloud.com/test_sets/f25e7616-4a6e-11ea-b69a-52540065bddc.

Comment by Gerrit Updater [ 13/Feb/20 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37561
Subject: LU-11597 tests: skip sanityn tests for PPC
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bfe402a758e752124a0d081f5bb5bde4b95566bb

Comment by Gerrit Updater [ 15/Feb/20 ]

Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/37589
Subject: LU-11597 test: fix sanityn 16a to align page size
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c23ab3e322cfc6b585b39c97d6bbec77127e6f2b

Comment by Gerrit Updater [ 20/Feb/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37561/
Subject: LU-11597 tests: skip sanityn tests for PPC
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c27e5fe50ca3de4c9d3dbb024a0704ee3cc4e15c

Comment by James Nunez (Inactive) [ 20/Feb/20 ]

The patch that landed, https://review.whamcloud.com/37561/, puts sanityn tests 16a and 71a on the ALWAYS_EXCEPT list for PPC client testing. This ticket should remain open until those tests are fixed and the tests are taken off the list.

Comment by Gerrit Updater [ 17/Nov/20 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40660
Subject: LU-11597 tests: skip sanityn tests for PPC
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 7d20d3c6abfcac939ec9c7020484882d5e8067b7

Comment by Gerrit Updater [ 04/Mar/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40660/
Subject: LU-11597 tests: skip sanityn tests for PPC
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: d0a53964f1bd7d8a21fc987df8d2f250b04f9ada

Comment by Xinliang Liu [ 29/Oct/21 ]

Only sanityn test 16a fails now,  change the title.

Comment by Gerrit Updater [ 18/Jan/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/37589/
Subject: LU-11597 test: Fix sanityn 16a failed on arm
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: eb704aecfaad2a6256d1e2e48cdfadbabb07e5cb

Generated at Sat Feb 10 02:45:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.