[LU-10437] sanity-pfl test_8: dbench failed Created: 26/Dec/17  Updated: 29/Aug/18  Resolved: 20/Jan/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.11.0, Lustre 2.10.4

Type: Bug Priority: Minor
Reporter: James Casper Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: None
Environment:

onyx, full interop
servers: el7.4, ldiskfs, branch b2_10, v2.10.2, b52
clients: el7.4, branch master, v2.10.56, b3678


Issue Links:
Duplicate
is duplicated by LU-10439 sanity-pfl test_15: dd /mnt/lustre/d1... Resolved
Related
is related to LU-11291 recovering from LU-10437 Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

session: https://testing.hpdd.intel.com/test_sessions/cb5f13e0-9177-4f4e-9a26-d197001db0c0
test set: https://testing.hpdd.intel.com/test_sets/5922f1c0-e4ef-11e7-8027-52540065bddc

From test_log:

copying /usr/share/dbench/client.txt to /mnt/lustre/d8.sanity-pfl/client.txt
cp: error writing '/mnt/lustre/d8.sanity-pfl/client.txt': Invalid argument
cp: failed to extend '/mnt/lustre/d8.sanity-pfl/client.txt': Invalid argument
  Trace dump:
  = rundbench:55:main()
sanity-pfl: FAIL: test-framework exiting on error
 sanity-pfl test_8: @@@@@@ FAIL: dbench failed 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:5328:error()
  = /usr/lib64/lustre/tests/sanity-pfl.sh:333:test_8()
  = /usr/lib64/lustre/tests/test-framework.sh:5604:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:5643:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:5490:run_test()
  = /usr/lib64/lustre/tests/sanity-pfl.sh:337:main()


 Comments   
Comment by Jian Yu [ 08/Jan/18 ]

Dmesg log on client node:

LustreError: 514:0:(lov_object.c:1220:lov_layout_change()) lustre-clilov-ffff88005daa9000: cannot apply new layout on [0x200062e21:0x2735:0x0] : rc = -22
LustreError: 514:0:(vvp_io.c:1495:vvp_io_init()) lustre: refresh file layout [0x200062e21:0x2735:0x0] error -22.

Debug log on client node:

00000080:00200000:1.0:1513687522.574612:0:514:0:(vvp_io.c:312:vvp_io_fini()) [0x200062e21:0x2735:0x0] ignore/verify layout 1/0, layout version 0 need write layout 0, restore needed 0
00020000:00020000:1.0:1513687522.578220:0:514:0:(lov_object.c:1220:lov_layout_change()) lustre-clilov-ffff88005daa9000: cannot apply new layout on [0x200062e21:0x2735:0x0] : rc = -22
00010000:00010000:1.0:1513687522.579782:0:514:0:(ldlm_lock.c:800:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(CR) ns: ?? lock: ffff88005cf42240/0xed5adfd7af8ec0c4 lrc: 3/1,0 mode: CR/CR res: ?? rrc=?? type: ??? flags: 0x10000000000000 nid: local remote: 0x4869a6ac1a9ab3d4 expref: -99 pid: 514 timeout: 0 lvb_type: 3
00010000:00010000:1.0:1513687522.579786:0:514:0:(ldlm_lock.c:873:ldlm_lock_decref_internal()) ### add lock into lru list ns: ?? lock: ffff88005cf42240/0xed5adfd7af8ec0c4 lrc: 2/0,0 mode: CR/CR res: ?? rrc=?? type: ??? flags: 0x10000000000000 nid: local remote: 0x4869a6ac1a9ab3d4 expref: -99 pid: 514 timeout: 0 lvb_type: 3
00000080:00020000:1.0:1513687522.579791:0:514:0:(vvp_io.c:1495:vvp_io_init()) lustre: refresh file layout [0x200062e21:0x2735:0x0] error -22.
00000080:00200000:1.0:1513687522.580937:0:514:0:(vvp_io.c:312:vvp_io_fini()) [0x200062e21:0x2735:0x0] ignore/verify layout 0/0, layout version -2 need write layout 0, restore needed 0
00000080:00200000:1.0:1513687522.580940:0:514:0:(file.c:1423:ll_file_io_generic()) client.txt: 2 io complete with rc: -22, result: 0, restart: 0
00000080:00200000:1.0:1513687522.580942:0:514:0:(file.c:1459:ll_file_io_generic()) client.txt: write *ppos: 16777216, pos: 16777216, ret: 0, rc: -22

Dmesg log on MDS:

LustreError: 18669:0:(mdt_lvb.c:163:mdt_lvbo_fill()) lustre-MDT0000: expected 368 actual 344.
Comment by Jian Yu [ 08/Jan/18 ]

sanity-pfl test 15 in the same interop test session hit the same failure:

== sanity-pfl test 15: Verify component options for lfs find ========================================= 12:46:26 (1513687586)
dd: error writing '/mnt/lustre/d15.sanity-pfl/f1': Invalid argument

Debug log on client node:

00000080:00200000:1.0:1513687587.114545:0:6996:0:(vvp_io.c:312:vvp_io_fini()) [0x200062e21:0x2743:0x0] ignore/verify layout 1/0, layout version 0 need write layout 0, restore needed 0
00020000:00020000:1.0:1513687587.114768:0:6996:0:(lov_object.c:1220:lov_layout_change()) lustre-clilov-ffff88005daa9000: cannot apply new layout on [0x200062e21:0x2743:0x0] : rc = -22
00010000:00010000:1.0:1513687587.116338:0:6996:0:(ldlm_lock.c:800:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(CR) ns: ?? lock: ffff88005cf43440/0xed5adfd7af8ec4ec lrc: 3/1,0 mode: CR/CR res: ?? rrc=?? type: ??? flags: 0x10000000000000 nid: local remote: 0x4869a6ac1a9aca2b expref: -99 pid: 6996 timeout: 0 lvb_type: 3
00010000:00010000:1.0:1513687587.116341:0:6996:0:(ldlm_lock.c:873:ldlm_lock_decref_internal()) ### add lock into lru list ns: ?? lock: ffff88005cf43440/0xed5adfd7af8ec4ec lrc: 2/0,0 mode: CR/CR res: ?? rrc=?? type: ??? flags: 0x10000000000000 nid: local remote: 0x4869a6ac1a9aca2b expref: -99 pid: 6996 timeout: 0 lvb_type: 3
00000080:00020000:1.0:1513687587.116345:0:6996:0:(vvp_io.c:1495:vvp_io_init()) lustre: refresh file layout [0x200062e21:0x2743:0x0] error -22.
00000080:00200000:1.0:1513687587.117486:0:6996:0:(vvp_io.c:312:vvp_io_fini()) [0x200062e21:0x2743:0x0] ignore/verify layout 0/0, layout version -2 need write layout 0, restore needed 0
00000080:00200000:1.0:1513687587.117488:0:6996:0:(file.c:1423:ll_file_io_generic()) f1: 2 io complete with rc: -22, result: 0, restart: 0
00000080:00200000:1.0:1513687587.117489:0:6996:0:(file.c:1459:ll_file_io_generic()) f1: write *ppos: 1048576, pos: 1048576, ret: 0, rc: -22
Comment by Jinshan Xiong (Inactive) [ 08/Jan/18 ]

it turned out that the b2_10 branch doesn't clear either lcm_flags or lcm_padding fields when packing layout on the server side. The corresponding code is in lod_generate_lovea():

        lcm = (struct lov_comp_md_v1 *)lmm;                                     
        lcm->lcm_magic = cpu_to_le32(LOV_MAGIC_COMP_V1);                        
        lcm->lcm_entry_count = cpu_to_le16(comp_cnt);                           
                                                                                
        offset = sizeof(*lcm) + sizeof(*lcme) * comp_cnt;                       
        LASSERT(offset % sizeof(__u64) == 0);    

This will confuse b2_11 clients because lcm_flags and lcm_mirror_count would be random numbers and then will not pass sanity check.

I think the best way to fix this problem is to create a patch to clear the corresponding fields in b2_10.

Comment by Gerrit Updater [ 08/Jan/18 ]

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/30784
Subject: LU-10437 lod: clear layout header when generating layout
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 2eb12e91fea8f57b1c00dac7d756bedcda4aee1f

Comment by Gerrit Updater [ 08/Jan/18 ]

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/30785
Subject: LU-10437 lod: clear layout header when generating layout
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b34f3ccad87cbcdd9b0c2bd4a84d6221735dc9dd

Comment by Gerrit Updater [ 20/Jan/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30785/
Subject: LU-10437 lod: clear layout header when generating layout
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 47d6ce20cfd8b04f20f7fc7accc39b3902780900

Comment by Peter Jones [ 20/Jan/18 ]

Landed for 2.11

Comment by Gerrit Updater [ 02/Feb/18 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/30784/
Subject: LU-10437 lod: clear layout header when generating layout
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 0e38e97e2c4209ac31f3f6f9bc245da9a991006c

Generated at Sat Feb 10 02:35:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.