[LU-17327] Write conf-santity test case for online OST and MDT addition Created: 30/Nov/23  Updated: 03/Feb/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Jian Yu Assignee: Jian Yu
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-17300 Avoid creating new dir/file/object on... Open
is related to LU-17334 Client should handle dir/file/object ... In Progress
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We need an automated test to exercise adding OSTs and MDTs online to a live filesystem that is under load.

Andreas provided this guidance:

Add a conf-sanity test to format MDTs and OSTs, mount the first half of them, start a workload (eg. "rsync -a /etc /usr/lib $DIR/$tdir"), and then mount the second half of MDTs and OSTs).

You can likely copy test_46a to test_46b and add in the rsync workload instead of waiting nicely for the second set of OSTs to be added.



 Comments   
Comment by Gerrit Updater [ 30/Nov/23 ]

"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53300
Subject: LU-17327 tests: add test case for online MDT and OST addition
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 61e7844a131c94ddb58f367ef134e5961c3b143e

Comment by Andreas Dilger [ 04/Dec/23 ]

It looks like the test case has exposed an expected failure case where the MDS created a file with an object on the newly-added OST but the client wasn't aware of the new OST yet:

[  183.117948] Lustre: DEBUG MARKER: == conf-sanity test 46b: online OST and MDT addition ===== 16:48:36 (1701380916)
[  206.830361] Lustre: Mounted lustre-client
[  214.067224] LustreError: 14230:0:(lov_ea.c:279:lsme_unpack()) lustre-clilov_UUID: OST index 1 more than OST count 1
[  214.070240] Lustre: 14230:0:(lov_pack.c:57:lov_dump_lmm_common()) objid 0x2ab:1025, magic 0x0bd10bd0, pattern 0x1
[  214.072822] Lustre: 14230:0:(lov_pack.c:61:lov_dump_lmm_common()) stripe_size 4194304, stripe_count 1, layout_gen 0
[  214.075459] Lustre: 14230:0:(lov_pack.c:81:lov_dump_lmm_objects()) stripe 0 idx 1 subobj 0x2c0000401:2
[  214.078104] LustreError: 14230:0:(lcommon_cl.c:196:cl_file_inode_init()) lustre: failed to initialize cl_object [0x200000401:0x2ab:0x0]: rc = -22
[  214.081653] LustreError: 14230:0:(llite_lib.c:3613:ll_prep_inode()) new_inode -fatal: rc -22
[  460.900709] Lustre: DEBUG MARKER: conf-sanity test_46b: @@@@@@ FAIL: rsync failed

Issue LU-17334 is tracking the fix for client gracefully handling of this case, while LU-17300 is tracking the fix to avoid creating such files in the first place. Both fixes are useful to implement for interop and reliability reasons.

Comment by Andreas Dilger [ 04/Dec/23 ]

I looked through the test results on Gerrit Janitor and 100% of the test runs for the new test_46b failed, but 40/44 test runs only failed because they ran out of space while copying the source trees into the test filesystem:

Started lustre-OST0001
waiting for rsync to finish
rsync: mkstemp "/mnt/lustre/d46b.conf-sanity/lib/kbd/keymaps/xkb/.hr-alternatequotes.map.gz.QHFVy3" failed: No space left on device (28)
rsync: mkstemp "/mnt/lustre/d46b.conf-sanity/lib/kbd/keymaps/xkb/.hr-unicode.map.gz.lliNpP" failed: No space left on device (28)
rsync: mkstemp "/mnt/lustre/d46b.conf-sanity/lib/kbd/keymaps/xkb/.hr-unicodeus.map.gz.AAhphB" failed: No space left on device (28)
rsync: mkstemp "/mnt/lustre/d46b.conf-sanity/lib/kbd/keymaps/xkb/.hr-us.map.gz.ID9K9m" failed: No space left on device (28)
:

There were 4 cases that failed due to the MDS creating a file on a new OST that the client didn't know existed yet (with errors on the client console as in the previous comment):

waiting for rsync to finish
rsync: mkstemp "/mnt/lustre/d46b.conf-sanity/etc/.cron.deny.PDs0S0" failed: Invalid argument (22)
rsync: mkstemp "/mnt/lustre/d46b.conf-sanity/etc/.crypttab.FFAEg4" failed: Invalid argument (22)
rsync: mkstemp "/mnt/lustre/d46b.conf-sanity/etc/.csh.login.f7BhT7" failed: Invalid argument (22)
rsync: write failed on "/mnt/lustre/d46b.conf-sanity/lib/locale/locale-archive": No space left on device (28)
rsync error: error in file IO (code 11) at receiver.c(393) [receiver=3.1.2]
Comment by Andreas Dilger [ 04/Dec/23 ]

The test runs on Autotest showed much more chance of hitting the object-on-new-OST creation race, with a long list of files being created on new OSTs:
https://testing.whamcloud.com/test_sets/be67a6fa-8bc9-4538-be28-dcf2688028a3
https://testing.whamcloud.com/test_sets/036c867c-6d9c-4c1c-9403-63b24195a873

rsync: mkstemp "/mnt/lustre/d46b.conf-sanity/lib/dracut/modules.d/99shutdown/.shutdown.sh.iiLyky" failed: Invalid argument (22)
rsync: mkstemp "/mnt/lustre/d46b.conf-sanity/lib/dracut/modules.d/99squash/.shchkdir.V60fMT" failed: Invalid argument (22)
rsync: mkstemp "/mnt/lustre/d46b.conf-sanity/lib/dracut/modules.d/99squash/.module-setup.sh.TDJhBN" failed: Invalid argument (22)
rsync: mkstemp "/mnt/lustre/d46b.conf-sanity/lib/firewalld/helpers/.RAS.xml.zYZx1S" failed: Invalid argument (22)
rsync: mkstemp "/mnt/lustre/d46b.conf-sanity/lib/firewalld/helpers/.amanda.xml.okGbNj" failed: Invalid argument (22)
rsync: mkstemp "/mnt/lustre/d46b.conf-sanity/lib/firewalld/helpers/.ftp.xml.Rsg5G5" failed: Invalid argument (22)
:

and the client console logs showing this error was hit for each new OST addition.

This is likely because there are more Autotest OSTs to be added (6) instead of Janitor (only 1):

:
[  605.944711] LustreError: 41994:0:(lcommon_cl.c:196:cl_file_inode_init()) lustre: failed to initialize cl_object [0x200000401:0x5bc:0x0]: rc = -22
[  605.946775] LustreError: 41994:0:(lcommon_cl.c:196:cl_file_inode_init()) Skipped 8 previous similar messages
[  605.948389] LustreError: 41994:0:(llite_lib.c:3613:ll_prep_inode()) new_inode -fatal: rc -22
[  605.949762] LustreError: 41994:0:(llite_lib.c:3613:ll_prep_inode()) Skipped 8 previous similar messages
[  606.954945] LustreError: 41994:0:(lov_ea.c:279:lsme_unpack()) lustre-clilov_UUID: OST index 2 more than OST count 2
[  606.956738] LustreError: 41994:0:(lov_ea.c:279:lsme_unpack()) Skipped 25 previous similar messages
[  606.958203] Lustre: 41994:0:(lov_pack.c:57:lov_dump_lmm_common()) objid 0x663:1025, magic 0x0bd10bd0, pattern 0x1
[  606.959944] Lustre: 41994:0:(lov_pack.c:57:lov_dump_lmm_common()) Skipped 25 previous similar messages
[  606.961495] Lustre: 41994:0:(lov_pack.c:61:lov_dump_lmm_common()) stripe_size 4194304, stripe_count 1, layout_gen 0
[  606.963239] Lustre: 41994:0:(lov_pack.c:61:lov_dump_lmm_common()) Skipped 25 previous similar messages
[  606.964782] Lustre: 41994:0:(lov_pack.c:81:lov_dump_lmm_objects()) stripe 0 idx 2 subobj 0x2c0000400:37
[  606.966319] Lustre: 41994:0:(lov_pack.c:81:lov_dump_lmm_objects()) Skipped 25 previous similar messages
[  606.967869] LustreError: 41994:0:(lcommon_cl.c:196:cl_file_inode_init()) lustre: failed to initialize cl_object [0x200000401:0x663:0x0]: rc = -22
[  606.969972] LustreError: 41994:0:(lcommon_cl.c:196:cl_file_inode_init()) Skipped 25 previous similar messages
[  606.971572] LustreError: 41994:0:(llite_lib.c:3613:ll_prep_inode()) new_inode -fatal: rc -22
[  606.972935] LustreError: 41994:0:(llite_lib.c:3613:ll_prep_inode()) Skipped 25 previous similar messages
[  626.075632] LustreError: 41994:0:(lov_ea.c:279:lsme_unpack()) lustre-clilov_UUID: OST index 5 more than OST count 5
[  626.082235] LustreError: 41994:0:(lov_ea.c:279:lsme_unpack()) Skipped 5 previous similar messages
[  626.083709] Lustre: 41994:0:(lov_pack.c:57:lov_dump_lmm_common()) objid 0x147a:1025, magic 0x0bd10bd0, pattern 0x1
[  626.085377] Lustre: 41994:0:(lov_pack.c:57:lov_dump_lmm_common()) Skipped 5 previous similar messages
[  626.086871] Lustre: 41994:0:(lov_pack.c:61:lov_dump_lmm_common()) stripe_size 4194304, stripe_count 1, layout_gen 0
[  626.088539] Lustre: 41994:0:(lov_pack.c:61:lov_dump_lmm_common()) Skipped 5 previous similar messages
[  626.090037] Lustre: 41994:0:(lov_pack.c:81:lov_dump_lmm_objects()) stripe 0 idx 5 subobj 0x380000400:2
[  626.091542] Lustre: 41994:0:(lov_pack.c:81:lov_dump_lmm_objects()) Skipped 5 previous similar messages
[  626.093115] LustreError: 41994:0:(lcommon_cl.c:196:cl_file_inode_init()) lustre: failed to initialize cl_object [0x200000401:0x147a:0x0]: rc = -22
[  626.095205] LustreError: 41994:0:(lcommon_cl.c:196:cl_file_inode_init()) Skipped 5 previous similar messages
[  626.096842] LustreError: 41994:0:(llite_lib.c:3613:ll_prep_inode()) new_inode -fatal: rc -22
[  626.098230] LustreError: 41994:0:(llite_lib.c:3613:ll_prep_inode()) Skipped 5 previous similar messages
[  632.765740] LustreError: 41994:0:(lov_ea.c:279:lsme_unpack()) lustre-clilov_UUID: OST index 6 more than OST count 6
[  632.767467] LustreError: 41994:0:(lov_ea.c:279:lsme_unpack()) Skipped 36 previous similar messages
[  632.768944] Lustre: 41994:0:(lov_pack.c:57:lov_dump_lmm_common()) objid 0x16a9:1025, magic 0x0bd10bd0, pattern 0x1
[  632.770626] Lustre: 41994:0:(lov_pack.c:57:lov_dump_lmm_common()) Skipped 36 previous similar messages
[  632.772137] Lustre: 41994:0:(lov_pack.c:61:lov_dump_lmm_common()) stripe_size 4194304, stripe_count 1, layout_gen 0
[  632.773800] Lustre: 41994:0:(lov_pack.c:61:lov_dump_lmm_common()) Skipped 36 previous similar messages
[  632.775283] Lustre: 41994:0:(lov_pack.c:81:lov_dump_lmm_objects()) stripe 0 idx 6 subobj 0x3c0000400:2
[  632.776766] Lustre: 41994:0:(lov_pack.c:81:lov_dump_lmm_objects()) Skipped 36 previous similar messages
[  632.778285] LustreError: 41994:0:(lcommon_cl.c:196:cl_file_inode_init()) lustre: failed to initialize cl_object [0x200000401:0x16a9:0x0]: rc = -22
[  632.780422] LustreError: 41994:0:(lcommon_cl.c:196:cl_file_inode_init()) Skipped 36 previous similar messages
:
Comment by Gerrit Updater [ 06/Dec/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53343
Subject: LU-17327 tests: conf-sanity/46b to avoid QOS
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 30534d63510a1cefe07f7a79934b6f1428d6aa18

Comment by Gerrit Updater [ 08/Dec/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53390
Subject: LU-17327 tests: conf-sanity/46c to avoid MDT balance
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1bfdb01a73c30f5ac5e32e1a10015b1f907019b3

Generated at Sat Feb 10 03:34:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.