[LU-11643] create disk images for Lustre 2.10 and 2.12 for ldiskfs Created: 08/Nov/18  Updated: 26/Oct/23  Resolved: 07/Mar/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: Lustre 2.15.0

Type: Question/Request Priority: Minor
Reporter: Andreas Dilger Assignee: Sarah Liu
Resolution: Fixed Votes: 0
Labels: LTS12

Issue Links:
Duplicate
is duplicated by LU-10998 Add 2.10.0 zfs and ldiskfs images for... Resolved
Related
is related to LU-13514 conf-sanity test_32a: Timeout occurre... Resolved
is related to LU-9207 Create new conf-sanity test_32 disk i... Resolved
is related to LU-10215 disk2_5-ldiskfs.tar.bz2 is not packag... Open
is related to LU-15751 Interop conf-sanity test_32b: Fails l... In Progress
is related to LU-16082 old-style Lustre EA inodes support is... Resolved
is related to LU-11625 ls -l returns "ls: cannot access Inv... Resolved
is related to LU-13514 conf-sanity test_32a: Timeout occurre... Resolved
is related to LU-15506 conf-sanity test_32b failed with 2: s... Resolved
is related to LU-14772 split conf-sanity into 2 or 3 parts In Progress
is related to LU-13604 rebase Lustre e2fsprogs onto 1.45.6 Resolved
is related to LU-14853 conf-sanity: create ldiskfs and zfs f... Open
is related to LU-15482 create zfs disk images for Lustre 2.1... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

We need to create disk images for Lustre 2.10.0 and 2.12.0 for conf-sanity test_32 upgrade regression testing, similar to the existing lustre/tests/disk*.tar.bz2 files.

These new test filesystems should include files using the new features for those releases, and tests that verify they are working properly:

  • PFL file layouts
  • project quotas
  • FLR mirrored files
  • DoM files on the MDTs

Also, some of the files need to be striped over multiple OSTs (e.g. "lfs setstripe -c 2"), but store data on only a single object (i.e. size = 4KB). This relates to an LFSCK issue that I saw with filter_fid and would be good to test.



 Comments   
Comment by Gerrit Updater [ 03/Jun/19 ]

Wei Liu (sarah@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35049
Subject: LU-11643 tests: add new images for upgrade tests
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b053a3ca368455db8c253c16b957d4271977d669

Comment by Gerrit Updater [ 21/Feb/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37680
Subject: LU-11643 libext2fs: don't use O_DIRECT for files on tmpfs
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set: 1
Commit: ad26a8f7d33f4d88a66a32b6898a71b518d9c0eb

Comment by Gerrit Updater [ 23/Feb/20 ]

Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/37680/
Subject: LU-11643 libext2fs: don't use O_DIRECT for files on tmpfs
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set:
Commit: f94e4eef3f9d6a749798d4daddb6f67b708ea8c1

Comment by Gerrit Updater [ 14/May/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35049/
Subject: LU-11643 tests: add new images and tests for upgrade tests
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6b979daaffc36aeef145316b41d0e2fe8abcf20f

Comment by Peter Jones [ 14/May/20 ]

Landed for 2.14

Comment by Gerrit Updater [ 15/May/20 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38619
Subject: LU-11643 tests: revert new images and tests for upgrade patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 07dc46059ec40cfb1b18e7f40bc35c185fc5ea33

Comment by Gerrit Updater [ 17/May/20 ]

Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/38619/
Subject: LU-11643 tests: revert new images and tests for upgrade patch
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ebaf3b1b9980bcea6e404e03dfd2ac48f21cc3fc

Comment by Andreas Dilger [ 17/May/20 ]

Patch was reverted because of test failures.

Comment by Sarah Liu [ 22/Jan/21 ]

I have tried with disk2_12-ldiskfs image(2 mdts, 2 osts) again and found it hung in test_32a when try to mount the client without writeconf, client side shows following error:

[207084.287987] Lustre: DEBUG MARKER: == conf-sanity testing /usr/lib64/lustre/tests/disk2_12-ldiskfs.tar.bz2 upgrade ====================== 03:36:58 (1611286618)
[207175.053537] Lustre: DEBUG MARKER: == conf-sanity nid = 10.9.6.32@tcp mount -t lustre 10.9.6.32@tcp:/t32fs /tmp/t32/mnt/lustre ========== 03:38:29 (1611286709)
[207232.567915] LustreError: 30549:0:(mgc_request.c:1578:mgc_apply_recover_logs()) mgc: cannot find UUID by nid '10.9.6.32@tcp': rc = -2
[207232.570266] Lustre: 30549:0:(mgc_request.c:1797:mgc_process_recover_nodemap_log()) MGC10.9.6.32@tcp: error processing recovery log t32fs-cliir: rc = -2
[207232.572846] Lustre: 30549:0:(mgc_request.c:2149:mgc_process_log()) MGC10.9.6.32@tcp: IR log t32fs-cliir failed, not fatal: rc = -2
[207256.403162] LustreError: 30549:0:(lmv_obd.c:1263:lmv_statfs()) t32fs-MDT0000-mdc-ffff91bf366cf800: can't stat MDS #0: rc = -11
[207256.424154] Lustre: Unmounted t32fs-client
[207256.425232] LustreError: 30549:0:(obd_mount.c:1680:lustre_fill_super()) Unable to mount  (-11)
[207312.563317] LustreError: 30588:0:(mgc_request.c:1578:mgc_apply_recover_logs()) mgc: cannot find UUID by nid '10.9.6.32@tcp': rc = -2
[207312.565523] LustreError: 30588:0:(mgc_request.c:1578:mgc_apply_recover_logs()) Skipped 1 previous similar message
[207312.567342] Lustre: 30588:0:(mgc_request.c:1797:mgc_process_recover_nodemap_log()) MGC10.9.6.32@tcp: error processing recovery log t32fs-cliir: rc = -2
[207312.569687] Lustre: 30588:0:(mgc_request.c:1797:mgc_process_recover_nodemap_log()) Skipped 1 previous similar message
[207312.571612] Lustre: 30588:0:(mgc_request.c:2149:mgc_process_log()) MGC10.9.6.32@tcp: IR log t32fs-cliir failed, not fatal: rc = -2
[207312.573790] Lustre: 30588:0:(mgc_request.c:2149:mgc_process_log()) Skipped 1 previous similar message
[207322.606194] LustreError: 11-0: t32fs-MDT0000-mdc-ffff91bf793af000: operation mds_connect to node 10.9.6.32@tcp failed: rc = -11
[207402.734185] LustreError: 11-0: t32fs-MDT0000-mdc-ffff91bf793a8800: operation mds_connect to node 10.9.6.32@tcp failed: rc = -11
[207456.137501] LustreError: 30588:0:(lmv_obd.c:1263:lmv_statfs()) t32fs-MDT0000-mdc-ffff91bf793a8800: can't stat MDS #0: rc = -11
[207456.158467] Lustre: Unmounted t32fs-client
[207456.159584] LustreError: 30588:0:(obd_mount.c:1680:lustre_fill_super()) Unable to mount  (-11)

The client mount code was added by LU-12846, patch: https://review.whamcloud.com/#/c/37636/7

$MOUNT_CMD $nid:/$fsname $tmp/mnt/lustre || {
			error_noexit "Mounting the client"
			return 1
		}

		[[ $(do_facet mds1 pgrep orph_.*-MDD | wc -l) == 0 ]] ||
			error "MDD orphan cleanup thread not quit"

		umount $tmp/mnt/lustre || {
			error_noexit "Unmounting the client"
			return 1
		}
Comment by Gerrit Updater [ 10/Dec/21 ]

"Wei Liu <sarah@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45827
Subject: LU-11643 tests: add new images and tests for upgrade tests
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 413c3b75ba5fdfdc7e741489c3b0ba23c46e31c4

Comment by Gerrit Updater [ 26/Jan/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45827/
Subject: LU-11643 tests: add new images and tests for upgrade tests
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f2143c0790bb1cb802fad7e81bcb386dc0245b36

Comment by Peter Jones [ 26/Jan/22 ]

Landed for 2.15

Comment by Andreas Dilger [ 28/Jan/22 ]

It looks like the newly landed images for 2.10 and 2.12 fail conf-sanity test_32a whenever they are run. The patch passed testing because it used:

Test-Parameters: fstype=ldiskfs envdefinitions=ONLY="32f 32g" testlist=conf-sanit

which did not run test 32a-e on the new images. The test is consistently hanging at:

:
CMD: onyx-41vm7 /usr/sbin/lctl conf_param t32fs-OST0000.osc.max_dirty_mb=15
CMD: onyx-41vm7 /usr/sbin/lctl conf_param t32fs-MDT0000.mdc.max_rpcs_in_flight=9
CMD: onyx-41vm7 /usr/sbin/lctl conf_param t32fs-MDT0000.lov.stripesize=4M
CMD: onyx-41vm7 /usr/sbin/lctl conf_param t32fs-MDT0000.mdd.atime_diff=70
CMD: onyx-41vm7 /usr/sbin/lctl pool_new t32fs.interop
onyx-41vm7: Pool t32fs.interop created

The next line of output for a passing test is:

CMD: onyx-65vm11 pgrep orph_.*-MDD

There are still a number of patches that are currently passing this test because they have not yet updated to include this patch or are themselves marked trivial and do not run this test, and conversely conf-sanity was failing 100% on master-next.

I will push a patch to temporarily disable the new images from being used for this test.

Comment by Gerrit Updater [ 28/Jan/22 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46353
Subject: LU-11643 tests: skip some conf-sanity test_32 tests
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fcac417a66e49f99522e4d124783e43bb36f793b

Comment by Peter Jones [ 29/Jan/22 ]

Is this fixed by the landing of https://review.whamcloud.com/46354/?

Comment by Sarah Liu [ 01/Feb/22 ]

test_32a hung with new disk images are known issue, I commented on https://jira.whamcloud.com/browse/LU-11643?focusedCommentId=290110&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-290110, That "mount client" code was added by LU-12846. And if I do writeconf, the mount can pass.

Comment by Andreas Dilger [ 02/Feb/22 ]

Yes, the patch https://review.whamcloud.com/46354 "LU-13514 tests: replace nid in conf-sanity test_32" fixed the conf-sanity test_33a failure and allowed the mdt2 image to be mounted.

The current problem is test_33b looks like the filesystem image is missing files in the ROOT directory that the test expects to see when there are multiple MDTs (LU-15506). It looks like the test is failing before it checks the striped_dir:

== checking sha1sums ==
CMD: onyx-71vm5 cat /tmp/t32/sha1sums
/tmp/t32/mnt/lustre
--- /tmp/t32/sha1sums.orig	2022-02-01 23:27:02.120240417 +0000
+++ /tmp/t32/sha1sums	2022-02-01 23:27:02.123240424 +0000
@@ -1,10 +0,0 @@
-59ced6686342e5fdff70a29277632622ad271168  ./init.d/functions
-ff4f8d1bcd9ab4a9edcf77496e23963e5c6f6a2c  ./init.d/lsvcgss
-f8f634b92b75af4112634a6f14464e562cd82454  ./init.d/lustre
-dff7d87de75271f0714c3b82921d40c96598f67a  ./init.d/netconsole
-21414c2b3c89f95d3eab00dafc954d3f6cf3ba9f  ./init.d/network
-f87a11aceaf7dc0e1614ea074fda14d6896ac66f  ./init.d/README
-92624163580750ca250a2c1cc8bd531d0609702a  ./init.d/rhnsd
-a17ecaeb91c0218092c8b01308a132698da9b81f  ./pfl_dir/pfl_file
-da39a3ee5e6b4b0d3255bfef95601890afd80709  ./project_quota_dir/pj_quota_file_old
-2c72448b440f16c9fae18e287ca827c25d29a7cb  ./rc.local
==** find returned files **==

My patch https://review.whamcloud.com/46404 "LU-15506 tests: improve conf-sanity test_32 messages" is adding some debugging to make it more clear which phases the test is running, and where the files are missing. I haven't had time to actually mount the disk2_10 filesystem locally and check whether the files are there, just doing everything via autotest/Maloo.

The best solution would be fixing the test32newtarball() function to properly fill the image for what the tests expect, if that is really the problem, rather than fixing the images by hand.

It also looks like the disk2_5-ldiskfs.tar.bz2 image has a similar problem, which is causing the Janitor test failures, but it has never been tested because it wasn't included in the lustre-tests RPM due to not being added to lustre/tests/Makefile.am. I'm not sure whether it is worthwhile to fix that image at this point, or maybe it should be removed from Git entirely (though we would "lose" some test coverage in this case).

Comment by Andreas Dilger [ 02/Feb/22 ]

Looking at the Janitor testing, it also failed test_32c in the same way, which is checking the striped_dir, but the files are missing, so it may not be an image problem. What is needed at this point is to check with master whether the mounted filesystem shows the files to be present in the root directory and striped_dir, and then check whether this matches 2.12, or if there really is an upgrade problem.

Comment by Sarah Liu [ 02/Feb/22 ]

For these 2 images, when doing the sha1sums check, it needs to go into the "remote_dir" dir instead of ROOT or striped_dir. I think we need to set the "pfl_upgrade=yes" (or other symbols) in test_32x which failed this part.

if $r test -f $tmp/sha1sums; then
                        # LU-2393 - do both sorts on same node to ensure locale
                        # is identical
                        $r cat $tmp/sha1sums | sort -k 2 >$tmp/sha1sums.orig
                        if [ "$dne_upgrade" != "no" ]; then
                                pushd $tmp/mnt/lustre/striped_dir
                        elif [ "$pfl_upgrade" != "no" ] ||
                                [ "$flr_upgrade" != "no" ] ||
                                [ "$dom_new_upgrade" != "no" ] ||
                                [ "$project_quota_upgrade" != "no" ]; then
                                pushd $tmp/mnt/lustre/remote_dir
                        else
                                pushd $tmp/mnt/lustre
                        fi

I will try locally to verity first

Comment by Andreas Dilger [ 03/Feb/22 ]

It looks like the test should be checking both remote_dir and striped_dir_old. It looks like a few parts of the test check for dne_upgrade, but they should always run that part of the test if mds2_is_available is true and the directory exists. It may be that we don't need to regenerate the images at all.

Please make any changes starting with my patch https://review.whamcloud.com/46404 so that we keep the debug messages.

Comment by Peter Jones [ 07/Mar/22 ]

Fixed in 2.15

Comment by Alexander Zarochentsev [ 28/Aug/22 ]

sarah ,
which code was used to create the disk2_12 image ? I am trying to add one more upgrade test (LU-16082)  but cannot recreate the image easily.

Comment by Sarah Liu [ 29/Aug/22 ]

Hello,
The test_32newtarball I used for 2.12 should be in the master code.

Generated at Sat Feb 10 02:45:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.