Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11643

create disk images for Lustre 2.10 and 2.12 for ldiskfs

Details

    • Question/Request
    • Resolution: Fixed
    • Minor
    • Lustre 2.15.0
    • Lustre 2.12.0
    • 9223372036854775807

    Description

      We need to create disk images for Lustre 2.10.0 and 2.12.0 for conf-sanity test_32 upgrade regression testing, similar to the existing lustre/tests/disk*.tar.bz2 files.

      These new test filesystems should include files using the new features for those releases, and tests that verify they are working properly:

      • PFL file layouts
      • project quotas
      • FLR mirrored files
      • DoM files on the MDTs

      Also, some of the files need to be striped over multiple OSTs (e.g. "lfs setstripe -c 2"), but store data on only a single object (i.e. size = 4KB). This relates to an LFSCK issue that I saw with filter_fid and would be good to test.

      Attachments

        Issue Links

          Activity

            [LU-11643] create disk images for Lustre 2.10 and 2.12 for ldiskfs
            sarah Sarah Liu added a comment - - edited

            For these 2 images, when doing the sha1sums check, it needs to go into the "remote_dir" dir instead of ROOT or striped_dir. I think we need to set the "pfl_upgrade=yes" (or other symbols) in test_32x which failed this part.

            if $r test -f $tmp/sha1sums; then
                                    # LU-2393 - do both sorts on same node to ensure locale
                                    # is identical
                                    $r cat $tmp/sha1sums | sort -k 2 >$tmp/sha1sums.orig
                                    if [ "$dne_upgrade" != "no" ]; then
                                            pushd $tmp/mnt/lustre/striped_dir
                                    elif [ "$pfl_upgrade" != "no" ] ||
                                            [ "$flr_upgrade" != "no" ] ||
                                            [ "$dom_new_upgrade" != "no" ] ||
                                            [ "$project_quota_upgrade" != "no" ]; then
                                            pushd $tmp/mnt/lustre/remote_dir
                                    else
                                            pushd $tmp/mnt/lustre
                                    fi
            

            I will try locally to verity first

            sarah Sarah Liu added a comment - - edited For these 2 images, when doing the sha1sums check, it needs to go into the "remote_dir" dir instead of ROOT or striped_dir. I think we need to set the "pfl_upgrade=yes" (or other symbols) in test_32x which failed this part. if $r test -f $tmp/sha1sums; then # LU-2393 - do both sorts on same node to ensure locale # is identical $r cat $tmp/sha1sums | sort -k 2 >$tmp/sha1sums.orig if [ "$dne_upgrade" != "no" ]; then pushd $tmp/mnt/lustre/striped_dir elif [ "$pfl_upgrade" != "no" ] || [ "$flr_upgrade" != "no" ] || [ "$dom_new_upgrade" != "no" ] || [ "$project_quota_upgrade" != "no" ]; then pushd $tmp/mnt/lustre/remote_dir else pushd $tmp/mnt/lustre fi I will try locally to verity first

            Looking at the Janitor testing, it also failed test_32c in the same way, which is checking the striped_dir, but the files are missing, so it may not be an image problem. What is needed at this point is to check with master whether the mounted filesystem shows the files to be present in the root directory and striped_dir, and then check whether this matches 2.12, or if there really is an upgrade problem.

            adilger Andreas Dilger added a comment - Looking at the Janitor testing, it also failed test_32c in the same way, which is checking the striped_dir , but the files are missing, so it may not be an image problem. What is needed at this point is to check with master whether the mounted filesystem shows the files to be present in the root directory and striped_dir , and then check whether this matches 2.12, or if there really is an upgrade problem.

            Yes, the patch https://review.whamcloud.com/46354 "LU-13514 tests: replace nid in conf-sanity test_32" fixed the conf-sanity test_33a failure and allowed the mdt2 image to be mounted.

            The current problem is test_33b looks like the filesystem image is missing files in the ROOT directory that the test expects to see when there are multiple MDTs (LU-15506). It looks like the test is failing before it checks the striped_dir:

            == checking sha1sums ==
            CMD: onyx-71vm5 cat /tmp/t32/sha1sums
            /tmp/t32/mnt/lustre
            --- /tmp/t32/sha1sums.orig	2022-02-01 23:27:02.120240417 +0000
            +++ /tmp/t32/sha1sums	2022-02-01 23:27:02.123240424 +0000
            @@ -1,10 +0,0 @@
            -59ced6686342e5fdff70a29277632622ad271168  ./init.d/functions
            -ff4f8d1bcd9ab4a9edcf77496e23963e5c6f6a2c  ./init.d/lsvcgss
            -f8f634b92b75af4112634a6f14464e562cd82454  ./init.d/lustre
            -dff7d87de75271f0714c3b82921d40c96598f67a  ./init.d/netconsole
            -21414c2b3c89f95d3eab00dafc954d3f6cf3ba9f  ./init.d/network
            -f87a11aceaf7dc0e1614ea074fda14d6896ac66f  ./init.d/README
            -92624163580750ca250a2c1cc8bd531d0609702a  ./init.d/rhnsd
            -a17ecaeb91c0218092c8b01308a132698da9b81f  ./pfl_dir/pfl_file
            -da39a3ee5e6b4b0d3255bfef95601890afd80709  ./project_quota_dir/pj_quota_file_old
            -2c72448b440f16c9fae18e287ca827c25d29a7cb  ./rc.local
            ==** find returned files **==
            

            My patch https://review.whamcloud.com/46404 "LU-15506 tests: improve conf-sanity test_32 messages" is adding some debugging to make it more clear which phases the test is running, and where the files are missing. I haven't had time to actually mount the disk2_10 filesystem locally and check whether the files are there, just doing everything via autotest/Maloo.

            The best solution would be fixing the test32newtarball() function to properly fill the image for what the tests expect, if that is really the problem, rather than fixing the images by hand.

            It also looks like the disk2_5-ldiskfs.tar.bz2 image has a similar problem, which is causing the Janitor test failures, but it has never been tested because it wasn't included in the lustre-tests RPM due to not being added to lustre/tests/Makefile.am. I'm not sure whether it is worthwhile to fix that image at this point, or maybe it should be removed from Git entirely (though we would "lose" some test coverage in this case).

            adilger Andreas Dilger added a comment - Yes, the patch https://review.whamcloud.com/46354 " LU-13514 tests: replace nid in conf-sanity test_32 " fixed the conf-sanity test_33a failure and allowed the mdt2 image to be mounted. The current problem is test_33b looks like the filesystem image is missing files in the ROOT directory that the test expects to see when there are multiple MDTs ( LU-15506 ). It looks like the test is failing before it checks the striped_dir : == checking sha1sums == CMD: onyx-71vm5 cat /tmp/t32/sha1sums /tmp/t32/mnt/lustre --- /tmp/t32/sha1sums.orig 2022-02-01 23:27:02.120240417 +0000 +++ /tmp/t32/sha1sums 2022-02-01 23:27:02.123240424 +0000 @@ -1,10 +0,0 @@ -59ced6686342e5fdff70a29277632622ad271168 ./init.d/functions -ff4f8d1bcd9ab4a9edcf77496e23963e5c6f6a2c ./init.d/lsvcgss -f8f634b92b75af4112634a6f14464e562cd82454 ./init.d/lustre -dff7d87de75271f0714c3b82921d40c96598f67a ./init.d/netconsole -21414c2b3c89f95d3eab00dafc954d3f6cf3ba9f ./init.d/network -f87a11aceaf7dc0e1614ea074fda14d6896ac66f ./init.d/README -92624163580750ca250a2c1cc8bd531d0609702a ./init.d/rhnsd -a17ecaeb91c0218092c8b01308a132698da9b81f ./pfl_dir/pfl_file -da39a3ee5e6b4b0d3255bfef95601890afd80709 ./project_quota_dir/pj_quota_file_old -2c72448b440f16c9fae18e287ca827c25d29a7cb ./rc.local ==** find returned files **== My patch https://review.whamcloud.com/46404 " LU-15506 tests: improve conf-sanity test_32 messages " is adding some debugging to make it more clear which phases the test is running, and where the files are missing. I haven't had time to actually mount the disk2_10 filesystem locally and check whether the files are there, just doing everything via autotest/Maloo. The best solution would be fixing the test32newtarball() function to properly fill the image for what the tests expect, if that is really the problem, rather than fixing the images by hand. It also looks like the disk2_5-ldiskfs.tar.bz2 image has a similar problem, which is causing the Janitor test failures, but it has never been tested because it wasn't included in the lustre-tests RPM due to not being added to lustre/tests/Makefile.am . I'm not sure whether it is worthwhile to fix that image at this point, or maybe it should be removed from Git entirely (though we would "lose" some test coverage in this case).
            sarah Sarah Liu added a comment - - edited

            test_32a hung with new disk images are known issue, I commented on https://jira.whamcloud.com/browse/LU-11643?focusedCommentId=290110&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-290110, That "mount client" code was added by LU-12846. And if I do writeconf, the mount can pass.

            sarah Sarah Liu added a comment - - edited test_32a hung with new disk images are known issue, I commented on https://jira.whamcloud.com/browse/LU-11643?focusedCommentId=290110&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-290110 , That "mount client" code was added by LU-12846 . And if I do writeconf, the mount can pass.
            pjones Peter Jones added a comment -

            Is this fixed by the landing of https://review.whamcloud.com/46354/?

            pjones Peter Jones added a comment - Is this fixed by the landing of https://review.whamcloud.com/46354/?

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46353
            Subject: LU-11643 tests: skip some conf-sanity test_32 tests
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: fcac417a66e49f99522e4d124783e43bb36f793b

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46353 Subject: LU-11643 tests: skip some conf-sanity test_32 tests Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: fcac417a66e49f99522e4d124783e43bb36f793b
            adilger Andreas Dilger added a comment - - edited

            It looks like the newly landed images for 2.10 and 2.12 fail conf-sanity test_32a whenever they are run. The patch passed testing because it used:

            Test-Parameters: fstype=ldiskfs envdefinitions=ONLY="32f 32g" testlist=conf-sanit
            

            which did not run test 32a-e on the new images. The test is consistently hanging at:

            :
            CMD: onyx-41vm7 /usr/sbin/lctl conf_param t32fs-OST0000.osc.max_dirty_mb=15
            CMD: onyx-41vm7 /usr/sbin/lctl conf_param t32fs-MDT0000.mdc.max_rpcs_in_flight=9
            CMD: onyx-41vm7 /usr/sbin/lctl conf_param t32fs-MDT0000.lov.stripesize=4M
            CMD: onyx-41vm7 /usr/sbin/lctl conf_param t32fs-MDT0000.mdd.atime_diff=70
            CMD: onyx-41vm7 /usr/sbin/lctl pool_new t32fs.interop
            onyx-41vm7: Pool t32fs.interop created
            

            The next line of output for a passing test is:

            CMD: onyx-65vm11 pgrep orph_.*-MDD
            

            There are still a number of patches that are currently passing this test because they have not yet updated to include this patch or are themselves marked trivial and do not run this test, and conversely conf-sanity was failing 100% on master-next.

            I will push a patch to temporarily disable the new images from being used for this test.

            adilger Andreas Dilger added a comment - - edited It looks like the newly landed images for 2.10 and 2.12 fail conf-sanity test_32a whenever they are run. The patch passed testing because it used: Test-Parameters: fstype=ldiskfs envdefinitions=ONLY="32f 32g" testlist=conf-sanit which did not run test 32a-e on the new images. The test is consistently hanging at: : CMD: onyx-41vm7 /usr/sbin/lctl conf_param t32fs-OST0000.osc.max_dirty_mb=15 CMD: onyx-41vm7 /usr/sbin/lctl conf_param t32fs-MDT0000.mdc.max_rpcs_in_flight=9 CMD: onyx-41vm7 /usr/sbin/lctl conf_param t32fs-MDT0000.lov.stripesize=4M CMD: onyx-41vm7 /usr/sbin/lctl conf_param t32fs-MDT0000.mdd.atime_diff=70 CMD: onyx-41vm7 /usr/sbin/lctl pool_new t32fs.interop onyx-41vm7: Pool t32fs.interop created The next line of output for a passing test is: CMD: onyx-65vm11 pgrep orph_.*-MDD There are still a number of patches that are currently passing this test because they have not yet updated to include this patch or are themselves marked trivial and do not run this test, and conversely conf-sanity was failing 100% on master-next . I will push a patch to temporarily disable the new images from being used for this test.
            pjones Peter Jones added a comment -

            Landed for 2.15

            pjones Peter Jones added a comment - Landed for 2.15

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45827/
            Subject: LU-11643 tests: add new images and tests for upgrade tests
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: f2143c0790bb1cb802fad7e81bcb386dc0245b36

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45827/ Subject: LU-11643 tests: add new images and tests for upgrade tests Project: fs/lustre-release Branch: master Current Patch Set: Commit: f2143c0790bb1cb802fad7e81bcb386dc0245b36

            "Wei Liu <sarah@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45827
            Subject: LU-11643 tests: add new images and tests for upgrade tests
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 413c3b75ba5fdfdc7e741489c3b0ba23c46e31c4

            gerrit Gerrit Updater added a comment - "Wei Liu <sarah@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45827 Subject: LU-11643 tests: add new images and tests for upgrade tests Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 413c3b75ba5fdfdc7e741489c3b0ba23c46e31c4
            sarah Sarah Liu added a comment - - edited

            I have tried with disk2_12-ldiskfs image(2 mdts, 2 osts) again and found it hung in test_32a when try to mount the client without writeconf, client side shows following error:

            [207084.287987] Lustre: DEBUG MARKER: == conf-sanity testing /usr/lib64/lustre/tests/disk2_12-ldiskfs.tar.bz2 upgrade ====================== 03:36:58 (1611286618)
            [207175.053537] Lustre: DEBUG MARKER: == conf-sanity nid = 10.9.6.32@tcp mount -t lustre 10.9.6.32@tcp:/t32fs /tmp/t32/mnt/lustre ========== 03:38:29 (1611286709)
            [207232.567915] LustreError: 30549:0:(mgc_request.c:1578:mgc_apply_recover_logs()) mgc: cannot find UUID by nid '10.9.6.32@tcp': rc = -2
            [207232.570266] Lustre: 30549:0:(mgc_request.c:1797:mgc_process_recover_nodemap_log()) MGC10.9.6.32@tcp: error processing recovery log t32fs-cliir: rc = -2
            [207232.572846] Lustre: 30549:0:(mgc_request.c:2149:mgc_process_log()) MGC10.9.6.32@tcp: IR log t32fs-cliir failed, not fatal: rc = -2
            [207256.403162] LustreError: 30549:0:(lmv_obd.c:1263:lmv_statfs()) t32fs-MDT0000-mdc-ffff91bf366cf800: can't stat MDS #0: rc = -11
            [207256.424154] Lustre: Unmounted t32fs-client
            [207256.425232] LustreError: 30549:0:(obd_mount.c:1680:lustre_fill_super()) Unable to mount  (-11)
            [207312.563317] LustreError: 30588:0:(mgc_request.c:1578:mgc_apply_recover_logs()) mgc: cannot find UUID by nid '10.9.6.32@tcp': rc = -2
            [207312.565523] LustreError: 30588:0:(mgc_request.c:1578:mgc_apply_recover_logs()) Skipped 1 previous similar message
            [207312.567342] Lustre: 30588:0:(mgc_request.c:1797:mgc_process_recover_nodemap_log()) MGC10.9.6.32@tcp: error processing recovery log t32fs-cliir: rc = -2
            [207312.569687] Lustre: 30588:0:(mgc_request.c:1797:mgc_process_recover_nodemap_log()) Skipped 1 previous similar message
            [207312.571612] Lustre: 30588:0:(mgc_request.c:2149:mgc_process_log()) MGC10.9.6.32@tcp: IR log t32fs-cliir failed, not fatal: rc = -2
            [207312.573790] Lustre: 30588:0:(mgc_request.c:2149:mgc_process_log()) Skipped 1 previous similar message
            [207322.606194] LustreError: 11-0: t32fs-MDT0000-mdc-ffff91bf793af000: operation mds_connect to node 10.9.6.32@tcp failed: rc = -11
            [207402.734185] LustreError: 11-0: t32fs-MDT0000-mdc-ffff91bf793a8800: operation mds_connect to node 10.9.6.32@tcp failed: rc = -11
            [207456.137501] LustreError: 30588:0:(lmv_obd.c:1263:lmv_statfs()) t32fs-MDT0000-mdc-ffff91bf793a8800: can't stat MDS #0: rc = -11
            [207456.158467] Lustre: Unmounted t32fs-client
            [207456.159584] LustreError: 30588:0:(obd_mount.c:1680:lustre_fill_super()) Unable to mount  (-11)
            

            The client mount code was added by LU-12846, patch: https://review.whamcloud.com/#/c/37636/7

            $MOUNT_CMD $nid:/$fsname $tmp/mnt/lustre || {
            			error_noexit "Mounting the client"
            			return 1
            		}
            
            		[[ $(do_facet mds1 pgrep orph_.*-MDD | wc -l) == 0 ]] ||
            			error "MDD orphan cleanup thread not quit"
            
            		umount $tmp/mnt/lustre || {
            			error_noexit "Unmounting the client"
            			return 1
            		}
            
            sarah Sarah Liu added a comment - - edited I have tried with disk2_12-ldiskfs image(2 mdts, 2 osts) again and found it hung in test_32a when try to mount the client without writeconf, client side shows following error: [207084.287987] Lustre: DEBUG MARKER: == conf-sanity testing /usr/lib64/lustre/tests/disk2_12-ldiskfs.tar.bz2 upgrade ====================== 03:36:58 (1611286618) [207175.053537] Lustre: DEBUG MARKER: == conf-sanity nid = 10.9.6.32@tcp mount -t lustre 10.9.6.32@tcp:/t32fs /tmp/t32/mnt/lustre ========== 03:38:29 (1611286709) [207232.567915] LustreError: 30549:0:(mgc_request.c:1578:mgc_apply_recover_logs()) mgc: cannot find UUID by nid '10.9.6.32@tcp': rc = -2 [207232.570266] Lustre: 30549:0:(mgc_request.c:1797:mgc_process_recover_nodemap_log()) MGC10.9.6.32@tcp: error processing recovery log t32fs-cliir: rc = -2 [207232.572846] Lustre: 30549:0:(mgc_request.c:2149:mgc_process_log()) MGC10.9.6.32@tcp: IR log t32fs-cliir failed, not fatal: rc = -2 [207256.403162] LustreError: 30549:0:(lmv_obd.c:1263:lmv_statfs()) t32fs-MDT0000-mdc-ffff91bf366cf800: can't stat MDS #0: rc = -11 [207256.424154] Lustre: Unmounted t32fs-client [207256.425232] LustreError: 30549:0:(obd_mount.c:1680:lustre_fill_super()) Unable to mount (-11) [207312.563317] LustreError: 30588:0:(mgc_request.c:1578:mgc_apply_recover_logs()) mgc: cannot find UUID by nid '10.9.6.32@tcp': rc = -2 [207312.565523] LustreError: 30588:0:(mgc_request.c:1578:mgc_apply_recover_logs()) Skipped 1 previous similar message [207312.567342] Lustre: 30588:0:(mgc_request.c:1797:mgc_process_recover_nodemap_log()) MGC10.9.6.32@tcp: error processing recovery log t32fs-cliir: rc = -2 [207312.569687] Lustre: 30588:0:(mgc_request.c:1797:mgc_process_recover_nodemap_log()) Skipped 1 previous similar message [207312.571612] Lustre: 30588:0:(mgc_request.c:2149:mgc_process_log()) MGC10.9.6.32@tcp: IR log t32fs-cliir failed, not fatal: rc = -2 [207312.573790] Lustre: 30588:0:(mgc_request.c:2149:mgc_process_log()) Skipped 1 previous similar message [207322.606194] LustreError: 11-0: t32fs-MDT0000-mdc-ffff91bf793af000: operation mds_connect to node 10.9.6.32@tcp failed: rc = -11 [207402.734185] LustreError: 11-0: t32fs-MDT0000-mdc-ffff91bf793a8800: operation mds_connect to node 10.9.6.32@tcp failed: rc = -11 [207456.137501] LustreError: 30588:0:(lmv_obd.c:1263:lmv_statfs()) t32fs-MDT0000-mdc-ffff91bf793a8800: can't stat MDS #0: rc = -11 [207456.158467] Lustre: Unmounted t32fs-client [207456.159584] LustreError: 30588:0:(obd_mount.c:1680:lustre_fill_super()) Unable to mount (-11) The client mount code was added by LU-12846 , patch: https://review.whamcloud.com/#/c/37636/7 $MOUNT_CMD $nid:/$fsname $tmp/mnt/lustre || { error_noexit "Mounting the client" return 1 } [[ $(do_facet mds1 pgrep orph_.*-MDD | wc -l) == 0 ]] || error "MDD orphan cleanup thread not quit" umount $tmp/mnt/lustre || { error_noexit "Unmounting the client" return 1 }

            People

              sarah Sarah Liu
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: