Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17240

change test-framework to format and mount targets in parallel

Details

    • 3
    • 9223372036854775807

    Description

      It would be useful for a number of reasons to change test-framework.sh to format and mount the MDTs and OSTs in parallel (if not both MDTs and OSTs at the same time, then at least in two sets).

      • this would reduce testing time significantly for tests that reformat the filesystem (e.g. conf-sanity)
      • this would improve testing of the MGS to handle registering multiple targets in parallel (there are at least some known issues with this that could be found and fixed)
      • this would improve test coverage since filesystems are often mounted in parallel in production, and this would better simulate the real world

      Attachments

        Issue Links

          Activity

            [LU-17240] change test-framework to format and mount targets in parallel

            I've modified patch slightly so it runs locally as well: https://review.whamcloud.com/c/fs/lustre-release/+/59296
            and tested few different combinations: ldiskfs/zfs, parallel ost, mdt+ost, none:
            https://testing.whamcloud.com/test_sessions/related?jobs=lustre-reviews&builds=113556#redirect
            the tests were limited to subtests 0-5 to get a feedback quickly.
            I don't see a visible improvement for a reason, thought the logs confirm concurrent mkfs where it was requested.
            will check the details a bit later.

            bzzz Alex Zhuravlev added a comment - I've modified patch slightly so it runs locally as well: https://review.whamcloud.com/c/fs/lustre-release/+/59296 and tested few different combinations: ldiskfs/zfs, parallel ost, mdt+ost, none: https://testing.whamcloud.com/test_sessions/related?jobs=lustre-reviews&builds=113556#redirect the tests were limited to subtests 0-5 to get a feedback quickly. I don't see a visible improvement for a reason, thought the logs confirm concurrent mkfs where it was requested. will check the details a bit later.

            at least locally it breaks multiple tests right away, will check the details.

            bzzz Alex Zhuravlev added a comment - at least locally it breaks multiple tests right away, will check the details.

            Tim already pushed a patch for this.

            adilger Andreas Dilger added a comment - Tim already pushed a patch for this.

            "Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53518
            Subject: LU-17240 tests: format and mount targets in parallel
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e2ea758f0a11cef20621759746750ac92de418af

            gerrit Gerrit Updater added a comment - "Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53518 Subject: LU-17240 tests: format and mount targets in parallel Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e2ea758f0a11cef20621759746750ac92de418af

            The other ones are patch https://review.whamcloud.com/45259 "LU-15112 mgc: do not ignore target registration failure" and maybe patch https://review.whamcloud.com/45871 "LU-15112 ptlrpc: make rq_replied flag always correct" included in 2.14.57.

            adilger Andreas Dilger added a comment - The other ones are patch https://review.whamcloud.com/45259 " LU-15112 mgc: do not ignore target registration failure " and maybe patch https://review.whamcloud.com/45871 " LU-15112 ptlrpc: make rq_replied flag always correct " included in 2.14.57.

            Tim, it looks like the patch I was thinking about is https://review.whamcloud.com/44594 "LU-14928 mgs: allow md target re-register" and that has been landed since 2.14.55. That at least addresses part of the issue, though I thought there was at least one more patch in this area about re-registering targets.

            adilger Andreas Dilger added a comment - Tim, it looks like the patch I was thinking about is https://review.whamcloud.com/44594 " LU-14928 mgs: allow md target re-register " and that has been landed since 2.14.55. That at least addresses part of the issue, though I thought there was at least one more patch in this area about re-registering targets.

            Andreas,

            it's Zam patch I think. I have just move all handling in single thread.
            as about ticket..
            I don't think it's large problem - as test target is small and format rate, except a conf-sanity.

            shadow Alexey Lyashkov added a comment - Andreas, it's Zam patch I think. I have just move all handling in single thread. as about ticket.. I don't think it's large problem - as test target is small and format rate, except a conf-sanity.

            Patrick,
            yes parallel unmounting would also be useful. I think the formatting and mounting in parallel would be a bigger win.

            Tim,
            I don't have any tickets that have details on this, since most of the time this has happened is in conjunction with some other issue that has a higher priority to fix. Basically, what I've seen is that mounting multiple targets in parallel and registering with the MGS for the first time. If there are problems during registration (after reformat or writeconf) the MGS thinks that it is registered but the OST does not, or similar. I think shadow has previously submitted a patch for this to allow the OST to retry the initial connection, but I couldn't find it.

            adilger Andreas Dilger added a comment - Patrick, yes parallel unmounting would also be useful. I think the formatting and mounting in parallel would be a bigger win. Tim, I don't have any tickets that have details on this, since most of the time this has happened is in conjunction with some other issue that has a higher priority to fix. Basically, what I've seen is that mounting multiple targets in parallel and registering with the MGS for the first time. If there are problems during registration (after reformat or writeconf) the MGS thinks that it is registered but the OST does not, or similar. I think shadow has previously submitted a patch for this to allow the OST to retry the initial connection, but I couldn't find it.
            timday Tim Day added a comment -

            I've actually written some tests to do a bunch of parallel mounts, but that was client-side. It was to test out OBD device registration. I never got around to cleaning the test up and submitting it.

            While mounting targets in parallel would make testing faster, I'm not sure if it would meaningfully improve test coverage. I haven't seen/heard of issues with mounting targets in parallel (even with 100s of OSS/OST). It would useful if we could find a way to register a few hundred OSS/MDS in parallel. I think that would surface more bugs faster. I think it would go:

            1) Stop all clients, MDS, MDS, OSS

            2) Make a bunch of small temp disks in /tmp/ on each node

            3) Start a bunch of a services using those disks, hope nothing explodes

            4) Cleanup and restart services

            Andreas, could you link some of the known issues you mentioned (in the description) to this ticket? I'm curious what people have seen go wrong.

            timday Tim Day added a comment - I've actually written some tests to do a bunch of parallel mounts, but that was client-side. It was to test out OBD device registration. I never got around to cleaning the test up and submitting it. While mounting targets in parallel would make testing faster, I'm not sure if it would meaningfully improve test coverage. I haven't seen/heard of issues with mounting targets in parallel (even with 100s of OSS/OST). It would useful if we could find a way to register a few hundred OSS/MDS in parallel. I think that would surface more bugs faster. I think it would go: 1) Stop all clients, MDS, MDS, OSS 2) Make a bunch of small temp disks in /tmp/ on each node 3) Start a bunch of a services using those disks, hope nothing explodes 4) Cleanup and restart services Andreas, could you link some of the known issues you mentioned (in the description) to this ticket? I'm curious what people have seen go wrong.

            Andreas,

            Would it make sense to put "unmounting"/Stopping targets in parallel here under this same ticket?  That's closely related and also takes a while.

            paf0186 Patrick Farrell added a comment - Andreas, Would it make sense to put "unmounting"/Stopping targets in parallel here under this same ticket?  That's closely related and also takes a while.

            People

              timday Tim Day
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: