[LU-17240] change test-framework to format and mount targets in parallel Created: 28/Oct/23  Updated: 20/Dec/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Andreas Dilger Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: medium, test_script_improvements

Issue Links:
Related
is related to LU-14928 Allow MD target re-registered after w... Resolved
is related to LU-4966 handle server registration errors gra... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

It would be useful for a number of reasons to change test-framework.sh to format and mount the MDTs and OSTs in parallel (if not both MDTs and OSTs at the same time, then at least in two sets).

  • this would reduce testing time significantly for tests that reformat the filesystem (e.g. conf-sanity)
  • this would improve testing of the MGS to handle registering multiple targets in parallel (there are at least some known issues with this that could be found and fixed)
  • this would improve test coverage since filesystems are often mounted in parallel in production, and this would better simulate the real world


 Comments   
Comment by Patrick Farrell [ 28/Oct/23 ]

timday - I'm just living in hope here, but maybe this would be of interest to you?  It would certainly make reloading a test node faster, which would be very nice.

Comment by Patrick Farrell [ 28/Oct/23 ]

Andreas,

Would it make sense to put "unmounting"/Stopping targets in parallel here under this same ticket?  That's closely related and also takes a while.

Comment by Tim Day [ 29/Oct/23 ]

I've actually written some tests to do a bunch of parallel mounts, but that was client-side. It was to test out OBD device registration. I never got around to cleaning the test up and submitting it.

While mounting targets in parallel would make testing faster, I'm not sure if it would meaningfully improve test coverage. I haven't seen/heard of issues with mounting targets in parallel (even with 100s of OSS/OST). It would useful if we could find a way to register a few hundred OSS/MDS in parallel. I think that would surface more bugs faster. I think it would go:

1) Stop all clients, MDS, MDS, OSS

2) Make a bunch of small temp disks in /tmp/ on each node

3) Start a bunch of a services using those disks, hope nothing explodes

4) Cleanup and restart services

Andreas, could you link some of the known issues you mentioned (in the description) to this ticket? I'm curious what people have seen go wrong.

Comment by Andreas Dilger [ 30/Oct/23 ]

Patrick,
yes parallel unmounting would also be useful. I think the formatting and mounting in parallel would be a bigger win.

Tim,
I don't have any tickets that have details on this, since most of the time this has happened is in conjunction with some other issue that has a higher priority to fix. Basically, what I've seen is that mounting multiple targets in parallel and registering with the MGS for the first time. If there are problems during registration (after reformat or writeconf) the MGS thinks that it is registered but the OST does not, or similar. I think shadow has previously submitted a patch for this to allow the OST to retry the initial connection, but I couldn't find it.

Comment by Alexey Lyashkov [ 30/Oct/23 ]

Andreas,

it's Zam patch I think. I have just move all handling in single thread.
as about ticket..
I don't think it's large problem - as test target is small and format rate, except a conf-sanity.

Comment by Andreas Dilger [ 31/Oct/23 ]

Tim, it looks like the patch I was thinking about is https://review.whamcloud.com/44594 "LU-14928 mgs: allow md target re-register" and that has been landed since 2.14.55. That at least addresses part of the issue, though I thought there was at least one more patch in this area about re-registering targets.

Comment by Andreas Dilger [ 31/Oct/23 ]

The other ones are patch https://review.whamcloud.com/45259 "LU-15112 mgc: do not ignore target registration failure" and maybe patch https://review.whamcloud.com/45871 "LU-15112 ptlrpc: make rq_replied flag always correct" included in 2.14.57.

Comment by Gerrit Updater [ 20/Dec/23 ]

"Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53518
Subject: LU-17240 tests: format and mount targets in parallel
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e2ea758f0a11cef20621759746750ac92de418af

Generated at Sat Feb 10 03:33:48 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.