[LU-14773] reduce run_one() overhead Created: 18/Jun/21  Updated: 23/Jul/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Minor
Reporter: Andreas Dilger Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-14772 split conf-sanity into 2 or 3 parts In Progress
is related to LU-14936 sanity test_140 returned 1 Open
Rank (Obsolete): 9223372036854775807

 Description   

There could be some simple changes made to reduce individual subtest time and unmount/mount/format times that would help speed up every test session .

Individual sanity subtests that are not doing more than "touch file; check if file exists" currently take 6-7 seconds because they are doing a lot of different things in the background in run_one() with multiple "do_nodes" commands:

  • reset fail_loc
  • check if the network is working on every node (kind of pointless given that other commands are being run on the nodes before and after this check)
  • check grant correctness
  • check dmesg for VFS inodes busy
  • check for LBUG
  • check for multiop still running

When sanity was first written, these subtests took a fraction of a second each (i.e. they would scroll quickly up the screen). While I think the above checks are useful, the overhead could be reduced.

I think the large part of this slowness is that each of these checks runs as a separate ssh/mcmd command, to each remote VM in series, and each ssh invocation is relatively slow.

Speeding up the ssh invocation itself (via do_facet()/do_node()) would of course be desirable, but is not something I can control directly.

Running the per-node checks in parallel would be a win (e.g. use real "pdsh" or "clush"), as would combining all of the checks into a single command that is run with a single ssh invocation to each node. The latter is something that can be done directly in test-framework, and is the main target of this ticket.



 Comments   
Comment by Gerrit Updater [ 18/Jun/21 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44033
Subject: LU-14773 tests: skip check_network() on working node
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d085468b6bc06c4bbdfcfbc26afa23d4b752aa64

Comment by Andreas Dilger [ 18/Jun/21 ]

Note that the 44033 patch is NOT the only thing that should be fixed, but is a simple patch that may produce immediate benefits (at a minimum it will avoid a lot of useless visual clutter in the subtest logs from the check_network() output).

Comment by Gerrit Updater [ 18/Jun/21 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44034
Subject: LU-14773 tests: quiet down some verbose messages
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f23adfd71ee46dfbbf18b8b544ad311c96468fd3

Comment by Gerrit Updater [ 18/Aug/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44033/
Subject: LU-14773 tests: skip check_network() on working node
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 67752f6db2c1a7062a73bd6674ee53ad670b392e

Comment by Gerrit Updater [ 25/Aug/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44034/
Subject: LU-14773 tests: quiet down some verbose messages
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 86f16910645d9d9cad17c0f53ca1a375121e3f4c

Generated at Sat Feb 10 03:12:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.