[LU-17373] Add check under init_param_vars() when update to mgc_requeue_timeout_min fails Created: 18/Dec/23  Updated: 22/Dec/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Arshad Hussain Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When running llmount.sh. Update to mgc_requeue_timeout_min could fail. One reason for that could be not setting of passwordless ssh to self. While this is not fatel. Mounts on server/client would still work , as by that time the mounts have already been done. However, the llmount.sh scripts breaks in between soon after this. Wtih the false error message as below.

rocky9a: Host key verification failed.
pdsh@rocky9a: rocky9a: ssh exited with exit code 255
+ return 255


 Comments   
Comment by Gerrit Updater [ 18/Dec/23 ]

"Arshad Hussain <arshad.hussain@aeoncomputing.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53487
Subject: LU-17373 tests: Add check after updating mgc_requeue_timeout_min
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a4415c860bd6399c97911eb30bbe0801769376ed

Comment by Andreas Dilger [ 21/Dec/23 ]

Is this something failing in Autotest or just your local test config? I think fixing this specific subtest is missing the higher-level issue.

The test-framework.sh code should not be trying to do "ssh" to the local node, but it seems possible if it is confused that the hostname does not match the "client" node name? Otherwise, there is a specific check in do_node() if the target node matches the local hostname and "no_dsh" is used to execute the command directly.

It seems that do_node() needs to get smarter and track all local interfaces to determine whether the client hostname resolves to a local interface and avoids "ssh" entirely, since it is always slower than executing the command directly.

That would solve this issue for all tests instead of just this one test.

Comment by Arshad Hussain [ 22/Dec/23 ]

Is this something failing in Autotest or just your local test config? ...

This is local test setup.

 

The test-framework.sh code should not be trying to do "ssh" to the local node, but it seems possible if it is confused that the hostname does not match the "client" node name? Otherwise, there is a specific check in do_node() if the target node matches the local hostname and "no_dsh" is used to execute the command directly.

It seems that do_node() needs to get smarter and track all local interfaces to determine whether the client hostname resolves to a local interface and avoids "ssh" entirely, since it is always slower than executing the command directly.

That would solve this issue for all tests instead of just this one test.

Thanks for the clarification. I will update the patch with your suggestion.

Generated at Sat Feb 10 03:34:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.