[LU-10651] Failed to provision nodes: No such process Created: 08/Feb/18  Updated: 09/Feb/18  Resolved: 09/Feb/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Olaf Faaland Assignee: Charlie Olmstead
Resolution: Done Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Maloo failed to provision nodes, with error "Failed to provision nodes: No such process".  See https://testing.hpdd.intel.com/test_sets/c15c2304-0c7e-11e8-bd00-52540065bddc

Node provisioning log tail:

2018-02-08T02:53:42 Rebooting trevis-57vm1, trevis-57vm2, trevis-57vm3, trevis-57vm4, trevis-57vm5...
2018-02-08T02:56:09 trevis-57vm2 is reachable and ready
2018-02-08T02:57:49 trevis-57vm4 is reachable and ready
2018-02-08T02:58:20 trevis-57vm5 is reachable and ready
2018-02-08T02:58:24 trevis-57vm3 is reachable and ready
2018-02-08T02:58:40 trevis-57vm1 is reachable and ready
2018-02-08T02:58:40 All nodes rebooted successfully
2018-02-08T02:58:40 Creating partitions for OSSs...
2018-02-08T02:58:40 Creating lvm partitions
2018-02-08T02:58:44 trevis-57vm3 - lvm_size=96.5
2018-02-08T02:58:44 trevis-57vm3 - node.partition_size_gb=11.99
2018-02-08T02:59:23 Creating partitions for MDSs...
2018-02-08T02:59:23 Creating lvm partitions
2018-02-08T02:59:29 trevis-57vm4 - lvm_size=96.5
2018-02-08T02:59:29 trevis-57vm4 - node.partition_size_gb=2
2018-02-08T03:00:50 trevis-57vm5 - lvm_size=96.5
2018-02-08T03:00:51 trevis-57vm5 - node.partition_size_gb=2
2018-02-08T03:02:07 Creating partitions for servers complete!
2018-02-08T03:02:07 Rebooting nodes...
2018-02-08T03:02:07 Rebooting trevis-57vm3, trevis-57vm4, trevis-57vm5...
2018-02-08T03:13:14 trevis-57vm5 is reachable and ready
2018-02-08T03:13:15 Errno::ESRCH
No such process
/home/autotest2/autotest/lib/interruptable_process.rb:69:in `getpgid'
/home/autotest2/autotest/lib/interruptable_process.rb:69:in `cleanup_process'
/home/autotest2/autotest/lib/interruptable_process.rb:50:in `ensure in run3'
/home/autotest2/autotest/lib/interruptable_process.rb:50:in `run3'
/home/autotest2/autotest/lib/system_utils.rb:74:in `block (2 levels) in execute'
/usr/local/rbenv/versions/2.3.1/lib/ruby/2.3.0/timeout.rb:91:in `block in timeout'
/usr/local/rbenv/versions/2.3.1/lib/ruby/2.3.0/timeout.rb:33:in `block in catch'
/usr/local/rbenv/versions/2.3.1/lib/ruby/2.3.0/timeout.rb:33:in `catch'
/usr/local/rbenv/versions/2.3.1/lib/ruby/2.3.0/timeout.rb:33:in `catch'
/usr/local/rbenv/versions/2.3.1/lib/ruby/2.3.0/timeout.rb:106:in `timeout'
/home/autotest2/autotest/lib/system_utils.rb:73:in `block in execute'
/home/autotest2/autotest/lib/retry_loop.rb:28:in `block (2 levels) in retry_loop'
/home/autotest2/autotest/lib/retry_loop.rb:27:in `upto'
/home/autotest2/autotest/lib/retry_loop.rb:27:in `block in retry_loop'
/home/autotest2/autotest/lib/retry_loop.rb:26:in `catch'
/home/autotest2/autotest/lib/retry_loop.rb:26:in `retry_loop'
/home/autotest2/autotest/lib/system_utils.rb:69:in `execute'
/home/autotest2/autotest/lib/system_utils.rb:108:in `block in rexec'
/home/autotest2/autotest/lib/system_utils.rb:297:in `via_nfs'
/home/autotest2/autotest/lib/system_utils.rb:104:in `rexec'
/home/autotest2/autotest/lib/system_utils.rb:116:in `rexec_no_retry'
/home/autotest2/autotest/lib/system_utils.rb:202:in `reachable?'
/home/autotest2/autotest/lib/configure_cluster.rb:280:in `block in reboot_nodes'
/home/autotest2/autotest/vendor/bundle/ruby/2.3.0/gems/parallel-1.11.1/lib/parallel.rb:484:in `call_with_index'
/home/autotest2/autotest/vendor/bundle/ruby/2.3.0/gems/parallel-1.11.1/lib/parallel.rb:342:in `block (2 levels) in work_in_threads'
/home/autotest2/autotest/vendor/bundle/ruby/2.3.0/gems/parallel-1.11.1/lib/parallel.rb:493:in `with_instrumentation'
/home/autotest2/autotest/vendor/bundle/ruby/2.3.0/gems/parallel-1.11.1/lib/parallel.rb:341:in `block in work_in_threads'
/home/autotest2/autotest/vendor/bundle/ruby/2.3.0/gems/parallel-1.11.1/lib/parallel.rb:206:in `block (2 levels) in in_threads'
2018-02-08T03:13:16 Getting console log trevis-57vm1.log
2018-02-08T03:13:17 Getting console log trevis-57vm2.log
2018-02-08T03:13:21 Getting console log trevis-57vm3.log
2018-02-08T03:13:28 Getting console log trevis-57vm4.log
2018-02-08T03:13:44 Getting console log trevis-57vm5.log


 Comments   
Comment by Peter Jones [ 09/Feb/18 ]

Lee

Do you consider this issue to be related to the test infrastructure rather than the Lustre code?

Peter

Comment by Charlie Olmstead [ 09/Feb/18 ]

From what I can tell, this appears to be an edge case. The library for handling external calls does not handle those that died between timing a call out and cleaning the call up. I'll create an internal ticket to handle this case properly.

Comment by Lee Ochoa [ 09/Feb/18 ]

I guess moving a ticket doesn't add you to the watchers, I'm just now seeing the comments.

So yes Peter, this was definitely a test infrastructure issue rather than Lustre. For future cases I think the best option would be to create an ATM ticket and link the two so reporters don't loose visibility and update them accordingly. Sorry about the confusion.

Comment by Peter Jones [ 09/Feb/18 ]

Thanks Lee/Charlie. Olaf, are you ok to close this ticket out as a duplicate of the ticket opened against our test infrastructure? It seems to not be a bug from a Lustre point of view (though does affect testing)...

Generated at Sat Feb 10 02:36:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.