[LU-10651] Failed to provision nodes: No such process Created: 08/Feb/18 Updated: 09/Feb/18 Resolved: 09/Feb/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Olaf Faaland | Assignee: | Charlie Olmstead |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
Maloo failed to provision nodes, with error "Failed to provision nodes: No such process". See https://testing.hpdd.intel.com/test_sets/c15c2304-0c7e-11e8-bd00-52540065bddc Node provisioning log tail: 2018-02-08T02:53:42 Rebooting trevis-57vm1, trevis-57vm2, trevis-57vm3, trevis-57vm4, trevis-57vm5... 2018-02-08T02:56:09 trevis-57vm2 is reachable and ready 2018-02-08T02:57:49 trevis-57vm4 is reachable and ready 2018-02-08T02:58:20 trevis-57vm5 is reachable and ready 2018-02-08T02:58:24 trevis-57vm3 is reachable and ready 2018-02-08T02:58:40 trevis-57vm1 is reachable and ready 2018-02-08T02:58:40 All nodes rebooted successfully 2018-02-08T02:58:40 Creating partitions for OSSs... 2018-02-08T02:58:40 Creating lvm partitions 2018-02-08T02:58:44 trevis-57vm3 - lvm_size=96.5 2018-02-08T02:58:44 trevis-57vm3 - node.partition_size_gb=11.99 2018-02-08T02:59:23 Creating partitions for MDSs... 2018-02-08T02:59:23 Creating lvm partitions 2018-02-08T02:59:29 trevis-57vm4 - lvm_size=96.5 2018-02-08T02:59:29 trevis-57vm4 - node.partition_size_gb=2 2018-02-08T03:00:50 trevis-57vm5 - lvm_size=96.5 2018-02-08T03:00:51 trevis-57vm5 - node.partition_size_gb=2 2018-02-08T03:02:07 Creating partitions for servers complete! 2018-02-08T03:02:07 Rebooting nodes... 2018-02-08T03:02:07 Rebooting trevis-57vm3, trevis-57vm4, trevis-57vm5... 2018-02-08T03:13:14 trevis-57vm5 is reachable and ready 2018-02-08T03:13:15 Errno::ESRCH No such process /home/autotest2/autotest/lib/interruptable_process.rb:69:in `getpgid' /home/autotest2/autotest/lib/interruptable_process.rb:69:in `cleanup_process' /home/autotest2/autotest/lib/interruptable_process.rb:50:in `ensure in run3' /home/autotest2/autotest/lib/interruptable_process.rb:50:in `run3' /home/autotest2/autotest/lib/system_utils.rb:74:in `block (2 levels) in execute' /usr/local/rbenv/versions/2.3.1/lib/ruby/2.3.0/timeout.rb:91:in `block in timeout' /usr/local/rbenv/versions/2.3.1/lib/ruby/2.3.0/timeout.rb:33:in `block in catch' /usr/local/rbenv/versions/2.3.1/lib/ruby/2.3.0/timeout.rb:33:in `catch' /usr/local/rbenv/versions/2.3.1/lib/ruby/2.3.0/timeout.rb:33:in `catch' /usr/local/rbenv/versions/2.3.1/lib/ruby/2.3.0/timeout.rb:106:in `timeout' /home/autotest2/autotest/lib/system_utils.rb:73:in `block in execute' /home/autotest2/autotest/lib/retry_loop.rb:28:in `block (2 levels) in retry_loop' /home/autotest2/autotest/lib/retry_loop.rb:27:in `upto' /home/autotest2/autotest/lib/retry_loop.rb:27:in `block in retry_loop' /home/autotest2/autotest/lib/retry_loop.rb:26:in `catch' /home/autotest2/autotest/lib/retry_loop.rb:26:in `retry_loop' /home/autotest2/autotest/lib/system_utils.rb:69:in `execute' /home/autotest2/autotest/lib/system_utils.rb:108:in `block in rexec' /home/autotest2/autotest/lib/system_utils.rb:297:in `via_nfs' /home/autotest2/autotest/lib/system_utils.rb:104:in `rexec' /home/autotest2/autotest/lib/system_utils.rb:116:in `rexec_no_retry' /home/autotest2/autotest/lib/system_utils.rb:202:in `reachable?' /home/autotest2/autotest/lib/configure_cluster.rb:280:in `block in reboot_nodes' /home/autotest2/autotest/vendor/bundle/ruby/2.3.0/gems/parallel-1.11.1/lib/parallel.rb:484:in `call_with_index' /home/autotest2/autotest/vendor/bundle/ruby/2.3.0/gems/parallel-1.11.1/lib/parallel.rb:342:in `block (2 levels) in work_in_threads' /home/autotest2/autotest/vendor/bundle/ruby/2.3.0/gems/parallel-1.11.1/lib/parallel.rb:493:in `with_instrumentation' /home/autotest2/autotest/vendor/bundle/ruby/2.3.0/gems/parallel-1.11.1/lib/parallel.rb:341:in `block in work_in_threads' /home/autotest2/autotest/vendor/bundle/ruby/2.3.0/gems/parallel-1.11.1/lib/parallel.rb:206:in `block (2 levels) in in_threads' 2018-02-08T03:13:16 Getting console log trevis-57vm1.log 2018-02-08T03:13:17 Getting console log trevis-57vm2.log 2018-02-08T03:13:21 Getting console log trevis-57vm3.log 2018-02-08T03:13:28 Getting console log trevis-57vm4.log 2018-02-08T03:13:44 Getting console log trevis-57vm5.log |
| Comments |
| Comment by Peter Jones [ 09/Feb/18 ] |
|
Lee Do you consider this issue to be related to the test infrastructure rather than the Lustre code? Peter |
| Comment by Charlie Olmstead [ 09/Feb/18 ] |
|
From what I can tell, this appears to be an edge case. The library for handling external calls does not handle those that died between timing a call out and cleaning the call up. I'll create an internal ticket to handle this case properly. |
| Comment by Lee Ochoa [ 09/Feb/18 ] |
|
I guess moving a ticket doesn't add you to the watchers, I'm just now seeing the comments. So yes Peter, this was definitely a test infrastructure issue rather than Lustre. For future cases I think the best option would be to create an ATM ticket and link the two so reporters don't loose visibility and update them accordingly. Sorry about the confusion. |
| Comment by Peter Jones [ 09/Feb/18 ] |
|
Thanks Lee/Charlie. Olaf, are you ok to close this ticket out as a duplicate of the ticket opened against our test infrastructure? It seems to not be a bug from a Lustre point of view (though does affect testing)... |