Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.12.0
-
None
-
3
-
9223372036854775807
Description
several ha.sh defects found during wide ha testing :
1.
I.e. if stop file is created before ha_repeat_mpi_load() "while" is started – we have :
ha.sh: line 399: (1476438773 - start_time) / nr_loops: division by 0 (error token is "nr_loops")
2.
in hard failover mode (pm -0 <node>) some node could be down at the time when ha.sh collects the lustre logs.
In this case we have test passed but returns 255 at the end :
stdout :
/usr/lib64/lustre/tests/ha.sh: 23:18:01 1476487081: ---------------8<--------------- /usr/lib64/lustre/tests/ha.sh: 23:18:01 1476487081: Summary: /usr/lib64/lustre/tests/ha.sh: 23:18:01 1476487081: Duration: 44887s /usr/lib64/lustre/tests/ha.sh: 23:18:01 1476487081: Loops: 20
stderr :
redpill00: failback: Operation performed successfully. pdsh@redpill-client08: redpill16: ssh exited with exit code 255 /usr/lib64/lustre/tests/ha.sh: 23:18:05 1476487085: not all logs are dumped! Some nodes are unreachable. pdsh@redpill-client08: redpill16: ssh exited with exit code 255 /usr/lib64/lustre/tests/ha.sh: 00:45:57 1476492357: Trap ERR triggered by: /usr/lib64/lustre/tests/ha.sh: 00:45:57 1476492357: return $rc
Landed for 2.13