Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
None
-
None
-
3
-
6422
Description
There are many logs like the following,
Mar 16 07:36:11 n007 kernel: LustreError: 11-0: an error occurred while communicating with 10.55.32.2@o2ib. The ost_connect operation failed with -16
10.55.32.2 is oss3-ib and is responsive to ping and ssh.
Also, oss4 rebooted last Wednesday evening and now its disks are mounted on
oss3.
How do we make the disks for oss4 migrate from oss3 to oss4? Anything else I should check to debug this?
Customer tried rebooting oss4 and halting oss4 which didn't seem to help. The issue eventually cleared for unknown reasons and the filesystem became responsive again (with the oss4 disks still on oss3).
This again happens over the weekend multiple times. What happens is that one oss server declares the other is dead (due to high load?) and takes over the disks (and reboots the other node, by design). The Lustre filesystem is then unavailable. The node that took over the disks has an extremely high node. This happened to both sets oss1/oss2 and oss3/oss4 Thursday night/Friday morning. I just had to reset oss1 this morning so it would give up the disks it took over from oss2 on Saturday morning. The cluster is unusable when this situation happens.
Adjustments have been made that we hope will help prevent this issue from reappearing so often. For the HA heartbeat we tripled the timeouts so 30 seconds is now a warning and the time to
declare a node dead is now 90 seconds. For Lustre we enabled striping with a stripe count of 8 and stripe size of 4MB.
Do these values seem reasonable? Or should they, for example, increase the stripe count to 24 so it is across all the OSTs?