[LU-1288] Filesystem hang on file operations (ls, df hang) on the headnode and compute nodes Created: 04/Apr/12 Updated: 10/May/12 Resolved: 10/May/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Archie Dizon | Assignee: | Cliff White (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 6422 |
| Description |
|
There are many logs like the following, Mar 16 07:36:11 n007 kernel: LustreError: 11-0: an error occurred while communicating with 10.55.32.2@o2ib. The ost_connect operation failed with -16 10.55.32.2 is oss3-ib and is responsive to ping and ssh. Also, oss4 rebooted last Wednesday evening and now its disks are mounted on How do we make the disks for oss4 migrate from oss3 to oss4? Anything else I should check to debug this? Customer tried rebooting oss4 and halting oss4 which didn't seem to help. The issue eventually cleared for unknown reasons and the filesystem became responsive again (with the oss4 disks still on oss3). This again happens over the weekend multiple times. What happens is that one oss server declares the other is dead (due to high load?) and takes over the disks (and reboots the other node, by design). The Lustre filesystem is then unavailable. The node that took over the disks has an extremely high node. This happened to both sets oss1/oss2 and oss3/oss4 Thursday night/Friday morning. I just had to reset oss1 this morning so it would give up the disks it took over from oss2 on Saturday morning. The cluster is unusable when this situation happens. Adjustments have been made that we hope will help prevent this issue from reappearing so often. For the HA heartbeat we tripled the timeouts so 30 seconds is now a warning and the time to Do these values seem reasonable? Or should they, for example, increase the stripe count to 24 so it is across all the OSTs? |
| Comments |
| Comment by Archie Dizon [ 05/Apr/12 ] |
|
Customer would like a status update..... Thanks |
| Comment by Peter Jones [ 05/Apr/12 ] |
|
Cliff Could you please help with this one? Archie What version of Lustre are you using? Thanks Peter |
| Comment by Cliff White (Inactive) [ 05/Apr/12 ] |
|
Your failover setup should allow you to fail back the disks from the secondary (which i assume is oss3) to the primary. Heartbeat has a 'takeover' command for this purpose. -16 is EBUSY, which typically happens when an OSS is in recovery. The OSS should complete recovery and allow new connections at that instant which may account for the impression that the filesystem has suddenly 'fixed itself'. You need to monitor OSS console and system logs and be certain the system has completed recovery before you get upset, and start rebooting stuff. (rebooting too soon will just make it worse, as the OSS will go into recovery again) There should be messages in the OSS system logs explaining the recovery state. Proper HA timeout is really site-specific, 30/90 seconds is about average. If your previous timeout was 1/3 of this (10/30) then you would likely have false triggers, so backing off is a good idea. stripe_count is completely unrelated to failover or where stuff is mounted. Proper stripe count is application-dependent, so you must decide for yourselves what stripe is proper. Remember that the number of stripes is going to roughly translate to the number of parallel IO a client can issue at once, so for a wide stripe (all OSTs) be sure the clients are actually capable of generating the necessary load. Typically, you look at the amount of work generated by a client times the number of clients and from that IO demand size the stripe as needed. I would look at client time spent waiting for IO also - if clients are spending time waiting on IO, increasing stripe count may help. |
| Comment by Archie Dizon [ 09/Apr/12 ] |
|
Here is our customer response: So it sounds like you agree that our change on the HA timeout sounds This doesn't answer why the OSS nodes couldn't handle the additional We'll need to experiment with the striping parameters it sounds like. |
| Comment by Cliff White (Inactive) [ 09/Apr/12 ] |
|
What is the load on the OSS prior to the failover? If the systems are already very busy, then a failover can result in the surviving servering being very overloaded. |
| Comment by Archie Dizon [ 10/Apr/12 ] |
|
Loads are typically around 3.00 to 8.00 prior to issues. Here's some [root@oss1 OSS]# uptime [root@oss1 OSS]# pwd [root@oss1 OSS]# find . -name threads_'*' -print -exec cat {} \; [root@oss1 OSS]# cd ost |
| Comment by Cliff White (Inactive) [ 10/Apr/12 ] |
|
You need to understand that Lustre will be in recovery after a failover. I would suggest looking at the recovery chapter in the Lustre Manual. If this failover happens again, you need to monitor the server consoles/syslogs and monitor the state of the recovery. If clients are attempting IO during the recovery time, they will block and this will likely increase the amount of work the OST must catch up on after recovery. To determine why the load spikes, examine performance metrics (sar, iostat and top can be useful) and console/syslog outputs. If Lustre is having issues, there should be errors in the logs beyond the EBUSY and EAGAIN errors you are seeing. |
| Comment by Archie Dizon [ 11/Apr/12 ] |
|
,the recovery chapter only serves to reinforce the problem: as designed, the OSS's simply do not have the "headroom" to handle failover. With ongoing loads of 7 - 8, e.g., oss3 is already over 50% load which implies that the addition of oss4's OST's will automatically push it over 100%. The LUSTRE tuning document here: http://wiki.lustre.org/index.php/Lustre_Tuning as well as examination of the LUSTRE 1.8.x code clearly demonstrate that one way we can combat OSS load is to prevent too many service threads from starting only to sit waiting on i/o. With fewer OSS service threads, the rationale is that the "waiting" would be pushed onto the client nodes since they will not be able to immediately find a free service thread to which to connect. This will obviously increase latency on the client side, but will keep the OSS from holding too many concurrent clients' state/data while waiting on i/o. > Again, the LUSTRE tuning document cited makes it seem like timeout performance is directly related to service thread counts and their acceptance of data severely outpacing the disks' ability to write that data. Some lines from the logs re: time outs and load: Lots of these: Plenty of these, too: Particularly interesting given the claim that service threads are not likely of import: Nothing containing "increas" though: [root@oss1 ~]# grep -i in |
| Comment by Archie Dizon [ 11/Apr/12 ] |
|
Some information to add to the problems we're experiencing with LUSTRE: Today we caught oss2 in the middle of failing. Load had risen to 120+ with the usual CPU usage levels for user/system/wait and only one LUSTRE service thread seeming to be actively functioning. Failover had not yet occurred. From syslog, we observed the following: Apr 11 09:14:03 oss2 kernel: Lustre: Service thread pid 8432 was inactive for 1200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Several other service threads exhibited the same condition, in each case with a stack trace indicating the thread was in jbd2_log_wait_commit. The OSS does not recover from this state in a sensible amount of time, so after 5 minutes we were forced to reboot oss2 once again. FWIW, oss1 picked-up the disks and we allowed recovery to complete on all 6 failover OSTs before starting heartbeat on the newly-rebooted oss2. Since Whamcloud's LUSTRE 1.8.7 sits atop EXT4 filesystems, we're wondering if there are known issues with EXT4 on the 2.6.18-238 kernel that would be leading to the observed condition – inordinately high load with service threads "stuck" in jbd2_log_wait_commit? I note that the RHEL 5 kernel is at 2.6.18 release 308.1.1 now – here "uname -a" for oss2: Linux oss2.localdomain 2.6.18-238.12.1.el5_lustre.g266a955 #1 SMP Fri Jun 10 16:39:27 PDT 2011 x86_64 x86_64 x86_64 GNU/Linux |
| Comment by Archie Dizon [ 12/Apr/12 ] |
|
Yet another LUSTRE failure to report this a.m. – again with oss2. In doing a little web research, the conditions customers observed match EXACTLY with those reported in this Whamcloud bug report: http://jira.whamcloud.com/browse/LU-436 That issue was with 1.8.6 but is listed as unresolved and still open, so I doubt it was resolved in 1.8.7; perhaps the changes introduced in the 2.x releases skirts the issue (whatever it may be) entirely? Can this be confirmed? |
| Comment by Cliff White (Inactive) [ 17/Apr/12 ] |
|
Yes, tuning the threads for performance is important in the failover case. Yes, reducing threads may be necessary for some hardware. The issue in |
| Comment by Cliff White (Inactive) [ 19/Apr/12 ] |
|
How are you doing on this issue? Would a call or Skype conference help? |
| Comment by Cliff White (Inactive) [ 29/Apr/12 ] |
|
What is your current state? Do you still need help? |
| Comment by Cliff White (Inactive) [ 10/May/12 ] |
|
I am going to resolve this issue, please re-open if you have further information. |