[LU-1288] Filesystem hang on file operations (ls, df hang) on the headnode and compute nodes Created: 04/Apr/12  Updated: 10/May/12  Resolved: 10/May/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Archie Dizon Assignee: Cliff White (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 6422

 Description   

There are many logs like the following,

Mar 16 07:36:11 n007 kernel: LustreError: 11-0: an error occurred while communicating with 10.55.32.2@o2ib. The ost_connect operation failed with -16

10.55.32.2 is oss3-ib and is responsive to ping and ssh.

Also, oss4 rebooted last Wednesday evening and now its disks are mounted on
oss3.

How do we make the disks for oss4 migrate from oss3 to oss4? Anything else I should check to debug this?

Customer tried rebooting oss4 and halting oss4 which didn't seem to help. The issue eventually cleared for unknown reasons and the filesystem became responsive again (with the oss4 disks still on oss3).

This again happens over the weekend multiple times. What happens is that one oss server declares the other is dead (due to high load?) and takes over the disks (and reboots the other node, by design). The Lustre filesystem is then unavailable. The node that took over the disks has an extremely high node. This happened to both sets oss1/oss2 and oss3/oss4 Thursday night/Friday morning. I just had to reset oss1 this morning so it would give up the disks it took over from oss2 on Saturday morning. The cluster is unusable when this situation happens.

Adjustments have been made that we hope will help prevent this issue from reappearing so often. For the HA heartbeat we tripled the timeouts so 30 seconds is now a warning and the time to
declare a node dead is now 90 seconds. For Lustre we enabled striping with a stripe count of 8 and stripe size of 4MB.

Do these values seem reasonable? Or should they, for example, increase the stripe count to 24 so it is across all the OSTs?



 Comments   
Comment by Archie Dizon [ 05/Apr/12 ]

Customer would like a status update..... Thanks

Comment by Peter Jones [ 05/Apr/12 ]

Cliff

Could you please help with this one?

Archie

What version of Lustre are you using?

Thanks

Peter

Comment by Cliff White (Inactive) [ 05/Apr/12 ]

Your failover setup should allow you to fail back the disks from the secondary (which i assume is oss3) to the primary. Heartbeat has a 'takeover' command for this purpose. -16 is EBUSY, which typically happens when an OSS is in recovery. The OSS should complete recovery and allow new connections at that instant which may account for the impression that the filesystem has suddenly 'fixed itself'. You need to monitor OSS console and system logs and be certain the system has completed recovery before you get upset, and start rebooting stuff. (rebooting too soon will just make it worse, as the OSS will go into recovery again) There should be messages in the OSS system logs explaining the recovery state. Proper HA timeout is really site-specific, 30/90 seconds is about average. If your previous timeout was 1/3 of this (10/30) then you would likely have false triggers, so backing off is a good idea.

stripe_count is completely unrelated to failover or where stuff is mounted. Proper stripe count is application-dependent, so you must decide for yourselves what stripe is proper. Remember that the number of stripes is going to roughly translate to the number of parallel IO a client can issue at once, so for a wide stripe (all OSTs) be sure the clients are actually capable of generating the necessary load. Typically, you look at the amount of work generated by a client times the number of clients and from that IO demand size the stripe as needed. I would look at client time spent waiting for IO also - if clients are spending time waiting on IO, increasing stripe count may help.

Comment by Archie Dizon [ 09/Apr/12 ]

Here is our customer response:

So it sounds like you agree that our change on the HA timeout sounds
good. Hopefully this will stop the take over when OSS nodes aren't
actually down. We'll assume that the takeover (happened four times) was
because of this.

This doesn't answer why the OSS nodes couldn't handle the additional
disks though once the fail over happened. In each case the Lustre
filesystem was unavailable for an extended period (including most of
last weekend). I believe in each case we had to end up rebooting the
OSS that took over the disks to make the filesystem work. When an OSS
takes over the disks from the other the load on that OSS goes to over 200.

We'll need to experiment with the striping parameters it sounds like.

Comment by Cliff White (Inactive) [ 09/Apr/12 ]

What is the load on the OSS prior to the failover? If the systems are already very busy, then a failover can result in the surviving servering being very overloaded.

Comment by Archie Dizon [ 10/Apr/12 ]

Loads are typically around 3.00 to 8.00 prior to issues. Here's some
more info:

[root@oss1 OSS]# uptime
09:31:21 up 1 day, 23:38, 1 user, load average: 8.29, 4.14, 3.51

[root@oss1 OSS]# pwd
/proc/fs/lustre/ost/OSS

[root@oss1 OSS]# find . -name threads_'*' -print -exec cat {} \;
./ost_io/threads_started
222
./ost_io/threads_max
512
./ost_io/threads_min
128
./ost_create/threads_started
10
./ost_create/threads_max
16
./ost_create/threads_min
2
./ost/threads_started
512
./ost/threads_max
512
./ost/threads_min
128

[root@oss1 OSS]# cd ost
[root@oss1 ost]# ls -l
:
rw-rr- 1 root root 0 Apr 10 09:30 threads_max
rw-rr- 1 root root 0 Apr 10 09:30 threads_min
rrr- 1 root root 0 Apr 10 09:30 threads_started

Comment by Cliff White (Inactive) [ 10/Apr/12 ]

You need to understand that Lustre will be in recovery after a failover. I would suggest looking at the recovery chapter in the Lustre Manual. If this failover happens again, you need to monitor the server consoles/syslogs and monitor the state of the recovery. If clients are attempting IO during the recovery time, they will block and this will likely increase the amount of work the OST must catch up on after recovery. To determine why the load spikes, examine performance metrics (sar, iostat and top can be useful) and console/syslog outputs. If Lustre is having issues, there should be errors in the logs beyond the EBUSY and EAGAIN errors you are seeing.
Threads counts are likely not useful in this cases. It would be helpful for you to examine timeout performance, there should be some indications in your system logs if Luster is increasing timeouts.
The load on a failover OSS when supporting disks from it's failover partner will depend a great deal on the amount a performance headroom available. If the OSS is already busy, the additional load from the extra LUNs may indeed overload the system to the point where performance degrads.

Comment by Archie Dizon [ 11/Apr/12 ]

,the recovery chapter only serves to reinforce the problem: as designed, the OSS's simply do not have the "headroom" to handle failover. With ongoing loads of 7 - 8, e.g., oss3 is already over 50% load which implies that the addition of oss4's OST's will automatically push it over 100%. The LUSTRE tuning document here:

http://wiki.lustre.org/index.php/Lustre_Tuning

as well as examination of the LUSTRE 1.8.x code clearly demonstrate that one way we can combat OSS load is to prevent too many service threads from starting only to sit waiting on i/o. With fewer OSS service threads, the rationale is that the "waiting" would be pushed onto the client nodes since they will not be able to immediately find a free service thread to which to connect. This will obviously increase latency on the client side, but will keep the OSS from holding too many concurrent clients' state/data while waiting on i/o.

>
> If this failover happens again, we need to monitor the server consoles/syslogs and monitor the state of the recovery.
> If staff is attempting IO during the recovery time, they will block and this will likely increase the amount of work the OST must catch up on after recovery. To determine why the load spikes, examine performance metrics (sar, iostat and top can be useful) and console/syslog outputs. If Lustre is having issues, there should be errors in the logs beyond the EBUSY and EAGAIN errors you are seeing.
> Threads counts are likely not useful in this case. It would be helpful to examine timeout performance, there should be some indications in the system logs if Luster is increasing timeouts.

Again, the LUSTRE tuning document cited makes it seem like timeout performance is directly related to service thread counts and their acceptance of data severely outpacing the disks' ability to write that data. Some lines from the logs re: time outs and load:

Lots of these:
Apr 10 08:32:35 oss1 kernel: LustreError: 8998:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff81063559c400 x1398692377549522/t0 o101->MGS@MGC10.55.31.1@o2ib_0:26/25 lens 296/544 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
Apr 10 08:32:35 oss1 kernel: LustreError: 8998:0:(client.c:858:ptlrpc_import_delay_req()) Skipped 119 previous similar messages
Apr 10 08:35:16 oss1 kernel: Lustre: 8517:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1398692377549713 sent from MGC10.55.31.1@o2ib to NID 10.55.31.2@o2ib 5s ago has timed out (5s prior to deadline).

Plenty of these, too:
Apr 10 08:26:08 oss1 kernel: Lustre: lustre-OST0001: slow start_page_write 31s due to heavy IO load
Apr 10 12:51:19 oss1 kernel: Lustre: lustre-OST0005: slow start_page_write 36s due to heavy IO load

Particularly interesting given the claim that service threads are not likely of import:
Apr 10 11:06:40 oss1 kernel: Lustre: lustre-OST0005: slow start_page_read 593s due to heavy IO load
Apr 10 11:06:40 oss1 kernel: Lustre: Service thread pid 16202 completed after 593.92s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

Nothing containing "increas" though:

[root@oss1 ~]# grep -i in

Comment by Archie Dizon [ 11/Apr/12 ]

Some information to add to the problems we're experiencing with LUSTRE:

Today we caught oss2 in the middle of failing. Load had risen to 120+ with the usual CPU usage levels for user/system/wait and only one LUSTRE service thread seeming to be actively functioning. Failover had not yet occurred.

From syslog, we observed the following:

Apr 11 09:14:03 oss2 kernel: Lustre: Service thread pid 8432 was inactive for 1200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Apr 11 09:14:03 oss2 kernel: Lustre: Skipped 4 previous similar messages
Apr 11 09:14:03 oss2 kernel: Pid: 8432, comm: ll_ost_io_184
Apr 11 09:14:03 oss2 kernel:
Apr 11 09:14:03 oss2 kernel: Call Trace:
Apr 11 09:14:03 oss2 kernel: [<ffffffff88c5c6a8>] jbd2_log_wait_commit+0xa3/0xf5 [jbd2]
Apr 11 09:14:03 oss2 kernel: [<ffffffff800a28dd>] autoremove_wake_function+0x0/0x2e
Apr 11 09:14:03 oss2 kernel: [<ffffffff88d0792b>] fsfilt_ldiskfs_commit_wait+0xab/0xd0 [fsfilt_ldiskfs]
Apr 11 09:14:03 oss2 kernel: [<ffffffff88d47cd4>] filter_commitrw_write+0x1c34/0x2be0 [obdfilter]
Apr 11 09:14:03 oss2 kernel: [<ffffffff88ce9d09>] ost_brw_write+0x1c99/0x2480 [ost]
Apr 11 09:14:03 oss2 kernel: [<ffffffff88911111>] LNetMDBind+0x301/0x450 [lnet]
Apr 11 09:14:03 oss2 kernel: [<ffffffff88a2dc65>] lustre_msg_set_limit+0x35/0xf0 [ptlrpc]
Apr 11 09:14:03 oss2 kernel: [<ffffffff88a23ac8>] ptlrpc_send_reply+0x5e8/0x600 [ptlrpc]
Apr 11 09:14:03 oss2 kernel: [<ffffffff8008e421>] default_wake_function+0x0/0xe
Apr 11 09:14:03 oss2 kernel: [<ffffffff88a280a8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
Apr 11 09:14:03 oss2 kernel: [<ffffffff88ced09e>] ost_handle+0x2bae/0x55b0 [ost]
Apr 11 09:14:03 oss2 kernel: [<ffffffff80153e4d>] __next_cpu+0x19/0x28
Apr 11 09:14:03 oss2 kernel: [<ffffffff800774ed>] smp_send_reschedule+0x4e/0x53
Apr 11 09:14:03 oss2 kernel: [<ffffffff88a376d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
Apr 11 09:14:03 oss2 kernel: [<ffffffff88a37e35>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
Apr 11 09:14:03 oss2 kernel: [<ffffffff8008c85d>] __wake_up_common+0x3e/0x68
Apr 11 09:14:03 oss2 kernel: [<ffffffff88a38dc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
Apr 11 09:14:03 oss2 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Apr 11 09:14:03 oss2 kernel: [<ffffffff88a37e60>] ptlrpc_main+0x0/0x1120 [ptlrpc]
Apr 11 09:14:03 oss2 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11

Several other service threads exhibited the same condition, in each case with a stack trace indicating the thread was in jbd2_log_wait_commit. The OSS does not recover from this state in a sensible amount of time, so after 5 minutes we were forced to reboot oss2 once again. FWIW, oss1 picked-up the disks and we allowed recovery to complete on all 6 failover OSTs before starting heartbeat on the newly-rebooted oss2.

Since Whamcloud's LUSTRE 1.8.7 sits atop EXT4 filesystems, we're wondering if there are known issues with EXT4 on the 2.6.18-238 kernel that would be leading to the observed condition – inordinately high load with service threads "stuck" in jbd2_log_wait_commit? I note that the RHEL 5 kernel is at 2.6.18 release 308.1.1 now – here "uname -a" for oss2:

Linux oss2.localdomain 2.6.18-238.12.1.el5_lustre.g266a955 #1 SMP Fri Jun 10 16:39:27 PDT 2011 x86_64 x86_64 x86_64 GNU/Linux

Comment by Archie Dizon [ 12/Apr/12 ]

Yet another LUSTRE failure to report this a.m. – again with oss2. In doing a little web research, the conditions customers observed match EXACTLY with those reported in this Whamcloud bug report:

http://jira.whamcloud.com/browse/LU-436

That issue was with 1.8.6 but is listed as unresolved and still open, so I doubt it was resolved in 1.8.7; perhaps the changes introduced in the 2.x releases skirts the issue (whatever it may be) entirely? Can this be confirmed?

Comment by Cliff White (Inactive) [ 17/Apr/12 ]

Yes, tuning the threads for performance is important in the failover case. Yes, reducing threads may be necessary for some hardware. The issue in LU-436 was resolved by reducing the threads as reported in the bug.
I think you do have the same issue and should reduce threads.

Comment by Cliff White (Inactive) [ 19/Apr/12 ]

How are you doing on this issue? Would a call or Skype conference help?

Comment by Cliff White (Inactive) [ 29/Apr/12 ]

What is your current state? Do you still need help?

Comment by Cliff White (Inactive) [ 10/May/12 ]

I am going to resolve this issue, please re-open if you have further information.

Generated at Sat Feb 10 01:15:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.