[LU-1504]  the /lustre filesystem was unusable for an extended period due to a single OST's dropping out of service Created: 11/Jun/12  Updated: 15/Feb/13  Resolved: 15/Feb/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.0.0, Lustre 1.8.6
Fix Version/s: None

Type: Task Priority: Major
Reporter: Archie Dizon Assignee: Cliff White (Inactive)
Resolution: Fixed Votes: 0
Labels: performance
Environment:

Clustering


Attachments: File headnode-messages.gz     File lustre-failure-120619-1.gz     File mds0a-messages.gz     File oss3-vmstat.log     File oss4-messages.gz    
Epic: hang
Rank (Obsolete): 4043

 Description   

Hello Support,

One of customer at University of Delaware had at least three separate instances where the /lustre filesystem was unusable for an extended period due to a single OST's dropping out of service due to:

Jun 11 02:40:07 oss4 kernel: Lustre: 9443:0:(ldlm_lib.c:874:target_handle_connect()) lustre-OST0016: refuse reconnection from d085b4f1-e418-031f-8474-b980894ce7ad@10.55.50.115@o2ib to 0xffff8103119bac00; still busy with 1 active RPCs

The hang was so bad for one of them (upwards of 30 minutes with the OST unavailable) that a reboot of the oss1/oss2 pair was necessary. The symptom is easily identified: long hangs on the head node while one waits for a directory listing or for a file to open for editing in vi, etc. Sometimes the situation remedies itself, sometimes it does not and we need to reboot one or more OSS nodes.

"Enclosed are all of the syslogs, dmesg, and /tmp/lustre* crash dumps for our MDS and OSS's."

You can retrieve the drop-off anytime in the next 21 days by clicking the following link (or copying and pasting it into your web browser):

"https://pandora.nss.udel.edu//pickup.php?claimID=vuAFoSBUoReVuaje&claimPasscode=RfTmXJZFVdUGzbLk&emailAddr=tsingh%40penguincomputing.com"

Full information for the drop-off:

Claim ID: vuAFoSBUoReVuaje
Claim Passcode: RfTmXJZFVdUGzbLk
Date of Drop-Off: 2012-06-11 12:23:20-0400

Please review the attached log files and provide us the next course of action since it's very critical issue and impacting their environment? Also please let me know
if you need any further info?

Thanks
Terry
Penguin Tech Support
Ph: 415-954-2833



 Comments   
Comment by Cliff White (Inactive) [ 11/Jun/12 ]

First, as explained in the Lustre Manual, we can't use the lustre-log information unless you pre-process it.
use "# lctl df <infile> <outfile> " where <infile> is the log file, and <outfile> is whatever name you choose.
Second, we really need complete system logs, not just the Lustre messages. Also, we need a log from at least one client, that would help. Please attach the logs, for the servers and a client, from June 11 only.

Comment by Archie Dizon [ 11/Jun/12 ]

Hello Cliff,

"Pre-processed lustre-log files, per request of WhamCloud."

You can retrieve the drop-off anytime in the next 21 days by clicking the following link (or copying and pasting it into your web browser):

"https://pandora.nss.udel.edu//pickup.php?claimID=j47zLKRJwQMqRFkA&claimPasscode=iNY2YwUBujk2hGpY&emailAddr=tsingh%40penguincomputing.com"

Can you be more specific w.r.t. what system logs you need? I sent along the "/var/log/messages*" and "/var/log/dmesg" files, what additional ones do you need?

Thanks
Terry

Comment by Cliff White (Inactive) [ 11/Jun/12 ]

As I said, we need complete /var/log/messages, just for the day of the incident. Your logs contained only Lustre messages, really need to see the complete logs.

Comment by Cliff White (Inactive) [ 11/Jun/12 ]

Looking at these logs, I am quite confused. It looks like your system has not run cleanly for quite some time.
When was the last time the filesystem was restarted? As I look in the logs, I see quite a few network errors:

May 20 04:04:55 oss4 kernel: Lustre: 8468:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1401607688579645 sent from MGC10.55.31.1@o2ib to NID 10.55.31.2@o2ib 0s ago has failed due to network error (5s prior to deadline).
If i grep 'network error' from only the oss4 logs, i see 4217 such messages. That's rather a lot.
I also see other indications of things being overloaded:
May 20 04:04:55 oss4 kernel: Lustre: 8468:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1401607688579645 sent from MGC10.55.31.1@o2ib to NID 10.55.31.2@o2ib 0s ago has failed due to network error (5s prior to deadline).

I can't really tell where you problem starts, as near as I can see the system was running in some degraded fashion at least since May 20th.
Are you seeing errors in logs now? Can you validate that your network is healthy? I would consider a restart of the filesystem, to get yourself into a known stable state.

Comment by Archie Dizon [ 11/Jun/12 ]

Hello Cliff,

I sent /var/log/messages from the Mills head node to you via Dropbox. The compute nodes do not have local syslog files, they output via rsyslogd to the head node, where it drops into /var/log/messages along with the syslog from the head node itself. So the file I sent essentially includes syslog data from all the LUSTRE clients over the last ~24 hours.

You can retrieve the drop-off anytime in the next 21 days by clicking the following link (or copying and pasting it into your web browser):

"https://pandora.nss.udel.edu//pickup.php?claimID=46ce3AvsMgKuk0KQ&claimPasscode=Jv6u3TXkTXR14ctJ&emailAddr=tsingh%40penguincomputing.com"

Thanks
Terry

Comment by Cliff White (Inactive) [ 11/Jun/12 ]

In the future, please use our anonymous FTP site for uploads (ftp.whamcloud.com) Indicate the exact file name uploaded in the bug comment.

When was the filesystem last restarted?
Can you validate that your network is healthy?

Comment by Cliff White (Inactive) [ 11/Jun/12 ]

Also, are the system clocks on the servers and clients in sync?

Comment by Archie Dizon [ 12/Jun/12 ]

Hello Cliff,

Please find the customer response in-line below. Also would it be possible for you to work with the customer directly to avoid any delays? If so then let me know and I'll provide you the contact info.

> Looking at these logs, I am quite confused. It looks like your system has not run cleanly for quite some time.

That's an understatement. We've been rebooting OSS's and the MDS at least 3 times a week for the last 3 months. This system has never run in a way that we'd term "clean."

> When was the last time the filesystem was restarted?

Last week we did the disk and controller firmware updates – so that was the last reboot. The filesystem ran without major issue for a couple days before we again had to restart OSS's due to hung RPCs, etc.

> May 20 04:04:55 oss4 kernel: Lustre: 8468:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1401607688579645 sent from MGC10.55.31.1@o2ib to NID 10.55.31.2@o2ib 0s ago has failed due to network error (5s prior to deadline).
> If i grep 'network error' from only the oss4 logs, i see 4217 such messages. That's rather a lot.

Yes, I agree. But "network error" really doesn't say much, that's why we asked WhamCloud to analyze the situation.

10.55.31.2 is "mds0b-ib", the standby MDS server. 10.55.31.1 is "mds0a-ib", the active MDS servers. So why is "mds0a" trying to talk to "mds0b"? Since "mds0b" is standby under the heartbeat service, there's no reason to expect that LUSTRE communications to it should succeed – there's no active LUSTRE services on it.

> I also see other indications of things being overloaded:
> May 20 04:04:55 oss4 kernel: Lustre: 8468:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1401607688579645 sent from MGC10.55.31.1@o2ib to NID 10.55.31.2@o2ib 0s ago has failed due to network error (5s prior to deadline).

So the same error message is indicative of both "network errors" as well as "overload"? Or did they mean to cite a different error message?

The last ticket we opened with WhamCloud was w.r.t. the "overload" we were observing under the default LUSTRE configuration as the system was built by Penguin. Originally our idea of lowering the max thread limits was deemed not likely to help anything; a few weeks later WhamCloud's solution was to lower the max thread limits, just as we had originally proposed. Obviously that hasn't helped matters.

> I can't really tell where you problem starts, as near as I can see the system was running in some degraded fashion at least since May 20th.

This is also very vague: "some degraded fashion" does not tell us how this cluster of services is degraded, that's the kind of info we were hoping WhamCloud could provide.

> Are you seeing errors in logs now? Can you validate that your network is healthy? I would consider a restart of the filesystem, to get yourself into a known stable state.

As you can see in the logs we sent, errors are a somewhat continuous affair on these systems.

As mentioned, we just did a full restart last week; should we not expect LUSTRE 1.8.6 to stay stable any more than 3 or 4 days and be doing a full restart that often? A restart temporarily alleviates problems, but doesn't truly fix anything.

> Also, are the system clocks on the servers and clients in sync?

Yes, they are all sync'ed to the head node's NTP service.

[root@mills 1001]# for n in mds0a oss

{1,2,3,4}

; do ssh $n /usr/sbin/ntpq -c peer; done
remote refid st t when poll reach delay offset jitter
==============================================================================
*scyld.localdoma 128.175.60.175 2 u 608 1024 377 0.062 -0.078 0.074
remote refid st t when poll reach delay offset jitter
==============================================================================
*scyld.localdoma 128.175.60.175 2 u 425 1024 377 0.146 -0.365 0.052
remote refid st t when poll reach delay offset jitter
==============================================================================
*scyld.localdoma 128.175.60.175 2 u 665 1024 377 0.083 -0.195 0.062
remote

Thanks
Terry

Comment by Cliff White (Inactive) [ 12/Jun/12 ]

Please send me the contact information. We're always glad to help.

Comment by Archie Dizon [ 13/Jun/12 ]

Here is the customers contact information.

Contact Name: Jeffrey Frey
Contact Phone: 302-831-6034
Email: frey@udel.edu

Comment by Archie Dizon [ 14/Jun/12 ]

Hello Cliff,

Were you able to contact Jeffrey Frey at UoD? Also do you need any further info to resolve this issue since I don't see any update on the case?

Thanks
Terry

Comment by Peter Jones [ 14/Jun/12 ]

Archie

I have tried to reach Jeffrey by both email and phone but with no luck so far. Please contact me directly at peter@whamcloud.com with any further questions of this nature

Thanks

Peter

Comment by Archie Dizon [ 15/Jun/12 ]

Hi Peter,

I am following up with Jeff and his team at U Del. It was brought to my
attention you were attempting to get a hold of me or possibly Terry? Please
let me know if you need anything.
Also I have provided your email address to Jeff

Thanks
Archie


Archie Dizon
Technical Support

Penguin Computing Inc.
45800 Northport Loop West
Fremont, CA 94538
support@penguincomputing.com
Phone: 1-888-Penguin (736-4846)

Comment by Minh Diep [ 15/Jun/12 ]

Could you provide the following information?

1. modprobe.conf or lnet/lustre.conf
2. output of lctl list_nids from all servers and at least 1 client
3. do lctl ping between servers and 1 client
4. output of lctl show_route

Thanks

Comment by Peter Jones [ 16/Jun/12 ]

From the customer site

"
1. modprobe.conf entries for each of the four OSS servers have the following:
options ost oss_num_threads=48
We have then been bumping the threads_max variable to 100 after booting.

2. lctl list_nids foreach server and 1 client
mds0a: 10.55.31.1@o2ib
mds0b: opening /dev/lnet failed: No such device
hint: the kernel modules may not be loaded
IOC_LIBCFS_GET_NI error 19: No such device
oss1: 10.55.32.0@o2ib
oss2: 10.55.32.1@o2ib
oss3: 10.55.32.2@o2ib
oss4: 10.55.32.3@o2ib
headnode: 10.55.0.1@o2ib

3. lctl ping between servers and headnode
mds0a to headnode
12345-0@lo
12345-10.55.0.1@o2ib
mds0b to headnode
opening /dev/lnet failed: No such device
hint: the kernel modules may not be loaded
failed to ping 10.55.0.1@o2ib: No such device
Exit 1
oss1 to headnode
12345-0@lo
12345-10.55.0.1@o2ib
oss2 to headnode
12345-0@lo
12345-10.55.0.1@o2ib
oss3 to headnode
12345-0@lo
12345-10.55.0.1@o2ib
oss4 to headnode
12345-0@lo
12345-10.55.0.1@o2ib

headnode to 10.55.31.1@o2ib
12345-0@lo
12345-10.55.31.1@o2ib
headnode to 10.55.32.0@o2ib
12345-0@lo
12345-10.55.32.0@o2ib
headnode to 10.55.32.1@o2ib
12345-0@lo
12345-10.55.32.1@o2ib
headnode to 10.55.32.2@o2ib
12345-0@lo
12345-10.55.32.2@o2ib
headnode to 10.55.32.3@o2ib
12345-0@lo
12345-10.55.32.3@o2ib

4. output of lctl show_route
no output any of the ways I ran it.

Also, we are using a Qlogic QDR Infiniband network if that wasn't mentioned before. "

Comment by Peter Jones [ 16/Jun/12 ]

More info from customer site

"I'll try to provide some context now if that will help. We have two MDS servers in an HA configuration. We also have four OSS servers (oss[1-4]) where oss1/2 in an HA config and oss3/4 are another pair. Each OSS has six OSTs. The initial problem we started seeing in February was that the load would spike on one OSS (as high as 500 at times) and the other OSS would take over. We lowered the thread count (max of 100) and we are still seeing the load spike as high as 99 at times and one OSS takes over for another. This is probably more an issue with the HA solution provided by Penguin than with Lustre per se though. We then started to have problems where disks would go offline in the DotHill enclosures. This seems to have been fixed finally with a recent Seagate firmware update.

We continue to see OSTs go offline however. Sometimes it is temporary and will recover. Other times it is not temporary and our solution has been to reboot one or both OSS pair involved. The recovery after a reboot is quick at times (around 5 minutes) other times it is more like an hour. A lot of the logs Jeff sent you involve the errors when these events happen. Can you tell from the logs what might be a cause?

We are running Lustre 1.8.6. Are there known bugs with this version that we are hitting? Would an upgrade (1.8.7 or 2.x) solve any problems we are seeing? We also had some questions earlier about what the stripe size should be and are thinking about that again now also.

We are also exploring providing a non-lustre solution to our users so they can get some productive work done. The many lustre outages have been affecting our users greatly. "

Comment by Cliff White (Inactive) [ 18/Jun/12 ]

When you say 'the system takes an hour to recover' after a server failover, what data are you using to measure this? Is this server load factor, or some other measure? We need to understand how you define 'recover' in this case.
Can you give us more details on the client workload? What are the typical IO sizes from the clients?

We need to gather some more data on block IO performance. We can do this directly if you can give us login access to the system. otherwise, from each server, please obtain and attach the following:
1. the output of 'ps ax' when the server load is spiking.
2. On each server for each OST, there is a 'brw_stats' file located on the path
/proc/fs/lustre/obdfilter/<OST ID>/brw_stats where 'OST ID' is of the format lustre-OST00XX. for each of your OSTs, from a login on the OSS please issue the command:
'# cat /proc/fs/lustre/obdfilter/<OST ID>/brw_stats > <file>'
where <file> is the OST ID. Please do this when the server load is high.
3. The clients capture data on IO transactions, again this is per OST, the relevant client file is /proc/fs/lustre/osc/<OST ID>/rpc_stats, again <OST ID> should be replaced by
your OST IDs (lustre-OST0000)
Please for each OST under the osc/ directory, on several of the affected clients,
'# cat /proc/fs/lustre/osc/<OST ID>/rpc_stats > <file>'
and again please name the files so as to indicate which OST they refer to.

Comment by Ben Miller [ 19/Jun/12 ]

I define recovery time as the time it takes from once I notice that
/lustre is unavailable until when it is available again after rebooting
servers. Typically /lustre is unavailable for a few hours then I'll
reboot one OSS that has an OST unavailable. Sometimes the other in the
HA pair also needs to be rebooted. Sometimes one of the OSSs needs to
be rebooted twice. Sometimes the lustre recovery time after reboot is
quick and other times it takes quite awhile for all clients to recover.
We suspect the client workload is high at time of failure. IO
sizes can be of all sizes. Many users are working with lots of small
files, others are using larger files.

OSS4 took over for OSS3 last night and the load on OSS4 was at 100 this
morning. I recovered by starting the heartbeat service on oss3,
rebooting oss4, starting heartbeat on oss4 and then waited. The OSS3
OSTs became available, but none of the OSS4 ones did after 15 minutes.
So I rebooted oss4 again, restarted heartbeat, and waited another 15
minutes. None of the OSS4 OSTs are available yet. OSS4 has logs like this:

Jun 19 08:03:08 oss4 kernel: LustreError:
8147:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID
req@ffff81061bf68000 x1405203837354187/t0
o101->MGS@MGC10.55.31.1@o2ib_0:26/25 lens 296/544 e 0 to 1 dl 0 ref 1 fl
Rpc:/0/0 rc 0/0
Jun 19 08:03:08 oss4 kernel: LustreError:
8147:0:(client.c:858:ptlrpc_import_delay_req()) Skipped 23 previous
similar messages
Jun 19 08:03:09 oss4 kernel: Lustre:
7826:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request
x1405203837354199 sent from MGC10.55.31.1@o2ib to NID 10.55.31.2@o2ib 0s
ago has failed due to network error (5s prior to deadline).
Jun 19 08:03:09 oss4 kernel: req@ffff810621afc400 x1405203837354199/t0
o250->MGS@MGC10.55.31.1@o2ib_0:26/25 lens 368/584 e 0 to 1 dl 1340107394
ref 1 fl Rpc:N/0/0 rc 0/0
Jun 19 08:03:09 oss4 kernel: Lustre:
7826:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 42 previous
similar messages
Jun 19 08:04:41 oss4 kernel: LustreError: 137-5: UUID
'lustre-OST000c_UUID' is not available for connect (no target)

OST000c is on OSS3 and is available to clients. OST00

{12-17}

currently
are not available.

I have attached a gzipped log of the data (prior to rebooting this
morning) you ask for below. Would you like syslogs also?

thanks,
Ben

Comment by Ben Miller [ 19/Jun/12 ]

And then right after I sent this email the OSTs finally became available
and /lustre is back online after working on it for about 40 minutes.

Ben

Comment by Cliff White (Inactive) [ 19/Jun/12 ]

Please attach the system log for oss4, starting from a time prior to this morning's reboot. Also please the log from the MDS and at least one client, for the same time period.

Comment by Ben Miller [ 19/Jun/12 ]

Logs are attached. The headnode-messages file has syslogs for the
headnode and all the compute nodes.

Ben

Comment by Cliff White (Inactive) [ 19/Jun/12 ]

Can you run "vmstat 5 100 > vmstat.log" on one of the busy OSTs and attach the output?

Comment by Ben Miller [ 19/Jun/12 ]

Here it is for oss3.

Ben

procs ----------memory--------- --swap- ----io--- -system- ----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 3679452 2271020 16719004 0 0 12 60 49 37 0 1 99 0 0
1 0 0 3665612 2271220 16731472 0 0 12 2683 5492 22904 0 2 98 0 0
1 0 0 3586076 2270824 16823820 0 0 14 19930 6164 24593 0 8 92 0 0
1 0 0 3583992 2270992 16826752 0 0 14 767 6706 30467 0 10 90 0 0
2 0 0 3572604 2271160 16839324 0 0 2 2713 6588 29624 0 10 90 0 0
1 0 0 3553672 2271488 16857124 0 0 10 4235 5925 25023 0 10 90 0 0
1 0 0 3551728 2271080 16858028 0 0 7 106 6971 31704 0 10 90 0 0
1 0 0 3526424 2270872 16885768 0 0 2 5778 7662 33085 0 10 90 0 0
1 0 0 3519940 2270736 16893616 0 0 5 1845 4776 20677 0 8 92 0 0
1 0 0 3518544 2270376 16896276 0 0 12 539 5162 22691 0 10 90 0 0
1 0 0 3523576 2270200 16887492 0 0 9 304 6025 26623 0 10 90 0 0
2 0 0 3523612 2270296 16888852 0 0 7 405 7089 32613 0 10 90 0 0
2 0 0 3536784 2270416 16874592 0 0 11 530 5358 24228 0 10 90 0 0
1 0 0 3536212 2270536 16875516 0 0 16 387 5351 23709 0 3 97 0 0
1 0 0 3518632 2270568 16891360 0 0 16 3240 5485 24006 0 7 93 0 0
2 0 0 3504568 2270840 16906664 0 0 6 3297 7673 35050 0 10 90 0 0
2 0 0 3489320 2271104 16922432 0 0 13 3536 5865 25530 0 10 90 0 0
1 0 0 3487380 2271320 16925436 0 0 19 540 6345 29010 0 10 90 0 0
2 0 0 3477772 2271420 16931608 0 0 5 1420 7390 33680 0 10 90 0 0
2 0 0 3458112 2271916 16952220 0 0 50 4638 6207 26096 0 10 90 0 0
2 0 0 3318852 2273780 17087976 0 0 6 34498 6682 22626 0 10 90 0 0
procs ----------memory--------- --swap- ----io--- -system- ----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 3318732 2273556 17089888 0 0 0 253 5681 25024 0 10 90 0 0
2 0 0 3281832 2273992 17125828 0 0 4 7547 7094 30202 0 10 90 0 0
2 0 0 3272004 2274216 17136424 0 0 12 2268 6500 28415 0 10 90 0 0
2 0 0 3256964 2274112 17151664 0 0 4 3198 5621 23499 0 10 90 0 0
1 0 0 3246400 2273744 17162628 0 0 6 2677 5627 23582 0 10 90 0 0
1 0 0 3235624 2273240 17171196 0 0 6 2217 7079 32173 0 10 90 0 0
1 1 0 3245212 2273036 17163072 0 0 10 5155 6201 27244 0 9 91 0 0
1 0 0 3214192 2273040 17193544 0 0 22 6617 6234 26471 0 10 90 0 0
1 0 0 3191088 2272920 17217188 0 0 16 4946 7279 32868 0 10 90 0 0
1 0 0 3176340 2272944 17231044 0 0 4 2886 6947 31922 0 10 90 0 0
1 0 0 3173472 2273192 17234444 0 0 22 1059 4799 20868 0 10 90 0 0
1 0 0 3172304 2272816 17236180 0 0 14 400 5268 23325 0 10 90 0 0
2 0 0 3172920 2272408 17236632 0 0 11 138 5970 26801 0 10 90 0 0
2 0 0 3167624 2272056 17240424 0 0 13 825 7096 31001 0 10 90 0 0
1 0 0 3165220 2272152 17243068 0 0 7 682 5149 22487 0 10 90 0 0
2 0 0 3163628 2271960 17245264 0 0 1 490 5443 24123 0 10 90 0 0
2 0 0 3160612 2271760 17248636 0 0 2 838 5787 26236 0 10 90 0 0
1 0 0 3156912 2271872 17251540 0 0 7 907 7176 33221 0 10 90 0 0
1 0 0 3156260 2271960 17252712 0 0 12 266 5587 25427 0 10 90 0 0
1 0 0 3153276 2271872 17255972 0 0 12 823 6403 29209 0 10 90 0 0
2 0 0 3151996 2272048 17257488 0 0 21 393 7237 33723 0 10 90 0 0
procs ----------memory--------- --swap- ----io--- -system- ----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 3147760 2272128 17259880 0 0 5 563 5479 24913 0 10 90 0 0
1 0 0 3146884 2272280 17262168 0 0 10 564 4963 21783 0 10 90 0 0
1 0 0 3146156 2272400 17261772 0 0 10 461 5735 25777 0 10 90 0 0
1 0 0 3101968 2272896 17305052 0 0 833 9286 7095 31214 0 10 90 0 0
1 0 0 3143424 2273096 17263004 0 0 22 717 6008 27080 0 10 90 0 0
1 0 0 3144020 2273172 17263596 0 0 0 234 5288 23557 0 10 90 0 0
1 0 0 3142292 2273340 17265644 0 0 6 547 5339 23535 0 10 90 0 0
2 0 0 3141508 2273484 17266848 0 0 11 305 7273 33868 0 10 90 0 0
1 0 0 3140376 2273508 17267744 0 0 6 202 5724 26090 0 10 90 0 0
1 0 0 3096288 2274368 17311792 0 0 229 19058 7161 27260 0 10 90 0 0
1 0 0 3091776 2274488 17314492 0 0 9 1066 6976 31663 0 7 93 0 0
2 0 0 3069832 2274664 17326844 0 0 2 2754 6882 30216 0 2 98 0 0
1 0 0 3056844 2274840 17331568 0 0 17 971 4705 20198 0 2 98 0 0
1 0 0 3109352 2274788 17294964 0 0 7 2655 5362 22722 0 2 98 0 0
3 0 0 3015684 2274916 17391628 0 0 10 20679 6667 27656 0 7 93 0 0
1 0 0 3014468 2274068 17392884 0 0 16 546 7143 32296 0 9 91 0 0
1 0 0 3050608 2273996 17354228 0 0 16 788 5179 22799 0 8 92 0 0
1 0 0 3127024 2273964 17273668 0 0 12 3244 5571 24207 0 9 91 0 0
2 0 0 3111308 2273820 17292936 0 0 20 4055 6203 25874 0 9 91 0 0
2 0 0 3090120 2273356 17315624 0 0 23 5059 7564 32830 0 10 90 0 0
2 0 0 3089884 2272900 17317324 0 0 2 462 5458 24031 0 8 91 0 0
procs ----------memory--------- --swap- ----io--- -system- ----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 3088984 2272980 17319600 0 0 6 459 6475 28349 0 10 90 0 0
2 0 0 2995780 2273492 17412792 0 0 5 19180 8460 34220 0 11 89 0 0
3 0 0 2985708 2272916 17424072 0 0 8 2332 5648 24187 0 10 90 0 0
4 0 0 2965952 2272964 17443116 0 0 10 4574 5281 21170 0 10 90 0 0
1 0 0 2970440 2272636 17433200 0 0 8 2831 6166 26640 0 7 93 0 0
2 0 0 2961988 2272856 17443840 0 0 2 2190 6840 30713 0 10 90 0 0
1 0 0 2957244 2272332 17450308 0 0 2 2953 6066 26143 0 10 90 0 0
1 0 0 2952340 2272140 17456128 0 0 6 2278 5487 23197 0 6 94 0 0
0 0 0 2951920 2271764 17454436 0 0 15 228 5394 23413 0 8 92 0 0
2 0 0 2951156 2271348 17454896 0 0 8 146 7381 33632 0 8 92 0 0
2 0 0 2942704 2271380 17465200 0 0 25 4042 5934 25341 0 10 90 0 0
1 0 0 2938772 2271204 17469672 0 0 8 2420 6135 26676 0 10 90 0 0
2 0 0 2929464 2270848 17478452 0 0 17 4628 7567 32596 0 10 90 0 0
1 0 0 2922180 2270536 17486124 0 0 9 3868 7044 30435 0 10 90 0 0
2 0 0 2909516 2270640 17499912 0 0 3 4629 4907 20917 0 10 90 0 0
2 0 0 2900304 2270840 17509008 0 0 10 4393 5868 24716 0 10 90 0 0
2 0 0 2901412 2270936 17505024 0 0 4 1712 6456 28808 0 10 90 0 0
1 0 0 2891168 2270688 17516180 0 0 30 4203 7167 30894 0 10 90 0 0
2 0 0 2916592 2270792 17490216 0 0 10 82 5253 23086 0 10 90 0 0
2 0 0 2903988 2270696 17504200 0 0 826 2253 5760 24081 0 10 90 0 0
1 0 0 2893892 2270720 17512492 0 0 6 5798 6869 28962 0 10 90 0 0
procs ----------memory--------- --swap- ----io--- -system- ----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 2881068 2270600 17526988 0 0 2 3694 6850 30249 0 10 90 0 0
1 0 0 2878404 2270156 17530512 0 0 8 960 5641 25202 0 10 90 0 0
1 0 0 2868848 2269928 17540368 0 0 3 3522 7001 30809 0 8 92 0 0
2 0 0 2864540 2269460 17540948 0 0 9 2708 7399 33274 0 7 93 0 0
1 0 0 2863940 2269336 17543248 0 0 13 710 5043 22031 0 10 90 0 0
1 0 0 2866248 2268912 17541764 0 0 4 525 4962 21800 0 10 90 0 0
2 0 0 2864944 2268820 17544228 0 0 838 734 6114 27383 0 10 90 0 0
1 0 0 2853648 2268872 17552824 0 0 16 4321 7043 31961 0 10 90 0 0
1 0 0 2842860 2268960 17566128 0 0 8 4648 5628 24462 0 10 90 0 0
2 0 0 2834824 2268956 17574452 0 0 6 5086 5371 23136 0 10 90 0 0
2 0 0 2834492 2269184 17575464 0 0 17 670 6318 26323 0 10 90 0 0
3 0 0 2832544 2269092 17575608 0 0 10 406 8650 38921 0 10 90 0 0
1 0 0 2830952 2269732 17577252 0 0 10 2411 6211 27698 0 10 90 0 0
2 0 0 2827412 2269956 17582376 0 0 10 1793 6197 27999 0 10 90 0 0
1 0 0 2827604 2270180 17582404 0 0 9 1563 7256 32327 0 10 90 0 0
1 0 0 2824660 2270432 17582720 0 0 5 1230 6122 27703 0 10 90 0 0

Comment by Cliff White (Inactive) [ 19/Jun/12 ]

Thanks for providing the information so far. Analyzing the results has been useful in understanding what is going on.

From the IO statistics, we are seeing a great deal of small file IO (4k io size), and very little parallel IO (most of the time only 1 IO in flight). This is not an optimal IO model for Lustre - any steps you can take on the application side to increase IO size, eliminate excessive flush() or sync() calls, or otherwise allow the filesystem to aggregate larger IO will help to improve your performance.

Given this IO pattern the Lustre read cache – which is on by default - may be doing more harm than good. To turn it off please run the command "lctl set_param obdfilter.*.read_cache_enable=0" on all OSS nodes.

Finally, we recommend an immediate upgrade to Lustre 1.8.8-wc1 as this release contains optimizations for small file IO (see http://jira.whamcloud.com/browse/LU-983).  The Lustre 1.8.6-wc1 and 1.8.8-wc1 releases are completely compatible, so you can do a rolling upgrade of your systems without needing any downtime.

Please let me know if you have any further questions about any of the above and let us know whether this advice helps.

Comment by Ben Miller [ 19/Jun/12 ]

OK, thanks. I turned off the read cache. We'll let you know if it
helps. We'll discuss internally about upgrading soon.

Ben

Comment by Jeffrey Frey [ 20/Jun/12 ]

Cliff:

With respect to upgrading the LUSTRE release on these production servers:

(1) Since these systems have QLogic's own OFED installed on them, we would want to build the LUSTRE 1.8.8 server components from scratch, correct? It appears this is what Penguin did when the system was built:

[root@oss1 src]# cd /usr/src
[root@oss1 src]# ls
aacraid-1.1.7.28000 debug kernels ofa_kernel ofa_kernel-1.5.3 openib redhat
[root@oss1 src]# ls kernels/
2.6.18-238.12.1.el5_lustre.g266a955-x86_64 2.6.18-238.el5-x86_64

(2) For rolling upgrades, we're assuming the general order of operations would be:

  • upgrade spare MDS
  • failover MDS to upgraded spare, upgrade primary MDS
  • failover oss1 => oss2; upgrade oss1; failover oss2 => oss1; upgrade oss2
  • failover oss3 => oss4; upgrade oss3; failover oss4 => oss3; upgrade oss4

We'd appreciate citations of any important/helpful materials on the subject of LUSTRE rolling upgrades.

Thanks for any and all information/feedback you can provide.

::::::::::::::::::::::::::::::::::::::::::::::::::::::
Jeffrey T. Frey, Ph.D.
Systems Programmer IV / Cluster Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE 19716
Office: (302) 831-6034 Mobile: (302) 419-4976
http://turin.nss.udel.edu/
::::::::::::::::::::::::::::::::::::::::::::::::::::::

Comment by Cliff White (Inactive) [ 20/Jun/12 ]

As far as we know, Qlogic cards are supported by the OFED supplied with the RedHat kernel - was there a reason giving for using external OFED?
Please give us the model numbers of the cards.
Your upgrade procedure is correct, and identical to that in the Lustre Manual. The change from 1.8.6 to 1.8.8 is a point release, there are absolutely no concerns about compatibility and no need for any special work. The clients may require a RedHat kernel update to match our client kernel.
There is afaik not much information beyond the Lustre Manual, the upgrade should not be difficult.
Please let us know when you are planning this, and we'd be glad to have a conference call or furnish other help.
Has disabling the read cache produced any change in the performance?

Comment by Jeffrey Frey [ 20/Jun/12 ]

QLogic's OFED stack is a superset of the stock OFED distribution. It includes value-added utilities/libraries and bug-fixes [that OFED hypothetically rolls back into the baseline distribution].

When we initially setup the head node w/ ScientificLinux 6.1 the stock RDMA kernel module kept the machine from booting (kernel panic every time). Wiping the RHEL IB stack and replacing with the QLogic one for RHEL 6.1 fixed the issue.

As configured by Penguin, the LUSTRE servers are all running CentOS 5.x.

[root@oss1 ~]# ibstat
CA 'qib0'
CA type: InfiniPath_QLE7340
Number of ports: 1
Firmware version:
Hardware version: 2
Node GUID: 0x001175000077592a
System image GUID: 0x001175000077592a
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 219
LMC: 0
SM lid: 1
Capability mask: 0x07610868
Port GUID: 0x001175000077592a
Link layer: IB

So in your (Whamcloud's) experience there have never been any instances of data loss due to upgrading like this on-the-fly while the filesystem is still online?

We're using the loadable module variant of the client rather than building a LUSTRE kernel for the clients or using a Whamcloud kernel. Is there anything about the 1.8.8 server that demands 1.8.8 clients? Upgrading all clients = lots of downtime for our users and they've already experienced a LOT of downtime/lost time thanks to LUSTRE.

::::::::::::::::::::::::::::::::::::::::::::::::::::::
Jeffrey T. Frey, Ph.D.
Systems Programmer IV / Cluster Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE 19716
Office: (302) 831-6034 Mobile: (302) 419-4976
http://turin.nss.udel.edu/
::::::::::::::::::::::::::::::::::::::::::::::::::::::

Comment by Cliff White (Inactive) [ 20/Jun/12 ]

We have never had data loss from a rolling upgrade, or from a point upgrade of this type. We routinely test these point release upgrade/downgrades as a part of the release process for each new Lustre release.

If there are other dependencies involved (such as Qlogic tools) then it would be advisable to involve Penguin in the decision-making process about how to proceed.

While upgrading your Lustre version provides the benefit from other fixes included, it is quite possible to apply the exact fix you need to your existing version of Lustre, which may be less disruptive on other layers of the stack.

Our clients use a stock RedHat kernel and you can certainly do a gradual upgrade of the clients as you schedule permits, there is no problem running with a mix of client versions.

Before embarking on such an upgrade path you could even experiment with a single client to see whether the pay-off warrants the effort. To do this effectively you would need to identify an example of the kind of workload that triggers this issue so that you can measure the before and after effect of applying this fix. Do you have any idea as to a likely candidate for such a test?

Comment by Cliff White (Inactive) [ 27/Jun/12 ]

What is your current state? What else can we do to assist?

Comment by Cliff White (Inactive) [ 05/Jul/12 ]

Please update us as to your status. What else we do to help on this issue? Should we close this bug?

Comment by Ben Miller [ 06/Jul/12 ]

We continue to have temporary OST unavailability with the read cache
disabled. We haven't had a chance to upgrade to 1.8.8 yet. Sometimes
the OST will recover within a few minutes other times after several
minutes (or hours) we end up rebooted an OSS to two to get Lustre
available again.

Ben

Comment by Jeffrey Frey [ 12/Jul/12 ]

Cliff:

Further research on your Jira site has shown that the issue we're seeing is EXACTLY the situation reported in LU-416:

http://jira.whamcloud.com/browse/LU-416

To what version of Lustre does that incident correspond – 1.8.6? – and is it resolved in 1.8.8?

::::::::::::::::::::::::::::::::::::::::::::::::::::::
Jeffrey T. Frey, Ph.D.
Systems Programmer IV / Cluster Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE 19716
Office: (302) 831-6034 Mobile: (302) 419-4976
http://turin.nss.udel.edu/
::::::::::::::::::::::::::::::::::::::::::::::::::::::

Comment by Oleg Drokin [ 12/Jul/12 ]

I believe your problem does not look like LU-416 which was a client side problem )and the one specific to 2.x releases only at that too).
You are clearly having a problem on the server side.
Looks like your disk devices are overloaded with small IO and there is very little Lustre can do about it because apparently thisis what the application in question wants.

Comment by Jeffrey Frey [ 12/Jul/12 ]

Interesting that this is a 2.x problem; the fact remains, though, that the symptoms we observe on the client side are EXACTLY the same as the ticket I mentioned, so we were hoping that would provide you folks with a better idea of what's going on. The problem seems like it has to be partly client-oriented, since 185 of the 200 Lustre clients were using the OST without issue while 15 of the clients showed it as unavailable. On the cluster head node where we noticed this recent outage:

[root@mills lustre-OST0006-osc-ffff880d13d29000]# pwd
/proc/fs/lustre/ldlm/namespaces/lustre-OST0006-osc-ffff880d13d29000

[root@mills lustre-OST0006-osc-ffff880d13d29000]# more lock_*
::::::::::::::
lock_count
::::::::::::::
99554
::::::::::::::
lock_unused_count
::::::::::::::
0

Should we read this statement as "use Lustre for applications with a couple large files to access?" There is no single application here since this is a community cluster with many different applications, and Lustre was intended by us to be used as working space for users' jobs on the cluster (a'la scratch).

On Jul 12, 2012, at 12:51 PM, Oleg Drokin (JIRA) wrote:

::::::::::::::::::::::::::::::::::::::::::::::::::::::
Jeffrey T. Frey, Ph.D.
Systems Programmer IV / Cluster Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE 19716
Office: (302) 831-6034 Mobile: (302) 419-4976
http://turin.nss.udel.edu/
::::::::::::::::::::::::::::::::::::::::::::::::::::::

Comment by Cliff White (Inactive) [ 12/Jul/12 ]

We have already recommended that you upgrade to 1.8.8 to improve small file IO. We understand that you run a mix of workloads, we believe if you could identify which workload is doing the random 4k IO, changes to that workload would help the situation.

Comment by Jeffrey Frey [ 12/Jul/12 ]

Thanks, Cliff. We cannot "jump" into the 1.8.8 upgrade for two important reasons:

(1) We like to be very aware of what's changed in a software package before we install it on a production system – so we have some reading ahead of us to get familiar with the 1.8.6 to 1.8.8 roadmap

(2) This is a production system with user's computation work running out of the Lustre filesystem – any disruption is a major issue for our users, so we try to minimize those

In theory a rolling upgrade works flawlessly – but in theory the hardware we were sold behind this Lustre filesystem is enterprise-class, fast storage, too :-\ So our users have to be notified of the possibility for failure, which means if there could be downtime involved we may as well do other patching at the same time. Just so happens QLogic released a firmware update for our switch recently. We plan to do the 1.8.8 upgrade within the next month, we just have some planning and scheduling to do first.

As for the random 4k i/o, for our sake can you show applicable portions of the logs/debug traces that led to that diagnosis? In context is it all block-level i/o to/from files, or could it be standard filesystem ops (ls, cp, tar, etc.) e.g. also fetching size metadata from the OST(s)?

::::::::::::::::::::::::::::::::::::::::::::::::::::::
Jeffrey T. Frey, Ph.D.
Systems Programmer IV / Cluster Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE 19716
Office: (302) 831-6034 Mobile: (302) 419-4976
http://turin.nss.udel.edu/
::::::::::::::::::::::::::::::::::::::::::::::::::::::

Comment by Cliff White (Inactive) [ 12/Jul/12 ]

The random 4k IO was identified in the brw_stats which you furnished to us - to quote that comment(18/Jun/12 11:18 AM):
"2. On each server for each OST, there is a 'brw_stats' file located on the path
/proc/fs/lustre/obdfilter/<OST ID>/brw_stats where 'OST ID' is of the format lustre-OST00XX. for each of your OSTs, from a login on the OSS please issue the command:
'# cat /proc/fs/lustre/obdfilter/<OST ID>/brw_stats > <file>'
where <file> is the OST ID. Please do this when the server load is high. "

You can examine these files using that method at any time to see how your IO is performing, and can clear the data by echoing 'clear' into the file.
In the data you furnished to us previous, we see this, as an example:

# cat lustre-OST0012
snapshot_time:         1340104119.496271 (secs.usecs)

                           read      |     write
pages per bulk r/w     rpcs  % cum % |  rpcs  % cum %
1:                  491976  73  73   | 408911  59  59
2:                    4588   0  74   | 16816   2  62
4:                    1431   0  74   | 10062   1  63
8:                    1156   0  74   | 19221   2  66
16:                   1077   0  74   | 7433   1  67
32:                   1731   0  74   | 13036   1  69
64:                   2898   0  75   | 8544   1  70
128:                  5149   0  76   | 12009   1  72
256:                160523  23 100   | 187956  27 100

A 'page' in this case is a 4k memory page. So we can see here that 73% of your read IO and 59% of your write IO was 4k in size. This is rather more that we would expect from the odd ls, etc, and likely indicates an application behavior. Again, this is why we advise the 1.8.8 upgrade. The 1.8.8 release has been out for quite some time, and is stable in numerous production deployments.

Comment by Oleg Drokin [ 12/Jul/12 ]

I wonder about the "185 of 200 clients have no problem accessing the OST" case you mention. Could it be the case that only 15 "problematic" clients are actaully doing any IO on that OST and the rest happen to do IO on other OSTs?
After all overloaded OST is overloaded OST. Lustre cannot really improve disk controller properties and if the controller hates lots of small IO (and frankly, a lot of them does), there is very little we can do about it.

People in big government labs have these problems too and the solutions range from changing hardware to limiting "bad-behaving" applications to only certain parts of the machine to actually rewriting the apps to form a better IO pattern.

The 100k lock count should not be affecting anything as long as you don't run out of memory on OSS nodes, but looking at the vmstat output its evident you don't have this problem.

brw_stat output shows a lot of 4k io, and it only tracks data IO by applications.

Comment by Cliff White (Inactive) [ 15/Feb/13 ]

Please reopen this issue if you have further comments/questions.

Generated at Sat Feb 10 01:17:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.