Details
-
Task
-
Resolution: Fixed
-
Major
-
None
-
Lustre 2.0.0, Lustre 1.8.6
-
Clustering
-
4043
Description
Hello Support,
One of customer at University of Delaware had at least three separate instances where the /lustre filesystem was unusable for an extended period due to a single OST's dropping out of service due to:
Jun 11 02:40:07 oss4 kernel: Lustre: 9443:0:(ldlm_lib.c:874:target_handle_connect()) lustre-OST0016: refuse reconnection from d085b4f1-e418-031f-8474-b980894ce7ad@10.55.50.115@o2ib to 0xffff8103119bac00; still busy with 1 active RPCs
The hang was so bad for one of them (upwards of 30 minutes with the OST unavailable) that a reboot of the oss1/oss2 pair was necessary. The symptom is easily identified: long hangs on the head node while one waits for a directory listing or for a file to open for editing in vi, etc. Sometimes the situation remedies itself, sometimes it does not and we need to reboot one or more OSS nodes.
"Enclosed are all of the syslogs, dmesg, and /tmp/lustre* crash dumps for our MDS and OSS's."
You can retrieve the drop-off anytime in the next 21 days by clicking the following link (or copying and pasting it into your web browser):
"https://pandora.nss.udel.edu//pickup.php?claimID=vuAFoSBUoReVuaje&claimPasscode=RfTmXJZFVdUGzbLk&emailAddr=tsingh%40penguincomputing.com"
Full information for the drop-off:
Claim ID: vuAFoSBUoReVuaje
Claim Passcode: RfTmXJZFVdUGzbLk
Date of Drop-Off: 2012-06-11 12:23:20-0400
Please review the attached log files and provide us the next course of action since it's very critical issue and impacting their environment? Also please let me know
if you need any further info?
Thanks
Terry
Penguin Tech Support
Ph: 415-954-2833
Attachments
- headnode-messages.gz
- 623 kB
- oss3-vmstat.log
- 9 kB
- oss4-messages.gz
- 52 kB
Issue Links
- Trackbacks
-
Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA
Activity
We have never had data loss from a rolling upgrade, or from a point upgrade of this type. We routinely test these point release upgrade/downgrades as a part of the release process for each new Lustre release.
If there are other dependencies involved (such as Qlogic tools) then it would be advisable to involve Penguin in the decision-making process about how to proceed.
While upgrading your Lustre version provides the benefit from other fixes included, it is quite possible to apply the exact fix you need to your existing version of Lustre, which may be less disruptive on other layers of the stack.
Our clients use a stock RedHat kernel and you can certainly do a gradual upgrade of the clients as you schedule permits, there is no problem running with a mix of client versions.
Before embarking on such an upgrade path you could even experiment with a single client to see whether the pay-off warrants the effort. To do this effectively you would need to identify an example of the kind of workload that triggers this issue so that you can measure the before and after effect of applying this fix. Do you have any idea as to a likely candidate for such a test?
QLogic's OFED stack is a superset of the stock OFED distribution. It includes value-added utilities/libraries and bug-fixes [that OFED hypothetically rolls back into the baseline distribution].
When we initially setup the head node w/ ScientificLinux 6.1 the stock RDMA kernel module kept the machine from booting (kernel panic every time). Wiping the RHEL IB stack and replacing with the QLogic one for RHEL 6.1 fixed the issue.
As configured by Penguin, the LUSTRE servers are all running CentOS 5.x.
[root@oss1 ~]# ibstat
CA 'qib0'
CA type: InfiniPath_QLE7340
Number of ports: 1
Firmware version:
Hardware version: 2
Node GUID: 0x001175000077592a
System image GUID: 0x001175000077592a
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 219
LMC: 0
SM lid: 1
Capability mask: 0x07610868
Port GUID: 0x001175000077592a
Link layer: IB
So in your (Whamcloud's) experience there have never been any instances of data loss due to upgrading like this on-the-fly while the filesystem is still online?
We're using the loadable module variant of the client rather than building a LUSTRE kernel for the clients or using a Whamcloud kernel. Is there anything about the 1.8.8 server that demands 1.8.8 clients? Upgrading all clients = lots of downtime for our users and they've already experienced a LOT of downtime/lost time thanks to LUSTRE.
::::::::::::::::::::::::::::::::::::::::::::::::::::::
Jeffrey T. Frey, Ph.D.
Systems Programmer IV / Cluster Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE 19716
Office: (302) 831-6034 Mobile: (302) 419-4976
http://turin.nss.udel.edu/
::::::::::::::::::::::::::::::::::::::::::::::::::::::
As far as we know, Qlogic cards are supported by the OFED supplied with the RedHat kernel - was there a reason giving for using external OFED?
Please give us the model numbers of the cards.
Your upgrade procedure is correct, and identical to that in the Lustre Manual. The change from 1.8.6 to 1.8.8 is a point release, there are absolutely no concerns about compatibility and no need for any special work. The clients may require a RedHat kernel update to match our client kernel.
There is afaik not much information beyond the Lustre Manual, the upgrade should not be difficult.
Please let us know when you are planning this, and we'd be glad to have a conference call or furnish other help.
Has disabling the read cache produced any change in the performance?
Cliff:
With respect to upgrading the LUSTRE release on these production servers:
(1) Since these systems have QLogic's own OFED installed on them, we would want to build the LUSTRE 1.8.8 server components from scratch, correct? It appears this is what Penguin did when the system was built:
[root@oss1 src]# cd /usr/src
[root@oss1 src]# ls
aacraid-1.1.7.28000 debug kernels ofa_kernel ofa_kernel-1.5.3 openib redhat
[root@oss1 src]# ls kernels/
2.6.18-238.12.1.el5_lustre.g266a955-x86_64 2.6.18-238.el5-x86_64
(2) For rolling upgrades, we're assuming the general order of operations would be:
- upgrade spare MDS
- failover MDS to upgraded spare, upgrade primary MDS
- failover oss1 => oss2; upgrade oss1; failover oss2 => oss1; upgrade oss2
- failover oss3 => oss4; upgrade oss3; failover oss4 => oss3; upgrade oss4
We'd appreciate citations of any important/helpful materials on the subject of LUSTRE rolling upgrades.
Thanks for any and all information/feedback you can provide.
::::::::::::::::::::::::::::::::::::::::::::::::::::::
Jeffrey T. Frey, Ph.D.
Systems Programmer IV / Cluster Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE 19716
Office: (302) 831-6034 Mobile: (302) 419-4976
http://turin.nss.udel.edu/
::::::::::::::::::::::::::::::::::::::::::::::::::::::
OK, thanks. I turned off the read cache. We'll let you know if it
helps. We'll discuss internally about upgrading soon.
Ben
Thanks for providing the information so far. Analyzing the results has been useful in understanding what is going on.
From the IO statistics, we are seeing a great deal of small file IO (4k io size), and very little parallel IO (most of the time only 1 IO in flight). This is not an optimal IO model for Lustre - any steps you can take on the application side to increase IO size, eliminate excessive flush() or sync() calls, or otherwise allow the filesystem to aggregate larger IO will help to improve your performance.
Given this IO pattern the Lustre read cache – which is on by default - may be doing more harm than good. To turn it off please run the command "lctl set_param obdfilter.*.read_cache_enable=0" on all OSS nodes.
Finally, we recommend an immediate upgrade to Lustre 1.8.8-wc1 as this release contains optimizations for small file IO (see http://jira.whamcloud.com/browse/LU-983). The Lustre 1.8.6-wc1 and 1.8.8-wc1 releases are completely compatible, so you can do a rolling upgrade of your systems without needing any downtime.
Please let me know if you have any further questions about any of the above and let us know whether this advice helps.
Here it is for oss3.
Ben
procs ----------memory--------- --swap- ----io--- -system- ----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 3679452 2271020 16719004 0 0 12 60 49 37 0 1 99 0 0
1 0 0 3665612 2271220 16731472 0 0 12 2683 5492 22904 0 2 98 0 0
1 0 0 3586076 2270824 16823820 0 0 14 19930 6164 24593 0 8 92 0 0
1 0 0 3583992 2270992 16826752 0 0 14 767 6706 30467 0 10 90 0 0
2 0 0 3572604 2271160 16839324 0 0 2 2713 6588 29624 0 10 90 0 0
1 0 0 3553672 2271488 16857124 0 0 10 4235 5925 25023 0 10 90 0 0
1 0 0 3551728 2271080 16858028 0 0 7 106 6971 31704 0 10 90 0 0
1 0 0 3526424 2270872 16885768 0 0 2 5778 7662 33085 0 10 90 0 0
1 0 0 3519940 2270736 16893616 0 0 5 1845 4776 20677 0 8 92 0 0
1 0 0 3518544 2270376 16896276 0 0 12 539 5162 22691 0 10 90 0 0
1 0 0 3523576 2270200 16887492 0 0 9 304 6025 26623 0 10 90 0 0
2 0 0 3523612 2270296 16888852 0 0 7 405 7089 32613 0 10 90 0 0
2 0 0 3536784 2270416 16874592 0 0 11 530 5358 24228 0 10 90 0 0
1 0 0 3536212 2270536 16875516 0 0 16 387 5351 23709 0 3 97 0 0
1 0 0 3518632 2270568 16891360 0 0 16 3240 5485 24006 0 7 93 0 0
2 0 0 3504568 2270840 16906664 0 0 6 3297 7673 35050 0 10 90 0 0
2 0 0 3489320 2271104 16922432 0 0 13 3536 5865 25530 0 10 90 0 0
1 0 0 3487380 2271320 16925436 0 0 19 540 6345 29010 0 10 90 0 0
2 0 0 3477772 2271420 16931608 0 0 5 1420 7390 33680 0 10 90 0 0
2 0 0 3458112 2271916 16952220 0 0 50 4638 6207 26096 0 10 90 0 0
2 0 0 3318852 2273780 17087976 0 0 6 34498 6682 22626 0 10 90 0 0
procs ----------memory--------- --swap- ----io--- -system- ----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 3318732 2273556 17089888 0 0 0 253 5681 25024 0 10 90 0 0
2 0 0 3281832 2273992 17125828 0 0 4 7547 7094 30202 0 10 90 0 0
2 0 0 3272004 2274216 17136424 0 0 12 2268 6500 28415 0 10 90 0 0
2 0 0 3256964 2274112 17151664 0 0 4 3198 5621 23499 0 10 90 0 0
1 0 0 3246400 2273744 17162628 0 0 6 2677 5627 23582 0 10 90 0 0
1 0 0 3235624 2273240 17171196 0 0 6 2217 7079 32173 0 10 90 0 0
1 1 0 3245212 2273036 17163072 0 0 10 5155 6201 27244 0 9 91 0 0
1 0 0 3214192 2273040 17193544 0 0 22 6617 6234 26471 0 10 90 0 0
1 0 0 3191088 2272920 17217188 0 0 16 4946 7279 32868 0 10 90 0 0
1 0 0 3176340 2272944 17231044 0 0 4 2886 6947 31922 0 10 90 0 0
1 0 0 3173472 2273192 17234444 0 0 22 1059 4799 20868 0 10 90 0 0
1 0 0 3172304 2272816 17236180 0 0 14 400 5268 23325 0 10 90 0 0
2 0 0 3172920 2272408 17236632 0 0 11 138 5970 26801 0 10 90 0 0
2 0 0 3167624 2272056 17240424 0 0 13 825 7096 31001 0 10 90 0 0
1 0 0 3165220 2272152 17243068 0 0 7 682 5149 22487 0 10 90 0 0
2 0 0 3163628 2271960 17245264 0 0 1 490 5443 24123 0 10 90 0 0
2 0 0 3160612 2271760 17248636 0 0 2 838 5787 26236 0 10 90 0 0
1 0 0 3156912 2271872 17251540 0 0 7 907 7176 33221 0 10 90 0 0
1 0 0 3156260 2271960 17252712 0 0 12 266 5587 25427 0 10 90 0 0
1 0 0 3153276 2271872 17255972 0 0 12 823 6403 29209 0 10 90 0 0
2 0 0 3151996 2272048 17257488 0 0 21 393 7237 33723 0 10 90 0 0
procs ----------memory--------- --swap- ----io--- -system- ----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 3147760 2272128 17259880 0 0 5 563 5479 24913 0 10 90 0 0
1 0 0 3146884 2272280 17262168 0 0 10 564 4963 21783 0 10 90 0 0
1 0 0 3146156 2272400 17261772 0 0 10 461 5735 25777 0 10 90 0 0
1 0 0 3101968 2272896 17305052 0 0 833 9286 7095 31214 0 10 90 0 0
1 0 0 3143424 2273096 17263004 0 0 22 717 6008 27080 0 10 90 0 0
1 0 0 3144020 2273172 17263596 0 0 0 234 5288 23557 0 10 90 0 0
1 0 0 3142292 2273340 17265644 0 0 6 547 5339 23535 0 10 90 0 0
2 0 0 3141508 2273484 17266848 0 0 11 305 7273 33868 0 10 90 0 0
1 0 0 3140376 2273508 17267744 0 0 6 202 5724 26090 0 10 90 0 0
1 0 0 3096288 2274368 17311792 0 0 229 19058 7161 27260 0 10 90 0 0
1 0 0 3091776 2274488 17314492 0 0 9 1066 6976 31663 0 7 93 0 0
2 0 0 3069832 2274664 17326844 0 0 2 2754 6882 30216 0 2 98 0 0
1 0 0 3056844 2274840 17331568 0 0 17 971 4705 20198 0 2 98 0 0
1 0 0 3109352 2274788 17294964 0 0 7 2655 5362 22722 0 2 98 0 0
3 0 0 3015684 2274916 17391628 0 0 10 20679 6667 27656 0 7 93 0 0
1 0 0 3014468 2274068 17392884 0 0 16 546 7143 32296 0 9 91 0 0
1 0 0 3050608 2273996 17354228 0 0 16 788 5179 22799 0 8 92 0 0
1 0 0 3127024 2273964 17273668 0 0 12 3244 5571 24207 0 9 91 0 0
2 0 0 3111308 2273820 17292936 0 0 20 4055 6203 25874 0 9 91 0 0
2 0 0 3090120 2273356 17315624 0 0 23 5059 7564 32830 0 10 90 0 0
2 0 0 3089884 2272900 17317324 0 0 2 462 5458 24031 0 8 91 0 0
procs ----------memory--------- --swap- ----io--- -system- ----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 3088984 2272980 17319600 0 0 6 459 6475 28349 0 10 90 0 0
2 0 0 2995780 2273492 17412792 0 0 5 19180 8460 34220 0 11 89 0 0
3 0 0 2985708 2272916 17424072 0 0 8 2332 5648 24187 0 10 90 0 0
4 0 0 2965952 2272964 17443116 0 0 10 4574 5281 21170 0 10 90 0 0
1 0 0 2970440 2272636 17433200 0 0 8 2831 6166 26640 0 7 93 0 0
2 0 0 2961988 2272856 17443840 0 0 2 2190 6840 30713 0 10 90 0 0
1 0 0 2957244 2272332 17450308 0 0 2 2953 6066 26143 0 10 90 0 0
1 0 0 2952340 2272140 17456128 0 0 6 2278 5487 23197 0 6 94 0 0
0 0 0 2951920 2271764 17454436 0 0 15 228 5394 23413 0 8 92 0 0
2 0 0 2951156 2271348 17454896 0 0 8 146 7381 33632 0 8 92 0 0
2 0 0 2942704 2271380 17465200 0 0 25 4042 5934 25341 0 10 90 0 0
1 0 0 2938772 2271204 17469672 0 0 8 2420 6135 26676 0 10 90 0 0
2 0 0 2929464 2270848 17478452 0 0 17 4628 7567 32596 0 10 90 0 0
1 0 0 2922180 2270536 17486124 0 0 9 3868 7044 30435 0 10 90 0 0
2 0 0 2909516 2270640 17499912 0 0 3 4629 4907 20917 0 10 90 0 0
2 0 0 2900304 2270840 17509008 0 0 10 4393 5868 24716 0 10 90 0 0
2 0 0 2901412 2270936 17505024 0 0 4 1712 6456 28808 0 10 90 0 0
1 0 0 2891168 2270688 17516180 0 0 30 4203 7167 30894 0 10 90 0 0
2 0 0 2916592 2270792 17490216 0 0 10 82 5253 23086 0 10 90 0 0
2 0 0 2903988 2270696 17504200 0 0 826 2253 5760 24081 0 10 90 0 0
1 0 0 2893892 2270720 17512492 0 0 6 5798 6869 28962 0 10 90 0 0
procs ----------memory--------- --swap- ----io--- -system- ----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 2881068 2270600 17526988 0 0 2 3694 6850 30249 0 10 90 0 0
1 0 0 2878404 2270156 17530512 0 0 8 960 5641 25202 0 10 90 0 0
1 0 0 2868848 2269928 17540368 0 0 3 3522 7001 30809 0 8 92 0 0
2 0 0 2864540 2269460 17540948 0 0 9 2708 7399 33274 0 7 93 0 0
1 0 0 2863940 2269336 17543248 0 0 13 710 5043 22031 0 10 90 0 0
1 0 0 2866248 2268912 17541764 0 0 4 525 4962 21800 0 10 90 0 0
2 0 0 2864944 2268820 17544228 0 0 838 734 6114 27383 0 10 90 0 0
1 0 0 2853648 2268872 17552824 0 0 16 4321 7043 31961 0 10 90 0 0
1 0 0 2842860 2268960 17566128 0 0 8 4648 5628 24462 0 10 90 0 0
2 0 0 2834824 2268956 17574452 0 0 6 5086 5371 23136 0 10 90 0 0
2 0 0 2834492 2269184 17575464 0 0 17 670 6318 26323 0 10 90 0 0
3 0 0 2832544 2269092 17575608 0 0 10 406 8650 38921 0 10 90 0 0
1 0 0 2830952 2269732 17577252 0 0 10 2411 6211 27698 0 10 90 0 0
2 0 0 2827412 2269956 17582376 0 0 10 1793 6197 27999 0 10 90 0 0
1 0 0 2827604 2270180 17582404 0 0 9 1563 7256 32327 0 10 90 0 0
1 0 0 2824660 2270432 17582720 0 0 5 1230 6122 27703 0 10 90 0 0
Can you run "vmstat 5 100 > vmstat.log" on one of the busy OSTs and attach the output?
Logs are attached. The headnode-messages file has syslogs for the
headnode and all the compute nodes.
Ben
Please update us as to your status. What else we do to help on this issue? Should we close this bug?