[LU-450] System unresponsive and hitting LBUG() after 1.6 => 1.8 upgrade Created: 22/Jun/11 Updated: 26/Feb/16 Resolved: 04/Jul/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Shuichi Ihara (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | ucl | ||
| Environment: |
Longstanding 1.6 installation, RHEL5.3, ddn 9550, 48 OSTs, 4 OSS. 10g network, 800 clients. Exact version is |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 6597 |
| Description |
|
After a site visit yesterday to upgrade from 1.6 to 1.8 the filesystem is now unstable with 'cat /proc/fs/lustre/health_check' on the OSSs taking up to 18 minutes to complete, a system load of 200+ on the OSSs and the evening several LBUGS() Yesterday we upgraded from 1.6.7.2 to 1.8.4.ddn3.1, configured quotas on the system and fixed a issue with LAST_ID on ost_12 which was causing it to set as inactive at start. It's possible that the update is a red herring, we first had problems with heartbeat restarting the MDS last Thursday, it started taking too long to read health_check on the MDS around 3am last Tuesday morning, at this time I restarted all servers and it was OK again until Friday, it was however restarting every 1/2 hour over the weekend. We didn't do anything Monday because of the site shutdown and upgrade scheduled for Tuesday. Also - since the restart the OSTs have been filling up at an alarming rate, they've gone from ~70% up to 100% in some cases, I'm speaking to the customer to see if this is real data and if they can stem the tide somehow. |
| Comments |
| Comment by Ashley Pittman (Inactive) [ 22/Jun/11 ] |
|
The LBUG data I have is below, this came out on syslog from a client which I don't have access to, it looks like Oracle bug 20278. I don't have access to the clients to know what version they are running, this has been requested from the customer. The system starting experiencing load at lunchtime, I suspect this is a symptom rather than a cause. https://bugzilla.lustre.org/show_bug.cgi?id=20278 Message from syslogd@ at Wed Jun 22 20:29:45 2011 ... Message from syslogd@ at Wed Jun 22 20:29:45 2011 ... Message from syslogd@ at Wed Jun 22 20:29:45 2011 ... Message from syslogd@ at Wed Jun 22 20:29:45 2011 ... |
| Comment by Peter Jones [ 22/Jun/11 ] |
|
Bobi Can you please help out with this one? Thanks Peter |
| Comment by Zhenyu Xu [ 22/Jun/11 ] |
|
Ashley, please upgrade those affected clients to 1.8.4 as well. |
| Comment by William Hay [ 23/Jun/11 ] |
|
Hi Ashley/DDN's customer here: The lustre version on the clients is currently: [root@node-a01 ~]# rpm -qa|grep lustre The particular node generating the LBUG isn't accessible to us either won't respond to ssh. The remote console was At the moment we can do an ls (or even ls -l )of lustre on the client but running df on the lustre |
| Comment by Ashley Pittman (Inactive) [ 23/Jun/11 ] |
|
William. Last night I could mount lustre on the spare MDS node, the mount would happen successfully however df would then hang but lfs df would still give meaningful output. In particular most OSTs would report a % used, some OSTs would be listed as "inactive device". Checking these OSTs on the OSS nodes it appears not all OSTs were running last night, I've now started all of them which should at least allow df to complete. [root@mds2 ~]# lfs df filesystem summary: 169212268896 136582727936 24033718004 80% /lustre/lustre/client Note that because of the age of this system it wasn't formatted with the --index option so the translation from device to index is not constant. |
| Comment by Ashley Pittman (Inactive) [ 23/Jun/11 ] |
|
Zhenyu: It appears only one client out of 800 LBUGED, do you still recommend upgrading all clients because of this? Ashley. |
| Comment by Zhenyu Xu [ 23/Jun/11 ] |
|
for the time being, it's better upgrading this single affected client. |
| Comment by Ashley Pittman (Inactive) [ 23/Jun/11 ] |
|
Zhenyu: regarding the client LBUG, is this likely to be because some of the OSTs are now full or could it be a quota issue? Short term updates to the clients are unlikely so would disabling quotas on the servers avoid this LBUG again? Can you tell us what information you need to be able to give further advice on how to stabilise the system. |
| Comment by Sven Trautmann (Inactive) [ 23/Jun/11 ] |
|
some more tests trying to access /proc/fs/lustre/health_check on one of the OSS nodes. date in a loop. The results look like this: Thu Jun 23 12:45:15 BST 2011 real 18m36.173s real 49m49.152s real 26m55.176s real 12m38.171s real 3m12.691s at the moment it seem to work better. the last cat /proc/fs/lustre/health_check took only real 0m0.089s |
| Comment by Brian Murrell (Inactive) [ 23/Jun/11 ] |
|
Can you attach the syslogs from the OSSes covering the last 36h? |
| Comment by Ashley Pittman (Inactive) [ 24/Jun/11 ] |
|
We suspected [root@mds1 ~]# pdsh -a uptime real 0m0.219s Our logs show that for a time after boot the system load was high and the system was unresponsive to the point where we think heartbeat would object and kill the node, I'm getting the logs for this time currently. |
| Comment by Ashley Pittman (Inactive) [ 24/Jun/11 ] |
|
If we take one node, oss4 it booted at "Jun 23 17:36:01". Our load monitoring script above wasn't started automatically, the first few entries from after it was started are below. Note that 45 minutes after boot it was still taking over 8 minutes to cat the health_check file. The OSTs were mounted at boot time. As I say things have quietened down now and the system hasn't restarted since this time. Thu Jun 23 18:15:42 BST 2011 real 8m28.119s real 4m30.552s real 1m5.965s real 0m0.157s real 0m0.077s All further samples are below 1 second to read this file with the exception of the three samples below. Thu Jun 23 19:09:48 BST 2011 real 5m55.093s and Fri Jun 24 01:11:04 BST 2011 real 0m20.471s real 0m34.715s |
| Comment by Ashley Pittman (Inactive) [ 24/Jun/11 ] |
|
Logs from oss4 since the last boot. |
| Comment by Zhenyu Xu [ 24/Jun/11 ] |
|
at the OST boot time, oss4 was recovering 887 clients, which generate heavy IO load. Jun 23 17:49:01 oss4 kernel: Lustre: 19755:0:(filter.c:1001:filter_init_server_data()) RECOVERY: service lustre-OST0024, 887 recoverable clients, 0 delayed clients, last_rcvd 34359739610 while until Jun 24 01:25:37 there still shows heavy IO load on oss4. Don't know whether they are still delayed recovery requests IO handling or new ones from connected clients. |
| Comment by Zhenyu Xu [ 24/Jun/11 ] |
|
Are OSTs on oss4 pretty full? |
| Comment by Ashley Pittman (Inactive) [ 24/Jun/11 ] |
|
I understand there will be high load on servers after restart with that many active clients I wouldn't expect it to cause such delays though. The heartbeat monitor timeout is set to ten minutes, any longer than this and the resource is assumed to have problems and the node is rebooted. We use this same timeout everywhere and whilst asking the customer to increase it is an option I'd rather avoid doing this. Yes they are pretty full, we're working with the customer on this. There seems to be a usage spike in the last week which we don't understand the cause of. /dev/mapper/ost_lustre_36 |
| Comment by Zhenyu Xu [ 24/Jun/11 ] |
|
Several questions we'd like to know: 1. please check OST's recover status (like: cat /proc/obdfilter/lustre-OSTxxxxx/recovery_status) 4. From the log, lustre IO is using tcp network, shall it use ib network? And heartbeat should be configured using network different from what lustre uses, or they'd compete and heavy IO would make heartbeat illude that the ndoe is dead. |
| Comment by Brian Murrell (Inactive) [ 24/Jun/11 ] |
|
Would I be correct in interpreting that your current pain point here is the time it's taking to read from the health_check proc file? |
| Comment by Ashley Pittman (Inactive) [ 24/Jun/11 ] |
|
Brian. Now we've got the system stable and under control I think that's a fair summary of the remaining concerns. The load is high and the node is somewhat unresponsive after startup but we can live with that if we need to. One thing I did think of is we could remove this check from the heart beat monitor script, this would give us a get-out and something to deliver to the customer but doesn't resolve the real issue. What we notice during occasions like this is the node is basically functional, you can login, run top, ls and the like but the load will be high, lots of ll_ost processes will be in D state and certain commands, for example sync will take a long time to complete. I'm assuming sync and reading the health_check file are waiting for the same locks. |
| Comment by Ashley Pittman (Inactive) [ 24/Jun/11 ] |
|
Zhenyu: 1) below [root@oss4 ~]# lustre_recovery_status.sh -v |
| Comment by Brian Murrell (Inactive) [ 24/Jun/11 ] |
|
Ashley, There is an alternative and it would be useful to try it even if just for a test. Reading from the health_check proc file causes a write to the OST during it's operation, as a check of the health of writing to the disk. This part of the health_check can be suppressed by building lustre with --disable-health_write configure option. Of course the result is a slightly less extensive health_check, but even if you could do that as a test to see if alleviates your health_check proc file reading latency, then we will at least know where the problem is. |
| Comment by Ashley Pittman (Inactive) [ 24/Jun/11 ] |
|
That's good to know. I'll prepare a build with this option set so we have it ready. At this stage I'm not going to force a reboot on the customer but we'll discuss this with them Monday. |
| Comment by Brian Murrell (Inactive) [ 28/Jun/11 ] |
|
Is there any update on this issue. Were you able to perform the prescribed test? |
| Comment by Ashley Pittman (Inactive) [ 29/Jun/11 ] |
|
No update yet. We have new packages on-site and ready to deploy but need to liase with customer over schedule for doing this. At this time we are assuming this to be a duplicate of The filesystem is still unpleasantly full with between 58% and 100% of blocks used per OST with 84% used on average. |
| Comment by Ashley Pittman (Inactive) [ 29/Jun/11 ] |
|
Update confirmed for tomorrow, other than monitoring load average and time taken to cat the health_check file is there anything you want us to check after boot? |
| Comment by Ashley Pittman (Inactive) [ 01/Jul/11 ] |
|
Upgrade performed and the system is now running under the control of heartbeat again, the --disable-health-write option made all the difference, at no point during startup did reading this file take more than a second. The underlying issue remains though and through testing I've confirmed it's the same issue we've seen elsewhere, basically for some time after startup any writes to an OST will stall for a considerable period of time. Using our ost-survey script I got the following results (measured shortly after recovery had finished): OST 0: 1073741824 bytes (1.1 GB) copied, 118.534 seconds, 9.1 MB/s This is an older system so 100MB/s is about right, what we see though is that some OSTs take considerably longer. From monitoring the file size as the above test is run the file will increase in size up to a point and then remain that size for a considerable time, whilst the OST is accepting writes the performance is as expected but it freezes for long periods. Rerunning the same test now results in good results for all OSTS. If the health_check was performing a write to every OST on a node this explains why it was taking so long to read the file. |
| Comment by Zhenyu Xu [ 04/Jul/11 ] |
|
--disable-health-write configuration helps heartbeat health check, dup to |