[LU-450] System unresponsive and hitting LBUG() after 1.6 => 1.8 upgrade Created: 22/Jun/11  Updated: 26/Feb/16  Resolved: 04/Jul/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.6
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Shuichi Ihara (Inactive) Assignee: Zhenyu Xu
Resolution: Duplicate Votes: 0
Labels: ucl
Environment:

Longstanding 1.6 installation, RHEL5.3, ddn 9550, 48 OSTs, 4 OSS. 10g network, 800 clients.

Exact version is
lustre: 1.8.4.ddn3.1
kernel: patchless_client
build: 1.8.4.ddn3.1-20110406235128-PRISTINE-2.6.18-194.32.1.el5_lustre.1.8.4.ddn3.1.20110406235217

https://fseng.ddn.com/es_browser/record?record=es_lustre_showall_2011-06-21_180652&site=UCL&system=lustre


Attachments: File oss4-messages-1.8.6    
Severity: 3
Rank (Obsolete): 6597

 Description   

After a site visit yesterday to upgrade from 1.6 to 1.8 the filesystem is now unstable with 'cat /proc/fs/lustre/health_check' on the OSSs taking up to 18 minutes to complete, a system load of 200+ on the OSSs and the evening several LBUGS()

Yesterday we upgraded from 1.6.7.2 to 1.8.4.ddn3.1, configured quotas on the system and fixed a issue with LAST_ID on ost_12 which was causing it to set as inactive at start.

It's possible that the update is a red herring, we first had problems with heartbeat restarting the MDS last Thursday, it started taking too long to read health_check on the MDS around 3am last Tuesday morning, at this time I restarted all servers and it was OK again until Friday, it was however restarting every 1/2 hour over the weekend. We didn't do anything Monday because of the site shutdown and upgrade scheduled for Tuesday.

Also - since the restart the OSTs have been filling up at an alarming rate, they've gone from ~70% up to 100% in some cases, I'm speaking to the customer to see if this is real data and if they can stem the tide somehow.



 Comments   
Comment by Ashley Pittman (Inactive) [ 22/Jun/11 ]

The LBUG data I have is below, this came out on syslog from a client which I don't have access to, it looks like Oracle bug 20278. I don't have access to the clients to know what version they are running, this has been requested from the customer. The system starting experiencing load at lunchtime, I suspect this is a symptom rather than a cause.

https://bugzilla.lustre.org/show_bug.cgi?id=20278

Message from syslogd@ at Wed Jun 22 20:29:45 2011 ...
10.143.17.85 kernel: LustreError: 4424:0:(osc_request.c:1041:osc_init_grant()) ASSERTION(cli->cl_avail_grant >= 0) failed

Message from syslogd@ at Wed Jun 22 20:29:45 2011 ...
10.143.17.85 kernel: LustreError: 4424:0:(osc_request.c:1041:osc_init_grant()) LBUG

Message from syslogd@ at Wed Jun 22 20:29:45 2011 ...
10.143.17.85 kernel: LustreError: 4423:0:(osc_request.c:761:osc_consume_write_grant()) ASSERTION(cli->cl_avail_grant >= 0) failed: invalid avail grant is -225280

Message from syslogd@ at Wed Jun 22 20:29:45 2011 ...
10.143.17.85 kernel: LustreError: 4423:0:(osc_request.c:761:osc_consume_write_grant()) LBUG

Comment by Peter Jones [ 22/Jun/11 ]

Bobi

Can you please help out with this one?

Thanks

Peter

Comment by Zhenyu Xu [ 22/Jun/11 ]

Ashley, please upgrade those affected clients to 1.8.4 as well.

Comment by William Hay [ 23/Jun/11 ]

Hi Ashley/DDN's customer here:

The lustre version on the clients is currently:

[root@node-a01 ~]# rpm -qa|grep lustre
lustre-1.8.3.ddn3.3-2.6.18_194.11.4.el5_201009291220.x86_64
lustre-modules-1.8.3.ddn3.3-2.6.18_194.11.4.el5_201009291220.x86_64

The particular node generating the LBUG isn't accessible to us either won't respond to ssh. The remote console was
showing stack traces of which we have video (more than a screenful and going by too fast).

At the moment we can do an ls (or even ls -l )of lustre on the client but running df on the lustre
file system hangs. We have quite a few hung df processes on the nodes from various health checking scripts.

Comment by Ashley Pittman (Inactive) [ 23/Jun/11 ]

William.

Last night I could mount lustre on the spare MDS node, the mount would happen successfully however df would then hang but lfs df would still give meaningful output. In particular most OSTs would report a % used, some OSTs would be listed as "inactive device". Checking these OSTs on the OSS nodes it appears not all OSTs were running last night, I've now started all of them which should at least allow df to complete.

[root@mds2 ~]# lfs df
UUID 1K-blocks Used Available Use% Mounted on
lustre-MDT0000_UUID 1020122348 49082024 912742776 4% /lustre/lustre/client[MDT:0]
lustre-OST0000_UUID 3845733384 3122213084 528163112 81% /lustre/lustre/client[OST:0]
lustre-OST0001_UUID 3845733384 2791898800 858481064 72% /lustre/lustre/client[OST:1]
lustre-OST0002_UUID 3845733384 3609962120 40414452 93% /lustre/lustre/client[OST:2]
lustre-OST0003_UUID 3845733384 3648220176 2161080 94% /lustre/lustre/client[OST:3]
lustre-OST0004_UUID : inactive device
lustre-OST0005_UUID 3845733384 2302205248 1348167284 59% /lustre/lustre/client[OST:5]
lustre-OST0006_UUID 3845733384 3192676748 457694944 83% /lustre/lustre/client[OST:6]
lustre-OST0007_UUID 3845733384 3648241424 2124416 94% /lustre/lustre/client[OST:7]
lustre-OST0008_UUID 3845733384 2836017412 814359824 73% /lustre/lustre/client[OST:8]
lustre-OST0009_UUID 3845733384 2895836516 754539188 75% /lustre/lustre/client[OST:9]
lustre-OST000a_UUID 3845733384 3648183844 2197276 94% /lustre/lustre/client[OST:10]
lustre-OST000b_UUID 3845733384 2867535040 782837272 74% /lustre/lustre/client[OST:11]
lustre-OST000c_UUID : inactive device
lustre-OST000d_UUID 3845733384 2410525076 1239851436 62% /lustre/lustre/client[OST:13]
lustre-OST000e_UUID : inactive device
lustre-OST000f_UUID 3845733384 2685155768 965222468 69% /lustre/lustre/client[OST:15]
lustre-OST0010_UUID 3845733384 2596218232 1054154088 67% /lustre/lustre/client[OST:16]
lustre-OST0011_UUID : inactive device
lustre-OST0012_UUID 3845733384 3335791016 314580064 86% /lustre/lustre/client[OST:18]
lustre-OST0013_UUID 3845733384 2451923448 1198456676 63% /lustre/lustre/client[OST:19]
lustre-OST0014_UUID 3845733384 3619509900 30864044 94% /lustre/lustre/client[OST:20]
lustre-OST0015_UUID 3845733384 2509919836 1140451816 65% /lustre/lustre/client[OST:21]
lustre-OST0016_UUID 3845733384 3370192404 280186576 87% /lustre/lustre/client[OST:22]
lustre-OST0017_UUID 3845733384 3225873140 424505108 83% /lustre/lustre/client[OST:23]
lustre-OST0018_UUID 3845733384 2973690904 676670960 77% /lustre/lustre/client[OST:24]
lustre-OST0019_UUID 3845733384 2702246688 948124556 70% /lustre/lustre/client[OST:25]
lustre-OST001a_UUID 3845733384 2737444208 912908680 71% /lustre/lustre/client[OST:26]
lustre-OST001b_UUID 3845733384 2284905224 1365467664 59% /lustre/lustre/client[OST:27]
lustre-OST001c_UUID 3845733384 2265870020 1384505944 58% /lustre/lustre/client[OST:28]
lustre-OST001d_UUID 3845733384 3648118412 2251264 94% /lustre/lustre/client[OST:29]
lustre-OST001e_UUID 3845733384 2944649824 705727328 76% /lustre/lustre/client[OST:30]
lustre-OST001f_UUID 3845733384 2885398996 764982320 75% /lustre/lustre/client[OST:31]
lustre-OST0020_UUID 3845733384 3285192664 365172820 85% /lustre/lustre/client[OST:32]
lustre-OST0021_UUID 3845733384 3648099688 2278432 94% /lustre/lustre/client[OST:33]
lustre-OST0022_UUID 3845733384 3648147796 2232244 94% /lustre/lustre/client[OST:34]
lustre-OST0023_UUID 3845733384 3187297968 463083164 82% /lustre/lustre/client[OST:35]
lustre-OST0024_UUID 3845733384 3222652840 427724384 83% /lustre/lustre/client[OST:36]
lustre-OST0025_UUID 3845733384 3647790324 2574548 94% /lustre/lustre/client[OST:37]
lustre-OST0026_UUID 3845733384 3304111792 346267260 85% /lustre/lustre/client[OST:38]
lustre-OST0027_UUID 3845733384 2826439808 823933796 73% /lustre/lustre/client[OST:39]
lustre-OST0028_UUID 3845733384 3648074132 2291824 94% /lustre/lustre/client[OST:40]
lustre-OST0029_UUID 3845733384 3401982860 248394260 88% /lustre/lustre/client[OST:41]
lustre-OST002a_UUID 3845733384 2860903912 789477404 74% /lustre/lustre/client[OST:42]
lustre-OST002b_UUID 3845733384 3500709016 149660016 91% /lustre/lustre/client[OST:43]
lustre-OST002c_UUID 3845733384 3021170152 629187196 78% /lustre/lustre/client[OST:44]
lustre-OST002d_UUID 3845733384 3517798940 132562924 91% /lustre/lustre/client[OST:45]
lustre-OST002e_UUID 3845733384 3169784092 480593960 82% /lustre/lustre/client[OST:46]
lustre-OST002f_UUID 3845733384 3482148444 168232868 90% /lustre/lustre/client[OST:47]

filesystem summary: 169212268896 136582727936 24033718004 80% /lustre/lustre/client

Note that because of the age of this system it wasn't formatted with the --index option so the translation from device to index is not constant.

Comment by Ashley Pittman (Inactive) [ 23/Jun/11 ]

Zhenyu:

It appears only one client out of 800 LBUGED, do you still recommend upgrading all clients because of this?

Ashley.

Comment by Zhenyu Xu [ 23/Jun/11 ]

for the time being, it's better upgrading this single affected client.

Comment by Ashley Pittman (Inactive) [ 23/Jun/11 ]

Zhenyu:

regarding the client LBUG, is this likely to be because some of the OSTs are now full or could it be a quota issue? Short term updates to the clients are unlikely so would disabling quotas on the servers avoid this LBUG again?

Can you tell us what information you need to be able to give further advice on how to stabilise the system.

Comment by Sven Trautmann (Inactive) [ 23/Jun/11 ]

some more tests trying to access /proc/fs/lustre/health_check on one of the OSS nodes.
The script is executing

date
time cat /proc/fs/lustre/health_check
date
uptime

in a loop. The results look like this:

Thu Jun 23 12:45:15 BST 2011
healthy

real 18m36.173s
user 0m0.000s
sys 0m0.000s
Thu Jun 23 13:03:51 BST 2011
13:03:51 up 21:48, 0 users, load average: 113.28, 109.27, 85.87
Thu Jun 23 13:03:51 BST 2011
Thu Jun 23 13:08:51 BST 2011
healthy

real 49m49.152s
user 0m0.000s
sys 0m0.002s
Thu Jun 23 13:58:40 BST 2011
13:58:40 up 22:43, 0 users, load average: 100.87, 102.32, 99.78
Thu Jun 23 13:58:40 BST 2011
Thu Jun 23 14:03:40 BST 2011
healthy

real 26m55.176s
user 0m0.000s
sys 0m0.002s
Thu Jun 23 14:30:35 BST 2011
14:30:35 up 23:15, 0 users, load average: 135.76, 129.90, 123.12
Thu Jun 23 14:30:35 BST 2011
Thu Jun 23 14:35:35 BST 2011
healthy

real 12m38.171s
user 0m0.000s
sys 0m0.004s
Thu Jun 23 14:48:13 BST 2011
14:48:13 up 23:32, 1 user, load average: 84.03, 119.54, 121.03
Thu Jun 23 14:48:13 BST 2011
Thu Jun 23 14:53:13 BST 2011
healthy

real 3m12.691s
user 0m0.000s
sys 0m0.000s
Thu Jun 23 14:56:26 BST 2011
14:56:26 up 23:40, 1 user, load average: 68.20, 74.51, 96.04
Thu Jun 23 14:56:26 BST 2011

at the moment it seem to work better. the last cat /proc/fs/lustre/health_check took only

real 0m0.089s
user 0m0.000s
sys 0m0.000s
Thu Jun 23 15:01:26 BST 2011
15:01:26 up 23:45, 1 user, load average: 6.80, 31.34, 71.25

Comment by Brian Murrell (Inactive) [ 23/Jun/11 ]

Can you attach the syslogs from the OSSes covering the last 36h?

Comment by Ashley Pittman (Inactive) [ 24/Jun/11 ]

We suspected LU-15 so have installed Whamclouds 1.8.6-rc1 on all servers, the situation has improved greatly since and the system is now stable and responsive, heartbeat is not running however.

[root@mds1 ~]# pdsh -a uptime
mds2: 11:08:27 up 15:29, 0 users, load average: 0.07, 3.78, 6.27
oss2: 11:08:27 up 16:37, 0 users, load average: 2.22, 3.43, 5.50
oss1: 11:08:27 up 16:22, 0 users, load average: 4.19, 4.36, 4.76
oss3: 11:08:27 up 18:48, 0 users, load average: 5.28, 4.86, 4.80
oss4: 11:08:27 up 17:34, 1 user, load average: 2.74, 2.74, 3.09
mds1: 11:08:27 up 14:12, 2 users, load average: 0.00, 0.00, 0.00
[root@mds1 ~]# time pdsh -a cat /proc/fs/lustre/health_check
mds1: healthy
mds2: healthy
oss1: healthy
oss4: healthy
oss2: healthy
oss3: healthy

real 0m0.219s
user 0m0.055s
sys 0m0.023s

Our logs show that for a time after boot the system load was high and the system was unresponsive to the point where we think heartbeat would object and kill the node, I'm getting the logs for this time currently.

Comment by Ashley Pittman (Inactive) [ 24/Jun/11 ]

If we take one node, oss4 it booted at "Jun 23 17:36:01". Our load monitoring script above wasn't started automatically, the first few entries from after it was started are below. Note that 45 minutes after boot it was still taking over 8 minutes to cat the health_check file. The OSTs were mounted at boot time. As I say things have quietened down now and the system hasn't restarted since this time.

Thu Jun 23 18:15:42 BST 2011
healthy

real 8m28.119s
user 0m0.000s
sys 0m0.002s
Thu Jun 23 18:24:10 BST 2011
18:24:10 up 49 min, 0 users, load average: 45.37, 34.88, 29.60
Thu Jun 23 18:24:10 BST 2011
Thu Jun 23 18:29:10 BST 2011
healthy

real 4m30.552s
user 0m0.000s
sys 0m0.001s
Thu Jun 23 18:33:41 BST 2011
18:33:41 up 59 min, 0 users, load average: 100.18, 54.35, 35.92
Thu Jun 23 18:33:41 BST 2011
Thu Jun 23 18:38:41 BST 2011
healthy

real 1m5.965s
user 0m0.001s
sys 0m0.002s
Thu Jun 23 18:39:46 BST 2011
18:39:46 up 1:05, 0 users, load average: 54.92, 47.41, 37.87
Thu Jun 23 18:39:46 BST 2011
Thu Jun 23 18:44:46 BST 2011
healthy

real 0m0.157s
user 0m0.000s
sys 0m0.003s
Thu Jun 23 18:44:47 BST 2011
18:44:47 up 1:10, 0 users, load average: 13.53, 27.95, 32.20
Thu Jun 23 18:44:47 BST 2011
Thu Jun 23 18:49:47 BST 2011
healthy

real 0m0.077s

All further samples are below 1 second to read this file with the exception of the three samples below.

Thu Jun 23 19:09:48 BST 2011
healthy

real 5m55.093s
user 0m0.000s
sys 0m0.002s
Thu Jun 23 19:15:43 BST 2011

and

Fri Jun 24 01:11:04 BST 2011
healthy

real 0m20.471s
user 0m0.001s
sys 0m0.000s
Fri Jun 24 01:11:24 BST 2011
01:11:24 up 7:36, 0 users, load average: 9.84, 7.82, 7.35
Fri Jun 24 01:11:24 BST 2011
Fri Jun 24 01:16:24 BST 2011
healthy

real 0m34.715s
user 0m0.000s
sys 0m0.002s
Fri Jun 24 01:16:59 BST 2011

Comment by Ashley Pittman (Inactive) [ 24/Jun/11 ]

Logs from oss4 since the last boot.

Comment by Zhenyu Xu [ 24/Jun/11 ]

at the OST boot time, oss4 was recovering 887 clients, which generate heavy IO load.

Jun 23 17:49:01 oss4 kernel: Lustre: 19755:0:(filter.c:1001:filter_init_server_data()) RECOVERY: service lustre-OST0024, 887 recoverable clients, 0 delayed clients, last_rcvd 34359739610

while until Jun 24 01:25:37 there still shows heavy IO load on oss4. Don't know whether they are still delayed recovery requests IO handling or new ones from connected clients.

Comment by Zhenyu Xu [ 24/Jun/11 ]

Are OSTs on oss4 pretty full?

Comment by Ashley Pittman (Inactive) [ 24/Jun/11 ]

I understand there will be high load on servers after restart with that many active clients I wouldn't expect it to cause such delays though. The heartbeat monitor timeout is set to ten minutes, any longer than this and the resource is assumed to have problems and the node is rebooted. We use this same timeout everywhere and whilst asking the customer to increase it is an option I'd rather avoid doing this.

Yes they are pretty full, we're working with the customer on this. There seems to be a usage spike in the last week which we don't understand the cause of.

/dev/mapper/ost_lustre_36
3845733384 3252737736 397643584 90% /lustre/lustre/ost_36
/dev/mapper/ost_lustre_37
3845733384 3285744784 364636536 91% /lustre/lustre/ost_37
/dev/mapper/ost_lustre_38
3845733384 3528308788 122072532 97% /lustre/lustre/ost_38
/dev/mapper/ost_lustre_39
3845733384 2973617572 676763748 82% /lustre/lustre/ost_39
/dev/mapper/ost_lustre_40
3845733384 2992170588 658210732 82% /lustre/lustre/ost_40
/dev/mapper/ost_lustre_41
3845733384 3303287544 347093776 91% /lustre/lustre/ost_41
/dev/mapper/ost_lustre_42
3845733384 3561794272 88587048 98% /lustre/lustre/ost_42
/dev/mapper/ost_lustre_43
3845733384 3104758792 545622528 86% /lustre/lustre/ost_43
/dev/mapper/ost_lustre_44
3845733384 3516847956 133533364 97% /lustre/lustre/ost_44
/dev/mapper/ost_lustre_45
3845733384 3312432192 337949128 91% /lustre/lustre/ost_45
/dev/mapper/ost_lustre_46
3845733384 3445718020 204663300 95% /lustre/lustre/ost_46
/dev/mapper/ost_lustre_47
3845733384 3614267360 36113960 100% /lustre/lustre/ost_47

Comment by Zhenyu Xu [ 24/Jun/11 ]

Several questions we'd like to know:

1. please check OST's recover status (like: cat /proc/obdfilter/lustre-OSTxxxxx/recovery_status)
2. what's the client apps write mode? Are they usually create new files or write existing old files?
3. what's the stripe mode setup?
If OST's recovery finishes, and if the dirs are not dedicated to allocate on certain OSTs, when the OSTs getting almost full, new created files would avoid them finding other OSTs with relatively spare capacity. But writing to old files already on the full OSTs does not fit this mode.

4. From the log, lustre IO is using tcp network, shall it use ib network? And heartbeat should be configured using network different from what lustre uses, or they'd compete and heavy IO would make heartbeat illude that the ndoe is dead.

Comment by Brian Murrell (Inactive) [ 24/Jun/11 ]

Would I be correct in interpreting that your current pain point here is the time it's taking to read from the health_check proc file?

Comment by Ashley Pittman (Inactive) [ 24/Jun/11 ]

Brian. Now we've got the system stable and under control I think that's a fair summary of the remaining concerns. The load is high and the node is somewhat unresponsive after startup but we can live with that if we need to.

One thing I did think of is we could remove this check from the heart beat monitor script, this would give us a get-out and something to deliver to the customer but doesn't resolve the real issue.

What we notice during occasions like this is the node is basically functional, you can login, run top, ls and the like but the load will be high, lots of ll_ost processes will be in D state and certain commands, for example sync will take a long time to complete. I'm assuming sync and reading the health_check file are waiting for the same locks.

Comment by Ashley Pittman (Inactive) [ 24/Jun/11 ]

Zhenyu:

1) below
2) I don't know, it's a 800 node university cluster so I expect it's a mixed workload. It's mainly used as scratch for HPC workloads.
3) There is currently no striping anywhere on the filesystem. Some of the files are 20Gb+ and I've recommended that these files at least are striped in future.
We don't know where the spike in usage has come from but I'm assuming it's a small number of very large files as otherwise the % usage of each OST would be closer together rather than the current wide range we see.
4) There is no ib network. heartbeat is running over ethernet but I think it's a separate 1g link used for this. The networking stack remains responsive and there is no indication of heartbeat communication timeouts, it's all resource timesouts from trying to read the health_check file.

[root@oss4 ~]# lustre_recovery_status.sh -v
/proc/fs/lustre/obdfilter/lustre-OST0024/recovery_status:status: COMPLETE
/proc/fs/lustre/obdfilter/lustre-OST0024/recovery_status:recovery_start: 1308847756
/proc/fs/lustre/obdfilter/lustre-OST0024/recovery_status:recovery_duration: 114
/proc/fs/lustre/obdfilter/lustre-OST0024/recovery_status:delayed_clients: 0/887
/proc/fs/lustre/obdfilter/lustre-OST0024/recovery_status:completed_clients: 887/887
/proc/fs/lustre/obdfilter/lustre-OST0024/recovery_status:replayed_requests: 0
/proc/fs/lustre/obdfilter/lustre-OST0024/recovery_status:last_transno: 34359739610
/proc/fs/lustre/obdfilter/lustre-OST0025/recovery_status:status: COMPLETE
/proc/fs/lustre/obdfilter/lustre-OST0025/recovery_status:recovery_start: 1308847808
/proc/fs/lustre/obdfilter/lustre-OST0025/recovery_status:recovery_duration: 81
/proc/fs/lustre/obdfilter/lustre-OST0025/recovery_status:delayed_clients: 0/887
/proc/fs/lustre/obdfilter/lustre-OST0025/recovery_status:completed_clients: 887/887
/proc/fs/lustre/obdfilter/lustre-OST0025/recovery_status:replayed_requests: 0
/proc/fs/lustre/obdfilter/lustre-OST0025/recovery_status:last_transno: 25769804303
/proc/fs/lustre/obdfilter/lustre-OST0026/recovery_status:status: COMPLETE
/proc/fs/lustre/obdfilter/lustre-OST0026/recovery_status:recovery_start: 1308847760
/proc/fs/lustre/obdfilter/lustre-OST0026/recovery_status:recovery_duration: 107
/proc/fs/lustre/obdfilter/lustre-OST0026/recovery_status:delayed_clients: 0/887
/proc/fs/lustre/obdfilter/lustre-OST0026/recovery_status:completed_clients: 887/887
/proc/fs/lustre/obdfilter/lustre-OST0026/recovery_status:replayed_requests: 0
/proc/fs/lustre/obdfilter/lustre-OST0026/recovery_status:last_transno: 30064789726
/proc/fs/lustre/obdfilter/lustre-OST0027/recovery_status:status: COMPLETE
/proc/fs/lustre/obdfilter/lustre-OST0027/recovery_status:recovery_start: 1308847818
/proc/fs/lustre/obdfilter/lustre-OST0027/recovery_status:recovery_duration: 77
/proc/fs/lustre/obdfilter/lustre-OST0027/recovery_status:delayed_clients: 0/887
/proc/fs/lustre/obdfilter/lustre-OST0027/recovery_status:completed_clients: 887/887
/proc/fs/lustre/obdfilter/lustre-OST0027/recovery_status:replayed_requests: 0
/proc/fs/lustre/obdfilter/lustre-OST0027/recovery_status:last_transno: 34359743820
/proc/fs/lustre/obdfilter/lustre-OST0028/recovery_status:status: COMPLETE
/proc/fs/lustre/obdfilter/lustre-OST0028/recovery_status:recovery_start: 1308847769
/proc/fs/lustre/obdfilter/lustre-OST0028/recovery_status:recovery_duration: 102
/proc/fs/lustre/obdfilter/lustre-OST0028/recovery_status:delayed_clients: 0/887
/proc/fs/lustre/obdfilter/lustre-OST0028/recovery_status:completed_clients: 887/887
/proc/fs/lustre/obdfilter/lustre-OST0028/recovery_status:replayed_requests: 0
/proc/fs/lustre/obdfilter/lustre-OST0028/recovery_status:last_transno: 34359738471
/proc/fs/lustre/obdfilter/lustre-OST0029/recovery_status:status: COMPLETE
/proc/fs/lustre/obdfilter/lustre-OST0029/recovery_status:recovery_start: 1308847828
/proc/fs/lustre/obdfilter/lustre-OST0029/recovery_status:recovery_duration: 75
/proc/fs/lustre/obdfilter/lustre-OST0029/recovery_status:delayed_clients: 0/887
/proc/fs/lustre/obdfilter/lustre-OST0029/recovery_status:completed_clients: 887/887
/proc/fs/lustre/obdfilter/lustre-OST0029/recovery_status:replayed_requests: 0
/proc/fs/lustre/obdfilter/lustre-OST0029/recovery_status:last_transno: 34359755743
/proc/fs/lustre/obdfilter/lustre-OST002a/recovery_status:status: COMPLETE
/proc/fs/lustre/obdfilter/lustre-OST002a/recovery_status:recovery_start: 1308847774
/proc/fs/lustre/obdfilter/lustre-OST002a/recovery_status:recovery_duration: 99
/proc/fs/lustre/obdfilter/lustre-OST002a/recovery_status:delayed_clients: 0/887
/proc/fs/lustre/obdfilter/lustre-OST002a/recovery_status:completed_clients: 887/887
/proc/fs/lustre/obdfilter/lustre-OST002a/recovery_status:replayed_requests: 0
/proc/fs/lustre/obdfilter/lustre-OST002a/recovery_status:last_transno: 34359740488
/proc/fs/lustre/obdfilter/lustre-OST002b/recovery_status:status: COMPLETE
/proc/fs/lustre/obdfilter/lustre-OST002b/recovery_status:recovery_start: 1308847837
/proc/fs/lustre/obdfilter/lustre-OST002b/recovery_status:recovery_duration: 75
/proc/fs/lustre/obdfilter/lustre-OST002b/recovery_status:delayed_clients: 0/887
/proc/fs/lustre/obdfilter/lustre-OST002b/recovery_status:completed_clients: 887/887
/proc/fs/lustre/obdfilter/lustre-OST002b/recovery_status:replayed_requests: 0
/proc/fs/lustre/obdfilter/lustre-OST002b/recovery_status:last_transno: 34359798280
/proc/fs/lustre/obdfilter/lustre-OST002c/recovery_status:status: COMPLETE
/proc/fs/lustre/obdfilter/lustre-OST002c/recovery_status:recovery_start: 1308847788
/proc/fs/lustre/obdfilter/lustre-OST002c/recovery_status:recovery_duration: 101
/proc/fs/lustre/obdfilter/lustre-OST002c/recovery_status:delayed_clients: 0/887
/proc/fs/lustre/obdfilter/lustre-OST002c/recovery_status:completed_clients: 887/887
/proc/fs/lustre/obdfilter/lustre-OST002c/recovery_status:replayed_requests: 0
/proc/fs/lustre/obdfilter/lustre-OST002c/recovery_status:last_transno: 30064773071
/proc/fs/lustre/obdfilter/lustre-OST002d/recovery_status:status: COMPLETE
/proc/fs/lustre/obdfilter/lustre-OST002d/recovery_status:recovery_start: 1308847853
/proc/fs/lustre/obdfilter/lustre-OST002d/recovery_status:recovery_duration: 69
/proc/fs/lustre/obdfilter/lustre-OST002d/recovery_status:delayed_clients: 0/887
/proc/fs/lustre/obdfilter/lustre-OST002d/recovery_status:completed_clients: 887/887
/proc/fs/lustre/obdfilter/lustre-OST002d/recovery_status:replayed_requests: 0
/proc/fs/lustre/obdfilter/lustre-OST002d/recovery_status:last_transno: 25769806937
/proc/fs/lustre/obdfilter/lustre-OST002e/recovery_status:status: COMPLETE
/proc/fs/lustre/obdfilter/lustre-OST002e/recovery_status:recovery_start: 1308847799
/proc/fs/lustre/obdfilter/lustre-OST002e/recovery_status:recovery_duration: 90
/proc/fs/lustre/obdfilter/lustre-OST002e/recovery_status:delayed_clients: 0/887
/proc/fs/lustre/obdfilter/lustre-OST002e/recovery_status:completed_clients: 887/887
/proc/fs/lustre/obdfilter/lustre-OST002e/recovery_status:replayed_requests: 0
/proc/fs/lustre/obdfilter/lustre-OST002e/recovery_status:last_transno: 30064772112
/proc/fs/lustre/obdfilter/lustre-OST002f/recovery_status:status: COMPLETE
/proc/fs/lustre/obdfilter/lustre-OST002f/recovery_status:recovery_start: 1308847855
/proc/fs/lustre/obdfilter/lustre-OST002f/recovery_status:recovery_duration: 75
/proc/fs/lustre/obdfilter/lustre-OST002f/recovery_status:delayed_clients: 0/887
/proc/fs/lustre/obdfilter/lustre-OST002f/recovery_status:completed_clients: 887/887
/proc/fs/lustre/obdfilter/lustre-OST002f/recovery_status:replayed_requests: 0
/proc/fs/lustre/obdfilter/lustre-OST002f/recovery_status:last_transno: 30064807203

Comment by Brian Murrell (Inactive) [ 24/Jun/11 ]

Ashley,

There is an alternative and it would be useful to try it even if just for a test.

Reading from the health_check proc file causes a write to the OST during it's operation, as a check of the health of writing to the disk.

This part of the health_check can be suppressed by building lustre with --disable-health_write configure option.

Of course the result is a slightly less extensive health_check, but even if you could do that as a test to see if alleviates your health_check proc file reading latency, then we will at least know where the problem is.

Comment by Ashley Pittman (Inactive) [ 24/Jun/11 ]

That's good to know. I'll prepare a build with this option set so we have it ready.

At this stage I'm not going to force a reboot on the customer but we'll discuss this with them Monday.

Comment by Brian Murrell (Inactive) [ 28/Jun/11 ]

Is there any update on this issue. Were you able to perform the prescribed test?

Comment by Ashley Pittman (Inactive) [ 29/Jun/11 ]

No update yet. We have new packages on-site and ready to deploy but need to liase with customer over schedule for doing this.

At this time we are assuming this to be a duplicate of LU-15.

The filesystem is still unpleasantly full with between 58% and 100% of blocks used per OST with 84% used on average.

Comment by Ashley Pittman (Inactive) [ 29/Jun/11 ]

Update confirmed for tomorrow, other than monitoring load average and time taken to cat the health_check file is there anything you want us to check after boot?

Comment by Ashley Pittman (Inactive) [ 01/Jul/11 ]

Upgrade performed and the system is now running under the control of heartbeat again, the --disable-health-write option made all the difference, at no point during startup did reading this file take more than a second.

The underlying issue remains though and through testing I've confirmed it's the same issue we've seen elsewhere, basically for some time after startup any writes to an OST will stall for a considerable period of time. Using our ost-survey script I got the following results (measured shortly after recovery had finished):

OST 0: 1073741824 bytes (1.1 GB) copied, 118.534 seconds, 9.1 MB/s
OST 1: 1073741824 bytes (1.1 GB) copied, 9.31611 seconds, 115 MB/s
OST 2: 1073741824 bytes (1.1 GB) copied, 9.21754 seconds, 116 MB/s
OST 3: 1073741824 bytes (1.1 GB) copied, 54.316 seconds, 19.8 MB/s
OST 4: 1073741824 bytes (1.1 GB) copied, 9.18021 seconds, 117 MB/s
OST 5: 1073741824 bytes (1.1 GB) copied, 12.2757 seconds, 87.5 MB/s
OST 6: 1073741824 bytes (1.1 GB) copied, 327.987 seconds, 3.3 MB/s
OST 7: 1073741824 bytes (1.1 GB) copied, 24.4431 seconds, 43.9 MB/s
OST 8: 1073741824 bytes (1.1 GB) copied, 10.2977 seconds, 104 MB/s
OST 9: 1073741824 bytes (1.1 GB) copied, 330.015 seconds, 3.3 MB/s
OST 10: 1073741824 bytes (1.1 GB) copied, 24.5828 seconds, 43.7 MB/s
OST 11: 1073741824 bytes (1.1 GB) copied, 339.283 seconds, 3.2 MB/s
OST 12: 1073741824 bytes (1.1 GB) copied, 10.2614 seconds, 105 MB/s
OST 13: 1073741824 bytes (1.1 GB) copied, 10.6387 seconds, 101 MB/s
OST 14: 1073741824 bytes (1.1 GB) copied, 9.18088 seconds, 117 MB/s
OST 15: 1073741824 bytes (1.1 GB) copied, 33.046 seconds, 32.5 MB/s
OST 16: 1073741824 bytes (1.1 GB) copied, 9.61687 seconds, 112 MB/s
OST 17: 1073741824 bytes (1.1 GB) copied, 12.578 seconds, 85.4 MB/s
OST 18: 1073741824 bytes (1.1 GB) copied, 333.312 seconds, 3.2 MB/s
OST 19: 1073741824 bytes (1.1 GB) copied, 16.8783 seconds, 63.6 MB/s
OST 20: 1073741824 bytes (1.1 GB) copied, 10.7011 seconds, 100 MB/s
OST 21: 1073741824 bytes (1.1 GB) copied, 9.51567 seconds, 113 MB/s
OST 22: 1073741824 bytes (1.1 GB) copied, 335.516 seconds, 3.2 MB/s
OST 23: 1073741824 bytes (1.1 GB) copied, 345.44 seconds, 3.1 MB/s
OST 24: 1073741824 bytes (1.1 GB) copied, 334.839 seconds, 3.2 MB/s
OST 25: 1073741824 bytes (1.1 GB) copied, 19.8422 seconds, 54.1 MB/s
OST 26: 1073741824 bytes (1.1 GB) copied, 18.0309 seconds, 59.5 MB/s
OST 27: 1073741824 bytes (1.1 GB) copied, 19.0357 seconds, 56.4 MB/s
OST 28: 1073741824 bytes (1.1 GB) copied, 17.4743 seconds, 61.4 MB/s
OST 29: 1073741824 bytes (1.1 GB) copied, 9.03472 seconds, 119 MB/s
OST 30: 1073741824 bytes (1.1 GB) copied, 9.09998 seconds, 118 MB/s
OST 31: 1073741824 bytes (1.1 GB) copied, 27.8013 seconds, 38.6 MB/s
OST 32: 1073741824 bytes (1.1 GB) copied, 9.90428 seconds, 108 MB/s
OST 33: 1073741824 bytes (1.1 GB) copied, 26.576 seconds, 40.4 MB/s
OST 34: 1073741824 bytes (1.1 GB) copied, 9.02994 seconds, 119 MB/s
OST 35: 1073741824 bytes (1.1 GB) copied, 318.633 seconds, 3.4 MB/s
OST 36: 1073741824 bytes (1.1 GB) copied, 33.0055 seconds, 32.5 MB/s
OST 37: 1073741824 bytes (1.1 GB) copied, 10.093 seconds, 106 MB/s
OST 38: 1073741824 bytes (1.1 GB) copied, 319.42 seconds, 3.4 MB/s
OST 39: 1073741824 bytes (1.1 GB) copied, 9.10729 seconds, 118 MB/s
OST 40: 1073741824 bytes (1.1 GB) copied, 20.39 seconds, 52.7 MB/s
OST 41: 1073741824 bytes (1.1 GB) copied, 236.857 seconds, 4.5 MB/s
OST 42: 1073741824 bytes (1.1 GB) copied, 8.89273 seconds, 121 MB/s
OST 43: 1073741824 bytes (1.1 GB) copied, 37.6976 seconds, 28.5 MB/s
OST 44: 1073741824 bytes (1.1 GB) copied, 131.119 seconds, 8.2 MB/s
OST 45: 1073741824 bytes (1.1 GB) copied, 9.81052 seconds, 109 MB/s
OST 46: 1073741824 bytes (1.1 GB) copied, 320.598 seconds, 3.3 MB/s
OST 47: 1073741824 bytes (1.1 GB) copied, 313.122 seconds, 3.4 MB/s

This is an older system so 100MB/s is about right, what we see though is that some OSTs take considerably longer. From monitoring the file size as the above test is run the file will increase in size up to a point and then remain that size for a considerable time, whilst the OST is accepting writes the performance is as expected but it freezes for long periods. Rerunning the same test now results in good results for all OSTS. If the health_check was performing a write to every OST on a node this explains why it was taking so long to read the file.

Comment by Zhenyu Xu [ 04/Jul/11 ]

--disable-health-write configuration helps heartbeat health check, dup to LU-15 for slow IO issue on almost full OST.

Generated at Sat Feb 10 01:07:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.