<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:07:09 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-450] System unresponsive and hitting LBUG() after 1.6 =&gt; 1.8 upgrade</title>
                <link>https://jira.whamcloud.com/browse/LU-450</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;After a site visit yesterday to upgrade from 1.6 to 1.8 the filesystem is now unstable with &apos;cat /proc/fs/lustre/health_check&apos; on the OSSs taking up to 18 minutes to complete, a system load of 200+ on the OSSs and the evening several LBUGS()&lt;/p&gt;

&lt;p&gt;Yesterday we upgraded from 1.6.7.2 to 1.8.4.ddn3.1, configured quotas on the system and fixed a issue with LAST_ID on ost_12 which was causing it to set as inactive at start.&lt;/p&gt;

&lt;p&gt;It&apos;s possible that the update is a red herring, we first had problems with heartbeat restarting the MDS last Thursday, it started taking too long to read health_check on the MDS around 3am last Tuesday morning, at this time I restarted all servers and it was OK again until Friday, it was however restarting every 1/2 hour over the weekend.  We didn&apos;t do anything Monday because of the site shutdown and upgrade scheduled for Tuesday.&lt;/p&gt;

&lt;p&gt;Also - since the restart the OSTs have been filling up at an alarming rate, they&apos;ve gone from ~70% up to 100% in some cases, I&apos;m speaking to the customer to see if this is real data and if they can stem the tide somehow.&lt;/p&gt;</description>
                <environment>Longstanding 1.6 installation, RHEL5.3, ddn 9550, 48 OSTs, 4 OSS. 10g network, 800 clients.&lt;br/&gt;
&lt;br/&gt;
Exact version is &lt;br/&gt;
lustre: 1.8.4.ddn3.1&lt;br/&gt;
kernel: patchless_client&lt;br/&gt;
build:  1.8.4.ddn3.1-20110406235128-PRISTINE-2.6.18-194.32.1.el5_lustre.1.8.4.ddn3.1.20110406235217&lt;br/&gt;
&lt;br/&gt;
&lt;a href=&quot;https://fseng.ddn.com/es_browser/record?record=es_lustre_showall_2011-06-21_180652&amp;site=UCL&amp;system=lustre&quot;&gt;https://fseng.ddn.com/es_browser/record?record=es_lustre_showall_2011-06-21_180652&amp;amp;site=UCL&amp;amp;system=lustre&lt;/a&gt;</environment>
        <key id="11221">LU-450</key>
            <summary>System unresponsive and hitting LBUG() after 1.6 =&gt; 1.8 upgrade</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="bobijam">Zhenyu Xu</assignee>
                                    <reporter username="ihara">Shuichi Ihara</reporter>
                        <labels>
                            <label>ucl</label>
                    </labels>
                <created>Wed, 22 Jun 2011 17:37:44 +0000</created>
                <updated>Fri, 26 Feb 2016 15:10:48 +0000</updated>
                            <resolved>Mon, 4 Jul 2011 12:04:07 +0000</resolved>
                                    <version>Lustre 1.8.6</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="16811" author="apittman" created="Wed, 22 Jun 2011 17:51:13 +0000"  >&lt;p&gt;The LBUG data I have is below, this came out on syslog from a client which I don&apos;t have access to, it looks like Oracle bug 20278.  I don&apos;t have access to the clients to know what version they are running, this has been requested from the customer.  The system starting experiencing load at lunchtime, I suspect this is a symptom rather than a cause.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://bugzilla.lustre.org/show_bug.cgi?id=20278&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://bugzilla.lustre.org/show_bug.cgi?id=20278&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Message from syslogd@ at Wed Jun 22 20:29:45 2011 ...&lt;br/&gt;
10.143.17.85 kernel: LustreError: 4424:0:(osc_request.c:1041:osc_init_grant()) ASSERTION(cli-&amp;gt;cl_avail_grant &amp;gt;= 0) failed &lt;/p&gt;

&lt;p&gt;Message from syslogd@ at Wed Jun 22 20:29:45 2011 ...&lt;br/&gt;
10.143.17.85 kernel: LustreError: 4424:0:(osc_request.c:1041:osc_init_grant()) LBUG &lt;/p&gt;

&lt;p&gt;Message from syslogd@ at Wed Jun 22 20:29:45 2011 ...&lt;br/&gt;
10.143.17.85 kernel: LustreError: 4423:0:(osc_request.c:761:osc_consume_write_grant()) ASSERTION(cli-&amp;gt;cl_avail_grant &amp;gt;= 0) failed: invalid avail grant is -225280  &lt;/p&gt;

&lt;p&gt;Message from syslogd@ at Wed Jun 22 20:29:45 2011 ...&lt;br/&gt;
10.143.17.85 kernel: LustreError: 4423:0:(osc_request.c:761:osc_consume_write_grant()) LBUG&lt;/p&gt;</comment>
                            <comment id="16812" author="pjones" created="Wed, 22 Jun 2011 18:50:32 +0000"  >&lt;p&gt;Bobi&lt;/p&gt;

&lt;p&gt;Can you please help out with this one?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="16818" author="bobijam" created="Wed, 22 Jun 2011 22:29:34 +0000"  >&lt;p&gt;Ashley, please upgrade those affected clients to 1.8.4 as well.&lt;/p&gt;</comment>
                            <comment id="16823" author="whay" created="Thu, 23 Jun 2011 03:47:10 +0000"  >&lt;p&gt;Hi Ashley/DDN&apos;s customer here: &lt;/p&gt;

&lt;p&gt;The lustre version on the clients is currently:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@node-a01 ~&amp;#93;&lt;/span&gt;# rpm -qa|grep lustre&lt;br/&gt;
lustre-1.8.3.ddn3.3-2.6.18_194.11.4.el5_201009291220.x86_64&lt;br/&gt;
lustre-modules-1.8.3.ddn3.3-2.6.18_194.11.4.el5_201009291220.x86_64&lt;/p&gt;

&lt;p&gt;The particular node generating the LBUG isn&apos;t accessible to us either won&apos;t respond to ssh.  The remote console was &lt;br/&gt;
showing stack traces of which we have video (more than a screenful and going by too fast).&lt;/p&gt;

&lt;p&gt;At the moment we can do an ls (or even ls -l )of lustre on the client but running df on the lustre&lt;br/&gt;
file system hangs.  We have quite a few hung df processes on the nodes from various health checking scripts.&lt;/p&gt;
</comment>
                            <comment id="16824" author="apittman" created="Thu, 23 Jun 2011 04:40:05 +0000"  >&lt;p&gt;William.&lt;/p&gt;

&lt;p&gt;Last night I could mount lustre on the spare MDS node, the mount would happen successfully however df would then hang but lfs df would still give meaningful output.  In particular most OSTs would report a % used, some OSTs would be listed as &quot;inactive device&quot;.  Checking these OSTs on the OSS nodes it appears not all OSTs were running last night, I&apos;ve now started all of them which should at least allow df to complete.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@mds2 ~&amp;#93;&lt;/span&gt;# lfs df&lt;br/&gt;
UUID                   1K-blocks        Used   Available Use% Mounted on&lt;br/&gt;
lustre-MDT0000_UUID   1020122348    49082024   912742776   4% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;MDT:0&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0000_UUID   3845733384  3122213084   528163112  81% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:0&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0001_UUID   3845733384  2791898800   858481064  72% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:1&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0002_UUID   3845733384  3609962120    40414452  93% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:2&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0003_UUID   3845733384  3648220176     2161080  94% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:3&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0004_UUID : inactive device&lt;br/&gt;
lustre-OST0005_UUID   3845733384  2302205248  1348167284  59% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:5&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0006_UUID   3845733384  3192676748   457694944  83% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:6&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0007_UUID   3845733384  3648241424     2124416  94% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:7&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0008_UUID   3845733384  2836017412   814359824  73% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:8&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0009_UUID   3845733384  2895836516   754539188  75% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:9&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST000a_UUID   3845733384  3648183844     2197276  94% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:10&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST000b_UUID   3845733384  2867535040   782837272  74% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:11&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST000c_UUID : inactive device&lt;br/&gt;
lustre-OST000d_UUID   3845733384  2410525076  1239851436  62% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:13&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST000e_UUID : inactive device&lt;br/&gt;
lustre-OST000f_UUID   3845733384  2685155768   965222468  69% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:15&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0010_UUID   3845733384  2596218232  1054154088  67% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:16&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0011_UUID : inactive device&lt;br/&gt;
lustre-OST0012_UUID   3845733384  3335791016   314580064  86% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:18&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0013_UUID   3845733384  2451923448  1198456676  63% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:19&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0014_UUID   3845733384  3619509900    30864044  94% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:20&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0015_UUID   3845733384  2509919836  1140451816  65% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:21&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0016_UUID   3845733384  3370192404   280186576  87% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:22&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0017_UUID   3845733384  3225873140   424505108  83% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:23&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0018_UUID   3845733384  2973690904   676670960  77% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:24&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0019_UUID   3845733384  2702246688   948124556  70% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:25&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST001a_UUID   3845733384  2737444208   912908680  71% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:26&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST001b_UUID   3845733384  2284905224  1365467664  59% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:27&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST001c_UUID   3845733384  2265870020  1384505944  58% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:28&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST001d_UUID   3845733384  3648118412     2251264  94% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:29&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST001e_UUID   3845733384  2944649824   705727328  76% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:30&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST001f_UUID   3845733384  2885398996   764982320  75% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:31&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0020_UUID   3845733384  3285192664   365172820  85% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:32&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0021_UUID   3845733384  3648099688     2278432  94% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:33&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0022_UUID   3845733384  3648147796     2232244  94% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:34&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0023_UUID   3845733384  3187297968   463083164  82% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:35&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0024_UUID   3845733384  3222652840   427724384  83% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:36&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0025_UUID   3845733384  3647790324     2574548  94% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:37&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0026_UUID   3845733384  3304111792   346267260  85% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:38&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0027_UUID   3845733384  2826439808   823933796  73% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:39&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0028_UUID   3845733384  3648074132     2291824  94% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:40&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0029_UUID   3845733384  3401982860   248394260  88% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:41&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST002a_UUID   3845733384  2860903912   789477404  74% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:42&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST002b_UUID   3845733384  3500709016   149660016  91% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:43&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST002c_UUID   3845733384  3021170152   629187196  78% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:44&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST002d_UUID   3845733384  3517798940   132562924  91% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:45&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST002e_UUID   3845733384  3169784092   480593960  82% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:46&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST002f_UUID   3845733384  3482148444   168232868  90% /lustre/lustre/client&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:47&amp;#93;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;filesystem summary:  169212268896 136582727936 24033718004  80% /lustre/lustre/client&lt;/p&gt;

&lt;p&gt;Note that because of the age of this system it wasn&apos;t formatted with the --index option so the translation from device to index is not constant.&lt;/p&gt;</comment>
                            <comment id="16825" author="apittman" created="Thu, 23 Jun 2011 04:41:35 +0000"  >&lt;p&gt;Zhenyu:&lt;/p&gt;

&lt;p&gt;It appears only one client out of 800 LBUGED, do you still recommend upgrading all clients because of this?&lt;/p&gt;

&lt;p&gt;Ashley.&lt;/p&gt;</comment>
                            <comment id="16827" author="bobijam" created="Thu, 23 Jun 2011 04:58:19 +0000"  >&lt;p&gt;for the time being, it&apos;s better upgrading this single affected client.&lt;/p&gt;</comment>
                            <comment id="16831" author="apittman" created="Thu, 23 Jun 2011 05:40:47 +0000"  >&lt;p&gt;Zhenyu:&lt;/p&gt;

&lt;p&gt;regarding the client LBUG, is this likely to be because some of the OSTs are now full or could it be a quota issue?  Short term updates to the clients are unlikely so would disabling quotas on the servers avoid this LBUG again?&lt;/p&gt;

&lt;p&gt;Can you tell us what information you need to be able to give further advice on how to stabilise the system.&lt;/p&gt;</comment>
                            <comment id="16834" author="svtr" created="Thu, 23 Jun 2011 10:10:16 +0000"  >&lt;p&gt;some more tests trying to access /proc/fs/lustre/health_check on one of the OSS nodes.&lt;br/&gt;
The script is executing&lt;/p&gt;

&lt;p&gt;date&lt;br/&gt;
time cat /proc/fs/lustre/health_check&lt;br/&gt;
date&lt;br/&gt;
uptime&lt;/p&gt;

&lt;p&gt;in a loop. The results look like this:&lt;/p&gt;

&lt;p&gt;Thu Jun 23 12:45:15 BST 2011&lt;br/&gt;
healthy&lt;/p&gt;

&lt;p&gt;real    18m36.173s&lt;br/&gt;
user    0m0.000s&lt;br/&gt;
sys     0m0.000s&lt;br/&gt;
Thu Jun 23 13:03:51 BST 2011&lt;br/&gt;
 13:03:51 up 21:48,  0 users,  load average: 113.28, 109.27, 85.87&lt;br/&gt;
Thu Jun 23 13:03:51 BST 2011&lt;br/&gt;
Thu Jun 23 13:08:51 BST 2011&lt;br/&gt;
healthy&lt;/p&gt;

&lt;p&gt;real    49m49.152s&lt;br/&gt;
user    0m0.000s&lt;br/&gt;
sys     0m0.002s&lt;br/&gt;
Thu Jun 23 13:58:40 BST 2011&lt;br/&gt;
 13:58:40 up 22:43,  0 users,  load average: 100.87, 102.32, 99.78&lt;br/&gt;
Thu Jun 23 13:58:40 BST 2011&lt;br/&gt;
Thu Jun 23 14:03:40 BST 2011&lt;br/&gt;
healthy&lt;/p&gt;

&lt;p&gt;real    26m55.176s&lt;br/&gt;
user    0m0.000s&lt;br/&gt;
sys     0m0.002s&lt;br/&gt;
Thu Jun 23 14:30:35 BST 2011&lt;br/&gt;
 14:30:35 up 23:15,  0 users,  load average: 135.76, 129.90, 123.12&lt;br/&gt;
Thu Jun 23 14:30:35 BST 2011&lt;br/&gt;
Thu Jun 23 14:35:35 BST 2011&lt;br/&gt;
healthy&lt;/p&gt;

&lt;p&gt;real    12m38.171s&lt;br/&gt;
user    0m0.000s&lt;br/&gt;
sys     0m0.004s&lt;br/&gt;
Thu Jun 23 14:48:13 BST 2011&lt;br/&gt;
 14:48:13 up 23:32,  1 user,  load average: 84.03, 119.54, 121.03&lt;br/&gt;
Thu Jun 23 14:48:13 BST 2011&lt;br/&gt;
Thu Jun 23 14:53:13 BST 2011&lt;br/&gt;
healthy&lt;/p&gt;

&lt;p&gt;real    3m12.691s&lt;br/&gt;
user    0m0.000s&lt;br/&gt;
sys     0m0.000s&lt;br/&gt;
Thu Jun 23 14:56:26 BST 2011&lt;br/&gt;
 14:56:26 up 23:40,  1 user,  load average: 68.20, 74.51, 96.04&lt;br/&gt;
Thu Jun 23 14:56:26 BST 2011&lt;/p&gt;

&lt;p&gt;at the moment it seem to work better. the last cat /proc/fs/lustre/health_check took only &lt;/p&gt;

&lt;p&gt;real    0m0.089s&lt;br/&gt;
user    0m0.000s&lt;br/&gt;
sys     0m0.000s&lt;br/&gt;
Thu Jun 23 15:01:26 BST 2011&lt;br/&gt;
 15:01:26 up 23:45,  1 user,  load average: 6.80, 31.34, 71.25&lt;/p&gt;

</comment>
                            <comment id="16835" author="brian" created="Thu, 23 Jun 2011 11:17:55 +0000"  >&lt;p&gt;Can you attach the syslogs from the OSSes covering the last 36h?&lt;/p&gt;</comment>
                            <comment id="16937" author="apittman" created="Fri, 24 Jun 2011 06:11:09 +0000"  >&lt;p&gt;We suspected &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15&quot; title=&quot;strange slow IO messages and bad performance &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15&quot;&gt;&lt;del&gt;LU-15&lt;/del&gt;&lt;/a&gt; so have installed Whamclouds 1.8.6-rc1 on all servers, the situation has improved greatly since and the system is now stable and responsive, heartbeat is not running however.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@mds1 ~&amp;#93;&lt;/span&gt;# pdsh -a uptime&lt;br/&gt;
mds2:  11:08:27 up 15:29,  0 users,  load average: 0.07, 3.78, 6.27&lt;br/&gt;
oss2:  11:08:27 up 16:37,  0 users,  load average: 2.22, 3.43, 5.50&lt;br/&gt;
oss1:  11:08:27 up 16:22,  0 users,  load average: 4.19, 4.36, 4.76&lt;br/&gt;
oss3:  11:08:27 up 18:48,  0 users,  load average: 5.28, 4.86, 4.80&lt;br/&gt;
oss4:  11:08:27 up 17:34,  1 user,  load average: 2.74, 2.74, 3.09&lt;br/&gt;
mds1:  11:08:27 up 14:12,  2 users,  load average: 0.00, 0.00, 0.00&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@mds1 ~&amp;#93;&lt;/span&gt;# time pdsh -a cat /proc/fs/lustre/health_check &lt;br/&gt;
mds1: healthy&lt;br/&gt;
mds2: healthy&lt;br/&gt;
oss1: healthy&lt;br/&gt;
oss4: healthy&lt;br/&gt;
oss2: healthy&lt;br/&gt;
oss3: healthy&lt;/p&gt;

&lt;p&gt;real    0m0.219s&lt;br/&gt;
user    0m0.055s&lt;br/&gt;
sys     0m0.023s&lt;/p&gt;

&lt;p&gt;Our logs show that for a time after boot the system load was high and the system was unresponsive to the point where we think heartbeat would object and kill the node, I&apos;m getting the logs for this time currently.&lt;/p&gt;</comment>
                            <comment id="16938" author="apittman" created="Fri, 24 Jun 2011 06:18:54 +0000"  >&lt;p&gt;If we take one node, oss4 it booted at &quot;Jun 23 17:36:01&quot;.  Our load monitoring script above wasn&apos;t started automatically, the first few entries from after it was started are below.  Note that 45 minutes after boot it was still taking over 8 minutes to cat the health_check file.  The OSTs were mounted at boot time.  As I say things have quietened down now and the system hasn&apos;t restarted since this time.&lt;/p&gt;

&lt;p&gt;Thu Jun 23 18:15:42 BST 2011&lt;br/&gt;
healthy&lt;/p&gt;

&lt;p&gt;real    8m28.119s&lt;br/&gt;
user    0m0.000s&lt;br/&gt;
sys     0m0.002s&lt;br/&gt;
Thu Jun 23 18:24:10 BST 2011&lt;br/&gt;
 18:24:10 up 49 min,  0 users,  load average: 45.37, 34.88, 29.60&lt;br/&gt;
Thu Jun 23 18:24:10 BST 2011&lt;br/&gt;
Thu Jun 23 18:29:10 BST 2011&lt;br/&gt;
healthy&lt;/p&gt;

&lt;p&gt;real    4m30.552s&lt;br/&gt;
user    0m0.000s&lt;br/&gt;
sys     0m0.001s&lt;br/&gt;
Thu Jun 23 18:33:41 BST 2011&lt;br/&gt;
 18:33:41 up 59 min,  0 users,  load average: 100.18, 54.35, 35.92&lt;br/&gt;
Thu Jun 23 18:33:41 BST 2011&lt;br/&gt;
Thu Jun 23 18:38:41 BST 2011&lt;br/&gt;
healthy&lt;/p&gt;

&lt;p&gt;real    1m5.965s&lt;br/&gt;
user    0m0.001s&lt;br/&gt;
sys     0m0.002s&lt;br/&gt;
Thu Jun 23 18:39:46 BST 2011&lt;br/&gt;
 18:39:46 up  1:05,  0 users,  load average: 54.92, 47.41, 37.87&lt;br/&gt;
Thu Jun 23 18:39:46 BST 2011&lt;br/&gt;
Thu Jun 23 18:44:46 BST 2011&lt;br/&gt;
healthy&lt;/p&gt;

&lt;p&gt;real    0m0.157s&lt;br/&gt;
user    0m0.000s&lt;br/&gt;
sys     0m0.003s&lt;br/&gt;
Thu Jun 23 18:44:47 BST 2011&lt;br/&gt;
 18:44:47 up  1:10,  0 users,  load average: 13.53, 27.95, 32.20&lt;br/&gt;
Thu Jun 23 18:44:47 BST 2011&lt;br/&gt;
Thu Jun 23 18:49:47 BST 2011&lt;br/&gt;
healthy&lt;/p&gt;

&lt;p&gt;real    0m0.077s&lt;/p&gt;

&lt;p&gt;All further samples are below 1 second to read this file with the exception of the three samples below.&lt;/p&gt;

&lt;p&gt;Thu Jun 23 19:09:48 BST 2011&lt;br/&gt;
healthy&lt;/p&gt;

&lt;p&gt;real    5m55.093s&lt;br/&gt;
user    0m0.000s&lt;br/&gt;
sys     0m0.002s&lt;br/&gt;
Thu Jun 23 19:15:43 BST 2011&lt;/p&gt;

&lt;p&gt;and&lt;/p&gt;

&lt;p&gt;Fri Jun 24 01:11:04 BST 2011&lt;br/&gt;
healthy&lt;/p&gt;

&lt;p&gt;real    0m20.471s&lt;br/&gt;
user    0m0.001s&lt;br/&gt;
sys     0m0.000s&lt;br/&gt;
Fri Jun 24 01:11:24 BST 2011&lt;br/&gt;
 01:11:24 up  7:36,  0 users,  load average: 9.84, 7.82, 7.35&lt;br/&gt;
Fri Jun 24 01:11:24 BST 2011&lt;br/&gt;
Fri Jun 24 01:16:24 BST 2011&lt;br/&gt;
healthy&lt;/p&gt;

&lt;p&gt;real    0m34.715s&lt;br/&gt;
user    0m0.000s&lt;br/&gt;
sys     0m0.002s&lt;br/&gt;
Fri Jun 24 01:16:59 BST 2011&lt;/p&gt;</comment>
                            <comment id="16939" author="apittman" created="Fri, 24 Jun 2011 06:27:49 +0000"  >&lt;p&gt;Logs from oss4 since the last boot.&lt;/p&gt;</comment>
                            <comment id="16942" author="bobijam" created="Fri, 24 Jun 2011 07:33:16 +0000"  >&lt;p&gt;at the OST boot time, oss4 was recovering 887 clients, which generate heavy IO load.&lt;/p&gt;

&lt;p&gt;Jun 23 17:49:01 oss4 kernel: Lustre: 19755:0:(filter.c:1001:filter_init_server_data()) RECOVERY: service lustre-OST0024, 887 recoverable clients, 0 delayed clients, last_rcvd 34359739610&lt;/p&gt;

&lt;p&gt;while until Jun 24 01:25:37 there still shows heavy IO load on oss4. Don&apos;t know whether they are still delayed recovery requests IO handling or new ones from connected clients.&lt;/p&gt;</comment>
                            <comment id="16943" author="bobijam" created="Fri, 24 Jun 2011 07:40:14 +0000"  >&lt;p&gt;Are OSTs on oss4 pretty full?&lt;/p&gt;</comment>
                            <comment id="16945" author="apittman" created="Fri, 24 Jun 2011 07:50:09 +0000"  >&lt;p&gt;I understand there will be high load on servers after restart with that many active clients I wouldn&apos;t expect it to cause such delays though.  The heartbeat monitor timeout is set to ten minutes, any longer than this and the resource is assumed to have problems and the node is rebooted.  We use this same timeout everywhere and whilst asking the customer to increase it is an option I&apos;d rather avoid doing this.&lt;/p&gt;

&lt;p&gt;Yes they are pretty full, we&apos;re working with the customer on this.  There seems to be a usage spike in the last week which we don&apos;t understand the cause of.&lt;/p&gt;

&lt;p&gt;/dev/mapper/ost_lustre_36&lt;br/&gt;
                     3845733384 3252737736 397643584  90% /lustre/lustre/ost_36&lt;br/&gt;
/dev/mapper/ost_lustre_37&lt;br/&gt;
                     3845733384 3285744784 364636536  91% /lustre/lustre/ost_37&lt;br/&gt;
/dev/mapper/ost_lustre_38&lt;br/&gt;
                     3845733384 3528308788 122072532  97% /lustre/lustre/ost_38&lt;br/&gt;
/dev/mapper/ost_lustre_39&lt;br/&gt;
                     3845733384 2973617572 676763748  82% /lustre/lustre/ost_39&lt;br/&gt;
/dev/mapper/ost_lustre_40&lt;br/&gt;
                     3845733384 2992170588 658210732  82% /lustre/lustre/ost_40&lt;br/&gt;
/dev/mapper/ost_lustre_41&lt;br/&gt;
                     3845733384 3303287544 347093776  91% /lustre/lustre/ost_41&lt;br/&gt;
/dev/mapper/ost_lustre_42&lt;br/&gt;
                     3845733384 3561794272  88587048  98% /lustre/lustre/ost_42&lt;br/&gt;
/dev/mapper/ost_lustre_43&lt;br/&gt;
                     3845733384 3104758792 545622528  86% /lustre/lustre/ost_43&lt;br/&gt;
/dev/mapper/ost_lustre_44&lt;br/&gt;
                     3845733384 3516847956 133533364  97% /lustre/lustre/ost_44&lt;br/&gt;
/dev/mapper/ost_lustre_45&lt;br/&gt;
                     3845733384 3312432192 337949128  91% /lustre/lustre/ost_45&lt;br/&gt;
/dev/mapper/ost_lustre_46&lt;br/&gt;
                     3845733384 3445718020 204663300  95% /lustre/lustre/ost_46&lt;br/&gt;
/dev/mapper/ost_lustre_47&lt;br/&gt;
                     3845733384 3614267360  36113960 100% /lustre/lustre/ost_47&lt;/p&gt;</comment>
                            <comment id="16946" author="bobijam" created="Fri, 24 Jun 2011 08:21:36 +0000"  >&lt;p&gt;Several questions we&apos;d like to know:&lt;/p&gt;

&lt;p&gt;1. please check OST&apos;s recover status (like: cat /proc/obdfilter/lustre-OSTxxxxx/recovery_status)&lt;br/&gt;
2. what&apos;s the client apps write mode? Are they usually create new files or write existing old files?&lt;br/&gt;
3. what&apos;s the stripe mode setup?&lt;br/&gt;
   If OST&apos;s recovery finishes, and if the dirs are not dedicated to allocate on certain OSTs, when the OSTs getting almost full, new created files would avoid them finding other OSTs with relatively spare capacity. But writing to old files already on the full OSTs does not fit this mode.&lt;/p&gt;

&lt;p&gt;4. From the log, lustre IO is using tcp network, shall it use ib network? And heartbeat should be configured using network different from what lustre uses, or they&apos;d compete and heavy IO would make heartbeat illude that the ndoe is dead.&lt;/p&gt;
</comment>
                            <comment id="16947" author="brian" created="Fri, 24 Jun 2011 09:16:38 +0000"  >&lt;p&gt;Would I be correct in interpreting that your current pain point here is the time it&apos;s taking to read from the health_check proc file?&lt;/p&gt;</comment>
                            <comment id="16950" author="apittman" created="Fri, 24 Jun 2011 11:54:41 +0000"  >&lt;p&gt;Brian.  Now we&apos;ve got the system stable and under control I think that&apos;s a fair summary of the remaining concerns.  The load is high and the node is somewhat unresponsive after startup but we can live with that if we need to.&lt;/p&gt;

&lt;p&gt;One thing I did think of is we could remove this check from the heart beat monitor script, this would give us a get-out and something to deliver to the customer but doesn&apos;t resolve the real issue.&lt;/p&gt;

&lt;p&gt;What we notice during occasions like this is the node is basically functional, you can login, run top, ls and the like but the load will be high, lots of ll_ost processes will be in D state and certain commands, for example sync will take a long time to complete.  I&apos;m assuming sync and reading the health_check file are waiting for the same locks.&lt;/p&gt;</comment>
                            <comment id="16951" author="apittman" created="Fri, 24 Jun 2011 12:01:02 +0000"  >&lt;p&gt;Zhenyu:&lt;/p&gt;

&lt;p&gt;1) below&lt;br/&gt;
2) I don&apos;t know, it&apos;s a 800 node university cluster so I expect it&apos;s a mixed workload.  It&apos;s mainly used as scratch for HPC workloads.&lt;br/&gt;
3) There is currently no striping anywhere on the filesystem.  Some of the files are 20Gb+ and I&apos;ve recommended that these files at least are striped in future.&lt;br/&gt;
We don&apos;t know where the spike in usage has come from but I&apos;m assuming it&apos;s a small number of very large files as otherwise the % usage of each OST would be closer together rather than the current wide range we see.&lt;br/&gt;
4) There is no ib network.  heartbeat is running over ethernet but I think it&apos;s a separate 1g link used for this.  The networking stack remains responsive and there is no indication of heartbeat communication timeouts, it&apos;s all resource timesouts from trying to read the health_check file.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@oss4 ~&amp;#93;&lt;/span&gt;# lustre_recovery_status.sh -v&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0024/recovery_status:status: COMPLETE&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0024/recovery_status:recovery_start: 1308847756&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0024/recovery_status:recovery_duration: 114&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0024/recovery_status:delayed_clients: 0/887&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0024/recovery_status:completed_clients: 887/887&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0024/recovery_status:replayed_requests: 0&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0024/recovery_status:last_transno: 34359739610&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0025/recovery_status:status: COMPLETE&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0025/recovery_status:recovery_start: 1308847808&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0025/recovery_status:recovery_duration: 81&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0025/recovery_status:delayed_clients: 0/887&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0025/recovery_status:completed_clients: 887/887&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0025/recovery_status:replayed_requests: 0&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0025/recovery_status:last_transno: 25769804303&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0026/recovery_status:status: COMPLETE&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0026/recovery_status:recovery_start: 1308847760&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0026/recovery_status:recovery_duration: 107&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0026/recovery_status:delayed_clients: 0/887&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0026/recovery_status:completed_clients: 887/887&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0026/recovery_status:replayed_requests: 0&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0026/recovery_status:last_transno: 30064789726&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0027/recovery_status:status: COMPLETE&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0027/recovery_status:recovery_start: 1308847818&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0027/recovery_status:recovery_duration: 77&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0027/recovery_status:delayed_clients: 0/887&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0027/recovery_status:completed_clients: 887/887&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0027/recovery_status:replayed_requests: 0&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0027/recovery_status:last_transno: 34359743820&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0028/recovery_status:status: COMPLETE&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0028/recovery_status:recovery_start: 1308847769&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0028/recovery_status:recovery_duration: 102&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0028/recovery_status:delayed_clients: 0/887&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0028/recovery_status:completed_clients: 887/887&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0028/recovery_status:replayed_requests: 0&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0028/recovery_status:last_transno: 34359738471&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0029/recovery_status:status: COMPLETE&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0029/recovery_status:recovery_start: 1308847828&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0029/recovery_status:recovery_duration: 75&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0029/recovery_status:delayed_clients: 0/887&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0029/recovery_status:completed_clients: 887/887&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0029/recovery_status:replayed_requests: 0&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST0029/recovery_status:last_transno: 34359755743&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002a/recovery_status:status: COMPLETE&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002a/recovery_status:recovery_start: 1308847774&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002a/recovery_status:recovery_duration: 99&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002a/recovery_status:delayed_clients: 0/887&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002a/recovery_status:completed_clients: 887/887&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002a/recovery_status:replayed_requests: 0&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002a/recovery_status:last_transno: 34359740488&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002b/recovery_status:status: COMPLETE&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002b/recovery_status:recovery_start: 1308847837&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002b/recovery_status:recovery_duration: 75&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002b/recovery_status:delayed_clients: 0/887&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002b/recovery_status:completed_clients: 887/887&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002b/recovery_status:replayed_requests: 0&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002b/recovery_status:last_transno: 34359798280&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002c/recovery_status:status: COMPLETE&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002c/recovery_status:recovery_start: 1308847788&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002c/recovery_status:recovery_duration: 101&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002c/recovery_status:delayed_clients: 0/887&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002c/recovery_status:completed_clients: 887/887&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002c/recovery_status:replayed_requests: 0&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002c/recovery_status:last_transno: 30064773071&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002d/recovery_status:status: COMPLETE&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002d/recovery_status:recovery_start: 1308847853&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002d/recovery_status:recovery_duration: 69&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002d/recovery_status:delayed_clients: 0/887&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002d/recovery_status:completed_clients: 887/887&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002d/recovery_status:replayed_requests: 0&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002d/recovery_status:last_transno: 25769806937&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002e/recovery_status:status: COMPLETE&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002e/recovery_status:recovery_start: 1308847799&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002e/recovery_status:recovery_duration: 90&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002e/recovery_status:delayed_clients: 0/887&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002e/recovery_status:completed_clients: 887/887&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002e/recovery_status:replayed_requests: 0&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002e/recovery_status:last_transno: 30064772112&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002f/recovery_status:status: COMPLETE&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002f/recovery_status:recovery_start: 1308847855&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002f/recovery_status:recovery_duration: 75&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002f/recovery_status:delayed_clients: 0/887&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002f/recovery_status:completed_clients: 887/887&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002f/recovery_status:replayed_requests: 0&lt;br/&gt;
/proc/fs/lustre/obdfilter/lustre-OST002f/recovery_status:last_transno: 30064807203&lt;/p&gt;</comment>
                            <comment id="16952" author="brian" created="Fri, 24 Jun 2011 12:03:36 +0000"  >&lt;p&gt;Ashley,&lt;/p&gt;

&lt;p&gt;There is an alternative and it would be useful to try it even if just for a test.&lt;/p&gt;

&lt;p&gt;Reading from the health_check proc file causes a write to the OST during it&apos;s operation, as a check of the health of writing to the disk.&lt;/p&gt;

&lt;p&gt;This part of the health_check can be suppressed by building lustre with --disable-health_write configure option.&lt;/p&gt;

&lt;p&gt;Of course the result is a slightly less extensive health_check, but even if you could do that as a test to see if alleviates your health_check proc file reading latency, then we will at least know where the problem is.&lt;/p&gt;</comment>
                            <comment id="16957" author="apittman" created="Fri, 24 Jun 2011 13:13:41 +0000"  >&lt;p&gt;That&apos;s good to know.  I&apos;ll prepare a build with this option set so we have it ready.&lt;/p&gt;

&lt;p&gt;At this stage I&apos;m not going to force a reboot on the customer but we&apos;ll discuss this with them Monday.&lt;/p&gt;</comment>
                            <comment id="17086" author="brian" created="Tue, 28 Jun 2011 12:22:36 +0000"  >&lt;p&gt;Is there any update on this issue.  Were you able to perform the prescribed test?&lt;/p&gt;</comment>
                            <comment id="17161" author="apittman" created="Wed, 29 Jun 2011 10:58:12 +0000"  >&lt;p&gt;No update yet.  We have new packages on-site and ready to deploy but need to liase with customer over schedule for doing this.&lt;/p&gt;

&lt;p&gt;At this time we are assuming this to be a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15&quot; title=&quot;strange slow IO messages and bad performance &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15&quot;&gt;&lt;del&gt;LU-15&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The filesystem is still unpleasantly full with between 58% and 100% of blocks used per OST with 84% used on average.&lt;/p&gt;</comment>
                            <comment id="17162" author="apittman" created="Wed, 29 Jun 2011 13:09:57 +0000"  >&lt;p&gt;Update confirmed for tomorrow, other than monitoring load average and time taken to cat the health_check file is there anything you want us to check after boot?&lt;/p&gt;</comment>
                            <comment id="17207" author="apittman" created="Fri, 1 Jul 2011 12:37:24 +0000"  >&lt;p&gt;Upgrade performed and the system is now running under the control of heartbeat again, the --disable-health-write option made all the difference, at no point during startup did reading this file take more than a second.&lt;/p&gt;

&lt;p&gt;The underlying issue remains though and through testing I&apos;ve confirmed it&apos;s the same issue we&apos;ve seen elsewhere, basically for some time after startup any writes to an OST will stall for a considerable period of time.  Using our ost-survey script I got the following results (measured shortly after recovery had finished):&lt;/p&gt;

&lt;p&gt;OST 0: 1073741824 bytes (1.1 GB) copied, 118.534 seconds, 9.1 MB/s&lt;br/&gt;
OST 1: 1073741824 bytes (1.1 GB) copied, 9.31611 seconds, 115 MB/s&lt;br/&gt;
OST 2: 1073741824 bytes (1.1 GB) copied, 9.21754 seconds, 116 MB/s&lt;br/&gt;
OST 3: 1073741824 bytes (1.1 GB) copied, 54.316 seconds, 19.8 MB/s&lt;br/&gt;
OST 4: 1073741824 bytes (1.1 GB) copied, 9.18021 seconds, 117 MB/s&lt;br/&gt;
OST 5: 1073741824 bytes (1.1 GB) copied, 12.2757 seconds, 87.5 MB/s&lt;br/&gt;
OST 6: 1073741824 bytes (1.1 GB) copied, 327.987 seconds, 3.3 MB/s&lt;br/&gt;
OST 7: 1073741824 bytes (1.1 GB) copied, 24.4431 seconds, 43.9 MB/s&lt;br/&gt;
OST 8: 1073741824 bytes (1.1 GB) copied, 10.2977 seconds, 104 MB/s&lt;br/&gt;
OST 9: 1073741824 bytes (1.1 GB) copied, 330.015 seconds, 3.3 MB/s&lt;br/&gt;
OST 10: 1073741824 bytes (1.1 GB) copied, 24.5828 seconds, 43.7 MB/s&lt;br/&gt;
OST 11: 1073741824 bytes (1.1 GB) copied, 339.283 seconds, 3.2 MB/s&lt;br/&gt;
OST 12: 1073741824 bytes (1.1 GB) copied, 10.2614 seconds, 105 MB/s&lt;br/&gt;
OST 13: 1073741824 bytes (1.1 GB) copied, 10.6387 seconds, 101 MB/s&lt;br/&gt;
OST 14: 1073741824 bytes (1.1 GB) copied, 9.18088 seconds, 117 MB/s&lt;br/&gt;
OST 15: 1073741824 bytes (1.1 GB) copied, 33.046 seconds, 32.5 MB/s&lt;br/&gt;
OST 16: 1073741824 bytes (1.1 GB) copied, 9.61687 seconds, 112 MB/s&lt;br/&gt;
OST 17: 1073741824 bytes (1.1 GB) copied, 12.578 seconds, 85.4 MB/s&lt;br/&gt;
OST 18: 1073741824 bytes (1.1 GB) copied, 333.312 seconds, 3.2 MB/s&lt;br/&gt;
OST 19: 1073741824 bytes (1.1 GB) copied, 16.8783 seconds, 63.6 MB/s&lt;br/&gt;
OST 20: 1073741824 bytes (1.1 GB) copied, 10.7011 seconds, 100 MB/s&lt;br/&gt;
OST 21: 1073741824 bytes (1.1 GB) copied, 9.51567 seconds, 113 MB/s&lt;br/&gt;
OST 22: 1073741824 bytes (1.1 GB) copied, 335.516 seconds, 3.2 MB/s&lt;br/&gt;
OST 23: 1073741824 bytes (1.1 GB) copied, 345.44 seconds, 3.1 MB/s&lt;br/&gt;
OST 24: 1073741824 bytes (1.1 GB) copied, 334.839 seconds, 3.2 MB/s&lt;br/&gt;
OST 25: 1073741824 bytes (1.1 GB) copied, 19.8422 seconds, 54.1 MB/s&lt;br/&gt;
OST 26: 1073741824 bytes (1.1 GB) copied, 18.0309 seconds, 59.5 MB/s&lt;br/&gt;
OST 27: 1073741824 bytes (1.1 GB) copied, 19.0357 seconds, 56.4 MB/s&lt;br/&gt;
OST 28: 1073741824 bytes (1.1 GB) copied, 17.4743 seconds, 61.4 MB/s&lt;br/&gt;
OST 29: 1073741824 bytes (1.1 GB) copied, 9.03472 seconds, 119 MB/s&lt;br/&gt;
OST 30: 1073741824 bytes (1.1 GB) copied, 9.09998 seconds, 118 MB/s&lt;br/&gt;
OST 31: 1073741824 bytes (1.1 GB) copied, 27.8013 seconds, 38.6 MB/s&lt;br/&gt;
OST 32: 1073741824 bytes (1.1 GB) copied, 9.90428 seconds, 108 MB/s&lt;br/&gt;
OST 33: 1073741824 bytes (1.1 GB) copied, 26.576 seconds, 40.4 MB/s&lt;br/&gt;
OST 34: 1073741824 bytes (1.1 GB) copied, 9.02994 seconds, 119 MB/s&lt;br/&gt;
OST 35: 1073741824 bytes (1.1 GB) copied, 318.633 seconds, 3.4 MB/s&lt;br/&gt;
OST 36: 1073741824 bytes (1.1 GB) copied, 33.0055 seconds, 32.5 MB/s&lt;br/&gt;
OST 37: 1073741824 bytes (1.1 GB) copied, 10.093 seconds, 106 MB/s&lt;br/&gt;
OST 38: 1073741824 bytes (1.1 GB) copied, 319.42 seconds, 3.4 MB/s&lt;br/&gt;
OST 39: 1073741824 bytes (1.1 GB) copied, 9.10729 seconds, 118 MB/s&lt;br/&gt;
OST 40: 1073741824 bytes (1.1 GB) copied, 20.39 seconds, 52.7 MB/s&lt;br/&gt;
OST 41: 1073741824 bytes (1.1 GB) copied, 236.857 seconds, 4.5 MB/s&lt;br/&gt;
OST 42: 1073741824 bytes (1.1 GB) copied, 8.89273 seconds, 121 MB/s&lt;br/&gt;
OST 43: 1073741824 bytes (1.1 GB) copied, 37.6976 seconds, 28.5 MB/s&lt;br/&gt;
OST 44: 1073741824 bytes (1.1 GB) copied, 131.119 seconds, 8.2 MB/s&lt;br/&gt;
OST 45: 1073741824 bytes (1.1 GB) copied, 9.81052 seconds, 109 MB/s&lt;br/&gt;
OST 46: 1073741824 bytes (1.1 GB) copied, 320.598 seconds, 3.3 MB/s&lt;br/&gt;
OST 47: 1073741824 bytes (1.1 GB) copied, 313.122 seconds, 3.4 MB/s&lt;/p&gt;

&lt;p&gt;This is an older system so 100MB/s is about right, what we see though is that some OSTs take considerably longer.  From monitoring the file size as the above test is run the file will increase in size up to a point and then remain that size for a considerable time, whilst the OST is accepting writes the performance is as expected but it freezes for long periods.  Rerunning the same test now results in good results for all OSTS.  If the health_check was performing a write to every OST on a node this explains why it was taking so long to read the file.&lt;/p&gt;</comment>
                            <comment id="17227" author="bobijam" created="Mon, 4 Jul 2011 12:04:07 +0000"  >&lt;p&gt;--disable-health-write configuration helps heartbeat health check, dup to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15&quot; title=&quot;strange slow IO messages and bad performance &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15&quot;&gt;&lt;del&gt;LU-15&lt;/del&gt;&lt;/a&gt; for slow IO issue on almost full OST.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="10284" name="oss4-messages-1.8.6" size="155584" author="apittman" created="Fri, 24 Jun 2011 06:27:48 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvi73:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>6597</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>