<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:54:26 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-5778] MDS not creating files on OSTs properly</title>
                <link>https://jira.whamcloud.com/browse/LU-5778</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;One of our Stampede filesystems running Lustre 2.5.2 has an OST offline due to a different problem described in another ticket.   Since the OST has been offline, the MDS server crashed with an LBUG and was restarted last Friday.  After the restart, the MDS server no longer automatically creates files on any OSTs after the offline OSTs.  In our case, OST0010 is offline so now the MDS will only create files on the first 16 OSTs unless we manually specify the stripeoffset in lfs setstripe.   This is overloading the the servers with these OSTs while the others are doing nothing.   If we deactivate the first 16 OSTs on the MDS, then all files are created with the first stripe on the lowest numbered active OST.  &lt;/p&gt;

&lt;p&gt;Can you suggest any way to force the MDS to use all the other OSTs through any lctl set_param options?  Getting the offline OST back online is not currently an option due to corruption and ongoing e2fsck, it can&apos;t be mounted.  Manually setting the stripe is also not an option, we need it to work automatically like it should.  Could we set some qos options to try and have it balance the OST file creation?&lt;/p&gt;</description>
                <environment>CentOS 6.5, kernel 2.6.32-431.17.1.el6_lustre.x86_64</environment>
        <key id="27128">LU-5778</key>
            <summary>MDS not creating files on OSTs properly</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="niu">Niu Yawei</assignee>
                                    <reporter username="minyard">Tommy Minyard</reporter>
                        <labels>
                            <label>patch</label>
                    </labels>
                <created>Tue, 21 Oct 2014 14:41:40 +0000</created>
                <updated>Mon, 25 Dec 2017 07:52:47 +0000</updated>
                            <resolved>Thu, 4 Dec 2014 22:34:59 +0000</resolved>
                                    <version>Lustre 2.5.2</version>
                                    <fixVersion>Lustre 2.7.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>15</watches>
                                                                            <comments>
                            <comment id="96864" author="adilger" created="Tue, 21 Oct 2014 17:40:01 +0000"  >&lt;p&gt;Does the MDS think all the OSTs are online?  What do &lt;tt&gt;lctl get_param lod.&amp;#42;.target_obd&lt;/tt&gt; and &lt;tt&gt;lctl get_param osp.&amp;#42;.state&lt;/tt&gt; return on the MDS?  Do the OSCs have enough precreated objects &lt;tt&gt;lctl get_param osp.&amp;#42;.prealloc&amp;#42;&lt;/tt&gt; on the MDS?&lt;/p&gt;

&lt;p&gt;If the the OSTs are space imbalanced and going into QOS mode &quot;lfs df&quot; would report significantly different space usage, but this may be a symptom and not a cause.  QOS mode shouldn&apos;t really prevent the other OSTs beyond  from being used, and in fact should start avoiding the full OSTs.  You might try setting on the MDS &lt;tt&gt;lctl set_param lod.&amp;#42;.qos_threshold_rr=95&lt;/tt&gt; to set the threshold between &quot;least full&quot; and &quot;most full&quot; OSTs to 95% of the filesystem size and essentially disable the QOS mode.&lt;/p&gt;

&lt;p&gt;Have you restarted the MDS since the initial failure?&lt;/p&gt;</comment>
                            <comment id="96897" author="minyard" created="Tue, 21 Oct 2014 19:25:21 +0000"  >&lt;p&gt;Andreas,&lt;/p&gt;

&lt;p&gt;The MDS does think all the OSTs are online and active and the state of all the OSTs is FULL, except for the OST that is offline, OST0010 (the offline OST described in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5780&quot; title=&quot;Corrupted OST and very long running e2fsck&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5780&quot;&gt;&lt;del&gt;LU-5780&lt;/del&gt;&lt;/a&gt;).   Attached is the output from the two commands.   I tried the prealloc command but that option is not listed, prealloc_status, prealloc_reserved are reporting 0 for all active OSTs.&lt;/p&gt;

&lt;p&gt;We are already starting to see substantial difference in usage, the first 16 OSTs are approaching 90% while the remaining are at 75% of space used.&lt;/p&gt;

&lt;p&gt;The MDS was restarted a few minutes ago and it still refuses to create any files on any OST after OST0010 when the stripeoffset -1.   I&apos;m pretty sure the problem here is that the MDS won&apos;t go above the offline OST index in terms of creating new files due to the OST being offline and inactive and never checking in with the MDS.  My suspicion is that the obdindex numbers it chooses to use when stripeoffset=-1 get incremented as OSTs check in, but since OST0010 never reconnects to MDS, it won&apos;t go above that one.&lt;/p&gt;</comment>
                            <comment id="96899" author="minyard" created="Tue, 21 Oct 2014 19:27:11 +0000"  >&lt;p&gt;These files contain the output from the commands Andreas asked us to run&lt;/p&gt;</comment>
                            <comment id="96935" author="minyard" created="Tue, 21 Oct 2014 21:32:17 +0000"  >&lt;p&gt;So we tried adjusting the qos_threshold_rr to 95 and 100%, but had no effect on distribution of files on OSTs, they always get created on the first 16 OSTs.   Also tried reducing the qos_threshold_rr to 5% and still no change in OST file creation distribution.   There is something in the MDS file creation algorithm that is preventing it from using any OSTs greater than the offline OST unless offsetstripe is set manually.&lt;/p&gt;

&lt;p&gt;THIS IS NOW URGENT and BLOCKING, we cannot put the system back into production in this state and we need some way to ensure that files get distributed across all the OSTs in the filesystem.   The first 16 are above 90% full and still growing while most of the rest are below 72% and not changing.&lt;/p&gt;</comment>
                            <comment id="96990" author="niu" created="Wed, 22 Oct 2014 08:15:15 +0000"  >&lt;p&gt;Could you try to capture debug log on MDT with D_TRACE enabled when creating file? I think that might be useful for me to see why the MDT only pick the first 16 OSTs. (You&apos;d better add a marker before and after the test by &quot;lctl  mark&quot;)&lt;/p&gt;

&lt;p&gt;BTW: what commands did you use to offline the OST? I tried to reproduce it locally, but failed. (deactivate an OST and see if MDT would create objects on all other active OSTs.)&lt;/p&gt;</comment>
                            <comment id="97004" author="green" created="Wed, 22 Oct 2014 13:30:22 +0000"  >&lt;p&gt;hm, stripe offset at -1 means that qos should not really be in play here - as it will try to allocate everywhere still due to requested stripe count.&lt;/p&gt;

&lt;p&gt;so if you disable, say, ost with index 3, does your setstripe -1 creates only created on first three OSTs then? (this is incorrect behavior tha we cannot observe ourselves, btw).&lt;/p&gt;

&lt;p&gt;Do you have any patches applied on top of lustre 2.5.2?&lt;/p&gt;</comment>
                            <comment id="97015" author="minyard" created="Wed, 22 Oct 2014 14:57:55 +0000"  >&lt;p&gt;Niu,&lt;br/&gt;
The sequence of events to reproduce the problem is as follows.  Deactivate the OST in the MDS, then unmount the deactivated OST on the oss (ours can&apos;t be mounted due to corruption). then restart the MDS.  We are currently in that state, the OST is unmounted and offline, so can&apos;t check in with the MDS when the MDS is restarted.   It was only after MDS restart that we started to see the file creates only on the first 16 OSTs.&lt;/p&gt;</comment>
                            <comment id="97016" author="minyard" created="Wed, 22 Oct 2014 15:06:30 +0000"  >&lt;p&gt;Oleg,&lt;br/&gt;
We have one cherry-picked patch applied to resolve crashes we were experiencing shortly after our upgrade to 2.5.2.   We used patch 0020cc44.diff from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5040&quot; title=&quot;kernel BUG at fs/jbd2/transaction.c:1033&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5040&quot;&gt;&lt;del&gt;LU-5040&lt;/del&gt;&lt;/a&gt; to resolve those crashes.  &lt;/p&gt;

&lt;p&gt;In the current situation, the MDS will create files on any of the first 16 OSTs as expected with stripe_offset of -1 when the OSTs are active in the MDS.   If we deactivate those OSTs in the MDS, then the first active OST index is used for &lt;em&gt;all&lt;/em&gt; file creates, if we then deactivate that one, it then moves to the next one in the index.&lt;/p&gt;</comment>
                            <comment id="97052" author="adilger" created="Wed, 22 Oct 2014 20:05:30 +0000"  >&lt;p&gt;I don&apos;t want to restate the obvious, but just to be sure that we don&apos;t have a simple workaround here, have you actually deactivated the failed OST on the MDS?  Something like the following:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;mds# lctl --device  %scratch-OST0010-osc-MDT0000 deactivate
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This should produce a message in the MDS console logs like:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Lustre: setting import scratch-OST0010_UUID INACTIVE by administrator request
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I&apos;ve done local testing with master and am unable to reproduce this (&lt;tt&gt;lfs setstripe -c -1&lt;/tt&gt; creates objects on all available stripes when one in the middle is deactivated).  I&apos;m going to build and run with 2.5.2 + patch to see if that shows a similar problem (maybe it has already been fixed in later releases).&lt;/p&gt;</comment>
                            <comment id="97053" author="minyard" created="Wed, 22 Oct 2014 20:25:58 +0000"  >&lt;p&gt;Andreas,&lt;br/&gt;
The device is definitely deactivated on the MDS, since the OST is offline and the MDS has been restarted, it could never activate anyway but I have deactivated it again for good measure.  The MDS is still choosing to create files on only the first 16 OSTs.&lt;/p&gt;

&lt;p&gt;Also, it is not the stripe_count that is the problem, I have been saying it is the stripe_offset set to -1, where it should choose to create a file on the OSTs in a semi-random fashion.  Can you try your test again with -c set to 2 (that is our default stripe), -s set to 1MB and -i set to -1?&lt;/p&gt;</comment>
                            <comment id="97059" author="green" created="Wed, 22 Oct 2014 21:35:30 +0000"  >&lt;p&gt;It should be noted that deactivate state as indicated by Andreas is different from disconnected - the state that the system would enter if it could not connect on it&apos;s own to an OST (and it&apos;ll retry all the time too).&lt;/p&gt;

&lt;p&gt;It&apos;s interesting what Andreas testing of 2.5.2 will show, I guess.&lt;/p&gt;</comment>
                            <comment id="97060" author="adilger" created="Wed, 22 Oct 2014 21:48:20 +0000"  >&lt;p&gt;I&apos;m unable to reproduce the problem with 2.5.2 using &quot;-c 2 -i -1&quot;.  It does imbalance the object allocations somewhat - with OST0002 disabled on the MDS, OST0000 and OST0003 seem to get chosen as the starting OST index much less often than others (of 1000 files, 2000 objects), but it still chooses OSTs beyond the deactivated OST, and the total number of objects allocated on each OST isn&apos;t as imbalanced:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;OST_idx     #start   #total
   0           36       251
   1          283       319
   3           39       322
   4          175       214
   5          147       322
   6          105       252
   7          215       320
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Tommy, can you please enable full debugging via &lt;tt&gt;lctl set_param debug=-1 debug_mb=512&lt;/tt&gt; on the MDS, and then create maybe 50 files (enough that some of them should be beyond OST0010), then dump the debug log &lt;tt&gt;lctl dk /tmp/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5779&quot; title=&quot;sanity-hsm test_70: Copytool failed to send unregister event to FIFO&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5779&quot;&gt;&lt;del&gt;LU-5779&lt;/del&gt;&lt;/a&gt;.debug; bzip2 -9 /tmp/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5779&quot; title=&quot;sanity-hsm test_70: Copytool failed to send unregister event to FIFO&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5779&quot;&gt;&lt;del&gt;LU-5779&lt;/del&gt;&lt;/a&gt;.debug&lt;/tt&gt; and attach that log to the ticket here.  Getting the &lt;tt&gt;lctl get_param osp.&amp;#42;.prealloc&amp;#42;&lt;/tt&gt; info would also be useful (sorry, the Jira markup turned my &amp;#42; into &lt;b&gt;bold&lt;/b&gt; in my first comment).&lt;/p&gt;</comment>
                            <comment id="97074" author="minyard" created="Thu, 23 Oct 2014 00:09:55 +0000"  >&lt;p&gt;Andreas,&lt;br/&gt;
I also tried to reproduce the problem on some test hardware by creating a filesystem with the same exact 2.5.2 version of Lustre installed on our /scratch filesystem and was unsuccessful to reproduce as well.   There must be something else going on with our /scratch filesystem, either due to large scale with 348 OSTs or the upgrade from the 2.1.5 version we were running, so i&apos;m going to compare the setup of the filesystem and see if I can find any differences.&lt;/p&gt;

&lt;p&gt;In regard to the debug output, we could not wait to put the system back into production so we developed a manual process to distribute files across by setting the stripe offset to a random OST for active user directories.  We are cycling the first two active OSTs so that files created in directories where the stripe_offset is still set to -1 get distributed as well.  Not efficient or as good performance, but at least it lets us run jobs for users and distribute files across all the OSTs.    I can certainly generate the debug output, but afraid it would be polluted with the activity from all the users.   That and we had to deactivate the first 16 OSTs since they reached &amp;gt; 93% capacity.  We have a maintenance scheduled for next Tuesday and can collect data on a quiet system then.  I&apos;ve included the output from the prealloc information in case it might be useful.   I noticed two had -5 as the prealloc_status, those OSTs are in the list of inactive OSTs, which is in the attached file as well.   Note that in looking through the prealloc output, found these three sets of messages corresponding to those OSTs:&lt;/p&gt;

&lt;p&gt;Oct 22 00:43:29 mds5 kernel: Lustre: setting import scratch-OST001d_UUID INACTIVE by administrator request&lt;br/&gt;
Oct 22 00:43:29 mds5 kernel: Lustre: Skipped 8 previous similar messages&lt;br/&gt;
Oct 22 00:43:29 mds5 kernel: LustreError: 22062:0:(osp_precreate.c:464:osp_precreate_send()) scratch-OST001d-osc-MDT0000: can&apos;t precreate: rc = -5&lt;br/&gt;
Oct 22 00:43:29 mds5 kernel: LustreError: 22062:0:(osp_precreate.c:968:osp_precreate_thread()) scratch-OST001d-osc-MDT0000: cannot precreate objects: rc = -5&lt;/p&gt;

&lt;p&gt;Oct 22 01:04:06 mds5 kernel: Lustre: setting import scratch-OST0021_UUID INACTIVE by administrator request&lt;br/&gt;
Oct 22 01:04:06 mds5 kernel: LustreError: 22070:0:(osp_precreate.c:464:osp_precreate_send()) scratch-OST0021-osc-MDT0000: can&apos;t precreate: rc = -5&lt;br/&gt;
Oct 22 01:04:06 mds5 kernel: LustreError: 22070:0:(osp_precreate.c:968:osp_precreate_thread()) scratch-OST0021-osc-MDT0000: cannot precreate objects: rc = -5&lt;/p&gt;

&lt;p&gt;Oct 22 15:07:21 mds5 kernel: Lustre: setting import scratch-OST0024_UUID INACTIVE by administrator request&lt;br/&gt;
Oct 22 15:07:21 mds5 kernel: Lustre: Skipped 5 previous similar messages&lt;br/&gt;
Oct 22 15:07:21 mds5 kernel: LustreError: 22084:0:(osp_precreate.c:464:osp_precreate_send()) scratch-OST0026-osc-MDT0000: can&apos;t precreate: rc = -5&lt;br/&gt;
Oct 22 15:07:21 mds5 kernel: LustreError: 22084:0:(osp_precreate.c:968:osp_precreate_thread()) scratch-OST0026-osc-MDT0000: cannot precreate objects: rc = -5&lt;/p&gt;</comment>
                            <comment id="97348" author="green" created="Thu, 23 Oct 2014 23:17:53 +0000"  >&lt;p&gt;Just a quick note.&lt;br/&gt;
Even if there&apos;s a lot of noise from other users, you can grab some tracesand since users are likely to allocate files in that time,&lt;br/&gt;
we still can gather some useful info.&lt;br/&gt;
sadly, t looks like QOS_DEBUG stuff is compiled out by default, but perhaps you can still limit the scope of traces to jsut lod with&lt;br/&gt;
echo &quot;lod osp&quot; &amp;gt;/proc/sys/lnet/debug_subsystem&lt;br/&gt;
echo -1 &amp;gt;/proc/sys/lnet/debug&lt;br/&gt;
lctl dk &amp;gt;/dev/null #to clear the log&lt;/p&gt;

&lt;p&gt;on your MDS (please cat those files first and remember the values and restore the content after the gathering of info).&lt;/p&gt;

&lt;p&gt;Let it run for some brief time in normal offset -1 mode and definitely do a couple of creations manually too that expose the problem, then gather the log with lctl dk &amp;gt;/tmp/lustre.log. (just run it long enough for your creates to run course)&lt;/p&gt;

&lt;p&gt;Also what&apos;s your default striping like wrt default stripe count? How about ost pools, do you use that?&lt;/p&gt;</comment>
                            <comment id="97395" author="minyard" created="Fri, 24 Oct 2014 14:47:04 +0000"  >&lt;p&gt;Thanks Oleg, we&apos;re going to try and quiet down the system a bit (the  system is a little drained waiting to schedule a 32K core job) and collect the mds trace all with the OSTs active again to see if that can provide some more debug information.&lt;/p&gt;

&lt;p&gt;We compared our test filesystem with the current /scratch filesystem and did find one significant difference, the test filesystem has a lwp service running on it that /scratch does not (there are additional lwp entries in lctl dl on MDS and OSS&apos;s.   The test filesystem went through the same 2.1.5 -&amp;gt; 2.5.2 upgrade process, however, we also ran tunefs.lustre --writeconf on the test filesystem and we &lt;em&gt;did not&lt;/em&gt; run tunefs.lustre --writeconf on /scratch in case we encountered a major issue and needed to rollback to previous version.   That appears to be the only difference in process of upgrade between our test filesystem and /scratch.   I didn&apos;t find much information about what lwp does from a quick search and some googling.   Not sure if having this lwp service running would impact file layout and creation on the OSTs.&lt;/p&gt;

&lt;p&gt;In answer to your question, we use a default stripe count of 2, a size of 1MB and offset -1.   We have not enabled ost pools.&lt;/p&gt;</comment>
                            <comment id="97404" author="green" created="Fri, 24 Oct 2014 15:18:30 +0000"  >&lt;p&gt;Hm, thanks for this extra bit of info.&lt;br/&gt;
lwp really should only be used for quota and some fld stuff that should not really impact the allocations, certainly not on some OSTs, but lwp stuff does live in OSP codebase so should be caught with the osp mask, except I just checked and it&apos;s somehow registered itself on ost mask instead, weird.&lt;br/&gt;
You might wish to revise that echo debug subsystem line to echo &quot;osp ost lod&quot; &amp;gt; /..../debug_subsystem&lt;/p&gt;

&lt;p&gt;Old servers did not really have LWP config record, but I think we tried to connect anyway (there was even compat issue in the past about that that we since fixed, but we&apos;ll need to go back and check how this was implemented).&lt;/p&gt;

&lt;p&gt;I think catching that bit of debug should be useful just in case.&lt;/p&gt;</comment>
                            <comment id="97411" author="minyard" created="Fri, 24 Oct 2014 16:06:05 +0000"  >&lt;p&gt;Attached is the debug log with the data filtered for lov and ost subsystems.  There is no osp subsystem on our current /scratch mds.   I cleared the logs prior to running the test, set a mark and then 500 files were created on the filesystem, hopefully you can find it in the attached log.&lt;/p&gt;</comment>
                            <comment id="97513" author="green" created="Sun, 26 Oct 2014 03:14:37 +0000"  >&lt;p&gt;Looking at the log:&lt;br/&gt;
It contains 157 create requests in it (I would have thought it has reached internal log sizing limit, but with the default at 1M per cpu, and the total log size you have at 1.1M it&apos;s certainly not the case).&lt;/p&gt;

&lt;p&gt;Other observations, in the log it says it has 348 OST of which 248 are not really connected (check status returns -107).&lt;br/&gt;
First 47 OSTs in the list are connected (with only a one in the middle of these 47 being down, namely OST 16 ).&lt;br/&gt;
OSTs 47 to 292 are down, then OSTs 293, 341, 342 are also down.&lt;/p&gt;

&lt;p&gt;Of the 157 creates only 16 went into the alloc_qos path and had any chance of random allocation. The other 141 took the alloc_specific path that is chosen when starting offset is below number of osts in the system (in other words it was set not to -1).&lt;br/&gt;
Did you have other processes in the system creating files with forced striping while this was run?&lt;br/&gt;
Sadly our level of debug does not allow us to see what osts were chosen in the end in the alloc_qos case, but if you have retained those 500 files, it might be interesting to see lfs getstripe output on them.&lt;/p&gt;</comment>
                            <comment id="97520" author="niu" created="Mon, 27 Oct 2014 02:25:20 +0000"  >&lt;p&gt;I&apos;m wondering why the QOS_DEBUG() messages in lod_alloc_qos() wasn&apos;t printed?&lt;/p&gt;</comment>
                            <comment id="97525" author="green" created="Mon, 27 Oct 2014 04:28:31 +0000"  >&lt;p&gt;QOS_DEBUG is compiled out by default which is really unfortunate:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;
#if 0
#define QOS_DEBUG(fmt, ...)     CDEBUG(D_OTHER, fmt, ## __VA_ARGS__)
#define QOS_CONSOLE(fmt, ...)   LCONSOLE(D_OTHER, fmt, ## __VA_ARGS__)
#else
#define QOS_DEBUG(fmt, ...)
#define QOS_CONSOLE(fmt, ...)
#endif
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Whoever thought that was great idea was not right.&lt;/p&gt;

&lt;p&gt;Niu, can you please open a separate ticket for this and perhaps add a patch?&lt;br/&gt;
Also d_other might not be great mask for it either, so please see if something else could be better here.&lt;/p&gt;</comment>
                            <comment id="97579" author="minyard" created="Mon, 27 Oct 2014 15:44:25 +0000"  >&lt;p&gt;Oleg,&lt;br/&gt;
In regard to the file creation, we did have some directories set with a manual stripe offset and not -1, so the stripe creation was being forced onto specific OSTs.   I ran an lfs getstripe on the 500 created and they all landed on objidx less than 16.  I&apos;ve attached the output from that file create.  As you noted, a large number of OSTs were inactive due us trying to distribute files across the other OSTs, but in this case, all files still had their stripes on the first 16 OSTs.&lt;/p&gt;

&lt;p&gt;This ticket is no longer critical for us since we were able to recreate OST0010 over the weekend and once it was mounted and checked in with the mds, then file creation started across all the active OSTs once again like it should.  This specific issue must have been due to the OST being offline and not able to check in with the MDS when the MDS had to be restarted after an LBUG crash.   We certainly do not get into this state often as usually all OSTs are available and not in a state that they can&apos;t be mounted.&lt;/p&gt;</comment>
                            <comment id="98521" author="adegremont" created="Thu, 6 Nov 2014 15:46:23 +0000"  >
&lt;p&gt;We faced the same issue at CEA. We analyze the problem and tracked it down to &lt;tt&gt;lod_qos_statfs_update()&lt;/tt&gt;. Indead, in this function, each OST is checked successively. If a OST has active=0, &lt;tt&gt;lod_statfs_and_check()&lt;/tt&gt; will return ENOTCONN and the for-loop will break. Following OSTs won&apos;t be checked. Their metadata will not be updated (zero) and they will not be used when allocating file objects.&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;        &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; (i = 0; i &amp;lt; osts-&amp;gt;op_count; i++) {
                idx = osts-&amp;gt;op_array[i];
                avail = OST_TGT(lod,idx)-&amp;gt;ltd_statfs.os_bavail;
                rc = lod_statfs_and_check(env, lod, idx,
                                          &amp;amp;OST_TGT(lod,idx)-&amp;gt;ltd_statfs);
                &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (rc)
                        &lt;span class=&quot;code-keyword&quot;&gt;break&lt;/span&gt;;
                &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (OST_TGT(lod,idx)-&amp;gt;ltd_statfs.os_bavail != avail)
                        &lt;span class=&quot;code-comment&quot;&gt;/* recalculate weigths */&lt;/span&gt;
                        lod-&amp;gt;lod_qos.lq_dirty = 1;
        }
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;A simple workaround for this bug is to deactivate QOS allocation setting &lt;tt&gt;qos_threshold_rr&lt;/tt&gt; to 100 and so going back to a simple round-robin.&lt;/p&gt;

&lt;p&gt;A fix seems to simply replace the &lt;tt&gt;break&lt;/tt&gt; with &lt;tt&gt;continue&lt;/tt&gt;&lt;/p&gt;</comment>
                            <comment id="98537" author="minyard" created="Thu, 6 Nov 2014 17:31:21 +0000"  >&lt;p&gt;Thanks for the workaround Aurelien, will keep this in mind if we encounter this situation again.   Do you know if this problem is still in the main branch of the source tree or is it just for the 2.5.x versions like we are running?   If so, should be fixed a simple fix as you noted.  &lt;/p&gt;

&lt;p&gt;Niu, can you take a look and confirm?&lt;/p&gt;</comment>
                            <comment id="98541" author="bzzz" created="Thu, 6 Nov 2014 17:38:45 +0000"  >&lt;p&gt;iirc, originally osp_statfs() wasn&apos;t returning an error, instead it claimed &quot;empty&quot; OST if the connection is down. &lt;/p&gt;</comment>
                            <comment id="98542" author="adegremont" created="Thu, 6 Nov 2014 17:40:48 +0000"  >&lt;p&gt;As far as I can see in current master branch, the problem is still there.&lt;/p&gt;</comment>
                            <comment id="98626" author="niu" created="Fri, 7 Nov 2014 01:35:33 +0000"  >&lt;blockquote&gt;
&lt;p&gt;Niu, can you take a look and confirm?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;I agree with that &apos;break&apos; here is inappropriate, but I don&apos;t see why this can leads to the result of always allocating objects on the OSTs before the deactivated one.&lt;/p&gt;

&lt;p&gt;In the lod_alloc_qos(), statfs will be performed on all OSPs again:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;        &lt;span class=&quot;code-comment&quot;&gt;/* Find all the OSTs that are valid stripe candidates */&lt;/span&gt;
        &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; (i = 0; i &amp;lt; osts-&amp;gt;op_count; i++) {
                &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (!cfs_bitmap_check(m-&amp;gt;lod_ost_bitmap, osts-&amp;gt;op_array[i]))
                        &lt;span class=&quot;code-keyword&quot;&gt;continue&lt;/span&gt;;

                rc = lod_statfs_and_check(env, m, osts-&amp;gt;op_array[i], sfs);
                &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (rc) {
                        &lt;span class=&quot;code-comment&quot;&gt;/* &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; OSP doesn&apos;t feel well */&lt;/span&gt;
                        &lt;span class=&quot;code-keyword&quot;&gt;continue&lt;/span&gt;;
                }
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="98642" author="apercher" created="Fri, 7 Nov 2014 08:53:59 +0000"  >&lt;p&gt;Because in this case  lod_statfs_and_check() put the result on sfs who are &amp;amp;lod_env_info(env)-&amp;gt;lti_osfs&lt;br/&gt;
and not in &amp;amp;OST_TGT(lod,idx)-&amp;gt;ltd_statfs &lt;br/&gt;
The question could be why we have 2 structures to put same ost counter ?&lt;/p&gt;</comment>
                            <comment id="98648" author="niu" created="Fri, 7 Nov 2014 10:23:24 +0000"  >&lt;blockquote&gt;
&lt;p&gt;Because in this case lod_statfs_and_check() put the result on sfs who are &amp;amp;lod_env_info(env)-&amp;gt;lti_osfs&lt;br/&gt;
and not in &amp;amp;OST_TGT(lod,idx)-&amp;gt;ltd_statfs &lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Indeed, I think you are right.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The question could be why we have 2 structures to put same ost counter ?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;I think it because we don&apos;t want to get lq_rw_sem in lod_alloc_qos(), that would introduce lots of contention.&lt;/p&gt;

&lt;p&gt;Hi, Aurelien&lt;br/&gt;
Would you mind to post a patch to fix this? (replace the &apos;break&apos; with &apos;continue&apos; in lod_qos_statfs_update()). Thanks.&lt;/p&gt;</comment>
                            <comment id="98649" author="adegremont" created="Fri, 7 Nov 2014 10:44:46 +0000"  >&lt;p&gt;I will try to do that&lt;/p&gt;</comment>
                            <comment id="98670" author="adegremont" created="Fri, 7 Nov 2014 16:55:51 +0000"  >&lt;p&gt;I&apos;ve pushed: &lt;a href=&quot;http://review.whamcloud.com/12617&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/12617&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="98959" author="adegremont" created="Wed, 12 Nov 2014 13:26:22 +0000"  >&lt;p&gt;Patch for b2_5: &lt;a href=&quot;http://review.whamcloud.com/12685&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/12685&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="99225" author="adegremont" created="Fri, 14 Nov 2014 21:24:06 +0000"  >&lt;p&gt;I&apos;ve finally found time for a reproducer.&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;cd /usr/lib64/lustre/tests
OSTCOUNT=5 MGSDEV=/tmp/lustre-mgs ./llmount.sh

# Fill the first OST to unbalance disk space and activate qos algorithm
lfs setstripe -c1 -i0 /mnt/lustre/fill.0
dd &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt;=/dev/zero of=/mnt/lustre/fill.0 bs=1M count=100

sync
lfs df -h /mnt/lustre

# Stop OST #2 to avoid MDS getting stats from it
umount /mnt/ost3
# Deactivate &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; OST
lctl conf_param lustre-OST0002.osc.active=0

# Re-start MDT to clear cached data
umount /mnt/mds1
mount -t lustre /tmp/lustre-mdt1 /mnt/mds1 -o loop

# Now create a lot of files and check how they are stripped
&lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; i in {1..50}; &lt;span class=&quot;code-keyword&quot;&gt;do&lt;/span&gt; lfs setstripe -c1 /mnt/lustre/file.$i; done
lfs getstripe /mnt/lustre/file.* | awk &lt;span class=&quot;code-quote&quot;&gt;&apos;/0x/ {print $1}&apos;&lt;/span&gt; | sort | uniq -c
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Files are striped only on OST #0, #1 and nothing on #3 and #4, even if #0 is almost full and #1, #3 and #4 are empty. (#2 is deactivated and stopped)&lt;/p&gt;

&lt;p&gt;Maybe I will find time to create a test case after SC14.&lt;/p&gt;</comment>
                            <comment id="100745" author="gerrit" created="Thu, 4 Dec 2014 20:26:56 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/12685/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/12685/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5778&quot; title=&quot;MDS not creating files on OSTs properly&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5778&quot;&gt;&lt;del&gt;LU-5778&lt;/del&gt;&lt;/a&gt; lod: Fix lod_qos_statfs_update()&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_5&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: cebb8fd03635f2f4e8f17c3a902eeba8008b07c4&lt;/p&gt;</comment>
                            <comment id="100767" author="pjones" created="Thu, 4 Dec 2014 22:34:59 +0000"  >&lt;p&gt;Fix landed for 2.5.4 and 2.7&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                        <issuelink>
            <issuekey id="49940">LU-10414</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="27131">LU-5780</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="27317">LU-5807</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="16215" name="LU-5778.debug_filtered.bz2" size="30470" author="minyard" created="Fri, 24 Oct 2014 16:06:05 +0000"/>
                            <attachment id="16230" name="LU-5778_file_create_getstripe.out.gz" size="12053" author="minyard" created="Mon, 27 Oct 2014 15:44:25 +0000"/>
                            <attachment id="16035" name="lctl_state.out" size="44949" author="minyard" created="Tue, 21 Oct 2014 19:27:11 +0000"/>
                            <attachment id="16036" name="lctl_target_obd.out" size="11415" author="minyard" created="Tue, 21 Oct 2014 19:27:11 +0000"/>
                            <attachment id="16068" name="mds5_prealloc.out" size="130636" author="minyard" created="Thu, 23 Oct 2014 00:09:55 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwz3b:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>16216</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>