<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:49:26 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-5204] 2.6 DNE stress testing: EINVAL when attempting to delete file</title>
                <link>https://jira.whamcloud.com/browse/LU-5204</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;After our stress testing this weekend, we are unable to delete some (perhaps any?) of the files on a particular OST (OST 38).  All of them give EINVAL.&lt;/p&gt;

&lt;p&gt;For example:&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@galaxy-esf-mds008 tmp&amp;#93;&lt;/span&gt;# rm -f posix_shm_open &lt;br/&gt;
rm: cannot remove `posix_shm_open&apos;: Invalid argument&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@galaxy-esf-mds008 tmp&amp;#93;&lt;/span&gt;# lfs getstripe posix_shm_open &lt;br/&gt;
posix_shm_open&lt;br/&gt;
lmm_stripe_count:   1&lt;br/&gt;
lmm_stripe_size:    1048576&lt;br/&gt;
lmm_pattern:        1&lt;br/&gt;
lmm_layout_gen:     0&lt;br/&gt;
lmm_stripe_offset:  38&lt;br/&gt;
	obdidx		 objid		 objid		 group&lt;br/&gt;
	    38	        907263	      0xdd7ff	             0&lt;/p&gt;

&lt;p&gt;However, OST 38 (OST0027) is showing up in lctl dl, and as far as I know, there are no issues with it.  (The dk logs on the OSS don&apos;t show any issues.)&lt;/p&gt;

&lt;p&gt;Here&apos;s the relevant part of the log from MDT000:&lt;br/&gt;
00000004:00020000:2.0:1402947131.685511:0:25039:0:(lod_lov.c:695:validate_lod_and_idx()) esfprod-MDT0000-mdtlov: bad idx: 38 of 64&lt;br/&gt;
00000004:00000001:2.0:1402947131.685513:0:25039:0:(lod_lov.c:757:lod_initialize_objects()) Process leaving via out (rc=18446744073709551594 : -22 : 0xffffffffffffffea)&lt;br/&gt;
00000004:00000010:2.0:1402947131.685515:0:25039:0:(lod_lov.c:782:lod_initialize_objects()) kfreed &apos;stripe&apos;: 8 at ffff8807fc208a00.&lt;br/&gt;
00000004:00000001:2.0:1402947131.685516:0:25039:0:(lod_lov.c:788:lod_initialize_objects()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea)&lt;br/&gt;
00000004:00000001:2.0:1402947131.685519:0:25039:0:(lod_lov.c:839:lod_parse_striping()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea)&lt;br/&gt;
00000004:00000001:2.0:1402947131.685520:0:25039:0:(lod_lov.c:885:lod_load_striping_locked()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea)&lt;br/&gt;
00000004:00000001:2.0:1402947131.685522:0:25039:0:(lod_object.c:2754:lod_declare_object_destroy()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea)&lt;br/&gt;
00000004:00000001:2.0:1402947131.685524:0:25039:0:(mdd_dir.c:1586:mdd_unlink()) Process leaving via stop (rc=18446744073709551594 : -22 : 0xffffffffffffffea)&lt;/p&gt;


&lt;p&gt;I don&apos;t know for certain if this is related to DNE2 or not, but this is not an error I&apos;ve seen before.  The file system and objects are still around, so I can provide further data if needed.&lt;/p&gt;

&lt;p&gt;Any thoughts?&lt;/p&gt;</description>
                <environment></environment>
        <key id="25170">LU-5204</key>
            <summary>2.6 DNE stress testing: EINVAL when attempting to delete file</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="emoly.liu">Emoly Liu</assignee>
                                    <reporter username="paf">Patrick Farrell</reporter>
                        <labels>
                            <label>dne2</label>
                    </labels>
                <created>Mon, 16 Jun 2014 19:37:45 +0000</created>
                <updated>Wed, 18 Feb 2015 21:09:47 +0000</updated>
                            <resolved>Thu, 6 Nov 2014 18:56:17 +0000</resolved>
                                    <version>Lustre 2.6.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="86730" author="paf" created="Mon, 16 Jun 2014 19:41:56 +0000"  >&lt;p&gt;dk logs (-1) from the client (also MDS008, serving mdt0007) and MDS1/MDT0, which gave the -EINVAL back to the client.&lt;/p&gt;

&lt;p&gt;There is also an lctl dl from another client showing OST38/OST0027 as available, and an lctl dl from MDS1/MDT0000 showing it as available as well.&lt;/p&gt;</comment>
                            <comment id="86820" author="pjones" created="Tue, 17 Jun 2014 17:12:18 +0000"  >&lt;p&gt;Di&lt;/p&gt;

&lt;p&gt;Could you please comment?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="86824" author="di.wang" created="Tue, 17 Jun 2014 17:30:29 +0000"  >&lt;p&gt;According to the console log here, it seems OST38 is not being registered correctly on MDT0000.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;static int validate_lod_and_idx(struct lod_device *md, int idx)
{
        if (unlikely(idx &amp;gt;= md-&amp;gt;lod_ost_descs.ltd_tgts_size ||
                     !cfs_bitmap_check(md-&amp;gt;lod_ost_bitmap, idx))) {
                CERROR(&quot;%s: bad idx: %d of %d\n&quot;, lod2obd(md)-&amp;gt;obd_name, idx,
                       md-&amp;gt;lod_ost_descs.ltd_tgts_size);
                return -EINVAL;
        }
..........
}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Patrick, Could you please try lctl get_param lod.$your_fsname-MDT0000-mdtlov.target_obd and post the result here. Thanks.&lt;/p&gt;</comment>
                            <comment id="86828" author="paf" created="Tue, 17 Jun 2014 17:55:19 +0000"  >&lt;p&gt;Looks like you&apos;re right, Di...  Here&apos;s from MDS0:&lt;/p&gt;

&lt;p&gt;0: esfprod-OST0000_UUID ACTIVE&lt;br/&gt;
1: esfprod-OST0001_UUID ACTIVE&lt;br/&gt;
2: esfprod-OST0002_UUID ACTIVE&lt;br/&gt;
3: esfprod-OST0003_UUID ACTIVE&lt;br/&gt;
4: esfprod-OST0004_UUID ACTIVE&lt;br/&gt;
5: esfprod-OST0005_UUID ACTIVE&lt;br/&gt;
6: esfprod-OST0006_UUID ACTIVE&lt;br/&gt;
7: esfprod-OST0007_UUID ACTIVE&lt;br/&gt;
8: esfprod-OST0008_UUID ACTIVE&lt;br/&gt;
9: esfprod-OST0009_UUID ACTIVE&lt;br/&gt;
10: esfprod-OST000a_UUID ACTIVE&lt;br/&gt;
11: esfprod-OST000b_UUID ACTIVE&lt;br/&gt;
12: esfprod-OST000c_UUID ACTIVE&lt;br/&gt;
13: esfprod-OST000d_UUID ACTIVE&lt;br/&gt;
14: esfprod-OST000e_UUID ACTIVE&lt;br/&gt;
15: esfprod-OST000f_UUID ACTIVE&lt;br/&gt;
16: esfprod-OST0010_UUID ACTIVE&lt;br/&gt;
17: esfprod-OST0011_UUID ACTIVE&lt;br/&gt;
18: esfprod-OST0012_UUID ACTIVE&lt;br/&gt;
19: esfprod-OST0013_UUID ACTIVE&lt;br/&gt;
20: esfprod-OST0014_UUID ACTIVE&lt;br/&gt;
21: esfprod-OST0015_UUID ACTIVE&lt;br/&gt;
22: esfprod-OST0016_UUID ACTIVE&lt;br/&gt;
23: esfprod-OST0017_UUID ACTIVE&lt;br/&gt;
24: esfprod-OST0018_UUID ACTIVE&lt;br/&gt;
25: esfprod-OST0019_UUID ACTIVE&lt;br/&gt;
26: esfprod-OST001a_UUID ACTIVE&lt;br/&gt;
27: esfprod-OST001b_UUID ACTIVE&lt;br/&gt;
28: esfprod-OST001c_UUID ACTIVE&lt;br/&gt;
29: esfprod-OST001d_UUID ACTIVE&lt;br/&gt;
30: esfprod-OST001e_UUID ACTIVE&lt;br/&gt;
31: esfprod-OST001f_UUID ACTIVE&lt;br/&gt;
32: esfprod-OST0020_UUID ACTIVE&lt;br/&gt;
33: esfprod-OST0021_UUID ACTIVE&lt;br/&gt;
34: esfprod-OST0022_UUID ACTIVE&lt;br/&gt;
35: esfprod-OST0023_UUID ACTIVE&lt;br/&gt;
36: esfprod-OST0024_UUID ACTIVE&lt;br/&gt;
37: esfprod-OST0025_UUID ACTIVE&lt;br/&gt;
39: esfprod-OST0027_UUID ACTIVE&lt;/p&gt;


&lt;p&gt;And here&apos;s from MDS008:&lt;br/&gt;
0: esfprod-OST0000_UUID ACTIVE&lt;br/&gt;
1: esfprod-OST0001_UUID ACTIVE&lt;br/&gt;
2: esfprod-OST0002_UUID ACTIVE&lt;br/&gt;
3: esfprod-OST0003_UUID ACTIVE&lt;br/&gt;
4: esfprod-OST0004_UUID ACTIVE&lt;br/&gt;
5: esfprod-OST0005_UUID ACTIVE&lt;br/&gt;
6: esfprod-OST0006_UUID ACTIVE&lt;br/&gt;
7: esfprod-OST0007_UUID ACTIVE&lt;br/&gt;
8: esfprod-OST0008_UUID ACTIVE&lt;br/&gt;
9: esfprod-OST0009_UUID ACTIVE&lt;br/&gt;
10: esfprod-OST000a_UUID ACTIVE&lt;br/&gt;
11: esfprod-OST000b_UUID ACTIVE&lt;br/&gt;
12: esfprod-OST000c_UUID ACTIVE&lt;br/&gt;
13: esfprod-OST000d_UUID ACTIVE&lt;br/&gt;
14: esfprod-OST000e_UUID ACTIVE&lt;br/&gt;
15: esfprod-OST000f_UUID ACTIVE&lt;br/&gt;
16: esfprod-OST0010_UUID ACTIVE&lt;br/&gt;
17: esfprod-OST0011_UUID ACTIVE&lt;br/&gt;
18: esfprod-OST0012_UUID ACTIVE&lt;br/&gt;
19: esfprod-OST0013_UUID ACTIVE&lt;br/&gt;
20: esfprod-OST0014_UUID ACTIVE&lt;br/&gt;
21: esfprod-OST0015_UUID ACTIVE&lt;br/&gt;
22: esfprod-OST0016_UUID ACTIVE&lt;br/&gt;
23: esfprod-OST0017_UUID ACTIVE&lt;br/&gt;
24: esfprod-OST0018_UUID ACTIVE&lt;br/&gt;
25: esfprod-OST0019_UUID ACTIVE&lt;br/&gt;
26: esfprod-OST001a_UUID ACTIVE&lt;br/&gt;
27: esfprod-OST001b_UUID ACTIVE&lt;br/&gt;
28: esfprod-OST001c_UUID ACTIVE&lt;br/&gt;
29: esfprod-OST001d_UUID ACTIVE&lt;br/&gt;
30: esfprod-OST001e_UUID ACTIVE&lt;br/&gt;
31: esfprod-OST001f_UUID ACTIVE&lt;br/&gt;
32: esfprod-OST0020_UUID ACTIVE&lt;br/&gt;
33: esfprod-OST0021_UUID ACTIVE&lt;br/&gt;
34: esfprod-OST0022_UUID ACTIVE&lt;br/&gt;
35: esfprod-OST0023_UUID ACTIVE&lt;br/&gt;
36: esfprod-OST0024_UUID ACTIVE&lt;br/&gt;
37: esfprod-OST0025_UUID ACTIVE&lt;br/&gt;
38: esfprod-OST0026_UUID ACTIVE&lt;br/&gt;
39: esfprod-OST0027_UUID ACTIVE&lt;/p&gt;


&lt;p&gt;Any troubleshooting tips for this?  Should I just try stopping and starting the file system?  (I believe that&apos;s been done, but we could do it again.)&lt;/p&gt;</comment>
                            <comment id="86834" author="di.wang" created="Tue, 17 Jun 2014 18:27:53 +0000"  >&lt;p&gt;yes, please. and if you can provide -1 level debug log on MDS0 (especially when you mount OST38), that would be great. Hmm, could you please tell me how you restart the FS. I mean the the restart order of each nodes? MDTs first, then OSTs, or mixed? Thanks.&lt;/p&gt;</comment>
                            <comment id="86839" author="paf" created="Tue, 17 Jun 2014 19:09:37 +0000"  >&lt;p&gt;Di,&lt;/p&gt;

&lt;p&gt;I will do that as soon as I can get the system cleared.  (The file system is connected to one of our development machines, so it has some actual users at the moment.  I should be able to get them cleared out soon.)&lt;/p&gt;

&lt;p&gt;We usually do OSTs - &amp;gt; MDTs.  With larger DNE systems, this has been problematic sometimes...  So we have tried starting MDTs - &amp;gt; OSTs.  It&apos;s not mixed - It&apos;s all of one kind, then it moves on to the other.&lt;/p&gt;

&lt;p&gt;Neither order has been 100% reliable, to be honest.  Generally if I do one, then the other, I&apos;ve been able to get the system to start.  &lt;br/&gt;
Is there a particular order you recommend?&lt;/p&gt;</comment>
                            <comment id="86844" author="paf" created="Tue, 17 Jun 2014 19:48:02 +0000"  >&lt;p&gt;After restart (Order was OSTs - &amp;gt; MDTs), same problem on MDS0:&lt;/p&gt;

&lt;p&gt;0: esfprod-OST0000_UUID ACTIVE&lt;br/&gt;
1: esfprod-OST0001_UUID ACTIVE&lt;br/&gt;
2: esfprod-OST0002_UUID ACTIVE&lt;br/&gt;
3: esfprod-OST0003_UUID ACTIVE&lt;br/&gt;
4: esfprod-OST0004_UUID ACTIVE&lt;br/&gt;
5: esfprod-OST0005_UUID ACTIVE&lt;br/&gt;
6: esfprod-OST0006_UUID ACTIVE&lt;br/&gt;
7: esfprod-OST0007_UUID ACTIVE&lt;br/&gt;
8: esfprod-OST0008_UUID ACTIVE&lt;br/&gt;
9: esfprod-OST0009_UUID ACTIVE&lt;br/&gt;
10: esfprod-OST000a_UUID ACTIVE&lt;br/&gt;
11: esfprod-OST000b_UUID ACTIVE&lt;br/&gt;
12: esfprod-OST000c_UUID ACTIVE&lt;br/&gt;
13: esfprod-OST000d_UUID ACTIVE&lt;br/&gt;
14: esfprod-OST000e_UUID ACTIVE&lt;br/&gt;
15: esfprod-OST000f_UUID ACTIVE&lt;br/&gt;
16: esfprod-OST0010_UUID ACTIVE&lt;br/&gt;
17: esfprod-OST0011_UUID ACTIVE&lt;br/&gt;
18: esfprod-OST0012_UUID ACTIVE&lt;br/&gt;
19: esfprod-OST0013_UUID ACTIVE&lt;br/&gt;
20: esfprod-OST0014_UUID ACTIVE&lt;br/&gt;
21: esfprod-OST0015_UUID ACTIVE&lt;br/&gt;
22: esfprod-OST0016_UUID ACTIVE&lt;br/&gt;
23: esfprod-OST0017_UUID ACTIVE&lt;br/&gt;
24: esfprod-OST0018_UUID ACTIVE&lt;br/&gt;
25: esfprod-OST0019_UUID ACTIVE&lt;br/&gt;
26: esfprod-OST001a_UUID ACTIVE&lt;br/&gt;
27: esfprod-OST001b_UUID ACTIVE&lt;br/&gt;
28: esfprod-OST001c_UUID ACTIVE&lt;br/&gt;
29: esfprod-OST001d_UUID ACTIVE&lt;br/&gt;
30: esfprod-OST001e_UUID ACTIVE&lt;br/&gt;
31: esfprod-OST001f_UUID ACTIVE&lt;br/&gt;
32: esfprod-OST0020_UUID ACTIVE&lt;br/&gt;
33: esfprod-OST0021_UUID ACTIVE&lt;br/&gt;
34: esfprod-OST0022_UUID ACTIVE&lt;br/&gt;
35: esfprod-OST0023_UUID ACTIVE&lt;br/&gt;
36: esfprod-OST0024_UUID ACTIVE&lt;br/&gt;
37: esfprod-OST0025_UUID ACTIVE&lt;br/&gt;
39: esfprod-OST0027_UUID ACTIVE&lt;/p&gt;


&lt;p&gt;Attaching start log with debug=-1 from MDS1 as requested...&lt;/p&gt;</comment>
                            <comment id="86845" author="paf" created="Tue, 17 Jun 2014 19:48:40 +0000"  >&lt;p&gt;Start log of MDS1/MDT0 with OST not registering.&lt;/p&gt;</comment>
                            <comment id="86852" author="paf" created="Tue, 17 Jun 2014 20:10:32 +0000"  >&lt;p&gt;Start logs of both mds001 and oss003, which is presenting the affected OST.&lt;/p&gt;</comment>
                            <comment id="86855" author="di.wang" created="Tue, 17 Jun 2014 20:43:05 +0000"  >&lt;p&gt;Usually we prefer MDT0 start first, then other targets, no matter MDTs or OSTs. Btw: does this FS have separate MGS? Unfortunately, the debug log does not include information I need, it seems debug_level(-1) is being set in a later time, instead of initial mount?&lt;/p&gt;

&lt;p&gt;And could you please dump the config log and post here, if you can not umount MGT or MDT0. you can do it on MGS like this&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@mds tests]# debugfs /dev/loop0   #your MGT device
debugfs 1.42.3.wc3 (15-Aug-2012)
debugfs:  ls
 2  (12) .    2  (12) ..    11  (20) lost+found    25001  (16) CONFIGS   
 25002  (12) O    25003  (28) REMOTE_PARENT_DIR    13  (16) OI_scrub   
 14  (16) oi.16.0    15  (16) oi.16.1    16  (16) oi.16.2    17  (16) oi.16.3   
 18  (16) oi.16.4    19  (16) oi.16.5    20  (16) oi.16.6    21  (16) oi.16.7   
 22  (16) oi.16.8    23  (16) oi.16.9    24  (16) oi.16.10   
 25  (16) oi.16.11    26  (16) oi.16.12    27  (16) oi.16.13   
 28  (16) oi.16.14    29  (16) oi.16.15    30  (16) oi.16.16   
 31  (16) oi.16.17    32  (16) oi.16.18    33  (16) oi.16.19   
 34  (16) oi.16.20    35  (16) oi.16.21    36  (16) oi.16.22   
 37  (16) oi.16.23    38  (16) oi.16.24    39  (16) oi.16.25   
 40  (16) oi.16.26    41  (16) oi.16.27    42  (16) oi.16.28   
 43  (16) oi.16.29    44  (16) oi.16.30    45  (16) oi.16.31   
 46  (16) oi.16.32    47  (16) oi.16.33    48  (16) oi.16.34   
 49  (16) oi.16.35    50  (16) oi.16.36    51  (16) oi.16.37   
 52  (16) oi.16.38    53  (16) oi.16.39    54  (16) oi.16.40   
 55  (16) oi.16.41    56  (16) oi.16.42    57  (16) oi.16.43   
 58  (16) oi.16.44    59  (16) oi.16.45    60  (16) oi.16.46   
 61  (16) oi.16.47    62  (16) oi.16.48    63  (16) oi.16.49   
 64  (16) oi.16.50    65  (16) oi.16.51    66  (16) oi.16.52   
 67  (16) oi.16.53    68  (16) oi.16.54    69  (16) oi.16.55   
 70  (16) oi.16.56    71  (16) oi.16.57    72  (16) oi.16.58   
 73  (16) oi.16.59    74  (16) oi.16.60    75  (16) oi.16.61   
 76  (16) oi.16.62    77  (16) oi.16.63    25026  (24) NIDTBL_VERSIONS   
 85  (12) fld    86  (16) seq_ctl    87  (16) seq_srv    88  (20) last_rcvd   
 50039  (20) quota_master    50042  (20) quota_slave    50043  (12) ROOT   
 75022  (16) PENDING    98  (28) changelog_catalog   
 99  (24) changelog_users    100  (20) hsm_actions   
 101  (24) lfsck_bookmark    102  (24) lfsck_namespace   
 103  (20) lfsck_layout    109  (20) SLAVE_LOG    116  (20) lov_objid   
 117  (20) lov_objseq    118  (2600) CATALOGS   
debugfs:  ls CONFIGS
 25001  (12) .    2  (12) ..    12  (20) mountdata    81  (24) params-client   
 82  (16) params    83  (24) lustre-client    84  (24) lustre-MDT0000   
 104  (24) lustre-MDT0001    105  (24) lustre-MDT0002   
 106  (24) lustre-MDT0003    107  (24) lustre-OST0000   
 108  (3868) lustre-OST0001   
debugfs:  dump_inode -p CONFIGS/lustre-MDT0000 /tmp/config.log
debugfs:  quite
debugfs: Unknown request &quot;quite&quot;.  Type &quot;?&quot; for a request list.
debugfs:  q
[root@mds tests]# ../utils/llog_reader /tmp/config.log 
Bit 48 of 69 not set
Header size : 8192
Time : Mon Jun 16 18:11:52 2014
Number of records: 69
Target uuid : config_uuid 
-----------------------
.....................
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="86856" author="di.wang" created="Tue, 17 Jun 2014 20:47:19 +0000"  >&lt;p&gt;Btw: I do not think this is related with DNE2, probably some config log problem, but let&apos;s see after I get the config log. &lt;/p&gt;</comment>
                            <comment id="86857" author="paf" created="Tue, 17 Jun 2014 21:13:40 +0000"  >&lt;p&gt;Weird about the debug logs.  I modprobe&apos;d Lustre, then set debug=-1, then started the fs, so the logs should&apos;ve been taken before the targets were mounted...&lt;/p&gt;

&lt;p&gt;Anyway:  Yes, the MGT and MDT0 are separate devices.  We tend to do that so it&apos;s easier for backup/restore, etc.&lt;br/&gt;
I&apos;m attaching the config log you requested in a second here.&lt;/p&gt;</comment>
                            <comment id="86858" author="paf" created="Tue, 17 Jun 2014 21:14:24 +0000"  >&lt;p&gt;MDT0 config log as requested&lt;/p&gt;</comment>
                            <comment id="86863" author="di.wang" created="Tue, 17 Jun 2014 21:24:54 +0000"  >&lt;p&gt;Interesting, you can see OST0026 is skipped in the config log, that is why OST38 is not registered on MDT0000.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;.....
#532 (224)SKIP START marker 1284 (flags=0x05, v2.5.59.0) esfprod         &apos;sys.timeout&apos; Wed Jun 11 16:05:24 2014-Wed Jun 11 16:08:29 2014
#533 (080)SKIP set_timeout=300 
#534 (224)SKIP END   marker 1284 (flags=0x06, v2.5.59.0) esfprod         &apos;sys.timeout&apos; Wed Jun 11 16:05:24 2014-Wed Jun 11 16:08:29 2014
#535 (224)SKIP START marker 1288 (flags=0x05, v2.5.59.0) esfprod-OST0026 &apos;add osc&apos; Wed Jun 11 16:08:29 2014-
#536 (088)SKIP add_uuid  nid=10.151.10.11@o2ib8(0x500080a970a0b)  0:  1:10.151.10.11@o2ib8  
#537 (088)SKIP add_uuid  nid=10.151.10.11@o2ib8002(0x51f420a970a0b)  0:  1:10.151.10.11@o2ib8  
#538 (144)SKIP attach    0:esfprod-OST0026-osc-MDT0000  1:osc  2:esfprod-MDT0000-mdtlov_UUID  
#539 (152)SKIP setup     0:esfprod-OST0026-osc-MDT0000  1:esfprod-OST0026_UUID  2:10.151.10.11@o2ib8  
#540 (088)SKIP add_uuid  nid=10.150.10.12@o2ib8(0x500080a960a0c)  0:  1:10.150.10.12@o2ib8  
#541 (120)SKIP add_conn  0:esfprod-OST0026-osc-MDT0000  1:10.150.10.12@o2ib8  
#542 (136)SKIP lov_modify_tgts add 0:esfprod-MDT0000-mdtlov  1:esfprod-OST0026_UUID  2:38  3:1  
#543 (224)SKIP END   marker 1288 (flags=0x06, v2.5.59.0) esfprod-OST0026 &apos;add osc&apos; Wed Jun 11 16:08:29 2014-
#544 (224)SKIP START marker 1335 (flags=0x05, v2.5.59.0) esfprod         &apos;sys.timeout&apos; Wed Jun 11 16:08:29 2014-Wed Jun 11 16:08:32 2014
#545 (080)SKIP set_timeout=300 
#547 (224)SKIP END   marker 1335 (flags=0x06, v2.5.59.0) esfprod         &apos;sys.timeout&apos; Wed Jun 11 16:08:29 2014-Wed Jun 11 16:08:32 2014
......
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Not sure how this happen? Did you ever tweak config log with tunefs or set_param?&lt;/p&gt;</comment>
                            <comment id="86864" author="paf" created="Tue, 17 Jun 2014 21:28:29 +0000"  >&lt;p&gt;No, definitely not.  We did a stress run of 2.6 with DNE2 (2.6 clients as well), and when it was over and the system had been rebooted, we were in this state where some of the files created during that stress run could not be deleted.  We didn&apos;t deliberately touch the config at any point in there.&lt;/p&gt;</comment>
                            <comment id="86874" author="di.wang" created="Tue, 17 Jun 2014 23:00:55 +0000"  >&lt;p&gt;Patrick, was this FS reformatted before this test? Btw you can always erase the config log by tunefs --writeconf and remount the FS to fix this config log issue. But we still need to understand the issue here.  &lt;/p&gt;</comment>
                            <comment id="86880" author="di.wang" created="Tue, 17 Jun 2014 23:54:06 +0000"  >&lt;p&gt;Patrick, please also provide mkfs.lustre command line you use to create the filesystem. I checked the master code and did not find any issue there. &lt;/p&gt;</comment>
                            <comment id="86881" author="paf" created="Wed, 18 Jun 2014 00:12:58 +0000"  >&lt;p&gt;Di,&lt;/p&gt;

&lt;p&gt;Yes, now that we know it&apos;s a config log issue, I figured we could fix it with a writeconf operation..  But like you said, we&apos;d like to understand the issue.&lt;/p&gt;

&lt;p&gt;It was not reformatted before the test.  It WAS upgraded from 2.5, which required a writeconf operation at that time to get it to start.&lt;br/&gt;
So it was originally formatted with Lustre 2.5.1, then upgraded to master.&lt;/p&gt;

&lt;p&gt;For the mkfs.lustre command for the MDT (I don&apos;t have the device name, but these are the options that were used):&lt;br/&gt;
mkfs.lustre --reformat --mdt --fsname=esfprod --mgsnode=galaxy-esf-mds001 --index=0 --quiet --backfstype=ldiskfs --param sys.timeout=300 --param lov.stripesize=1048576 --param lov.stripecount=1 --mkfsoptions=&quot;-J size=400&quot; &lt;span class=&quot;error&quot;&gt;&amp;#91;MDT device name&amp;#93;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;For the MGT:&lt;br/&gt;
Command: mkfs.lustre --reformat --mgs --quiet --backfstype=ldiskfs --param sys.timeout=300 &lt;span class=&quot;error&quot;&gt;&amp;#91;MGT device name&amp;#93;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;For one of the OSTs:&lt;br/&gt;
mkfs.lustre --reformat --ost --fsname=esfprod --mgsnode=galaxy-esf-mds001 --index=1 --quiet --backfstype=ldiskfs --param sys.timeout=300 --mkfsoptions=&quot;-J size=400&quot; --mountfsoptions=&quot;errors=remount-ro,extents,mballoc&quot; &lt;span class=&quot;error&quot;&gt;&amp;#91;OST device name&amp;#93;&lt;/span&gt;&lt;/p&gt;</comment>
                            <comment id="86892" author="pjones" created="Wed, 18 Jun 2014 03:41:33 +0000"  >&lt;p&gt;Emoly&lt;/p&gt;

&lt;p&gt;Could you please try reproducing this issue?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="86999" author="emoly.liu" created="Thu, 19 Jun 2014 01:44:02 +0000"  >&lt;p&gt;Patrick,&lt;/p&gt;

&lt;p&gt;I will try to upgrade a lustre file system from 2.5.1 to 2.6 to reproduce this problem.  Could you please suggest how many OSTs and MDTs are enough for this test? What&apos;s more, I know MGS and MDS should be separated in this test, and anything else I should pay attention to?&lt;/p&gt;

&lt;p&gt;Thanks.&lt;/p&gt;</comment>
                            <comment id="87002" author="paf" created="Thu, 19 Jun 2014 03:07:45 +0000"  >&lt;p&gt;Emoly,&lt;/p&gt;

&lt;p&gt;Unfortunately, I don&apos;t really know how many is enough.  We have 8 MDSes and 8 MDTs, and 4 OSSes and 40 OSTs.  It&apos;s a test bed system for DNE, which is why it&apos;s such a weird configuration.&lt;/p&gt;

&lt;p&gt;We do have separate MGT and MDT.&lt;/p&gt;

&lt;p&gt;As far as other things: all I know about what we did is we ran a bunch of different IO tests, like IOR and a large number of tests from the Linux test project in various configurations, all with mkdir replaced by a script which would randomly create striped or remote directories.  It would also sometimes create normal directories.&lt;/p&gt;

&lt;p&gt;We did that last weekend, and had this problem on Monday.  No idea what was running when it started.&lt;/p&gt;

&lt;p&gt;Sorry for not having many specifics on testing, it&apos;s a large test suite.&lt;/p&gt;

&lt;p&gt;We&apos;re probably going to fix the system soon by doing a writeconf, so we can continue stress testing DNE2.  Let me know if there&apos;s anything else I can give you first.&lt;/p&gt;</comment>
                            <comment id="87065" author="adilger" created="Thu, 19 Jun 2014 17:52:32 +0000"  >&lt;p&gt;Is it possible that OST0026 was ever deactivated during testing (e.g. &lt;tt&gt;lctl conf_param esfprod-OST0026.osc.active=0&lt;/tt&gt; or similar)?  That would permanently disable the OST in the config log and seems to me to be the most likely cause of this problem.&lt;/p&gt;</comment>
                            <comment id="87071" author="paf" created="Thu, 19 Jun 2014 18:10:14 +0000"  >&lt;p&gt;Andreas,&lt;/p&gt;

&lt;p&gt;It&apos;s really unlikely.  No one should have been mucking with the system.  I can&apos;t say it&apos;s impossible, but...&lt;/p&gt;

&lt;p&gt;Now that we&apos;ve tracked it down to such a strange error, I&apos;m planning to go ahead and fix it, and not worry unless it occurs again in further stress testing.  In fact, I&apos;m going to do exactly that unless someone has further information they&apos;d like from the system.  (Speak up soon - I&apos;m going to fix it for our stress testing slot tonight.)&lt;/p&gt;

&lt;p&gt;I&apos;ve also (in further testing) hit an MDS0 crash bug that could possibly be related to this one I&apos;m going to open shortly.  I&apos;ll reference that LU here once I&apos;ve got it open.&lt;/p&gt;</comment>
                            <comment id="87102" author="paf" created="Thu, 19 Jun 2014 22:05:16 +0000"  >&lt;p&gt;Opened &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5233&quot; title=&quot;2.6 DNE stress testing: (lod_object.c:930:lod_declare_attr_set()) ASSERTION( lo-&amp;gt;ldo_stripe ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5233&quot;&gt;&lt;del&gt;LU-5233&lt;/del&gt;&lt;/a&gt; for the MDS1 LBUG I mentioned above.&lt;/p&gt;</comment>
                            <comment id="87192" author="adilger" created="Fri, 20 Jun 2014 18:04:02 +0000"  >&lt;p&gt;The one obvious problem that I see is that it should ALWAYS be possible to delete a file, even if the OST is unavailable, or configured out of the system.  Regardless of what the root cause of the problem is, there needs to be a patch to allow the file to be deleted.&lt;/p&gt;</comment>
                            <comment id="97358" author="emoly.liu" created="Fri, 24 Oct 2014 02:53:37 +0000"  >&lt;p&gt;Sorry for my late update. I can&apos;t reproduce this issue in my testing environment.&lt;/p&gt;</comment>
                            <comment id="97361" author="di.wang" created="Fri, 24 Oct 2014 02:59:57 +0000"  >&lt;p&gt;Since we can not reproduce the problem locally, I can not figure out why the config log is &quot;corrupted&quot;.  If it happens again in DNE testing, please remember what&apos;s the step to reproduce it. We will probably have more ideas. &lt;/p&gt;</comment>
                            <comment id="98573" author="adilger" created="Thu, 6 Nov 2014 18:56:17 +0000"  >&lt;p&gt;Unable to figure out what the problem is, please reopen if it is hit again.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="25227">LU-5233</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="15178" name="LU-5204_mds0_start_log.tar.gz" size="253" author="paf" created="Tue, 17 Jun 2014 19:48:40 +0000"/>
                            <attachment id="15179" name="LU-5204_start_log_with_oss.tar.gz" size="261" author="paf" created="Tue, 17 Jun 2014 20:10:32 +0000"/>
                            <attachment id="15155" name="invalid_object_client_mdt0007" size="517949" author="paf" created="Mon, 16 Jun 2014 19:41:56 +0000"/>
                            <attachment id="15156" name="invalid_object_mds_mdt0000" size="136256" author="paf" created="Mon, 16 Jun 2014 19:41:56 +0000"/>
                            <attachment id="15157" name="lctl_dl_from_client" size="4415" author="paf" created="Mon, 16 Jun 2014 19:41:56 +0000"/>
                            <attachment id="15158" name="lctl_dl_from_mds001_mdt0000" size="3693" author="paf" created="Mon, 16 Jun 2014 19:41:56 +0000"/>
                            <attachment id="15180" name="mdt0.config.log" size="58165" author="paf" created="Tue, 17 Jun 2014 21:14:24 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwp5b:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>14529</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>