<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:18:21 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-8527] Lustre 2.8 server crashed on bring up during large scale test shot</title>
                <link>https://jira.whamcloud.com/browse/LU-8527</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;In our first attempt to test lustre 2.8 servers at scale the MDS crashed at bring up. The backtrace is:&lt;/p&gt;

&lt;p&gt;&amp;lt;2&amp;gt;[ 8011.527071] kernel BUG at fs/jbd2/transaction.c:1028!&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.532938] invalid opcode: 0000 &lt;a href=&quot;#1&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;1&lt;/a&gt; SMP&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.537761] last sysfs file: /sys/devices/virtual/block/dm-7/uevent&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.544986] CPU 5&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.547040] Modules linked in: osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgc(U) osd_ldiskfs(U) ldiskfs(U) mbcache jbd2 lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ptlrpc(U) obdclass(U) ko2iblnd(U) lnet(U) sha512_generic crc32c_intel libcfs(U) mpt2sas mptctl mptbase dell_rbu autofs4 8021q garp stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_REJECT xt_comment nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_sa ib_mad ib_core ib_addr dm_mirror dm_region_hash dm_log dm_round_robin scsi_dh_rdac dm_multipath dm_mod sg ipmi_devintf sd_mod crc_t10dif iTCO_wdt iTCO_vendor_support microcode wmi power_meter acpi_ipmi ipmi_si ipmi_msghandler dcdbas mpt3sas scsi_transport_sas raid_class shpchp sb_edac edac_core lpc_ich mfd_core ahci ipv6 nfs lockd fscache auth_rpcgss nfs_acl sunrpc tg3 mlx4_en ptp pps_core mlx4_core &lt;span class=&quot;error&quot;&gt;&amp;#91;last unloaded: scsi_wait_scan&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.646238]&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.648140] Pid: 30211, comm: mdt01_503 Not tainted 2.6.32-573.26.1.el6.dne2.x86_64 #1 Dell Inc. PowerEdge R630/0CNCJW&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.660578] RIP: 0010:&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0c9d92d&amp;gt;&amp;#93;&lt;/span&gt;  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0c9d92d&amp;gt;&amp;#93;&lt;/span&gt; jbd2_journal_dirty_metadata+0x10d/0x150 &lt;span class=&quot;error&quot;&gt;&amp;#91;jbd2&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.672316] RSP: 0018:ffff883e6751f580  EFLAGS: 00010246&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.678473] RAX: ffff883e3a8be380 RBX: ffff883ce650e3d8 RCX: ffff883e9a7656e0&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.686676] RDX: 0000000000000000 RSI: ffff883e9a7656e0 RDI: ffff883ce650e3d8&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.694873] RBP: ffff883e6751f5a0 R08: c010000000000000 R09: 0000000000000000&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.703072] R10: 0000000000000001 R11: 0000000000000008 R12: ffff883df92c8ba8&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.716541] R13: ffff883e9a7656e0 R14: ffff884023ca1800 R15: 0000000000000058&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.724738] FS:  0000000000000000(0000) GS:ffff88018fca0000(0000) knlGS:0000000000000000&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.734220] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.740863] CR2: 00007fc59a79b380 CR3: 0000000001a8d000 CR4: 00000000001407e0&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.749059] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.757263] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.765460] Process mdt01_503 (pid: 30211, threadinfo ffff883e6751c000, task ffff883e6750c040)&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.775518] Stack:&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.777983]  ffff883ce650e3d8 ffffffffa0d827b0 ffff883e9a7656e0 0000000000000000&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.786313] &amp;lt;d&amp;gt; ffff883e6751f5e0 ffffffffa0cc0fab ffff883e6751f5e0 ffffffffa0ccf968&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.795363] &amp;lt;d&amp;gt; 0000000000000000 0000000000000058 0000000000000058 ffff883e9a7656e0&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.804839] Call Trace:&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.807798]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0cc0fab&amp;gt;&amp;#93;&lt;/span&gt; __ldiskfs_handle_dirty_metadata+0x7b/0x100 &lt;span class=&quot;error&quot;&gt;&amp;#91;ldiskfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.817097]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0ccf968&amp;gt;&amp;#93;&lt;/span&gt; ? ldiskfs_bread+0x18/0x80 &lt;span class=&quot;error&quot;&gt;&amp;#91;ldiskfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.824517]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0d6635c&amp;gt;&amp;#93;&lt;/span&gt; osd_ldiskfs_write_record+0xec/0x340 &lt;span class=&quot;error&quot;&gt;&amp;#91;osd_ldiskfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.833519]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0d690a3&amp;gt;&amp;#93;&lt;/span&gt; osd_write+0x183/0x5b0 &lt;span class=&quot;error&quot;&gt;&amp;#91;osd_ldiskfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.840965]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa05f8b4d&amp;gt;&amp;#93;&lt;/span&gt; dt_record_write+0x3d/0x130 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.848680]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa05bb44f&amp;gt;&amp;#93;&lt;/span&gt; llog_osd_write_rec+0xb6f/0x1ad0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.856787]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0d7ba0b&amp;gt;&amp;#93;&lt;/span&gt; ? dynlock_unlock+0x16b/0x1d0 &lt;span class=&quot;error&quot;&gt;&amp;#91;osd_ldiskfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.864894]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa05a9426&amp;gt;&amp;#93;&lt;/span&gt; llog_write_rec+0xb6/0x270 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.872419]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa05b2893&amp;gt;&amp;#93;&lt;/span&gt; llog_cat_add_rec+0x1c3/0x7b0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.880234]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa05a9239&amp;gt;&amp;#93;&lt;/span&gt; llog_add+0x89/0x1c0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.887176]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa108641d&amp;gt;&amp;#93;&lt;/span&gt; osp_sync_add_rec+0x26d/0x9b0 &lt;span class=&quot;error&quot;&gt;&amp;#91;osp&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.894501]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa1086c07&amp;gt;&amp;#93;&lt;/span&gt; osp_sync_add+0x77/0x80 &lt;span class=&quot;error&quot;&gt;&amp;#91;osp&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.901248]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0fb502e&amp;gt;&amp;#93;&lt;/span&gt; ? lod_sub_get_thandle+0x24e/0x3c0 &lt;span class=&quot;error&quot;&gt;&amp;#91;lod&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.909060]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa1077823&amp;gt;&amp;#93;&lt;/span&gt; osp_object_destroy+0x173/0x230 &lt;span class=&quot;error&quot;&gt;&amp;#91;osp&amp;#93;&lt;/span&gt;&lt;br/&gt;
&amp;lt;4&amp;gt;[ 8011.916582]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0fb85ad&amp;gt;&amp;#93;&lt;/span&gt; lod_sub_object_destroy+0x1fd/0x440 &lt;span class=&quot;error&quot;&gt;&amp;#91;lod&amp;#93;&lt;/span&gt;&lt;/p&gt;</description>
                <environment>RHEL6.7  servers running latest lustre 2.8 using latest lustre 2.8 clients.</environment>
        <key id="39059">LU-8527</key>
            <summary>Lustre 2.8 server crashed on bring up during large scale test shot</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="6" iconUrl="https://jira.whamcloud.com/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="bobijam">Zhenyu Xu</assignee>
                                    <reporter username="simmonsja">James A Simmons</reporter>
                        <labels>
                    </labels>
                <created>Tue, 23 Aug 2016 15:21:52 +0000</created>
                <updated>Tue, 18 Oct 2016 21:53:20 +0000</updated>
                            <resolved>Sat, 8 Oct 2016 19:04:09 +0000</resolved>
                                    <version>Lustre 2.8.0</version>
                                    <fixVersion>Lustre 2.9.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>11</watches>
                                                                            <comments>
                            <comment id="162844" author="simmonsja" created="Tue, 23 Aug 2016 15:59:12 +0000"  >&lt;p&gt;Small detail. This crashed happened during an IOR run. Also it doesn&apos;t seem repeatable.&lt;/p&gt;</comment>
                            <comment id="162851" author="simmonsja" created="Tue, 23 Aug 2016 16:32:40 +0000"  >&lt;p&gt;Here is the dmesg from the crash&lt;/p&gt;</comment>
                            <comment id="162875" author="yujian" created="Tue, 23 Aug 2016 17:33:06 +0000"  >&lt;p&gt;Hi Bobi,&lt;/p&gt;

&lt;p&gt;I saw the similar stack trace reported in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3102&quot; title=&quot;kernel BUG at fs/jbd2/transaction.c:1033&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3102&quot;&gt;&lt;del&gt;LU-3102&lt;/del&gt;&lt;/a&gt; before. Could you please look into this failure?&lt;/p&gt;

&lt;p&gt;Thank you.&lt;/p&gt;</comment>
                            <comment id="162900" author="bobijam" created="Tue, 23 Aug 2016 20:47:58 +0000"  >&lt;p&gt;is this FE 2.8? If yes, I think &lt;a href=&quot;http://review.whamcloud.com/19732&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/19732&lt;/a&gt; would be a promising patch.&lt;/p&gt;</comment>
                            <comment id="162906" author="bobijam" created="Tue, 23 Aug 2016 21:11:22 +0000"  >&lt;p&gt;port of it at &lt;a href=&quot;http://review.whamcloud.com/22058&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/22058&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="162926" author="simmonsja" created="Tue, 23 Aug 2016 21:46:38 +0000"  >&lt;p&gt;Okay I&apos;m building it now. Will let you know the results.&lt;/p&gt;</comment>
                            <comment id="162946" author="simmonsja" created="Tue, 23 Aug 2016 22:50:39 +0000"  >&lt;p&gt;Nope patch 22058 doesn&apos;t fix it for us &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/sad.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt; We are in a state or constant crashing when the MDS finishes recovery.&lt;/p&gt;</comment>
                            <comment id="162978" author="bzzz" created="Wed, 24 Aug 2016 10:48:42 +0000"  >&lt;p&gt;LOD itself doesn&apos;t declare credits, but it ask OSD to declare operations. is it possible to load osd-ldiskfs module with ldiskfs_track_declares_assert=1 and attach dmesg output starting a bit earlier (hopefully we&apos;ll hit our assert with additional information). thanks in advance.&lt;/p&gt;</comment>
                            <comment id="163168" author="dustb100" created="Thu, 25 Aug 2016 19:03:29 +0000"  >&lt;p&gt;I cut the log down to show times from 7:11PM (recovery &lt;br/&gt;
started and /proc/fs/lustre/osd-ldiskfs enabled) to 7:18PM when the &lt;br/&gt;
crashdump kernel took over. Please see output below: &lt;/p&gt;

&lt;p&gt;Aug 23 19:05:18 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: Kernel logging (proc) stopped.&lt;br/&gt;
Aug 23 19:05:18 atlas1-mds1.ccs.ornl.gov kernel: imklog 5.8.10, log source = /proc/kmsg started.&lt;br/&gt;
Aug 23 19:05:47 atlas1-mds1.ccs.ornl.gov kernel: [  138.744691] LNet: HW CPU cores: 8, npartitions: 2&lt;br/&gt;
Aug 23 19:05:47 atlas1-mds1.ccs.ornl.gov kernel: [  138.757447] alg: No test for adler32 (adler32-zlib)&lt;br/&gt;
Aug 23 19:05:47 atlas1-mds1.ccs.ornl.gov kernel: [  138.763082] alg: No test for crc32 (crc32-table)&lt;br/&gt;
Aug 23 19:05:47 atlas1-mds1.ccs.ornl.gov kernel: [  138.768422] alg: No test for crc32 (crc32-pclmul)&lt;br/&gt;
Aug 23 19:05:55 atlas1-mds1.ccs.ornl.gov kernel: [  146.861939] Lustre: Lustre: Build Version: 2.8.0-g242e67d-CHANGED-2.6.32-573.26.1.el6.dne2.x86_64&lt;br/&gt;
Aug 23 19:05:56 atlas1-mds1.ccs.ornl.gov kernel: [  147.181854] LNet: Added LNI 10.36.226.117@o2ib &lt;span class=&quot;error&quot;&gt;&amp;#91;63/2560/0/180&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 23 19:05:56 atlas1-mds1.ccs.ornl.gov kernel: [  147.452347] LNetError: 2837:0:(o2iblnd_cb.c:2310:kiblnd_passive_connect()) Can&apos;t accept conn from 10.36.230.233@o2ib200 on NA (ib0:1:10.36.226.117): bad dst nid 10.36.226.117@o2ib200&lt;br/&gt;
Aug 23 19:05:56 atlas1-mds1.ccs.ornl.gov kernel: [  147.589168] LNet: Added LNI 10.36.226.117@o2ib200 &lt;span class=&quot;error&quot;&gt;&amp;#91;63/2560/0/180&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 23 19:10:27 atlas1-mds1.ccs.ornl.gov kernel: [  419.034586] LDISKFS-fs (dm-7): warning: mounting fs with errors, running e2fsck is recommended&lt;br/&gt;
Aug 23 19:10:27 atlas1-mds1.ccs.ornl.gov kernel: [  419.050076] LDISKFS-fs (dm-7): recovery complete&lt;br/&gt;
Aug 23 19:10:27 atlas1-mds1.ccs.ornl.gov kernel: [  419.065246] LDISKFS-fs (dm-7): mounted filesystem with ordered data mode. quota=on. Opts: &lt;br/&gt;
Aug 23 19:10:29 atlas1-mds1.ccs.ornl.gov kernel: [  420.483435] Lustre: atlas1-MDT0000: Not available for connect from 17469@gni100 (not set up)&lt;br/&gt;
Aug 23 19:10:31 atlas1-mds1.ccs.ornl.gov kernel: [  422.098522] Lustre: atlas1-MDT0000: Imperative Recovery enabled, recovery window shrunk from 1800-5400 down to 900-2700&lt;br/&gt;
Aug 23 19:10:32 atlas1-mds1.ccs.ornl.gov kernel: [  423.169261] Lustre: atlas1-MDT0000: Will be in recovery for at least 15:00, or until 18437 clients reconnect&lt;br/&gt;
Aug 23 19:10:32 atlas1-mds1.ccs.ornl.gov kernel: [  423.180828] Lustre: atlas1-MDT0000: Connection restored to 7e9ab0dc-95d7-895f-67d0-472e66bcea23 (at 232@gni100)&lt;br/&gt;
Aug 23 19:10:33 atlas1-mds1.ccs.ornl.gov kernel: [  424.194505] Lustre: atlas1-MDT0000: Connection restored to b8f051f0-79b3-229d-0cab-f02d9100888a (at 18990@gni100)&lt;br/&gt;
Aug 23 19:10:34 atlas1-mds1.ccs.ornl.gov kernel: [  425.308076] Lustre: atlas1-MDT0000: Connection restored to 6954d66d-f1d9-df22-81b6-a2a84a767326 (at 17860@gni100)&lt;br/&gt;
Aug 23 19:10:34 atlas1-mds1.ccs.ornl.gov kernel: [  425.320064] Lustre: Skipped 5 previous similar messages&lt;br/&gt;
Aug 23 19:10:36 atlas1-mds1.ccs.ornl.gov kernel: [  427.309298] Lustre: atlas1-MDT0000: Connection restored to 97ba02a4-3cbf-2339-af5c-bc672e524fcf (at 15033@gni100)&lt;br/&gt;
Aug 23 19:10:36 atlas1-mds1.ccs.ornl.gov kernel: [  427.321286] Lustre: Skipped 249 previous similar messages&lt;br/&gt;
Aug 23 19:10:40 atlas1-mds1.ccs.ornl.gov kernel: [  431.312192] Lustre: atlas1-MDT0000: Connection restored to 32e83f22-2827-c76a-be5a-444297abac0f (at 1031@gni100)&lt;br/&gt;
Aug 23 19:10:40 atlas1-mds1.ccs.ornl.gov kernel: [  431.324082] Lustre: Skipped 11156 previous similar messages&lt;br/&gt;
Aug 23 19:12:02 atlas1-mds1.ccs.ornl.gov kernel: [  513.742286] Lustre: atlas1-MDT0000: Connection restored to 68400d40-3736-e9e3-b4eb-7d2f1a9c2a9b (at 10.36.225.10@o2ib)&lt;br/&gt;
Aug 23 19:12:02 atlas1-mds1.ccs.ornl.gov kernel: [  513.754757] Lustre: Skipped 2849 previous similar messages&lt;br/&gt;
Aug 23 19:12:41 atlas1-mds1.ccs.ornl.gov kernel: [  552.533598] Lustre: atlas1-MDT0000: Connection restored to d6edd526-b008-01f3-beb9-910b627acce0 (at 10.36.205.218@o2ib)&lt;br/&gt;
Aug 23 19:13:32 atlas1-mds1.ccs.ornl.gov kernel: [  603.859284] Lustre: atlas1-MDT0000: Connection restored to fdf3ca98-024f-2d65-83bc-b10712d608f5 (at 10.36.205.209@o2ib)&lt;br/&gt;
Aug 23 19:13:32 atlas1-mds1.ccs.ornl.gov kernel: [  603.871786] Lustre: Skipped 1 previous similar message&lt;br/&gt;
Aug 23 19:14:51 atlas1-mds1.ccs.ornl.gov kernel: [  682.177088] Lustre: atlas1-MDT0000: Connection restored to c281fcc2-9ed2-632b-dfca-a063cbb468fd (at 16884@gni100)&lt;br/&gt;
Aug 23 19:14:51 atlas1-mds1.ccs.ornl.gov kernel: [  682.189010] Lustre: Skipped 1677 previous similar messages&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.646938] Lustre: 18235:0:(osd_handler.c:1265:osd_trans_dump_creds())   create: 1008/8064/0, destroy: 1/4/1&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.658473] Lustre: 18235:0:(osd_handler.c:1272:osd_trans_dump_creds())   attr_set: 2/2/1, xattr_set: 1011/92/0&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.670198] Lustre: 18235:0:(osd_handler.c:1282:osd_trans_dump_creds())   write: 5043/43356/0, punch: 0/0/0, quota 4/52/0&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.682891] Lustre: 18235:0:(osd_handler.c:1289:osd_trans_dump_creds())   insert: 1009/17152/0, delete: 2/9/1&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.694423] Lustre: 18235:0:(osd_handler.c:1296:osd_trans_dump_creds())   ref_add: 1/1/0, ref_del: 2/2/1&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.703059] Lustre: atlas1-MDT0000: Recovery over after 4:34, of 18437 clients 18437 recovered and 0 were evicted.&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.717547] LustreError: 18235:0:(osd_internal.h:1073:osd_trans_exec_op()) atlas1-MDT0000-osd: op = 7, rb = 7&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.729087] LustreError: 18235:0:(osd_internal.h:1081:osd_trans_exec_op()) LBUG&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.737712] Pid: 18235, comm: mdt01_399&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.742233] &lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.742233] Call Trace:&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.747087]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0423875&amp;gt;&amp;#93;&lt;/span&gt; libcfs_debug_dumpstack+0x55/0x80 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.755118]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0423e77&amp;gt;&amp;#93;&lt;/span&gt; lbug_with_loc+0x47/0xb0 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.762280]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0c4b340&amp;gt;&amp;#93;&lt;/span&gt; osd_write+0x420/0x5b0 &lt;span class=&quot;error&quot;&gt;&amp;#91;osd_ldiskfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.769753]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0569b4d&amp;gt;&amp;#93;&lt;/span&gt; dt_record_write+0x3d/0x130 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.777402]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa052c44f&amp;gt;&amp;#93;&lt;/span&gt; llog_osd_write_rec+0xb6f/0x1ad0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.785527]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa051a426&amp;gt;&amp;#93;&lt;/span&gt; llog_write_rec+0xb6/0x270 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.793080]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0523893&amp;gt;&amp;#93;&lt;/span&gt; llog_cat_add_rec+0x1c3/0x7b0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.800936]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa051a239&amp;gt;&amp;#93;&lt;/span&gt; llog_add+0x89/0x1c0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.807904]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0f1e41d&amp;gt;&amp;#93;&lt;/span&gt; osp_sync_add_rec+0x26d/0x9b0 &lt;span class=&quot;error&quot;&gt;&amp;#91;osp&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.815261]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0f1ec07&amp;gt;&amp;#93;&lt;/span&gt; osp_sync_add+0x77/0x80 &lt;span class=&quot;error&quot;&gt;&amp;#91;osp&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.822015]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0f0f823&amp;gt;&amp;#93;&lt;/span&gt; osp_object_destroy+0x173/0x230 &lt;span class=&quot;error&quot;&gt;&amp;#91;osp&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.829581]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0e635ad&amp;gt;&amp;#93;&lt;/span&gt; lod_sub_object_destroy+0x1fd/0x440 &lt;span class=&quot;error&quot;&gt;&amp;#91;lod&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.837516]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0e5745b&amp;gt;&amp;#93;&lt;/span&gt; lod_object_destroy+0x36b/0x770 &lt;span class=&quot;error&quot;&gt;&amp;#91;lod&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.845067]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa042ed81&amp;gt;&amp;#93;&lt;/span&gt; ? libcfs_debug_msg+0x41/0x50 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.852726]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0ebde6b&amp;gt;&amp;#93;&lt;/span&gt; mdd_finish_unlink+0x28b/0x3d0 &lt;span class=&quot;error&quot;&gt;&amp;#91;mdd&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.860164]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa042c1d8&amp;gt;&amp;#93;&lt;/span&gt; ? libcfs_log_return+0x28/0x40 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.867922]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0ec2845&amp;gt;&amp;#93;&lt;/span&gt; mdd_unlink+0xab5/0xf70 &lt;span class=&quot;error&quot;&gt;&amp;#91;mdd&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.874700]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa042ed81&amp;gt;&amp;#93;&lt;/span&gt; ? libcfs_debug_msg+0x41/0x50 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.882383]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0d800c8&amp;gt;&amp;#93;&lt;/span&gt; mdo_unlink+0x18/0x50 &lt;span class=&quot;error&quot;&gt;&amp;#91;mdt&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.888962]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0d88f70&amp;gt;&amp;#93;&lt;/span&gt; mdt_reint_unlink+0xbb0/0x1060 &lt;span class=&quot;error&quot;&gt;&amp;#91;mdt&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.896416]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0d8015d&amp;gt;&amp;#93;&lt;/span&gt; mdt_reint_rec+0x5d/0x200 &lt;span class=&quot;error&quot;&gt;&amp;#91;mdt&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.903394]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0d6bddb&amp;gt;&amp;#93;&lt;/span&gt; mdt_reint_internal+0x62b/0x9f0 &lt;span class=&quot;error&quot;&gt;&amp;#91;mdt&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.910960]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0d6c63b&amp;gt;&amp;#93;&lt;/span&gt; mdt_reint+0x6b/0x120 &lt;span class=&quot;error&quot;&gt;&amp;#91;mdt&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.917602]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa07e79cc&amp;gt;&amp;#93;&lt;/span&gt; tgt_request_handle+0x8ec/0x1440 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.925557]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0794a91&amp;gt;&amp;#93;&lt;/span&gt; ptlrpc_main+0xd21/0x1800 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.932824]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81539b0e&amp;gt;&amp;#93;&lt;/span&gt; ? thread_return+0x4e/0x7d0&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.939414]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0793d70&amp;gt;&amp;#93;&lt;/span&gt; ? ptlrpc_main+0x0/0x1800 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.946690]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff810a138e&amp;gt;&amp;#93;&lt;/span&gt; kthread+0x9e/0xc0&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.952407]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8100c28a&amp;gt;&amp;#93;&lt;/span&gt; child_rip+0xa/0x20&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.958206]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff810a12f0&amp;gt;&amp;#93;&lt;/span&gt; ? kthread+0x0/0xc0&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.964016]  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8100c280&amp;gt;&amp;#93;&lt;/span&gt; ? child_rip+0x0/0x20&lt;br/&gt;
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [  697.970033] &lt;br/&gt;
Aug 23 19:19:57 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: imklog 5.8.10, log source = /proc/kmsg started.&lt;br/&gt;
Aug 23 19:19:57 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: [    0.000000] Initializing cgroup subsys cpuset&lt;br/&gt;
Aug 23 19:19:57 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: [    0.000000] Initializing cgroup subsys cpu&lt;br/&gt;
Aug 23 19:19:57 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: [    0.000000] Linux version 2.6.32-573.26.1.el6.dne2.x86_64 (jsimmons@lgmgmt1) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC) ) #1 SMP Thu Jul 21 13:22:19 EDT 2016&lt;br/&gt;
Aug 23 19:19:57 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: [    0.000000] Command line: initrd=initrd-2.6.32-573.26.1.el6.dne2.x86_64-gedi selinux=0 audit=0 panic=10 console=tty0 console=ttyS1,115200n8 console=ttyS2,115200n8 crashkernel=192M init=/gedi-preinit BOOT_IMAGE=vmlinuz-2.6.32-573.26.1.el6.dne2.x86_64 BOOTIF=01-20-47-47-7e-c5-e0 &lt;/p&gt;</comment>
                            <comment id="163179" author="yujian" created="Thu, 25 Aug 2016 19:58:47 +0000"  >&lt;p&gt;Thank you Dustin for the logs.&lt;/p&gt;

&lt;p&gt;Hi Alex,&lt;/p&gt;

&lt;p&gt;Could you please look into the logs? And although patch &lt;a href=&quot;http://review.whamcloud.com/22058&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/22058&lt;/a&gt; doesn&apos;t fix the issue in this ticket, do you think it is still a better-to-have patch for Lustre b2_8_fe servers?&lt;/p&gt;</comment>
                            <comment id="163293" author="adilger" created="Fri, 26 Aug 2016 17:48:18 +0000"  >&lt;p&gt;It seems possible that this is related to creating or deleting a wide-striped file?  I found &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3102&quot; title=&quot;kernel BUG at fs/jbd2/transaction.c:1033&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3102&quot;&gt;&lt;del&gt;LU-3102&lt;/del&gt;&lt;/a&gt; that may be related.&lt;/p&gt;</comment>
                            <comment id="164637" author="simmonsja" created="Thu, 1 Sep 2016 16:38:59 +0000"  >&lt;p&gt;the first crash happened while running a large stripe single shared file with IOR. After the first crash this bug came back at random times.&lt;/p&gt;</comment>
                            <comment id="164781" author="bzzz" created="Fri, 2 Sep 2016 05:26:37 +0000"  >&lt;p&gt;please, tell the latest commit for the branch you&apos;re using.&lt;/p&gt;</comment>
                            <comment id="164797" author="yujian" created="Fri, 2 Sep 2016 13:10:41 +0000"  >&lt;p&gt;Hi Alex,&lt;/p&gt;

&lt;p&gt;It&apos;s Lustre b2_8_fe branch with commit 71badf2 on the tip.&lt;/p&gt;</comment>
                            <comment id="165187" author="simmonsja" created="Wed, 7 Sep 2016 19:46:37 +0000"  >&lt;p&gt;Any updates?&lt;/p&gt;</comment>
                            <comment id="165317" author="bzzz" created="Thu, 8 Sep 2016 14:30:29 +0000"  >&lt;p&gt;i&apos;m trying to come up with a debugging patch.&lt;/p&gt;</comment>
                            <comment id="166004" author="bzzz" created="Wed, 14 Sep 2016 14:00:11 +0000"  >&lt;p&gt;Lustre: 18235:0:(osd_handler.c:1272:osd_trans_dump_creds()) attr_set: 2/2/1, xattr_set: 1011/92/0&lt;br/&gt;
Lustre: 18235:0:(osd_handler.c:1282:osd_trans_dump_creds()) write: 5043/43356/0, punch: 0/0/0, quota 4/52/0&lt;br/&gt;
Lustre: 18235:0:(osd_handler.c:1289:osd_trans_dump_creds()) insert: 1009/17152/0, delete: 2/9/1&lt;br/&gt;
Lustre: 18235:0:(osd_handler.c:1296:osd_trans_dump_creds()) ref_add: 1/1/0, ref_del: 2/2/1&lt;/p&gt;

&lt;p&gt;the transaction declared at least 2+92+43356+52+17152+9+2+2=60667 credits, while consumed 1+1+1=3&lt;br/&gt;
well, probably a bit more as we don&apos;t track everything, but still way less than 60667.&lt;br/&gt;
it looks like a false assertion though I don&apos;t understand the root cause yet.&lt;/p&gt;</comment>
                            <comment id="166084" author="adilger" created="Wed, 14 Sep 2016 21:56:51 +0000"  >&lt;p&gt;My first guess would be to take a look at the ldiskfs xattr inode (wide striping) patch.  It may be that it is declaring too many credits in some cases.&lt;/p&gt;</comment>
                            <comment id="167001" author="yong.fan" created="Fri, 23 Sep 2016 02:43:50 +0000"  >&lt;blockquote&gt;
&lt;p&gt;Lustre: 18235:0:(osd_handler.c:1272:osd_trans_dump_creds()) attr_set: 2/2/1, xattr_set: 1011/92/0&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;The &quot;xattr_set&quot; seems abnormal, under which case the oti_declare_ops&lt;span class=&quot;error&quot;&gt;&amp;#91;xattr_set&amp;#93;&lt;/span&gt; can be so much larger than oti_declare_ops_cred&lt;span class=&quot;error&quot;&gt;&amp;#91;xattr_set&amp;#93;&lt;/span&gt; ?&lt;/p&gt;</comment>
                            <comment id="167175" author="yong.fan" created="Mon, 26 Sep 2016 08:16:40 +0000"  >&lt;p&gt;It is suspected that the concurrent multiple llog append operations, such as unlink large striped (large EA) files concurrently, that can fill up a llog file in very short time. Because the llog may be shared (at the same time) by many concurrent modifications, then under some extreme cases, some unlink large striped file operation may have not declared to create llog file, but during the real llog append for recording the striped OST-objects, it finds that the current llog is filled up, and it needs to create new llog file, but because it does not declare related credits, then the lower layer may hit trouble.&lt;/p&gt;

&lt;p&gt;I am not sure whether it is just such case or not. But since it is reproducible on customer site, I will make a patch for verification.&lt;/p&gt;</comment>
                            <comment id="167176" author="gerrit" created="Mon, 26 Sep 2016 08:17:41 +0000"  >&lt;p&gt;Fan Yong (fan.yong@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/22726&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/22726&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8527&quot; title=&quot;Lustre 2.8 server crashed on bring up during large scale test shot&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8527&quot;&gt;&lt;del&gt;LU-8527&lt;/del&gt;&lt;/a&gt; obdclass: declare more credits for llog append&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: c38efe8101d4bc977ceb9f07b041ac56795dd90d&lt;/p&gt;</comment>
                            <comment id="167179" author="bzzz" created="Mon, 26 Sep 2016 09:15:21 +0000"  >&lt;p&gt;Fan Yong, there was no problem with write credits in the customer&apos;s case:&lt;br/&gt;
Lustre: 18235:0:(osd_handler.c:1282:osd_trans_dump_creds()) write: 5043/43356/0, punch: 0/0/0, quota 4/52/0&lt;br/&gt;
no single credit was used for write.&lt;/p&gt;

&lt;p&gt;we do declare additional create:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-style: solid;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeHeader panelHeader&quot; style=&quot;border-bottom-width: 1px;border-bottom-style: solid;&quot;&gt;&lt;b&gt;Bar.java&lt;/b&gt;&lt;/div&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;	&lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (!IS_ERR_OR_NULL(next)) {
		&lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (!llog_exist(next)) {
			&lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (dt_object_remote(cathandle-&amp;gt;lgh_obj)) {
                        } &lt;span class=&quot;code-keyword&quot;&gt;else&lt;/span&gt; {
                                 rc = llog_declare_create(env, next, th);
                                 llog_declare_write_rec(env, cathandle, &amp;amp;lirec-&amp;gt;lid_hdr, -1, th);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;if you think it&apos;s possible to hit a race where we miss to declare additional llog (say, the next was just created), then it makes sense to add an assertion and hit it.&lt;/p&gt;</comment>
                            <comment id="167180" author="yong.fan" created="Mon, 26 Sep 2016 09:49:31 +0000"  >&lt;p&gt;The case may be that when the unlink RPC service thread declares, the next llog exists already, then it will not declare the credit for create new llog file, but after the unlink transaction started, related llog file may has been filled up by others (and plus itself). Then it will try to create new llog file that will use the credits declared for subsequent write, then cause some transaction trouble.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Lustre: 18235:0:(osd_handler.c:1282:osd_trans_dump_creds()) write: 5043/43356/0, punch: 0/0/0, quota 4/52/0&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;I am not sure whether it is the same as the original trouble. According to our current implementation, there are some nested transactions, means one transaction during other transaction declare, that may cause the OSD layer credits information are not correct. That is another issue to be fixed.&lt;/p&gt;</comment>
                            <comment id="167181" author="bzzz" created="Mon, 26 Sep 2016 09:57:40 +0000"  >&lt;p&gt;check llog_cat_current_log().. when the current is full, then the next is taken and set to NULL. in any case, there is no point to declare yet another &quot;next&quot;. we just need to make sure that we&apos;ve declared at least one. for example, if we find &quot;next&quot; existing, then we can repeat from the beginning.&lt;/p&gt;</comment>
                            <comment id="167183" author="yong.fan" created="Mon, 26 Sep 2016 10:10:26 +0000"  >&lt;p&gt;As you can see:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;int llog_cat_declare_add_rec(const struct lu_env *env,
                             struct llog_handle *cathandle,
                             struct llog_rec_hdr *rec, struct thandle *th)
{
...
        if (!llog_exist(cathandle-&amp;gt;u.chd.chd_current_log)) {
...
        next = cathandle-&amp;gt;u.chd.chd_next_log;
        if (next) {
(1)==&amp;gt;                if (!llog_exist(next)) {
                                 /* declare create */
...
}

static struct llog_handle *llog_cat_current_log(struct llog_handle *cathandle,
                                                struct thandle *th)
{
...
        loghandle = cathandle-&amp;gt;u.chd.chd_next_log;
        cathandle-&amp;gt;u.chd.chd_current_log = loghandle;
(2)==&amp;gt;        cathandle-&amp;gt;u.chd.chd_next_log = NULL;
...
}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;During above checking, we do NOT take &quot;cathandle-&amp;gt;lgh_lock&quot;, then it is possible (at least in theory) that the thread1 has already moved to the line (1), then another thread2 sets the &quot;chd_current_log&quot; as &quot;chd_next_log&quot; and set &quot;chd_next_log&quot; as NULL at line (2), and then create related llog object via llog_cat_new_log(). So the thread1 will find the &quot;next&quot; exist, then will not declare to create the &quot;next&quot;.  Such case should exists, otherwise the line (1) check is redundant, and should be replaced as &quot;LASSERT(!llog_exist(next))&quot;.&lt;/p&gt;</comment>
                            <comment id="167185" author="bzzz" created="Mon, 26 Sep 2016 10:21:12 +0000"  >&lt;p&gt;yes, this is possible, but we declare &lt;em&gt;few&lt;/em&gt; creations usually because of striping (those are independed from llog point of view). I think it&apos;s unlikely to hit this race often while most of time new llog declarations are not used.&lt;br/&gt;
also, we do have per-operation accounting in OSD and I can&apos;t remember any report on lack of credits at create operation. notice OSD checks credits before operation and after. in all the cases it&apos;s write which fail. this is why I suggested to enable the assertion in declaration checks. hopefully we&apos;ll see something useful.&lt;br/&gt;
my currenct suspicion is write itself, probably it&apos;s too optimistic.&lt;/p&gt;</comment>
                            <comment id="167187" author="yong.fan" created="Mon, 26 Sep 2016 10:34:18 +0000"  >&lt;p&gt;Yes, it is unlikely. One important reason is that we seldom test large striped file during our daily Maloo tests, that may hides some corner cases when llog handle switching. In this case, large striping is the most different factor compared with our daily testing. So I suspect such point.&lt;/p&gt;

&lt;p&gt;As for the write itself optimistic, I have checked related code. The &quot;offset&quot; is the most important parameter to indicate how to optimise. But for llog case, such parameter for write is either &quot;0&quot; (for modify header, including bitmap) or &quot;-1&quot; (for append record). But for these two cases, the OSD layer does very limited optimisation as to I cannot find out suspected points.&lt;/p&gt;</comment>
                            <comment id="167190" author="bzzz" created="Mon, 26 Sep 2016 11:20:25 +0000"  >&lt;p&gt;if create was missing a declaration or credits, then we&apos;d get a warning or assertion (when enabled). I just checked with skipped declaration and low credits..&lt;/p&gt;</comment>
                            <comment id="167191" author="yong.fan" created="Mon, 26 Sep 2016 11:56:21 +0000"  >&lt;p&gt;Another suspected point: &quot;osd_thandle::ot_credits&quot; is defined as &quot;unsigned short&quot;. Is such two-bytes large enough for the credits declaration for large striping operations? Possible overflow?&lt;/p&gt;</comment>
                            <comment id="167192" author="bzzz" created="Mon, 26 Sep 2016 12:00:49 +0000"  >&lt;p&gt;I was looking on this as well. this can potentially explain our assertion, but not BUG() in JBD2..&lt;/p&gt;</comment>
                            <comment id="167570" author="gerrit" created="Wed, 28 Sep 2016 12:02:44 +0000"  >&lt;p&gt;Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/22782&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/22782&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8527&quot; title=&quot;Lustre 2.8 server crashed on bring up during large scale test shot&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8527&quot;&gt;&lt;del&gt;LU-8527&lt;/del&gt;&lt;/a&gt; osd: ot_credits must be 32bit&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: c6ed09256d2d76e255a3decd1d54b756b6360977&lt;/p&gt;</comment>
                            <comment id="168802" author="gerrit" created="Sat, 8 Oct 2016 16:37:59 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/22782/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/22782/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8527&quot; title=&quot;Lustre 2.8 server crashed on bring up during large scale test shot&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8527&quot;&gt;&lt;del&gt;LU-8527&lt;/del&gt;&lt;/a&gt; osd: ot_credits must be 32bit&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 9edc89c11eb673b0c0da08381a6a779e40ff8ac2&lt;/p&gt;</comment>
                            <comment id="168819" author="pjones" created="Sat, 8 Oct 2016 19:04:09 +0000"  >&lt;p&gt;Landed for 2.9&lt;/p&gt;</comment>
                            <comment id="170248" author="simmonsja" created="Tue, 18 Oct 2016 21:53:20 +0000"  >&lt;p&gt;We were able to reproduce the problem at smaller scale and after we applied the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8527&quot; title=&quot;Lustre 2.8 server crashed on bring up during large scale test shot&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8527&quot;&gt;&lt;del&gt;LU-8527&lt;/del&gt;&lt;/a&gt; patch the MDS no longer crashed. Thank you.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="18236">LU-3102</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="37556">LU-8267</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="22726" name="dmesg.txt" size="128825" author="simmonsja" created="Tue, 23 Aug 2016 16:32:40 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzylt3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>