<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:01:34 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-6596] GPF: RIP [&lt;ffffffffa076924b&gt;] ptlrpc_replay_next+0xdb/0x380 [ptlrpc]</title>
                <link>https://jira.whamcloud.com/browse/LU-6596</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Since its update from 2.5.3 to 2.5.3.90, one of our customer is hitting the following GPF on client nodes while the MDT is in recovery.&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;&amp;lt;4&amp;gt;general protection fault: 0000 [0000001] SMP
&amp;lt;4&amp;gt;last sysfs file: /sys/module/ipv6/initstate
&amp;lt;4&amp;gt;CPU 21
&amp;lt;4&amp;gt;Modules linked in: iptable_mangle iptable_filter lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) lquota(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic crc32c_intel libcfs(U) nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_state iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_sa(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) mic(U) uinput ipmi_si ipmi_msghandler sg compat(U) lpc_ich mfd_core ioatdma myri10ge igb dca i2c_algo_bit i2c_core ptp pps_core ext4 jbd2 mbcache ahci sd_mod crc_t10dif dm_mirror dm_region_hash dm_log dm_mod megaraid_sas [last unloaded: scsi_wait_scan]
&amp;lt;4&amp;gt;
&amp;lt;4&amp;gt;Pid: 10457, comm: ptlrpcd_rcv Not tainted 2.6.32-504.12.2.el6.Bull.72.x86_64 0000001 BULL bullx &lt;span class=&quot;code-keyword&quot;&gt;super&lt;/span&gt;-node
&amp;lt;4&amp;gt;RIP: 0010:[&amp;lt;ffffffffa076924b&amp;gt;] [&amp;lt;ffffffffa076924b&amp;gt;] ptlrpc_replay_next+0xdb/0x380 [ptlrpc]
&amp;lt;4&amp;gt;RSP: 0018:ffff88086d375bb0 EFLAGS: 00010296
&amp;lt;4&amp;gt;RAX: 5a5a5a5a5a5a5a5a RBX: ffff88107ad2d800 RCX: ffff8806e2f65d10
&amp;lt;4&amp;gt;RDX: ffff88107ad2d8b0 RSI: ffff88086d375c1c RDI: ffff88107ad2d800
&amp;lt;4&amp;gt;RBP: ffff88086d375be0 R08: 0000000000000000 R09: 0000000000000000
&amp;lt;4&amp;gt;R10: ffff88087ca2fe50 R11: 0000000000000000 R12: 0000000000000000
&amp;lt;4&amp;gt;R13: ffff88107ad2da90 R14: ffff88086d375c1c R15: ffff881e7f308000
&amp;lt;4&amp;gt;FS: 0000000000000000(0000) GS:ffff88089c540000(0000) knlGS:0000000000000000
&amp;lt;4&amp;gt;CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
&amp;lt;4&amp;gt;CR2: 00002ba018b2c000 CR3: 0000000001a85000 CR4: 00000000000007e0
&amp;lt;4&amp;gt;DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
&amp;lt;4&amp;gt;DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
&amp;lt;4&amp;gt;&lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; ptlrpcd_rcv (pid: 10457, threadinfo ffff88086d374000, task ffff88087b2b6040)
&amp;lt;4&amp;gt;Stack:
&amp;lt;4&amp;gt; 0000000000000000 ffff88107ad2d800 ffff880917d81800 ffff88107ad2da90
&amp;lt;4&amp;gt;&amp;lt;d&amp;gt; 0000000000000000 ffff880d22697cc0 ffff88086d375c40 ffffffffa078e570
&amp;lt;4&amp;gt;&amp;lt;d&amp;gt; 0000000000000000 ffff88107b316078 0000000000000000 ffff880d22697cc0
&amp;lt;4&amp;gt;Call Trace:
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa078e570&amp;gt;] ptlrpc_import_recovery_state_machine+0x360/0xc30 [ptlrpc]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa078fc69&amp;gt;] ptlrpc_connect_interpret+0x779/0x21d0 [ptlrpc]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0784d6b&amp;gt;] ? ptlrpc_pinger_commit_expected+0x1b/0x90 [ptlrpc]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa076605d&amp;gt;] ptlrpc_check_set+0x31d/0x1c20 [ptlrpc]
&amp;lt;4&amp;gt; [&amp;lt;ffffffff81087fdb&amp;gt;] ? try_to_del_timer_sync+0x7b/0xe0
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0792613&amp;gt;] ptlrpcd_check+0x533/0x550 [ptlrpc]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0792b2b&amp;gt;] ptlrpcd+0x20b/0x370 [ptlrpc]
&amp;lt;4&amp;gt; [&amp;lt;ffffffff81064b90&amp;gt;] ? default_wake_function+0x0/0x20
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0792920&amp;gt;] ? ptlrpcd+0x0/0x370 [ptlrpc]
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8109e66e&amp;gt;] kthread+0x9e/0xc0
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8100c20a&amp;gt;] child_rip+0xa/0x20
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8109e5d0&amp;gt;] ? kthread+0x0/0xc0
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8100c200&amp;gt;] ? child_rip+0x0/0x20
&amp;lt;4&amp;gt;Code: c0 00 00 00 48 8b 00 48 39 c2 48 89 83 c0 00 00 00 75 18 eb 23 0f 1f 00 48 8b 00 48 39 c2 48 89 83 c0 00 00 00 0f 84 8c 00 00 00 &amp;lt;4c&amp;gt; 3b 60 f0 4c 8d b8 f0 fe ff ff 73 e0 4d 85 ff 74 7a f6 83 95
&amp;lt;1&amp;gt;RIP [&amp;lt;ffffffffa076924b&amp;gt;] ptlrpc_replay_next+0xdb/0x380 [ptlrpc]
&amp;lt;4&amp;gt; RSP &amp;lt;ffff88086d375bb0&amp;gt;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Our MDS is an active/passive HA cluster. This GPF can occur during failover or failback of the MDT.&lt;/p&gt;

&lt;p&gt;Occurred on 05/04, 05/07 and 05/12. During the last occurrence, we lost 200 compute nodes and 2 login nodes.&lt;/p&gt;

&lt;p&gt;The stack looks like &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6022&quot; title=&quot;replay-single test 73c hung: RIP: ptlrpc_replay_next+0xdb/0x380 [ptlrpc]&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6022&quot;&gt;LU-6022&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Could you help us on this one?&lt;/p&gt;</description>
                <environment>kernel 2.6.32-504.12.2.el6&lt;br/&gt;
lustre-2.5.3.90 w/ some bullpatches on clients and servers</environment>
        <key id="30059">LU-6596</key>
            <summary>GPF: RIP [&lt;ffffffffa076924b&gt;] ptlrpc_replay_next+0xdb/0x380 [ptlrpc]</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="bfaccini">Bruno Faccini</assignee>
                                    <reporter username="bruno.travouillon">Bruno Travouillon</reporter>
                        <labels>
                    </labels>
                <created>Wed, 13 May 2015 15:30:34 +0000</created>
                <updated>Mon, 2 Nov 2015 17:02:37 +0000</updated>
                            <resolved>Mon, 2 Nov 2015 17:02:37 +0000</resolved>
                                    <version>Lustre 2.5.4</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>12</watches>
                                                                            <comments>
                            <comment id="115196" author="bruno.travouillon" created="Wed, 13 May 2015 15:40:41 +0000"  >&lt;p&gt;FTR, our build include &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6084&quot; title=&quot;Tests are failed due to &amp;#39;recovery is aborted by hard timeout&amp;#39;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6084&quot;&gt;&lt;del&gt;LU-6084&lt;/del&gt;&lt;/a&gt; ptlrpc: prevent request timeout grow due to recovery&lt;/p&gt;</comment>
                            <comment id="115228" author="pjones" created="Wed, 13 May 2015 18:30:00 +0000"  >&lt;p&gt;Bruno F&lt;/p&gt;

&lt;p&gt;Could you please advise on this one?&lt;/p&gt;

&lt;p&gt;thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="115636" author="bfaccini" created="Mon, 18 May 2015 09:31:50 +0000"  >&lt;p&gt;Hello Bruno,&lt;br/&gt;
Is it possible to get access to one of these Clients crash-dump ??&lt;br/&gt;
Also, you have indicated that this could also occur during MDS fail-back, thus you may be able to enable full debug on a set (all?) of Clients nodes in order to catch max infos/traces at the time of an expected new occurrence?&lt;br/&gt;
Thanks to highlight that patch for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6084&quot; title=&quot;Tests are failed due to &amp;#39;recovery is aborted by hard timeout&amp;#39;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6084&quot;&gt;&lt;del&gt;LU-6084&lt;/del&gt;&lt;/a&gt; has been already integrated in your distro, but can you also detail the full list of patches that are on top of 2.5.3.90 for both Servers/Clients sides ??&lt;/p&gt;
</comment>
                            <comment id="115979" author="sebastien.buisson" created="Wed, 20 May 2015 07:18:17 +0000"  >&lt;p&gt;Hi Bruno,&lt;/p&gt;

&lt;p&gt;As for the patches that we apply on top of 2.5.3.90, here is the list you requested:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5740&quot; title=&quot;Kernel upgrade [RHEL6.6 2.6.32-504.el6]&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5740&quot;&gt;&lt;del&gt;LU-5740&lt;/del&gt;&lt;/a&gt;: a 2.5 specific version of patch &lt;a href=&quot;http://review.whamcloud.com/12609&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/12609&lt;/a&gt;&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4582&quot; title=&quot;After failing over Lustre MGS node to the secondary, client mount fails with -5&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4582&quot;&gt;&lt;del&gt;LU-4582&lt;/del&gt;&lt;/a&gt;: &lt;a href=&quot;http://review.whamcloud.com/9217&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/9217&lt;/a&gt;&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5678&quot; title=&quot;kernel crash due to NULL pointer dereference in kiblnd_pool_alloc_node()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5678&quot;&gt;&lt;del&gt;LU-5678&lt;/del&gt;&lt;/a&gt;: &lt;a href=&quot;http://review.whamcloud.com/12852&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/12852&lt;/a&gt;&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5393&quot; title=&quot;LBUG: (ost_handler.c:882:ost_brw_read()) ASSERTION( local_nb[i].rc == 0 ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5393&quot;&gt;&lt;del&gt;LU-5393&lt;/del&gt;&lt;/a&gt;: &lt;a href=&quot;http://review.whamcloud.com/13707&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/13707&lt;/a&gt;&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3727&quot; title=&quot;LBUG (llite_nfs.c:281:ll_get_parent()) ASSERTION(body-&amp;gt;valid &amp;amp; OBD_MD_FLID) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3727&quot;&gt;&lt;del&gt;LU-3727&lt;/del&gt;&lt;/a&gt;: &lt;a href=&quot;http://review.whamcloud.com/13270&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/13270&lt;/a&gt;&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4528&quot; title=&quot;osd_trans_exec_op()) ASSERTION( oti-&amp;gt;oti_declare_ops_rb[rb] &amp;gt; 0 ) failed: rb = 0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4528&quot;&gt;&lt;del&gt;LU-4528&lt;/del&gt;&lt;/a&gt;: &lt;a href=&quot;http://review.whamcloud.com/11751&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/11751&lt;/a&gt;&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5522&quot; title=&quot;ofd_prolong_extent_locks()) ASSERTION( lock-&amp;gt;l_flags &amp;amp; 0x0000000000000020ULL ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5522&quot;&gt;&lt;del&gt;LU-5522&lt;/del&gt;&lt;/a&gt;: &lt;a href=&quot;http://review.whamcloud.com/11634&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/11634&lt;/a&gt;&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5264&quot; title=&quot;ASSERTION( info-&amp;gt;oti_r_locks == 0 ) at OST umount&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5264&quot;&gt;&lt;del&gt;LU-5264&lt;/del&gt;&lt;/a&gt;: &lt;a href=&quot;http://review.whamcloud.com/13103&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/13103&lt;/a&gt;&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6049&quot; title=&quot;General Protection Fault at echo_session_key_fini+0xa9&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6049&quot;&gt;&lt;del&gt;LU-6049&lt;/del&gt;&lt;/a&gt;: &lt;a href=&quot;http://review.whamcloud.com/13164&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/13164&lt;/a&gt;&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6084&quot; title=&quot;Tests are failed due to &amp;#39;recovery is aborted by hard timeout&amp;#39;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6084&quot;&gt;&lt;del&gt;LU-6084&lt;/del&gt;&lt;/a&gt;: &lt;a href=&quot;http://review.whamcloud.com/13685&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/13685&lt;/a&gt;&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5764&quot; title=&quot;Crash of MDS on &amp;quot;apparent buffer overflow&amp;quot;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5764&quot;&gt;&lt;del&gt;LU-5764&lt;/del&gt;&lt;/a&gt;: &lt;a href=&quot;http://review.whamcloud.com/13413&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/13413&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Cheers,&lt;br/&gt;
Sebastien.&lt;/p&gt;</comment>
                            <comment id="116058" author="bruno.travouillon" created="Wed, 20 May 2015 20:18:55 +0000"  >&lt;p&gt;Hello Bruno,&lt;/p&gt;

&lt;p&gt;The customer is a blacksite. We can schedule a meeting to work on the crash-dump onsite if you want to.&lt;br/&gt;
You are right, I will give instructions to the local support to enable full debug in case of failback. We will set debug=ALL and slightly increase the debug_size.&lt;/p&gt;</comment>
                            <comment id="122095" author="green" created="Fri, 24 Jul 2015 04:41:31 +0000"  >&lt;p&gt;I just hit this (or so I think) during my regular testing of master. I have a crashdump, so that should give us quite a bit of extra info.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;&amp;lt;1&amp;gt;[26215.496409] BUG: unable to handle kernel paging request at ffffffffffffffe8
&amp;lt;1&amp;gt;[26215.496730] IP: [&amp;lt;ffffffffa16cf30b&amp;gt;] ptlrpc_replay_next+0xdb/0x370 [ptlrpc]
&amp;lt;4&amp;gt;[26215.497079] PGD 1a27067 PUD 1a28067 PMD 0 
&amp;lt;4&amp;gt;[26215.497358] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
&amp;lt;4&amp;gt;[26215.497650] last sysfs file: /sys/devices/system/cpu/possible
&amp;lt;4&amp;gt;[26215.497954] CPU 7 
&amp;lt;4&amp;gt;[26215.497996] Modules linked in: lustre ofd osp lod ost mdt mdd mgs osd_ldiskfs ldiskfs lquota lfsck obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass ksocklnd lnet libcfs zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl zlib_deflate exportfs jbd sha512_generic sha256_generic ext4 jbd2 mbcache virtio_console virtio_balloon i2c_piix4 i2c_core virtio_blk virtio_net virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache auth_rpcgss nfs_acl sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: libcfs]
&amp;lt;4&amp;gt;[26215.500111] 
&amp;lt;4&amp;gt;[26215.500111] Pid: 14755, comm: ptlrpcd_rcv Tainted: P           ---------------    2.6.32-rhe6.6-debug #1 Red Hat KVM
&amp;lt;4&amp;gt;[26215.500111] RIP: 0010:[&amp;lt;ffffffffa16cf30b&amp;gt;]  [&amp;lt;ffffffffa16cf30b&amp;gt;] ptlrpc_replay_next+0xdb/0x370 [ptlrpc]
&amp;lt;4&amp;gt;[26215.500111] RSP: 0018:ffff880098b55b10  EFLAGS: 00010286
&amp;lt;4&amp;gt;[26215.500111] RAX: 0000000000000000 RBX: ffff8800789617f0 RCX: 0000000000000000
&amp;lt;4&amp;gt;[26215.500111] RDX: ffff8800789618b8 RSI: 0000000000000000 RDI: ffff8800b051ef70
&amp;lt;4&amp;gt;[26215.500111] RBP: ffff880098b55b40 R08: 00000000fffffffb R09: 00000000fffffffe
&amp;lt;4&amp;gt;[26215.500111] R10: 0000000000000000 R11: 000000000000005a R12: 0000000000000000
&amp;lt;4&amp;gt;[26215.500111] R13: ffff880078961ad0 R14: ffff880098b55b7c R15: ffff880075689ce8
&amp;lt;4&amp;gt;[26215.500111] FS:  0000000000000000(0000) GS:ffff8800063c0000(0000) knlGS:0000000000000000
&amp;lt;4&amp;gt;[26215.500111] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
&amp;lt;4&amp;gt;[26215.500111] CR2: ffffffffffffffe8 CR3: 000000008a94b000 CR4: 00000000000006e0
&amp;lt;4&amp;gt;[26215.500111] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
&amp;lt;4&amp;gt;[26215.500111] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
&amp;lt;4&amp;gt;[26215.500111] Process ptlrpcd_rcv (pid: 14755, threadinfo ffff880098b54000, task ffff8800b43d02c0)
&amp;lt;4&amp;gt;[26215.500111] Stack:
&amp;lt;4&amp;gt;[26215.500111]  0000000000000000 ffff8800789617f0 0000000000000000 ffffffffa177a633
&amp;lt;4&amp;gt;[26215.500111] &amp;lt;d&amp;gt; ffff8800b03b9ce8 ffff880078961ad0 ffff880098b55ba0 ffffffffa16f3830
&amp;lt;4&amp;gt;[26215.500111] &amp;lt;d&amp;gt; 0000000000000010 ffff880098b55bb0 ffff880098b55b70 ffff8800789617f0
&amp;lt;4&amp;gt;[26215.500111] Call Trace:
&amp;lt;4&amp;gt;[26215.500111]  [&amp;lt;ffffffffa16f3830&amp;gt;] ptlrpc_import_recovery_state_machine+0x360/0xc20 [ptlrpc]
&amp;lt;4&amp;gt;[26215.500111]  [&amp;lt;ffffffffa16f6bac&amp;gt;] ptlrpc_connect_interpret+0xc4c/0x2570 [ptlrpc]
&amp;lt;4&amp;gt;[26215.500111]  [&amp;lt;ffffffffa16ec59a&amp;gt;] ? ptlrpc_update_next_ping+0x4a/0xd0 [ptlrpc]
&amp;lt;4&amp;gt;[26215.500111]  [&amp;lt;ffffffffa16cc2c3&amp;gt;] ptlrpc_check_set+0x613/0x1bf0 [ptlrpc]
&amp;lt;4&amp;gt;[26215.500111]  [&amp;lt;ffffffff81522574&amp;gt;] ? _spin_lock_irqsave+0x24/0x30
&amp;lt;4&amp;gt;[26215.500111]  [&amp;lt;ffffffffa16fa2d3&amp;gt;] ptlrpcd_check+0x3e3/0x630 [ptlrpc]
&amp;lt;4&amp;gt;[26215.500111]  [&amp;lt;ffffffffa16fa83b&amp;gt;] ptlrpcd+0x31b/0x500 [ptlrpc]
&amp;lt;4&amp;gt;[26215.500111]  [&amp;lt;ffffffff81061630&amp;gt;] ? default_wake_function+0x0/0x20
&amp;lt;4&amp;gt;[26215.500111]  [&amp;lt;ffffffffa16fa520&amp;gt;] ? ptlrpcd+0x0/0x500 [ptlrpc]
&amp;lt;4&amp;gt;[26215.500111]  [&amp;lt;ffffffff8109ce4e&amp;gt;] kthread+0x9e/0xc0
&amp;lt;4&amp;gt;[26215.500111]  [&amp;lt;ffffffff8100c24a&amp;gt;] child_rip+0xa/0x20
&amp;lt;4&amp;gt;[26215.500111]  [&amp;lt;ffffffff8109cdb0&amp;gt;] ? kthread+0x0/0xc0
&amp;lt;4&amp;gt;[26215.500111]  [&amp;lt;ffffffff8100c240&amp;gt;] ? child_rip+0x0/0x20
&amp;lt;4&amp;gt;[26215.500111] Code: 48 8b 00 48 39 c2 48 89 83 d8 00 00 00 75 1c eb 24 0f 1f 80 00 00 00 00 48 8b 00 48 39 c2 48 89 83 d8 00 00 00 0f 84 7c 00 00 00 &amp;lt;4c&amp;gt; 3b 60 e8 4c 8d 78 80 73 e3 4d 85 ff 74 6d f6 83 f9 02 00 00 
&amp;lt;1&amp;gt;[26215.500111] RIP  [&amp;lt;ffffffffa16cf30b&amp;gt;] ptlrpc_replay_next+0xdb/0x370 [ptlrpc]
&amp;lt;4&amp;gt;[26215.500111]  RSP &amp;lt;ffff880098b55b10&amp;gt;
&amp;lt;4&amp;gt;[26215.500111] CR2: ffffffffffffffe8
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;(gdb) l *(ptlrpc_replay_next+0xdb)
0x4a33b is in ptlrpc_replay_next (/home/green/git/lustre-release/lustre/ptlrpc/recover.c:129).
124				&lt;span class=&quot;code-keyword&quot;&gt;while&lt;/span&gt; (imp-&amp;gt;imp_replay_cursor !=
125				       &amp;amp;imp-&amp;gt;imp_committed_list) {
126					req = list_entry(imp-&amp;gt;imp_replay_cursor,
127							     struct ptlrpc_request,
128							     rq_replay_list);
129					&lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (req-&amp;gt;rq_transno &amp;gt; last_transno)
130						&lt;span class=&quot;code-keyword&quot;&gt;break&lt;/span&gt;;
131	
132					req = NULL;
133					imp-&amp;gt;imp_replay_cursor =
(gdb) p (&lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt;)0xffffffffffffffe8
$1 = -24
(gdb) p &amp;amp;((struct ptlrpc_request *) 0x0)-&amp;gt;rq_transno
$1 = (__u64 *) 0x68
(gdb) p (&lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt;)0xffffffffffffffe8 - 0x68
$2 = -128
(gdb) quit
[green@intelbox lustre-release]$ grep 128 ~/bk/linux-2.6.32-504.3.3.el6-debug/include/asm-&lt;span class=&quot;code-keyword&quot;&gt;generic&lt;/span&gt;/errno.h 
#define	EKEYREVOKED	128	&lt;span class=&quot;code-comment&quot;&gt;/* Key has been revoked */&lt;/span&gt;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;of course this -128 might be total garbage from somewhere and also there are some differences in the code accumulated, so this might not be totally related or it well might be related too.&lt;/p&gt;

&lt;p&gt;Crashdump on my box in /exports/crashdumps/192.168.10.221-2015-07-23-18\:24\:10 tag in my tree master-20150723&lt;/p&gt;</comment>
                            <comment id="124858" author="bfaccini" created="Sun, 23 Aug 2015 08:28:27 +0000"  >&lt;p&gt;Humm, according to crash-dump infos and related source/assembly code, seems that problem occurs during following piece of code in ptlrpc_replay_next( ) :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;119                         /* Since the imp_committed_list is immutable before
120                          * all of it&apos;s requests being replayed, it&apos;s safe to
121                          * use a cursor to accelerate the search */
122                         imp-&amp;gt;imp_replay_cursor = imp-&amp;gt;imp_replay_cursor-&amp;gt;next;
123 
124                         while (imp-&amp;gt;imp_replay_cursor !=
125                                &amp;amp;imp-&amp;gt;imp_committed_list) {
126                                 req = list_entry(imp-&amp;gt;imp_replay_cursor,
127                                                      struct ptlrpc_request,
128                                                      rq_replay_list);
129                                 if (req-&amp;gt;rq_transno &amp;gt; last_transno)
130                                         break;
131 
132                                 req = NULL;
133                                 imp-&amp;gt;imp_replay_cursor =
134                                         imp-&amp;gt;imp_replay_cursor-&amp;gt;next;
135                         }
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;because imp-&amp;gt;imp_replay_cursor has become corrupted, likely due to a race. We may need to add some mutual exclusion protection to allw this code to become safe.&lt;br/&gt;
Just to confirm, Bruno, can you check imp_replay_cursor field in obd_import struct with address 0xffff88107ad2d800 from your crash-dump ? If you could also provide the full obd_import struct, that would be very helpful too.&lt;/p&gt;</comment>
                            <comment id="124868" author="bfaccini" created="Mon, 24 Aug 2015 08:22:51 +0000"  >&lt;p&gt;To be complete, GPF occurs at line :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;129                                 if (req-&amp;gt;rq_transno &amp;gt; last_transno)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;when dereferencing req, which directly comes from the fact that imp-&amp;gt;imp_replay_cursor has been corrupted at either loop initialization line :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;122                         imp-&amp;gt;imp_replay_cursor = imp-&amp;gt;imp_replay_cursor-&amp;gt;next;&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;or inside while loop, at lines :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;133                                 imp-&amp;gt;imp_replay_cursor =
134                                         imp-&amp;gt;imp_replay_cursor-&amp;gt;next;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;with LP_POISON (Bull&apos;s case) value when not already re-used or not overwritten, or either 0s (Oleg&apos;s case) when already re-allocated ...&lt;/p&gt;

&lt;p&gt;Previous imp-&amp;gt;imp_replay_cursor pointer value can not be retrieved from the crash-dumps, according to routine assembly code, so we can presently only think about the possible and different corruption scenarios.&lt;br/&gt;
And since a race does not seem to have occured, at least after I have analyzed the Lustre debug log extracted from crash-dump for Oleg&apos;s case, I have thought about an other possibility which could be that imp-&amp;gt;imp_replay_cursor original value, when starting the current replay/recovery round, could have been out-dated and pointing to an  old req ...&lt;br/&gt;
If this is the case, recent patches for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6802&quot; title=&quot;sanity test_208 fail: &#8220;lease not broken over recovery&amp;quot;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6802&quot;&gt;&lt;del&gt;LU-6802&lt;/del&gt;&lt;/a&gt; may avoid this problem, because they change recovery mechanism to all times reset imp-&amp;gt;imp_replay_cursor to start from imp-&amp;gt;imp_committed_list list beginning.&lt;/p&gt;

&lt;p&gt;Bruno, I have other questions/requests ... Do you still frequently experience this problem during MDT fail-over/back ? If yes, could it be possible for you to give a try to integrate 2x patches for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6802&quot; title=&quot;sanity test_208 fail: &#8220;lease not broken over recovery&amp;quot;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6802&quot;&gt;&lt;del&gt;LU-6802&lt;/del&gt;&lt;/a&gt; ? Also, can provide the full dmesg (from &quot;log&quot; crash sub-command output) from your crash-dump ?&lt;/p&gt;</comment>
                            <comment id="125496" author="bfaccini" created="Fri, 28 Aug 2015 08:20:26 +0000"  >&lt;p&gt;I have spent sometimes on-site to have a look to a bunch of Lustre Client crash dumps taken during recovery after MDS/MDT failure/restart and here are more facts/details that can be listed :&lt;br/&gt;
                        _ problem can occur during MDS fail-over or fail-back operations but also during MDS reboot in non-HA configuration or MDT  target re-start/mount.&lt;br/&gt;
                        _ during a single occurence, one or multiple (sometimes &amp;gt;200!) Lustre Clients can be affected/crash but there have been occurrences without any Client crash.&lt;br/&gt;
                        _ Clients crashes are very often GPFs of the same kind than described in this ticket, this with multiple different and invalid values of imp-&amp;gt;imp_replay_cursor, but there are also a few cases of LBUGs or Kernel BUG()s that I have analyzed on-site and they only come from the fact that imp-&amp;gt;imp_replay_cursor has been corrupted by a wrong (eg, not pointing as expected to a ptlrpc_request struct) but valid (on a Kernel virtual memory addressing point of view).&lt;br/&gt;
                        _ the configured Clients Lustre debug mask is only WARN+EMERG+ERROR+CONSOLE, so the Lustre debug log extracted from each crash-dump does not help in order to identify if the imp-&amp;gt;imp_replay_cursor corruption comes from a race during imp_committed_list access or out-dated value usage. There are instructions to ensure that upon a new MDS/MDT re-start occurrence, full/-1 debug mask will be enabled in order to catch the full recovery process.&lt;/p&gt;

&lt;p&gt;According to all of this I still feel pretty confident that, as I already indicated, the 2x patches for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6802&quot; title=&quot;sanity test_208 fail: &#8220;lease not broken over recovery&amp;quot;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6802&quot;&gt;&lt;del&gt;LU-6802&lt;/del&gt;&lt;/a&gt; are likely to be the fix for this ticket.&lt;/p&gt;</comment>
                            <comment id="130177" author="adilger" created="Tue, 13 Oct 2015 05:30:29 +0000"  >&lt;p&gt;Should this be closed as a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6802&quot; title=&quot;sanity test_208 fail: &#8220;lease not broken over recovery&amp;quot;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6802&quot;&gt;&lt;del&gt;LU-6802&lt;/del&gt;&lt;/a&gt;?&lt;/p&gt;</comment>
                            <comment id="132366" author="jay" created="Mon, 2 Nov 2015 17:02:37 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6802&quot; title=&quot;sanity test_208 fail: &#8220;lease not broken over recovery&amp;quot;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6802&quot;&gt;&lt;del&gt;LU-6802&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                            <outwardlinks description="duplicates">
                                        <issuelink>
            <issuekey id="30952">LU-6802</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="27888">LU-6022</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                                        </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzxd6v:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>