<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:27:30 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-2704] GPF in __d_lookup called from ll_statahead_one</title>
                <link>https://jira.whamcloud.com/browse/LU-2704</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We recently had a report of a General Protection Fault on a client node running a purge workload.&lt;/p&gt;

&lt;p&gt;The console message:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;general protection fault: 0000 [#1] SMP 
last sysfs file: /sys/devices/system/cpu/cpu11/cache/index2/shared_cpu_map
CPU 0 
Modules linked in: cpufreq_ondemand nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ko2iblnd(U) lnet(U) libcfs(U) acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm uinput ahci ib_qib(U) ib_mad ib_core dcdbas microcode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma i7core_edac edac_core shpchp xt_owner ipt_LOG xt_multiport ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc igb dca [last unloaded: cpufreq_ondemand]

Pid: 3895, comm: ll_sa_25915 Tainted: G        W  ----------------   2.6.32-220.23.1.1chaos.ch5.x86_64 #1 Dell       XS23-TY35       /0GW08P
RIP: 0010:[&amp;lt;ffffffff8118fc8c&amp;gt;]  [&amp;lt;ffffffff8118fc8c&amp;gt;] __d_lookup+0x8c/0x150
RSP: 0018:ffff88049ad3dcc0  EFLAGS: 00010202
RAX: 000000000000000f RBX: 2e342036343a3732 RCX: 0000000000000016
RDX: 018721e08df08940 RSI: ffff88049ad3ddc0 RDI: ffff880421557300
RBP: ffff88049ad3dd10 R08: ffff880589fdad30 R09: 00000000ffffffff
R10: 0000000000000000 R11: 0000000000000000 R12: 2e342036343a371a
R13: ffff880421557300 R14: 000000009e75e374 R15: ffff8803a22822b8
FS:  00002aaaab05db20(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fffffffc010 CR3: 0000000297c13000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400
Process ll_sa_25915 (pid: 3895, threadinfo ffff88049ad3c000, task ffff8804f49a0aa0)
Stack:
 ffff8802fc86b3b8 0000000f00000246 000000000000000f ffff88049ad3ddc0
&amp;lt;0&amp;gt; ffff880028215fc0 0000000002170c3c ffff88049ad3ddc0 ffff880421557300
&amp;lt;0&amp;gt; ffff880421557300 ffff8803a22822b8 ffff88049ad3dd40 ffffffff811908fc
Call Trace:
 [&amp;lt;ffffffff811908fc&amp;gt;] d_lookup+0x3c/0x60
 [&amp;lt;ffffffffa09b672c&amp;gt;] ll_statahead_one+0x1ec/0x14a0 [lustre]
 [&amp;lt;ffffffff81051ba3&amp;gt;] ? __wake_up+0x53/0x70
 [&amp;lt;ffffffff8109144c&amp;gt;] ? remove_wait_queue+0x3c/0x50
 [&amp;lt;ffffffffa09b7c98&amp;gt;] ll_statahead_thread+0x2b8/0x890 [lustre]
 [&amp;lt;ffffffff8105ea30&amp;gt;] ? default_wake_function+0x0/0x20
 [&amp;lt;ffffffffa09b79e0&amp;gt;] ? ll_statahead_thread+0x0/0x890 [lustre]
 [&amp;lt;ffffffff8100c14a&amp;gt;] child_rip+0xa/0x20
 [&amp;lt;ffffffffa09b79e0&amp;gt;] ? ll_statahead_thread+0x0/0x890 [lustre]
 [&amp;lt;ffffffffa09b79e0&amp;gt;] ? ll_statahead_thread+0x0/0x890 [lustre]
 [&amp;lt;ffffffff8100c140&amp;gt;] ? child_rip+0x0/0x20
Code: 48 03 05 88 4b a7 00 48 8b 18 8b 45 bc 48 85 db 48 89 45 c0 75 11 eb 74 0f 1f 80 00 00 00 00 48 8b 1b 48 85 db 74 65 4c 8d 63 e8 &amp;lt;45&amp;gt; 39 74 24 30 75 ed 4d 39 6c 24 28 75 e6 4d 8d 7c 24 08 4c 89 
RIP  [&amp;lt;ffffffff8118fc8c&amp;gt;] __d_lookup+0x8c/0x150
 RSP &amp;lt;ffff88049ad3dcc0&amp;gt;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The stack reported by crash:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash&amp;gt; bt
PID: 3895   TASK: ffff8804f49a0aa0  CPU: 0   COMMAND: &quot;ll_sa_25915&quot;
 #0 [ffff88049ad3da50] machine_kexec at ffffffff8103216b
 #1 [ffff88049ad3dab0] crash_kexec at ffffffff810b8d12
 #2 [ffff88049ad3db80] oops_end at ffffffff814f2b00
 #3 [ffff88049ad3dbb0] die at ffffffff8100f26b
 #4 [ffff88049ad3dbe0] do_general_protection at ffffffff814f2692
 #5 [ffff88049ad3dc10] general_protection at ffffffff814f1e65
    [exception RIP: __d_lookup+140]
    RIP: ffffffff8118fc8c  RSP: ffff88049ad3dcc0  RFLAGS: 00010202
    RAX: 000000000000000f  RBX: 2e342036343a3732  RCX: 0000000000000016
    RDX: 018721e08df08940  RSI: ffff88049ad3ddc0  RDI: ffff880421557300
    RBP: ffff88049ad3dd10   R8: ffff880589fdad30   R9: 00000000ffffffff
    R10: 0000000000000000  R11: 0000000000000000  R12: 2e342036343a371a
    R13: ffff880421557300  R14: 000000009e75e374  R15: ffff8803a22822b8
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #6 [ffff88049ad3dd18] d_lookup at ffffffff811908fc
 #7 [ffff88049ad3dd48] ll_statahead_one at ffffffffa09b672c [lustre]
 #8 [ffff88049ad3de18] ll_statahead_thread at ffffffffa09b7c98 [lustre]
 #9 [ffff88049ad3df48] kernel_thread at ffffffff8100c14a
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And source level information from GDB:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;(gdb) l *__d_lookup+140
0xffffffff8118fc8c is in __d_lookup (fs/dcache.c:1409).
1404            rcu_read_lock();
1405
1406            hlist_for_each_entry_rcu(dentry, node, head, d_hash) {
1407                    struct qstr *qstr;
1408
1409                    if (dentry-&amp;gt;d_name.hash != hash)
1410                            continue;
1411                    if (dentry-&amp;gt;d_parent != parent)
1412                            continue;
1413
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here&apos;s the assembly of &lt;tt&gt;__d_lookup&lt;/tt&gt; around the &lt;tt&gt;RIP&lt;/tt&gt; of the failure:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash&amp;gt; x/80i __d_lookup
...
   0xffffffff8118fc86 &amp;lt;__d_lookup+134&amp;gt;: je     0xffffffff8118fced &amp;lt;__d_lookup+237&amp;gt;
   0xffffffff8118fc88 &amp;lt;__d_lookup+136&amp;gt;: lea    -0x18(%rbx),%r12
   0xffffffff8118fc8c &amp;lt;__d_lookup+140&amp;gt;: cmp    %r14d,0x30(%r12)
   0xffffffff8118fc91 &amp;lt;__d_lookup+145&amp;gt;: jne    0xffffffff8118fc80 &amp;lt;__d_lookup+128&amp;gt;
   0xffffffff8118fc93 &amp;lt;__d_lookup+147&amp;gt;: cmp    %r13,0x28(%r12)
...
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The backtrace above from crash shows the register contents when the failure occurred. &lt;tt&gt;%r14&lt;/tt&gt; appears to contain the hash value for the &lt;tt&gt;qstr&lt;/tt&gt; we are looking up (i.e. &lt;tt&gt;hash&lt;/tt&gt; in the source code). I tend to trust this because I believe I found that &lt;tt&gt;qstr&lt;/tt&gt; structure on the stack here:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash&amp;gt; p -x *(struct qstr *)0xffff88049ad3ddc0
$30 = {
  hash = 0x9e75e374, 
  len = 0xf, 
  name = 0xffff8802fc86b3b8 &quot;MultiFab_D_07813&quot;
}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The value in &lt;tt&gt;%r14&lt;/tt&gt; and the value of the hash for the dump &lt;tt&gt;qstr&lt;/tt&gt; match up. I&apos;m not sure what the value in &lt;tt&gt;%r12&lt;/tt&gt; represents though, and my assembly is a bit rusty so I&apos;m not entirely sure how to decode the &lt;tt&gt;cmp&lt;/tt&gt; instruction that we fail on. My only guess is we&apos;re dereferencing &lt;tt&gt;%r12&lt;/tt&gt; which isn&apos;t a valid pointer value.&lt;/p&gt;</description>
                <environment></environment>
        <key id="17343">LU-2704</key>
            <summary>GPF in __d_lookup called from ll_statahead_one</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="bfaccini">Bruno Faccini</assignee>
                                    <reporter username="prakash">Prakash Surya</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Tue, 29 Jan 2013 12:32:47 +0000</created>
                <updated>Sun, 3 Apr 2016 03:03:41 +0000</updated>
                            <resolved>Sun, 3 Apr 2016 03:03:41 +0000</resolved>
                                    <version>Lustre 2.1.2</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>8</watches>
                                                                            <comments>
                            <comment id="51421" author="bfaccini" created="Tue, 29 Jan 2013 16:00:40 +0000"  >&lt;p&gt;Yes, but %r12 can be directly inherited from a wrong parent 1st parameter if we just came in __d_lookup() or be the result of a corrupted dentry on the hash-list if already looping ...&lt;/p&gt;

&lt;p&gt;So now you need to check parent dentry pointer value upon call of __d_lookup().&lt;/p&gt;</comment>
                            <comment id="51865" author="bfaccini" created="Wed, 6 Feb 2013 09:45:39 +0000"  >&lt;p&gt;Hello Prakash,&lt;/p&gt;

&lt;p&gt;Have you been able to find dentry pointer value passed as &quot;parent&quot; parameter upon call of __d_lookup() from ll_statahead_one() with inlined d_lookup() ??&lt;/p&gt;

&lt;p&gt;Is the crash-dump available ?&lt;/p&gt;
</comment>
                            <comment id="51872" author="prakash" created="Wed, 6 Feb 2013 12:20:57 +0000"  >&lt;p&gt;I did find an address pointing to a dentry structure (0xffff880421557300), but it&apos;s hard to say if that was the parent, or just an entry in the list during traversal. I do have a crash dump, how do I post it to your FTP server?&lt;/p&gt;</comment>
                            <comment id="51987" author="prakash" created="Thu, 7 Feb 2013 13:58:33 +0000"  >&lt;p&gt;I uploaded a tarball named &lt;tt&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2704&quot; title=&quot;GPF in __d_lookup called from ll_statahead_one&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2704&quot;&gt;&lt;del&gt;LU-2704&lt;/del&gt;&lt;/a&gt;.tar&lt;/tt&gt; in that directory. I has the vmcore, vmlinux, and Lustre modules. I believe you should be able to inspect it like so:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;$ crash LU-2704/vmcore-ansel12-2013-01-26-09\:07\:05 LU-2704/vmlinux
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;If you need anything else, let me know.&lt;/p&gt;</comment>
                            <comment id="52279" author="bfaccini" created="Wed, 13 Feb 2013 12:11:06 +0000"  >&lt;p&gt;Sorry for the lack of update, I saved your tarball and currently xfer it to continue crash-dump analysis, will keep you updated.&lt;/p&gt;</comment>
                            <comment id="52475" author="bfaccini" created="Fri, 15 Feb 2013 15:08:50 +0000"  >&lt;p&gt;Ok, I have been able to dig into the crash-dump, and actually I am a bit confused because here is what I found :&lt;/p&gt;

&lt;p&gt;        _ to answer one of the question I had before, &quot;parent&quot;/1st parameter of &lt;span class=&quot;error&quot;&gt;&amp;#91;__&amp;#93;&lt;/span&gt;d_lookup() is valid and points to a dentry with a name of &quot;Level_0&quot; :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;$3 = {
  d_count = {
    counter = 0x3
  },
  d_flags = 0x0,
  d_lock = {
    raw_lock = {
      slock = 0x78b078b
    }
  },
  d_mounted = 0x0,
  d_inode = 0xffff8803a22822b8,
  d_hash = {
    next = 0x0,
    pprev = 0xffffc90001489a50
  },
  d_parent = 0xffff8804215a2980,
  d_name = {
    hash = 0x4816d9c5,
    len = 0x7,
    name = 0xffff8804215573a0
  },
  d_lru = {
    next = 0xffff880421557340,
    prev = 0xffff880421557340
  },
  d_u = {
    d_child = {
      next = 0xffff880421576910,
      prev = 0xffff8804215a29e0
    },
    d_rcu = {
      next = 0xffff880421576910,
      func = 0xffff8804215a29e0
    }
  },
  d_subdirs = {
    next = 0xffff880421557360,
    prev = 0xffff880421557360
  },
  d_alias = {
    next = 0xffff8803a22822e8,
    prev = 0xffff8803a22822e8
  },
  d_time = 0x0,
  d_op = 0xffffffffa09d3280,
  d_sb = 0xffff880622524000,
  d_fsdata = 0xffff8804170c1440,
  d_iname = {0x4c, 0x65, 0x76, 0x65, 0x6c, 0x5f, 0x30, 0x0, 0x5f, 0x4d, 0x46, 0x5f, 0x44, 0x5f, 0x30, 0x37, 0x38, 0x32, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;        _ as you found too, &quot;name&quot;/2nd parameter of &lt;span class=&quot;error&quot;&gt;&amp;#91;__&amp;#93;&lt;/span&gt;d_lookup() is also valid and points to this :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;$7 = {
  hash = 0x9e75e374,
  len = 0xf,
  name = 0xffff8802fc86b3b8 --&amp;gt;&amp;gt; &quot;MultiFab_D_07813&quot;
}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;         _ due to optimisation there is no intermediate registers/results to track/check some of the d_hash() inlined  function computations, so I had to replay it by hand (and verified same method works with dentry-&amp;gt;parent and its own hash!) with help of __d_lookup() deassembly and source code. And what I found is that the dentry_hashtable[] entry contains what looks like a valid pointer to the 1st element/dentry beeing linked on their d_hash linked-list. But when trying to resolve/dereference it crash complains it is not present/dumped and moreover the page this address belongs is currently assigned to a Lustre inode mapping !!!&lt;/p&gt;

&lt;p&gt;         _ this explains why the page has not been dumped and likely the kind of data that was accessed to trigger the GPF !!!&lt;/p&gt;

&lt;p&gt;         _ there are 2 others current threads having old references this inode in their stacks, &quot;ptlrpcd-brw&quot; and &quot;flush-lustre-2&quot;, but they have not been scheduled recently (I mean since less than 1s!).&lt;/p&gt;

&lt;p&gt;         _ more interesting is that there is an other thread &quot;lpurge&quot; running on an other Core, which also triggers a GPF in the dentry layer !!... I will dig into it now.&lt;/p&gt;

&lt;p&gt;         _ last, there have been earlier OOM situations logged and also these msgs found in the Kernel log :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;LustreError: 29314:0:(mdc_locks.c:723:mdc_enqueue()) ldlm_cli_enqueue: -108
LustreError: 29314:0:(mdc_locks.c:723:mdc_enqueue()) ldlm_cli_enqueue: -108
LustreError: 29314:0:(file.c:158:ll_close_inode_openhandle()) inode 5053885323872960658 mdc close failed: rc = -108
LustreError: 29314:0:(file.c:158:ll_close_inode_openhandle()) inode 5044781017488031745 mdc close failed: rc = -108
LustreError: 29314:0:(dir.c:477:ll_get_dir_page()) read cache page: [0x24d18001:0x3db440f0:0x0] at 3512391207215765018: rc -108
LustreError: 29314:0:(dir.c:648:ll_readdir()) error reading dir [0x24d18001:0x3db440f0:0x0] at 3512391207215765018: rc -108
Lustre: lsc-MDT0000-mdc-ffff880622524000: Connection restored to lsc-MDT0000 (at 172.19.1.100@o2ib100)
LustreError: 11-0: lsc-MDT0000-mdc-ffff880622524000: Communicating with 172.19.1.100@o2ib100, operation mds_getattr_lock failed with -107.
Lustre: lsc-MDT0000-mdc-ffff880622524000: Connection to lsc-MDT0000 (at 172.19.1.100@o2ib100) was lost; in progress operations using this service will wait for recovery to complete
LustreError: 167-0: This client was evicted by lsc-MDT0000; in progress operations using this service will fail.
LustreError: 25915:0:(mdc_locks.c:723:mdc_enqueue()) ldlm_cli_enqueue: -108
LustreError: 4613:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace lsc-MDT0000-mdc-ffff880622524000 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
LustreError: 4613:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff88012b155800 (2500481170767/301/0/0) (rc: 1)
Lustre: lsc-MDT0000-mdc-ffff880622524000: Connection restored to lsc-MDT0000 (at 172.19.1.100@o2ib100)
LustreError: 25915:0:(mdc_locks.c:723:mdc_enqueue()) Skipped 1001 previous similar messages
LustreError: 11-0: lsc-MDT0000-mdc-ffff880622524000: Communicating with 172.19.1.100@o2ib100, operation mds_getattr_lock failed with -107.
Lustre: lsc-MDT0000-mdc-ffff880622524000: Connection to lsc-MDT0000 (at 172.19.1.100@o2ib100) was lost; in progress operations using this service will wait for recovery to complete
LustreError: 167-0: This client was evicted by lsc-MDT0000; in progress operations using this service will fail.
LustreError: 9659:0:(mdc_locks.c:723:mdc_enqueue()) ldlm_cli_enqueue: -4
LustreError: 9681:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace lsc-MDT0000-mdc-ffff880622524000 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
LustreError: 9681:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff880135b3ba80 (2957967665419/1/0/0) (rc: 1)
Lustre: lsc-MDT0000-mdc-ffff880622524000: Connection restored to lsc-MDT0000 (at 172.19.1.100@o2ib100)
LustreError: 9659:0:(mdc_locks.c:723:mdc_enqueue()) Skipped 4 previous similar messages
LustreError: 9659:0:(file.c:2196:ll_inode_revalidate_fini()) failure -4 inode 8764462226200395777
general protection fault: 0000 [#1] SMP
.....
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;which may likely be related ...&lt;/p&gt;

&lt;p&gt;So more to come, but seems to me that we may have crashed due to an orphan reference left in dentry_hashtable[], which may have occurred during some specific eviction conditions ....&lt;/p&gt;</comment>
                            <comment id="52862" author="bfaccini" created="Fri, 22 Feb 2013 06:25:05 +0000"  >&lt;p&gt;The 2nd thread/&quot;lpurge&quot; also handling a GPF is in fact working on the same parent directory and hash value and thus also triggers the same+corrupted dentry_hashtable[] entry. This is ok, since the 1st thread in GPF is a ll_sa_&amp;lt;PID&amp;gt; companion thread of the 2nd !!&lt;/p&gt;

&lt;p&gt;On the other hand, the Lustre inode mapping owner of the page the corrupted reference maps, is one of the current file-descriptors for this 2nd thread/&quot;lpurge&quot; ...&lt;/p&gt;

&lt;p&gt;I am not familiar with the &quot;lpurge&quot; command, but is it doing/using tricky things from a Lustre point of view ??&lt;/p&gt;
</comment>
                            <comment id="52896" author="morrone" created="Fri, 22 Feb 2013 14:29:13 +0000"  >&lt;p&gt;lpurge is part of the &lt;a href=&quot;https://github.com/chaos/lustre-tools-llnl&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;lustre-tools-llnl&lt;/a&gt; package.  You can see the code here:&lt;/p&gt;

&lt;p&gt;  &lt;a href=&quot;https://github.com/chaos/lustre-tools-llnl/blob/master/src/lpurge.c&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/chaos/lustre-tools-llnl/blob/master/src/lpurge.c&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;lpurge tries to constrain itself to only talking to the MDS when possible, so it uses the IOC_MDC_GETFILEINFO ioctl, for instance.&lt;/p&gt;

&lt;p&gt;Other than that, nothing too tricky that I can remember.  It is not even multi-threaded.  If there were two lpurge threads running, that it likely because the sysadmin ran lpurge more than once.  But I would expect them to run each copy in a different subdirectory (likely different users&apos; directories in lustre).  It is possible that someone made a mistake though.&lt;/p&gt;

&lt;p&gt;I would not be surprised if they share some top-level directory if no mistake was made in running the command.&lt;/p&gt;</comment>
                            <comment id="53025" author="bfaccini" created="Tue, 26 Feb 2013 04:11:59 +0000"  >&lt;p&gt;Thank&apos;s for this infos Chris, but I have never enough, sorry ... Can you confirm me if lpurge was run with -R (i.e., removing selected files) at the time of the crash ? I can not verify by myself in the crash since by default user-context pages are not dumped.&lt;/p&gt;

&lt;p&gt;I think I may have found a path in the code that could allow/explain the dentry_hashtable[] corruption I found in your crash-dump, I will give it a try and expose it to our regression tests.&lt;/p&gt;
</comment>
                            <comment id="53034" author="bfaccini" created="Tue, 26 Feb 2013 04:46:57 +0000"  >&lt;p&gt;Patch against b2_1 at &lt;a href=&quot;http://review.whamcloud.com/5533&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/5533&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="53126" author="prakash" created="Wed, 27 Feb 2013 13:20:32 +0000"  >&lt;p&gt;Bruno, lpurge was run with the &quot;-R&quot; option to remove the files.&lt;/p&gt;</comment>
                            <comment id="53414" author="bfaccini" created="Wed, 6 Mar 2013 02:28:05 +0000"  >&lt;p&gt;I realize that I forgot to ask, but I presume it has been a single shoot/crash until now ??&lt;/p&gt;

&lt;p&gt;BTW, the way it can happen is still unexplained, and actually I only have a debug-patch ready to go, in order to detect dentry_hashtable[] un-safe access ...&lt;/p&gt;</comment>
                            <comment id="53485" author="prakash" created="Wed, 6 Mar 2013 19:20:47 +0000"  >&lt;p&gt;AFAIK, it&apos;s only hit once. But I&apos;ve heard vague reports about crashes from our admins when running lpurge before, so there&apos;s a chance of unreported incidents of this crash occurring.&lt;/p&gt;</comment>
                            <comment id="53779" author="bfaccini" created="Tue, 12 Mar 2013 06:00:14 +0000"  >&lt;p&gt;Exposure of patch vs the full test-suite, went thru a lot of external conditions (mainly the one tracked in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2458&quot; title=&quot;Device MGC already exists, won&amp;#39;t add&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2458&quot;&gt;&lt;del&gt;LU-2458&lt;/del&gt;&lt;/a&gt;) causing some tests to fail randomly. Since no side-effect from patch have been detected I re-submitted it with no extra tests-parameters.&lt;/p&gt;</comment>
                            <comment id="57224" author="bfaccini" created="Mon, 29 Apr 2013 09:52:58 +0000"  >&lt;p&gt;Prakash, Chris,&lt;br/&gt;
Did you give a try to integrate my debug patch in your Lustre builds ?? Also, did you experience new occurrence of such crash running lpurge ??&lt;/p&gt;</comment>
                            <comment id="89890" author="spiechurski" created="Wed, 23 Jul 2014 21:32:19 +0000"  >&lt;p&gt;I am not sure this will be of any help, but I wanted to mention we had several occurrences of this issue at one of our customer site.&lt;br/&gt;
I have a crash dump if there can be any hints to what happened, but I guess that all interesting traces are already gone at the time of the crash.&lt;br/&gt;
If you want to take a look anyway, just tell me.&lt;/p&gt;</comment>
                            <comment id="89903" author="bfaccini" created="Wed, 23 Jul 2014 23:01:49 +0000"  >&lt;p&gt;Hello  Seb,&lt;br/&gt;
Very late/night update !&lt;br/&gt;
Does this customer also run with 2.1.2 or a latest version?&lt;br/&gt;
Yes, I can have a look in crash-dump to confirm or not it is the same scenario than original. But if you already had several occurrences, you may want to integrate my debug patch in your distro, it did not introduce any regression during intensive tests and will help to identify the root cause.&lt;/p&gt;</comment>
                            <comment id="89934" author="spiechurski" created="Thu, 24 Jul 2014 08:45:55 +0000"  >&lt;p&gt;The customer is running our version based on 2.1.6. Your patch is not part of it, so I will check if the customer still has some occurrences and try to put the patch there if that&apos;s the case.&lt;/p&gt;</comment>
                            <comment id="147677" author="adilger" created="Sun, 3 Apr 2016 03:03:41 +0000"  >&lt;p&gt;Have not had any feedback on this 2.1 bug in 2 years.  Closing.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="35774">LU-7973</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvgcv:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>6298</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>