<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:27:09 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-2665] LBUG while unmounting client</title>
                <link>https://jira.whamcloud.com/browse/LU-2665</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;When trying to unmount a Lustre client, we got the following problem:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Lustre: DEBUG MARKER: Wed Nov 21 06:25:01 2012

LustreError: 11559:0:(ldlm_lock.c:1697:ldlm_lock_cancel()) ### lock still has references ns:
ptmp-MDT0000-mdc-ffff88030871bc00 lock: ffff88060dbd2d80/0x4618f3ec8d79d8be lrc: 4/0,1 mode: PW/PW res: 8590405073/266
rrc: 2 type: FLK pid: 4414 [0-&amp;gt;551] flags: 0x22002890 remote: 0xc8980c051f8f6afd expref: -99 pid: 4414 timeout: 0
LustreError: 11559:0:(ldlm_lock.c:1698:ldlm_lock_cancel()) LBUG
Pid: 11559, comm: umount

Call Trace:
 [&amp;lt;ffffffffa040d7f5&amp;gt;] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [&amp;lt;ffffffffa040de07&amp;gt;] lbug_with_loc+0x47/0xb0 [libcfs]
 [&amp;lt;ffffffffa063343d&amp;gt;] ldlm_lock_cancel+0x1ad/0x1b0 [ptlrpc]
 [&amp;lt;ffffffffa064d245&amp;gt;] ldlm_cli_cancel_local+0xb5/0x380 [ptlrpc]
 [&amp;lt;ffffffffa06510b8&amp;gt;] ldlm_cli_cancel+0x58/0x3b0 [ptlrpc]
 [&amp;lt;ffffffffa063ae18&amp;gt;] cleanup_resource+0x168/0x300 [ptlrpc]
 [&amp;lt;ffffffffa063afda&amp;gt;] ldlm_resource_clean+0x2a/0x50 [ptlrpc]
 [&amp;lt;ffffffffa041e28f&amp;gt;] cfs_hash_for_each_relax+0x17f/0x380 [libcfs]
 [&amp;lt;ffffffffa063afb0&amp;gt;] ? ldlm_resource_clean+0x0/0x50 [ptlrpc]
 [&amp;lt;ffffffffa063afb0&amp;gt;] ? ldlm_resource_clean+0x0/0x50 [ptlrpc]
 [&amp;lt;ffffffffa041fcaf&amp;gt;] cfs_hash_for_each_nolock+0x7f/0x1c0 [libcfs]
 [&amp;lt;ffffffffa0637a69&amp;gt;] ldlm_namespace_cleanup+0x29/0xb0 [ptlrpc]
 [&amp;lt;ffffffffa0638adb&amp;gt;] __ldlm_namespace_free+0x4b/0x540 [ptlrpc]
 [&amp;lt;ffffffffa06502d0&amp;gt;] ? ldlm_cli_hash_cancel_unused+0x0/0xa0 [ptlrpc]
 [&amp;lt;ffffffffa06502d0&amp;gt;] ? ldlm_cli_hash_cancel_unused+0x0/0xa0 [ptlrpc]
 [&amp;lt;ffffffffa06502d0&amp;gt;] ? ldlm_cli_hash_cancel_unused+0x0/0xa0 [ptlrpc]
 [&amp;lt;ffffffffa041fcb7&amp;gt;] ? cfs_hash_for_each_nolock+0x87/0x1c0 [libcfs]
 [&amp;lt;ffffffffa063903f&amp;gt;] ldlm_namespace_free_prior+0x6f/0x230 [ptlrpc]
 [&amp;lt;ffffffffa063fc4c&amp;gt;] client_disconnect_export+0x23c/0x460 [ptlrpc]
 [&amp;lt;ffffffffa0b42a44&amp;gt;] lmv_disconnect+0x644/0xc70 [lmv]
 [&amp;lt;ffffffffa0a470bc&amp;gt;] client_common_put_super+0x46c/0xe80 [lustre]
 [&amp;lt;ffffffffa0a47ba0&amp;gt;] ll_put_super+0xd0/0x360 [lustre]
 [&amp;lt;ffffffff8117e01c&amp;gt;] ? dispose_list+0x11c/0x140
 [&amp;lt;ffffffff8117e4a8&amp;gt;] ? invalidate_inodes+0x158/0x1a0
 [&amp;lt;ffffffff8116542b&amp;gt;] generic_shutdown_super+0x5b/0x110
 [&amp;lt;ffffffff81165546&amp;gt;] kill_anon_super+0x16/0x60
 [&amp;lt;ffffffffa050897a&amp;gt;] lustre_kill_super+0x4a/0x60 [obdclass]
 [&amp;lt;ffffffff811664e0&amp;gt;] deactivate_super+0x70/0x90
 [&amp;lt;ffffffff811826bf&amp;gt;] mntput_no_expire+0xbf/0x110
 [&amp;lt;ffffffff81183188&amp;gt;] sys_umount+0x78/0x3c0
 [&amp;lt;ffffffff810030f2&amp;gt;] system_call_fastpath+0x16/0x1b

Kernel panic - not syncing: LBUG
Pid: 11559, comm: umount Not tainted 2.6.32-220.23.1.bl6.Bull.28.8.x86_64 #1
Call Trace:
 [&amp;lt;ffffffff81484650&amp;gt;] ? panic+0x78/0x143
 [&amp;lt;ffffffffa040de5b&amp;gt;] ? lbug_with_loc+0x9b/0xb0 [libcfs]
 [&amp;lt;ffffffffa063343d&amp;gt;] ? ldlm_lock_cancel+0x1ad/0x1b0 [ptlrpc]
 [&amp;lt;ffffffffa064d245&amp;gt;] ? ldlm_cli_cancel_local+0xb5/0x380 [ptlrpc]
 [&amp;lt;ffffffffa06510b8&amp;gt;] ? ldlm_cli_cancel+0x58/0x3b0 [ptlrpc]
 [&amp;lt;ffffffffa063ae18&amp;gt;] ? cleanup_resource+0x168/0x300 [ptlrpc]
 [&amp;lt;ffffffffa063afda&amp;gt;] ? ldlm_resource_clean+0x2a/0x50 [ptlrpc]
 [&amp;lt;ffffffffa041e28f&amp;gt;] ? cfs_hash_for_each_relax+0x17f/0x380 [libcfs]
 [&amp;lt;ffffffffa063afb0&amp;gt;] ? ldlm_resource_clean+0x0/0x50 [ptlrpc]
 [&amp;lt;ffffffffa063afb0&amp;gt;] ? ldlm_resource_clean+0x0/0x50 [ptlrpc]
 [&amp;lt;ffffffffa041fcaf&amp;gt;] ? cfs_hash_for_each_nolock+0x7f/0x1c0 [libcfs]
 [&amp;lt;ffffffffa0637a69&amp;gt;] ? ldlm_namespace_cleanup+0x29/0xb0 [ptlrpc]
 [&amp;lt;ffffffffa0638adb&amp;gt;] ? __ldlm_namespace_free+0x4b/0x540 [ptlrpc]
 [&amp;lt;ffffffffa06502d0&amp;gt;] ? ldlm_cli_hash_cancel_unused+0x0/0xa0 [ptlrpc]
 [&amp;lt;ffffffffa06502d0&amp;gt;] ? ldlm_cli_hash_cancel_unused+0x0/0xa0 [ptlrpc]
 [&amp;lt;ffffffffa06502d0&amp;gt;] ? ldlm_cli_hash_cancel_unused+0x0/0xa0 [ptlrpc]
 [&amp;lt;ffffffffa041fcb7&amp;gt;] ? cfs_hash_for_each_nolock+0x87/0x1c0 [libcfs]
 [&amp;lt;ffffffffa063903f&amp;gt;] ? ldlm_namespace_free_prior+0x6f/0x230 [ptlrpc]
 [&amp;lt;ffffffffa063fc4c&amp;gt;] ? client_disconnect_export+0x23c/0x460 [ptlrpc]
 [&amp;lt;ffffffffa0b42a44&amp;gt;] ? lmv_disconnect+0x644/0xc70 [lmv]
 [&amp;lt;ffffffffa0a470bc&amp;gt;] ? client_common_put_super+0x46c/0xe80 [lustre]
 [&amp;lt;ffffffffa0a47ba0&amp;gt;] ? ll_put_super+0xd0/0x360 [lustre]
 [&amp;lt;ffffffff8117e01c&amp;gt;] ? dispose_list+0x11c/0x140
 [&amp;lt;ffffffff8117e4a8&amp;gt;] ? invalidate_inodes+0x158/0x1a0
 [&amp;lt;ffffffff8116542b&amp;gt;] ? generic_shutdown_super+0x5b/0x110
 [&amp;lt;ffffffff81165546&amp;gt;] ? kill_anon_super+0x16/0x60
 [&amp;lt;ffffffffa050897a&amp;gt;] ? lustre_kill_super+0x4a/0x60 [obdclass]
 [&amp;lt;ffffffff811664e0&amp;gt;] ? deactivate_super+0x70/0x90
 [&amp;lt;ffffffff811826bf&amp;gt;] ? mntput_no_expire+0xbf/0x110
 [&amp;lt;ffffffff81183188&amp;gt;] ? sys_umount+0x78/0x3c0
 [&amp;lt;ffffffff810030f2&amp;gt;] ? system_call_fastpath+0x16/0x1b
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;


&lt;p&gt;This issue is exactly the same as the one described in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1429&quot; title=&quot;LBUG while unmounting client&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1429&quot;&gt;&lt;del&gt;LU-1429&lt;/del&gt;&lt;/a&gt;, which is a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1328&quot; title=&quot;Failing customer&amp;#39;s file creation test&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1328&quot;&gt;&lt;del&gt;LU-1328&lt;/del&gt;&lt;/a&gt;, which itself seems to be related to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1421&quot; title=&quot;Client LBUG in ll_file_write after filesystem expansion&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1421&quot;&gt;&lt;del&gt;LU-1421&lt;/del&gt;&lt;/a&gt;.&lt;br/&gt;
The issue seems to be resolved, but it is very unclear to me which patches are needed in order to completely fix the issue.&lt;br/&gt;
I add that we need of fix for b2_1.&lt;/p&gt;

&lt;p&gt;Can you please advise?&lt;/p&gt;

&lt;p&gt;TIA,&lt;br/&gt;
Sebastien.&lt;/p&gt;</description>
                <environment></environment>
        <key id="17261">LU-2665</key>
            <summary>LBUG while unmounting client</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="bfaccini">Bruno Faccini</assignee>
                                    <reporter username="sebastien.buisson">Sebastien Buisson</reporter>
                        <labels>
                            <label>mn1</label>
                            <label>ptr</label>
                    </labels>
                <created>Tue, 22 Jan 2013 11:29:16 +0000</created>
                <updated>Fri, 22 Nov 2013 08:21:38 +0000</updated>
                            <resolved>Wed, 18 Sep 2013 10:20:57 +0000</resolved>
                                    <version>Lustre 2.4.0</version>
                    <version>Lustre 2.1.3</version>
                                    <fixVersion>Lustre 2.5.0</fixVersion>
                    <fixVersion>Lustre 2.4.2</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>12</watches>
                                                                            <comments>
                            <comment id="51025" author="bfaccini" created="Wed, 23 Jan 2013 09:36:47 +0000"  >&lt;p&gt;Hello Seb !!&lt;br/&gt;
Is there a crash-dump available ?? If Yes, can you get us with a &quot;foreach bt&quot; output and the ldlm_lock content ??&lt;br/&gt;
On the other hand, I will try to understand the link between this-one/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1429&quot; title=&quot;LBUG while unmounting client&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1429&quot;&gt;&lt;del&gt;LU-1429&lt;/del&gt;&lt;/a&gt; and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1328&quot; title=&quot;Failing customer&amp;#39;s file creation test&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1328&quot;&gt;&lt;del&gt;LU-1328&lt;/del&gt;&lt;/a&gt;/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1421&quot; title=&quot;Client LBUG in ll_file_write after filesystem expansion&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1421&quot;&gt;&lt;del&gt;LU-1421&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Thank&apos;s in advance.&lt;/p&gt;</comment>
                            <comment id="51047" author="adilger" created="Wed, 23 Jan 2013 14:00:25 +0000"  >&lt;p&gt;Looks like this is an flock lock, and there were several patches landed to master about fixing those.  Maybe one needs to be back-ported?&lt;/p&gt;</comment>
                            <comment id="51086" author="sebastien.buisson" created="Thu, 24 Jan 2013 03:16:23 +0000"  >&lt;p&gt;Hi Bruno!!&lt;/p&gt;

&lt;p&gt;Antoine already sent us the content of the ldlm_lock structure:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash&amp;gt; p *(struct ldlm_lock *)0xffff88060dbd2d80
$3 = {
  l_handle = {
    h_link = {
      next = 0xffffc90016ae91d8,
      prev = 0xffffc90016ae91d8
    },
    h_cookie = 5051055179407415486,
    h_addref = 0xffffffffa062f130 &amp;lt;lock_handle_addref&amp;gt;,
    h_lock = {
      raw_lock = {
        slock = 196611
      }
    },
    h_ptr = 0x0,
    h_free_cb = 0,
    h_rcu = {
      next = 0x0,
      func = 0
    },
    h_size = 0,
    h_in = 1 &apos;\001&apos;,
    h_unused = &quot;\000\000&quot;
  },
  l_refc = {
    counter = 4
  },
  l_lock = {
    raw_lock = {
      slock = 851980
    }
  },
  l_resource = 0xffff8802f8a14300,
  l_lru = {
    next = 0xffff88060dbd2de0,
    prev = 0xffff88060dbd2de0
  },
  l_res_link = {
    next = 0xffff8802f8a14320,
    prev = 0xffff8802f8a14320
  },
  l_tree_node = 0x0,
  l_exp_hash = {
    next = 0x0,
    pprev = 0x0
  },
  l_req_mode = LCK_PW,
    l_granted_mode = LCK_PW,
  l_completion_ast = 0xffffffffa065ec20 &amp;lt;ldlm_flock_completion_ast&amp;gt;,
  l_blocking_ast = 0,
  l_glimpse_ast = 0,
  l_weigh_ast = 0,
  l_export = 0x0,
  l_conn_export = 0xffff88030871b800,
  l_remote_handle = {
    cookie = 14454316220189469437
  },
  l_policy_data = {
    l_extent = {
      start = 0,
      end = 551,
      gid = 18446612146206140608
    },
    l_flock = {
      start = 0,
      end = 551,
      owner = 18446612146206140608,
      blocking_owner = 0,
      blocking_export = 0x0,
      pid = 4414
    },
    l_inodebits = {
      bits = 0
    }
  },
  l_flags = 570435728,
  l_readers = 0,
  l_writers = 1,
  l_destroyed = 0 &apos;\000&apos;,
  l_ns_srv = 0 &apos;\000&apos;,
  l_waitq = {
    lock = {
      raw_lock = {
        slock = 196611
      }
    },
    task_list = {
      next = 0xffff88060dbd2ea8,
      prev = 0xffff88060dbd2ea8
     prev = 0xffff88060dbd2ea8
    }
  },
  l_last_activity = 1353415544,
  l_last_used = 0,
  l_req_extent = {
    start = 0,
    end = 0,
    gid = 0
  },
  l_lvb_len = 0,
  l_lvb_data = 0x0,
  l_ast_data = 0xffff8805e2161c38,
  l_client_cookie = 0,
  l_pending_chain = {
    next = 0xffff88060dbd2f00,
    prev = 0xffff88060dbd2f00
  },
  l_callback_timeout = 0,
  l_pid = 4414,
  l_bl_ast = {
    next = 0xffff88060dbd2f20,
    prev = 0xffff88060dbd2f20
  },
  l_cp_ast = {
    next = 0xffff88060dbd2f30,
    prev = 0xffff88060dbd2f30
  },
  l_rk_ast = {
    next = 0xffff88060dbd2f40,
    prev = 0xffff88060dbd2f40
  },
  l_blocking_lock = 0x0,
  l_bl_ast_run = 0,
  l_sl_mode = {
    next = 0xffff88060dbd2f60,
    prev = 0xffff88060dbd2f60
  },
  l_sl_policy = {
    next = 0xffff88060dbd2f70,
    prev = 0xffff88060dbd2f70
  },
  l_reference = {&amp;lt;No data fields&amp;gt;},
  l_exp_refs_nr = 0,
  l_exp_refs_link = {
    next = 0xffff88060dbd2f88,
    prev = 0xffff88060dbd2f88
  },
  l_exp_refs_target = 0x0,
  l_exp_list = {
    next = 0xffff88060dbd2fa0,
    prev = 0xffff88060dbd2fa0
  }
}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I will ask him for the &apos;foreach bt&apos; output.&lt;/p&gt;

&lt;p&gt;Cheers,&lt;br/&gt;
Sebastien.&lt;/p&gt;</comment>
                            <comment id="51107" author="bfaccini" created="Thu, 24 Jan 2013 10:50:33 +0000"  >&lt;p&gt;Thank&apos;s Sebastien already.&lt;/p&gt;

&lt;p&gt;BTW, the fact a flock is involved seems to indicate to Andreas and others teammates that it could be a new situation than the one in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1429&quot; title=&quot;LBUG while unmounting client&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1429&quot;&gt;&lt;del&gt;LU-1429&lt;/del&gt;&lt;/a&gt; finally.&lt;/p&gt;

&lt;p&gt;Also, do we know what has been running on this particular node ? I mean is it a compute node or a service (login/batch/...) node ??&lt;/p&gt;</comment>
                            <comment id="52749" author="bfaccini" created="Wed, 20 Feb 2013 11:36:18 +0000"  >&lt;p&gt;Sebastien, still no news about the &quot;foreach bt&quot; required infos ??&lt;br/&gt;
We need this to find who is 18446612146206140608/0xFFFF88033C05E8C0 and where it could be stuck.&lt;/p&gt;</comment>
                            <comment id="52799" author="sebastien.buisson" created="Thu, 21 Feb 2013 08:18:12 +0000"  >&lt;p&gt;Hi Bruno,&lt;/p&gt;

&lt;p&gt;Still no news, I will ping the Support team!&lt;/p&gt;

&lt;p&gt;Sebastien.&lt;/p&gt;</comment>
                            <comment id="53528" author="sebastien.buisson" created="Thu, 7 Mar 2013 07:39:50 +0000"  >&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;Here is the requested information. Inside the tarball, you have:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;M1013_bt.txt, containing the &apos;foreach bt&apos; output;&lt;/li&gt;
	&lt;li&gt;M1013_lctl_debug_file_allcpu_sort_1133.txt, for the last Lustre debug traces.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Sebastien.&lt;/p&gt;</comment>
                            <comment id="54430" author="bfaccini" created="Tue, 19 Mar 2013 21:08:04 +0000"  >&lt;p&gt;Sorry for the delay Seb.&lt;br/&gt;
Requestor of the concerned FLock was no longer here at the time of the crash.&lt;/p&gt;

&lt;p&gt;As the last infos I may need for this, can you or somebody from the site (I am sure Antoine kept this ...) double-check in node&apos;s syslog/dmesg if there was any previous messages with references to the lock/0xffff88060dbd2d80, the PID/4414 or the owner/task/0xffff88033c05e8c0 ??&lt;/p&gt;</comment>
                            <comment id="54741" author="bfaccini" created="Mon, 25 Mar 2013 09:27:12 +0000"  >&lt;p&gt;Talking directly with Antoine, as I expected there may be earlier related msgs in the log he will provide/comment soon. Using these new infos we may go faster to the root cause now.&lt;/p&gt;

&lt;p&gt;Also he confirmed that there has been a few more occurrences since January with available crash-dumps.&lt;/p&gt;
</comment>
                            <comment id="54992" author="apercher" created="Thu, 28 Mar 2013 09:51:34 +0000"  >&lt;p&gt;Yes unfortunatly, i have lost the crash dump who produce the LBUG so&lt;br/&gt;
I search new occurence and I have found 2 others crash with the same&lt;br/&gt;
LBUG and on one I found maybe interesting messages :&lt;/p&gt;

&lt;p&gt;LustreError: 28052:0:(file.c:158:ll_close_inode_openhandle()) inode 147840184874990214 mdc close failed: rc = -4&lt;/p&gt;

&lt;p&gt;I can&apos;t link inode to the LBUG but that is the PID who have the lock issue&lt;/p&gt;

&lt;p&gt;LustreError: 12577:0:(ldlm_lock.c:1697:ldlm_lock_cancel()) ### lock still has references ns: ptmp2-MDT0000-mdc-ffff8804739f5c00 lock: ffff88024b7496c0/0x141b572a020ffcdd lrc: 4/0,1 mode: PW/PW res: 8811961703/31366 rrc: 2 type: FLK pid: 28052 &lt;span class=&quot;error&quot;&gt;&amp;#91;0-&amp;gt;9223372036854775807&amp;#93;&lt;/span&gt; flags: 0x22002890 remote: 0xdf8e5c6f462dccb6 expref: -99 pid: 28052 timeout: 0&lt;/p&gt;

&lt;p&gt;So maybe lustre miss to release the lock when there are an issue during &lt;br/&gt;
the close (due to an oops on the network for this case) ...&lt;/p&gt;

&lt;p&gt;I have attach the all trace in attachment ... and sorry for the delay&lt;/p&gt;</comment>
                            <comment id="56093" author="sebastien.buisson" created="Thu, 11 Apr 2013 15:15:06 +0000"  >&lt;p&gt;Hi Bruno,&lt;/p&gt;

&lt;p&gt;Now that Antoine provided you with additional information, are you able to investigate on this issue?&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Sebastien.&lt;/p&gt;</comment>
                            <comment id="56386" author="bfaccini" created="Tue, 16 Apr 2013 12:16:55 +0000"  >&lt;p&gt;Yes and my current focus is to try determine how an FLock reference from a dead process can remain ...&lt;/p&gt;

&lt;p&gt;There are known quick and dirty fix like the very simple one from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-736&quot; title=&quot;LBUG and kernel panic on client unmount&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-736&quot;&gt;&lt;del&gt;LU-736&lt;/del&gt;&lt;/a&gt; but I will try to find the root cause of this situation ...&lt;/p&gt;

&lt;p&gt;I will call and work from remote with Antoine in order to identify how we can fall in this situation.&lt;/p&gt;</comment>
                            <comment id="56416" author="bfaccini" created="Tue, 16 Apr 2013 18:16:26 +0000"  >&lt;p&gt;I setup what I think is a reproducer. &lt;br/&gt;
It is a version of flocks_test that takes a lock and then sleeps, after this I force evict of client and kills process ...&lt;br/&gt;
It does not fail under 2.4 but shows the same Antoine identified. I am setting up a 2.1 test platform and will run it there. I may also call the site tomorrow and ask for a local test too ...&lt;/p&gt;

&lt;p&gt;But in case of evict event, ldlm_namespace_cleanup()/ldlm_resource_clean()/cleanup_resource() are called with LDLM_FL_LOCAL_ONLY unconditionally and then locks are treated as local which cause ldlm_cli_cancel()/ldlm_cli_cancel_local()/ldlm_lock_cancel() are not called to let the Server know of the cancel since it is assumed it did it on its own.&lt;/p&gt;

&lt;p&gt;This is also what happen in case of &quot;umount -f&quot;, which could already be used as a work-around on-site !!... Or when mount localflock option (instead of flock) is used.&lt;/p&gt;
</comment>
                            <comment id="56453" author="bfaccini" created="Wed, 17 Apr 2013 14:00:55 +0000"  >&lt;p&gt;From the crash-dumps available, all l_flags of involved ldlm_lock are set to 0x22002890 which means LDLM_FL_CBPENDING/LDLM_FL_CANCEL/LDLM_FL_FAILED/LDLM_FL_CANCELING/LDLM_FL_CLEANED/LDLM_FL_BL_DONE. All these flags should have been set as part of current cleanup_resource()/ldlm_cli_cancel&lt;span class=&quot;error&quot;&gt;&amp;#91;_local&amp;#93;&lt;/span&gt;() call sequence.&lt;/p&gt;

&lt;p&gt;The l_last_activity time-stamp is a long time ago but only a few minutes before the ll_close_inode_openhandle() error-message for pid owning lock, pointed by Antoine, when present.&lt;/p&gt;

&lt;p&gt;Working more with Antoine to understand if something could have been wrong on Server/MDS side, we found that at least for the last 2 occurrences of problem, MDS had crashed/rebooted at the same period than l_last_activity and ll_close_inode_openhandle() error-message.&lt;/p&gt;

&lt;p&gt;Thus, problem could only be the cause of wrong handling of MDS reboot condition during FLock unlock/clean-up.&lt;/p&gt;

&lt;p&gt;Will try to reproduce it now locally with a MDS re-boot and full traces enabled.&lt;/p&gt;</comment>
                            <comment id="57035" author="bfaccini" created="Thu, 25 Apr 2013 13:23:33 +0000"  >&lt;p&gt;Antoine just found a new occurrence that happen during March. But for this new case, MDS crash/reboot did not happen around l_last_activity timing, but much more later (in fact this Client was not unmounted/stopped before Servers during a full Cluster shutdown ...). And the concerned FLock was from an application that has been killed -9 ...&lt;/p&gt;

&lt;p&gt;Thus I requested that they check if the FLock clean-up really happen during Application exit/kill, by writing a small reproducer to grant a FLock and then sleep, to kill/signal it and with full Lustre debug traces enabled, see if the automatic Unlock occurs.&lt;/p&gt;</comment>
                            <comment id="57523" author="bfaccini" created="Thu, 2 May 2013 14:19:34 +0000"  >&lt;p&gt;Good news, I am able to get the crash with a reproducer that mimics what we identified to be the possible scenario in our last findings/comments :&lt;/p&gt;

&lt;p&gt;         _ run a cmd/bin doing/granting flock and then going to sleep for ever. I modified flocks_test for this.&lt;/p&gt;

&lt;p&gt;         _ force MDS to crash.&lt;/p&gt;

&lt;p&gt;         _ wait a bit and then kill/^C to abort cmd. We get the &quot;(file.c:158:ll_close_inode_openhandle()) inode &amp;lt;INO&amp;gt; mdc close failed: rc = -4&quot; msg in Client log.&lt;/p&gt;

&lt;p&gt;         _ reboot MDS, re-mount MDT.&lt;/p&gt;

&lt;p&gt;         _ umount lustre-FS on Client. --&amp;gt;&amp;gt; LBUG!&lt;/p&gt;

&lt;p&gt;more to come.&lt;/p&gt;
</comment>
                            <comment id="57628" author="bfaccini" created="Fri, 3 May 2013 13:40:41 +0000"  >
&lt;p&gt;Humm, I took a full Lustre trace during a new reproducer run and here is what I discovered :&lt;/p&gt;

&lt;p&gt;        _ when the application is killed/signaled and exits, the Kernel try to do some automatic cleanup/close on &lt;br/&gt;
the opened file-descs and for any FLock still around this is done via locks_remove_flock().&lt;/p&gt;

&lt;p&gt;        _ locks_remove_flock() to terminate any current FLock will then force-create a request for an UnLock on the whole&lt;br/&gt;
file, and also calls the FS associated+specific routine to do so, in our/Lustre case ll_file_flock().&lt;/p&gt;

&lt;p&gt;        _ unfortunately, if MDS is down at this time, the UnLock request will be failed and leave the current granted &lt;br/&gt;
FLocks as orphans.&lt;/p&gt;

&lt;p&gt;        _ they will even be replayed when MDS/MDT is back.&lt;/p&gt;

&lt;p&gt;        _ and on even much later umount, they can still be found with references and cause the LBUG.&lt;/p&gt;

&lt;p&gt;So, how can we fix this to handle the MDS in-between crash case ? Anyway, the fix has to be as generic as possible and it may not be obvious.&lt;/p&gt;</comment>
                            <comment id="58521" author="bfaccini" created="Tue, 14 May 2013 23:16:01 +0000"  >&lt;p&gt;Having a better/detailed look of the full lustre debug traces I may have found the reason why the UNLCK request is failed+thrown, it is because it is generated automatically by the Kernel during a fatal signal processing but then, if any problem occurs during communications with MDS, Lustre interrupts and trashes RPCs/requests when it detects a signal is pending/current in ptlrpc_set_wait()/ptlrpc_check_set() !!....&lt;/p&gt;

&lt;p&gt;Problem is that this behavior is incompatible in case of an FLock/F_UNLCK because, in any case (MDS crash/reboot + replay, or temporary comms/LNet problem) since Server/MDS will never know that covered locks can be released and also we finally end-up with the LBUG during a later umount due to these orphan granted Lustre-FLocks left.&lt;/p&gt;

&lt;p&gt;Thus, we need to find a way (keep track or retry) of FLock requests (at least F_UNLCKs) in such cases.&lt;/p&gt;</comment>
                            <comment id="59040" author="bfaccini" created="Wed, 22 May 2013 09:15:18 +0000"  >&lt;p&gt;I have a patch that fixes LBUG vs reproducer (crash of MDS when FLock set, kill/exit in-between, re-start MDT, umount on Client) and also passes auto-tests.&lt;/p&gt;

&lt;p&gt;Since problem is also in master (I ran reproducer against latest master build, ie 2.4.50 !), I will submit the same change to master and see how it runs against auto-tests full set.&lt;/p&gt;</comment>
                            <comment id="59153" author="bfaccini" created="Thu, 23 May 2013 09:12:44 +0000"  >&lt;p&gt;Ok, patch for b2_1 is at &lt;a href=&quot;http://review.whamcloud.com/6407&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/6407&lt;/a&gt;. Master version is at &lt;a href=&quot;http://review.whamcloud.com/6415&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/6415&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="61753" author="spitzcor" created="Wed, 3 Jul 2013 14:50:37 +0000"  >&lt;p&gt;What&apos;s the status of #6415?  Can it land?&lt;/p&gt;</comment>
                            <comment id="62280" author="bfaccini" created="Mon, 15 Jul 2013 14:11:41 +0000"  >&lt;p&gt;Hello Cory,&lt;br/&gt;
Change #6415 has just land. But do you also experienced this issue ?&lt;/p&gt;</comment>
                            <comment id="62288" author="spitzcor" created="Mon, 15 Jul 2013 14:34:48 +0000"  >&lt;p&gt;Yes, we did.  We&apos;ve adopted change #6415 and it appears to be the solution to our problem.&lt;/p&gt;</comment>
                            <comment id="62292" author="bfaccini" created="Mon, 15 Jul 2013 15:06:39 +0000"  >&lt;p&gt;Thank&apos;s for your feedback !! Nothing can be better for a patch than real &quot;life&quot;/production exposure.&lt;/p&gt;</comment>
                            <comment id="62477" author="pjones" created="Wed, 17 Jul 2013 13:29:10 +0000"  >&lt;p&gt;Sebastien&lt;/p&gt;

&lt;p&gt;Any word on the b2_1 version of the patch? Does this address your original concern? Can we close the ticket?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="62983" author="sebastien.buisson" created="Thu, 25 Jul 2013 15:28:10 +0000"  >&lt;p&gt;Hi Bruno,&lt;/p&gt;

&lt;p&gt;We have retrieved the patch for b2_1. It will be rolled out at CEA&apos;s next maintenance, planned for early September.&lt;/p&gt;

&lt;p&gt;Cheers,&lt;br/&gt;
Sebastien.&lt;/p&gt;</comment>
                            <comment id="65264" author="spitzcor" created="Wed, 28 Aug 2013 16:04:40 +0000"  >&lt;p&gt;Related to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3701&quot; title=&quot;Failure on test suite posix subtest test_1: fcntl.18/fcntl.35 Unresolved&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3701&quot;&gt;&lt;del&gt;LU-3701&lt;/del&gt;&lt;/a&gt; and &lt;a href=&quot;http://review.whamcloud.com/#/c/7453&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/7453&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="65297" author="askulysh" created="Wed, 28 Aug 2013 19:04:52 +0000"  >&lt;p&gt;I don&apos;t like an idea of resending flocks. There is no guarantee that request will succeed.&lt;br/&gt;
Client has cleanup mechanism already. &lt;br/&gt;
ldlm_flock_completion_ast() should call ldlm_lock_decref_internal() correctly.&lt;br/&gt;
The patch should be combined with similar patch from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2177&quot; title=&quot;ldlm_flock_completion_ast causes LBUG because of a race&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2177&quot;&gt;&lt;del&gt;LU-2177&lt;/del&gt;&lt;/a&gt; .&lt;/p&gt;</comment>
                            <comment id="65345" author="bfaccini" created="Thu, 29 Aug 2013 08:31:25 +0000"  >&lt;p&gt;Cory: Thanks to add a reference to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3701&quot; title=&quot;Failure on test suite posix subtest test_1: fcntl.18/fcntl.35 Unresolved&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3701&quot;&gt;&lt;del&gt;LU-3701&lt;/del&gt;&lt;/a&gt;. This is the follow-on to this ticket as its change introduced some regressions vs the POSIX test suite.&lt;/p&gt;

&lt;p&gt;Andriy: If you read carefully history/comments for this/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2665&quot; title=&quot;LBUG while unmounting client&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2665&quot;&gt;&lt;del&gt;LU-2665&lt;/del&gt;&lt;/a&gt; ticket, you will find it addresses a very particular scenario which can be simplified as &quot;FLock/F_UNLCK requests can be trashed upon MDS crash or communications problem, leaving orphaned FLocks&quot;. This is definitelly not the race problem described in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2177&quot; title=&quot;ldlm_flock_completion_ast causes LBUG because of a race&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2177&quot;&gt;&lt;del&gt;LU-2177&lt;/del&gt;&lt;/a&gt;. And concerning the &quot;retry&quot; mechanism in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3701&quot; title=&quot;Failure on test suite posix subtest test_1: fcntl.18/fcntl.35 Unresolved&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3701&quot;&gt;&lt;del&gt;LU-3701&lt;/del&gt;&lt;/a&gt; change it becomes fully limited to FLock/F_UNLCK requests.&lt;/p&gt;</comment>
                            <comment id="65350" author="askulysh" created="Thu, 29 Aug 2013 10:28:26 +0000"  >&lt;p&gt;But &quot;retry&quot; mechanism doesn&apos;t guarantee that lock will reach MDS. cleanup_resource() should deal with orphaned locks also.&lt;/p&gt;</comment>
                            <comment id="65373" author="bfaccini" created="Thu, 29 Aug 2013 15:41:48 +0000"  >
&lt;p&gt;&amp;gt; But &quot;retry&quot; mechanism doesn&apos;t guarantee that lock will reach MDS&lt;br/&gt;
Yes, but we need to do our best! And particularly for FLock/F_UNLCKs (where Server must know, unless other Clients/processes will stay stuck forever) which can not be trashed. At least with retries we now also cover MDS crashes and communications problems, if reboot/restart/failover/fix/... occurs finally and if not Lustre is dead on this Client and who cares that we retry forever ?&lt;/p&gt;

&lt;p&gt;&amp;gt; cleanup_resource() should deal with orphaned locks also&lt;br/&gt;
Only during evict or &quot;umount -f&quot;.&lt;/p&gt;
</comment>
                            <comment id="66847" author="adilger" created="Tue, 17 Sep 2013 16:53:33 +0000"  >&lt;p&gt;Patches are landed to b2_1 and master for 2.5.0.  Can this bug be closed?&lt;/p&gt;</comment>
                            <comment id="66901" author="sebastien.buisson" created="Wed, 18 Sep 2013 10:11:50 +0000"  >&lt;p&gt;Sure, this bug can be closed, as additional work was carried out by Bruno in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3701&quot; title=&quot;Failure on test suite posix subtest test_1: fcntl.18/fcntl.35 Unresolved&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3701&quot;&gt;&lt;del&gt;LU-3701&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Sebastien.&lt;/p&gt;</comment>
                            <comment id="66902" author="pjones" created="Wed, 18 Sep 2013 10:20:57 +0000"  >&lt;p&gt;ok - thanks Sebastien!&lt;/p&gt;</comment>
                            <comment id="72112" author="yujian" created="Fri, 22 Nov 2013 08:21:38 +0000"  >&lt;p&gt;Patch &lt;a href=&quot;http://review.whamcloud.com/6415&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/6415&lt;/a&gt; was cherry-picked to Lustre b2_4 branch.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="20191">LU-3701</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="12436" name="LU-2665-2.tgz" size="1354487" author="apercher" created="Thu, 28 Mar 2013 09:50:56 +0000"/>
                            <attachment id="12285" name="LU-2665-trace.tgz" size="62939" author="sebastien.buisson" created="Thu, 7 Mar 2013 07:39:50 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvfvb:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>6217</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>