<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:34:29 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-3504] MDS: All cores spinning on ldlm lock in lock_res_and_lock</title>
                <link>https://jira.whamcloud.com/browse/LU-3504</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Grove&apos;s MDS had some trouble yesterday when it got into a situation where all available CPUs were spinning in lock_res_and_lock, presumably on the same ldlm_lock. The system was crashed to gather a core dump. Upon Lustre starting back up, it went through recovery and then got back into the same state where all cores were spinning in lock_res_and_lock. Rebooting again, and bringing lustre up with the &quot;abort_recov&quot; option got the system back into a usable state.&lt;/p&gt;

&lt;p&gt;The core dump captured show all CPUs spinning with one of the following back traces:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;PID: 25091 TASK: ... CPU: 0 COMMAND: &quot;mdt00_033&quot;                                
...                                                                                
--- &amp;lt;IRQ STACK&amp;gt; ---                                                                
#17 [...] ret_from_intr                                                            
    [exception RIP: _spin_lock+30]                                                 
#18 [...] lock_res_and_lock+0x30 at ... [ptlrpc]                                   
#19 [...] ldlm_handle_enqueue0+0x907 at ... [ptlrpc]                               
#20 [...] mdt_enqueue+0x46 at ... [mdt]                                            
#21 [...] mdt_handle_common+0x648 at ... [mdt]                                     
#22 [...] mds_regular_handle+0x15 at ... [mdt]                                     
#23 [...] ptlrpc_server_handle_request+0x398 at ... [ptlrpc]                       
#24 [...] ptlrpc_main+0xace at ... [ptlrpc]                                        
#25 [...] child_rip+0xa                                                            
                                                                                   
PID: 25291 TASK: ... CPU:1 COMMAND: &quot;mdt00_088&quot;                                    
...                                                                                
--- &amp;lt;NMI exception stack&amp;gt; ---                                                      
#6  [...] _spin_lock+0x1e                                                          
#7  [...] lock_res_and_lock+0x30 at ... [ptlrpc]                                   
#8  [...] ldlm_lock_enqueue+0x11d at ... [ptlrpc]                                  
#9  [...] ldlm_handle_enqueue+0x4ef at ... [ptlrpc]                                
#10 [...] mdt_enqueue+0x46 at ... [mdt]                                            
#11 [...] mdt_handle_common+0x648 at ... [mdt]                                     
#12 [...] mds_regular_handle+0x15 at ... [mdt]                                     
#13 [...] ptlrpc_server_handle_request+0x398 at ... [ptlrpc]                       
#14 [...] ptlrpc_main+0xace at ... [ptlrpc]                                        
#15 [...] child_rip+0xa
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</description>
                <environment></environment>
        <key id="19552">LU-3504</key>
            <summary>MDS: All cores spinning on ldlm lock in lock_res_and_lock</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="bfaccini">Bruno Faccini</assignee>
                                    <reporter username="prakash">Prakash Surya</reporter>
                        <labels>
                            <label>sequoia</label>
                            <label>zfs</label>
                    </labels>
                <created>Tue, 25 Jun 2013 18:27:19 +0000</created>
                <updated>Fri, 21 Mar 2014 19:52:54 +0000</updated>
                            <resolved>Thu, 29 Aug 2013 13:49:40 +0000</resolved>
                                    <version>Lustre 2.4.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>14</watches>
                                                                            <comments>
                            <comment id="61354" author="bfaccini" created="Wed, 26 Jun 2013 10:34:23 +0000"  >&lt;p&gt;Was there also any backtrace dump in the syslog/console before the forced crash-dump ??&lt;br/&gt;
Is it possible to get the syslog (and console/conman log if available) for the time period covering the day of the crash ?? Also a &quot;foreach bt&quot; output from crash tool running over the crash-dump could be useful.&lt;/p&gt;</comment>
                            <comment id="61397" author="prakash" created="Wed, 26 Jun 2013 17:48:05 +0000"  >&lt;p&gt;Bruno, this system is on the SCF so I&apos;m not able to copy anything directly to you. If you have any specific things you&apos;d like me to look at, let me know; but any logs, backtraces, etc have to be manually copied over by hand so that&apos;s not an option for bulk requests (i.e. &quot;foreach bt&quot; from crash).&lt;/p&gt;

&lt;p&gt;I don&apos;t see any stack traces in the console logs other than the &quot;CPU soft lockup&quot; messages for the threads spinning on the lock that I mentioned in the description of this ticket.&lt;/p&gt;

&lt;p&gt;I&apos;m looking through the back traces and I do see one thread that looks like it might be sleeping while holding the lock the other threads are spinning on. I need to walk the code to verify this thread is holding the lock, but here is it&apos;s back trace:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;                                                                         
PID: 24974 TASK: ... CPU: 19 COMMAND: &quot;mdt02_010&quot;                                  
 #0 [...] schedule                                                                 
 #1 [...] __cond_resched+0x2a at ...                                               
 #2 [...] _cond_resched+0x30 at ...                                                
 #3 [...] __kmalloc+0x130 at ...                                                   
 #4 [...] cfs_alloc+0x30 at ... [libcfs]                                           
 #5 [...] cfs_hash_create+0x151 at ... [libcfs]                                    
 #6 [...] ldlm_init_flock_export+0x54 at ... [ptlrpc]                              
 #7 [...] ldlm_process_flock_lock+0x1432 at ... [ptlrpc]                           
 #8 [...] ldlm_lock_enqueue+0x405 at ... [ptlrpc]                                  
 #9 [...] ldlm_handle_enqueue0+0x4ef at ... [ptlrpc]                               
#10 [...] mdt_enqueue+0x46 at ... [mdt]                                            
#11 [...] mdt_handle_common+0x648 at ... [mdt]                                     
#12 [...] mdt_regular_handle+0x15 at ... [mdt]                                     
#13 [...] ptlrpc_server_handle_request+0x398 at ... [ptlrpc]                       
#14 [...] ptlrpc_main+0xace at ... [ptlrpc]                                        
#15 [...] child_rip+0xa at ...                                                     
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;  </comment>
                            <comment id="61398" author="bfaccini" created="Wed, 26 Jun 2013 17:57:27 +0000"  >&lt;p&gt;Sure, this one looks guilty !! BTW, can you check if it is in this state since a long time ? We can get an idea about that if you check its &quot;ps -l 24794&quot; output against the latest ones (&quot;ps -l | head&quot;).&lt;/p&gt;</comment>
                            <comment id="61401" author="prakash" created="Wed, 26 Jun 2013 18:11:29 +0000"  >&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash&amp;gt; ps -l 24974
[1016382628073301] [RU] PID: 24974 TASK: ... CPU: 19 COMMAND: &quot;mdt02_010&quot;
crash&amp;gt; ps -l | head -n1
[1016382635871136] [RU] PID: 25198 TASK: ... CPU: 6  COMMAND: &quot;mdt02_058&quot;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="61403" author="prakash" created="Wed, 26 Jun 2013 18:38:29 +0000"  >&lt;p&gt;It looks like we don&apos;t have &lt;tt&gt;CONFIG_DEBUG_SPINLOCK_SLEEP&lt;/tt&gt; enabled on this system, so the kernel *&lt;b&gt;would not&lt;/b&gt;* have complained if the thread did in fact sleep while holding the lock.&lt;/p&gt;</comment>
                            <comment id="61411" author="prakash" created="Wed, 26 Jun 2013 20:17:38 +0000"  >&lt;p&gt;I think this change may have introduced this issue.. &lt;a href=&quot;http://review.whamcloud.com/2240&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/2240&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="61417" author="bfaccini" created="Wed, 26 Jun 2013 21:39:56 +0000"  >&lt;p&gt;Anyway, thread &quot;mdt02_010&quot;, has been scheduled since less than a second before crash so it may not be the responsible of the suspected dead-lock. Unless the code it runs only forces re-sched()s with lock held ... I need to investigate in this direction.&lt;/p&gt;

&lt;p&gt;Also, what make you think change #2240 could be the reason ?&lt;/p&gt;</comment>
                            <comment id="61421" author="prakash" created="Wed, 26 Jun 2013 22:01:56 +0000"  >&lt;p&gt;I have proposed a fix here: &lt;a href=&quot;http://review.whamcloud.com/6790&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/6790&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="61422" author="prakash" created="Wed, 26 Jun 2013 22:07:15 +0000"  >&lt;blockquote&gt;
&lt;p&gt;Also, what make you think change #2240 could be the reason ?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Because of this change:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;-static inline void ldlm_flock_blocking_link(struct ldlm_lock *req,
-                                            struct ldlm_lock *lock)
+static inline int ldlm_flock_blocking_link(struct ldlm_lock *req,
+                                          struct ldlm_lock *lock)
 {
+       int rc = 0;
+
         /* For server only */
         if (req-&amp;gt;l_export == NULL)
-                return;
+               return 0;
+
+       if (unlikely(req-&amp;gt;l_export-&amp;gt;exp_flock_hash == NULL)) {
+               rc = ldlm_init_flock_export(req-&amp;gt;l_export);
+               if (rc)
+                       goto error;
+       }
 
-        LASSERT(cfs_list_empty(&amp;amp;req-&amp;gt;l_flock_waitq));
-        cfs_write_lock(&amp;amp;req-&amp;gt;l_export-&amp;gt;exp_flock_wait_lock);
+       LASSERT(cfs_hlist_unhashed(&amp;amp;req-&amp;gt;l_exp_flock_hash));
 
         req-&amp;gt;l_policy_data.l_flock.blocking_owner =
                 lock-&amp;gt;l_policy_data.l_flock.owner;
         req-&amp;gt;l_policy_data.l_flock.blocking_export =
-                class_export_get(lock-&amp;gt;l_export);
-
-        cfs_list_add_tail(&amp;amp;req-&amp;gt;l_flock_waitq,
-                          &amp;amp;req-&amp;gt;l_export-&amp;gt;exp_flock_wait_list);
-        cfs_write_unlock(&amp;amp;req-&amp;gt;l_export-&amp;gt;exp_flock_wait_lock);
+               lock-&amp;gt;l_export;
+       req-&amp;gt;l_policy_data.l_flock.blocking_refs = 0;
+
+       cfs_hash_add(req-&amp;gt;l_export-&amp;gt;exp_flock_hash,
+                    &amp;amp;req-&amp;gt;l_policy_data.l_flock.owner,
+                    &amp;amp;req-&amp;gt;l_exp_flock_hash);
+error:
+       return rc;
 }
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Prior to &lt;a href=&quot;http://review.whamcloud.com/2240&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/2240&lt;/a&gt; , ldlm_flock_blocking_link() didn&apos;t make any allocations, now it might via ldlm_init_flock_export(). If the ldlm lock is held at this point as I &lt;em&gt;think&lt;/em&gt; it is, that is bad news. I&apos;m a little weary of the added call to cfs_hash_add also, but I haven&apos;t checked the code to determine if that might schedule.&lt;/p&gt;</comment>
                            <comment id="61453" author="bfaccini" created="Thu, 27 Jun 2013 16:37:29 +0000"  >&lt;p&gt;Prakash,&lt;br/&gt;
I agree about this possible dead-lock scenario, but did you check that all spinning threads do it on the same resource spin-lock that&apos;s the one currently owned by PID 24974 ?&lt;br/&gt;
Also I put a comment on you patch because I think the FLock hashing stuff needs to only occur for MDT exports that needs it.&lt;/p&gt;</comment>
                            <comment id="61455" author="prakash" created="Thu, 27 Jun 2013 17:04:44 +0000"  >&lt;blockquote&gt;
&lt;p&gt;I agree about this possible dead-lock scenario, but did you check that all spinning threads do it on the same resource spin-lock that&apos;s the one currently owned by PID 24974 ?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;No, I&apos;m not certain of this, but circumstantial evidence supports the claim. I took a bit of time trying to find the ldlm lock out of the stack for one of the spinning threads, but was unsuccessful. I could look again if you think it would be time well spent, but even if I found the lock for each spinning thread I&apos;m doubtful it would give us useful information without CONFIG_DEBUG_SPINLOCK enabled. If I did find the lock for each spinning thread, what information would you like be to gather from it?&lt;/p&gt;</comment>
                            <comment id="61456" author="bfaccini" created="Thu, 27 Jun 2013 18:00:20 +0000"  >&lt;p&gt;Is MDS node arch x86 ? If yes, RDI register of the NMI&apos;ed spinning threads should be &amp;amp;((struct ldlm_resource *)-&amp;gt;lr_lock). And then quick&amp;amp;dirty way to find if it is used/locked by PID 24974 it is a bit tricky but you will need to 1st find what pointer/address is stored in location &lt;span class=&quot;error&quot;&gt;&amp;#91;stack-address where  (ldlm_handle_enqueue0+0x4ef) return-address is stored - 0x8&amp;#93;&lt;/span&gt;, substract -0x40 to it and then get/dump this new stack address/memory-location to get the ldlm_lock address and see if its l_resource field matches ...&lt;/p&gt;

&lt;p&gt;If not x86 arch I need to work on it !!...&lt;/p&gt;</comment>
                            <comment id="61461" author="prakash" created="Thu, 27 Jun 2013 19:21:44 +0000"  >&lt;p&gt;It is x86_64, I&apos;ll see what I can come up with.&lt;/p&gt;</comment>
                            <comment id="61462" author="jhammond" created="Thu, 27 Jun 2013 19:49:35 +0000"  >&lt;p&gt;Prakash, I worked on getting backtraces out of crash dumps a bit this year. If you are interested in trying a crash module to do that then you can find it at &lt;a href=&quot;https://github.com/jhammond/xbt&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/jhammond/xbt&lt;/a&gt;. If it works for you then great, if not then we don&apos;t officially support it and I never commented on this issue.&lt;/p&gt;</comment>
                            <comment id="61463" author="prakash" created="Thu, 27 Jun 2013 21:13:34 +0000"  >&lt;p&gt;John, Thanks for the pointer. I&apos;ll have to give that a look when I get some time. Some of the other projects on your page look interesting as well..&lt;/p&gt;

&lt;p&gt;Here&apos;s what I got out of the crash dump so far..&lt;/p&gt;

&lt;p&gt;All of the spinning threads have this address in the RDI register &lt;tt&gt;0xffff880c74a0bad8&lt;/tt&gt;&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;                                                                         
crash&amp;gt; p -x *(spinlock_t *)0xffff880c74a0bad8                                      
$3 = {                                                                             
  raw_lock = {                                                                     
    slock = 0x5065504c                                                             
  }                                                                                
}                                                                                  
                                                                                   
# This is interesting. The x86_64 implementation uses ticket spin locks            
# such that the spin lock is split into two halves, &quot;next&quot; and &quot;owner&quot;.            
# See: http://lwn.net/Articles/267968                                              
# Let&apos;s subtract the two to determine how many threads are waiting on              
# the lock:                                                                        
                                                                                   
crash&amp;gt; p 0x5065-0x504c                                                             
$20 = 25                                                                           
                                                                                   
# This aligns well, we have 24 threads spinning on all this lock (all 24           
# cores of the system) and one thread holding the lock.                            
                                                                                   
crash&amp;gt; struct -xo ldlm_resource                                                    
struct ldlm_resource {                                                             
...                                                                                
    [0x18] spinlock_t lr_lock; # offset is 0x18 bytes into ldlm_resource           
...                                                                                
}                                                                                  
                                                                                   
crash&amp;gt; p -x *((struct ldlm_resource *)(0xffff880c74a0bad8-0x18))                   
... # looks like a valid structure, good :)                                        
                                                                                   
crash&amp;gt; p -x 0xffff880c74a0bad8-0x18                                                
$30 = 0xfff880c74a0bac0 # The address of the ldlm_resource                         
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;                                                                         

&lt;p&gt;I still need to examine the thread making the kmalloc call and match it up (or not) with this spinning thread.&lt;/p&gt;</comment>
                            <comment id="61493" author="behlendorf" created="Fri, 28 Jun 2013 17:06:29 +0000"  >&lt;p&gt;After looking at the offending code I began wondering why these are spin locks in the first place.  Typically, spin locks are used to protect tiny performance critical sections of the code.  Using them more broadly as this code does can lead to issues exactly like this where a function is used improperly because the executing context is unclear.  So my question is why aren&apos;t these locks a mutex?  Is this just historical?&lt;/p&gt;</comment>
                            <comment id="61813" author="bfaccini" created="Thu, 4 Jul 2013 08:39:22 +0000"  >&lt;p&gt;Hello Brian and all,&lt;br/&gt;
I agree with you, the fact that a spin-lock is used instead of mutex/semaphore is questionable but it is more a design subject that must be addressed/evaluated in a separate ticket/task.&lt;br/&gt;
Concerning the fix for this ticket&apos;s problem, I made a negative comment on Prakash one (&lt;a href=&quot;http://review.whamcloud.com/6790&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/6790&lt;/a&gt;) because I think it will move the FLock hashing stuff (introduced in &lt;a href=&quot;http://review.whamcloud.com/2240&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/2240&lt;/a&gt;) in a place where the allocation will occur unnecessary (OSS/MGS, and for MDS/MDT exports where there is no FLock), thus I pushed 2 patches because I think there could be 2 ways to fix it :&lt;/p&gt;

&lt;p&gt;         _ &lt;a href=&quot;http://review.whamcloud.com/6843&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/6843&lt;/a&gt;, is in the same direction than Prakash one but just try to ensure that FLock hashing stuff will only occur when necessary, and whith no resource-&amp;gt;lr_lock held for sure !&lt;/p&gt;

&lt;p&gt;         _ &lt;a href=&quot;http://review.whamcloud.com/6844&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/6844&lt;/a&gt;, does not change the current FLock involved algorithms but just ensure that the hashing stuff kmem allocation will occur atomically.&lt;/p&gt;

&lt;p&gt;But actually both patches experienced strange auto-tests failures (like in sanity-hsm !) that I need to investigate ...&lt;/p&gt;</comment>
                            <comment id="61966" author="bfaccini" created="Tue, 9 Jul 2013 19:35:00 +0000"  >&lt;p&gt;Both patches passed the tests finally !!... Need reviewers to give their feed-back now. Even if both patches are 2 different ways to fix original problem for this ticket, we may also decide that both patches are complementary since one ensures that FLock hashing stuff will occur without holding lr_lock which is not bad and the other will ensure that all layers allocating hashing data structures will do it atomically which is also a good idea !!&lt;/p&gt;</comment>
                            <comment id="62072" author="green" created="Thu, 11 Jul 2013 05:40:45 +0000"  >&lt;p&gt;Apparently there is an anternative fix from Xyratex for this issue: &lt;a href=&quot;http://review.whamcloud.com/#/c/5471/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/5471/&lt;/a&gt; I wonder if it&apos;s better?&lt;/p&gt;</comment>
                            <comment id="62078" author="bfaccini" created="Thu, 11 Jul 2013 08:38:16 +0000"  >&lt;p&gt;Because as I already commented negatively in Prakash&apos;s change #6790 for this same &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3504&quot; title=&quot;MDS: All cores spinning on ldlm lock in lock_res_and_lock&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3504&quot;&gt;&lt;del&gt;LU-3504&lt;/del&gt;&lt;/a&gt;, doing FLock related hashing stuff allocation in ldlm_init_export() will cause it to be allocated for all types of exports when it is only required for MDTs and then not for all clients/exports. It is why I decided to do it during 1st FLock on export.&lt;/p&gt;

&lt;p&gt;If you agree on this, may be I/you should also comment on this in Change #5471/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2835&quot; title=&quot;mds crash, cfs_hash_bd_del_locked()) ASSERTION( bd-&amp;gt;bd_bucket-&amp;gt;hsb_count &amp;gt; 0 ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2835&quot;&gt;&lt;del&gt;LU-2835&lt;/del&gt;&lt;/a&gt; ??&lt;/p&gt;</comment>
                            <comment id="62080" author="bfaccini" created="Thu, 11 Jul 2013 08:56:19 +0000"  >&lt;p&gt;Liang negatively commented in &lt;a href=&quot;http://review.whamcloud.com/6844&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/6844&lt;/a&gt; because as he wrote &quot;Please don&apos;t force cfs_hash to use atomic allocator, lustre can have very large hash table and can have hundreds of thousands cfs_hash instances.&quot;. So if, according to his experience with hashing, #6844 is not a good idea finally, I need to succeed with #6843 ...&lt;/p&gt;</comment>
                            <comment id="62090" author="vitaly_fertman" created="Thu, 11 Jul 2013 10:48:13 +0000"  >&lt;p&gt;resolved by &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2835&quot; title=&quot;mds crash, cfs_hash_bd_del_locked()) ASSERTION( bd-&amp;gt;bd_bucket-&amp;gt;hsb_count &amp;gt; 0 ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2835&quot;&gt;&lt;del&gt;LU-2835&lt;/del&gt;&lt;/a&gt;. export is checked in ldlm_init_flock_export().&lt;/p&gt;</comment>
                            <comment id="62117" author="bfaccini" created="Thu, 11 Jul 2013 15:30:00 +0000"  >&lt;p&gt;It is true and I read too quick change #5471 to miss that MDT-export only is checked, but again don&apos;t you think allocation will still occur even for exports/clients that will not/never use FLocks then ?&lt;/p&gt;

&lt;p&gt;In fact we have the choice between an unconditional, but may be not necessary, allocation in change #5471 and allocation on first need in #6843 ... Each having respectively a memory and a cpu/test cost.&lt;/p&gt;

</comment>
                            <comment id="62616" author="bfaccini" created="Fri, 19 Jul 2013 15:33:40 +0000"  >&lt;p&gt;Hello Prakash,&lt;br/&gt;
Seems that after Liang&apos;s feed-back/veto on my 2nd fix proposal (change #6844) I will have to abandon it, but what do you think about your original patch proposal (change #6790) ??&lt;/p&gt;</comment>
                            <comment id="62713" author="prakash" created="Mon, 22 Jul 2013 16:41:11 +0000"  >&lt;p&gt;After Liang&apos;s feedback, I agree, #6844 is unnecessary and incorrect. I think, in order to fix this issue, we just need to land one of either: #5471, #6790, or #6843. All three of those patches address the same issue in slightly different ways, so as long as one of those is landed, I believe this will be fixed.&lt;/p&gt;</comment>
                            <comment id="63133" author="jlevi" created="Mon, 29 Jul 2013 13:23:18 +0000"  >&lt;p&gt;Change, 5471 has landed to master. Should this ticket now be closed and the other patches abandoned? Or are the other patches also still needed?&lt;/p&gt;</comment>
                            <comment id="63158" author="prakash" created="Mon, 29 Jul 2013 16:13:13 +0000"  >&lt;p&gt;AFAIK, this issue is fixed as of &lt;a href=&quot;http://review.whamcloud.com/#/c/5471/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/5471/&lt;/a&gt; landing (no other patches needed).&lt;/p&gt;</comment>
                            <comment id="63159" author="prakash" created="Mon, 29 Jul 2013 16:16:42 +0000"  >&lt;p&gt;Actually, I&apos;m reopening this. I see it&apos;s marked to be fixed for 2.5.0 and 2.4.1. The patch that landed is for master which should resolve 2.5.0, but we still need to backport the patch to fix 2.4.1.&lt;/p&gt;</comment>
                            <comment id="63179" author="jlevi" created="Mon, 29 Jul 2013 18:28:18 +0000"  >&lt;p&gt;Reducing from 2.5 blocker as the master patch landed, but leaving open until it is ported to 2.4.1.&lt;/p&gt;</comment>
                            <comment id="63353" author="bfaccini" created="Wed, 31 Jul 2013 07:33:30 +0000"  >&lt;p&gt;Patch &lt;a href=&quot;http://review.whamcloud.com/6843&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/6843&lt;/a&gt; has just been abandonned in benefit of change &lt;a href=&quot;http://review.whamcloud.com/#/c/5471/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/5471/&lt;/a&gt; from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2835&quot; title=&quot;mds crash, cfs_hash_bd_del_locked()) ASSERTION( bd-&amp;gt;bd_bucket-&amp;gt;hsb_count &amp;gt; 0 ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2835&quot;&gt;&lt;del&gt;LU-2835&lt;/del&gt;&lt;/a&gt;, as per reviewers/gatekeeper decision.&lt;/p&gt;</comment>
                            <comment id="64052" author="pjones" created="Sun, 11 Aug 2013 15:45:04 +0000"  >&lt;p&gt;So can we close this issue as a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2835&quot; title=&quot;mds crash, cfs_hash_bd_del_locked()) ASSERTION( bd-&amp;gt;bd_bucket-&amp;gt;hsb_count &amp;gt; 0 ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2835&quot;&gt;&lt;del&gt;LU-2835&lt;/del&gt;&lt;/a&gt;?&lt;/p&gt;</comment>
                            <comment id="64452" author="bfaccini" created="Mon, 19 Aug 2013 08:40:47 +0000"  >&lt;p&gt;I also think there is nothing else to do with this ticket, but Prakash, do you also agree with Peter&apos;s suggestion ?&lt;/p&gt;</comment>
                            <comment id="64693" author="prakash" created="Tue, 20 Aug 2013 23:31:15 +0000"  >&lt;p&gt;If this fix doesn&apos;t need to be backported to 2.4, then yes, it can be closed.&lt;/p&gt;</comment>
                            <comment id="65262" author="bfaccini" created="Wed, 28 Aug 2013 15:39:08 +0000"  >&lt;p&gt;I verified that the &lt;a href=&quot;http://review.whamcloud.com/#/c/5471/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/5471/&lt;/a&gt; patch/change for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2835&quot; title=&quot;mds crash, cfs_hash_bd_del_locked()) ASSERTION( bd-&amp;gt;bd_bucket-&amp;gt;hsb_count &amp;gt; 0 ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2835&quot;&gt;&lt;del&gt;LU-2835&lt;/del&gt;&lt;/a&gt; is in origin/b2_4 branch already.&lt;br/&gt;
So can we close ?&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="17629">LU-2835</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="23830">LU-4801</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvtz3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>8822</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>