[LU-3504] MDS: All cores spinning on ldlm lock in lock_res_and_lock Created: 25/Jun/13  Updated: 21/Mar/14  Resolved: 29/Aug/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Prakash Surya (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: sequoia, zfs

Issue Links:
Related
is related to LU-2835 mds crash, cfs_hash_bd_del_locked()) ... Resolved
is related to LU-4801 spin lock contention in lock_res_and_... Resolved
Severity: 3
Rank (Obsolete): 8822

 Description   

Grove's MDS had some trouble yesterday when it got into a situation where all available CPUs were spinning in lock_res_and_lock, presumably on the same ldlm_lock. The system was crashed to gather a core dump. Upon Lustre starting back up, it went through recovery and then got back into the same state where all cores were spinning in lock_res_and_lock. Rebooting again, and bringing lustre up with the "abort_recov" option got the system back into a usable state.

The core dump captured show all CPUs spinning with one of the following back traces:

PID: 25091 TASK: ... CPU: 0 COMMAND: "mdt00_033"                                
...                                                                                
--- <IRQ STACK> ---                                                                
#17 [...] ret_from_intr                                                            
    [exception RIP: _spin_lock+30]                                                 
#18 [...] lock_res_and_lock+0x30 at ... [ptlrpc]                                   
#19 [...] ldlm_handle_enqueue0+0x907 at ... [ptlrpc]                               
#20 [...] mdt_enqueue+0x46 at ... [mdt]                                            
#21 [...] mdt_handle_common+0x648 at ... [mdt]                                     
#22 [...] mds_regular_handle+0x15 at ... [mdt]                                     
#23 [...] ptlrpc_server_handle_request+0x398 at ... [ptlrpc]                       
#24 [...] ptlrpc_main+0xace at ... [ptlrpc]                                        
#25 [...] child_rip+0xa                                                            
                                                                                   
PID: 25291 TASK: ... CPU:1 COMMAND: "mdt00_088"                                    
...                                                                                
--- <NMI exception stack> ---                                                      
#6  [...] _spin_lock+0x1e                                                          
#7  [...] lock_res_and_lock+0x30 at ... [ptlrpc]                                   
#8  [...] ldlm_lock_enqueue+0x11d at ... [ptlrpc]                                  
#9  [...] ldlm_handle_enqueue+0x4ef at ... [ptlrpc]                                
#10 [...] mdt_enqueue+0x46 at ... [mdt]                                            
#11 [...] mdt_handle_common+0x648 at ... [mdt]                                     
#12 [...] mds_regular_handle+0x15 at ... [mdt]                                     
#13 [...] ptlrpc_server_handle_request+0x398 at ... [ptlrpc]                       
#14 [...] ptlrpc_main+0xace at ... [ptlrpc]                                        
#15 [...] child_rip+0xa


 Comments   
Comment by Bruno Faccini (Inactive) [ 26/Jun/13 ]

Was there also any backtrace dump in the syslog/console before the forced crash-dump ??
Is it possible to get the syslog (and console/conman log if available) for the time period covering the day of the crash ?? Also a "foreach bt" output from crash tool running over the crash-dump could be useful.

Comment by Prakash Surya (Inactive) [ 26/Jun/13 ]

Bruno, this system is on the SCF so I'm not able to copy anything directly to you. If you have any specific things you'd like me to look at, let me know; but any logs, backtraces, etc have to be manually copied over by hand so that's not an option for bulk requests (i.e. "foreach bt" from crash).

I don't see any stack traces in the console logs other than the "CPU soft lockup" messages for the threads spinning on the lock that I mentioned in the description of this ticket.

I'm looking through the back traces and I do see one thread that looks like it might be sleeping while holding the lock the other threads are spinning on. I need to walk the code to verify this thread is holding the lock, but here is it's back trace:

                                                                         
PID: 24974 TASK: ... CPU: 19 COMMAND: "mdt02_010"                                  
 #0 [...] schedule                                                                 
 #1 [...] __cond_resched+0x2a at ...                                               
 #2 [...] _cond_resched+0x30 at ...                                                
 #3 [...] __kmalloc+0x130 at ...                                                   
 #4 [...] cfs_alloc+0x30 at ... [libcfs]                                           
 #5 [...] cfs_hash_create+0x151 at ... [libcfs]                                    
 #6 [...] ldlm_init_flock_export+0x54 at ... [ptlrpc]                              
 #7 [...] ldlm_process_flock_lock+0x1432 at ... [ptlrpc]                           
 #8 [...] ldlm_lock_enqueue+0x405 at ... [ptlrpc]                                  
 #9 [...] ldlm_handle_enqueue0+0x4ef at ... [ptlrpc]                               
#10 [...] mdt_enqueue+0x46 at ... [mdt]                                            
#11 [...] mdt_handle_common+0x648 at ... [mdt]                                     
#12 [...] mdt_regular_handle+0x15 at ... [mdt]                                     
#13 [...] ptlrpc_server_handle_request+0x398 at ... [ptlrpc]                       
#14 [...] ptlrpc_main+0xace at ... [ptlrpc]                                        
#15 [...] child_rip+0xa at ...                                                     
Comment by Bruno Faccini (Inactive) [ 26/Jun/13 ]

Sure, this one looks guilty !! BTW, can you check if it is in this state since a long time ? We can get an idea about that if you check its "ps -l 24794" output against the latest ones ("ps -l | head").

Comment by Prakash Surya (Inactive) [ 26/Jun/13 ]
crash> ps -l 24974
[1016382628073301] [RU] PID: 24974 TASK: ... CPU: 19 COMMAND: "mdt02_010"
crash> ps -l | head -n1
[1016382635871136] [RU] PID: 25198 TASK: ... CPU: 6  COMMAND: "mdt02_058"
Comment by Prakash Surya (Inactive) [ 26/Jun/13 ]

It looks like we don't have CONFIG_DEBUG_SPINLOCK_SLEEP enabled on this system, so the kernel *would not* have complained if the thread did in fact sleep while holding the lock.

Comment by Prakash Surya (Inactive) [ 26/Jun/13 ]

I think this change may have introduced this issue.. http://review.whamcloud.com/2240

Comment by Bruno Faccini (Inactive) [ 26/Jun/13 ]

Anyway, thread "mdt02_010", has been scheduled since less than a second before crash so it may not be the responsible of the suspected dead-lock. Unless the code it runs only forces re-sched()s with lock held ... I need to investigate in this direction.

Also, what make you think change #2240 could be the reason ?

Comment by Prakash Surya (Inactive) [ 26/Jun/13 ]

I have proposed a fix here: http://review.whamcloud.com/6790

Comment by Prakash Surya (Inactive) [ 26/Jun/13 ]

Also, what make you think change #2240 could be the reason ?

Because of this change:

-static inline void ldlm_flock_blocking_link(struct ldlm_lock *req,
-                                            struct ldlm_lock *lock)
+static inline int ldlm_flock_blocking_link(struct ldlm_lock *req,
+                                          struct ldlm_lock *lock)
 {
+       int rc = 0;
+
         /* For server only */
         if (req->l_export == NULL)
-                return;
+               return 0;
+
+       if (unlikely(req->l_export->exp_flock_hash == NULL)) {
+               rc = ldlm_init_flock_export(req->l_export);
+               if (rc)
+                       goto error;
+       }
 
-        LASSERT(cfs_list_empty(&req->l_flock_waitq));
-        cfs_write_lock(&req->l_export->exp_flock_wait_lock);
+       LASSERT(cfs_hlist_unhashed(&req->l_exp_flock_hash));
 
         req->l_policy_data.l_flock.blocking_owner =
                 lock->l_policy_data.l_flock.owner;
         req->l_policy_data.l_flock.blocking_export =
-                class_export_get(lock->l_export);
-
-        cfs_list_add_tail(&req->l_flock_waitq,
-                          &req->l_export->exp_flock_wait_list);
-        cfs_write_unlock(&req->l_export->exp_flock_wait_lock);
+               lock->l_export;
+       req->l_policy_data.l_flock.blocking_refs = 0;
+
+       cfs_hash_add(req->l_export->exp_flock_hash,
+                    &req->l_policy_data.l_flock.owner,
+                    &req->l_exp_flock_hash);
+error:
+       return rc;
 }

Prior to http://review.whamcloud.com/2240 , ldlm_flock_blocking_link() didn't make any allocations, now it might via ldlm_init_flock_export(). If the ldlm lock is held at this point as I think it is, that is bad news. I'm a little weary of the added call to cfs_hash_add also, but I haven't checked the code to determine if that might schedule.

Comment by Bruno Faccini (Inactive) [ 27/Jun/13 ]

Prakash,
I agree about this possible dead-lock scenario, but did you check that all spinning threads do it on the same resource spin-lock that's the one currently owned by PID 24974 ?
Also I put a comment on you patch because I think the FLock hashing stuff needs to only occur for MDT exports that needs it.

Comment by Prakash Surya (Inactive) [ 27/Jun/13 ]

I agree about this possible dead-lock scenario, but did you check that all spinning threads do it on the same resource spin-lock that's the one currently owned by PID 24974 ?

No, I'm not certain of this, but circumstantial evidence supports the claim. I took a bit of time trying to find the ldlm lock out of the stack for one of the spinning threads, but was unsuccessful. I could look again if you think it would be time well spent, but even if I found the lock for each spinning thread I'm doubtful it would give us useful information without CONFIG_DEBUG_SPINLOCK enabled. If I did find the lock for each spinning thread, what information would you like be to gather from it?

Comment by Bruno Faccini (Inactive) [ 27/Jun/13 ]

Is MDS node arch x86 ? If yes, RDI register of the NMI'ed spinning threads should be &((struct ldlm_resource *)->lr_lock). And then quick&dirty way to find if it is used/locked by PID 24974 it is a bit tricky but you will need to 1st find what pointer/address is stored in location [stack-address where (ldlm_handle_enqueue0+0x4ef) return-address is stored - 0x8], substract -0x40 to it and then get/dump this new stack address/memory-location to get the ldlm_lock address and see if its l_resource field matches ...

If not x86 arch I need to work on it !!...

Comment by Prakash Surya (Inactive) [ 27/Jun/13 ]

It is x86_64, I'll see what I can come up with.

Comment by John Hammond [ 27/Jun/13 ]

Prakash, I worked on getting backtraces out of crash dumps a bit this year. If you are interested in trying a crash module to do that then you can find it at https://github.com/jhammond/xbt. If it works for you then great, if not then we don't officially support it and I never commented on this issue.

Comment by Prakash Surya (Inactive) [ 27/Jun/13 ]

John, Thanks for the pointer. I'll have to give that a look when I get some time. Some of the other projects on your page look interesting as well..

Here's what I got out of the crash dump so far..

All of the spinning threads have this address in the RDI register 0xffff880c74a0bad8

                                                                         
crash> p -x *(spinlock_t *)0xffff880c74a0bad8                                      
$3 = {                                                                             
  raw_lock = {                                                                     
    slock = 0x5065504c                                                             
  }                                                                                
}                                                                                  
                                                                                   
# This is interesting. The x86_64 implementation uses ticket spin locks            
# such that the spin lock is split into two halves, "next" and "owner".            
# See: http://lwn.net/Articles/267968                                              
# Let's subtract the two to determine how many threads are waiting on              
# the lock:                                                                        
                                                                                   
crash> p 0x5065-0x504c                                                             
$20 = 25                                                                           
                                                                                   
# This aligns well, we have 24 threads spinning on all this lock (all 24           
# cores of the system) and one thread holding the lock.                            
                                                                                   
crash> struct -xo ldlm_resource                                                    
struct ldlm_resource {                                                             
...                                                                                
    [0x18] spinlock_t lr_lock; # offset is 0x18 bytes into ldlm_resource           
...                                                                                
}                                                                                  
                                                                                   
crash> p -x *((struct ldlm_resource *)(0xffff880c74a0bad8-0x18))                   
... # looks like a valid structure, good :)                                        
                                                                                   
crash> p -x 0xffff880c74a0bad8-0x18                                                
$30 = 0xfff880c74a0bac0 # The address of the ldlm_resource                         

I still need to examine the thread making the kmalloc call and match it up (or not) with this spinning thread.

Comment by Brian Behlendorf [ 28/Jun/13 ]

After looking at the offending code I began wondering why these are spin locks in the first place. Typically, spin locks are used to protect tiny performance critical sections of the code. Using them more broadly as this code does can lead to issues exactly like this where a function is used improperly because the executing context is unclear. So my question is why aren't these locks a mutex? Is this just historical?

Comment by Bruno Faccini (Inactive) [ 04/Jul/13 ]

Hello Brian and all,
I agree with you, the fact that a spin-lock is used instead of mutex/semaphore is questionable but it is more a design subject that must be addressed/evaluated in a separate ticket/task.
Concerning the fix for this ticket's problem, I made a negative comment on Prakash one (http://review.whamcloud.com/6790) because I think it will move the FLock hashing stuff (introduced in http://review.whamcloud.com/2240) in a place where the allocation will occur unnecessary (OSS/MGS, and for MDS/MDT exports where there is no FLock), thus I pushed 2 patches because I think there could be 2 ways to fix it :

_ http://review.whamcloud.com/6843, is in the same direction than Prakash one but just try to ensure that FLock hashing stuff will only occur when necessary, and whith no resource->lr_lock held for sure !

_ http://review.whamcloud.com/6844, does not change the current FLock involved algorithms but just ensure that the hashing stuff kmem allocation will occur atomically.

But actually both patches experienced strange auto-tests failures (like in sanity-hsm !) that I need to investigate ...

Comment by Bruno Faccini (Inactive) [ 09/Jul/13 ]

Both patches passed the tests finally !!... Need reviewers to give their feed-back now. Even if both patches are 2 different ways to fix original problem for this ticket, we may also decide that both patches are complementary since one ensures that FLock hashing stuff will occur without holding lr_lock which is not bad and the other will ensure that all layers allocating hashing data structures will do it atomically which is also a good idea !!

Comment by Oleg Drokin [ 11/Jul/13 ]

Apparently there is an anternative fix from Xyratex for this issue: http://review.whamcloud.com/#/c/5471/ I wonder if it's better?

Comment by Bruno Faccini (Inactive) [ 11/Jul/13 ]

Because as I already commented negatively in Prakash's change #6790 for this same LU-3504, doing FLock related hashing stuff allocation in ldlm_init_export() will cause it to be allocated for all types of exports when it is only required for MDTs and then not for all clients/exports. It is why I decided to do it during 1st FLock on export.

If you agree on this, may be I/you should also comment on this in Change #5471/LU-2835 ??

Comment by Bruno Faccini (Inactive) [ 11/Jul/13 ]

Liang negatively commented in http://review.whamcloud.com/6844 because as he wrote "Please don't force cfs_hash to use atomic allocator, lustre can have very large hash table and can have hundreds of thousands cfs_hash instances.". So if, according to his experience with hashing, #6844 is not a good idea finally, I need to succeed with #6843 ...

Comment by Vitaly Fertman [ 11/Jul/13 ]

resolved by LU-2835. export is checked in ldlm_init_flock_export().

Comment by Bruno Faccini (Inactive) [ 11/Jul/13 ]

It is true and I read too quick change #5471 to miss that MDT-export only is checked, but again don't you think allocation will still occur even for exports/clients that will not/never use FLocks then ?

In fact we have the choice between an unconditional, but may be not necessary, allocation in change #5471 and allocation on first need in #6843 ... Each having respectively a memory and a cpu/test cost.

Comment by Bruno Faccini (Inactive) [ 19/Jul/13 ]

Hello Prakash,
Seems that after Liang's feed-back/veto on my 2nd fix proposal (change #6844) I will have to abandon it, but what do you think about your original patch proposal (change #6790) ??

Comment by Prakash Surya (Inactive) [ 22/Jul/13 ]

After Liang's feedback, I agree, #6844 is unnecessary and incorrect. I think, in order to fix this issue, we just need to land one of either: #5471, #6790, or #6843. All three of those patches address the same issue in slightly different ways, so as long as one of those is landed, I believe this will be fixed.

Comment by Jodi Levi (Inactive) [ 29/Jul/13 ]

Change, 5471 has landed to master. Should this ticket now be closed and the other patches abandoned? Or are the other patches also still needed?

Comment by Prakash Surya (Inactive) [ 29/Jul/13 ]

AFAIK, this issue is fixed as of http://review.whamcloud.com/#/c/5471/ landing (no other patches needed).

Comment by Prakash Surya (Inactive) [ 29/Jul/13 ]

Actually, I'm reopening this. I see it's marked to be fixed for 2.5.0 and 2.4.1. The patch that landed is for master which should resolve 2.5.0, but we still need to backport the patch to fix 2.4.1.

Comment by Jodi Levi (Inactive) [ 29/Jul/13 ]

Reducing from 2.5 blocker as the master patch landed, but leaving open until it is ported to 2.4.1.

Comment by Bruno Faccini (Inactive) [ 31/Jul/13 ]

Patch http://review.whamcloud.com/6843 has just been abandonned in benefit of change http://review.whamcloud.com/#/c/5471/ from LU-2835, as per reviewers/gatekeeper decision.

Comment by Peter Jones [ 11/Aug/13 ]

So can we close this issue as a duplicate of LU-2835?

Comment by Bruno Faccini (Inactive) [ 19/Aug/13 ]

I also think there is nothing else to do with this ticket, but Prakash, do you also agree with Peter's suggestion ?

Comment by Prakash Surya (Inactive) [ 20/Aug/13 ]

If this fix doesn't need to be backported to 2.4, then yes, it can be closed.

Comment by Bruno Faccini (Inactive) [ 28/Aug/13 ]

I verified that the http://review.whamcloud.com/#/c/5471/ patch/change for LU-2835 is in origin/b2_4 branch already.
So can we close ?

Generated at Sat Feb 10 01:34:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.