Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2597

Soft lockup messages on Sequoia in ptlrpc_check_set under ll_fsync

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • None
    • Lustre 2.4.0
    • 3
    • 6051

    Description

      This looks similar to LU-2263, but looking at the stacks and function address listings I'm not certain they are the same just yet.

      When running a large IOR using our 2.3.58-3chaos tag, we saw many CPU Soft Lockup messages on the console of many clients. All of the stacks I have been able to uncover look like this:

      NIP [c000000000439724] ._spin_lock+0x2c/0x44                                       
      LR [8000000003a8e81c] .ptlrpc_check_set+0x43cc/0x5120 [ptlrpc]                     
      Call Trace:                                                                        
      [c0000003ec7bf600] [8000000003a8e75c] .ptlrpc_check_set+0x430c/0x5120 [ptlrpc] (unreliable)
      [c0000003ec7bf7b0] [8000000003a8fa5c] .ptlrpc_set_wait+0x4ec/0xcb0 [ptlrpc]        
      [c0000003ec7bf920] [8000000003a90974] .ptlrpc_queue_wait+0xd4/0x380 [ptlrpc]       
      [c0000003ec7bf9e0] [80000000059c3564] .mdc_sync+0x104/0x340 [mdc]                  
      [c0000003ec7bfa90] [800000000709bf08] .lmv_sync+0x2c8/0x820 [lmv]                  
      [c0000003ec7bfb80] [80000000068232dc] .ll_fsync+0x23c/0xc50 [lustre]               
      [c0000003ec7bfc80] [c000000000100420] .vfs_fsync_range+0xb0/0x104                  
      [c0000003ec7bfd30] [c000000000100518] .do_fsync+0x3c/0x6c                          
      [c0000003ec7bfdc0] [c000000000100588] .SyS_fsync+0x18/0x28                         
      [c0000003ec7bfe30] [c000000000000580] syscall_exit+0x0/0x2c
      

      Searching through all of the console logs for that day, I found three unique address within ptlrpc_check_set listed:

      $ grep -E "LR.*ptlrpc_check_set" R* | grep 2013-01-08 | \
        sed 's/.*\(ptlrpc_check_set+[a-z0-9]\+\).*/\1/' | \
        sort | uniq
      ptlrpc_check_set+0x43cc
      ptlrpc_check_set+0x50c
      ptlrpc_check_set+0xeec
      

      Listing these addresses is a bit confusing to me, they aren't near any spin_lock calls as I expected. Rather, they seem to correlate to debug print statements.

      ptlrpc_check_set+0x43cc
      (gdb) l *ptlrpc_check_set+0x43cc
      0x4e81c is in ptlrpc_check_set (/builddir/build/BUILD/lustre-2.3.58/lustre/ptlrpc/client.c:1113).
      
      1109 static int ptlrpc_check_status(struct ptlrpc_request *req)                      
      1110 {                                                                               
      1111         int err;                                                                
      1112         ENTRY;                                                                  
      1113                                                                                 
      1114         err = lustre_msg_get_status(req->rq_repmsg);                            
      1115         if (lustre_msg_get_type(req->rq_repmsg) == PTL_RPC_MSG_ERR) {           
      1116                 struct obd_import *imp = req->rq_import;                        
      1117                 __u32 opc = lustre_msg_get_opc(req->rq_reqmsg);
      
      ptlrpc_check_set+0x50c
      (gdb) l *ptlrpc_check_set+0x50c
      0x4a95c is in ptlrpc_check_set (/builddir/build/BUILD/lustre-2.3.58/lustre/ptlrpc/client.c:1774).
      
      1771                 ptlrpc_rqphase_move(req, RQ_PHASE_COMPLETE);                    
      1772                                                                                 
      1773                 CDEBUG(req->rq_reqmsg != NULL ? D_RPCTRACE : 0,                 
      1774                         "Completed RPC pname:cluuid:pid:xid:nid:"               
      1775                         "opc %s:%s:%d:"LPU64":%s:%d\n",                         
      1776                         cfs_curproc_comm(), imp->imp_obd->obd_uuid.uuid,        
      1777                         lustre_msg_get_status(req->rq_reqmsg), req->rq_xid,
      

      Any ideas?

      I'm working on getting CONFIG_DEBUG_SPINLOCK and CONFIG_LOCK_STAT enabled on the IONs. I'm hopeful that will work and will allow me to determine which spin lock the threads are contending on.

      Attachments

        Issue Links

          Activity

            People

              yong.fan nasf (Inactive)
              prakash Prakash Surya (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: