Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2597

Soft lockup messages on Sequoia in ptlrpc_check_set under ll_fsync

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • None
    • Lustre 2.4.0
    • 3
    • 6051

    Description

      This looks similar to LU-2263, but looking at the stacks and function address listings I'm not certain they are the same just yet.

      When running a large IOR using our 2.3.58-3chaos tag, we saw many CPU Soft Lockup messages on the console of many clients. All of the stacks I have been able to uncover look like this:

      NIP [c000000000439724] ._spin_lock+0x2c/0x44                                       
      LR [8000000003a8e81c] .ptlrpc_check_set+0x43cc/0x5120 [ptlrpc]                     
      Call Trace:                                                                        
      [c0000003ec7bf600] [8000000003a8e75c] .ptlrpc_check_set+0x430c/0x5120 [ptlrpc] (unreliable)
      [c0000003ec7bf7b0] [8000000003a8fa5c] .ptlrpc_set_wait+0x4ec/0xcb0 [ptlrpc]        
      [c0000003ec7bf920] [8000000003a90974] .ptlrpc_queue_wait+0xd4/0x380 [ptlrpc]       
      [c0000003ec7bf9e0] [80000000059c3564] .mdc_sync+0x104/0x340 [mdc]                  
      [c0000003ec7bfa90] [800000000709bf08] .lmv_sync+0x2c8/0x820 [lmv]                  
      [c0000003ec7bfb80] [80000000068232dc] .ll_fsync+0x23c/0xc50 [lustre]               
      [c0000003ec7bfc80] [c000000000100420] .vfs_fsync_range+0xb0/0x104                  
      [c0000003ec7bfd30] [c000000000100518] .do_fsync+0x3c/0x6c                          
      [c0000003ec7bfdc0] [c000000000100588] .SyS_fsync+0x18/0x28                         
      [c0000003ec7bfe30] [c000000000000580] syscall_exit+0x0/0x2c
      

      Searching through all of the console logs for that day, I found three unique address within ptlrpc_check_set listed:

      $ grep -E "LR.*ptlrpc_check_set" R* | grep 2013-01-08 | \
        sed 's/.*\(ptlrpc_check_set+[a-z0-9]\+\).*/\1/' | \
        sort | uniq
      ptlrpc_check_set+0x43cc
      ptlrpc_check_set+0x50c
      ptlrpc_check_set+0xeec
      

      Listing these addresses is a bit confusing to me, they aren't near any spin_lock calls as I expected. Rather, they seem to correlate to debug print statements.

      ptlrpc_check_set+0x43cc
      (gdb) l *ptlrpc_check_set+0x43cc
      0x4e81c is in ptlrpc_check_set (/builddir/build/BUILD/lustre-2.3.58/lustre/ptlrpc/client.c:1113).
      
      1109 static int ptlrpc_check_status(struct ptlrpc_request *req)                      
      1110 {                                                                               
      1111         int err;                                                                
      1112         ENTRY;                                                                  
      1113                                                                                 
      1114         err = lustre_msg_get_status(req->rq_repmsg);                            
      1115         if (lustre_msg_get_type(req->rq_repmsg) == PTL_RPC_MSG_ERR) {           
      1116                 struct obd_import *imp = req->rq_import;                        
      1117                 __u32 opc = lustre_msg_get_opc(req->rq_reqmsg);
      
      ptlrpc_check_set+0x50c
      (gdb) l *ptlrpc_check_set+0x50c
      0x4a95c is in ptlrpc_check_set (/builddir/build/BUILD/lustre-2.3.58/lustre/ptlrpc/client.c:1774).
      
      1771                 ptlrpc_rqphase_move(req, RQ_PHASE_COMPLETE);                    
      1772                                                                                 
      1773                 CDEBUG(req->rq_reqmsg != NULL ? D_RPCTRACE : 0,                 
      1774                         "Completed RPC pname:cluuid:pid:xid:nid:"               
      1775                         "opc %s:%s:%d:"LPU64":%s:%d\n",                         
      1776                         cfs_curproc_comm(), imp->imp_obd->obd_uuid.uuid,        
      1777                         lustre_msg_get_status(req->rq_reqmsg), req->rq_xid,
      

      Any ideas?

      I'm working on getting CONFIG_DEBUG_SPINLOCK and CONFIG_LOCK_STAT enabled on the IONs. I'm hopeful that will work and will allow me to determine which spin lock the threads are contending on.

      Attachments

        Issue Links

          Activity

            [LU-2597] Soft lockup messages on Sequoia in ptlrpc_check_set under ll_fsync
            jlevi Jodi Levi (Inactive) made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: In Progress [ 3 ] New: Resolved [ 5 ]
            yong.fan nasf (Inactive) made changes -
            Status Original: Open [ 1 ] New: In Progress [ 3 ]
            prakash Prakash Surya (Inactive) made changes -
            Assignee Original: Prakash Surya [ prakash ] New: nasf [ yong.fan ]
            prakash Prakash Surya (Inactive) made changes -
            Assignee Original: nasf [ yong.fan ] New: Prakash Surya [ prakash ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-2263 [ LU-2263 ]
            adilger Andreas Dilger made changes -
            Assignee Original: Alex Zhuravlev [ bzzz ] New: nasf [ yong.fan ]
            jlevi Jodi Levi (Inactive) made changes -
            Labels Original: sequoia topsequoia New: MB sequoia topsequoia
            pjones Peter Jones made changes -
            Assignee Original: WC Triage [ wc-triage ] New: Alex Zhuravlev [ bzzz ]
            prakash Prakash Surya (Inactive) created issue -

            People

              yong.fan nasf (Inactive)
              prakash Prakash Surya (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: