Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.4.0
    • 2.3.56-2chaos (github.com/chaos/lustre)
    • 3
    • 5629

    Description

      In the fsync at after writing in a file-per-process ior, we're seeing many lustre clients on Sequoia getting stuck in soft lockups.

      See attached console log for one of the nodes in file RB5-ID-J02.log.

      This may be a similar problem to LU-2332, but under a user-space process, not under ptlrpcd.

      Attachments

        Issue Links

          Activity

            [LU-2366] Soft lockups under ptlrpc_check_set
            pjones Peter Jones added a comment -

            As per LLNL, close and reopen if this reoccurs

            pjones Peter Jones added a comment - As per LLNL, close and reopen if this reoccurs

            I think this is the same problem to LU-2263 and the root cause is high contention on import lock. Just in case, I'll leave this ticket open but lower the priority.

            jay Jinshan Xiong (Inactive) added a comment - I think this is the same problem to LU-2263 and the root cause is high contention on import lock. Just in case, I'll leave this ticket open but lower the priority.

            This looks similar to what we saw in yesterday's testing, which I commented on in LU-2263 here.

            can you please show me the source code at ptlrpc_check_set+0x4f4/0x4e80?

            This was taken from modules and code built from our 2.3.56-3chaos tag (the log is running 2.3.56-2chaos, but I don't think ptlrpc_check_set has changed so:

            (gdb) l *ptlrpc_check_set+0x4f4
            0x46ed4 is in ptlrpc_check_set (/builddir/build/BUILD/lustre-2.3.56/lustre/ptlrpc/client.c:1852).
            1847    /builddir/build/BUILD/lustre-2.3.56/lustre/ptlrpc/client.c: No such file or directory.
                    in /builddir/build/BUILD/lustre-2.3.56/lustre/ptlrpc/client.c
            
            1849                         libcfs_nid2str(imp->imp_connection->c_peer.nid),        
            1850                         lustre_msg_get_opc(req->rq_reqmsg));                    
            1851                                                                                 
            1852                 cfs_spin_lock(&imp->imp_lock);                                  
            1853                 /* Request already may be not on sending or delaying list. This 
            1854                  * may happen in the case of marking it erroneous for the case  
            1855                  * ptlrpc_import_delay_req(req, status) find it impossible to
            
            prakash Prakash Surya (Inactive) added a comment - This looks similar to what we saw in yesterday's testing, which I commented on in LU-2263 here . can you please show me the source code at ptlrpc_check_set+0x4f4/0x4e80? This was taken from modules and code built from our 2.3.56-3chaos tag (the log is running 2.3.56-2chaos , but I don't think ptlrpc_check_set has changed so: (gdb) l *ptlrpc_check_set+0x4f4 0x46ed4 is in ptlrpc_check_set (/builddir/build/BUILD/lustre-2.3.56/lustre/ptlrpc/client.c:1852). 1847 /builddir/build/BUILD/lustre-2.3.56/lustre/ptlrpc/client.c: No such file or directory. in /builddir/build/BUILD/lustre-2.3.56/lustre/ptlrpc/client.c 1849 libcfs_nid2str(imp->imp_connection->c_peer.nid), 1850 lustre_msg_get_opc(req->rq_reqmsg)); 1851 1852 cfs_spin_lock(&imp->imp_lock); 1853 /* Request already may be not on sending or delaying list. This 1854 * may happen in the case of marking it erroneous for the case 1855 * ptlrpc_import_delay_req(req, status) find it impossible to

            can you please show me the source code at ptlrpc_check_set+0x4f4/0x4e80?

            Thanks.

            jay Jinshan Xiong (Inactive) added a comment - can you please show me the source code at ptlrpc_check_set+0x4f4/0x4e80? Thanks.

            please have a look

            bzzz Alex Zhuravlev added a comment - please have a look
            pjones Peter Jones added a comment -

            Alex will triage

            pjones Peter Jones added a comment - Alex will triage

            People

              jay Jinshan Xiong (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: