[LU-2366] Soft lockups under ptlrpc_check_set Created: 20/Nov/12  Updated: 03/Dec/12  Resolved: 03/Dec/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Christopher Morrone Assignee: Jinshan Xiong (Inactive)
Resolution: Duplicate Votes: 0
Labels: sequoia
Environment:

2.3.56-2chaos (github.com/chaos/lustre)


Attachments: File RB5-ID-J02.log    
Issue Links:
Related
is related to LU-2263 CPU Soft Lockups due to many threads ... Resolved
Severity: 3
Rank (Obsolete): 5629

 Description   

In the fsync at after writing in a file-per-process ior, we're seeing many lustre clients on Sequoia getting stuck in soft lockups.

See attached console log for one of the nodes in file RB5-ID-J02.log.

This may be a similar problem to LU-2332, but under a user-space process, not under ptlrpcd.



 Comments   
Comment by Peter Jones [ 20/Nov/12 ]

Alex will triage

Comment by Alex Zhuravlev [ 21/Nov/12 ]

please have a look

Comment by Jinshan Xiong (Inactive) [ 28/Nov/12 ]

can you please show me the source code at ptlrpc_check_set+0x4f4/0x4e80?

Thanks.

Comment by Prakash Surya (Inactive) [ 29/Nov/12 ]

This looks similar to what we saw in yesterday's testing, which I commented on in LU-2263 here.

can you please show me the source code at ptlrpc_check_set+0x4f4/0x4e80?

This was taken from modules and code built from our 2.3.56-3chaos tag (the log is running 2.3.56-2chaos, but I don't think ptlrpc_check_set has changed so:

(gdb) l *ptlrpc_check_set+0x4f4
0x46ed4 is in ptlrpc_check_set (/builddir/build/BUILD/lustre-2.3.56/lustre/ptlrpc/client.c:1852).
1847    /builddir/build/BUILD/lustre-2.3.56/lustre/ptlrpc/client.c: No such file or directory.
        in /builddir/build/BUILD/lustre-2.3.56/lustre/ptlrpc/client.c
1849                         libcfs_nid2str(imp->imp_connection->c_peer.nid),        
1850                         lustre_msg_get_opc(req->rq_reqmsg));                    
1851                                                                                 
1852                 cfs_spin_lock(&imp->imp_lock);                                  
1853                 /* Request already may be not on sending or delaying list. This 
1854                  * may happen in the case of marking it erroneous for the case  
1855                  * ptlrpc_import_delay_req(req, status) find it impossible to
Comment by Jinshan Xiong (Inactive) [ 30/Nov/12 ]

I think this is the same problem to LU-2263 and the root cause is high contention on import lock. Just in case, I'll leave this ticket open but lower the priority.

Comment by Peter Jones [ 03/Dec/12 ]

As per LLNL, close and reopen if this reoccurs

Generated at Sat Feb 10 01:24:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.