[LU-2366] Soft lockups under ptlrpc_check_set Created: 20/Nov/12 Updated: 03/Dec/12 Resolved: 03/Dec/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Christopher Morrone | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | sequoia | ||
| Environment: |
2.3.56-2chaos (github.com/chaos/lustre) |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 5629 | ||||||||
| Description |
|
In the fsync at after writing in a file-per-process ior, we're seeing many lustre clients on Sequoia getting stuck in soft lockups. See attached console log for one of the nodes in file RB5-ID-J02.log. This may be a similar problem to |
| Comments |
| Comment by Peter Jones [ 20/Nov/12 ] |
|
Alex will triage |
| Comment by Alex Zhuravlev [ 21/Nov/12 ] |
|
please have a look |
| Comment by Jinshan Xiong (Inactive) [ 28/Nov/12 ] |
|
can you please show me the source code at ptlrpc_check_set+0x4f4/0x4e80? Thanks. |
| Comment by Prakash Surya (Inactive) [ 29/Nov/12 ] |
|
This looks similar to what we saw in yesterday's testing, which I commented on in
This was taken from modules and code built from our 2.3.56-3chaos tag (the log is running 2.3.56-2chaos, but I don't think ptlrpc_check_set has changed so: (gdb) l *ptlrpc_check_set+0x4f4
0x46ed4 is in ptlrpc_check_set (/builddir/build/BUILD/lustre-2.3.56/lustre/ptlrpc/client.c:1852).
1847 /builddir/build/BUILD/lustre-2.3.56/lustre/ptlrpc/client.c: No such file or directory.
in /builddir/build/BUILD/lustre-2.3.56/lustre/ptlrpc/client.c
1849 libcfs_nid2str(imp->imp_connection->c_peer.nid), 1850 lustre_msg_get_opc(req->rq_reqmsg)); 1851 1852 cfs_spin_lock(&imp->imp_lock); 1853 /* Request already may be not on sending or delaying list. This 1854 * may happen in the case of marking it erroneous for the case 1855 * ptlrpc_import_delay_req(req, status) find it impossible to |
| Comment by Jinshan Xiong (Inactive) [ 30/Nov/12 ] |
|
I think this is the same problem to |
| Comment by Peter Jones [ 03/Dec/12 ] |
|
As per LLNL, close and reopen if this reoccurs |