[LU-1949] SWL - mds wedges 'still busy with 1 RPC' Created: 16/Sep/12 Updated: 02/Jul/15 Resolved: 02/Jul/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Cliff White (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
SWL Hyperion LLNL |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 10169 | ||||||||
| Description |
|
Running SWL, MDS gradually goes into a wedged, clients get -EBUSY, MDS nevers clears stuck RPC. Rebooted MDS to recover. Sep 15 15:23:34 hyperion770 kernel: Lustre: 8865:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1347747046/real 1347747046] req@ffff880176204800 x1413191623533468/t0(0) o101->lustre-MDT0000-mdc-ffff880339d11800@192.168.127.6@o2ib1:12/10 lens 592/1136 e 3 to 1 dl 1347747806 ref 2 fl Rpc:XP/0/ffffffff rc 0/-1 Sep 15 15:21:09 hyperion-rst6 kernel: req@ffff880254359050 x1413191623533468/t0(0) o101->d17e0f27-22a5-38fb-14c0-313655de63cd@192.168.117.51@o2ib1:0/0 lens 592/1152 e 3 to 0 dl 1347747674 ref 2 fl Interpret:/0/0 rc 0/0 ----------- |
| Comments |
| Comment by Peter Jones [ 16/Sep/12 ] |
|
Bobijam Could you please look into this one? Thanks Peter |
| Comment by Zhenyu Xu [ 17/Sep/12 ] |
|
I think the MDS log is a little bit late for the scenario, I cannot find out why MDS was stucking with the RPC. Can you try to grab MDS logs when it is handling the to-be-timedout request? In this case capture what MDS has done to the client request of "req@ffff880176204800 x1413191623533468/t0(0) o101->lustre-MDT0000-mdc-ffff880339d11800@192.168.127.6@o2ib1" |
| Comment by Peter Jones [ 21/Sep/12 ] |
|
Dropping priority as unable to reproduce |
| Comment by nasf (Inactive) [ 21/Sep/12 ] |
|
There are some unfinished RPCs on the export which prevented client to reconnect. But I cannot find related RPC processing in the lustre-debug log. If there are "ps" log to show what the RPCs were, or stack trace to show what the RPC services thread were doing, then it is much helpful. Anyway, it seems not the duplication of |
| Comment by Cliff White (Inactive) [ 29/Sep/12 ] |
|
vmcore is at ~cliffw/lu1948/erofs on brent. |
| Comment by Andreas Dilger [ 02/Jul/15 ] |
|
Closing old bug. |