[LU-4206] Sanity test_120e fails with 1 blocking RPC occured. Created: 04/Nov/13 Updated: 28/Apr/16 Resolved: 08/May/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0 |
| Fix Version/s: | Lustre 2.6.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Amir Shehata (Inactive) | Assignee: | Di Wang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 11434 | ||||||||||||||||||||
| Description |
|
Sanity test_120e seems to be failing intermittently on some tests. Lustre: lustre-MDT0000: Not available for connect from 10.10.16.108@tcp (stopping) Might be the that mdt is being umounted while Clients are communicating with it On Client |
| Comments |
| Comment by Andreas Dilger [ 27/Nov/13 ] |
|
Only found 2 failures in the past few weeks. |
| Comment by John Hammond [ 07/May/14 ] |
|
I see this occasionally from autotest and also when running locally. IIUC, when is see it, the blocking callback is from the OST (handling OST_DESTROY) for the rename onto victim. AFAICT we don't do ELC for objects of rename victims. Is that correct? 00000100:00100000:0.0:1399475307.568943:0:21604:0:service.c:2090:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost00_004:lustre-MDT0000-mdtlov_UUID+4:11360:x1467451757057912:12345-0@lo:6 ... 00010000:00010000:0.0:1399475307.569026:0:21604:0:(ldlm_lock.c:715:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: filter-lustre-OST0000_UUID lock: ffff88016dacb158/0x2e4c6046c3721790 lrc: 2/0,0 mode: PR/PR res: [0x212:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x40000000000000 nid: 0@lo remote: 0x2e4c6046c3721789 expref: 4 pid: 13418 timeout: 0 lvb_type: 0 ... 00010000:00010000:0.0:1399475307.569074:0:21604:0:(ldlm_lockd.c:848:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: filter-lustre-OST0000_UUID lock: ffff88016dacb158/0x2e4c6046c3721790 lrc: 3/0,0 mode: PR/PR res: [0x212:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x50000000010020 nid: 0@lo remote: 0x2e4c6046c3721789 expref: 4 pid: 13418 timeout: 0 lvb_type: 0 ... 00010000:00010000:0.0:1399475307.569081:0:21604:0:(ldlm_lockd.c:459:ldlm_add_waiting_lock()) ### adding to wait list(timeout: 150, AT: on) ns: filter-lustre-OST0000_UUID lock: ffff88016dacb158/0x2e4c6046c3721790 lrc: 4/0,0 mode: PR/PR res: [0x212:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x70000000010020 nid: 0@lo remote: 0x2e4c6046c3721789 expref: 4 pid: 13418 timeout: 4451652756 lvb_type: 0 ... 00000100:00000040:0.0:1399475307.569088:0:21604:0:(lustre_net.h:3296:ptlrpc_rqphase_move()) @@@ move req "New" -> "Rpc" req@ffff88012c7032f0 x1467451757057916/t0(0) o104->lustre-OST0000@0@lo:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1 ... 00000100:00100000:0.0:1399475307.569097:0:21604:0:(client.c:1480:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ll_ost00_004:lustre-OST0000_UUID:21604:1467451757057916:0@lo:104 |
| Comment by Di Wang [ 07/May/14 ] |
|
As I understand, we actually do ELC on OSC as well. But after 2.4, it is the MDT who will send destroy request to the OST, then OST revoke the lock on the client cache and destroy the object. So client does not have chance to do ELC here, so we might see some blocking RPC here. IIRC, test_120 only suppose to check ELC for metadata object, we probably need fix the test script here. |
| Comment by John Hammond [ 07/May/14 ] |
|
Yes, from osc_destroy() but this is not until after md_rename() has returned. |
| Comment by Andreas Dilger [ 07/May/14 ] |
|
The correct solution is to have the client cancel the OST locks for the file's objects if it thinks this is the last unlink of the file and it is not open on the client. That should avoid the blocking RPC from the OST, and also allow the page cleanup and OST RPCs to overlap with the MDT processing of the unlink. |
| Comment by Di Wang [ 07/May/14 ] |
| Comment by Di Wang [ 07/May/14 ] |
|
Andreas: I just talked to jinshan, there are no proper interface for us to implement this(cancel OST locks directly from llite). And according to jinshan's idea, there might be proper interface for us to do this after CLIO cleanup project, according to jinshan's comment, so we probably temporary fix this in script for now? |
| Comment by Jodi Levi (Inactive) [ 08/May/14 ] |
|
Patch landed to Master. |
| Comment by Jian Yu [ 28/Oct/14 ] |
|
The failure also occurred on Lustre b2_5 branch: |