[LU-4206] Sanity test_120e fails with 1 blocking RPC occured. - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.6.0
Affects Version/s: Lustre 2.5.0
Labels:
None

Severity:
3
Rank (Obsolete):
11434

Description

Sanity test_120e seems to be failing intermittently on some tests.
Seeing the following on the MDT

Lustre: lustre-MDT0000: Not available for connect from 10.10.16.108@tcp (stopping)
Lustre: lustre-MDT0000: Not available for connect from 10.10.16.108@tcp (stopping)
Lustre: Skipped 5 previous similar messages
Lustre: lustre-MDT0000: Not available for connect from 10.10.16.109@tcp (stopping)
LustreError: 2966:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff8800686f7c00 x1450597855266416/t0(0) o13->lustre-OST0002-osc-MDT0000@10.10.16.108@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
LustreError: 2966:0:(client.c:1076:ptlrpc_import_delay_req()) Skipped 3 previous similar messages
Lustre: lustre-MDT0000: Not available for connect from 10.10.16.110@tcp (stopping)
Lustre: 10030:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1383398053/real 1383398053] req@ffff880068291c00 x1450597855266444/t0(0) o251->MGC10.10.16.107@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1383398059 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: server umount lustre-MDT0000 complete

Might be the that mdt is being umounted while Clients are communicating with it

On Client
LustreError: 11-0: lustre-MDT0000-mdc-ffff880037d5a400: Communicating with 10.10.16.107@tcp, operation obd_ping failed with -107.

Attachments

Issue Links

is duplicated by

LU-7812 sanity test_120e: 1 blocking RPC occured

Resolved

LU-4421 Failure on test suite sanity test_120e: 1 blocking RPC occured

Resolved

LU-4909 sanity test_120f failure: 1 blocking RPC occured

Resolved

Activity

[LU-4206] Sanity test_120e fails with 1 blocking RPC occured.

Jian Yu added a comment - 28/Oct/14 4:49 PM

The failure also occurred on Lustre b2_5 branch:
https://testing.hpdd.intel.com/test_sets/720ecb84-5e77-11e4-bd01-5254006e85c2

Jian Yu added a comment - 28/Oct/14 4:49 PM The failure also occurred on Lustre b2_5 branch: https://testing.hpdd.intel.com/test_sets/720ecb84-5e77-11e4-bd01-5254006e85c2

Jodi Levi (Inactive) added a comment - 08/May/14 5:31 PM

Patch landed to Master.

Jodi Levi (Inactive) added a comment - 08/May/14 5:31 PM Patch landed to Master.

Di Wang (Inactive) added a comment - 07/May/14 9:10 PM

Andreas: I just talked to jinshan, there are no proper interface for us to implement this(cancel OST locks directly from llite). And according to jinshan's idea, there might be proper interface for us to do this after CLIO cleanup project, according to jinshan's comment, so we probably temporary fix this in script for now?

Di Wang (Inactive) added a comment - 07/May/14 9:10 PM Andreas: I just talked to jinshan, there are no proper interface for us to implement this(cancel OST locks directly from llite). And according to jinshan's idea, there might be proper interface for us to do this after CLIO cleanup project, according to jinshan's comment, so we probably temporary fix this in script for now?

Di Wang (Inactive) added a comment - 07/May/14 6:48 PM

http://review.whamcloud.com/10250

Di Wang (Inactive) added a comment - 07/May/14 6:48 PM http://review.whamcloud.com/10250

Andreas Dilger added a comment - 07/May/14 6:33 PM

The correct solution is to have the client cancel the OST locks for the file's objects if it thinks this is the last unlink of the file and it is not open on the client. That should avoid the blocking RPC from the OST, and also allow the page cleanup and OST RPCs to overlap with the MDT processing of the unlink.

Andreas Dilger added a comment - 07/May/14 6:33 PM The correct solution is to have the client cancel the OST locks for the file's objects if it thinks this is the last unlink of the file and it is not open on the client. That should avoid the blocking RPC from the OST, and also allow the page cleanup and OST RPCs to overlap with the MDT processing of the unlink.

John Hammond added a comment - 07/May/14 6:15 PM

Yes, from osc_destroy() but this is not until after md_rename() has returned.

John Hammond added a comment - 07/May/14 6:15 PM Yes, from osc_destroy() but this is not until after md_rename() has returned.

Di Wang (Inactive) added a comment - 07/May/14 6:13 PM

As I understand, we actually do ELC on OSC as well. But after 2.4, it is the MDT who will send destroy request to the OST, then OST revoke the lock on the client cache and destroy the object. So client does not have chance to do ELC here, so we might see some blocking RPC here. IIRC, test_120 only suppose to check ELC for metadata object, we probably need fix the test script here.

Di Wang (Inactive) added a comment - 07/May/14 6:13 PM As I understand, we actually do ELC on OSC as well. But after 2.4, it is the MDT who will send destroy request to the OST, then OST revoke the lock on the client cache and destroy the object. So client does not have chance to do ELC here, so we might see some blocking RPC here. IIRC, test_120 only suppose to check ELC for metadata object, we probably need fix the test script here.

John Hammond added a comment - 07/May/14 3:52 PM

I see this occasionally from autotest and also when running locally. IIUC, when is see it, the blocking callback is from the OST (handling OST_DESTROY) for the rename onto victim. AFAICT we don't do ELC for objects of rename victims. Is that correct?

00000100:00100000:0.0:1399475307.568943:0:21604:0:service.c:2090:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost00_004:lustre-MDT0000-mdtlov_UUID+4:11360:x1467451757057912:12345-0@lo:6
...
00010000:00010000:0.0:1399475307.569026:0:21604:0:(ldlm_lock.c:715:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: filter-lustre-OST0000_UUID lock: ffff88016dacb158/0x2e4c6046c3721790 lrc: 2/0,0 mode: PR/PR res: [0x212:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x40000000000000 nid: 0@lo remote: 0x2e4c6046c3721789 expref: 4 pid: 13418 timeout: 0 lvb_type: 0
...
00010000:00010000:0.0:1399475307.569074:0:21604:0:(ldlm_lockd.c:848:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: filter-lustre-OST0000_UUID lock: ffff88016dacb158/0x2e4c6046c3721790 lrc: 3/0,0 mode: PR/PR res: [0x212:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x50000000010020 nid: 0@lo remote: 0x2e4c6046c3721789 expref: 4 pid: 13418 timeout: 0 lvb_type: 0
...
00010000:00010000:0.0:1399475307.569081:0:21604:0:(ldlm_lockd.c:459:ldlm_add_waiting_lock()) ### adding to wait list(timeout: 150, AT: on) ns: filter-lustre-OST0000_UUID lock: ffff88016dacb158/0x2e4c6046c3721790 lrc: 4/0,0 mode: PR/PR res: [0x212:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x70000000010020 nid: 0@lo remote: 0x2e4c6046c3721789 expref: 4 pid: 13418 timeout: 4451652756 lvb_type: 0
...
00000100:00000040:0.0:1399475307.569088:0:21604:0:(lustre_net.h:3296:ptlrpc_rqphase_move()) @@@ move req "New" -> "Rpc"  req@ffff88012c7032f0 x1467451757057916/t0(0) o104->lustre-OST0000@0@lo:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1
...
00000100:00100000:0.0:1399475307.569097:0:21604:0:(client.c:1480:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ll_ost00_004:lustre-OST0000_UUID:21604:1467451757057916:0@lo:104

John Hammond added a comment - 07/May/14 3:52 PM I see this occasionally from autotest and also when running locally. IIUC, when is see it, the blocking callback is from the OST (handling OST_DESTROY) for the rename onto victim. AFAICT we don't do ELC for objects of rename victims. Is that correct? 00000100:00100000:0.0:1399475307.568943:0:21604:0:service.c:2090:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost00_004:lustre-MDT0000-mdtlov_UUID+4:11360:x1467451757057912:12345-0@lo:6 ... 00010000:00010000:0.0:1399475307.569026:0:21604:0:(ldlm_lock.c:715:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: filter-lustre-OST0000_UUID lock: ffff88016dacb158/0x2e4c6046c3721790 lrc: 2/0,0 mode: PR/PR res: [0x212:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x40000000000000 nid: 0@lo remote: 0x2e4c6046c3721789 expref: 4 pid: 13418 timeout: 0 lvb_type: 0 ... 00010000:00010000:0.0:1399475307.569074:0:21604:0:(ldlm_lockd.c:848:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: filter-lustre-OST0000_UUID lock: ffff88016dacb158/0x2e4c6046c3721790 lrc: 3/0,0 mode: PR/PR res: [0x212:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x50000000010020 nid: 0@lo remote: 0x2e4c6046c3721789 expref: 4 pid: 13418 timeout: 0 lvb_type: 0 ... 00010000:00010000:0.0:1399475307.569081:0:21604:0:(ldlm_lockd.c:459:ldlm_add_waiting_lock()) ### adding to wait list(timeout: 150, AT: on) ns: filter-lustre-OST0000_UUID lock: ffff88016dacb158/0x2e4c6046c3721790 lrc: 4/0,0 mode: PR/PR res: [0x212:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x70000000010020 nid: 0@lo remote: 0x2e4c6046c3721789 expref: 4 pid: 13418 timeout: 4451652756 lvb_type: 0 ... 00000100:00000040:0.0:1399475307.569088:0:21604:0:(lustre_net.h:3296:ptlrpc_rqphase_move()) @@@ move req "New" -> "Rpc" req@ffff88012c7032f0 x1467451757057916/t0(0) o104->lustre-OST0000@0@lo:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1 ... 00000100:00100000:0.0:1399475307.569097:0:21604:0:(client.c:1480:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ll_ost00_004:lustre-OST0000_UUID:21604:1467451757057916:0@lo:104

Andreas Dilger added a comment - 27/Nov/13 8:37 PM

Only found 2 failures in the past few weeks.

Andreas Dilger added a comment - 27/Nov/13 8:37 PM Only found 2 failures in the past few weeks.

People

Assignee:: Di Wang (Inactive)

Reporter:: Amir Shehata (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 04/Nov/13 7:46 PM

Updated:: 28/Apr/16 11:40 PM

Resolved:: 08/May/14 5:31 PM