[LU-7812] sanity test_120e: 1 blocking RPC occured Created: 25/Feb/16  Updated: 14/Mar/17  Resolved: 17/Jan/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Oleg Drokin
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-4206 Sanity test_120e fails with 1 blockin... Resolved
is duplicated by LU-7889 sanity test_120f: 1 blocking RPC occured Resolved
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for wangshilong <wshilong@ddn.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/6ce8d47a-db1b-11e5-877a-5254006e85c2.

The sub-test test_120e failed with the following error:

1 blocking RPC occured.

Please provide additional information about the failure here.

Info required for matching: sanity 120e



 Comments   
Comment by Bob Glossman (Inactive) [ 03/Mar/16 ]

another seen on master:
https://testing.hpdd.intel.com/test_sets/a6f909f2-e17c-11e5-937c-5254006e85c2

Comment by Gerrit Updater [ 09/Mar/16 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/18826
Subject: LU-7812 tests: print out more stats for test failures
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 85bf5b4bcf327276be4b1764b482e739cb452746

Comment by Gerrit Updater [ 23/Mar/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18826/
Subject: LU-7812 tests: print more stats for sanity 120e failure
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7cc486b0a9e5000f8a4a130ce0ba7288cde585d7

Comment by Andreas Dilger [ 28/Apr/16 ]

Looking at the debug logs of a recent failure (https://testing.hpdd.intel.com/test_sets/26e8e22a-0cd1-11e6-9e5d-5254006e85c2), it appears that an LDLM_BL_CALLBACK RPC is arriving at the client, even though the client is supposed to have dropped all of its locks:

00000100:00100000:0.0:1461790257.944465:0:4445:0:(service.c:1922:ptlrpc_server_handle_req_in()) got req x1532794272717024
00000100:00100000:0.0:1461790257.944494:0:4445:0:(service.c:2070:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb00_001:LOV_OSC_UUID+4:1261:x1532794272717024:12345-10.9.4.226@tcp:104
00000100:00100000:0.0:1461790257.944583:0:4445:0:(service.c:2120:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb00_001:LOV_OSC_UUID+4:1261:x1532794272717024:12345-10.9.4.226@tcp:104 Request procesed in 88us (150us total) trans 0 rc 0/0
00010000:00010000:0.0:1461790257.944606:0:12644:0:(ldlm_lockd.c:1764:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: lustre-OST0003-osc-ffff8800591dbc00 lock: ffff880063943d80/0xc744c7e3377cb74d lrc: 2/0,0 mode: PR/PR res: [0x931:0x0:0x0].0x0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x420000000000 nid: local remote: 0x38471ca6aa1b9ffb expref: -99 pid: 18616 timeout: 0 lvb_type: 1
00010000:00010000:0.0:1461790257.944627:0:12644:0:(ldlm_lockd.c:1777:ldlm_handle_bl_callback()) Lock ffff880063943d80 already unused, calling callback (ffffffffa0af2c30)
00010000:00010000:0.0:1461790257.944629:0:12644:0:(ldlm_request.c:1113:ldlm_cli_cancel_local()) ### client-side cancel ns: lustre-OST0003-osc-ffff8800591dbc00 lock: ffff880063943d80/0xc744c7e3377cb74d lrc: 3/0,0 mode: PR/PR res: [0x931:0x0:0x0].0x0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x428400000000 nid: local remote: 0x38471ca6aa1b9ffb expref: -99 pid: 18616 timeout: 0 lvb_type: 1
00010000:00010000:0.0:1461790257.944687:0:12644:0:(ldlm_request.c:1175:ldlm_cancel_pack()) ### packing ns: lustre-OST0003-osc-ffff8800591dbc00 lock: ffff880063943d80/0xc744c7e3377cb74d lrc: 2/0,0 mode: --/PR res: [0x931:0x0:0x0].0x0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x4c69400000000 nid: local remote: 0x38471ca6aa1b9ffb expref: -99 pid: 18616 timeout: 0 lvb_type: 1
00010000:00010000:0.0:1461790257.944694:0:12644:0:(ldlm_lockd.c:1786:ldlm_handle_bl_callback()) ### client blocking callback handler END ns: lustre-OST0003-osc-ffff8800591dbc00 lock: ffff880063943d80/0xc744c7e3377cb74d lrc: 1/0,0 mode: --/PR res: [0x931:0x0:0x0].0x0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x4c69400000000 nid: local remote: 0x38471ca6aa1b9ffb expref: -99 pid: 18616 timeout: 0 lvb_type: 1
00010000:00010000:0.0:1461790257.944698:0:12644:0:(ldlm_lock.c:200:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: lustre-OST0003-osc-ffff8800591dbc00 lock: ffff880063943d80/0xc744c7e3377cb74d lrc: 0/0,0 mode: --/PR res: [0x931:0x0:0x0].0x0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x4c69400000000 nid: local remote: 0x38471ca6aa1b9ffb expref: -99 pid: 18616 timeout: 0 lvb_type: 1

It appears that the lock is the LCK_PR lock gotten from dd if=$DIR/$tdir/f1 of=/dev/null, since the LCK_PW lock on the [0x931:0x0:0x0] resource was dropped after the first cancel_lru_locks osc that also caused the dirty data to be written to the OST:

00010000:00010000:0.0:1461790256.415781:0:18613:0:(ldlm_resource.c:322:lprocfs_lru_size_seq_write()) dropping all unused locks from namespace lustre-OST0000-osc-ffff8800591dbc00
00010000:00010000:0.0:1461790256.415805:0:18613:0:(ldlm_resource.c:322:lprocfs_lru_size_seq_write()) dropping all unused locks from namespace lustre-OST0001-osc-ffff8800591dbc00
00010000:00010000:0.0:1461790256.415824:0:18613:0:(ldlm_resource.c:322:lprocfs_lru_size_seq_write()) dropping all unused locks from namespace lustre-OST0002-osc-ffff8800591dbc00
00010000:00010000:0.0:1461790256.415843:0:18613:0:(ldlm_resource.c:322:lprocfs_lru_size_seq_write()) dropping all unused locks from namespace lustre-OST0003-osc-ffff8800591dbc00
00010000:00010000:0.0:1461790256.415851:0:12645:0:(ldlm_request.c:1113:ldlm_cli_cancel_local()) ### client-side cancel ns: lustre-OST0003-osc-ffff8800591dbc00 lock: ffff880063943580/0xc744c7e3377cb731 lrc: 3/0,0 mode: PW/PW res: [0x931:0x0:0x0].0x0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x28400000000 nid: local remote: 0x38471ca6aa1b9ff4 expref: -99 pid: 18609 timeout: 0 lvb_type: 1
00000008:00100000:0.0:1461790256.415914:0:12645:0:(osc_request.c:1315:osc_brw_prep_request()) brw rpc ffff8800597b4080 - object 0x0:2353 offset 0<>512
00000100:00100000:0.0:1461790256.415932:0:12729:0:(client.c:1589:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_00_00:1765eedc-c6c6-d3ef-2f08-6efd98a5aa0c:12729:1532794264066032:10.9.4.226@tcp:4
00000008:00010000:0.0:1461790256.418958:0:12729:0:(osc_cache.c:857:osc_extent_finish()) ### extent: ffff88005ae1f3d8
 ns: lustre-OST0003-osc-ffff8800591dbc00 lock: ffff880063943580/0xc744c7e3377cb731 lrc: 3/0,0 mode: PW/PW res: [0x931:0x0:0x0].0x0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x29400000000 nid: local remote: 0x38471ca6aa1b9ff4 expref: -99 pid: 18609 timeout: 0 lvb_type: 1
00000008:00010000:0.0:1461790256.418973:0:12729:0:(osc_cache.c:489:__osc_extent_remove()) ### extent: ffff88005ae1f3d8
 ns: lustre-OST0003-osc-ffff8800591dbc00 lock: ffff880063943580/0xc744c7e3377cb731 lrc: 3/0,0 mode: PW/PW res: [0x931:0x0:0x0].0x0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x29400000000 nid: local remote: 0x38471ca6aa1b9ff4 expref: -99 pid: 18609 timeout: 0 lvb_type: 1
00000100:00100000:0.0:1461790256.418984:0:12729:0:(client.c:1997:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_00_00:1765eedc-c6c6-d3ef-2f08-6efd98a5aa0c:12729:1532794264066032:10.9.4.226@tcp:4
00000100:00100000:0.0:1461790256.419023:0:12645:0:(client.c:1589:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_26:1765eedc-c6c6-d3ef-2f08-6efd98a5aa0c:12645:1532794264066048:10.9.4.226@tcp:103
00000100:00100000:0.0:1461790256.420085:0:12645:0:(client.c:1997:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_26:1765eedc-c6c6-d3ef-2f08-6efd98a5aa0c:12645:1532794264066048:10.9.4.226@tcp:103
00010000:00010000:0.0:1461790256.420092:0:12645:0:(ldlm_lock.c:200:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: lustre-OST0003-osc-ffff8800591dbc00 lock: ffff880063943580/0xc744c7e3377cb731 lrc: 0/0,0 mode: --/PW res: [0x931:0x0:0x0].0x0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x4869400000000 nid: local remote: 0x38471ca6aa1b9ff4 expref: -99 pid: 18609 timeout: 0 lvb_type: 1
00010000:00010000:0.0:1461790256.420142:0:18613:0:(ldlm_resource.c:322:lprocfs_lru_size_seq_write()) dropping all unused locks from namespace lustre-OST0004-osc-ffff8800591dbc00
00010000:00010000:0.0:1461790256.420163:0:18613:0:(ldlm_resource.c:322:lprocfs_lru_size_seq_write()) dropping all unused locks from namespace lustre-OST0005-osc-ffff8800591dbc00
00010000:00010000:0.0:1461790256.420181:0:18613:0:(ldlm_resource.c:322:lprocfs_lru_size_seq_write()) dropping all unused locks from namespace lustre-OST0006-osc-ffff8800591dbc00

The LCK_PR lock is fetched during the read and added to the LRU and then later matched for the stat $DIR/$tdir/f1 to get the file size and returned to the LRU:

00010000:00010000:0.0:1461790256.432444:0:18616:0:(ldlm_lock.c:876:ldlm_lock_decref_internal()) ### add lock into lru list ns: lustre-OST0003-osc-ffff8800591dbc00 lock: ffff880063943d80/0xc744c7e3377cb74d lrc: 3/0,0 mode: PR/PR res: [0x931:0x0:0x0].0x0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x10020000000000 nid: local remote: 0x38471ca6aa1b9ffb expref: -99 pid: 18616 timeout: 0 lvb_type: 1
:
:
00000001:00010000:0.0:1461790256.439644:0:18617:0:(glimpse.c:96:cl_glimpse_lock()) Glimpsing inode [0x200001b71:0xabe:0x0]
00000008:00010000:0.0:1461790256.439654:0:18617:0:(osc_lock.c:1201:osc_lock_init()) ### lock ffff880051d6e470, osc lock ffff880064b45070, flags 0x1002
00010000:00010000:0.0:1461790256.439660:0:18617:0:(ldlm_lock.c:749:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(PR) ns: lustre-OST0003-osc-ffff8800591dbc00 lock: ffff880063943d80/0xc744c7e3377cb74d lrc: 2/1,0 mode: PR/PR res: [0x931:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x20000000000 nid: local remote: 0x38471ca6aa1b9ffb expref: -99 pid: 18616 timeout: 0 lvb_type: 1
00010000:00010000:0.0:1461790256.439666:0:18617:0:(ldlm_lock.c:1474:ldlm_lock_match()) ### matched (0 18446744073709551615) ns: lustre-OST0003-osc-ffff8800591dbc00 lock: ffff880063943d80/0xc744c7e3377cb74d lrc: 2/1,0 mode: PR/PR res: [0x931:0x0:0x0].0x0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x20000000000 nid: local remote: 0x38471ca6aa1b9ffb expref: -99 pid: 18616 timeout: 0 lvb_type: 1
00010000:00010000:0.0:1461790256.439677:0:18617:0:(ldlm_lock.c:803:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(PR) ns: lustre-OST0003-osc-ffff8800591dbc00 lock: ffff880063943d80/0xc744c7e3377cb74d lrc: 5/1,0 mode: PR/PR res: [0x931:0x0:0x0].0x0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x10020000000000 nid: local remote: 0x38471ca6aa1b9ffb expref: -99 pid: 18616 timeout: 0 lvb_type: 1
00010000:00010000:0.0:1461790256.439689:0:18617:0:(ldlm_lock.c:876:ldlm_lock_decref_internal()) ### add lock into lru list ns: lustre-OST0003-osc-ffff8800591dbc00 lock: ffff880063943d80/0xc744c7e3377cb74d lrc: 4/0,0 mode: PR/PR res: [0x931:0x0:0x0].0x0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x10020000000000 nid: local remote: 0x38471ca6aa1b9ffb expref: -99 pid: 18616 timeout: 0 lvb_type: 1
00000080:00200000:0.0:1461790256.439697:0:18617:0:(file.c:971:ll_merge_attr()) [0x200001b71:0xabe:0x0] updating i_size 512

However, the lock seems to have been skipped during the immediately following cancel_lru_locks osc call:

00010000:00010000:0.0:1461790256.442518:0:18618:0:(ldlm_resource.c:322:lprocfs_lru_size_seq_write()) dropping all unused locks from namespace lustre-OST0000-osc-ffff8800591dbc00
00010000:00010000:0.0:1461790256.442558:0:18618:0:(ldlm_resource.c:322:lprocfs_lru_size_seq_write()) dropping all unused locks from namespace lustre-OST0001-osc-ffff8800591dbc00
00010000:00010000:0.0:1461790256.442612:0:18618:0:(ldlm_resource.c:322:lprocfs_lru_size_seq_write()) dropping all unused locks from namespace lustre-OST0002-osc-ffff8800591dbc00
00010000:00010000:0.0:1461790256.442644:0:18618:0:(ldlm_resource.c:322:lprocfs_lru_size_seq_write()) dropping all unused locks from namespace lustre-OST0003-osc-ffff8800591dbc00
00010000:00010000:0.0:1461790256.442675:0:18618:0:(ldlm_resource.c:322:lprocfs_lru_size_seq_write()) dropping all unused locks from namespace lustre-OST0004-osc-ffff8800591dbc00
00010000:00010000:0.0:1461790256.442706:0:18618:0:(ldlm_resource.c:322:lprocfs_lru_size_seq_write()) dropping all unused locks from namespace lustre-OST0005-osc-ffff8800591dbc00
00010000:00010000:0.0:1461790256.442735:0:18618:0:(ldlm_resource.c:322:lprocfs_lru_size_seq_write()) dropping all unused locks from namespace lustre-OST0006-osc-ffff8800591dbc00

and the lock is only called back from the OST when the file unlink triggers the object destroy on the OST, as shown above.

We need to find out why cancel_lru_locks didn't actually drop this lock. I don't think it is necessarily more efficient to have the client explicitly cancel the OST locks compared to just waiting for the OST blocking callback, since the client could be doing something else in the meantime that would allow the OST locks to be piggy-backed onto another OST RPC via ELC. It might make sense to move the OST locks to the end of the LRU so that they are sure to be cancelled with the next RPC to this OST.

Comment by Andreas Dilger [ 28/Apr/16 ]

LU-4206 only worked around the test failure, it didn't really fix it.

Comment by Bob Glossman (Inactive) [ 11/Aug/16 ]

another on master, sles11sp4 client and server:
https://testing.hpdd.intel.com/test_sets/1d257222-5fdc-11e6-906c-5254006e85c2

Comment by Bob Glossman (Inactive) [ 28/Nov/16 ]

more on master, sles11sp4 client and server:
https://testing.hpdd.intel.com/test_sets/fe524e90-b5af-11e6-a223-5254006e85c2
https://testing.hpdd.intel.com/test_sets/48ea6166-b5c1-11e6-a223-5254006e85c2

I think this may be a 100% fail on sles11sp4.
raising this ticket to Blocker

Comment by Bob Glossman (Inactive) [ 30/Nov/16 ]

more on master, sles12sp1
https://testing.hpdd.intel.com/test_sets/ad4ab7c4-b6d8-11e6-a559-5254006e85c2
https://testing.hpdd.intel.com/test_sets/c864a7d0-b729-11e6-a559-5254006e85c2
https://testing.hpdd.intel.com/test_sets/a0975c86-b759-11e6-a559-5254006e85c2

this may be a 100% fail on sles12 too

Comment by Bob Glossman (Inactive) [ 20/Dec/16 ]

more on master:
https://testing.hpdd.intel.com/test_sets/59612658-c668-11e6-8cb7-5254006e85c2
https://testing.hpdd.intel.com/test_sets/7457d9e4-c897-11e6-8a5b-5254006e85c2

Comment by Bob Glossman (Inactive) [ 30/Dec/16 ]

more on master:
https://testing.hpdd.intel.com/test_sets/114e1984-cece-11e6-af6a-5254006e85c2
https://testing.hpdd.intel.com/test_sets/edf384e2-d0b2-11e6-bbdd-5254006e85c2
https://testing.hpdd.intel.com/test_sets/93fba206-d163-11e6-bbdd-5254006e85c2
https://testing.hpdd.intel.com/test_sets/9c3a97e2-d69f-11e6-b630-5254006e85c2
https://testing.hpdd.intel.com/test_sets/187943b6-d6bf-11e6-bb30-5254006e85c2
https://testing.hpdd.intel.com/test_sets/8a1d30ea-d773-11e6-923b-5254006e85c2

Comment by Jinshan Xiong (Inactive) [ 10/Jan/17 ]

We need to find out why cancel_lru_locks didn't actually drop this lock. ...

cancel_lru_locks osc cancels unused locks which are the locks with zero l_readers and l_writers. There exists a race window that stat $DIR/$tdir $DIR/$tdir/f1 > /dev/null has completed but ldlm callback thread is still held the glimpse locks when cancel_lru_locks osc is being called, therefore the glimpse locks are not supposed to be 'unused' and miss the cancellation.

THis problem can be fixed by adding some delay between cancel_lru_locks osc and stat $DIR/$tdir $DIR/$tdir/f1 > /dev/null.

Comment by Gerrit Updater [ 11/Jan/17 ]

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/24811
Subject: LU-7812 tests: add a race condition for sanity:120e
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f4a1c85cb6353e057ea562a1094781b1924a9050

Comment by Bob Glossman (Inactive) [ 11/Jan/17 ]

There is a very similar looking fail in test_120f. see LU-7889. Does it need a similar test fix?

Comment by Jinshan Xiong (Inactive) [ 11/Jan/17 ]

Yes, I think so. We can merge these two tickets.

Comment by Gerrit Updater [ 17/Jan/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24811/
Subject: LU-7812 tests: address race condition for sanity:120

{e,f}

Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e240fb5099af8e62c532d314317095800ebb6864

Comment by Peter Jones [ 17/Jan/17 ]

Landed for 2.10

Generated at Sat Feb 10 02:12:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.