[LU-7812] sanity test_120e: 1 blocking RPC occured Created: 25/Feb/16 Updated: 14/Mar/17 Resolved: 17/Jan/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
This issue was created by maloo for wangshilong <wshilong@ddn.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/6ce8d47a-db1b-11e5-877a-5254006e85c2. The sub-test test_120e failed with the following error: 1 blocking RPC occured. Please provide additional information about the failure here. Info required for matching: sanity 120e |
| Comments |
| Comment by Bob Glossman (Inactive) [ 03/Mar/16 ] |
|
another seen on master: |
| Comment by Gerrit Updater [ 09/Mar/16 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/18826 |
| Comment by Gerrit Updater [ 23/Mar/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18826/ |
| Comment by Andreas Dilger [ 28/Apr/16 ] |
|
Looking at the debug logs of a recent failure (https://testing.hpdd.intel.com/test_sets/26e8e22a-0cd1-11e6-9e5d-5254006e85c2), it appears that an LDLM_BL_CALLBACK RPC is arriving at the client, even though the client is supposed to have dropped all of its locks: 00000100:00100000:0.0:1461790257.944465:0:4445:0:(service.c:1922:ptlrpc_server_handle_req_in()) got req x1532794272717024 00000100:00100000:0.0:1461790257.944494:0:4445:0:(service.c:2070:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb00_001:LOV_OSC_UUID+4:1261:x1532794272717024:12345-10.9.4.226@tcp:104 00000100:00100000:0.0:1461790257.944583:0:4445:0:(service.c:2120:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb00_001:LOV_OSC_UUID+4:1261:x1532794272717024:12345-10.9.4.226@tcp:104 Request procesed in 88us (150us total) trans 0 rc 0/0 00010000:00010000:0.0:1461790257.944606:0:12644:0:(ldlm_lockd.c:1764:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: lustre-OST0003-osc-ffff8800591dbc00 lock: ffff880063943d80/0xc744c7e3377cb74d lrc: 2/0,0 mode: PR/PR res: [0x931:0x0:0x0].0x0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x420000000000 nid: local remote: 0x38471ca6aa1b9ffb expref: -99 pid: 18616 timeout: 0 lvb_type: 1 00010000:00010000:0.0:1461790257.944627:0:12644:0:(ldlm_lockd.c:1777:ldlm_handle_bl_callback()) Lock ffff880063943d80 already unused, calling callback (ffffffffa0af2c30) 00010000:00010000:0.0:1461790257.944629:0:12644:0:(ldlm_request.c:1113:ldlm_cli_cancel_local()) ### client-side cancel ns: lustre-OST0003-osc-ffff8800591dbc00 lock: ffff880063943d80/0xc744c7e3377cb74d lrc: 3/0,0 mode: PR/PR res: [0x931:0x0:0x0].0x0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x428400000000 nid: local remote: 0x38471ca6aa1b9ffb expref: -99 pid: 18616 timeout: 0 lvb_type: 1 00010000:00010000:0.0:1461790257.944687:0:12644:0:(ldlm_request.c:1175:ldlm_cancel_pack()) ### packing ns: lustre-OST0003-osc-ffff8800591dbc00 lock: ffff880063943d80/0xc744c7e3377cb74d lrc: 2/0,0 mode: --/PR res: [0x931:0x0:0x0].0x0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x4c69400000000 nid: local remote: 0x38471ca6aa1b9ffb expref: -99 pid: 18616 timeout: 0 lvb_type: 1 00010000:00010000:0.0:1461790257.944694:0:12644:0:(ldlm_lockd.c:1786:ldlm_handle_bl_callback()) ### client blocking callback handler END ns: lustre-OST0003-osc-ffff8800591dbc00 lock: ffff880063943d80/0xc744c7e3377cb74d lrc: 1/0,0 mode: --/PR res: [0x931:0x0:0x0].0x0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x4c69400000000 nid: local remote: 0x38471ca6aa1b9ffb expref: -99 pid: 18616 timeout: 0 lvb_type: 1 00010000:00010000:0.0:1461790257.944698:0:12644:0:(ldlm_lock.c:200:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: lustre-OST0003-osc-ffff8800591dbc00 lock: ffff880063943d80/0xc744c7e3377cb74d lrc: 0/0,0 mode: --/PR res: [0x931:0x0:0x0].0x0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x4c69400000000 nid: local remote: 0x38471ca6aa1b9ffb expref: -99 pid: 18616 timeout: 0 lvb_type: 1 It appears that the lock is the LCK_PR lock gotten from dd if=$DIR/$tdir/f1 of=/dev/null, since the LCK_PW lock on the [0x931:0x0:0x0] resource was dropped after the first cancel_lru_locks osc that also caused the dirty data to be written to the OST: 00010000:00010000:0.0:1461790256.415781:0:18613:0:(ldlm_resource.c:322:lprocfs_lru_size_seq_write()) dropping all unused locks from namespace lustre-OST0000-osc-ffff8800591dbc00 00010000:00010000:0.0:1461790256.415805:0:18613:0:(ldlm_resource.c:322:lprocfs_lru_size_seq_write()) dropping all unused locks from namespace lustre-OST0001-osc-ffff8800591dbc00 00010000:00010000:0.0:1461790256.415824:0:18613:0:(ldlm_resource.c:322:lprocfs_lru_size_seq_write()) dropping all unused locks from namespace lustre-OST0002-osc-ffff8800591dbc00 00010000:00010000:0.0:1461790256.415843:0:18613:0:(ldlm_resource.c:322:lprocfs_lru_size_seq_write()) dropping all unused locks from namespace lustre-OST0003-osc-ffff8800591dbc00 00010000:00010000:0.0:1461790256.415851:0:12645:0:(ldlm_request.c:1113:ldlm_cli_cancel_local()) ### client-side cancel ns: lustre-OST0003-osc-ffff8800591dbc00 lock: ffff880063943580/0xc744c7e3377cb731 lrc: 3/0,0 mode: PW/PW res: [0x931:0x0:0x0].0x0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x28400000000 nid: local remote: 0x38471ca6aa1b9ff4 expref: -99 pid: 18609 timeout: 0 lvb_type: 1 00000008:00100000:0.0:1461790256.415914:0:12645:0:(osc_request.c:1315:osc_brw_prep_request()) brw rpc ffff8800597b4080 - object 0x0:2353 offset 0<>512 00000100:00100000:0.0:1461790256.415932:0:12729:0:(client.c:1589:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_00_00:1765eedc-c6c6-d3ef-2f08-6efd98a5aa0c:12729:1532794264066032:10.9.4.226@tcp:4 00000008:00010000:0.0:1461790256.418958:0:12729:0:(osc_cache.c:857:osc_extent_finish()) ### extent: ffff88005ae1f3d8 ns: lustre-OST0003-osc-ffff8800591dbc00 lock: ffff880063943580/0xc744c7e3377cb731 lrc: 3/0,0 mode: PW/PW res: [0x931:0x0:0x0].0x0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x29400000000 nid: local remote: 0x38471ca6aa1b9ff4 expref: -99 pid: 18609 timeout: 0 lvb_type: 1 00000008:00010000:0.0:1461790256.418973:0:12729:0:(osc_cache.c:489:__osc_extent_remove()) ### extent: ffff88005ae1f3d8 ns: lustre-OST0003-osc-ffff8800591dbc00 lock: ffff880063943580/0xc744c7e3377cb731 lrc: 3/0,0 mode: PW/PW res: [0x931:0x0:0x0].0x0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x29400000000 nid: local remote: 0x38471ca6aa1b9ff4 expref: -99 pid: 18609 timeout: 0 lvb_type: 1 00000100:00100000:0.0:1461790256.418984:0:12729:0:(client.c:1997:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_00_00:1765eedc-c6c6-d3ef-2f08-6efd98a5aa0c:12729:1532794264066032:10.9.4.226@tcp:4 00000100:00100000:0.0:1461790256.419023:0:12645:0:(client.c:1589:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_26:1765eedc-c6c6-d3ef-2f08-6efd98a5aa0c:12645:1532794264066048:10.9.4.226@tcp:103 00000100:00100000:0.0:1461790256.420085:0:12645:0:(client.c:1997:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_26:1765eedc-c6c6-d3ef-2f08-6efd98a5aa0c:12645:1532794264066048:10.9.4.226@tcp:103 00010000:00010000:0.0:1461790256.420092:0:12645:0:(ldlm_lock.c:200:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: lustre-OST0003-osc-ffff8800591dbc00 lock: ffff880063943580/0xc744c7e3377cb731 lrc: 0/0,0 mode: --/PW res: [0x931:0x0:0x0].0x0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x4869400000000 nid: local remote: 0x38471ca6aa1b9ff4 expref: -99 pid: 18609 timeout: 0 lvb_type: 1 00010000:00010000:0.0:1461790256.420142:0:18613:0:(ldlm_resource.c:322:lprocfs_lru_size_seq_write()) dropping all unused locks from namespace lustre-OST0004-osc-ffff8800591dbc00 00010000:00010000:0.0:1461790256.420163:0:18613:0:(ldlm_resource.c:322:lprocfs_lru_size_seq_write()) dropping all unused locks from namespace lustre-OST0005-osc-ffff8800591dbc00 00010000:00010000:0.0:1461790256.420181:0:18613:0:(ldlm_resource.c:322:lprocfs_lru_size_seq_write()) dropping all unused locks from namespace lustre-OST0006-osc-ffff8800591dbc00 The LCK_PR lock is fetched during the read and added to the LRU and then later matched for the stat $DIR/$tdir/f1 to get the file size and returned to the LRU: 00010000:00010000:0.0:1461790256.432444:0:18616:0:(ldlm_lock.c:876:ldlm_lock_decref_internal()) ### add lock into lru list ns: lustre-OST0003-osc-ffff8800591dbc00 lock: ffff880063943d80/0xc744c7e3377cb74d lrc: 3/0,0 mode: PR/PR res: [0x931:0x0:0x0].0x0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x10020000000000 nid: local remote: 0x38471ca6aa1b9ffb expref: -99 pid: 18616 timeout: 0 lvb_type: 1 : : 00000001:00010000:0.0:1461790256.439644:0:18617:0:(glimpse.c:96:cl_glimpse_lock()) Glimpsing inode [0x200001b71:0xabe:0x0] 00000008:00010000:0.0:1461790256.439654:0:18617:0:(osc_lock.c:1201:osc_lock_init()) ### lock ffff880051d6e470, osc lock ffff880064b45070, flags 0x1002 00010000:00010000:0.0:1461790256.439660:0:18617:0:(ldlm_lock.c:749:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(PR) ns: lustre-OST0003-osc-ffff8800591dbc00 lock: ffff880063943d80/0xc744c7e3377cb74d lrc: 2/1,0 mode: PR/PR res: [0x931:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x20000000000 nid: local remote: 0x38471ca6aa1b9ffb expref: -99 pid: 18616 timeout: 0 lvb_type: 1 00010000:00010000:0.0:1461790256.439666:0:18617:0:(ldlm_lock.c:1474:ldlm_lock_match()) ### matched (0 18446744073709551615) ns: lustre-OST0003-osc-ffff8800591dbc00 lock: ffff880063943d80/0xc744c7e3377cb74d lrc: 2/1,0 mode: PR/PR res: [0x931:0x0:0x0].0x0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x20000000000 nid: local remote: 0x38471ca6aa1b9ffb expref: -99 pid: 18616 timeout: 0 lvb_type: 1 00010000:00010000:0.0:1461790256.439677:0:18617:0:(ldlm_lock.c:803:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(PR) ns: lustre-OST0003-osc-ffff8800591dbc00 lock: ffff880063943d80/0xc744c7e3377cb74d lrc: 5/1,0 mode: PR/PR res: [0x931:0x0:0x0].0x0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x10020000000000 nid: local remote: 0x38471ca6aa1b9ffb expref: -99 pid: 18616 timeout: 0 lvb_type: 1 00010000:00010000:0.0:1461790256.439689:0:18617:0:(ldlm_lock.c:876:ldlm_lock_decref_internal()) ### add lock into lru list ns: lustre-OST0003-osc-ffff8800591dbc00 lock: ffff880063943d80/0xc744c7e3377cb74d lrc: 4/0,0 mode: PR/PR res: [0x931:0x0:0x0].0x0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x10020000000000 nid: local remote: 0x38471ca6aa1b9ffb expref: -99 pid: 18616 timeout: 0 lvb_type: 1 00000080:00200000:0.0:1461790256.439697:0:18617:0:(file.c:971:ll_merge_attr()) [0x200001b71:0xabe:0x0] updating i_size 512 However, the lock seems to have been skipped during the immediately following cancel_lru_locks osc call: 00010000:00010000:0.0:1461790256.442518:0:18618:0:(ldlm_resource.c:322:lprocfs_lru_size_seq_write()) dropping all unused locks from namespace lustre-OST0000-osc-ffff8800591dbc00 00010000:00010000:0.0:1461790256.442558:0:18618:0:(ldlm_resource.c:322:lprocfs_lru_size_seq_write()) dropping all unused locks from namespace lustre-OST0001-osc-ffff8800591dbc00 00010000:00010000:0.0:1461790256.442612:0:18618:0:(ldlm_resource.c:322:lprocfs_lru_size_seq_write()) dropping all unused locks from namespace lustre-OST0002-osc-ffff8800591dbc00 00010000:00010000:0.0:1461790256.442644:0:18618:0:(ldlm_resource.c:322:lprocfs_lru_size_seq_write()) dropping all unused locks from namespace lustre-OST0003-osc-ffff8800591dbc00 00010000:00010000:0.0:1461790256.442675:0:18618:0:(ldlm_resource.c:322:lprocfs_lru_size_seq_write()) dropping all unused locks from namespace lustre-OST0004-osc-ffff8800591dbc00 00010000:00010000:0.0:1461790256.442706:0:18618:0:(ldlm_resource.c:322:lprocfs_lru_size_seq_write()) dropping all unused locks from namespace lustre-OST0005-osc-ffff8800591dbc00 00010000:00010000:0.0:1461790256.442735:0:18618:0:(ldlm_resource.c:322:lprocfs_lru_size_seq_write()) dropping all unused locks from namespace lustre-OST0006-osc-ffff8800591dbc00 and the lock is only called back from the OST when the file unlink triggers the object destroy on the OST, as shown above. We need to find out why cancel_lru_locks didn't actually drop this lock. I don't think it is necessarily more efficient to have the client explicitly cancel the OST locks compared to just waiting for the OST blocking callback, since the client could be doing something else in the meantime that would allow the OST locks to be piggy-backed onto another OST RPC via ELC. It might make sense to move the OST locks to the end of the LRU so that they are sure to be cancelled with the next RPC to this OST. |
| Comment by Andreas Dilger [ 28/Apr/16 ] |
|
|
| Comment by Bob Glossman (Inactive) [ 11/Aug/16 ] |
|
another on master, sles11sp4 client and server: |
| Comment by Bob Glossman (Inactive) [ 28/Nov/16 ] |
|
more on master, sles11sp4 client and server: I think this may be a 100% fail on sles11sp4. |
| Comment by Bob Glossman (Inactive) [ 30/Nov/16 ] |
|
more on master, sles12sp1 this may be a 100% fail on sles12 too |
| Comment by Bob Glossman (Inactive) [ 20/Dec/16 ] |
|
more on master: |
| Comment by Bob Glossman (Inactive) [ 30/Dec/16 ] |
|
more on master: |
| Comment by Jinshan Xiong (Inactive) [ 10/Jan/17 ] |
cancel_lru_locks osc cancels unused locks which are the locks with zero l_readers and l_writers. There exists a race window that stat $DIR/$tdir $DIR/$tdir/f1 > /dev/null has completed but ldlm callback thread is still held the glimpse locks when cancel_lru_locks osc is being called, therefore the glimpse locks are not supposed to be 'unused' and miss the cancellation. THis problem can be fixed by adding some delay between cancel_lru_locks osc and stat $DIR/$tdir $DIR/$tdir/f1 > /dev/null. |
| Comment by Gerrit Updater [ 11/Jan/17 ] |
|
Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/24811 |
| Comment by Bob Glossman (Inactive) [ 11/Jan/17 ] |
|
There is a very similar looking fail in test_120f. see |
| Comment by Jinshan Xiong (Inactive) [ 11/Jan/17 ] |
|
Yes, I think so. We can merge these two tickets. |
| Comment by Gerrit Updater [ 17/Jan/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24811/ Project: fs/lustre-release |
| Comment by Peter Jones [ 17/Jan/17 ] |
|
Landed for 2.10 |