[LU-8347] granting conflicting locks Created: 29/Jun/16  Updated: 29/Oct/16  Resolved: 29/Oct/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: Andriy Skulysh Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-8306 lost BL AST during failover Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

2016-04-14T13:52:45.530794+00:00 c3-2c1s12n0 Lustre: 21681:0:(client.c:1944:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1460641949/real 1460641949] req@ffff881fe6245700 x1530142443559856/t0(0) o101->snx11209-OST0007-osc-ffff880fe6fbbc00@10.149.209.11@o2ib1302:28/4 lens 328/400 e 1 to 1 dl 1460641965 ref 1 fl Rpc:XU/40/ffffffff rc 0/-1
2016-04-14T13:52:45.530830+00:00 c3-2c1s12n0 Lustre: 21681:0:(client.c:1944:ptlrpc_expire_one_request()) Skipped 224 previous similar messages
2016-04-14T13:53:17.536750+00:00 c3-2c1s12n0 LustreError: 21681:0:(import.c:646:ptlrpc_connect_import()) already connecting
2016-04-14T13:53:17.561952+00:00 c3-2c1s12n0 LustreError: 21681:0:(import.c:646:ptlrpc_connect_import()) Skipped 175 previous similar messages
2016-04-14T13:55:25.623058+00:00 c3-2c1s12n0 Lustre: 21681:0:(client.c:1944:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1460642109/real 1460642109] req@ffff881fe6103180 x1530142443588744/t0(0) o101->snx11209-OST0007-osc-ffff880fe6fbbc00@10.149.209.11@o2ib1302:28/4 lens 328/400 e 1 to 1 dl 1460642125 ref 1 fl Rpc:XU/40/ffffffff rc 0/-1
2016-04-14T13:55:25.623115+00:00 c3-2c1s12n0 Lustre: 21681:0:(client.c:1944:ptlrpc_expire_one_request()) Skipped 449 previous similar messages
2016-04-14T13:55:25.648246+00:00 c3-2c1s12n0 LustreError: 21681:0:(import.c:646:ptlrpc_connect_import()) already connecting
2016-04-14T13:55:25.648306+00:00 c3-2c1s12n0 LustreError: 21681:0:(import.c:646:ptlrpc_connect_import()) Skipped 351 previous similar messages
2016-04-14T13:57:45.501013+00:00 c3-2c1s12n0 Lustre: snx11209-OST0007-osc-ffff880fe6fbbc00: Connection restored to snx11209-OST0007 (at 10.149.209.11@o2ib1302)
2016-04-14T13:57:54.435474+00:00 c3-2c1s12n0 LustreError: 22135:0:(osc_cache.c:3107:discard_cb()) ASSERTION( (!(page->cp_type == CPT_CACHEABLE) || (!PageDirty(cl_page_vmpage(page)))) ) failed:
2016-04-14T13:57:54.460638+00:00 c3-2c1s12n0 LustreError: 22135:0:(osc_cache.c:3107:discard_cb()) LBUG
2016-04-14T13:57:54.460793+00:00 c3-2c1s12n0 Pid: 22135, comm: ldlm_bl_05
2016-04-14T13:57:54.460821+00:00 c3-2c1s12n0 Call Trace:
2016-04-14T13:57:54.460842+00:00 c3-2c1s12n0 [<ffffffff81006109>] try_stack_unwind+0x169/0x1b0
2016-04-14T13:57:54.485882+00:00 c3-2c1s12n0 [<ffffffff81004b99>] dump_trace+0x89/0x440
2016-04-14T13:57:54.485909+00:00 c3-2c1s12n0 [<ffffffffa02108c7>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
2016-04-14T13:57:54.511077+00:00 c3-2c1s12n0 [<ffffffffa0210e27>] lbug_with_loc+0x47/0xc0 [libcfs]
2016-04-14T13:57:54.511124+00:00 c3-2c1s12n0 [<ffffffffa06f264a>] discard_cb+0x19a/0x1d0 [osc]
2016-04-14T13:57:54.511142+00:00 c3-2c1s12n0 [<ffffffffa06f29c8>] osc_page_gang_lookup+0x1b8/0x330 [osc]
2016-04-14T13:57:54.536311+00:00 c3-2c1s12n0 [<ffffffffa06f2c6b>] osc_lock_discard_pages+0x12b/0x220 [osc]
2016-04-14T13:57:54.536403+00:00 c3-2c1s12n0 [<ffffffffa06e9408>] osc_lock_flush+0xf8/0x260 [osc]
2016-04-14T13:57:54.536433+00:00 c3-2c1s12n0 [<ffffffffa06e9651>] osc_lock_cancel+0xe1/0x1c0 [osc]
2016-04-14T13:57:54.561488+00:00 c3-2c1s12n0 [<ffffffffa035e115>] cl_lock_cancel0+0x75/0x160 [obdclass]
2016-04-14T13:57:54.561551+00:00 c3-2c1s12n0 [<ffffffffa035ee6b>] cl_lock_cancel+0x13b/0x140 [obdclass]
2016-04-14T13:57:54.586775+00:00 c3-2c1s12n0 [<ffffffffa06eb3cc>] osc_ldlm_blocking_ast+0x20c/0x330 [osc]
2016-04-14T13:57:54.586835+00:00 c3-2c1s12n0 [<ffffffffa046e3f4>] ldlm_handle_bl_callback+0xd4/0x430 [ptlrpc]
2016-04-14T13:57:54.586915+00:00 c3-2c1s12n0 [<ffffffffa046e964>] ldlm_bl_thread_main+0x214/0x460 [ptlrpc]
2016-04-14T13:57:54.611996+00:00 c3-2c1s12n0 [<ffffffff8107374e>] kthread+0x9e/0xb0
2016-04-14T13:57:54.612081+00:00 c3-2c1s12n0 [<ffffffff81427bb4>] kernel_thread_helper+0x4/0x10
2016-04-14T13:57:54.612204+00:00 c3-2c1s12n0 Kernel panic - not syncing: LBUG



 Comments   
Comment by Gerrit Updater [ 29/Jun/16 ]

Andriy Skulysh (andriy.skulysh@seagate.com) uploaded a new patch: http://review.whamcloud.com/21059
Subject: LU-8347 ldlm: granting conflicting locks
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2abc35352d2a1e3be9ee5dd8b2979ccda115c2e1

Comment by John Hammond [ 01/Jul/16 ]

This assertion was removed by http://review.whamcloud.com/#/c/14989/ LU-6271 osc: handle osc eviction correctly.

Comment by Jinshan Xiong (Inactive) [ 05/Jul/16 ]

Hi Andriy, can you please describe the root cause of this problem in detail? The reason to remove the assert is that this usually occurs when an OSC import is being evicted therefore this error is not that severe to make the client stop working.

Comment by Andriy Skulysh [ 06/Jul/16 ]

the waiting lock replay could come to the server first and gets granted immediately as no conflict exists, the next granted lock replay is just placed to the granted list - so conflicts could be granted.

Comment by Jinshan Xiong (Inactive) [ 06/Jul/16 ]

as far as I know, replaying locks will be added into lists directly, and they won't be processed with lock enqueue policy at replay time.

Do you have a reproducer?

Comment by Jinshan Xiong (Inactive) [ 07/Jul/16 ]

I realize you must be talking about resent locks. However, I don't think this patch fixes the problem.

Comment by Andriy Skulysh [ 12/Jul/16 ]

Yes, I have a test, but it depends on other ticket.
The patch disables reprocess during lock resend, thus waiting lock can't get granted before already granted lock resent.

Comment by John Hammond [ 26/Jul/16 ]

Hi Andriy,

Could you point us to that test or upload it to gerrit?

Comment by Patrick Farrell (Inactive) [ 26/Jul/16 ]

It's not integrated in to the test framework, but the test I attached to LU-8202 (duped to here) does a great job of reproducing this problem during failover. (Hit rate of >50% in my testing.) The LU-8347 patch (I believe with LU-8175 as well) makes that test pass consistently, without it, we see data corruption in that test.

https://jira.hpdd.intel.com/secure/attachment/21600/mpi_test.c

Running the test case is a little complicated:
https://jira.hpdd.intel.com/browse/LU-8202?focusedCommentId=153406&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-153406

Exactly what it's doing is described in this comment:
https://jira.hpdd.intel.com/browse/LU-8202?focusedCommentId=153406&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-153406

Comment by John Hammond [ 12/Sep/16 ]

Andriy, can you add a test to http://review.whamcloud.com/#/c/21059/?

Comment by Andriy Skulysh [ 14/Sep/16 ]

I've added the test to the patch

Comment by Gerrit Updater [ 28/Oct/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21059/
Subject: LU-8347 ldlm: granting conflicting locks
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a42a91c783903ea15ad902032166c7e312dad7ee

Comment by Peter Jones [ 29/Oct/16 ]

Landed for 2.9

Generated at Sat Feb 10 02:16:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.