[LU-8175] conflicting PW & PR extent locks on a client Created: 20/May/16  Updated: 01/Mar/18  Resolved: 02/Sep/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Critical
Reporter: Andriy Skulysh Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Duplicate
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

> [5034040.035051] Lustre: 16432:0:(client.c:1910:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1453393018/real 1453393018] req@ffff881f9d653c00 x1518811430048732/t0(0) o3->snx11091-OST0028-osc-ffff881fe6574800@172.17.47.209@o2ib1013:6/4 lens 488/432 e 0 to 1 dl 1453393778 ref 2 fl Rpc:XU/2/ffffffff rc -11/-1
> [5034040.035057] Lustre: 16432:0:(client.c:1910:ptlrpc_expire_one_request()) Skipped 32 previous similar messages
> [5034482.398979] Lustre: snx11091-OST000b-osc-ffff881fe6574800: Connection to snx11091-OST000b (at 172.17.47.201@o2ib1013) was lost; in progress operations using this service will wait for recovery to complete
> [5034482.398984] Lustre: Skipped 7 previous similar messages
> [5034482.399254] Lustre: snx11091-OST000b-osc-ffff881fe6574800: Connection restored to snx11091-OST000b (at 172.17.47.201@o2ib1013)
> [5034482.399257] Lustre: Skipped 7 previous similar messages
> [5034798.943798] Lustre: 16422:0:(client.c:1910:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1453393778/real 1453393778] req@ffff881fe4cc9000 x1518811430052084/t0(0) o4->snx11091-OST0028-osc-ffff881fe6574800@172.17.47.209@o2ib1013:6/4 lens 488/448 e 0 to 1 dl 1453394538 ref 2 fl Rpc:XU/2/ffffffff rc -11/-1
> [5034798.943805] Lustre: 16442:0:(client.c:1910:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1453393778/real 1453393778] req@ffff881fe4cc9400 x1518811430052092/t0(0) o4->snx11091-OST0028-osc-ffff881fe6574800@172.17.47.209@o2ib1013:6/4 lens 488/448 e 0 to 1 dl 1453394538 ref 2 fl Rpc:XU/2/ffffffff rc 0/-1
> [5034798.943811] Lustre: 16442:0:(client.c:1910:ptlrpc_expire_one_request()) Skipped 30 previous similar messages
> [5035427.382998] Lustre: snx11091-OST002a-osc-ffff881fe6574800: Connection restored to snx11091-OST002a (at 172.17.47.210@o2ib1012)
> [5035427.383001] Lustre: Skipped 7 previous similar messages
> [5035429.345176] LustreError: 16406:0:(osc_cache.c:2421:osc_teardown_async_page()) extent ffff88071aac01e0@

{[0 -> 255/255], [3|0|+|cache|wihuY|ffff880877eec198], [1048576|256|+|+|ffff880e6beab738|256| (null)]}

trunc at 0.
> [5035429.345183] LustreError: 16406:0:(osc_page.c:333:osc_page_delete()) page@ffff880973c33000[3 ffff88037416ae18 4 0 1 (null) (null) 0x0]
> [5035429.345188] LustreError: 16406:0:(osc_page.c:333:osc_page_delete()) vvp-page@ffff880973c330a0(0:0:0) vm@ffffea0006449948 20000000001079 2:0 ffff880973c33000 0 lru
> [5035429.345191] LustreError: 16406:0:(osc_page.c:333:osc_page_delete()) lov-page@ffff880973c330f8, raid0
> [5035429.345198] LustreError: 16406:0:(osc_page.c:333:osc_page_delete()) osc-page@ffff880973c33160 0: 1< 0x845fed 2 0 + - > 2< 0 0 4096 0x0 0x420 | (null) ffff881fe6ae0620 ffff880877eec198 > 3< + ffff880768e26380 0 0 0 > 4< 0 9 8 0 - | + - + + > 5< + - + - | 0 - | 948 - +>
> [5035429.345202] LustreError: 16406:0:(osc_page.c:333:osc_page_delete()) end page@ffff880973c33000
> [5035429.345204] LustreError: 16406:0:(osc_page.c:333:osc_page_delete()) Trying to teardown failed: -16
> [5035429.345206] LustreError: 16406:0:(osc_page.c:334:osc_page_delete()) ASSERTION( 0 ) failed:
> [5035429.353732] LustreError: 16406:0:(osc_page.c:334:osc_page_delete()) LBUG
> [5035429.360601] Pid: 16406, comm: ptlrpcd_3
> [5035429.360602]
> [5035429.360603] Call Trace:
> [5035429.360612] [<ffffffff81004b95>] dump_trace+0x75/0x300
> [5035429.360636] [<ffffffffa089c82a>] libcfs_debug_dumpstack+0x4a/0x70 [libcfs]
> [5035429.360664] [<ffffffffa089cd5e>] lbug_with_loc+0x3e/0xb0 [libcfs]
> [5035429.360678] [<ffffffffa1d35103>] osc_page_delete+0x393/0x3d0 [osc]
> [5035429.360722] [<ffffffffa09f43fd>] cl_page_delete0+0x6d/0x200 [obdclass]
> [5035429.360765] [<ffffffffa09f45c5>] cl_page_delete+0x35/0x120 [obdclass]
> [5035429.360817] [<ffffffffa1e695c6>] ll_invalidatepage+0x96/0x160 [lustre]
> [5035429.360850] [<ffffffffa1e7b45c>] vvp_page_discard+0xcc/0x170 [lustre]
> [5035429.360887] [<ffffffffa09f2ce8>] cl_page_invoid+0x58/0x150 [obdclass]
> [5035429.360918] [<ffffffffa1d4193e>] check_and_discard_cb+0x13e/0x190 [osc]
> [5035429.360934] [<ffffffffa1d41b4d>] osc_page_gang_lookup+0x1bd/0x340 [osc]
> [5035429.360951] [<ffffffffa1d41e0b>] osc_lock_discard_pages+0x13b/0x240 [osc]
> [5035429.360966] [<ffffffffa1d37993>] osc_lock_flush+0xf3/0x270 [osc]
> [5035429.360979] [<ffffffffa1d37c09>] osc_lock_cancel+0xf9/0x1e0 [osc]
> [5035429.361005] [<ffffffffa09f6bc5>] cl_lock_cancel0+0x65/0x150 [obdclass]
> [5035429.361050] [<ffffffffa09f9f76>] cl_lock_hold_release+0x1e6/0x2c0 [obdclass]
> [5035429.361081] [<ffffffffa1d3a613>] osc_lock_upcall+0x223/0x460 [osc]
> [5035429.361093] [<ffffffffa1d1b82d>] osc_enqueue_fini+0x9d/0x270 [osc]
> [5035429.361102] [<ffffffffa1d1e883>] osc_enqueue_interpret+0xe3/0x1e0 [osc]
> [5035429.361136] [<ffffffffa1c00152>] ptlrpc_check_set+0x562/0x1b60 [ptlrpc]
> [5035429.361174] [<ffffffffa1c2bd5b>] ptlrpcd_check+0x52b/0x550 [ptlrpc]
> [5035429.361219] [<ffffffffa1c2c39b>] ptlrpcd+0x32b/0x410 [ptlrpc]
> [5035429.361244] [<ffffffff81083f16>] kthread+0x96/0xa0
> [5035429.361249] [<ffffffff8146d964>] kernel_thread_helper+0x4/0x10
> [5035429.361252]
> [5035429.361378] Kernel panic - not syncing: LBUG



 Comments   
Comment by Gerrit Updater [ 20/May/16 ]

Andriy Skulysh (andriy.skulysh@seagate.com) uploaded a new patch: http://review.whamcloud.com/20345
Subject: LU-8175 ldlm: conflicting PW & PR extent locks on a client
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9a105cb6160629cd848e5d6a45a3a11028559f39

Comment by Joseph Gmitter (Inactive) [ 23/May/16 ]

Hi Jinshan,

Can you please review the patch?

Thanks.
Joe

Comment by Patrick Farrell (Inactive) [ 23/May/16 ]

Hi,

Just a general note - The underlying problem (two overlapping extent locks granted during failover) isn't limited to a single client like in this case (IE, it can happen with locks from two different clients) and can lead to data corruption. We'll try to open an LU about that some time soon.

Comment by parinay v kondekar (Inactive) [ 23/Aug/16 ]

LU-8388 looks to be dup of this issue.

Comment by Gerrit Updater [ 02/Sep/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20345/
Subject: LU-8175 ldlm: conflicting PW & PR extent locks on a client
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 80a818b80373bebd1438a74aeebda102b4885e53

Comment by Peter Jones [ 02/Sep/16 ]

Landed for 2.9

Generated at Sat Feb 10 02:15:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.