[LU-16596] Lustre client crashed with ASSERTION( ldlm_is_granted(lock) ) failed Created: 27/Feb/23  Updated: 30/Nov/23  Resolved: 30/Nov/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Dominika Wanat Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Rocky Linux 8.7, Kernel 4.18.0-425.10.1.el8_7.x86_64, Lustre Client 2.15.2 (compiled from b2_15 branch).


Issue Links:
Related
is related to LU-17278 ldlm_cli_inodebits_convert() should n... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Yesterday one of our Lustre clients built on top of the 2.15.2 release (from the b2_15 branch) crashed with the following LBUG:

[1361556.476660] LustreError: 882654:0:(ldlm_lock.c:1097:ldlm_grant_lock_with_skiplist()) ASSERTION( ldlm_is_granted(lock) ) failed: 
[1361556.478018] LustreError: 882654:0:(ldlm_lock.c:1097:ldlm_grant_lock_with_skiplist()) LBUG
[1361556.478681] Pid: 882654, comm: ldlm_bl_47 4.18.0-425.10.1.el8_7.x86_64 #1 SMP Thu Jan 12 16:32:13 UTC 2023
[1361556.479401] Call Trace TBD:
[1361556.479664] [<0>] libcfs_call_trace+0x6f/0xa0 [libcfs]
[1361556.480109] [<0>] lbug_with_loc+0x3f/0x70 [libcfs]
[1361556.480552] [<0>] ldlm_grant_lock_with_skiplist+0x642/0x780 [ptlrpc]
[1361556.481236] [<0>] ldlm_inodebits_drop+0xba/0x160 [ptlrpc]
[1361556.481706] [<0>] ldlm_cli_inodebits_convert+0x426/0x6c0 [ptlrpc]
[1361556.482242] [<0>] ldlm_cli_convert+0x68/0x2a0 [ptlrpc]
[1361556.482673] [<0>] ll_md_blocking_ast+0x131/0x2f0 [lustre]
[1361556.483137] [<0>] ldlm_handle_bl_callback+0xbc/0x3f0 [ptlrpc]
[1361556.483646] [<0>] ldlm_bl_thread_main+0x633/0x930 [ptlrpc]
[1361556.484103] [<0>] kthread+0x10b/0x130
[1361556.484440] [<0>] ret_from_fork+0x1f/0x40
[1361556.484767] Kernel panic - not syncing: LBUG
[1361556.485112] CPU: 9 PID: 882654 Comm: ldlm_bl_47 Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-425.10.1.el8_7.x86_64 #1
[1361556.486068] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.13.0-2.module_el8.3.0+555+a55c8938 04/01/2014
[1361556.486785] Call Trace:
[1361556.487006]  dump_stack+0x41/0x60
[1361556.487297]  panic+0xe7/0x2ac
[1361556.487542]  ? ret_from_fork+0x1f/0x40
[1361556.487849]  lbug_with_loc.cold.8+0x18/0x18 [libcfs]
[1361556.488254]  ldlm_grant_lock_with_skiplist+0x642/0x780 [ptlrpc]
[1361556.488751]  ldlm_inodebits_drop+0xba/0x160 [ptlrpc]
[1361556.489183]  ldlm_cli_inodebits_convert+0x426/0x6c0 [ptlrpc]
[1361556.489679]  ? ll_have_md_lock+0x169/0x3f0 [lustre]
[1361556.490087]  ldlm_cli_convert+0x68/0x2a0 [ptlrpc]
[1361556.490492]  ll_md_blocking_ast+0x131/0x2f0 [lustre]
[1361556.490906]  ? obd_stale_export_get+0x75/0x190 [obdclass]
[1361556.491438]  ldlm_handle_bl_callback+0xbc/0x3f0 [ptlrpc]
[1361556.491950]  ldlm_bl_thread_main+0x633/0x930 [ptlrpc]
[1361556.492385]  ? finish_wait+0x80/0x80
[1361556.492675]  ? ldlm_handle_bl_callback+0x3f0/0x3f0 [ptlrpc]
[1361556.493150]  kthread+0x10b/0x130
[1361556.493413]  ? set_kthread_struct+0x50/0x50
[1361556.493803]  ret_from_fork+0x1f/0x40

I found a similar LU: https://jira.whamcloud.com/browse/LU-13927 but with the Lustre 2.12.5 on MDS, patches linked in this LU are already there.



 Comments   
Comment by Etienne Aujames [ 01/Mar/23 ]

The https://review.whamcloud.com/39854 is not included in 2.12.5:

$ git log --oneline -1 dcbb023c2f57fff8c856cb5c777855266b7f7b6c
dcbb023 LU-11276 ldlm: fix lock convert races
$ git tag --contains=dcbb023c2f57fff8c856cb5c777855266b7f7b6c
2.12.6
2.12.6-RC1
2.12.6-RC2
2.12.7
2.12.7-RC1
2.12.8
2.12.9
2.12.9-RC1
v2_12_6
v2_12_6-RC1
v2_12_6-RC2
v2_12_7
v2_12_7-RC1
v2_12_8
v2_12_9
v2_12_9-RC1
Comment by Dominika Wanat [ 06/Mar/23 ]

Hello Etienne, thank you for your help. We have our own branch named "ares-client-2.15-rocky" and this commit seems to have different commit id in our case (probably because this branch was merged from master some time ago and has nothing in common with b2_12). This patch was cherry-picked to master, has commit id: 6c0b676e41245c2f74bcf7f3f1ac9fcb0fd6c319 and you can find it there: https://review.whamcloud.com/c/fs/lustre-release/+/36466

$ git log --oneline -1 dcbb023c2f57fff8c856cb5c777855266b7f7b6c
dcbb023c2f LU-11276 ldlm: fix lock convert races 
$ git log --oneline -1 6c0b676e41245c2f74bcf7f3f1ac9fcb0fd6c319
6c0b676e41 LU-11276 ldlm: fix lock convert races

The second patch is visible in our branch (and in master):

$ git branch --contains=6c0b676e41245c2f74bcf7f3f1ac9fcb0fd6c319 
* ares-client-2.15-rocky
(...) 
master

And patch with commit id cited by you is related to the different branches 2.12.x based on b2_12 (it is clearly visible in Gerrit):

$  git branch --contains dcbb023c2f57fff8c856cb5c777855266b7f7b6c
  prometheus-client-2.12
  prometheus-client-2.12-sysofed 

So, I think that we can assume that this patch is here.

 

Comment by Andreas Dilger [ 30/Nov/23 ]

This should be fixed by the patch in LU-17278

Generated at Sat Feb 10 03:28:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.