[LU-11786] 6752:0:(osd_handler.c:2308:osd_read_lock()) ASSERTION( obj->oo_owner == ((void *)0) ) failed: Created: 15/Dec/18  Updated: 09/Jun/20  Resolved: 09/Jun/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.5
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Yang Sheng
Resolution: Duplicate Votes: 0
Labels: None
Environment:

kernel: 3.10.0-693.21.1.el7
Lustre 2.10.5
SRC: https://github.com/jlan/lustre-nas/tree/nas-2.10.5


Issue Links:
Duplicate
duplicates LU-10678 LBUG: osd_handler.c:2353:osd_read_loc... Resolved
Related
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

MDS hit LBUG.
DUP of LU-10678?

[2470859.924802] LustreError: 6752:0:(osd_handler.c:2308:osd_read_lock()) ASSERTION( obj->oo_owner == ((void *)0) ) failed: 
[2470859.957599] LustreError: 6752:0:(osd_handler.c:2308:osd_read_lock()) LBUG
[2470859.978428] Pid: 6752, comm: mdt00_055 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
[2470859.978429] Call Trace:
[2470859.978443]  [<ffffffff8103a1f2>] save_stack_trace_tsk+0x22/0x40
[2470859.996970]  [<ffffffffa08a27cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[2470859.996975]  [<ffffffffa08a287c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[2470859.996987]  [<ffffffffa1040a2a>] osd_read_lock+0xda/0xe0 [osd_ldiskfs]
[2470859.996997]  [<ffffffffa12d61ca>] lod_read_lock+0x3a/0xd0 [lod]
[2470859.997006]  [<ffffffffa134e8ea>] mdd_read_lock+0x3a/0xd0 [mdd]
[2470859.997016]  [<ffffffffa135183c>] mdd_xattr_get+0x6c/0x390 [mdd]
[2470859.997032]  [<ffffffffa11f23fd>] mdt_stripe_get+0xcd/0x3c0 [mdt]
[2470859.997043]  [<ffffffffa11f29e9>] mdt_attr_get_complex+0x2f9/0xb10 [mdt]
[2470859.997055]  [<ffffffffa1219e74>] mdt_reint_open+0x1374/0x3190 [mdt]
[2470859.997067]  [<ffffffffa120faf3>] mdt_reint_rec+0x83/0x210 [mdt]
[2470859.997076]  [<ffffffffa11f133b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[2470859.997084]  [<ffffffffa11f1862>] mdt_intent_reint+0x162/0x430 [mdt]
[2470859.997093]  [<ffffffffa11fc631>] mdt_intent_policy+0x441/0xc70 [mdt]
[2470859.997138]  [<ffffffffa0bab2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
[2470859.997170]  [<ffffffffa0bd4b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
[2470859.997214]  [<ffffffffa0c5a262>] tgt_enqueue+0x62/0x210 [ptlrpc]
[2470859.997253]  [<ffffffffa0c5deca>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[2470859.997286]  [<ffffffffa0c064bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[2470859.997318]  [<ffffffffa0c0a4a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[2470859.997324]  [<ffffffff810b1131>] kthread+0xd1/0xe0
[2470859.997327]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
[2470859.997345]  [<ffffffffffffffff>] 0xffffffffffffffff

We have the crash dump if need.



 Comments   
Comment by Peter Jones [ 16/Dec/18 ]

Yang Sheng

Is this a duplicate of LU-10678?

Peter

Comment by Yang Sheng [ 17/Dec/18 ]

Yes, It is exactly same issue.

Comment by Mahmoud Hanafi [ 17/Dec/18 ]

We just hit the same LBUG on a different filesystem. This is become a critical issue for us.

Comment by Yang Sheng [ 17/Dec/18 ]

Hi, Mahmoud,

Could you help to carry on a debug patch to reproduce this issue?

Thanks,
YangSheng

Comment by Mahmoud Hanafi [ 17/Dec/18 ]

We don't have a specific reproducer but are willing to install a debug patch.

Do you think there is any info in the crash dump that can help?

This was a older filesytem (2.4 format) that was updated to 2.7.3 and now 2.10.5

Comment by Gerrit Updater [ 18/Dec/18 ]

Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33885
Subject: LU-11786 osd: debug patch
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: dda92128b6b7db3544cf4f9252738a7596f95eb2

Comment by Yang Sheng [ 18/Dec/18 ]

Hi, Mahmoud,

Could you please apply this patch and provide vmcore after hit this issue. Does this issue appear frequently? Does it be hit on after update to 693 kernel?

Thanks,
YangSheng

Comment by Mahmoud Hanafi [ 18/Dec/18 ]

We have only seen it 2 times on our large filesystem since update to cent7/Lustre2.10.5. We updated Lustre 2.10.5 at the end of Oct.  We will try out the debug patch.

 

Comment by Mahmoud Hanafi [ 13/Feb/19 ]

We finally got a crash again on this bug. I can upload the vmcore but it would have to viewed by US citizen only.
Here is the debug message.

[1594255.296920] LustreError: 12134:0:(osd_handler.c:2309:osd_read_lock()) ASSERTION( obj->oo_owner == NULL ) failed: owner=ffff883eb1f787c0, obj=ffff8811b9838b00
[1594255.296922] LustreError: 10199:0:(osd_handler.c:2309:osd_read_lock()) ASSERTION( obj->oo_owner == NULL ) failed: owner=ffff883eb1f787c0, obj=ffff8811b9838b00
[1594255.296926] LustreError: 10199:0:(osd_handler.c:2309:osd_read_lock()) LBUG

crash> struct osd_object ffff8811b9838b00
struct osd_object {
oo_dt = {
do_lu = {
lo_header = 0xffff880d8d14e288,
lo_dev = 0xffff881dce2e0000,
lo_ops = 0xffffffffa1047060 <osd_lu_obj_ops>,
lo_linkage =

{ next = 0xffff880d8d14e2c8, prev = 0xffff880f21d937d0 }

,
lo_dev_ref =

{<No data fields>}
},
do_ops = 0xffffffffa1046f60 <osd_obj_ops>,
do_body_ops = 0xffffffffa1048b20 <osd_body_ops>,
do_index_ops = 0x0
},
oo_inode = 0xffff880db0057708,
oo_hl_head = 0xffff881429684000,
oo_ext_idx_sem = {
{
count = { counter = 0 },
__UNIQUE_ID_rh_kabi_hide2 = { count = 0 },
{<No data fields>}

},
wait_lock = {
raw_lock = {
val =

{ counter = 0 }
}
},
osq = {
tail = { counter = 0 }
},
wait_list = { next = 0xffff8811b9838b60 },
owner = 0x0
},
oo_sem = {
{
count = { counter = -4294967294 },
__UNIQUE_ID_rh_kabi_hide2 = { count = -4294967294 },
{<No data fields>}
},
wait_lock = {
raw_lock = {
val = { counter = 0 }

}
},
osq = {
tail =

{ counter = 0 }

},
wait_list =

{ next = 0xffff883ec50bf7f0 }

,
owner = 0x1
},
oo_dir = 0x0,
oo_guard = {
{
rlock = {
raw_lock = {
val =

{ counter = 0 }

}
}
}
},
oo_destroyed = 0,
oo_lma_flags = 0,
oo_compat_dot_created = 1,
oo_compat_dotdot_created = 1,
oo_owner = 0x0,
oo_xattr_list =

{ next = 0xffff8811b9839c00, prev = 0xffff8811b9839c00 }

}


Comment by Peter Jones [ 14/Feb/19 ]

Mahmoud

We actually got a crash dump from a site without restrictions around citizenship a couple of days ago and so there is no immediate need to investigate your crash dump. We're working out next steps ATM.

Peter

Comment by Yang Sheng [ 15/Feb/19 ]

Hi, Mahmoud,

Could you please tell me your cpu type? Like Haswell or other. Please ignore this if it is sensitive infomation.

Thanks,
YangSheng

Comment by Mahmoud Hanafi [ 20/Feb/19 ]

The servers are E5-2670 SandyBridge

 

Comment by Taizeng Wu [ 12/Apr/19 ]

We have the same problem.

Environment:

  • lustre-2.10.3
  • kernel-3.10.0-693.11.6.el7_lustre
  • cpu-gold 6148

Dmesg:

[394034.327556] LustreError: 251105:0:(osd_handler.c:2294:osd_read_lock()) ASSERTION( obj->oo_owner == ((void *)0) ) failed:
[394034.327561] LustreError: 251081:0:(osd_handler.c:2294:osd_read_lock()) ASSERTION( obj->oo_owner == ((void *)0) ) failed:
[394034.327566] LustreError: 355561:0:(osd_handler.c:2294:osd_read_lock()) ASSERTION( obj->oo_owner == ((void *)0) ) failed:
[394034.327571] LustreError: 251061:0:(osd_handler.c:2294:osd_read_lock()) ASSERTION( obj->oo_owner == ((void *)0) ) failed:
[394034.327575] LustreError: 251081:0:(osd_handler.c:2294:osd_read_lock()) LBUG
[394034.327579] LustreError: 355561:0:(osd_handler.c:2294:osd_read_lock()) LBUG
[394034.327587] LustreError: 251061:0:(osd_handler.c:2294:osd_read_lock()) LBUG
[394034.327589] Pid: 251081, comm: mdt01_027
[394034.327591] Pid: 355561, comm: mdt01_066
[394034.327593] Pid: 251061, comm: mdt01_018
[394034.327594]
Call Trace:
[394034.327595]
Call Trace:
[394034.327596]
Call Trace:
[394034.327629] [<ffffffffc091e7ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
[394034.327631] [<ffffffffc091e7ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
[394034.327632] [<ffffffffc091e7ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
[394034.327643] [<ffffffffc091e83c>] lbug_with_loc+0x4c/0xb0 [libcfs]
[394034.327644] [<ffffffffc091e83c>] lbug_with_loc+0x4c/0xb0 [libcfs]
[394034.327651] [<ffffffffc091e83c>] lbug_with_loc+0x4c/0xb0 [libcfs]
[394034.327664] [<ffffffffc1933700>] ? mdd_xattr_get+0x0/0x390 [mdd]
[394034.327665] [<ffffffffc1933700>] ? mdd_xattr_get+0x0/0x390 [mdd]
[394034.327673] [<ffffffffc1933700>] ? mdd_xattr_get+0x0/0x390 [mdd]
[394034.327680] [<ffffffffc16249ca>] osd_read_lock+0xda/0xe0 [osd_ldiskfs]
[394034.327685] [<ffffffffc16249ca>] osd_read_lock+0xda/0xe0 [osd_ldiskfs]
[394034.327694] [<ffffffffc16249ca>] osd_read_lock+0xda/0xe0 [osd_ldiskfs]
[394034.327700] [<ffffffffc18b80c7>] lod_read_lock+0x37/0xd0 [lod]
[394034.327712] [<ffffffffc18b80c7>] lod_read_lock+0x37/0xd0 [lod]
[394034.327715] [<ffffffffc1930157>] mdd_read_lock+0x37/0xd0 [mdd]
[394034.327718] [<ffffffffc18b80c7>] lod_read_lock+0x37/0xd0 [lod]
[394034.327730] [<ffffffffc193376c>] mdd_xattr_get+0x6c/0x390 [mdd]
[394034.327732] [<ffffffffc1930157>] mdd_read_lock+0x37/0xd0 [mdd]
[394034.327738] [<ffffffffc1930157>] mdd_read_lock+0x37/0xd0 [mdd]
[394034.327752] [<ffffffffc193376c>] mdd_xattr_get+0x6c/0x390 [mdd]
[394034.327756] [<ffffffffc193376c>] mdd_xattr_get+0x6c/0x390 [mdd]
[394034.327759] [<ffffffffc17d53bb>] mdt_stripe_get+0xcb/0x3c0 [mdt]
[394034.327776] [<ffffffffc17d59a9>] mdt_attr_get_complex+0x2f9/0xb00 [mdt]
[394034.327783] [<ffffffffc17d53bb>] mdt_stripe_get+0xcb/0x3c0 [mdt]
[394034.327785] [<ffffffffc17d53bb>] mdt_stripe_get+0xcb/0x3c0 [mdt]
[394034.327803] [<ffffffffc17fcbb7>] mdt_reint_open+0x1387/0x31a0 [mdt]
[394034.327812] [<ffffffffc17d59a9>] mdt_attr_get_complex+0x2f9/0xb00 [mdt]
[394034.327813] [<ffffffffc17d59a9>] mdt_attr_get_complex+0x2f9/0xb00 [mdt]
[394034.327845] [<ffffffffc17fcbb7>] mdt_reint_open+0x1387/0x31a0 [mdt]
[394034.327847] [<ffffffffc17fcbb7>] mdt_reint_open+0x1387/0x31a0 [mdt]
[394034.327865] [<ffffffffc0c624ce>] ? upcall_cache_get_entry+0x20e/0x8f0 [obdclass]
[394034.327904] [<ffffffffc0c673ce>] ? lu_ucred+0x1e/0x30 [obdclass]
[394034.327909] [<ffffffffc0c624ce>] ? upcall_cache_get_entry+0x20e/0x8f0 [obdclass]
[394034.327911] [<ffffffffc0c624ce>] ? upcall_cache_get_entry+0x20e/0x8f0 [obdclass]
[394034.327925] [<ffffffffc17e2925>] ? mdt_ucred+0x15/0x20 [mdt]
[394034.327944] [<ffffffffc17e31f1>] ? mdt_root_squash+0x21/0x430 [mdt]
[394034.327953] [<ffffffffc0c673ce>] ? lu_ucred+0x1e/0x30 [obdclass]
[394034.327964] [<ffffffffc0c673ce>] ? lu_ucred+0x1e/0x30 [obdclass]
[394034.327967] [<ffffffffc17f28a0>] mdt_reint_rec+0x80/0x210 [mdt]
[394034.327973] [<ffffffffc17e2925>] ? mdt_ucred+0x15/0x20 [mdt]
[394034.327984] [<ffffffffc17d430b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[394034.327993] [<ffffffffc17e31f1>] ? mdt_root_squash+0x21/0x430 [mdt]
[394034.327994] [<ffffffffc17e2925>] ? mdt_ucred+0x15/0x20 [mdt]
[394034.328001] [<ffffffffc17d4832>] mdt_intent_reint+0x162/0x430 [mdt]
[394034.328016] [<ffffffffc17f28a0>] mdt_reint_rec+0x80/0x210 [mdt]
[394034.328020] [<ffffffffc17df59e>] mdt_intent_policy+0x43e/0xc70 [mdt]
[394034.328024] [<ffffffffc17e31f1>] ? mdt_root_squash+0x21/0x430 [mdt]
[394034.328033] [<ffffffffc17d430b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[394034.328050] [<ffffffffc17d4832>] mdt_intent_reint+0x162/0x430 [mdt]
[394034.328054] [<ffffffffc17f28a0>] mdt_reint_rec+0x80/0x210 [mdt]
[394034.328068] [<ffffffffc17df59e>] mdt_intent_policy+0x43e/0xc70 [mdt]
[394034.328082] [<ffffffffc17d430b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[394034.328097] [<ffffffffc120012f>] ? ldlm_resource_get+0x9f/0xa30 [ptlrpc]

Comment by Peter Jones [ 11/Jun/19 ]

Red Hat issue

Comment by Mahmoud Hanafi [ 26/Jul/19 ]

We hit this issue with kernel-3.10.0-957.21.3.el7.x86_64. So It doesn't appear to be fixed in the latest Cent7.6 kernel.

Comment by Mahmoud Hanafi [ 18/Feb/20 ]

We hit this again with 3.10.0-957.21.3.el7 and lustre2.12.2

Comment by Xiao Zhenggang [ 02/Apr/20 ]

We hit this with 3.10.0-957.10.1.el7_lustre.x86_64 and lustre 2.12.2 last night

Comment by Andreas Dilger [ 09/Jun/20 ]

This is fixed in RHEL7.7 and later kernels, see LU-12508 for details.

Generated at Sat Feb 10 02:46:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.