[LU-10678] LBUG: osd_handler.c:2353:osd_read_lock()) ASSERTION( obj->oo_owner == ((void *)0) ) failed: Created: 16/Feb/18  Updated: 27/Oct/21  Resolved: 11/Jun/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Cliff White (Inactive) Assignee: Yang Sheng
Resolution: Not a Bug Votes: 1
Labels: soak
Environment:

Soak stress cluster - Lustre version=2.10.57_58_gf24340c.


Issue Links:
Duplicate
is duplicated by LU-11786 6752:0:(osd_handler.c:2308:osd_read_l... Resolved
is duplicated by LU-12508 (llite_mmap.c:71:our_vma()) ASSERTION... Closed
Related
is related to LU-15156 Back port upstream patch for rwsem issue Resolved
Rank (Obsolete): 9223372036854775807

 Description   

Soak MDT was in normal operation. Sudden LBUG

Feb 16 09:28:39 soak-8 kernel: LustreError: 2688:0:(osd_handler.c:2353:osd_read_lock()) ASSERTION( obj->oo_owner == ((void *)0) ) failed:
Feb 16 09:28:39 soak-8 kernel: LustreError: 2688:0:(osd_handler.c:2353:osd_read_lock()) LBUG
Feb 16 09:28:39 soak-8 kernel: Pid: 2688, comm: mdt00_028
Feb 16 09:28:39 soak-8 kernel: #012Call Trace:
Feb 16 09:28:39 soak-8 kernel: [<ffffffffc0dbc7ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
Feb 16 09:28:39 soak-8 kernel: [<ffffffffc0dbc83c>] lbug_with_loc+0x4c/0xb0 [libcfs]
Feb 16 09:28:39 soak-8 kernel: [<ffffffffc140599a>] osd_read_lock+0xda/0xe0 [osd_ldiskfs]
Feb 16 09:28:40 soak-8 kernel: [<ffffffffc1691287>] lod_read_lock+0x37/0xd0 [lod]
Feb 16 09:28:40 soak-8 kernel: [<ffffffffc17125c7>] mdd_read_lock+0x37/0xd0 [mdd]
Feb 16 09:28:40 soak-8 kernel: [<ffffffffc1715dcc>] mdd_xattr_get+0x6c/0x390 [mdd]
Feb 16 09:28:40 soak-8 kernel: [<ffffffffc1593c3f>] mdt_pack_acl2body+0x1af/0x800 [mdt]
Feb 16 09:28:40 soak-8 kernel: [<ffffffffc15beaf9>] mdt_finish_open+0x289/0x690 [mdt]
Feb 16 09:28:40 soak-8 kernel: [<ffffffffc15c120b>] mdt_reint_open+0x230b/0x3260 [mdt]
Feb 16 09:28:40 soak-8 kernel: [<ffffffffc0f27d2e>] ? upcall_cache_get_entry+0x20e/0x8f0 [obdclass]
Feb 16 09:28:40 soak-8 kernel: [<ffffffffc15a4b43>] ? ucred_set_jobid+0x53/0x70 [mdt]
Feb 16 09:28:40 soak-8 kernel: [<ffffffffc15b5400>] mdt_reint_rec+0x80/0x210 [mdt]
Feb 16 09:28:40 soak-8 kernel: [<ffffffffc1594f8b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
Feb 16 09:28:40 soak-8 kernel: [<ffffffffc15a1437>] mdt_intent_reint+0x157/0x420 [mdt]
Feb 16 09:28:40 soak-8 kernel: [<ffffffffc15980b2>] mdt_intent_opc+0x442/0xad0 [mdt]
Feb 16 09:28:41 soak-8 kernel: [<ffffffffc1144470>] ? lustre_swab_ldlm_intent+0x0/0x20 [ptlrpc]
Feb 16 09:28:41 soak-8 kernel: [<ffffffffc159fc63>] mdt_intent_policy+0x1a3/0x360 [mdt]
Feb 16 09:28:41 soak-8 kernel: [<ffffffffc10f4202>] ldlm_lock_enqueue+0x382/0x8f0 [ptlrpc]
Feb 16 09:28:41 soak-8 kernel: [<ffffffffc111c753>] ldlm_handle_enqueue0+0x8f3/0x13e0 [ptlrpc]
Feb 16 09:28:41 soak-8 kernel: [<ffffffffc11444f0>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc]
Feb 16 09:28:41 soak-8 kernel: [<ffffffffc11a2202>] tgt_enqueue+0x62/0x210 [ptlrpc]
Feb 16 09:28:41 soak-8 kernel: [<ffffffffc11aa405>] tgt_request_handle+0x925/0x13b0 [ptlrpc]
Feb 16 09:28:41 soak-8 kernel: [<ffffffffc114e58e>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc]
Feb 16 09:28:41 soak-8 kernel: [<ffffffffc114b448>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
Feb 16 09:28:41 soak-8 kernel: [<ffffffff810c6440>] ? default_wake_function+0x0/0x20
Feb 16 09:28:41 soak-8 kernel: [<ffffffffc1151d42>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Feb 16 09:28:41 soak-8 kernel: [<ffffffffc11512b0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
Feb 16 09:28:41 soak-8 kernel: [<ffffffff810b252f>] kthread+0xcf/0xe0
Feb 16 09:28:42 soak-8 kernel: [<ffffffff810b2460>] ? kthread+0x0/0xe0
Feb 16 09:28:42 soak-8 kernel: [<ffffffff816b8798>] ret_from_fork+0x58/0x90
Feb 16 09:28:42 soak-8 kernel: [<ffffffff810b2460>] ? kthread+0x0/0xe0
Feb 16 09:28:42 soak-8 kernel:


 Comments   
Comment by Peter Jones [ 16/Feb/18 ]

Yang Sheng

Can you please advise

Peter

Comment by Sebastien Piechurski [ 22/Feb/18 ]

Hi,

We have seen this a couple of times on 2.7.21.2 recently.

The crashdump collection has failed on the first occurence, and I am waiting for confirmation about the second occurence.

Would you be interested in a dump if we get one ?

Comment by Yang Sheng [ 22/Feb/18 ]

Hi, Sebastien,

This is helpful if got a crash dump. TIA.

Thanks,
YangSheng

Comment by Sebastien Piechurski [ 23/Feb/18 ]

Unfortunately, the dump collection failed because the crashkernel=auto parameter does not reserve enough memory for our configuration. I have requested this to be adjusted. Let's hope we can get a dump at next crash.

 

Regards,

 

Sebastien.

Comment by Johann Peyrard (Inactive) [ 22/Oct/18 ]

Hi,

I have this LBUG on server running : lustre-el7.3-2.7.21.3-255.ddn20.g10dd357.el7.x86_64

[75345.251740] LustreError: 7714:0:(osd_handler.c:1751:osd_object_read_lock()) ASSERTION( obj->oo_owner == ((void *)0) ) failed: 
[75345.265763] LustreError: 7714:0:(osd_handler.c:1751:osd_object_read_lock()) LBUG 

Do I need to open a Jira ticket for this one, or we can use this one ? Seem to be similar, but I preffer to ask.

I will try to get the crash file this week and the whole dmesg.

 

Regards,

Johann

Comment by Lixin Liu [ 17/Nov/18 ]

We have a few similar crashes on version 2.10.1, kernel 3.10.0-693.2.2.el7_lustre.x86_64.

We have an incomplete kernel dump and I uploaded to ftp.whamcloud.com in /uploads/LU-10678 directory. Not sure if this helps.

Thanks.

 

Lixin Liu

Simon Fraser University

 

Comment by Sebastien Piechurski [ 19/Nov/18 ]

We have one complete vmcore from an MDS running kernel 3.10.0-693.11.1.el7 and lustre 2.7.21.2.

I have uploaded it to ftp.whamcloud.com/uploads/LU-10678/vmcore-3.10.0-693.11.1.el7.x86_64_lustre-2.7.21.2

 

Comment by Yang Sheng [ 19/Nov/18 ]

Hi, Sebastien,

Looks like you use a non-standard combination of lustre & kernel? 2.7.21.2 should use 3.10.0.514.xx kernel. Can you provide debuginfo rpms?

Thanks,
YangSheng

Comment by Sebastien Piechurski [ 20/Nov/18 ]

Hi Yang Sheng,

I have just uploaded the corresponding lustre and kernel debuginfo packages to the same directory.

Regards,

 

Sebastien.

Comment by rajgautam [ 27/Apr/19 ]

Also seen in server running  lustre version 2.11.0.201 and kernel version 3.10.0-693.21.1.x3.1.11.x86_64 

Apr 25 11:49:02 hostname-n03 kernel: Pid: 26722, comm: mdt03_004
Apr 25 11:49:02 hostname-n03 kernel: IEC: 026000003: LASSERT: { "pid": "26722", "ext_pid": "0", "filename": "osd_handler.c", "line": "2382", "func_name": "osd_read_lock", "assert_info": "( obj->oo_owner == ((void *)0) ) failed: " }
Apr 25 11:49:02 hostname-n03 kernel: IEC: 026000004: LBUG: { "pid": "26722", "ext_pid": "0", "filename": "osd_handler.c", "line": "2382", "func_name": "osd_read_lock" }
Apr 25 11:49:02 hostname-n03 kernel:
Apr 25 11:49:02 hostname-n03 kernel: [<ffffffffc0a407ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
Apr 25 11:49:02 hostname-n03 kernel: [<ffffffffc0a4083c>] lbug_with_loc+0x4c/0xb0 [libcfs]
Apr 25 11:49:02 hostname-n03 kernel: [<ffffffffc12d4800>] ? mdd_xattr_get+0x0/0x5c0 [mdd]
Apr 25 11:49:02 hostname-n03 kernel: [<ffffffffc15569ca>] osd_read_lock+0xda/0xe0 [osd_ldiskfs]
Apr 25 11:49:02 hostname-n03 kernel: [<ffffffffc141256a>] lod_read_lock+0x3a/0xd0 [lod]
Apr 25 11:49:02 hostname-n03 kernel: [<ffffffffc12ce82a>] mdd_read_lock+0x3a/0xd0 [mdd]
Apr 25 11:49:02 hostname-n03 kernel: [<ffffffffc12d4870>] mdd_xattr_get+0x70/0x5c0 [mdd]
Apr 25 11:49:02 hostname-n03 kernel: [<ffffffffc1413b0b>] ? lod_attr_get+0xab/0x130 [lod]
Apr 25 11:49:02 hostname-n03 kernel: [<ffffffffc1370490>] mdt_get_som+0x90/0x210 [mdt]
Apr 25 11:49:02 hostname-n03 kernel: [<ffffffffc133ea95>] mdt_attr_get_complex+0x955/0xb10 [mdt]
Apr 25 11:49:02 hostname-n03 kernel: [<ffffffffc1368248>] mdt_reint_open+0x898/0x3190 [mdt]
Apr 25 11:49:02 hostname-n03 kernel: [<ffffffffc0c41261>] ? upcall_cache_get_entry+0x211/0x8d0 [obdclass]
Apr 25 11:49:02 hostname-n03 kernel: [<ffffffffc0c46f0e>] ? lu_ucred+0x1e/0x30 [obdclass]
Apr 25 11:49:02 hostname-n03 kernel: [<ffffffffc134ccc5>] ? mdt_ucred+0x15/0x20 [mdt]
Apr 25 11:49:02 hostname-n03 kernel: [<ffffffffc134d551>] ? mdt_root_squash+0x21/0x430 [mdt]
Apr 25 11:49:02 hostname-n03 kernel: [<ffffffffc135dd93>] mdt_reint_rec+0x83/0x210 [mdt]
Apr 25 11:49:02 hostname-n03 kernel: [<ffffffffc133d1bb>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
Apr 25 11:49:02 hostname-n03 kernel: [<ffffffffc1349737>] mdt_intent_reint+0x157/0x420 [mdt]
Apr 25 11:49:02 hostname-n03 kernel: [<ffffffffc1340315>] mdt_intent_opc+0x455/0xae0 [mdt]
Apr 25 11:49:03 hostname-n03 kernel: [<ffffffffc0e69d10>] ? lustre_swab_ldlm_intent+0x0/0x20 [ptlrpc]
Apr 25 11:49:03 hostname-n03 kernel: [<ffffffffc1347f63>] mdt_intent_policy+0x1a3/0x360 [mdt]
Apr 25 11:49:03 hostname-n03 kernel: [<ffffffffc0e1af9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc]
Apr 25 11:49:03 hostname-n03 kernel: [<ffffffffc0a5674e>] ? cfs_hash_add+0xbe/0x1a0 [libcfs]
Apr 25 11:49:03 hostname-n03 kernel: [<ffffffffc0e43843>] ldlm_handle_enqueue0+0x8f3/0x13e0 [ptlrpc]
Apr 25 11:49:03 hostname-n03 kernel: [<ffffffffc0e69d90>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc]
Apr 25 11:49:03 hostname-n03 kernel: [<ffffffffc0ec8572>] tgt_enqueue+0x62/0x210 [ptlrpc]
Apr 25 11:49:03 hostname-n03 kernel: [<ffffffffc0ece8ba>] tgt_request_handle+0x92a/0x13b0 [ptlrpc]
Apr 25 11:49:03 hostname-n03 kernel: [<ffffffffc0e73f13>] ptlrpc_server_handle_request+0x253/0xab0 [ptlrpc]
Apr 25 11:49:03 hostname-n03 kernel: [<ffffffffc0e71aa5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
Apr 25 11:49:03 hostname-n03 kernel: [<ffffffff810c7c92>] ? default_wake_function+0x12/0x20
Apr 25 11:49:03 hostname-n03 kernel: [<ffffffff810bdc4b>] ? __wake_up_common+0x5b/0x90
Apr 25 11:49:03 hostname-n03 kernel: [<ffffffffc0e77862>] ptlrpc_main+0xab2/0x1f70 [ptlrpc]
Apr 25 11:49:03 hostname-n03 kernel: [<ffffffffc0e76db0>] ? ptlrpc_main+0x0/0x1f70 [ptlrpc]
Apr 25 11:49:03 hostname-n03 kernel: [<ffffffff810b4031>] kthread+0xd1/0xe0
Apr 25 11:49:03 hostname-n03 kernel: [<ffffffff810b3f60>] ? kthread+0x0/0xe0
Apr 25 11:49:03 hostname-n03 kernel: [<ffffffff816c155d>] ret_from_fork+0x5d/0xb0
Apr 25 11:49:03 hostname-n03 kernel: [<ffffffff810b3f60>] ? kthread+0x0/0xe0
Apr 25 11:49:03 hostname-n03 kernel:
Apr 25 11:49:03 hostname-n03 kernel: Kernel panic - not syncing: LBUG

Comment by Andrew Perepechko [ 29/May/19 ]

We, at Cray, encountered a bunch of similar crashes which we associated with the broken rwsem implementation in certain RHEL7 kernels.

https://access.redhat.com/solutions/3393611

Resolution
Red Hat Enterprise Linux 7.6
The issue was fixed in kernel-3.10.0-957.12.1.el7 from Errata RHSA-2019:0818
Red Hat Enterprise Linux 7.4.z (EUS)
The issue was fixed in kernel-3.10.0-693.47.2.el7 from Errata RHSA-2019:1170

Comment by Yang Sheng [ 29/May/19 ]

Hi, Andrew,

Thanks for the info. It is really a tricky one.

Thanks,
YangSheng

Comment by Peter Jones [ 11/Jun/19 ]

Red Hat bug not Lustre bug

Generated at Sat Feb 10 02:37:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.