[LU-10949] LBUG lu_object.h:688:lu_object_get()) ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 ) failed Created: 24/Apr/18  Updated: 13/Apr/21  Resolved: 21/Mar/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.10.4
Fix Version/s: Lustre 2.13.0, Lustre 2.12.6

Type: Bug Priority: Critical
Reporter: Cliff White (Inactive) Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: DNE, LTS12, soak
Environment:

Soak cluster - lustre-master-ib build 81


Attachments: Text File vmcore-dmesg.txt    
Issue Links:
Duplicate
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This also occurred under 2.10.4, but I missed it due to some syslog issues.
System failed over soak-8 (MDS) to soak-9, upon failback, soak-9 crashed.
vmcore available.

[67387.761484] LustreError: 167-0: soaked-MDT0000-osp-MDT0001: This client was evicted by soaked-MDT0000; in progress operations using this service will fail.
[67387.810434] LustreError: 8279:0:(mdt_reint.c:2181:mdt_reint_rename_or_migrate()) soaked-MDT0001: can't lock FS for rename: rc  = -5
[67387.810699] LNet: Service thread pid 8242 completed after 336.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[67387.810700] LNet: Service thread pid 8615 completed after 335.77s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[67387.811812] LustreError: 8615:0:(lu_object.h:688:lu_object_get()) ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 ) failed:
[67387.811814] LustreError: 8615:0:(lu_object.h:688:lu_object_get()) LBUG
[67387.811815] Pid: 8615, comm: mdt01_014
[67387.811816]
Call Trace:
[67387.811831]  [<ffffffffc0a567ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
[67387.811837]  [<ffffffffc0a5683c>] lbug_with_loc+0x4c/0xb0 [libcfs]
[67387.811847] LustreError: 8692:0:(lu_object.h:688:lu_object_get()) ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 ) failed:
[67387.811850] LustreError: 8692:0:(lu_object.h:688:lu_object_get()) LBUG
[67387.811851] Pid: 8692, comm: mdt01_017
[67387.811851]
Call Trace:
[67387.811860]  [<ffffffffc12527c9>] err_serious.part.38+0x0/0x36 [mdt]
[67387.811861]  [<ffffffffc0a567ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
[67387.811867]  [<ffffffffc0a5683c>] lbug_with_loc+0x4c/0xb0 [libcfs]
[67387.811873]  [<ffffffffc1204146>] mdt_remote_object_lock_try+0x736/0x770 [mdt]
[67387.811884]  [<ffffffffc12527c9>] err_serious.part.38+0x0/0x36 [mdt]
[67387.811895]  [<ffffffffc1204146>] mdt_remote_object_lock_try+0x736/0x770 [mdt]
[67387.811901]  [<ffffffffc0b74be4>] ? lprocfs_stats_lock+0x24/0xd0 [obdclass]
[67387.811912]  [<ffffffffc12041aa>] mdt_remote_object_lock+0x2a/0x30 [mdt]
[67387.811918]  [<ffffffffc0b74be4>] ? lprocfs_stats_lock+0x24/0xd0 [obdclass]
[67387.811927]  [<ffffffffc122ccf9>] mdt_reint_open+0x2769/0x3270 [mdt]
[67387.811928]  [<ffffffffc12041aa>] mdt_remote_object_lock+0x2a/0x30 [mdt]
[67387.811940]  [<ffffffffc122ccf9>] mdt_reint_open+0x2769/0x3270 [mdt]
[67387.811949]  [<ffffffffc0baae51>] ? upcall_cache_get_entry+0x211/0x8e0 [obdclass]
[67387.811965]  [<ffffffffc0baae51>] ? upcall_cache_get_entry+0x211/0x8e0 [obdclass]
[67387.811968]  [<ffffffffc0bb0b2e>] ? lu_ucred+0x1e/0x30 [obdclass]
[67387.811980]  [<ffffffffc120fd35>] ? mdt_ucred+0x15/0x20 [mdt]
[67387.811984]  [<ffffffffc0bb0b2e>] ? lu_ucred+0x1e/0x30 [obdclass]
[67387.811990]  [<ffffffffc12105c1>] ? mdt_root_squash+0x21/0x430 [mdt]
[67387.811995]  [<ffffffffc120fd35>] ? mdt_ucred+0x15/0x20 [mdt]
[67387.812002]  [<ffffffffc12208e3>] mdt_reint_rec+0x83/0x210 [mdt]  
[67387.812006]  [<ffffffffc12105c1>] ? mdt_root_squash+0x21/0x430 [mdt]
[67387.812012]  [<ffffffffc120020b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[67387.812017]  [<ffffffffc12208e3>] mdt_reint_rec+0x83/0x210 [mdt]
[67387.812022]  [<ffffffffc120c787>] mdt_intent_reint+0x157/0x420 [mdt]
[67387.812027]  [<ffffffffc120020b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[67387.812032]  [<ffffffffc1203365>] mdt_intent_opc+0x455/0xae0 [mdt]
[67387.812036]  [<ffffffffc120c787>] mdt_intent_reint+0x157/0x420 [mdt]
[67387.812046]  [<ffffffffc1203365>] mdt_intent_opc+0x455/0xae0 [mdt]
[67387.812090]  [<ffffffffc0dd3920>] ? lustre_swab_ldlm_intent+0x0/0x20 [ptlrpc]
[67387.812099]  [<ffffffffc0dd3920>] ? lustre_swab_ldlm_intent+0x0/0x20 [ptlrpc]
[67387.812102]  [<ffffffffc120afb3>] mdt_intent_policy+0x1a3/0x360 [mdt]
[67387.812111]  [<ffffffffc120afb3>] mdt_intent_policy+0x1a3/0x360 [mdt]
[67387.812127]  [<ffffffffc0d83f1e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc]
[67387.812136]  [<ffffffffc0a6a63e>] ? cfs_hash_add+0xbe/0x1a0 [libcfs]
[67387.812137]  [<ffffffffc0d83f1e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc]
[67387.812145]  [<ffffffffc0a6a63e>] ? cfs_hash_add+0xbe/0x1a0 [libcfs]
[67387.812164]  [<ffffffffc0dabb13>] ldlm_handle_enqueue0+0x8f3/0x1410 [ptlrpc]
[67387.812173]  [<ffffffffc0dabb13>] ldlm_handle_enqueue0+0x8f3/0x1410 [ptlrpc]
[67387.812198]  [<ffffffffc0dd39a0>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc]
[67387.812204]  [<ffffffffc0dd39a0>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc]
[67387.812238]  [<ffffffffc0e33392>] tgt_enqueue+0x62/0x210 [ptlrpc]
[67387.812238]  [<ffffffffc0e33392>] tgt_enqueue+0x62/0x210 [ptlrpc]
[67387.812245]  [<ffffffffc0e33392>] tgt_enqueue+0x62/0x210 [ptlrpc]
[67387.812288]  [<ffffffffc0e3994a>] tgt_request_handle+0x92a/0x13b0 [ptlrpc]
[67387.812294]  [<ffffffffc0e3994a>] tgt_request_handle+0x92a/0x13b0 [ptlrpc]
[67387.812322]  [<ffffffffc0ddda53>] ptlrpc_server_handle_request+0x253/0xab0 [ptlrpc]
[67387.812330]  [<ffffffffc0ddda53>] ptlrpc_server_handle_request+0x253/0xab0 [ptlrpc]
[67387.812355]  [<ffffffffc0ddb5d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[67387.812361]  [<ffffffff810c7c82>] ? default_wake_function+0x12/0x20
[67387.812364]  [<ffffffff810bdc4b>] ? __wake_up_common+0x5b/0x90
[67387.812366]  [<ffffffffc0ddb5d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[67387.812370]  [<ffffffff810c7c82>] ? default_wake_function+0x12/0x20
[67387.812372]  [<ffffffff810bdc4b>] ? __wake_up_common+0x5b/0x90
[67387.812397]  [<ffffffffc0de1202>] ptlrpc_main+0xa92/0x1f40 [ptlrpc]
[67387.812405]  [<ffffffffc0de1202>] ptlrpc_main+0xa92/0x1f40 [ptlrpc]
[67387.812429]  [<ffffffffc0de0770>] ? ptlrpc_main+0x0/0x1f40 [ptlrpc]
[67387.812433]  [<ffffffff810b4031>] kthread+0xd1/0xe0
[67387.812435]  [<ffffffff810b3f60>] ? kthread+0x0/0xe0
[67387.812439]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
[67387.812440]  [<ffffffffc0de0770>] ? ptlrpc_main+0x0/0x1f40 [ptlrpc]
[67387.812441]  [<ffffffff810b3f60>] ? kthread+0x0/0xe0
[67387.812442]
[67387.812443]  [<ffffffff810b4031>] kthread+0xd1/0xe0
[67387.812443] Kernel panic - not syncing: LBUG
[67387.812446] CPU: 25 PID: 8615 Comm: mdt01_014 Tainted: P        W  OE  ------------   3.10.0-693.21.1.el7_lustre.x86_64 #1
[67387.812447]  [<ffffffff810b3f60>] ? kthread+0x0/0xe0
[67387.812448] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
[67387.812449] Call Trace:
[67387.812450]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
[67387.812468]  [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
[67387.812469]  [<ffffffff810b3f60>] ? kthread+0x0/0xe0
[67387.812471]  [<ffffffff816a8634>] panic+0xe8/0x21f
[67387.812472]
[67387.812478]  [<ffffffffc0a56854>] lbug_with_loc+0x64/0xb0 [libcfs]
[67387.812492]  [<ffffffffc12527c9>] lu_object_get.isra.33.part.34+0x36/0x36 [mdt]
[67387.812502]  [<ffffffffc1204146>] mdt_remote_object_lock_try+0x736/0x770 [mdt]
[67387.812521]  [<ffffffffc0b74be4>] ? lprocfs_stats_lock+0x24/0xd0 [obdclass]
[67387.812531]  [<ffffffffc12041aa>] mdt_remote_object_lock+0x2a/0x30 [mdt]
[67387.812543]  [<ffffffffc122ccf9>] mdt_reint_open+0x2769/0x3270 [mdt]
[67387.812563]  [<ffffffffc0baae51>] ? upcall_cache_get_entry+0x211/0x8e0 [obdclass]
[67387.812582]  [<ffffffffc0bb0b2e>] ? lu_ucred+0x1e/0x30 [obdclass]
[67387.812592]  [<ffffffffc120fd35>] ? mdt_ucred+0x15/0x20 [mdt]
.....

Full vmcore at /scratch/dumps/soak-9.spirit.hpdd.intel.com/10.10.1.109-2018-04-24-18:20:36



 Comments   
Comment by Cliff White (Inactive) [ 25/Apr/18 ]

We hit this repeatedly on master. About every two hours.

Comment by James Nunez (Inactive) [ 27/Apr/18 ]

Lai - Would you please review this issue and comment?

Thanks,
James

Comment by Lai Siyao [ 28/Apr/18 ]

This asserts on mdt_lock_root_xattr() -> mdt_remote_object_lock(), and the message shows it was evicted:

[67387.761484] LustreError: 167-0: soaked-MDT0000-osp-MDT0001: This client was evicted by soaked-MDT0000; in progress operations using this service will fail.
[67387.810434] LustreError: 8279:0:(mdt_reint.c:2181:mdt_reint_rename_or_migrate()) soaked-MDT0001: can't lock FS for rename: rc  = -5
[67387.810699] LNet: Service thread pid 8242 completed after 336.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[67387.810700] LNet: Service thread pid 8615 completed after 335.77s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[67387.811812] LustreError: 8615:0:(lu_object.h:688:lu_object_get()) ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 ) failed: 

Since root XATTR lock holds a refcount of root, there should be race here.

Comment by Gerrit Updater [ 02/May/18 ]

Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/32244
Subject: LU-10949 debug: assert object refcount > 1 in lock cancel
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7ec5d9141c1eed09397b2e501494e6213e2a56a4

Comment by Lai Siyao [ 02/May/18 ]

Hi Cliff, I just uploaded a debug patch, would you test with that build?

Comment by Cliff White (Inactive) [ 10/May/18 ]

Tested for 24 hours with the debug patch, the error is not repeating. Any idea what is going on?

Comment by Lai Siyao [ 11/May/18 ]

I'm sure the debug patch doesn't change any logic, so I'll push to land the debug patch, so in the future it can help debug.

Comment by Gerrit Updater [ 05/Feb/19 ]

Andriy Skulysh (c17819@cray.com) uploaded a new patch: https://review.whamcloud.com/34181
Subject: LU-10949 mdt: lost reference on mdt_md_root
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1cf41c8ad2ec38b0697b9cc029c016ec49607bfc

Comment by Gerrit Updater [ 21/Mar/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34181/
Subject: LU-10949 mdt: lost reference on mdt_md_root
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b2368774a01eb89981e2ceb92be9673e4b403d62

Comment by Peter Jones [ 21/Mar/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 18/Sep/20 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39976
Subject: LU-10949 mdt: lost reference on mdt_md_root
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 46bd7408f6e0f06cd9af2848660800cd90e5c945

Comment by Gerrit Updater [ 22/Oct/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39976/
Subject: LU-10949 mdt: lost reference on mdt_md_root
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 2040144a54fbd0da81f3e1fdacb6b35f473689e6

Generated at Sat Feb 10 02:39:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.