[LU-12485] (osd_handler.c:2146:osd_object_release()) ASSERTION( !(o->oo_destroyed == 0 && o->oo_inode && o->oo_inode->i_nlink == 0) ) faile Created: 28/Jun/19  Updated: 24/Sep/20  Resolved: 15/Aug/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.13.0, Lustre 2.12.3

Type: Bug Priority: Minor
Reporter: Sarah Liu Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: soak
Environment:

lustre-master tag-2.12.54


Issue Links:
Duplicate
is duplicated by LU-11578 ldiskfs_map_blocks: comm mdt00_100: l... Resolved
Related
is related to LU-12360 Can't restart filesystem (2.12) even ... Reopened
is related to LU-13980 Kernel panic on OST after removing fi... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

MDS hit LBUG when first time has routers in the configuration.
I tried to see if this problem can be reproduced, so I cleaned the update log on the MDS and tried again, soak ran normally.

[365459.643165] Lustre: 58072:0:(ldlm_lib.c:1777:extend_recovery_timer()) soaked-MDT0000: extended recovery timer reaching hard limit: 900, extend: 1
[365459.657843] Lustre: 58072:0:(ldlm_lib.c:1777:extend_recovery_timer()) Skipped 2 previous similar messages
[365485.628240] LNet: 57059:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 192.168.1.110@o2ib: 7 seconds
[365513.040207] Lustre: 58072:0:(ldlm_lib.c:1777:extend_recovery_timer()) soaked-MDT0000: extended recovery timer reaching hard limit: 900, extend: 1
[365513.054890] Lustre: 58072:0:(ldlm_lib.c:1777:extend_recovery_timer()) Skipped 2 previous similar messages
[365554.587216] Lustre: MGS: Connection restored to 192.168.1.110@o2ib (at 192.168.1.110@o2ib)
[365555.266913] Lustre: 58072:0:(ldlm_lib.c:1777:extend_recovery_timer()) soaked-MDT0000: extended recovery timer reaching hard limit: 900, extend: 1
[365555.281597] Lustre: 58072:0:(ldlm_lib.c:1777:extend_recovery_timer()) Skipped 3 previous similar messages
[365579.714847] Lustre: 58072:0:(ldlm_lib.c:1777:extend_recovery_timer()) soaked-MDT0000: extended recovery timer reaching hard limit: 900, extend: 1
[365579.729526] Lustre: 58072:0:(ldlm_lib.c:1777:extend_recovery_timer()) Skipped 3 previous similar messages
[365586.653587] Lustre: soaked-MDT0000: recovery is timed out, evict stale exports
[365586.664897] Lustre: soaked-MDT0000: Recovery over after 6:14, of 3 clients 3 recovered and 0 were evicted.
[365587.461671] Lustre: soaked-MDT0000: Connection restored to 192.168.1.107@o2ib (at 192.168.1.107@o2ib)
[365587.472122] Lustre: Skipped 20 previous similar messages
[366665.834308] Lustre: MGS: Connection restored to 3e1153ef-cd3c-4 (at 172.16.1.36@o2ib1)
[366665.843291] Lustre: Skipped 8 previous similar messages
[366714.243215] Lustre: 57971:0:(mdd_device.c:1811:mdd_changelog_clear()) soaked-MDD0000: No entry for user 1
[366796.507210] Lustre: MGS: Connection restored to a82300e0-23e7-4 (at 172.16.1.40@o2ib1)
[366796.516214] Lustre: Skipped 7 previous similar messages
[366812.601393] Lustre: MGS: Connection restored to f35a7a93-e477-4 (at 172.16.1.23@o2ib1)
[366812.610382] Lustre: Skipped 13 previous similar messages
[366894.739493] Lustre: MGS: Connection restored to 205e25f3-b54e-4 (at 172.16.1.17@o2ib1)
[366894.748458] Lustre: Skipped 19 previous similar messages
[367624.449318] LustreError: 58673:0:(osd_handler.c:2146:osd_object_release()) ASSERTION( !(o->oo_destroyed == 0 && o->oo_inode && o->oo_inode->i_nlink == 0) ) faile
d:
[367624.465857] LustreError: 58673:0:(osd_handler.c:2146:osd_object_release()) LBUG
[367624.474135] Pid: 58673, comm: mdt_out01_002 3.10.0-957.12.2.el7_lustre.x86_64 #1 SMP Wed Jun 5 07:00:13 UTC 2019
[367624.485598] Call Trace:
[367624.488439]  [<ffffffffc0a017cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[367624.495875]  [<ffffffffc0a0187c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[367624.502903]  [<ffffffffc132268c>] osd_object_release+0x7c/0x80 [osd_ldiskfs]
[367624.510909]  [<ffffffffc0bf7430>] lu_object_put+0x190/0x3d0 [obdclass]
[367624.518362]  [<ffffffffc0f400ec>] out_tx_end+0x1ec/0x5c0 [ptlrpc]
[367624.525380]  [<ffffffffc0f442b2>] out_handle+0x1452/0x1bc0 [ptlrpc]
[367624.532547]  [<ffffffffc0f3a6da>] tgt_request_handle+0x91a/0x15c0 [ptlrpc]
[367624.540382]  [<ffffffffc0ede7ee>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc]
[367624.549104]  [<ffffffffc0ee22dc>] ptlrpc_main+0xbac/0x1560 [ptlrpc]
[367624.556257]  [<ffffffff818c1d21>] kthread+0xd1/0xe0
[367624.561844]  [<ffffffff81f75c37>] ret_from_fork_nospec_end+0x0/0x39
[367624.568961]  [<ffffffffffffffff>] 0xffffffffffffffff
[367624.574638] Kernel panic - not syncing: LBUG
[367624.579502] CPU: 25 PID: 58673 Comm: mdt_out01_002 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.12.2.el7_lustre.x86_64 #1
[367624.593766] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
[367624.606390] Call Trace:
[367624.609223]  [<ffffffff81f63041>] dump_stack+0x19/0x1b
[367624.615061]  [<ffffffff81f5c750>] panic+0xe8/0x21f
[367624.620514]  [<ffffffffc0a018cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
[367624.627513]  [<ffffffffc132268c>] osd_object_release+0x7c/0x80 [osd_ldiskfs]
[367624.635501]  [<ffffffffc0bf7430>] lu_object_put+0x190/0x3d0 [obdclass]
[367624.642919]  [<ffffffffc0f400ec>] out_tx_end+0x1ec/0x5c0 [ptlrpc]
[367624.649853]  [<ffffffffc0f442b2>] out_handle+0x1452/0x1bc0 [ptlrpc]
[367624.656967]  [<ffffffffc0e8a650>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[367624.665156]  [<ffffffffc0f3a6da>] tgt_request_handle+0x91a/0x15c0 [ptlrpc]
[367624.672955]  [<ffffffffc0f143e1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
[367624.681496]  [<ffffffffc0a01bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
[367624.689487]  [<ffffffffc0ede7ee>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc]
[367624.698137]  [<ffffffff818ced54>] ? __wake_up+0x44/0x50
[367624.704092]  [<ffffffffc0ee22dc>] ptlrpc_main+0xbac/0x1560 [ptlrpc]
[367624.711209]  [<ffffffffc0ee1730>] ? ptlrpc_register_service+0xfa0/0xfa0 [ptlrpc]
[367624.719560]  [<ffffffff818c1d21>] kthread+0xd1/0xe0
[367624.725099]  [<ffffffff818c1c50>] ? insert_kthread_work+0x40/0x40
[367624.731997]  [<ffffffff81f75c37>] ret_from_fork_nospec_begin+0x21/0x21
[367624.739378]  [<ffffffff818c1c50>] ? insert_kthread_work+0x40/0x40
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
[    0.000000] Linux version 3.10.0-957.12.2.el7_lustr


 Comments   
Comment by Patrick Farrell (Inactive) [ 28/Jun/19 ]

laisiyao this is the issue you submitted a patch for against LU-12360, but since LU-12360 covers the larger abort_recovery issue, let's use this one for the crash.

Comment by Lai Siyao [ 29/Jun/19 ]

Hi Patrick, okay, but there is one thing to know: though they assert on the same place in osd_object_release(), it's on OST in LU-12360, but MDT in this one. MDT code may change directory i_nlink in several places, but OST code only drop_nlink() on object destroy, so I doubt they may not be the same issue.

Comment by Patrick Farrell (Inactive) [ 29/Jun/19 ]

Ah, oops - I did not realize that, sorry.

It sounds like we need a third LU, then?  adilger ?

Comment by Andreas Dilger [ 29/Jun/19 ]

I guess...

Comment by Lai Siyao [ 17/Jul/19 ]

https://review.whamcloud.com/#/c/35360/

Mmm, this should be the same issue as LU-12360.

Comment by Gerrit Updater [ 15/Aug/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35360/
Subject: LU-12485 obdclass: 0-nlink race in lu_object_find_at()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2ff420913b9718ee8d80ae51fddc6e5df4a3148a

Comment by Peter Jones [ 15/Aug/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 19/Aug/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35834
Subject: LU-12485 obdclass: 0-nlink race in lu_object_find_at()
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 4820f89a4bfffcedcbe9f562f43a23e2fc1d0f4a

Comment by Gerrit Updater [ 12/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35834/
Subject: LU-12485 obdclass: 0-nlink race in lu_object_find_at()
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: c4a91e08b1e1452e950037c135dfe9f6cf7a7c30

Comment by Gerrit Updater [ 08/Jul/20 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39308
Subject: LU-12485 obdclass: 0-nlink race in lu_object_find_at()
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 8a122663ebe98fccc2e39795dcaf0f7addaa2052

Generated at Sat Feb 10 02:53:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.