[LU-9806] tgt_client_free()) ASSERTION( lut && lut->lut_client_bitmap ) failed Created: 29/Jul/17  Updated: 19/Jul/23  Resolved: 19/Jul/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.13.0
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Major
Reporter: Oleg Drokin Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-11232 replay-ost-single test_0b: BUG: unabl... Open
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This seems to be a return of LU-7430 and a few other similar bugs, but happening on current master.

[291606.098200] Lustre: DEBUG MARKER: == replay-ost-single test 7: Fail OST before obd_destroy ============================================= 23:53:41 (1501300421)
[291616.783248] Lustre: DEBUG MARKER: before: 623720 after_dd: 618600 took 1 seconds
[291617.134646] LustreError: 28072:0:(osd_handler.c:2184:osd_ro()) *** setting lustre-OST0000 read-only ***
[291617.152901] Turning device loop1 (0x700001) read-only
[291617.224927] Lustre: DEBUG MARKER: ost1 REPLAY BARRIER on lustre-OST0000
[291617.277436] Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-OST0000
[291617.590847] Lustre: Failing over lustre-OST0000
[291617.601802] LustreError: 22375:0:(tgt_lastrcvd.c:440:tgt_client_free()) ASSERTION( lut && lut->lut_client_bitmap ) failed: 
[291617.602975] LustreError: 22375:0:(tgt_lastrcvd.c:440:tgt_client_free()) LBUG
[291617.603578] Pid: 22375, comm: obd_zombid
[291617.604096] 
Call Trace:
[291617.606669]  [<ffffffffa02857ce>] libcfs_call_trace+0x4e/0x60 [libcfs]
[291617.607349]  [<ffffffffa028585c>] lbug_with_loc+0x4c/0xb0 [libcfs]
[291617.608122]  [<ffffffffa05ddde2>] tgt_client_free+0x2a2/0x360 [ptlrpc]
[291617.608814]  [<ffffffffa0db5b12>] ofd_destroy_export+0x62/0x180 [ofd]
[291617.609551]  [<ffffffffa0389239>] obd_zombie_impexp_cull+0x549/0x920 [obdclass]
[291617.622563]  [<ffffffffa038967d>] obd_zombie_impexp_thread+0x6d/0x1c0 [obdclass]
[291617.628967]  [<ffffffff810b7cc0>] ? default_wake_function+0x0/0x20
[291617.629676]  [<ffffffffa0389610>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
[291617.631230]  [<ffffffff810a2eba>] kthread+0xea/0xf0
[291617.631906]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[291617.632572]  [<ffffffff8170fb98>] ret_from_fork+0x58/0x90
[291617.633236]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[291617.639601] 
[291617.640036] Kernel panic - not syncing: LBUG
[291617.640462] CPU: 4 PID: 22375 Comm: obd_zombid Tainted: P           OE  ------------   3.10.0-debug #2
[291617.641354] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[291617.641830]  ffffffffa02a4ed2 0000000025d32961 ffff8800a16b3cc0 ffffffff816fd3e4
[291617.642712]  ffff8800a16b3d40 ffffffff816f8c34 ffffffff00000008 ffff8800a16b3d50
[291617.644582]  ffff8800a16b3cf0 0000000025d32961 0000000025d32961 ffff88033e48d948
[291617.645811] Call Trace:
[291617.646408]  [<ffffffff816fd3e4>] dump_stack+0x19/0x1b
[291617.647142]  [<ffffffff816f8c34>] panic+0xd8/0x1e7
[291617.647765]  [<ffffffffa0285874>] lbug_with_loc+0x64/0xb0 [libcfs]
[291617.648540]  [<ffffffffa05ddde2>] tgt_client_free+0x2a2/0x360 [ptlrpc]
[291617.649224]  [<ffffffffa0db5b12>] ofd_destroy_export+0x62/0x180 [ofd]
[291617.649911]  [<ffffffffa0389239>] obd_zombie_impexp_cull+0x549/0x920 [obdclass]
[291617.651165]  [<ffffffffa038967d>] obd_zombie_impexp_thread+0x6d/0x1c0 [obdclass]
[291617.652377]  [<ffffffff810b7cc0>] ? wake_up_state+0x20/0x20
[291617.653065]  [<ffffffffa0389610>] ? obd_zombie_impexp_cull+0x920/0x920 [obdclass]
[291617.654285]  [<ffffffff810a2eba>] kthread+0xea/0xf0
[291617.654920]  [<ffffffff810a2dd0>] ? kthread_create_on_node+0x140/0x140
[291617.655610]  [<ffffffff8170fb98>] ret_from_fork+0x58/0x90
[291617.656262]  [<ffffffff810a2dd0>] ? kthread_create_on_node+0x140/0x140

Crasydump on onyx-68 in /exports/crashdumps/192.168.123.181-2017-07-28-23:53:59
Modules also there.



 Comments   
Comment by Oleg Drokin [ 30/Jul/17 ]

Just had another one

[11716.272157] Lustre: DEBUG MARKER: == recovery-small test 29b: error adding new clients doesn't cause LBUG (bug 22273) ================== 23:21:29 (1501384889)
[11716.438161] Lustre: Failing over lustre-OST0000
[11716.527043] LustreError: 9005:0:(tgt_lastrcvd.c:440:tgt_client_free()) ASSERTION( lut && lut->lut_client_bitmap ) failed: 
[11716.528524] LustreError: 9005:0:(tgt_lastrcvd.c:440:tgt_client_free()) LBUG
[11716.529497] Pid: 9005, comm: obd_zombid
[11716.530209] 
Call Trace:
[11716.532127]  [<ffffffffa02c57ce>] libcfs_call_trace+0x4e/0x60 [libcfs]
[11716.534315]  [<ffffffffa02c585c>] lbug_with_loc+0x4c/0xb0 [libcfs]
[11716.535401]  [<ffffffffa061dde2>] tgt_client_free+0x2a2/0x360 [ptlrpc]
[11716.536214]  [<ffffffffa1412b12>] ofd_destroy_export+0x62/0x180 [ofd]
[11716.537110]  [<ffffffffa03c9239>] obd_zombie_impexp_cull+0x549/0x920 [obdclass]
[11716.551808]  [<ffffffffa03c967d>] obd_zombie_impexp_thread+0x6d/0x1c0 [obdclass]
[11716.553655]  [<ffffffff810b7cc0>] ? default_wake_function+0x0/0x20
[11716.554770]  [<ffffffffa03c9610>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
[11716.556332]  [<ffffffff810a2eba>] kthread+0xea/0xf0
[11716.557232]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[11716.558361]  [<ffffffff8170fb98>] ret_from_fork+0x58/0x90
[11716.564715]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[11716.567093] 
[11716.568045] Kernel panic - not syncing: LBUG
[11716.568703] CPU: 4 PID: 9005 Comm: obd_zombid Tainted: P           OE  ------------   3.10.0-debug #2
[11716.570244] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[11716.570937]  ffffffffa02e4ed2 00000000809dbba9 ffff8800b7697cc0 ffffffff816fd3e4
[11716.572370]  ffff8800b7697d40 ffffffff816f8c34 ffffffff00000008 ffff8800b7697d50
[11716.573539]  ffff8800b7697cf0 00000000809dbba9 00000000809dbba9 ffff88033e48d948
[11716.574459] Call Trace:
[11716.574892]  [<ffffffff816fd3e4>] dump_stack+0x19/0x1b
[11716.575383]  [<ffffffff816f8c34>] panic+0xd8/0x1e7
[11716.575862]  [<ffffffffa02c5874>] lbug_with_loc+0x64/0xb0 [libcfs]
[11716.576514]  [<ffffffffa061dde2>] tgt_client_free+0x2a2/0x360 [ptlrpc]
[11716.577065]  [<ffffffffa1412b12>] ofd_destroy_export+0x62/0x180 [ofd]
[11716.577577]  [<ffffffffa03c9239>] obd_zombie_impexp_cull+0x549/0x920 [obdclass]
[11716.578500]  [<ffffffffa03c967d>] obd_zombie_impexp_thread+0x6d/0x1c0 [obdclass]
[11716.579429]  [<ffffffff810b7cc0>] ? wake_up_state+0x20/0x20
[11716.579917]  [<ffffffffa03c9610>] ? obd_zombie_impexp_cull+0x920/0x920 [obdclass]
[11716.580827]  [<ffffffff810a2eba>] kthread+0xea/0xf0
[11716.581306]  [<ffffffff810a2dd0>] ? kthread_create_on_node+0x140/0x140
[11716.581801]  [<ffffffff8170fb98>] ret_from_fork+0x58/0x90
[11716.582316]  [<ffffffff810a2dd0>] ? kthread_create_on_node+0x140/0x140

crashdump is in 192.168.123.146-2017-07-29-23:21:* on onyx-68

Comment by Oleg Drokin [ 25/Feb/19 ]

this still seems to be regularly triggering in my testing

Comment by Alex Zhuravlev [ 17/Nov/19 ]
Lustre: DEBUG MARKER: == recovery-small test 60: Add Changelog entries during MDS failover ================================= 04:12:39 (1573945959)
Lustre: lustre-MDD0000: changelog on
Lustre: lustre-MDT0001: haven't heard from client 128ea591-f299-4 (at 192.168.122.22@tcp) in 48 seconds. I think it's dead, and I am evicting it. exp 000000007725ad20, cur 1573945996 expire 1573945966 last 1573945948
Lustre: lustre-OST0000: haven't heard from client 128ea591-f299-4 (at 192.168.122.22@tcp) in 48 seconds. I think it's dead, and I am evicting it. exp 00000000a202a5e3, cur 1573945996 expire 1573945966 last 1573945948
LustreError: 19:0:(tgt_lastrcvd.c:451:tgt_client_free()) ASSERTION( lut && lut->lut_client_bitmap ) failed: 
LustreError: 19:0:(tgt_lastrcvd.c:451:tgt_client_free()) LBUG
...
Call Trace:
 ? __schedule+0x2ad/0xb00
 schedule+0x34/0x80
 lbug_with_loc+0x79/0x80 [libcfs]
 ? tgt_client_free+0x2b0/0x330 [ptlrpc]
 ? mdt_destroy_export+0x87/0x2a0 [mdt]
 ? class_export_destroy+0xe9/0x460 [obdclass]
 ? process_one_work+0x249/0x5d0
 ? worker_thread+0x48/0x3d0
 ? kthread+0x100/0x140

umount          D    0 24858  24857 0x00000000
Call Trace:
 ? __schedule+0x2ad/0xb00
 schedule+0x34/0x80
 schedule_timeout+0x323/0x500
 ? wait_for_common+0x3b/0x160
 wait_for_common+0xc9/0x160
 ? wake_up_q+0x60/0x60
 flush_workqueue+0x143/0x4a0
 ? obd_exports_barrier+0x43/0x1a0 [obdclass]
 ? obd_exports_barrier+0x76/0x1a0 [obdclass]
 mgs_device_fini+0xdb/0x5c0 [mgs]
 class_cleanup+0x689/0xb50 [obdclass]
 class_process_config+0x153e/0x30f0 [obdclass]
 ? cache_alloc_debugcheck_after+0x138/0x150
 class_manual_cleanup+0x197/0x670 [obdclass]
 server_put_super+0x1525/0x1d50 [obdclass]
 ? evict_inodes+0x138/0x180
 generic_shutdown_super+0x5f/0xf0

looks like MDT umount didn't wait for all exports to be gone?

Comment by Alex Zhuravlev [ 07/Dec/20 ]

there is no any serialization between export destroy and obd destroy:

00000020:00000080:0.0:1607303759.080403:0:10539:0:(genops.c:984:class_export_put()) final put 0000000048c8f7e8/7bdf7e52-e46c-4201-82b5-5380be291135
00000020:00000001:1.0:1607303759.082137:0:11815:0:(tgt_main.c:570:tgt_fini()) Process entered
00000020:00000001:1.0:1607303759.082148:0:11815:0:(tgt_main.c:610:tgt_fini()) Process leaving
00000020:00000080:1.0:1607303759.082811:0:8175:0:(genops.c:943:class_export_destroy()) destroying export 0000000048c8f7e8/7bdf7e52-e46c-4201-82b5-5380be291135 for lustre-OST0000
00000001:00040000:1.0:1607303759.082843:0:8175:0:(tgt_lastrcvd.c:451:tgt_client_free()) ASSERTION( lut && lut->lut_client_bitmap ) failed: 

IMHO, the check for freed OBD is very naive:

	/* Target may have been freed (see LU-7430)
	 * Slot may be not yet assigned */
	if (exp->exp_obd->u.obt.obt_magic != OBT_MAGIC ||
	    ted->ted_lr_idx < 0)
		return;
Comment by Gerrit Updater [ 27/Feb/23 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50147
Subject: LU-9806 obdclass: wait for all exports to go
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8895829088251d37576a01d959689d4d9e9204a7

Comment by Gerrit Updater [ 19/Jul/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50147/
Subject: LU-9806 obdclass: wait for all exports to go
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 08f9ebe93b300c39d2af1fb8e82a22e9c84f401b

Comment by Peter Jones [ 19/Jul/23 ]

Landed for 2.16

Generated at Sat Feb 10 02:29:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.