[LU-9806] tgt_client_free()) ASSERTION( lut && lut->lut_client_bitmap ) failed Created: 29/Jul/17 Updated: 19/Jul/23 Resolved: 19/Jul/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0, Lustre 2.13.0 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Oleg Drokin | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
This seems to be a return of [291606.098200] Lustre: DEBUG MARKER: == replay-ost-single test 7: Fail OST before obd_destroy ============================================= 23:53:41 (1501300421) [291616.783248] Lustre: DEBUG MARKER: before: 623720 after_dd: 618600 took 1 seconds [291617.134646] LustreError: 28072:0:(osd_handler.c:2184:osd_ro()) *** setting lustre-OST0000 read-only *** [291617.152901] Turning device loop1 (0x700001) read-only [291617.224927] Lustre: DEBUG MARKER: ost1 REPLAY BARRIER on lustre-OST0000 [291617.277436] Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-OST0000 [291617.590847] Lustre: Failing over lustre-OST0000 [291617.601802] LustreError: 22375:0:(tgt_lastrcvd.c:440:tgt_client_free()) ASSERTION( lut && lut->lut_client_bitmap ) failed: [291617.602975] LustreError: 22375:0:(tgt_lastrcvd.c:440:tgt_client_free()) LBUG [291617.603578] Pid: 22375, comm: obd_zombid [291617.604096] Call Trace: [291617.606669] [<ffffffffa02857ce>] libcfs_call_trace+0x4e/0x60 [libcfs] [291617.607349] [<ffffffffa028585c>] lbug_with_loc+0x4c/0xb0 [libcfs] [291617.608122] [<ffffffffa05ddde2>] tgt_client_free+0x2a2/0x360 [ptlrpc] [291617.608814] [<ffffffffa0db5b12>] ofd_destroy_export+0x62/0x180 [ofd] [291617.609551] [<ffffffffa0389239>] obd_zombie_impexp_cull+0x549/0x920 [obdclass] [291617.622563] [<ffffffffa038967d>] obd_zombie_impexp_thread+0x6d/0x1c0 [obdclass] [291617.628967] [<ffffffff810b7cc0>] ? default_wake_function+0x0/0x20 [291617.629676] [<ffffffffa0389610>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass] [291617.631230] [<ffffffff810a2eba>] kthread+0xea/0xf0 [291617.631906] [<ffffffff810a2dd0>] ? kthread+0x0/0xf0 [291617.632572] [<ffffffff8170fb98>] ret_from_fork+0x58/0x90 [291617.633236] [<ffffffff810a2dd0>] ? kthread+0x0/0xf0 [291617.639601] [291617.640036] Kernel panic - not syncing: LBUG [291617.640462] CPU: 4 PID: 22375 Comm: obd_zombid Tainted: P OE ------------ 3.10.0-debug #2 [291617.641354] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [291617.641830] ffffffffa02a4ed2 0000000025d32961 ffff8800a16b3cc0 ffffffff816fd3e4 [291617.642712] ffff8800a16b3d40 ffffffff816f8c34 ffffffff00000008 ffff8800a16b3d50 [291617.644582] ffff8800a16b3cf0 0000000025d32961 0000000025d32961 ffff88033e48d948 [291617.645811] Call Trace: [291617.646408] [<ffffffff816fd3e4>] dump_stack+0x19/0x1b [291617.647142] [<ffffffff816f8c34>] panic+0xd8/0x1e7 [291617.647765] [<ffffffffa0285874>] lbug_with_loc+0x64/0xb0 [libcfs] [291617.648540] [<ffffffffa05ddde2>] tgt_client_free+0x2a2/0x360 [ptlrpc] [291617.649224] [<ffffffffa0db5b12>] ofd_destroy_export+0x62/0x180 [ofd] [291617.649911] [<ffffffffa0389239>] obd_zombie_impexp_cull+0x549/0x920 [obdclass] [291617.651165] [<ffffffffa038967d>] obd_zombie_impexp_thread+0x6d/0x1c0 [obdclass] [291617.652377] [<ffffffff810b7cc0>] ? wake_up_state+0x20/0x20 [291617.653065] [<ffffffffa0389610>] ? obd_zombie_impexp_cull+0x920/0x920 [obdclass] [291617.654285] [<ffffffff810a2eba>] kthread+0xea/0xf0 [291617.654920] [<ffffffff810a2dd0>] ? kthread_create_on_node+0x140/0x140 [291617.655610] [<ffffffff8170fb98>] ret_from_fork+0x58/0x90 [291617.656262] [<ffffffff810a2dd0>] ? kthread_create_on_node+0x140/0x140 Crasydump on onyx-68 in /exports/crashdumps/192.168.123.181-2017-07-28-23:53:59 |
| Comments |
| Comment by Oleg Drokin [ 30/Jul/17 ] |
|
Just had another one [11716.272157] Lustre: DEBUG MARKER: == recovery-small test 29b: error adding new clients doesn't cause LBUG (bug 22273) ================== 23:21:29 (1501384889) [11716.438161] Lustre: Failing over lustre-OST0000 [11716.527043] LustreError: 9005:0:(tgt_lastrcvd.c:440:tgt_client_free()) ASSERTION( lut && lut->lut_client_bitmap ) failed: [11716.528524] LustreError: 9005:0:(tgt_lastrcvd.c:440:tgt_client_free()) LBUG [11716.529497] Pid: 9005, comm: obd_zombid [11716.530209] Call Trace: [11716.532127] [<ffffffffa02c57ce>] libcfs_call_trace+0x4e/0x60 [libcfs] [11716.534315] [<ffffffffa02c585c>] lbug_with_loc+0x4c/0xb0 [libcfs] [11716.535401] [<ffffffffa061dde2>] tgt_client_free+0x2a2/0x360 [ptlrpc] [11716.536214] [<ffffffffa1412b12>] ofd_destroy_export+0x62/0x180 [ofd] [11716.537110] [<ffffffffa03c9239>] obd_zombie_impexp_cull+0x549/0x920 [obdclass] [11716.551808] [<ffffffffa03c967d>] obd_zombie_impexp_thread+0x6d/0x1c0 [obdclass] [11716.553655] [<ffffffff810b7cc0>] ? default_wake_function+0x0/0x20 [11716.554770] [<ffffffffa03c9610>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass] [11716.556332] [<ffffffff810a2eba>] kthread+0xea/0xf0 [11716.557232] [<ffffffff810a2dd0>] ? kthread+0x0/0xf0 [11716.558361] [<ffffffff8170fb98>] ret_from_fork+0x58/0x90 [11716.564715] [<ffffffff810a2dd0>] ? kthread+0x0/0xf0 [11716.567093] [11716.568045] Kernel panic - not syncing: LBUG [11716.568703] CPU: 4 PID: 9005 Comm: obd_zombid Tainted: P OE ------------ 3.10.0-debug #2 [11716.570244] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [11716.570937] ffffffffa02e4ed2 00000000809dbba9 ffff8800b7697cc0 ffffffff816fd3e4 [11716.572370] ffff8800b7697d40 ffffffff816f8c34 ffffffff00000008 ffff8800b7697d50 [11716.573539] ffff8800b7697cf0 00000000809dbba9 00000000809dbba9 ffff88033e48d948 [11716.574459] Call Trace: [11716.574892] [<ffffffff816fd3e4>] dump_stack+0x19/0x1b [11716.575383] [<ffffffff816f8c34>] panic+0xd8/0x1e7 [11716.575862] [<ffffffffa02c5874>] lbug_with_loc+0x64/0xb0 [libcfs] [11716.576514] [<ffffffffa061dde2>] tgt_client_free+0x2a2/0x360 [ptlrpc] [11716.577065] [<ffffffffa1412b12>] ofd_destroy_export+0x62/0x180 [ofd] [11716.577577] [<ffffffffa03c9239>] obd_zombie_impexp_cull+0x549/0x920 [obdclass] [11716.578500] [<ffffffffa03c967d>] obd_zombie_impexp_thread+0x6d/0x1c0 [obdclass] [11716.579429] [<ffffffff810b7cc0>] ? wake_up_state+0x20/0x20 [11716.579917] [<ffffffffa03c9610>] ? obd_zombie_impexp_cull+0x920/0x920 [obdclass] [11716.580827] [<ffffffff810a2eba>] kthread+0xea/0xf0 [11716.581306] [<ffffffff810a2dd0>] ? kthread_create_on_node+0x140/0x140 [11716.581801] [<ffffffff8170fb98>] ret_from_fork+0x58/0x90 [11716.582316] [<ffffffff810a2dd0>] ? kthread_create_on_node+0x140/0x140 crashdump is in 192.168.123.146-2017-07-29-23:21:* on onyx-68 |
| Comment by Oleg Drokin [ 25/Feb/19 ] |
|
this still seems to be regularly triggering in my testing |
| Comment by Alex Zhuravlev [ 17/Nov/19 ] |
Lustre: DEBUG MARKER: == recovery-small test 60: Add Changelog entries during MDS failover ================================= 04:12:39 (1573945959) Lustre: lustre-MDD0000: changelog on Lustre: lustre-MDT0001: haven't heard from client 128ea591-f299-4 (at 192.168.122.22@tcp) in 48 seconds. I think it's dead, and I am evicting it. exp 000000007725ad20, cur 1573945996 expire 1573945966 last 1573945948 Lustre: lustre-OST0000: haven't heard from client 128ea591-f299-4 (at 192.168.122.22@tcp) in 48 seconds. I think it's dead, and I am evicting it. exp 00000000a202a5e3, cur 1573945996 expire 1573945966 last 1573945948 LustreError: 19:0:(tgt_lastrcvd.c:451:tgt_client_free()) ASSERTION( lut && lut->lut_client_bitmap ) failed: LustreError: 19:0:(tgt_lastrcvd.c:451:tgt_client_free()) LBUG ... Call Trace: ? __schedule+0x2ad/0xb00 schedule+0x34/0x80 lbug_with_loc+0x79/0x80 [libcfs] ? tgt_client_free+0x2b0/0x330 [ptlrpc] ? mdt_destroy_export+0x87/0x2a0 [mdt] ? class_export_destroy+0xe9/0x460 [obdclass] ? process_one_work+0x249/0x5d0 ? worker_thread+0x48/0x3d0 ? kthread+0x100/0x140 umount D 0 24858 24857 0x00000000 Call Trace: ? __schedule+0x2ad/0xb00 schedule+0x34/0x80 schedule_timeout+0x323/0x500 ? wait_for_common+0x3b/0x160 wait_for_common+0xc9/0x160 ? wake_up_q+0x60/0x60 flush_workqueue+0x143/0x4a0 ? obd_exports_barrier+0x43/0x1a0 [obdclass] ? obd_exports_barrier+0x76/0x1a0 [obdclass] mgs_device_fini+0xdb/0x5c0 [mgs] class_cleanup+0x689/0xb50 [obdclass] class_process_config+0x153e/0x30f0 [obdclass] ? cache_alloc_debugcheck_after+0x138/0x150 class_manual_cleanup+0x197/0x670 [obdclass] server_put_super+0x1525/0x1d50 [obdclass] ? evict_inodes+0x138/0x180 generic_shutdown_super+0x5f/0xf0 looks like MDT umount didn't wait for all exports to be gone? |
| Comment by Alex Zhuravlev [ 07/Dec/20 ] |
|
there is no any serialization between export destroy and obd destroy: 00000020:00000080:0.0:1607303759.080403:0:10539:0:(genops.c:984:class_export_put()) final put 0000000048c8f7e8/7bdf7e52-e46c-4201-82b5-5380be291135 00000020:00000001:1.0:1607303759.082137:0:11815:0:(tgt_main.c:570:tgt_fini()) Process entered 00000020:00000001:1.0:1607303759.082148:0:11815:0:(tgt_main.c:610:tgt_fini()) Process leaving 00000020:00000080:1.0:1607303759.082811:0:8175:0:(genops.c:943:class_export_destroy()) destroying export 0000000048c8f7e8/7bdf7e52-e46c-4201-82b5-5380be291135 for lustre-OST0000 00000001:00040000:1.0:1607303759.082843:0:8175:0:(tgt_lastrcvd.c:451:tgt_client_free()) ASSERTION( lut && lut->lut_client_bitmap ) failed: IMHO, the check for freed OBD is very naive: /* Target may have been freed (see LU-7430) * Slot may be not yet assigned */ if (exp->exp_obd->u.obt.obt_magic != OBT_MAGIC || ted->ted_lr_idx < 0) return; |
| Comment by Gerrit Updater [ 27/Feb/23 ] |
|
"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50147 |
| Comment by Gerrit Updater [ 19/Jul/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50147/ |
| Comment by Peter Jones [ 19/Jul/23 ] |
|
Landed for 2.16 |