Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9806

tgt_client_free()) ASSERTION( lut && lut->lut_client_bitmap ) failed

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0
    • Lustre 2.12.0, Lustre 2.13.0
    • None
    • 3
    • 9223372036854775807

    Description

      This seems to be a return of LU-7430 and a few other similar bugs, but happening on current master.

      [291606.098200] Lustre: DEBUG MARKER: == replay-ost-single test 7: Fail OST before obd_destroy ============================================= 23:53:41 (1501300421)
      [291616.783248] Lustre: DEBUG MARKER: before: 623720 after_dd: 618600 took 1 seconds
      [291617.134646] LustreError: 28072:0:(osd_handler.c:2184:osd_ro()) *** setting lustre-OST0000 read-only ***
      [291617.152901] Turning device loop1 (0x700001) read-only
      [291617.224927] Lustre: DEBUG MARKER: ost1 REPLAY BARRIER on lustre-OST0000
      [291617.277436] Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-OST0000
      [291617.590847] Lustre: Failing over lustre-OST0000
      [291617.601802] LustreError: 22375:0:(tgt_lastrcvd.c:440:tgt_client_free()) ASSERTION( lut && lut->lut_client_bitmap ) failed: 
      [291617.602975] LustreError: 22375:0:(tgt_lastrcvd.c:440:tgt_client_free()) LBUG
      [291617.603578] Pid: 22375, comm: obd_zombid
      [291617.604096] 
      Call Trace:
      [291617.606669]  [<ffffffffa02857ce>] libcfs_call_trace+0x4e/0x60 [libcfs]
      [291617.607349]  [<ffffffffa028585c>] lbug_with_loc+0x4c/0xb0 [libcfs]
      [291617.608122]  [<ffffffffa05ddde2>] tgt_client_free+0x2a2/0x360 [ptlrpc]
      [291617.608814]  [<ffffffffa0db5b12>] ofd_destroy_export+0x62/0x180 [ofd]
      [291617.609551]  [<ffffffffa0389239>] obd_zombie_impexp_cull+0x549/0x920 [obdclass]
      [291617.622563]  [<ffffffffa038967d>] obd_zombie_impexp_thread+0x6d/0x1c0 [obdclass]
      [291617.628967]  [<ffffffff810b7cc0>] ? default_wake_function+0x0/0x20
      [291617.629676]  [<ffffffffa0389610>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
      [291617.631230]  [<ffffffff810a2eba>] kthread+0xea/0xf0
      [291617.631906]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
      [291617.632572]  [<ffffffff8170fb98>] ret_from_fork+0x58/0x90
      [291617.633236]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
      [291617.639601] 
      [291617.640036] Kernel panic - not syncing: LBUG
      [291617.640462] CPU: 4 PID: 22375 Comm: obd_zombid Tainted: P           OE  ------------   3.10.0-debug #2
      [291617.641354] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [291617.641830]  ffffffffa02a4ed2 0000000025d32961 ffff8800a16b3cc0 ffffffff816fd3e4
      [291617.642712]  ffff8800a16b3d40 ffffffff816f8c34 ffffffff00000008 ffff8800a16b3d50
      [291617.644582]  ffff8800a16b3cf0 0000000025d32961 0000000025d32961 ffff88033e48d948
      [291617.645811] Call Trace:
      [291617.646408]  [<ffffffff816fd3e4>] dump_stack+0x19/0x1b
      [291617.647142]  [<ffffffff816f8c34>] panic+0xd8/0x1e7
      [291617.647765]  [<ffffffffa0285874>] lbug_with_loc+0x64/0xb0 [libcfs]
      [291617.648540]  [<ffffffffa05ddde2>] tgt_client_free+0x2a2/0x360 [ptlrpc]
      [291617.649224]  [<ffffffffa0db5b12>] ofd_destroy_export+0x62/0x180 [ofd]
      [291617.649911]  [<ffffffffa0389239>] obd_zombie_impexp_cull+0x549/0x920 [obdclass]
      [291617.651165]  [<ffffffffa038967d>] obd_zombie_impexp_thread+0x6d/0x1c0 [obdclass]
      [291617.652377]  [<ffffffff810b7cc0>] ? wake_up_state+0x20/0x20
      [291617.653065]  [<ffffffffa0389610>] ? obd_zombie_impexp_cull+0x920/0x920 [obdclass]
      [291617.654285]  [<ffffffff810a2eba>] kthread+0xea/0xf0
      [291617.654920]  [<ffffffff810a2dd0>] ? kthread_create_on_node+0x140/0x140
      [291617.655610]  [<ffffffff8170fb98>] ret_from_fork+0x58/0x90
      [291617.656262]  [<ffffffff810a2dd0>] ? kthread_create_on_node+0x140/0x140
      

      Crasydump on onyx-68 in /exports/crashdumps/192.168.123.181-2017-07-28-23:53:59
      Modules also there.

      Attachments

        Issue Links

          Activity

            [LU-9806] tgt_client_free()) ASSERTION( lut && lut->lut_client_bitmap ) failed
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50147/
            Subject: LU-9806 obdclass: wait for all exports to go
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 08f9ebe93b300c39d2af1fb8e82a22e9c84f401b

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50147/ Subject: LU-9806 obdclass: wait for all exports to go Project: fs/lustre-release Branch: master Current Patch Set: Commit: 08f9ebe93b300c39d2af1fb8e82a22e9c84f401b

            "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50147
            Subject: LU-9806 obdclass: wait for all exports to go
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 8895829088251d37576a01d959689d4d9e9204a7

            gerrit Gerrit Updater added a comment - "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50147 Subject: LU-9806 obdclass: wait for all exports to go Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8895829088251d37576a01d959689d4d9e9204a7

            there is no any serialization between export destroy and obd destroy:

            00000020:00000080:0.0:1607303759.080403:0:10539:0:(genops.c:984:class_export_put()) final put 0000000048c8f7e8/7bdf7e52-e46c-4201-82b5-5380be291135
            00000020:00000001:1.0:1607303759.082137:0:11815:0:(tgt_main.c:570:tgt_fini()) Process entered
            00000020:00000001:1.0:1607303759.082148:0:11815:0:(tgt_main.c:610:tgt_fini()) Process leaving
            00000020:00000080:1.0:1607303759.082811:0:8175:0:(genops.c:943:class_export_destroy()) destroying export 0000000048c8f7e8/7bdf7e52-e46c-4201-82b5-5380be291135 for lustre-OST0000
            00000001:00040000:1.0:1607303759.082843:0:8175:0:(tgt_lastrcvd.c:451:tgt_client_free()) ASSERTION( lut && lut->lut_client_bitmap ) failed: 
            

            IMHO, the check for freed OBD is very naive:

            	/* Target may have been freed (see LU-7430)
            	 * Slot may be not yet assigned */
            	if (exp->exp_obd->u.obt.obt_magic != OBT_MAGIC ||
            	    ted->ted_lr_idx < 0)
            		return;
            
            bzzz Alex Zhuravlev added a comment - there is no any serialization between export destroy and obd destroy: 00000020:00000080:0.0:1607303759.080403:0:10539:0:(genops.c:984:class_export_put()) final put 0000000048c8f7e8/7bdf7e52-e46c-4201-82b5-5380be291135 00000020:00000001:1.0:1607303759.082137:0:11815:0:(tgt_main.c:570:tgt_fini()) Process entered 00000020:00000001:1.0:1607303759.082148:0:11815:0:(tgt_main.c:610:tgt_fini()) Process leaving 00000020:00000080:1.0:1607303759.082811:0:8175:0:(genops.c:943:class_export_destroy()) destroying export 0000000048c8f7e8/7bdf7e52-e46c-4201-82b5-5380be291135 for lustre-OST0000 00000001:00040000:1.0:1607303759.082843:0:8175:0:(tgt_lastrcvd.c:451:tgt_client_free()) ASSERTION( lut && lut->lut_client_bitmap ) failed: IMHO, the check for freed OBD is very naive: /* Target may have been freed (see LU-7430) * Slot may be not yet assigned */ if (exp->exp_obd->u.obt.obt_magic != OBT_MAGIC || ted->ted_lr_idx < 0) return ;
            Lustre: DEBUG MARKER: == recovery-small test 60: Add Changelog entries during MDS failover ================================= 04:12:39 (1573945959)
            Lustre: lustre-MDD0000: changelog on
            Lustre: lustre-MDT0001: haven't heard from client 128ea591-f299-4 (at 192.168.122.22@tcp) in 48 seconds. I think it's dead, and I am evicting it. exp 000000007725ad20, cur 1573945996 expire 1573945966 last 1573945948
            Lustre: lustre-OST0000: haven't heard from client 128ea591-f299-4 (at 192.168.122.22@tcp) in 48 seconds. I think it's dead, and I am evicting it. exp 00000000a202a5e3, cur 1573945996 expire 1573945966 last 1573945948
            LustreError: 19:0:(tgt_lastrcvd.c:451:tgt_client_free()) ASSERTION( lut && lut->lut_client_bitmap ) failed: 
            LustreError: 19:0:(tgt_lastrcvd.c:451:tgt_client_free()) LBUG
            ...
            Call Trace:
             ? __schedule+0x2ad/0xb00
             schedule+0x34/0x80
             lbug_with_loc+0x79/0x80 [libcfs]
             ? tgt_client_free+0x2b0/0x330 [ptlrpc]
             ? mdt_destroy_export+0x87/0x2a0 [mdt]
             ? class_export_destroy+0xe9/0x460 [obdclass]
             ? process_one_work+0x249/0x5d0
             ? worker_thread+0x48/0x3d0
             ? kthread+0x100/0x140
            
            umount          D    0 24858  24857 0x00000000
            Call Trace:
             ? __schedule+0x2ad/0xb00
             schedule+0x34/0x80
             schedule_timeout+0x323/0x500
             ? wait_for_common+0x3b/0x160
             wait_for_common+0xc9/0x160
             ? wake_up_q+0x60/0x60
             flush_workqueue+0x143/0x4a0
             ? obd_exports_barrier+0x43/0x1a0 [obdclass]
             ? obd_exports_barrier+0x76/0x1a0 [obdclass]
             mgs_device_fini+0xdb/0x5c0 [mgs]
             class_cleanup+0x689/0xb50 [obdclass]
             class_process_config+0x153e/0x30f0 [obdclass]
             ? cache_alloc_debugcheck_after+0x138/0x150
             class_manual_cleanup+0x197/0x670 [obdclass]
             server_put_super+0x1525/0x1d50 [obdclass]
             ? evict_inodes+0x138/0x180
             generic_shutdown_super+0x5f/0xf0
            

            looks like MDT umount didn't wait for all exports to be gone?

            bzzz Alex Zhuravlev added a comment - Lustre: DEBUG MARKER: == recovery-small test 60: Add Changelog entries during MDS failover ================================= 04:12:39 (1573945959) Lustre: lustre-MDD0000: changelog on Lustre: lustre-MDT0001: haven 't heard from client 128ea591-f299-4 (at 192.168.122.22@tcp) in 48 seconds. I think it' s dead, and I am evicting it. exp 000000007725ad20, cur 1573945996 expire 1573945966 last 1573945948 Lustre: lustre-OST0000: haven 't heard from client 128ea591-f299-4 (at 192.168.122.22@tcp) in 48 seconds. I think it' s dead, and I am evicting it. exp 00000000a202a5e3, cur 1573945996 expire 1573945966 last 1573945948 LustreError: 19:0:(tgt_lastrcvd.c:451:tgt_client_free()) ASSERTION( lut && lut->lut_client_bitmap ) failed: LustreError: 19:0:(tgt_lastrcvd.c:451:tgt_client_free()) LBUG ... Call Trace: ? __schedule+0x2ad/0xb00 schedule+0x34/0x80 lbug_with_loc+0x79/0x80 [libcfs] ? tgt_client_free+0x2b0/0x330 [ptlrpc] ? mdt_destroy_export+0x87/0x2a0 [mdt] ? class_export_destroy+0xe9/0x460 [obdclass] ? process_one_work+0x249/0x5d0 ? worker_thread+0x48/0x3d0 ? kthread+0x100/0x140 umount D 0 24858 24857 0x00000000 Call Trace: ? __schedule+0x2ad/0xb00 schedule+0x34/0x80 schedule_timeout+0x323/0x500 ? wait_for_common+0x3b/0x160 wait_for_common+0xc9/0x160 ? wake_up_q+0x60/0x60 flush_workqueue+0x143/0x4a0 ? obd_exports_barrier+0x43/0x1a0 [obdclass] ? obd_exports_barrier+0x76/0x1a0 [obdclass] mgs_device_fini+0xdb/0x5c0 [mgs] class_cleanup+0x689/0xb50 [obdclass] class_process_config+0x153e/0x30f0 [obdclass] ? cache_alloc_debugcheck_after+0x138/0x150 class_manual_cleanup+0x197/0x670 [obdclass] server_put_super+0x1525/0x1d50 [obdclass] ? evict_inodes+0x138/0x180 generic_shutdown_super+0x5f/0xf0 looks like MDT umount didn't wait for all exports to be gone?
            green Oleg Drokin added a comment -

            this still seems to be regularly triggering in my testing

            green Oleg Drokin added a comment - this still seems to be regularly triggering in my testing
            green Oleg Drokin added a comment -

            Just had another one

            [11716.272157] Lustre: DEBUG MARKER: == recovery-small test 29b: error adding new clients doesn't cause LBUG (bug 22273) ================== 23:21:29 (1501384889)
            [11716.438161] Lustre: Failing over lustre-OST0000
            [11716.527043] LustreError: 9005:0:(tgt_lastrcvd.c:440:tgt_client_free()) ASSERTION( lut && lut->lut_client_bitmap ) failed: 
            [11716.528524] LustreError: 9005:0:(tgt_lastrcvd.c:440:tgt_client_free()) LBUG
            [11716.529497] Pid: 9005, comm: obd_zombid
            [11716.530209] 
            Call Trace:
            [11716.532127]  [<ffffffffa02c57ce>] libcfs_call_trace+0x4e/0x60 [libcfs]
            [11716.534315]  [<ffffffffa02c585c>] lbug_with_loc+0x4c/0xb0 [libcfs]
            [11716.535401]  [<ffffffffa061dde2>] tgt_client_free+0x2a2/0x360 [ptlrpc]
            [11716.536214]  [<ffffffffa1412b12>] ofd_destroy_export+0x62/0x180 [ofd]
            [11716.537110]  [<ffffffffa03c9239>] obd_zombie_impexp_cull+0x549/0x920 [obdclass]
            [11716.551808]  [<ffffffffa03c967d>] obd_zombie_impexp_thread+0x6d/0x1c0 [obdclass]
            [11716.553655]  [<ffffffff810b7cc0>] ? default_wake_function+0x0/0x20
            [11716.554770]  [<ffffffffa03c9610>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
            [11716.556332]  [<ffffffff810a2eba>] kthread+0xea/0xf0
            [11716.557232]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
            [11716.558361]  [<ffffffff8170fb98>] ret_from_fork+0x58/0x90
            [11716.564715]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
            [11716.567093] 
            [11716.568045] Kernel panic - not syncing: LBUG
            [11716.568703] CPU: 4 PID: 9005 Comm: obd_zombid Tainted: P           OE  ------------   3.10.0-debug #2
            [11716.570244] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
            [11716.570937]  ffffffffa02e4ed2 00000000809dbba9 ffff8800b7697cc0 ffffffff816fd3e4
            [11716.572370]  ffff8800b7697d40 ffffffff816f8c34 ffffffff00000008 ffff8800b7697d50
            [11716.573539]  ffff8800b7697cf0 00000000809dbba9 00000000809dbba9 ffff88033e48d948
            [11716.574459] Call Trace:
            [11716.574892]  [<ffffffff816fd3e4>] dump_stack+0x19/0x1b
            [11716.575383]  [<ffffffff816f8c34>] panic+0xd8/0x1e7
            [11716.575862]  [<ffffffffa02c5874>] lbug_with_loc+0x64/0xb0 [libcfs]
            [11716.576514]  [<ffffffffa061dde2>] tgt_client_free+0x2a2/0x360 [ptlrpc]
            [11716.577065]  [<ffffffffa1412b12>] ofd_destroy_export+0x62/0x180 [ofd]
            [11716.577577]  [<ffffffffa03c9239>] obd_zombie_impexp_cull+0x549/0x920 [obdclass]
            [11716.578500]  [<ffffffffa03c967d>] obd_zombie_impexp_thread+0x6d/0x1c0 [obdclass]
            [11716.579429]  [<ffffffff810b7cc0>] ? wake_up_state+0x20/0x20
            [11716.579917]  [<ffffffffa03c9610>] ? obd_zombie_impexp_cull+0x920/0x920 [obdclass]
            [11716.580827]  [<ffffffff810a2eba>] kthread+0xea/0xf0
            [11716.581306]  [<ffffffff810a2dd0>] ? kthread_create_on_node+0x140/0x140
            [11716.581801]  [<ffffffff8170fb98>] ret_from_fork+0x58/0x90
            [11716.582316]  [<ffffffff810a2dd0>] ? kthread_create_on_node+0x140/0x140
            

            crashdump is in 192.168.123.146-2017-07-29-23:21:* on onyx-68

            green Oleg Drokin added a comment - Just had another one [11716.272157] Lustre: DEBUG MARKER: == recovery-small test 29b: error adding new clients doesn't cause LBUG (bug 22273) ================== 23:21:29 (1501384889) [11716.438161] Lustre: Failing over lustre-OST0000 [11716.527043] LustreError: 9005:0:(tgt_lastrcvd.c:440:tgt_client_free()) ASSERTION( lut && lut->lut_client_bitmap ) failed: [11716.528524] LustreError: 9005:0:(tgt_lastrcvd.c:440:tgt_client_free()) LBUG [11716.529497] Pid: 9005, comm: obd_zombid [11716.530209] Call Trace: [11716.532127] [<ffffffffa02c57ce>] libcfs_call_trace+0x4e/0x60 [libcfs] [11716.534315] [<ffffffffa02c585c>] lbug_with_loc+0x4c/0xb0 [libcfs] [11716.535401] [<ffffffffa061dde2>] tgt_client_free+0x2a2/0x360 [ptlrpc] [11716.536214] [<ffffffffa1412b12>] ofd_destroy_export+0x62/0x180 [ofd] [11716.537110] [<ffffffffa03c9239>] obd_zombie_impexp_cull+0x549/0x920 [obdclass] [11716.551808] [<ffffffffa03c967d>] obd_zombie_impexp_thread+0x6d/0x1c0 [obdclass] [11716.553655] [<ffffffff810b7cc0>] ? default_wake_function+0x0/0x20 [11716.554770] [<ffffffffa03c9610>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass] [11716.556332] [<ffffffff810a2eba>] kthread+0xea/0xf0 [11716.557232] [<ffffffff810a2dd0>] ? kthread+0x0/0xf0 [11716.558361] [<ffffffff8170fb98>] ret_from_fork+0x58/0x90 [11716.564715] [<ffffffff810a2dd0>] ? kthread+0x0/0xf0 [11716.567093] [11716.568045] Kernel panic - not syncing: LBUG [11716.568703] CPU: 4 PID: 9005 Comm: obd_zombid Tainted: P OE ------------ 3.10.0-debug #2 [11716.570244] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [11716.570937] ffffffffa02e4ed2 00000000809dbba9 ffff8800b7697cc0 ffffffff816fd3e4 [11716.572370] ffff8800b7697d40 ffffffff816f8c34 ffffffff00000008 ffff8800b7697d50 [11716.573539] ffff8800b7697cf0 00000000809dbba9 00000000809dbba9 ffff88033e48d948 [11716.574459] Call Trace: [11716.574892] [<ffffffff816fd3e4>] dump_stack+0x19/0x1b [11716.575383] [<ffffffff816f8c34>] panic+0xd8/0x1e7 [11716.575862] [<ffffffffa02c5874>] lbug_with_loc+0x64/0xb0 [libcfs] [11716.576514] [<ffffffffa061dde2>] tgt_client_free+0x2a2/0x360 [ptlrpc] [11716.577065] [<ffffffffa1412b12>] ofd_destroy_export+0x62/0x180 [ofd] [11716.577577] [<ffffffffa03c9239>] obd_zombie_impexp_cull+0x549/0x920 [obdclass] [11716.578500] [<ffffffffa03c967d>] obd_zombie_impexp_thread+0x6d/0x1c0 [obdclass] [11716.579429] [<ffffffff810b7cc0>] ? wake_up_state+0x20/0x20 [11716.579917] [<ffffffffa03c9610>] ? obd_zombie_impexp_cull+0x920/0x920 [obdclass] [11716.580827] [<ffffffff810a2eba>] kthread+0xea/0xf0 [11716.581306] [<ffffffff810a2dd0>] ? kthread_create_on_node+0x140/0x140 [11716.581801] [<ffffffff8170fb98>] ret_from_fork+0x58/0x90 [11716.582316] [<ffffffff810a2dd0>] ? kthread_create_on_node+0x140/0x140 crashdump is in 192.168.123.146-2017-07-29-23:21:* on onyx-68

            People

              bzzz Alex Zhuravlev
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: