Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • 3
    • 5964

    Description

      Just hit this issue where obd_zombid hits bad pointer, while running replay-dual (test 16):

      (gdb) bt
      #0  atomic_read (v=0xffff880090a74140)
          at /home/green/bk/linux-2.6.32-279.2.1.el6-debug/arch/x86/include/asm/atomic_64.h:23
      #1  osc_cleanup (obd=0xffff880050ed0fc0)
          at /home/green/git/lustre-release/lustre/osc/osc_request.c:3533
      #2  0xffffffffa0540ec2 in class_decref ()
      #3  0xffffffffa051d8a4 in obd_zombie_impexp_cull ()
      #4  0xffffffffa051dc75 in obd_zombie_impexp_thread ()
      #5  0xffffffff8100c14a in child_rip () at arch/x86/kernel/entry_64.S:1211
      #6  0x0000000000000000 in ?? ()
      

      line 3533 is:
      LASSERT(cfs_atomic_read(&cli->cl_cache->ccc_users) > 0);
      captured crashdump is in /exports/crashdumps/t/vmdump

      Attachments

        Activity

          [LU-2543] obd_zombid oops

          patch landed for 2.4

          niu Niu Yawei (Inactive) added a comment - patch landed for 2.4
          niu Niu Yawei (Inactive) added a comment - http://review.whamcloud.com/4951

          The ll_cache is from sbi, which might has already been freed when the zombie_cull thread cleanup osc, I think we'd allocate the ll_cache & free it on the last osc cleanup.

          niu Niu Yawei (Inactive) added a comment - The ll_cache is from sbi, which might has already been freed when the zombie_cull thread cleanup osc, I think we'd allocate the ll_cache & free it on the last osc cleanup.
          pjones Peter Jones added a comment -

          Niu

          Could you please look into this one?

          Thanks

          Peter

          pjones Peter Jones added a comment - Niu Could you please look into this one? Thanks Peter
          green Oleg Drokin added a comment -

          Probably hit this again, but with a failed assertion this time:

          [195137.958889] Lustre: DEBUG MARKER: == replay-single test 59: test log_commit_thread vs filter_destroy race == 07:08:26 (1356782906)
          [195139.807509] LustreError: 32245:0:(osc_request.c:3533:osc_cleanup()) ASSERTION( atomic_read(&cli->cl_cache->ccc_users) > 0 ) failed: 
          [195139.808102] LustreError: 32245:0:(osc_request.c:3533:osc_cleanup()) LBUG
          [195139.808357] Pid: 32245, comm: obd_zombid
          [195139.808575] 
          [195139.808575] Call Trace:
          [195139.808943]  [<ffffffffa074b915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
          [195139.809202]  [<ffffffffa074bf27>] lbug_with_loc+0x47/0xb0 [libcfs]
          [195139.809452]  [<ffffffffa09ca445>] osc_cleanup+0x185/0x190 [osc]
          [195139.809726]  [<ffffffffa0552ec2>] class_decref+0x212/0x590 [obdclass]
          [195139.809992]  [<ffffffffa052f8a4>] obd_zombie_impexp_cull+0x314/0x620 [obdclass]
          [195139.810415]  [<ffffffffa052fc75>] obd_zombie_impexp_thread+0xc5/0x1c0 [obdclass]
          [195139.810830]  [<ffffffff81057d60>] ? default_wake_function+0x0/0x20
          [195139.811089]  [<ffffffffa052fbb0>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
          [195139.827546]  [<ffffffff8100c14a>] child_rip+0xa/0x20
          [195139.827811]  [<ffffffffa052fbb0>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
          [195139.828244]  [<ffffffffa052fbb0>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
          [195139.828664]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
          [195139.828907] 
          [195139.853551] Kernel panic - not syncing: LBUG
          

          Crashdump is in /exports/crashdumps/192.168.10.223-2012-12-29-07\:08\:30/

          I am upgrading this to blocker as I seem to be hitting it pretty often

          green Oleg Drokin added a comment - Probably hit this again, but with a failed assertion this time: [195137.958889] Lustre: DEBUG MARKER: == replay-single test 59: test log_commit_thread vs filter_destroy race == 07:08:26 (1356782906) [195139.807509] LustreError: 32245:0:(osc_request.c:3533:osc_cleanup()) ASSERTION( atomic_read(&cli->cl_cache->ccc_users) > 0 ) failed: [195139.808102] LustreError: 32245:0:(osc_request.c:3533:osc_cleanup()) LBUG [195139.808357] Pid: 32245, comm: obd_zombid [195139.808575] [195139.808575] Call Trace: [195139.808943] [<ffffffffa074b915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [195139.809202] [<ffffffffa074bf27>] lbug_with_loc+0x47/0xb0 [libcfs] [195139.809452] [<ffffffffa09ca445>] osc_cleanup+0x185/0x190 [osc] [195139.809726] [<ffffffffa0552ec2>] class_decref+0x212/0x590 [obdclass] [195139.809992] [<ffffffffa052f8a4>] obd_zombie_impexp_cull+0x314/0x620 [obdclass] [195139.810415] [<ffffffffa052fc75>] obd_zombie_impexp_thread+0xc5/0x1c0 [obdclass] [195139.810830] [<ffffffff81057d60>] ? default_wake_function+0x0/0x20 [195139.811089] [<ffffffffa052fbb0>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass] [195139.827546] [<ffffffff8100c14a>] child_rip+0xa/0x20 [195139.827811] [<ffffffffa052fbb0>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass] [195139.828244] [<ffffffffa052fbb0>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass] [195139.828664] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [195139.828907] [195139.853551] Kernel panic - not syncing: LBUG Crashdump is in /exports/crashdumps/192.168.10.223-2012-12-29-07\:08\:30/ I am upgrading this to blocker as I seem to be hitting it pretty often
          green Oleg Drokin added a comment -

          apparently just hit it again

          [83158.070254] Lustre: DEBUG MARKER: == recovery-small test 106: lightweight connection support == 00:02:04 (1356670924)
          [83158.488164] Lustre: *** cfs_fail_loc=805, val=0***
          [83158.503808] Lustre: Mounted lustre-client
          [83159.236432] LustreError: 31142:0:(osd_handler.c:1064:osd_ro()) *** setting lustre-MDT0000 read-only ***
          [83159.236995] Turning device loop0 (0x700000) read-only
          [83159.289819] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
          [83159.293581] Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000
          [83160.711456] Removing read-only on unknown block (0x700000)
          [83172.237746] LDISKFS-fs (loop0): recovery complete
          [83172.301007] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: 
          [83182.261840] LustreError: 31414:0:(mdc_locks.c:784:mdc_enqueue()) ldlm_cli_enqueue: -5
          [83183.335108] LustreError: 31443:0:(obd_config.c:1175:class_process_config()) no device for: lustre-OST0000-osc-ffff88000bdb3bf0
          [83183.335694] LustreError: 31443:0:(obd_config.c:1730:class_manual_cleanup()) cleanup failed -22: lustre-OST0000-osc-ffff88000bdb3bf0
          [83183.344765] Lustre: Unmounted lustre-client
          [83184.784895] Lustre: DEBUG MARKER: == recovery-small test 107: drop reint reply, then restart MDT == 00:02:30 (1356670950)
          [83184.813350] BUG: spinlock bad magic on CPU#7, obd_zombid/15001 (Not tainted)
          [83184.824083] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
          [83184.824416] last sysfs file: /sys/devices/system/cpu/possible
          [83184.824698] CPU 7 
          [83184.824739] Modules linked in: lustre ofd osp lod ost mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs mdd mgs lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet libcfs exportfs jbd sha512_generic sha256_generic ext4 mbcache jbd2 virtio_balloon virtio_console i2c_piix4 i2c_core virtio_blk virtio_net virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache nfs_acl auth_rpcgss sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: libcfs]
          [83184.827831] 
          [83184.828036] Pid: 15001, comm: obd_zombid Not tainted 2.6.32-debug #6 Bochs Bochs
          [83184.828537] RIP: 0010:[<ffffffff81280961>]  [<ffffffff81280961>] spin_bug+0x81/0x100
          [83184.829056] RSP: 0018:ffff880031c97d80  EFLAGS: 00010206
          [83184.829331] RAX: 0000000000000056 RBX: ffff880049b14158 RCX: 00000000ffffffff
          [83184.829641] RDX: 0000000000000000 RSI: 0000000000000086 RDI: 0000000000000246
          [83184.829981] RBP: ffff880031c97da0 R08: 0000000000000000 R09: 0000000000000000
          [83184.830303] R10: 0000000000000001 R11: 0000000000000000 R12: 5a5a5a5a5a5a5a5a
          [83184.830615] R13: ffffffff817c5f07 R14: ffff880031c97e30 R15: ffffffffffffffff
          [83184.830925] FS:  00007f615b0b6700(0000) GS:ffff8800063c0000(0000) knlGS:0000000000000000
          [83184.831362] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
          [83184.831603] CR2: ffff880049b14160 CR3: 0000000001a25000 CR4: 00000000000006e0
          [83184.831871] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          [83184.832130] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
          [83184.832390] Process obd_zombid (pid: 15001, threadinfo ffff880031c96000, task ffff880036378040)
          [83184.832816] Stack:
          [83184.833001]  ffff880084318638 ffff880049b14158 ffff880084318638 ffffffffa06bd6c0
          [83184.833278] <d> ffff880031c97df0 ffffffff81280b25 0000000000000001 0000000000000001
          [83184.833707] <d> 0000000000000000 ffff88005cf61c80 ffff880084318638 ffffffffa06bd6c0
          [83184.864346] Call Trace:
          [83184.864551]  [<ffffffff81280b25>] _raw_spin_lock+0xa5/0x180
          [83184.864839]  [<ffffffff814fafde>] _spin_lock+0xe/0x10
          [83184.865112]  [<ffffffffa0679306>] osc_cleanup+0x46/0x190 [osc]
          [83184.865404]  [<ffffffffa0c7cec2>] class_decref+0x212/0x590 [obdclass]
          [83184.865722]  [<ffffffffa0c598a4>] obd_zombie_impexp_cull+0x314/0x620 [obdclass]
          [83184.883080]  [<ffffffffa0c59c75>] obd_zombie_impexp_thread+0xc5/0x1c0 [obdclass]
          [83184.884778]  [<ffffffff81057d60>] ? default_wake_function+0x0/0x20
          [83184.885099]  [<ffffffffa0c59bb0>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
          [83184.885588]  [<ffffffff8100c14a>] child_rip+0xa/0x20
          [83184.885963]  [<ffffffffa0c59bb0>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
          [83184.886459]  [<ffffffffa0c59bb0>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
          [83184.886930]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
          [83184.887194] Code: 8d 8e a0 06 00 00 49 89 c1 4c 89 ee 31 c0 48 c7 c7 a8 62 7c 81 65 8b 14 25 b8 e0 00 00 e8 54 6d 27 00 4d 85 e4 44 8b 4b 08 74 6b <45> 8b 84 24 a8 04 00 00 49 8d 8c 24 a0 06 00 00 8b 53 04 48 89 
          [83184.888377] RIP  [<ffffffff81280961>] spin_bug+0x81/0x100
          [83184.888659]  RSP <ffff880031c97d80>
          

          crashdump is in /exports/crashdumps/192.168.10.222-2012-12-28-00:02:33/vmcore

          green Oleg Drokin added a comment - apparently just hit it again [83158.070254] Lustre: DEBUG MARKER: == recovery-small test 106: lightweight connection support == 00:02:04 (1356670924) [83158.488164] Lustre: *** cfs_fail_loc=805, val=0*** [83158.503808] Lustre: Mounted lustre-client [83159.236432] LustreError: 31142:0:(osd_handler.c:1064:osd_ro()) *** setting lustre-MDT0000 read-only *** [83159.236995] Turning device loop0 (0x700000) read-only [83159.289819] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000 [83159.293581] Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000 [83160.711456] Removing read-only on unknown block (0x700000) [83172.237746] LDISKFS-fs (loop0): recovery complete [83172.301007] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: [83182.261840] LustreError: 31414:0:(mdc_locks.c:784:mdc_enqueue()) ldlm_cli_enqueue: -5 [83183.335108] LustreError: 31443:0:(obd_config.c:1175:class_process_config()) no device for: lustre-OST0000-osc-ffff88000bdb3bf0 [83183.335694] LustreError: 31443:0:(obd_config.c:1730:class_manual_cleanup()) cleanup failed -22: lustre-OST0000-osc-ffff88000bdb3bf0 [83183.344765] Lustre: Unmounted lustre-client [83184.784895] Lustre: DEBUG MARKER: == recovery-small test 107: drop reint reply, then restart MDT == 00:02:30 (1356670950) [83184.813350] BUG: spinlock bad magic on CPU#7, obd_zombid/15001 (Not tainted) [83184.824083] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC [83184.824416] last sysfs file: /sys/devices/system/cpu/possible [83184.824698] CPU 7 [83184.824739] Modules linked in: lustre ofd osp lod ost mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs mdd mgs lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet libcfs exportfs jbd sha512_generic sha256_generic ext4 mbcache jbd2 virtio_balloon virtio_console i2c_piix4 i2c_core virtio_blk virtio_net virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache nfs_acl auth_rpcgss sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: libcfs] [83184.827831] [83184.828036] Pid: 15001, comm: obd_zombid Not tainted 2.6.32-debug #6 Bochs Bochs [83184.828537] RIP: 0010:[<ffffffff81280961>] [<ffffffff81280961>] spin_bug+0x81/0x100 [83184.829056] RSP: 0018:ffff880031c97d80 EFLAGS: 00010206 [83184.829331] RAX: 0000000000000056 RBX: ffff880049b14158 RCX: 00000000ffffffff [83184.829641] RDX: 0000000000000000 RSI: 0000000000000086 RDI: 0000000000000246 [83184.829981] RBP: ffff880031c97da0 R08: 0000000000000000 R09: 0000000000000000 [83184.830303] R10: 0000000000000001 R11: 0000000000000000 R12: 5a5a5a5a5a5a5a5a [83184.830615] R13: ffffffff817c5f07 R14: ffff880031c97e30 R15: ffffffffffffffff [83184.830925] FS: 00007f615b0b6700(0000) GS:ffff8800063c0000(0000) knlGS:0000000000000000 [83184.831362] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [83184.831603] CR2: ffff880049b14160 CR3: 0000000001a25000 CR4: 00000000000006e0 [83184.831871] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [83184.832130] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [83184.832390] Process obd_zombid (pid: 15001, threadinfo ffff880031c96000, task ffff880036378040) [83184.832816] Stack: [83184.833001] ffff880084318638 ffff880049b14158 ffff880084318638 ffffffffa06bd6c0 [83184.833278] <d> ffff880031c97df0 ffffffff81280b25 0000000000000001 0000000000000001 [83184.833707] <d> 0000000000000000 ffff88005cf61c80 ffff880084318638 ffffffffa06bd6c0 [83184.864346] Call Trace: [83184.864551] [<ffffffff81280b25>] _raw_spin_lock+0xa5/0x180 [83184.864839] [<ffffffff814fafde>] _spin_lock+0xe/0x10 [83184.865112] [<ffffffffa0679306>] osc_cleanup+0x46/0x190 [osc] [83184.865404] [<ffffffffa0c7cec2>] class_decref+0x212/0x590 [obdclass] [83184.865722] [<ffffffffa0c598a4>] obd_zombie_impexp_cull+0x314/0x620 [obdclass] [83184.883080] [<ffffffffa0c59c75>] obd_zombie_impexp_thread+0xc5/0x1c0 [obdclass] [83184.884778] [<ffffffff81057d60>] ? default_wake_function+0x0/0x20 [83184.885099] [<ffffffffa0c59bb0>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass] [83184.885588] [<ffffffff8100c14a>] child_rip+0xa/0x20 [83184.885963] [<ffffffffa0c59bb0>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass] [83184.886459] [<ffffffffa0c59bb0>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass] [83184.886930] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [83184.887194] Code: 8d 8e a0 06 00 00 49 89 c1 4c 89 ee 31 c0 48 c7 c7 a8 62 7c 81 65 8b 14 25 b8 e0 00 00 e8 54 6d 27 00 4d 85 e4 44 8b 4b 08 74 6b <45> 8b 84 24 a8 04 00 00 49 8d 8c 24 a0 06 00 00 8b 53 04 48 89 [83184.888377] RIP [<ffffffff81280961>] spin_bug+0x81/0x100 [83184.888659] RSP <ffff880031c97d80> crashdump is in /exports/crashdumps/192.168.10.222-2012-12-28-00:02:33/vmcore

          People

            niu Niu Yawei (Inactive)
            green Oleg Drokin
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: