[LU-2543] obd_zombid oops Created: 27/Dec/12  Updated: 22/Apr/13  Resolved: 13/Jan/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Oleg Drokin Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: HB

Severity: 3
Rank (Obsolete): 5964

 Description   

Just hit this issue where obd_zombid hits bad pointer, while running replay-dual (test 16):

(gdb) bt
#0  atomic_read (v=0xffff880090a74140)
    at /home/green/bk/linux-2.6.32-279.2.1.el6-debug/arch/x86/include/asm/atomic_64.h:23
#1  osc_cleanup (obd=0xffff880050ed0fc0)
    at /home/green/git/lustre-release/lustre/osc/osc_request.c:3533
#2  0xffffffffa0540ec2 in class_decref ()
#3  0xffffffffa051d8a4 in obd_zombie_impexp_cull ()
#4  0xffffffffa051dc75 in obd_zombie_impexp_thread ()
#5  0xffffffff8100c14a in child_rip () at arch/x86/kernel/entry_64.S:1211
#6  0x0000000000000000 in ?? ()

line 3533 is:
LASSERT(cfs_atomic_read(&cli->cl_cache->ccc_users) > 0);
captured crashdump is in /exports/crashdumps/t/vmdump



 Comments   
Comment by Oleg Drokin [ 28/Dec/12 ]

apparently just hit it again

[83158.070254] Lustre: DEBUG MARKER: == recovery-small test 106: lightweight connection support == 00:02:04 (1356670924)
[83158.488164] Lustre: *** cfs_fail_loc=805, val=0***
[83158.503808] Lustre: Mounted lustre-client
[83159.236432] LustreError: 31142:0:(osd_handler.c:1064:osd_ro()) *** setting lustre-MDT0000 read-only ***
[83159.236995] Turning device loop0 (0x700000) read-only
[83159.289819] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
[83159.293581] Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000
[83160.711456] Removing read-only on unknown block (0x700000)
[83172.237746] LDISKFS-fs (loop0): recovery complete
[83172.301007] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: 
[83182.261840] LustreError: 31414:0:(mdc_locks.c:784:mdc_enqueue()) ldlm_cli_enqueue: -5
[83183.335108] LustreError: 31443:0:(obd_config.c:1175:class_process_config()) no device for: lustre-OST0000-osc-ffff88000bdb3bf0
[83183.335694] LustreError: 31443:0:(obd_config.c:1730:class_manual_cleanup()) cleanup failed -22: lustre-OST0000-osc-ffff88000bdb3bf0
[83183.344765] Lustre: Unmounted lustre-client
[83184.784895] Lustre: DEBUG MARKER: == recovery-small test 107: drop reint reply, then restart MDT == 00:02:30 (1356670950)
[83184.813350] BUG: spinlock bad magic on CPU#7, obd_zombid/15001 (Not tainted)
[83184.824083] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
[83184.824416] last sysfs file: /sys/devices/system/cpu/possible
[83184.824698] CPU 7 
[83184.824739] Modules linked in: lustre ofd osp lod ost mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs mdd mgs lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet libcfs exportfs jbd sha512_generic sha256_generic ext4 mbcache jbd2 virtio_balloon virtio_console i2c_piix4 i2c_core virtio_blk virtio_net virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache nfs_acl auth_rpcgss sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: libcfs]
[83184.827831] 
[83184.828036] Pid: 15001, comm: obd_zombid Not tainted 2.6.32-debug #6 Bochs Bochs
[83184.828537] RIP: 0010:[<ffffffff81280961>]  [<ffffffff81280961>] spin_bug+0x81/0x100
[83184.829056] RSP: 0018:ffff880031c97d80  EFLAGS: 00010206
[83184.829331] RAX: 0000000000000056 RBX: ffff880049b14158 RCX: 00000000ffffffff
[83184.829641] RDX: 0000000000000000 RSI: 0000000000000086 RDI: 0000000000000246
[83184.829981] RBP: ffff880031c97da0 R08: 0000000000000000 R09: 0000000000000000
[83184.830303] R10: 0000000000000001 R11: 0000000000000000 R12: 5a5a5a5a5a5a5a5a
[83184.830615] R13: ffffffff817c5f07 R14: ffff880031c97e30 R15: ffffffffffffffff
[83184.830925] FS:  00007f615b0b6700(0000) GS:ffff8800063c0000(0000) knlGS:0000000000000000
[83184.831362] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[83184.831603] CR2: ffff880049b14160 CR3: 0000000001a25000 CR4: 00000000000006e0
[83184.831871] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[83184.832130] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[83184.832390] Process obd_zombid (pid: 15001, threadinfo ffff880031c96000, task ffff880036378040)
[83184.832816] Stack:
[83184.833001]  ffff880084318638 ffff880049b14158 ffff880084318638 ffffffffa06bd6c0
[83184.833278] <d> ffff880031c97df0 ffffffff81280b25 0000000000000001 0000000000000001
[83184.833707] <d> 0000000000000000 ffff88005cf61c80 ffff880084318638 ffffffffa06bd6c0
[83184.864346] Call Trace:
[83184.864551]  [<ffffffff81280b25>] _raw_spin_lock+0xa5/0x180
[83184.864839]  [<ffffffff814fafde>] _spin_lock+0xe/0x10
[83184.865112]  [<ffffffffa0679306>] osc_cleanup+0x46/0x190 [osc]
[83184.865404]  [<ffffffffa0c7cec2>] class_decref+0x212/0x590 [obdclass]
[83184.865722]  [<ffffffffa0c598a4>] obd_zombie_impexp_cull+0x314/0x620 [obdclass]
[83184.883080]  [<ffffffffa0c59c75>] obd_zombie_impexp_thread+0xc5/0x1c0 [obdclass]
[83184.884778]  [<ffffffff81057d60>] ? default_wake_function+0x0/0x20
[83184.885099]  [<ffffffffa0c59bb0>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
[83184.885588]  [<ffffffff8100c14a>] child_rip+0xa/0x20
[83184.885963]  [<ffffffffa0c59bb0>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
[83184.886459]  [<ffffffffa0c59bb0>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
[83184.886930]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
[83184.887194] Code: 8d 8e a0 06 00 00 49 89 c1 4c 89 ee 31 c0 48 c7 c7 a8 62 7c 81 65 8b 14 25 b8 e0 00 00 e8 54 6d 27 00 4d 85 e4 44 8b 4b 08 74 6b <45> 8b 84 24 a8 04 00 00 49 8d 8c 24 a0 06 00 00 8b 53 04 48 89 
[83184.888377] RIP  [<ffffffff81280961>] spin_bug+0x81/0x100
[83184.888659]  RSP <ffff880031c97d80>

crashdump is in /exports/crashdumps/192.168.10.222-2012-12-28-00:02:33/vmcore

Comment by Oleg Drokin [ 31/Dec/12 ]

Probably hit this again, but with a failed assertion this time:

[195137.958889] Lustre: DEBUG MARKER: == replay-single test 59: test log_commit_thread vs filter_destroy race == 07:08:26 (1356782906)
[195139.807509] LustreError: 32245:0:(osc_request.c:3533:osc_cleanup()) ASSERTION( atomic_read(&cli->cl_cache->ccc_users) > 0 ) failed: 
[195139.808102] LustreError: 32245:0:(osc_request.c:3533:osc_cleanup()) LBUG
[195139.808357] Pid: 32245, comm: obd_zombid
[195139.808575] 
[195139.808575] Call Trace:
[195139.808943]  [<ffffffffa074b915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[195139.809202]  [<ffffffffa074bf27>] lbug_with_loc+0x47/0xb0 [libcfs]
[195139.809452]  [<ffffffffa09ca445>] osc_cleanup+0x185/0x190 [osc]
[195139.809726]  [<ffffffffa0552ec2>] class_decref+0x212/0x590 [obdclass]
[195139.809992]  [<ffffffffa052f8a4>] obd_zombie_impexp_cull+0x314/0x620 [obdclass]
[195139.810415]  [<ffffffffa052fc75>] obd_zombie_impexp_thread+0xc5/0x1c0 [obdclass]
[195139.810830]  [<ffffffff81057d60>] ? default_wake_function+0x0/0x20
[195139.811089]  [<ffffffffa052fbb0>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
[195139.827546]  [<ffffffff8100c14a>] child_rip+0xa/0x20
[195139.827811]  [<ffffffffa052fbb0>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
[195139.828244]  [<ffffffffa052fbb0>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
[195139.828664]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
[195139.828907] 
[195139.853551] Kernel panic - not syncing: LBUG

Crashdump is in /exports/crashdumps/192.168.10.223-2012-12-29-07\:08\:30/

I am upgrading this to blocker as I seem to be hitting it pretty often

Comment by Peter Jones [ 03/Jan/13 ]

Niu

Could you please look into this one?

Thanks

Peter

Comment by Niu Yawei (Inactive) [ 04/Jan/13 ]

The ll_cache is from sbi, which might has already been freed when the zombie_cull thread cleanup osc, I think we'd allocate the ll_cache & free it on the last osc cleanup.

Comment by Niu Yawei (Inactive) [ 04/Jan/13 ]

http://review.whamcloud.com/4951

Comment by Niu Yawei (Inactive) [ 13/Jan/13 ]

patch landed for 2.4

Generated at Sat Feb 10 01:26:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.