[LU-7297] BUG: spinlock bad magic, probably on oh->oh_lock Created: 14/Oct/15  Updated: 04/Dec/15  Resolved: 04/Dec/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Minor
Reporter: Olaf Faaland Assignee: Emoly Liu
Resolution: Fixed Votes: 0
Labels: llnl, patch
Environment:

Linux kernel 2.6.32-504.16.2.1chaos.ch5.3.x86_64.debug
Lustre 2.7.54 plus 3 patches, see description


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

After building lustre, attempted to run in-tree on a single node. Ran

FSTYPE=zfs llmount.sh

mkfs.lustre completed without errors.
mount produced following output in dmesg:

Lustre: Lustre: Build Version: v2_7_54_0-g99fd511-CHANGED-2.6.32-504.16.2.1chaos.ch5.3.x86_64.debug
LNet: Added LNI 192.168.122.12@tcp [8/256/0/180]
LNet: Accept secure, port 988
Lustre: Echo OBD driver; http://www.lustre.org/
zpool used greatest stack depth: 2928 bytes left
txg_sync used greatest stack depth: 2872 bytes left
BUG: spinlock bad magic on CPU#0, mount.lustre/4152 (Not tainted)
lock: ffff88002713a770, .magic: 00000000, .owner: mount.lustre/4152, .owner_cpu: 0
Pid: 4152, comm: mount.lustre Not tainted 2.6.32-504.16.2.1chaos.ch5.3.x86_64.debug #1
Call Trace:
[<ffffffff812c3d2a>] ? spin_bug+0xaa/0x100
[<ffffffff812c3de5>] ? _raw_spin_unlock+0x65/0xa0
[<ffffffff81561a4b>] ? _spin_unlock+0x2b/0x40
[<ffffffffa0a8d32c>] ? lprocfs_oh_tally+0x3c/0x50 [obdclass]
[<ffffffffa0e54319>] ? record_start_io+0x39/0x90 [osd_zfs]
[<ffffffffa0e5601d>] ? osd_write+0x1ad/0x3a0 [osd_zfs]
[<ffffffffa0ab747d>] ? dt_record_write+0x3d/0x130 [obdclass]
[<ffffffffa0a97895>] ? local_oid_storage_init+0xe55/0x1410 [obdclass]
[<ffffffffa11226a4>] ? mgs_fs_setup+0xa4/0x4b0 [mgs]
[<ffffffff8156190b>] ? _read_unlock+0x2b/0x40
[<ffffffffa1121aaf>] ? mgs_init0+0xeff/0x17c0 [mgs]
[<ffffffff8118f215>] ? kmem_cache_alloc_trace+0x1c5/0x2e0
[<ffffffff81545b30>] ? kmemleak_alloc+0x20/0xd0
[<ffffffffa111a399>] ? mgs_type_start+0x19/0x20 [mgs]
[<ffffffffa1122480>] ? mgs_device_alloc+0x110/0x1f0 [mgs]
[<ffffffffa0a9d19f>] ? obd_setup+0x1bf/0x290 [obdclass]
[<ffffffffa0a9d477>] ? class_setup+0x207/0x870 [obdclass]
[<ffffffffa0aa4bfc>] ? class_process_config+0x113c/0x2710 [obdclass]
[<ffffffff8118c983>] ? cache_alloc_debugcheck_after+0xf3/0x230
[<ffffffff81545b30>] ? kmemleak_alloc+0x20/0xd0
[<ffffffff8118ffdb>] ? __kmalloc+0x21b/0x330
[<ffffffffa0aaaf98>] ? do_lcfg+0x198/0x9c0 [obdclass]
[<ffffffffa0aab422>] ? do_lcfg+0x622/0x9c0 [obdclass]
[<ffffffffa0aab854>] ? lustre_start_simple+0x94/0x200 [obdclass]
[<ffffffffa0ae0ae1>] ? server_fill_super+0x1161/0x1690 [obdclass]
[<ffffffffa0ab0c58>] ? lustre_fill_super+0x5d8/0xa80 [obdclass]
[<ffffffffa0ab0680>] ? lustre_fill_super+0x0/0xa80 [obdclass]
[<ffffffff811af06f>] ? get_sb_nodev+0x5f/0xa0
[<ffffffffa0aa8345>] ? lustre_get_sb+0x25/0x30 [obdclass]
[<ffffffff811ae61b>] ? vfs_kern_mount+0x7b/0x1b0
[<ffffffff811ae7c2>] ? do_kern_mount+0x52/0x130
[<ffffffff811d0b0b>] ? do_mount+0x2fb/0x920
[<ffffffff811d11c0>] ? sys_mount+0x90/0xe0
[<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Lustre: Setting parameter lustre-MDT0000.mdt.identity_upcall in log lustre-MDT0000
Lustre: ctl-lustre-MDT0000: No data found on store. Initialize space
Lustre: lustre-MDT0000: new disk, initializing
mount.lustre used greatest stack depth: 2536 bytes left

Code that triggered the BUG:

void lprocfs_oh_tally(struct obd_histogram *oh, unsigned int value)
{
        if (value >= OBD_HIST_MAX)
                value = OBD_HIST_MAX - 1;

        spin_lock(&oh->oh_lock);
        oh->oh_buckets[value]++;
        spin_unlock(&oh->oh_lock);
}


 Comments   
Comment by Olaf Faaland [ 14/Oct/15 ]

Lustre involved is:

* 99fd511 LU-6765 obdecho: initialize cs_lu.ls_purge_mutex
* 4a604d2 LU-6816 utils: remove libzfs_load_module() call
* 4dbb82b LU-6747 osd-zfs: initialize obd_statfs in osd_statfs()
* d20d17e New tag 2.7.54
* 2b5ebbb LU-6599 header: Change erroneous GPLv3 header to GPLv2
* c6aab2c LU-6068 misc: update old URLs to hpdd.intel.com
* 8badb39 LU-6389 llite: restart short read/write for normal IO

I will check with current master and report back in a comment whether the issue has already been resolved and ticket can be closed, or not.

Comment by Olaf Faaland [ 14/Oct/15 ]

Running under a kernel with lock debug features enabled.

Comment by Olaf Faaland [ 14/Oct/15 ]

Rebuild from current master and problem does not occur. False alarm, sorry. Ticket should be closed.

Comment by Joseph Gmitter (Inactive) [ 14/Oct/15 ]

Thanks Olaf, we will close the ticket.

Comment by Olaf Faaland [ 14/Oct/15 ]

I find it does occur with current master. Kernel doesn't report it every time.

The BUG report that occurs with current master Lustre is

BUG: spinlock bad magic on CPU#0, mount.lustre/4407 (Not tainted)
lock: ffff880028996770, .magic: 00000000, .owner: mount.lustre/4407, .owner_cpu: 0
Pid: 4407, comm: mount.lustre Not tainted 2.6.32-504.16.2.1chaos.ch5.3.x86_64.debug #1
Call Trace:
[<ffffffff812c3d2a>] ? spin_bug+0xaa/0x100
[<ffffffff812c3de5>] ? _raw_spin_unlock+0x65/0xa0
[<ffffffff81561a4b>] ? _spin_unlock+0x2b/0x40
[<ffffffffa0a8f41c>] ? lprocfs_oh_tally+0x3c/0x50 [obdclass]
[<ffffffffa0e8e869>] ? record_start_io+0x39/0x90 [osd_zfs]
[<ffffffffa0e908ad>] ? osd_write+0x1ad/0x3a0 [osd_zfs]
[<ffffffffa0aba6bd>] ? dt_record_write+0x3d/0x130 [obdclass]
[<ffffffffa0a99f34>] ? local_oid_storage_init+0xeb4/0x14a0 [obdclass]
[<ffffffffa11495c4>] ? mgs_fs_setup+0xa4/0x4b0 [mgs]
[<ffffffff8156190b>] ? _read_unlock+0x2b/0x40
[<ffffffffa1148a8f>] ? mgs_init0+0xecf/0x1790 [mgs]
[<ffffffff81545b30>] ? kmemleak_alloc+0x20/0xd0
[<ffffffffa11416b9>] ? mgs_type_start+0x19/0x20 [mgs]
[<ffffffffa11493e8>] ? mgs_device_alloc+0x98/0x140 [mgs]
[<ffffffffa0a9f42f>] ? obd_setup+0x1bf/0x290 [obdclass]
[<ffffffffa0a9f758>] ? class_setup+0x258/0x930 [obdclass]
[<ffffffffa0aa6011>] ? class_process_config+0x1151/0x26d0 [obdclass]
[<ffffffff8118c983>] ? cache_alloc_debugcheck_after+0xf3/0x230
[<ffffffff81545b30>] ? kmemleak_alloc+0x20/0xd0
[<ffffffff8118ffdb>] ? __kmalloc+0x21b/0x330
[<ffffffffa0ab1198>] ? do_lcfg+0x198/0xb60 [obdclass]
[<ffffffffa0ab12cb>] ? do_lcfg+0x2cb/0xb60 [obdclass]
[<ffffffffa0ab1bf4>] ? lustre_start_simple+0x94/0x200 [obdclass]
[<ffffffffa0ae206e>] ? server_fill_super+0x115e/0x1688 [obdclass]
[<ffffffffa0ab3e28>] ? lustre_fill_super+0x338/0x990 [obdclass]
[<ffffffffa0ab3af0>] ? lustre_fill_super+0x0/0x990 [obdclass]
[<ffffffff811af06f>] ? get_sb_nodev+0x5f/0xa0
[<ffffffffa0aab195>] ? lustre_get_sb+0x25/0x30 [obdclass]
[<ffffffff811ae61b>] ? vfs_kern_mount+0x7b/0x1b0
[<ffffffff811ae7c2>] ? do_kern_mount+0x52/0x130
[<ffffffff811d0b0b>] ? do_mount+0x2fb/0x920
[<ffffffff811d11c0>] ? sys_mount+0x90/0xe0
[<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

record_start_io() calls lprocfs_oh_tally with osd->od_brw_stats.hist.
I don't see that od_brw_stats.hist[i].oh_lock are initialized in the ZFS osd.

[faaland1@hefe branch:follow_master lustre] $git grep 'spin_lock_init.*oh_lock'
lustre/ldlm/ldlm_lib.c: spin_lock_init(&cli->cl_read_rpc_hist.oh_lock);
lustre/ldlm/ldlm_lib.c: spin_lock_init(&cli->cl_write_rpc_hist.oh_lock);
lustre/ldlm/ldlm_lib.c: spin_lock_init(&cli->cl_read_page_hist.oh_lock);
lustre/ldlm/ldlm_lib.c: spin_lock_init(&cli->cl_write_page_hist.oh_lock);
lustre/ldlm/ldlm_lib.c: spin_lock_init(&cli->cl_read_offset_hist.oh_lock);
lustre/ldlm/ldlm_lib.c: spin_lock_init(&cli->cl_write_offset_hist.oh_lock);
lustre/ldlm/ldlm_lib.c: spin_lock_init(&cli->cl_mod_rpcs_hist.oh_lock);
lustre/mdt/mdt_lproc.c:         spin_lock_init(&mdt->mdt_rename_stats.hist[i].oh_lock);
lustre/osd-ldiskfs/osd_lproc.c:         spin_lock_init(&osd->od_brw_stats.hist[i].oh_lock);

The ldiskfs OSD initializes the locks via
osd_device_init->osd_lprocfs_init->osd_stats_init->spin_lock_init

But the zfs OSD returns from osd_device_init() immediately without doing anything. It's not clear to me where it should be initialized, if that's not the place.

Comment by Joseph Gmitter (Inactive) [ 15/Oct/15 ]

Hi Emoly,
Could you have a look at this one?
Thanks.
Joe

Comment by Gerrit Updater [ 22/Oct/15 ]

Olaf Faaland-LLNL (faaland1@llnl.gov) uploaded a new patch: http://review.whamcloud.com/16919
Subject: LU-7297 osd-zfs: initialize oh_lock
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d186ea11a1fb76d6532f6bf3eaa15105d5e617b0

Comment by Olaf Faaland [ 22/Oct/15 ]

Somehow I'd overlooked osd-zfs/osd_lproc.c:osd_stats_init(). Patch submitted.

Comment by Gerrit Updater [ 04/Dec/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16919/
Subject: LU-7297 osd-zfs: initialize oh_lock
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f4ea6cd384f152c04c478bf19278130802ad8e67

Comment by Joseph Gmitter (Inactive) [ 04/Dec/15 ]

Landed for 2.8

Generated at Sat Feb 10 02:07:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.