[LU-10607] uninitialized spinlock in osd_zfs Created: 06/Feb/18  Updated: 09/May/20  Resolved: 06/Mar/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Critical
Reporter: Oleg Drokin Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-10642 osd-zfs: non-initialized spinlock dur... Closed
Related
is related to LU-13542 osd stats are initialized too late Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I missed this originally, but looks like https://review.whamcloud.com/#/c/30909/2 introduced a regression into master.

Every time I mount lustre on ZFS, I get this:

[   54.320874] ZFS: Loaded module v0.7.1-1, ZFS pool version 5000, ZFS filesystem version 5
[   82.005278] BUG: spinlock bad magic on CPU#1, mount.lustre/5874
[   82.007911]  lock: 0xffff88030066e688, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
[   82.008146] CPU: 1 PID: 5874 Comm: mount.lustre Tainted: P           OE  ------------   3.10.0-debug #2
[   82.008381] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[   82.008526]  0000000000000000 00000000c03a9b87 ffff8800b00ff798 ffffffff816fd3e4
[   82.010041]  ffff8800b00ff7b8 ffffffff816fd472 ffff88030066e688 ffffffff81aa1a78
[   82.013275]  ffff8800b00ff7d8 ffffffff816fd498 ffff88030066e688 0000000000000001
[   82.014974] Call Trace:
[   82.015815]  [<ffffffff816fd3e4>] dump_stack+0x19/0x1b
[   82.016589]  [<ffffffff816fd472>] spin_dump+0x8c/0x91
[   82.017340]  [<ffffffff816fd498>] spin_bug+0x21/0x26
[   82.018086]  [<ffffffff8138ec10>] do_raw_spin_lock+0x130/0x150
[   82.018858]  [<ffffffff81706249>] _raw_spin_lock+0x49/0x50
[   82.019714]  [<ffffffffa032ebf7>] ? lprocfs_oh_tally+0x17/0x40 [obdclass]
[   82.032661]  [<ffffffffa032ebf7>] lprocfs_oh_tally+0x17/0x40 [obdclass]
[   82.036879]  [<ffffffffa113781b>] record_start_io.part.13+0x2b/0x40 [osd_zfs]
[   82.037792]  [<ffffffffa1137f1d>] osd_read+0xfd/0x2b0 [osd_zfs]
[   82.038385]  [<ffffffffa0356244>] dt_read+0x14/0x50 [obdclass]
[   82.039012]  [<ffffffffa037a52f>] scrub_file_load+0x4f/0x410 [obdclass]
[   82.039792]  [<ffffffffa115290a>] osd_scrub_setup+0x3ea/0xb50 [osd_zfs]
[   82.040618]  [<ffffffffa112c26a>] osd_mount+0xdfa/0x1170 [osd_zfs]
[   82.041425]  [<ffffffff81706717>] ? _raw_read_unlock+0x27/0x40
[   82.042127]  [<ffffffffa112cc1d>] osd_device_alloc+0x34d/0x400 [osd_zfs]
[   82.043068]  [<ffffffffa0341084>] obd_setup+0x114/0x2a0 [obdclass]
[   82.043668]  [<ffffffffa03414b6>] class_setup+0x2a6/0x840 [obdclass]
[   82.044592]  [<ffffffffa0345032>] class_process_config+0x1b62/0x28a0 [obdclass]
[   82.046302]  [<ffffffff811cd4f9>] ? __kmalloc+0x649/0x660
[   82.047316]  [<ffffffffa01f3f47>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[   82.048126]  [<ffffffffa0348ae8>] do_lcfg+0x258/0x4b0 [obdclass]
[   82.048907]  [<ffffffffa034cb48>] lustre_start_simple+0x88/0x210 [obdclass]
[   82.049714]  [<ffffffffa0379154>] server_fill_super+0xf34/0x1860 [obdclass]
[   82.050487]  [<ffffffffa01f3f47>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[   82.062653]  [<ffffffffa0350680>] lustre_fill_super+0x3d0/0x8b0 [obdclass]
[   82.063465]  [<ffffffffa03502b0>] ? lustre_common_put_super+0xb50/0xb50 [obdclass]
[   82.065632]  [<ffffffff811f0ebd>] mount_nodev+0x4d/0xb0
[   82.069485]  [<ffffffffa0348818>] lustre_mount+0x38/0x60 [obdclass]
[   82.070379]  [<ffffffff811f1899>] mount_fs+0x39/0x1b0
[   82.071309]  [<ffffffff8120e843>] vfs_kern_mount+0x63/0xf0
[   82.072876]  [<ffffffff81210ebe>] do_mount+0x24e/0xa40
[   82.073642]  [<ffffffff811767fe>] ? __get_free_pages+0xe/0x50
[   82.074377]  [<ffffffff81211746>] SyS_mount+0x96/0xf0
[   82.075246]  [<ffffffff8170fc49>] system_call_fastpath+0x16/0x1b


 Comments   
Comment by Peter Jones [ 06/Feb/18 ]

Fan Yong

Can you please advise

Thanks

Peter

Comment by Gerrit Updater [ 06/Feb/18 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/31180
Subject: LU-10607 osd-zfs: skip io stat before lproc initialized
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a6096058f1ed4af0b246babba9166978c0a6d9b9

Comment by Gerrit Updater [ 06/Mar/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31180/
Subject: LU-10607 osd-zfs: skip io stat for OI scrub
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 763c07c431b2ae0a91f4caccf3a4ce86fbc67243

Comment by Peter Jones [ 06/Mar/18 ]

Landed for 2.11

Comment by Andrew Perepechko [ 27/Nov/19 ]

This fix seems incomplete. Lustre will still crash a little further with this fix with the same root cause (incomplete osd stats initialization):

[ 5696.621248]  [<ffffffffc169719b>] record_start_io.part.14+0x2b/0x40 [osd_zfs]
[ 5696.623630]  [<ffffffffc1698322>] osd_read+0xa2/0x180 [osd_zfs]
[ 5696.625807]  [<ffffffffc1167dee>] dt_record_read+0x1e/0x70 [obdclass]
[ 5696.628063]  [<ffffffffc1190997>] lustre_index_restore+0x527/0x1720 [obdclass]
[ 5696.637015]  [<ffffffffc16b2564>] osd_initial_OI_scrub+0xa34/0xd50 [osd_zfs]
[ 5696.639357]  [<ffffffffc16b34fd>] osd_scrub_setup+0x9ed/0xb90 [osd_zfs]
[ 5696.641585]  [<ffffffffc168a97b>] osd_mount+0xf4b/0x1380 [osd_zfs]
Generated at Sat Feb 10 02:36:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.