[LU-16792] dirtying dbuf but not tx_held Created: 02/May/23  Updated: 02/May/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Rank (Obsolete): 9223372036854775807

 Description   

Brian recommended to try and run Lustre against zfs built with --enable-debug as that does some extra checks and this is the first thing that cropped up right at mount time.

I guess dirtying blocks outside of transaction is not very good?

[  128.003305] Lustre: lustre-MDT0000: mounting server target with '-t lustre' d
eprecated, use '-t lustre_tgt' 
[  129.332151] Kernel panic - not syncing: dirtying dbuf obj=20e lvl=0 blkid=10 
but not tx_held
[  129.332151]
[  129.333151] CPU: 3 PID: 9383 Comm: ll_mgs_0001 Kdump: loaded Tainted: G      
     O     --------- -  - 4.18.0rh8.7-debug #2
[  129.334318] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  129.334919] Call Trace:
[  129.335196]  ? dump_stack+0xf2/0x15e
[  129.335569]  ? panic+0x17a/0x4ac
[  129.335932]  ? dmu_tx_dirty_buf+0x40c/0x5b0 [zfs]
[  129.336722]  ? _raw_spin_unlock+0x3f/0x60
[  129.337155]  ? dbuf_dirty+0x6e/0x29a0 [zfs]
[  129.337765]  ? dbuf_read+0x753/0xe40 [zfs]
[  129.338380]  ? lock_release+0x343/0x770
[  129.338808]  ? __mutex_unlock_slowpath+0x49/0x430
[  129.339330]  ? dmu_buf_will_dirty_impl+0x19b/0x570 [zfs]
[  129.340103]  ? dmu_buf_will_dirty+0x1a/0x30 [zfs]
[  129.340766]  ? dmu_write_impl+0x5c/0x1d0 [zfs]
[  129.341418]  ? dmu_write_by_dnode+0xa6/0x110 [zfs]
[  129.342162]  ? osd_write+0x177/0x8d0 [osd_zfs]
[  129.342685]  ? dt_record_write+0x3b/0x180 [obdclass]
[  129.343277]  ? llog_osd_write_rec+0xe88/0x1ed0 [obdclass]
[  129.343907]  ? llog_write_rec+0x4d8/0x6c0 [obdclass]
[  129.344490]  ? llog_write+0x6be/0x760 [obdclass]
[  129.345034]  ? record_marker+0x180/0x2a0 [mgs]
[  129.345513]  ? mgs_write_log_lov.isra.7+0x2ff/0x980 [mgs]
[  129.346119]  ? mgs_write_log_mdt0+0x35e/0xa60 [mgs]
[  129.346630]  ? mgs_write_log_mdt+0x115/0x10c0 [mgs]
[  129.347203]  ? mgs_write_log_target+0x74b/0x8d0 [mgs]
[  129.347743]  ? mgs_target_reg+0xf8f/0x1a90 [mgs]
[  129.348242]  ? tgt_handle_request0+0xf9/0xa80 [ptlrpc]
[  129.348947]  ? tgt_request_handle+0x3a5/0x1c00 [ptlrpc]
[  129.349595]  ? ptlrpc_server_handle_request+0x632/0x11e0 [ptlrpc]
[  129.350328]  ? lprocfs_counter_add+0x172/0x240 [obdclass]
[  129.350974]  ? ptlrpc_main+0xd30/0x1440 [ptlrpc]
[  129.351555]  ? ptlrpc_wait_event+0x990/0x990 [ptlrpc]
[  129.352197]  ? kthread+0x197/0x1d0
[  129.352560]  ? set_kthread_struct+0x80/0x80

 



 Comments   
Comment by Alex Zhuravlev [ 02/May/23 ]

checking..

Comment by Alex Zhuravlev [ 02/May/23 ]

interesting, I can build with --enable-debug, but can't start:

osd_zfs: Unknown symbol zfs_refcount_add (err 0)
insmod: ERROR: could not insert module /mnt/build/lustre/tests/../osd-zfs/osd_zfs.ko: Unknown symbol in module

zfs_recount_add is not exported in 2.1.2, only in 2.1.5+

Comment by Alex Zhuravlev [ 02/May/23 ]

can't build any ZFS

checking whether blk_queue_update_readahead() exists... checking whether disk_update_readahead() exists... no
checking whether blk_queue_discard() is available... configure: error: 
	*** None of the expected "blk_queue_discard" interfaces were detected.
	*** This may be because your kernel version is newer than what is
	*** supported, or you are using a patched custom kernel with
	*** incompatible modifications.
	***
	*** ZFS Version: zfs-2.1.3-1
	*** Compatible Kernels: 3.10 - 5.16
...

so the root cause is:

make: Entering directory '/home/alexey/linux-4.18.0-425.3.1.el8'
  CC [M]  /home/alexey/zfs/build/blk_queue_discard/blk_queue_discard.o
/home/alexey/zfs/build/blk_queue_discard/blk_queue_discard.c: In function ‘main’:
/home/alexey/zfs/build/blk_queue_discard/blk_queue_discard.c:103:1: error: the frame size of 4256 bytes is larger than 4096 bytes [-Werror=frame-larger-than=]
  103 | }
      | ^
cc1: all warnings being treated as errors

I guess it's kernel's debug options inflating request_queue struct:

CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
CONFIG_DEBUG_RWSEMS=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_DEBUG_LOCKDEP=y
CONFIG_DEBUG_ATOMIC_SLEEP=y
Comment by Alex Zhuravlev [ 02/May/23 ]

disabling CONFIG_LOCK_STAT helped.

Comment by Alex Zhuravlev [ 02/May/23 ]

so.. this is a dup of LU-2160 and LU-7409

Generated at Sat Feb 10 03:30:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.