[LU-3104] Oops in dmu_tx_hold_spill() Created: 04/Apr/13  Updated: 09/Oct/21  Resolved: 09/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Minor
Reporter: John Hammond Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: llnl, mq313
Environment:
  1. git describe
    2.3.63-47-g1ee132f
  2. rpm -q zfs
    zfs-0.6.0-rc14.x86_64
  3. rpm -q spl
    spl-0.6.0-rc14.x86_64

Severity: 3
Rank (Obsolete): 7544

 Description   
# FSTYPE=zfs MOUNT_2=y llmount.sh
...
# cd /mnt/lustre; while true; do sys_mknod f0 r 0 0; sys_unlink f0; done
# cd /mnt/lustre2; while true; do setfattr -n user.0 -v 0 f0; done &

last sysfs file: /sys/devices/pci0000:00/0000:00:05.0/local_cpus
CPU 0
Modules linked in: lustre(U) ofd(U) osp(U) lod(U) ost(U) mdt(U) mdd(U) mgs(U) osd_zfs(U) \
lquota(U) jbd obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdcl\
ass(U) lvfs(U) ksocklnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) autofs4 nfs lo\
ckd fscache nfs_acl auth_rpcgss sunrpc ipv6 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)\
(U) zunicode(P)(U) spl(U) zlib_deflate microcode virtio_balloon virtio_net i2c_piix4 i2c_\
core ext4 mbcache jbd2 virtio_blk pata_acpi ata_generic ata_piix virtio_pci virtio_ring v\
irtio dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]

Pid: 5997, comm: mdt00_002 Tainted: P           ---------------    2.6.32-279.19.1.el6_lustre_gcov.x86_64 #1 Bochs Bochs
RIP: 0010:[<ffffffffa021ff76>]  [<ffffffffa021ff76>] dmu_tx_hold_spill+0x26/0xa0 [zfs]
RSP: 0018:ffff8801629cd9f0  EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffa02b1580 RDI: ffff88015457d998
RBP: ffff8801629cda00 R08: 000000000000000e R09: ffff880153c38000
R10: ffff8801629cd908 R11: 0000000000000000 R12: ffff88018d2d6500
R13: ffff88015386e6d0 R14: 0000000000000001 R15: 0000000000000755
FS:  00007f65f60cc700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000018 CR3: 0000000154724000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process mdt00_002 (pid: 5997, threadinfo ffff8801629cc000, task ffff880197c85540)
Stack:
 ffff880156c56200 ffff88018d2d6500 ffff8801629cda40 ffffffffa022175f
<d> ffff8801629cda20 ffff88015468d648 ffff880156c56200 ffffc9000bcdc9d0
<d> 0000000000000001 ffff880167f1e800 ffff8801629cda90 ffffffffa0bc4437
Call Trace:
 [<ffffffffa022175f>] dmu_tx_hold_sa+0x10f/0x190 [zfs]
 [<ffffffffa0bc4437>] __osd_xattr_declare_set+0x107/0x2a0 [osd_zfs]
 [<ffffffffa0bc46f3>] osd_declare_xattr_set+0x123/0x1b0 [osd_zfs]
 [<ffffffffa0cd3853>] lod_declare_xattr_set+0x143/0x410 [lod]
 [<ffffffffa0988dde>] mdd_declare_xattr_set+0x7e/0x1a0 [mdd]
 [<ffffffffa098bd4c>] mdd_xattr_set+0x1dc/0xbf0 [mdd]
 [<ffffffffa0862a74>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc]
 [<ffffffffa0c1d46a>] ? mdt_version_save+0x8a/0x1a0 [mdt]
 [<ffffffffa0c23c61>] mdt_reint_setxattr+0x6f1/0x1850 [mdt]
 [<ffffffffa06f57a0>] ? lu_ucred+0x20/0x30 [obdclass]
 [<ffffffffa0c17fcc>] ? mdt_root_squash+0x2c/0x410 [mdt]
 [<ffffffffa0c1c7c1>] mdt_reint_rec+0x41/0xe0 [mdt]
 [<ffffffffa0c15e03>] mdt_reint_internal+0x4e3/0x7d0 [mdt]
 [<ffffffffa0c16134>] mdt_reint+0x44/0xe0 [mdt]
 [<ffffffffa0c040f8>] mdt_handle_common+0x648/0x1660 [mdt]
 [<ffffffffa0c40345>] mds_regular_handle+0x15/0x20 [mdt]
 [<ffffffffa08713cc>] ptlrpc_server_handle_request+0x40c/0xd90 [ptlrpc]
 [<ffffffffa051e5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
 [<ffffffffa08689f9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
 [<ffffffff81052223>] ? __wake_up+0x53/0x70
 [<ffffffffa08728c5>] ptlrpc_main+0xb75/0x1870 [ptlrpc]
 [<ffffffffa0871d50>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffffa0871d50>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
 [<ffffffffa0871d50>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Code: 00 00 00 00 00 55 48 89 e5 41 54 53 0f 1f 44 00 00 45 31 c9 45 31 c0 b9 06 00 00 00\
 48 89 f2 48 8b 77 20 e8 fd fd ff ff 48 89 c3 <48> 8b 40 18 48 85 c0 74 14 4c 8b 60 58 41\
 f6 44 24 07 04 75 15
RIP  [<ffffffffa021ff76>] dmu_tx_hold_spill+0x26/0xa0 [zfs]
 RSP <ffff8801629cd9f0>
CR2: 0000000000000018

crash> dis -l dmu_tx_hold_spill+0x26
/tmp/zfs-build-root-PfxtLzjz/BUILD/zfs-0.6.0/module/zfs/../../module/zfs/dmu_tx.c: 1322
0xffffffffa021ff76 <dmu_tx_hold_spill+38>:      mov    0x18(%rax),%rax

void
dmu_tx_hold_spill(dmu_tx_t *tx, uint64_t object)
{
        dnode_t *dn;
        dmu_tx_hold_t *txh;

        txh = dmu_tx_hold_object_impl(tx, tx->tx_objset, object,
            THT_SPILL, 0, 0);

        dn = txh->txh_dnode; /* HERE */

        ....
}


 Comments   
Comment by Brian Behlendorf [ 24/Apr/13 ]

We had an occurrence of this issue, here's the good bit from our logs. It sure looks like somehow Lustre called dmu_tx_hold_spill() with a NULL dmu_tx_t pointer.

2013-03-27 11:16:13 BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
2013-03-27 11:16:13 IP: [<ffffffffa036d2b6>] dmu_tx_hold_spill+0x26/0xa0 [zfs]
Generated at Sat Feb 10 01:31:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.