[LU-2844] NULL pointer deref on unmount Created: 20/Feb/13  Updated: 27/Feb/13  Resolved: 27/Feb/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Ned Bass Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: LB
Environment:

2.6.32-279.9.1.1chaos.ch5.1.x86_64


Severity: 3
Rank (Obsolete): 6884

 Description   

After reproducing LU-2843, mounting and unmounting the MDT leads to a NULL pointer dereference.

padlock: VIA PadLock Hash Engine not detected.
Lustre: Lustre: Build Version: 2.3.58-g6125dec-CHANGED-2.6.32-279.9.1.1chaos.ch5.1.x86_64
LNet: Added LNI 192.168.122.43@tcp [8/256/0/180]
LNet: Accept secure, port 988
Lustre: Echo OBD driver; http://www.lustre.org/
LDISKFS-fs (sda): recovery complete
LDISKFS-fs (sda): mounted filesystem with ordered data mode. quota=on. Opts: 
Lustre: lustre-MDT0000: used disk, loading
Lustre: 29186:0:(mdt_lproc.c:380:lprocfs_wr_identity_upcall()) lustre-MDT0000: identity upcall set to /home/bass6/lustre-release/lustre/tests/../utils/l_getidentity
LustreError: 11-0: lustre-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
LustreError: 28964:0:(llog_osd.c:712:llog_osd_prev_block()) lustre-MDT0000-osd: invalid llog tail at log id 35/0 offset 5033984
LustreError: 28964:0:(mdd_device.c:323:mdd_changelog_llog_init()) lustre-MDD0000: changelog init failed: rc = -22
LustreError: 28964:0:(mdd_device.c:398:mdd_changelog_init()) lustre-MDD0000: changelog setup during init failed: rc = -22
LustreError: 11-0: lustre-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
Lustre: 28668:0:(client.c:1866:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1361403145/real 1361403145]  req@ffff880015c35000 x1427534649491496/t0(0) o8->lustre-OST0001-osc-MDT0000@0@lo:28/4 lens 400/544 e 0 to 1 dl 1361403150 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
LustreError: 11-0: lustre-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
Lustre: Failing over lustre-MDT0000
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff81283626>] __list_add+0x26/0xa0
PGD 0 
Oops: 0000 [#1] SMP 
last sysfs file: /sys/devices/pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/block/sda/queue/max_sectors_kb
CPU 1 
Modules linked in: lustre(U) ofd(U) osp(U) lod(U) ost(U) mdt(U) osd_ldiskfs(U) fsfilt_ldiskfs(U) ldiskfs(U) exportfs mdd(U) mgs(U) lquota(U) jbd obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) ebtable_nat ebtables fuse autofs4 sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate vhost_net macvtap macvlan tun kvm virtio_balloon virtio_net snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc i2c_piix4 i2c_core sg ext4 mbcache jbd2 virtio_blk sd_mod crc_t10dif virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]

Pid: 29211, comm: umount Tainted: P           ---------------    2.6.32-279.9.1.1chaos.ch5.1.x86_64 #1 Bochs Bochs
crash> bt
PID: 29211  TASK: ffff88007b6f0040  CPU: 1   COMMAND: "umount"
 #0 [ffff8800182bd5c0] machine_kexec at ffffffff8103283b
 #1 [ffff8800182bd620] crash_kexec at ffffffff810ba492
 #2 [ffff8800182bd6f0] oops_end at ffffffff81501e60
 #3 [ffff8800182bd720] no_context at ffffffff81043bfb
 #4 [ffff8800182bd770] __bad_area_nosemaphore at ffffffff81043e85
 #5 [ffff8800182bd7c0] bad_area at ffffffff81043fae
 #6 [ffff8800182bd7f0] __do_page_fault at ffffffff81044760
 #7 [ffff8800182bd910] do_page_fault at ffffffff81503e3e
 #8 [ffff8800182bd940] page_fault at ffffffff815011f5
    [exception RIP: __list_add+38]
    RIP: ffffffff81283626  RSP: ffff8800182bd9f8  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: ffff8800182bda38  RCX: 00000000ffffffff
    RDX: ffff8800182ee150  RSI: 0000000000000000  RDI: ffff8800182bda38
    RBP: ffff8800182bda18   R8: 0000000000000000   R9: ffff88001636ff80
    R10: 000000000000000f  R11: 0000000000000000  R12: ffff8800182ee150
    R13: 0000000000000000  R14: ffff8800182ee150  R15: ffffffffffffffff
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff8800182bda20] __mutex_lock_slowpath at ffffffff814ff9ef
#10 [ffff8800182bda90] mutex_lock at ffffffff814ff8fb
#11 [ffff8800182bdab0] mdd_lfsck_stop at ffffffffa0c86eec [mdd]
#12 [ffff8800182bdb30] mdd_iocontrol at ffffffffa0c827bc [mdd]
#13 [ffff8800182bdb90] mdt_device_fini at ffffffffa0dca7fa [mdt]
#14 [ffff8800182bdbf0] class_cleanup at ffffffffa079bcb7 [obdclass]
#15 [ffff8800182bdc70] class_process_config at ffffffffa079d59c [obdclass]
#16 [ffff8800182bdd00] class_manual_cleanup at ffffffffa079e2d9 [obdclass]
#17 [ffff8800182bddc0] server_put_super at ffffffffa07aae7c [obdclass]
#18 [ffff8800182bde30] generic_shutdown_super at ffffffff8117d10b
#19 [ffff8800182bde50] kill_anon_super at ffffffff8117d1f6
#20 [ffff8800182bde70] lustre_kill_super at ffffffffa07a0136 [obdclass]
#21 [ffff8800182bde90] deactivate_super at ffffffff8117e270
#22 [ffff8800182bdeb0] mntput_no_expire at ffffffff8119a2ef
#23 [ffff8800182bdee0] sys_umount at ffffffff8119ad8b
#24 [ffff8800182bdf80] system_call_fastpath at ffffffff8100b0f2
    RIP: 00007f405efc7d67  RSP: 00007fff44ccb308  RFLAGS: 00010202
    RAX: 00000000000000a6  RBX: ffffffff8100b0f2  RCX: 000000000000ffc0
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 00007f405fe84ad0
    RBP: 00007f405fe84ab0   R8: 0000000000000006   R9: 0000000000000000
    R10: 00007fff44ccb130  R11: 0000000000000246  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 00007f405fe84b30
    ORIG_RAX: 00000000000000a6  CS: 0033  SS: 002b


 Comments   
Comment by Peter Jones [ 21/Feb/13 ]

Lai

Could you please look into this one?

Thanks

Peter

Comment by Lai Siyao [ 22/Feb/13 ]

The cause is that before lfsck is setup, mdd_prepare fails, and then lfsck stop is called, which accesses uninitialized data.

Comment by Lai Siyao [ 22/Feb/13 ]

patch is on http://review.whamcloud.com/#change,5519

Comment by Peter Jones [ 27/Feb/13 ]

Landed for 2.4

Generated at Sat Feb 10 01:28:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.