[LU-5747] NULL pointer dereference in task_rq_lock when running mds-survey Created: 15/Oct/14  Updated: 20/Jul/15  Resolved: 20/Jul/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Trivial
Reporter: Isaac Huang (Inactive) Assignee: Isaac Huang (Inactive)
Resolution: Duplicate Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 16140

 Description   

I can reliably hit it when running mds-survey (master at de24d3e0fe4e77654358ed7d5d672fa94e957ef5 on 2.6.32-358.18.1.el6_lustre.x86_64):

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff81055d52>] task_rq_lock+0x42/0xa0
PGD 341879067 PUD 3571d1067 PMD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/system/cpu/online
CPU 6
Modules linked in: obdecho(U) osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgs(U) nodemap(U) mgc(U) osd_zfs(U) lquota(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic sha256_generic crc32c_intel libcfs(U) netconsole configfs ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate serio_raw i2c_i801 iTCO_wdt iTCO_vendor_support r8169 mii sg snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc i7core_edac edac_core shpchp ext4 jbd2 mbcache sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic pata_jmicron ahci nouveau ttm drm_kms_helper drm i2c_algo_bit i2c_core mxm_wmi video output wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 2157, comm: lctl Tainted: P           ---------------    2.6.32-358.18.1.el6_lustre.x86_64 #1 OEM OEM/132-BL-E758
RIP: 0010:[<ffffffff81055d52>]  [<ffffffff81055d52>] task_rq_lock+0x42/0xa0
RSP: 0018:ffff88033d3f56d8  EFLAGS: 00010086
RAX: 0000000000000286 RBX: 0000000000016740 RCX: ffff880351405378
RDX: 0000000000000286 RSI: ffff88033d3f5730 RDI: 0000000000000000
RBP: ffff88033d3f56f8 R08: 0000000000000002 R09: 5a5a5a5a5a5a5a5a
R10: 5a5a5a5a5a5a5a5a R11: 5a5a5a5a5a5a5a5a R12: 0000000000000000
R13: ffff88033d3f5730 R14: 0000000000000006 R15: 000000000000000f
FS:  00007f7e0f644700(0000) GS:ffff880028380000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 000000033cc78000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process lctl (pid: 2157, threadinfo ffff88033d3f4000, task ffff880357118aa0)
Stack:
 0000000000000000 ffff8803447c74a0 0000000000000000 0000000000000006                                                                                                                                                                                           
<d> ffff88033d3f5768 ffffffff8106306c ffff88033d3f5728 ffffffffa0f0e219
<d> ffff88033ce5aa70 ffff88033c2bc1c8 ffff88033d3f57a8 0000000000000286
Call Trace:
 [<ffffffff8106306c>] try_to_wake_up+0x3c/0x3e0
 [<ffffffffa0f0e219>] ? echo_object_free+0x159/0x2f0 [obdecho]
 [<ffffffff81063465>] wake_up_process+0x15/0x20
 [<ffffffff8150f7e4>] __mutex_unlock_slowpath+0x44/0x60
 [<ffffffff8150f79b>] mutex_unlock+0x1b/0x20
 [<ffffffffa07a4907>] lu_site_purge+0x3f7/0x4e0 [obdclass]
 [<ffffffffa07a4e31>] lu_object_limit+0x71/0x80 [obdclass]
 [<ffffffffa07a4f93>] lu_object_find_try+0x153/0x2b0 [obdclass]
 [<ffffffffa07a51a3>] lu_object_find_at+0xb3/0x100 [obdclass]
 [<ffffffffa0b5d6ca>] ? mdd_lookup+0x12a/0x170 [mdd]
 [<ffffffffa0f10013>] echo_md_create_internal+0x153/0x640 [obdecho]
 [<ffffffffa0f18af3>] echo_md_handler+0x1383/0x1930 [obdecho]
 [<ffffffffa0f1c84e>] echo_client_iocontrol+0x1bae/0x30f0 [obdecho]
 [<ffffffff81281826>] ? vsnprintf+0x336/0x5e0
 [<ffffffffa063d27b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs]
 [<ffffffffa0753ed5>] ? obd_ioctl_getdata+0x145/0x1150 [obdclass]
 [<ffffffffa076c77c>] class_handle_ioctl+0x163c/0x21c0 [obdclass]
 [<ffffffffa07532ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
 [<ffffffff81195352>] vfs_ioctl+0x22/0xa0
 [<ffffffff81511365>] ? page_fault+0x25/0x30
 [<ffffffff811954f4>] do_vfs_ioctl+0x84/0x580
 [<ffffffff81195a71>] sys_ioctl+0x81/0xa0
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

This is likely a bug in the Linux kernel:
https://bugzilla.kernel.org/show_bug.cgi?id=27142

The mutex in question was introduced by http://review.whamcloud.com/#/c/11099/



 Comments   
Comment by Andreas Dilger [ 17/Oct/14 ]

Isaac, can you see if this bug is fixed in the 2.6.32-431.29.2.el6 (RHEL6.5) kernel? That is what is supported for 2.5.3 and 2.6+ so it makes sense to be using that kernel for testing. I am also running the 2.6.32-358.23.2.el6 kernel on my test system, but I'm going to update it because of LU-5722, which looks like it may also be a kernel bug.

Comment by Isaac Huang (Inactive) [ 17/Oct/14 ]

Maybe I missed something, but the latest kernel RPM I could find on our download site was 2.6.32-431.20.3.el6_lustre.x86_64, on which the same error happened:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff81058e52>] task_rq_lock+0x42/0xa0
Comment by Andreas Dilger [ 17/Oct/14 ]

Can you please submit a patch to Lustre to apply the upstream patch to our server kernel. It would also be good to figure out how to request that this patch be included into RHEL.

Comment by Jian Yu [ 17/Jan/15 ]

While verifying patch http://review.whamcloud.com/10130 on master branch, mds-survey hit the same failure on MDS:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff81058e52>] task_rq_lock+0x42/0xa0
PGD 6c82e067 PUD 6c909067 PMD 0 
Oops: 0000 [#1] SMP 
last sysfs file: /sys/devices/system/cpu/online
CPU 1 
Modules linked in: obdecho(U) osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgs(U) mgc(U) osd_zfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic sha256_generic libcfs(U) nfs fscache nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]

Pid: 5628, comm: lctl Tainted: P           ---------------    2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64 #1 Red Hat KVM
RIP: 0010:[<ffffffff81058e52>]  [<ffffffff81058e52>] task_rq_lock+0x42/0xa0
RSP: 0018:ffff88006d1e3738  EFLAGS: 00010082 
RAX: 0000000000000282 RBX: 0000000000016880 RCX: ffff880071abdd38
RDX: 0000000000000282 RSI: ffff88006d1e3790 RDI: 0000000000000000
RBP: ffff88006d1e3758 R08: 0000000000000002 R09: 5a5a5a5a5a5a5a5a
R10: 5a5a5a5a5a5a5a5a R11: 5a5a5a5a5a5a5a5a R12: 0000000000000000
R13: ffff88006d1e3790 R14: 0000000000000001 R15: 000000000000000f
FS:  00007f2b4b7fe700(0000) GS:ffff880002300000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 000000006d3fb000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process lctl (pid: 5628, threadinfo ffff88006d1e2000, task ffff880073bc0080)
Stack:
 0000000000000000 ffff88006d7540a0 0000000000000000 0000000000000001
<d> ffff88006d1e37c8 ffffffff8106195c ffff88006d1e3788 ffffffffa122eb0c
<d> ffff88007213fb18 ffff88006b7b56f8 ffff88006d1e3808 0000000000000282
Call Trace:
 [<ffffffff8106195c>] try_to_wake_up+0x3c/0x3e0
 [<ffffffffa122eb0c>] ? echo_object_free+0x24c/0x460 [obdecho]
 [<ffffffff81061d55>] wake_up_process+0x15/0x20
 [<ffffffff8152aa74>] __mutex_unlock_slowpath+0x44/0x60
 [<ffffffff8152aa2b>] mutex_unlock+0x1b/0x20
 [<ffffffffa07742af>] lu_site_purge+0x3ff/0x4e0 [obdclass]
 [<ffffffffa07747d1>] lu_object_limit+0x71/0x80 [obdclass]
 [<ffffffffa0774933>] lu_object_find_try+0x153/0x2b0 [obdclass]
 [<ffffffffa0774b41>] lu_object_find_at+0xb1/0xe0 [obdclass]
 [<ffffffffa1170e91>] ? mdd_lookup+0xe1/0x170 [mdd]
 [<ffffffffa1230a03>] echo_md_create_internal+0x153/0x640 [obdecho]
 [<ffffffffa123abc0>] echo_md_handler+0x1300/0x1860 [obdecho]
 [<ffffffffa123c90c>] echo_client_iocontrol+0x17ec/0x2aa0 [obdecho]
 [<ffffffffa060627b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs] 
 [<ffffffff8118d475>] ? chrdev_open+0x125/0x230
 [<ffffffff811ab820>] ? mntput_no_expire+0x30/0x110
 [<ffffffff8116fe9c>] ? __kmalloc+0x20c/0x220
 [<ffffffffa0723f51>] ? obd_ioctl_getdata+0xe1/0x1140 [obdclass]
 [<ffffffffa073c7fc>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
 [<ffffffffa07232ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
 [<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
 [<ffffffff8103f9d8>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
 [<ffffffff8103ea6c>] ? kvm_clock_read+0x1c/0x20
 [<ffffffff8103ea79>] ? kvm_clock_get_cycles+0x9/0x10
 [<ffffffff810a5e07>] ? getnstimeofday+0x57/0xe0
 [<ffffffff8119f091>] sys_ioctl+0x81/0xa0
 [<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Code: 89 74 24 18 0f 1f 44 00 00 48 c7 c3 80 68 01 00 49 89 fc 49 89 f5 9c 58 0f 1f 44 00 00 48 89 c2 fa 66 0f 1f 44 00 00 49 89 55 00 <49> 8b 44 24 08 49 89 de 8b 40 18 4c 03 34 c5 a0 cf bf 81 4c 89 
RIP  [<ffffffff81058e52>] task_rq_lock+0x42/0xa0
 RSP <ffff88006d1e3738>
CR2: 0000000000000008

The kernel version was 2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64.

Maloo report: https://testing.hpdd.intel.com/test_sets/8a17fb50-9e4c-11e4-8c99-5254006e85c2

The failure did not occur everytime while running mds-survey. E.g., the following are passed reports for mds-survey running with kernel 2.6.32-431.29.2.el6:
https://testing.hpdd.intel.com/test_sessions/56234146-8f2a-11e4-89f3-5254006e85c2
https://testing.hpdd.intel.com/test_sessions/96e23bb2-795c-11e4-9e8a-5254006e85c2

Comment by Niu Yawei (Inactive) [ 20/Jul/15 ]

This is likely a bug in the Linux kernel:
https://bugzilla.kernel.org/show_bug.cgi?id=27142

This kernel defect looks only affect user space applications.

Comment by Niu Yawei (Inactive) [ 20/Jul/15 ]

Dup of LU-6765.

Generated at Sat Feb 10 01:54:10 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.