[LU-5747] NULL pointer dereference in task_rq_lock when running mds-survey Created: 15/Oct/14 Updated: 20/Jul/15 Resolved: 20/Jul/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Trivial |
| Reporter: | Isaac Huang (Inactive) | Assignee: | Isaac Huang (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 16140 |
| Description |
|
I can reliably hit it when running mds-survey (master at de24d3e0fe4e77654358ed7d5d672fa94e957ef5 on 2.6.32-358.18.1.el6_lustre.x86_64): BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [<ffffffff81055d52>] task_rq_lock+0x42/0xa0 PGD 341879067 PUD 3571d1067 PMD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/system/cpu/online CPU 6 Modules linked in: obdecho(U) osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgs(U) nodemap(U) mgc(U) osd_zfs(U) lquota(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic sha256_generic crc32c_intel libcfs(U) netconsole configfs ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate serio_raw i2c_i801 iTCO_wdt iTCO_vendor_support r8169 mii sg snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc i7core_edac edac_core shpchp ext4 jbd2 mbcache sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic pata_jmicron ahci nouveau ttm drm_kms_helper drm i2c_algo_bit i2c_core mxm_wmi video output wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 2157, comm: lctl Tainted: P --------------- 2.6.32-358.18.1.el6_lustre.x86_64 #1 OEM OEM/132-BL-E758 RIP: 0010:[<ffffffff81055d52>] [<ffffffff81055d52>] task_rq_lock+0x42/0xa0 RSP: 0018:ffff88033d3f56d8 EFLAGS: 00010086 RAX: 0000000000000286 RBX: 0000000000016740 RCX: ffff880351405378 RDX: 0000000000000286 RSI: ffff88033d3f5730 RDI: 0000000000000000 RBP: ffff88033d3f56f8 R08: 0000000000000002 R09: 5a5a5a5a5a5a5a5a R10: 5a5a5a5a5a5a5a5a R11: 5a5a5a5a5a5a5a5a R12: 0000000000000000 R13: ffff88033d3f5730 R14: 0000000000000006 R15: 000000000000000f FS: 00007f7e0f644700(0000) GS:ffff880028380000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000008 CR3: 000000033cc78000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process lctl (pid: 2157, threadinfo ffff88033d3f4000, task ffff880357118aa0) Stack: 0000000000000000 ffff8803447c74a0 0000000000000000 0000000000000006 <d> ffff88033d3f5768 ffffffff8106306c ffff88033d3f5728 ffffffffa0f0e219 <d> ffff88033ce5aa70 ffff88033c2bc1c8 ffff88033d3f57a8 0000000000000286 Call Trace: [<ffffffff8106306c>] try_to_wake_up+0x3c/0x3e0 [<ffffffffa0f0e219>] ? echo_object_free+0x159/0x2f0 [obdecho] [<ffffffff81063465>] wake_up_process+0x15/0x20 [<ffffffff8150f7e4>] __mutex_unlock_slowpath+0x44/0x60 [<ffffffff8150f79b>] mutex_unlock+0x1b/0x20 [<ffffffffa07a4907>] lu_site_purge+0x3f7/0x4e0 [obdclass] [<ffffffffa07a4e31>] lu_object_limit+0x71/0x80 [obdclass] [<ffffffffa07a4f93>] lu_object_find_try+0x153/0x2b0 [obdclass] [<ffffffffa07a51a3>] lu_object_find_at+0xb3/0x100 [obdclass] [<ffffffffa0b5d6ca>] ? mdd_lookup+0x12a/0x170 [mdd] [<ffffffffa0f10013>] echo_md_create_internal+0x153/0x640 [obdecho] [<ffffffffa0f18af3>] echo_md_handler+0x1383/0x1930 [obdecho] [<ffffffffa0f1c84e>] echo_client_iocontrol+0x1bae/0x30f0 [obdecho] [<ffffffff81281826>] ? vsnprintf+0x336/0x5e0 [<ffffffffa063d27b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs] [<ffffffffa0753ed5>] ? obd_ioctl_getdata+0x145/0x1150 [obdclass] [<ffffffffa076c77c>] class_handle_ioctl+0x163c/0x21c0 [obdclass] [<ffffffffa07532ab>] obd_class_ioctl+0x4b/0x190 [obdclass] [<ffffffff81195352>] vfs_ioctl+0x22/0xa0 [<ffffffff81511365>] ? page_fault+0x25/0x30 [<ffffffff811954f4>] do_vfs_ioctl+0x84/0x580 [<ffffffff81195a71>] sys_ioctl+0x81/0xa0 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b This is likely a bug in the Linux kernel: The mutex in question was introduced by http://review.whamcloud.com/#/c/11099/ |
| Comments |
| Comment by Andreas Dilger [ 17/Oct/14 ] |
|
Isaac, can you see if this bug is fixed in the 2.6.32-431.29.2.el6 (RHEL6.5) kernel? That is what is supported for 2.5.3 and 2.6+ so it makes sense to be using that kernel for testing. I am also running the 2.6.32-358.23.2.el6 kernel on my test system, but I'm going to update it because of |
| Comment by Isaac Huang (Inactive) [ 17/Oct/14 ] |
|
Maybe I missed something, but the latest kernel RPM I could find on our download site was 2.6.32-431.20.3.el6_lustre.x86_64, on which the same error happened: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [<ffffffff81058e52>] task_rq_lock+0x42/0xa0 |
| Comment by Andreas Dilger [ 17/Oct/14 ] |
|
Can you please submit a patch to Lustre to apply the upstream patch to our server kernel. It would also be good to figure out how to request that this patch be included into RHEL. |
| Comment by Jian Yu [ 17/Jan/15 ] |
|
While verifying patch http://review.whamcloud.com/10130 on master branch, mds-survey hit the same failure on MDS: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [<ffffffff81058e52>] task_rq_lock+0x42/0xa0 PGD 6c82e067 PUD 6c909067 PMD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/system/cpu/online CPU 1 Modules linked in: obdecho(U) osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgs(U) mgc(U) osd_zfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic sha256_generic libcfs(U) nfs fscache nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] Pid: 5628, comm: lctl Tainted: P --------------- 2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64 #1 Red Hat KVM RIP: 0010:[<ffffffff81058e52>] [<ffffffff81058e52>] task_rq_lock+0x42/0xa0 RSP: 0018:ffff88006d1e3738 EFLAGS: 00010082 RAX: 0000000000000282 RBX: 0000000000016880 RCX: ffff880071abdd38 RDX: 0000000000000282 RSI: ffff88006d1e3790 RDI: 0000000000000000 RBP: ffff88006d1e3758 R08: 0000000000000002 R09: 5a5a5a5a5a5a5a5a R10: 5a5a5a5a5a5a5a5a R11: 5a5a5a5a5a5a5a5a R12: 0000000000000000 R13: ffff88006d1e3790 R14: 0000000000000001 R15: 000000000000000f FS: 00007f2b4b7fe700(0000) GS:ffff880002300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000008 CR3: 000000006d3fb000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process lctl (pid: 5628, threadinfo ffff88006d1e2000, task ffff880073bc0080) Stack: 0000000000000000 ffff88006d7540a0 0000000000000000 0000000000000001 <d> ffff88006d1e37c8 ffffffff8106195c ffff88006d1e3788 ffffffffa122eb0c <d> ffff88007213fb18 ffff88006b7b56f8 ffff88006d1e3808 0000000000000282 Call Trace: [<ffffffff8106195c>] try_to_wake_up+0x3c/0x3e0 [<ffffffffa122eb0c>] ? echo_object_free+0x24c/0x460 [obdecho] [<ffffffff81061d55>] wake_up_process+0x15/0x20 [<ffffffff8152aa74>] __mutex_unlock_slowpath+0x44/0x60 [<ffffffff8152aa2b>] mutex_unlock+0x1b/0x20 [<ffffffffa07742af>] lu_site_purge+0x3ff/0x4e0 [obdclass] [<ffffffffa07747d1>] lu_object_limit+0x71/0x80 [obdclass] [<ffffffffa0774933>] lu_object_find_try+0x153/0x2b0 [obdclass] [<ffffffffa0774b41>] lu_object_find_at+0xb1/0xe0 [obdclass] [<ffffffffa1170e91>] ? mdd_lookup+0xe1/0x170 [mdd] [<ffffffffa1230a03>] echo_md_create_internal+0x153/0x640 [obdecho] [<ffffffffa123abc0>] echo_md_handler+0x1300/0x1860 [obdecho] [<ffffffffa123c90c>] echo_client_iocontrol+0x17ec/0x2aa0 [obdecho] [<ffffffffa060627b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs] [<ffffffff8118d475>] ? chrdev_open+0x125/0x230 [<ffffffff811ab820>] ? mntput_no_expire+0x30/0x110 [<ffffffff8116fe9c>] ? __kmalloc+0x20c/0x220 [<ffffffffa0723f51>] ? obd_ioctl_getdata+0xe1/0x1140 [obdclass] [<ffffffffa073c7fc>] class_handle_ioctl+0x15fc/0x2180 [obdclass] [<ffffffffa07232ab>] obd_class_ioctl+0x4b/0x190 [obdclass] [<ffffffff8119e972>] vfs_ioctl+0x22/0xa0 [<ffffffff8103f9d8>] ? pvclock_clocksource_read+0x58/0xd0 [<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580 [<ffffffff8103ea6c>] ? kvm_clock_read+0x1c/0x20 [<ffffffff8103ea79>] ? kvm_clock_get_cycles+0x9/0x10 [<ffffffff810a5e07>] ? getnstimeofday+0x57/0xe0 [<ffffffff8119f091>] sys_ioctl+0x81/0xa0 [<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Code: 89 74 24 18 0f 1f 44 00 00 48 c7 c3 80 68 01 00 49 89 fc 49 89 f5 9c 58 0f 1f 44 00 00 48 89 c2 fa 66 0f 1f 44 00 00 49 89 55 00 <49> 8b 44 24 08 49 89 de 8b 40 18 4c 03 34 c5 a0 cf bf 81 4c 89 RIP [<ffffffff81058e52>] task_rq_lock+0x42/0xa0 RSP <ffff88006d1e3738> CR2: 0000000000000008 The kernel version was 2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64. Maloo report: https://testing.hpdd.intel.com/test_sets/8a17fb50-9e4c-11e4-8c99-5254006e85c2 The failure did not occur everytime while running mds-survey. E.g., the following are passed reports for mds-survey running with kernel 2.6.32-431.29.2.el6: |
| Comment by Niu Yawei (Inactive) [ 20/Jul/15 ] |
This kernel defect looks only affect user space applications. |
| Comment by Niu Yawei (Inactive) [ 20/Jul/15 ] |
|
Dup of |