Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5747

NULL pointer dereference in task_rq_lock when running mds-survey

Details

    • Bug
    • Resolution: Duplicate
    • Trivial
    • None
    • Lustre 2.7.0
    • None
    • 3
    • 16140

    Description

      I can reliably hit it when running mds-survey (master at de24d3e0fe4e77654358ed7d5d672fa94e957ef5 on 2.6.32-358.18.1.el6_lustre.x86_64):

      BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      IP: [<ffffffff81055d52>] task_rq_lock+0x42/0xa0
      PGD 341879067 PUD 3571d1067 PMD 0
      Oops: 0000 [#1] SMP
      last sysfs file: /sys/devices/system/cpu/online
      CPU 6
      Modules linked in: obdecho(U) osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgs(U) nodemap(U) mgc(U) osd_zfs(U) lquota(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic sha256_generic crc32c_intel libcfs(U) netconsole configfs ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate serio_raw i2c_i801 iTCO_wdt iTCO_vendor_support r8169 mii sg snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc i7core_edac edac_core shpchp ext4 jbd2 mbcache sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic pata_jmicron ahci nouveau ttm drm_kms_helper drm i2c_algo_bit i2c_core mxm_wmi video output wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
      
      Pid: 2157, comm: lctl Tainted: P           ---------------    2.6.32-358.18.1.el6_lustre.x86_64 #1 OEM OEM/132-BL-E758
      RIP: 0010:[<ffffffff81055d52>]  [<ffffffff81055d52>] task_rq_lock+0x42/0xa0
      RSP: 0018:ffff88033d3f56d8  EFLAGS: 00010086
      RAX: 0000000000000286 RBX: 0000000000016740 RCX: ffff880351405378
      RDX: 0000000000000286 RSI: ffff88033d3f5730 RDI: 0000000000000000
      RBP: ffff88033d3f56f8 R08: 0000000000000002 R09: 5a5a5a5a5a5a5a5a
      R10: 5a5a5a5a5a5a5a5a R11: 5a5a5a5a5a5a5a5a R12: 0000000000000000
      R13: ffff88033d3f5730 R14: 0000000000000006 R15: 000000000000000f
      FS:  00007f7e0f644700(0000) GS:ffff880028380000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 0000000000000008 CR3: 000000033cc78000 CR4: 00000000000007e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process lctl (pid: 2157, threadinfo ffff88033d3f4000, task ffff880357118aa0)
      Stack:
       0000000000000000 ffff8803447c74a0 0000000000000000 0000000000000006                                                                                                                                                                                           
      <d> ffff88033d3f5768 ffffffff8106306c ffff88033d3f5728 ffffffffa0f0e219
      <d> ffff88033ce5aa70 ffff88033c2bc1c8 ffff88033d3f57a8 0000000000000286
      Call Trace:
       [<ffffffff8106306c>] try_to_wake_up+0x3c/0x3e0
       [<ffffffffa0f0e219>] ? echo_object_free+0x159/0x2f0 [obdecho]
       [<ffffffff81063465>] wake_up_process+0x15/0x20
       [<ffffffff8150f7e4>] __mutex_unlock_slowpath+0x44/0x60
       [<ffffffff8150f79b>] mutex_unlock+0x1b/0x20
       [<ffffffffa07a4907>] lu_site_purge+0x3f7/0x4e0 [obdclass]
       [<ffffffffa07a4e31>] lu_object_limit+0x71/0x80 [obdclass]
       [<ffffffffa07a4f93>] lu_object_find_try+0x153/0x2b0 [obdclass]
       [<ffffffffa07a51a3>] lu_object_find_at+0xb3/0x100 [obdclass]
       [<ffffffffa0b5d6ca>] ? mdd_lookup+0x12a/0x170 [mdd]
       [<ffffffffa0f10013>] echo_md_create_internal+0x153/0x640 [obdecho]
       [<ffffffffa0f18af3>] echo_md_handler+0x1383/0x1930 [obdecho]
       [<ffffffffa0f1c84e>] echo_client_iocontrol+0x1bae/0x30f0 [obdecho]
       [<ffffffff81281826>] ? vsnprintf+0x336/0x5e0
       [<ffffffffa063d27b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs]
       [<ffffffffa0753ed5>] ? obd_ioctl_getdata+0x145/0x1150 [obdclass]
       [<ffffffffa076c77c>] class_handle_ioctl+0x163c/0x21c0 [obdclass]
       [<ffffffffa07532ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
       [<ffffffff81195352>] vfs_ioctl+0x22/0xa0
       [<ffffffff81511365>] ? page_fault+0x25/0x30
       [<ffffffff811954f4>] do_vfs_ioctl+0x84/0x580
       [<ffffffff81195a71>] sys_ioctl+0x81/0xa0
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      

      This is likely a bug in the Linux kernel:
      https://bugzilla.kernel.org/show_bug.cgi?id=27142

      The mutex in question was introduced by http://review.whamcloud.com/#/c/11099/

      Attachments

        Activity

          [LU-5747] NULL pointer dereference in task_rq_lock when running mds-survey
          niu Niu Yawei (Inactive) added a comment - Dup of LU-6765 .

          This is likely a bug in the Linux kernel:
          https://bugzilla.kernel.org/show_bug.cgi?id=27142

          This kernel defect looks only affect user space applications.

          niu Niu Yawei (Inactive) added a comment - This is likely a bug in the Linux kernel: https://bugzilla.kernel.org/show_bug.cgi?id=27142 This kernel defect looks only affect user space applications.
          yujian Jian Yu added a comment -

          While verifying patch http://review.whamcloud.com/10130 on master branch, mds-survey hit the same failure on MDS:

          BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
          IP: [<ffffffff81058e52>] task_rq_lock+0x42/0xa0
          PGD 6c82e067 PUD 6c909067 PMD 0 
          Oops: 0000 [#1] SMP 
          last sysfs file: /sys/devices/system/cpu/online
          CPU 1 
          Modules linked in: obdecho(U) osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgs(U) mgc(U) osd_zfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic sha256_generic libcfs(U) nfs fscache nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
          
          Pid: 5628, comm: lctl Tainted: P           ---------------    2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64 #1 Red Hat KVM
          RIP: 0010:[<ffffffff81058e52>]  [<ffffffff81058e52>] task_rq_lock+0x42/0xa0
          RSP: 0018:ffff88006d1e3738  EFLAGS: 00010082 
          RAX: 0000000000000282 RBX: 0000000000016880 RCX: ffff880071abdd38
          RDX: 0000000000000282 RSI: ffff88006d1e3790 RDI: 0000000000000000
          RBP: ffff88006d1e3758 R08: 0000000000000002 R09: 5a5a5a5a5a5a5a5a
          R10: 5a5a5a5a5a5a5a5a R11: 5a5a5a5a5a5a5a5a R12: 0000000000000000
          R13: ffff88006d1e3790 R14: 0000000000000001 R15: 000000000000000f
          FS:  00007f2b4b7fe700(0000) GS:ffff880002300000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
          CR2: 0000000000000008 CR3: 000000006d3fb000 CR4: 00000000000006e0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
          Process lctl (pid: 5628, threadinfo ffff88006d1e2000, task ffff880073bc0080)
          Stack:
           0000000000000000 ffff88006d7540a0 0000000000000000 0000000000000001
          <d> ffff88006d1e37c8 ffffffff8106195c ffff88006d1e3788 ffffffffa122eb0c
          <d> ffff88007213fb18 ffff88006b7b56f8 ffff88006d1e3808 0000000000000282
          Call Trace:
           [<ffffffff8106195c>] try_to_wake_up+0x3c/0x3e0
           [<ffffffffa122eb0c>] ? echo_object_free+0x24c/0x460 [obdecho]
           [<ffffffff81061d55>] wake_up_process+0x15/0x20
           [<ffffffff8152aa74>] __mutex_unlock_slowpath+0x44/0x60
           [<ffffffff8152aa2b>] mutex_unlock+0x1b/0x20
           [<ffffffffa07742af>] lu_site_purge+0x3ff/0x4e0 [obdclass]
           [<ffffffffa07747d1>] lu_object_limit+0x71/0x80 [obdclass]
           [<ffffffffa0774933>] lu_object_find_try+0x153/0x2b0 [obdclass]
           [<ffffffffa0774b41>] lu_object_find_at+0xb1/0xe0 [obdclass]
           [<ffffffffa1170e91>] ? mdd_lookup+0xe1/0x170 [mdd]
           [<ffffffffa1230a03>] echo_md_create_internal+0x153/0x640 [obdecho]
           [<ffffffffa123abc0>] echo_md_handler+0x1300/0x1860 [obdecho]
           [<ffffffffa123c90c>] echo_client_iocontrol+0x17ec/0x2aa0 [obdecho]
           [<ffffffffa060627b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs] 
           [<ffffffff8118d475>] ? chrdev_open+0x125/0x230
           [<ffffffff811ab820>] ? mntput_no_expire+0x30/0x110
           [<ffffffff8116fe9c>] ? __kmalloc+0x20c/0x220
           [<ffffffffa0723f51>] ? obd_ioctl_getdata+0xe1/0x1140 [obdclass]
           [<ffffffffa073c7fc>] class_handle_ioctl+0x15fc/0x2180 [obdclass]
           [<ffffffffa07232ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
           [<ffffffff8119e972>] vfs_ioctl+0x22/0xa0
           [<ffffffff8103f9d8>] ? pvclock_clocksource_read+0x58/0xd0
           [<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580
           [<ffffffff8103ea6c>] ? kvm_clock_read+0x1c/0x20
           [<ffffffff8103ea79>] ? kvm_clock_get_cycles+0x9/0x10
           [<ffffffff810a5e07>] ? getnstimeofday+0x57/0xe0
           [<ffffffff8119f091>] sys_ioctl+0x81/0xa0
           [<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
           [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
          Code: 89 74 24 18 0f 1f 44 00 00 48 c7 c3 80 68 01 00 49 89 fc 49 89 f5 9c 58 0f 1f 44 00 00 48 89 c2 fa 66 0f 1f 44 00 00 49 89 55 00 <49> 8b 44 24 08 49 89 de 8b 40 18 4c 03 34 c5 a0 cf bf 81 4c 89 
          RIP  [<ffffffff81058e52>] task_rq_lock+0x42/0xa0
           RSP <ffff88006d1e3738>
          CR2: 0000000000000008
          

          The kernel version was 2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64.

          Maloo report: https://testing.hpdd.intel.com/test_sets/8a17fb50-9e4c-11e4-8c99-5254006e85c2

          The failure did not occur everytime while running mds-survey. E.g., the following are passed reports for mds-survey running with kernel 2.6.32-431.29.2.el6:
          https://testing.hpdd.intel.com/test_sessions/56234146-8f2a-11e4-89f3-5254006e85c2
          https://testing.hpdd.intel.com/test_sessions/96e23bb2-795c-11e4-9e8a-5254006e85c2

          yujian Jian Yu added a comment - While verifying patch http://review.whamcloud.com/10130 on master branch, mds-survey hit the same failure on MDS: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [<ffffffff81058e52>] task_rq_lock+0x42/0xa0 PGD 6c82e067 PUD 6c909067 PMD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/system/cpu/online CPU 1 Modules linked in: obdecho(U) osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgs(U) mgc(U) osd_zfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic sha256_generic libcfs(U) nfs fscache nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] Pid: 5628, comm: lctl Tainted: P --------------- 2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64 #1 Red Hat KVM RIP: 0010:[<ffffffff81058e52>] [<ffffffff81058e52>] task_rq_lock+0x42/0xa0 RSP: 0018:ffff88006d1e3738 EFLAGS: 00010082 RAX: 0000000000000282 RBX: 0000000000016880 RCX: ffff880071abdd38 RDX: 0000000000000282 RSI: ffff88006d1e3790 RDI: 0000000000000000 RBP: ffff88006d1e3758 R08: 0000000000000002 R09: 5a5a5a5a5a5a5a5a R10: 5a5a5a5a5a5a5a5a R11: 5a5a5a5a5a5a5a5a R12: 0000000000000000 R13: ffff88006d1e3790 R14: 0000000000000001 R15: 000000000000000f FS: 00007f2b4b7fe700(0000) GS:ffff880002300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000008 CR3: 000000006d3fb000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process lctl (pid: 5628, threadinfo ffff88006d1e2000, task ffff880073bc0080) Stack: 0000000000000000 ffff88006d7540a0 0000000000000000 0000000000000001 <d> ffff88006d1e37c8 ffffffff8106195c ffff88006d1e3788 ffffffffa122eb0c <d> ffff88007213fb18 ffff88006b7b56f8 ffff88006d1e3808 0000000000000282 Call Trace: [<ffffffff8106195c>] try_to_wake_up+0x3c/0x3e0 [<ffffffffa122eb0c>] ? echo_object_free+0x24c/0x460 [obdecho] [<ffffffff81061d55>] wake_up_process+0x15/0x20 [<ffffffff8152aa74>] __mutex_unlock_slowpath+0x44/0x60 [<ffffffff8152aa2b>] mutex_unlock+0x1b/0x20 [<ffffffffa07742af>] lu_site_purge+0x3ff/0x4e0 [obdclass] [<ffffffffa07747d1>] lu_object_limit+0x71/0x80 [obdclass] [<ffffffffa0774933>] lu_object_find_try+0x153/0x2b0 [obdclass] [<ffffffffa0774b41>] lu_object_find_at+0xb1/0xe0 [obdclass] [<ffffffffa1170e91>] ? mdd_lookup+0xe1/0x170 [mdd] [<ffffffffa1230a03>] echo_md_create_internal+0x153/0x640 [obdecho] [<ffffffffa123abc0>] echo_md_handler+0x1300/0x1860 [obdecho] [<ffffffffa123c90c>] echo_client_iocontrol+0x17ec/0x2aa0 [obdecho] [<ffffffffa060627b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs] [<ffffffff8118d475>] ? chrdev_open+0x125/0x230 [<ffffffff811ab820>] ? mntput_no_expire+0x30/0x110 [<ffffffff8116fe9c>] ? __kmalloc+0x20c/0x220 [<ffffffffa0723f51>] ? obd_ioctl_getdata+0xe1/0x1140 [obdclass] [<ffffffffa073c7fc>] class_handle_ioctl+0x15fc/0x2180 [obdclass] [<ffffffffa07232ab>] obd_class_ioctl+0x4b/0x190 [obdclass] [<ffffffff8119e972>] vfs_ioctl+0x22/0xa0 [<ffffffff8103f9d8>] ? pvclock_clocksource_read+0x58/0xd0 [<ffffffff8119eb14>] do_vfs_ioctl+0x84/0x580 [<ffffffff8103ea6c>] ? kvm_clock_read+0x1c/0x20 [<ffffffff8103ea79>] ? kvm_clock_get_cycles+0x9/0x10 [<ffffffff810a5e07>] ? getnstimeofday+0x57/0xe0 [<ffffffff8119f091>] sys_ioctl+0x81/0xa0 [<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Code: 89 74 24 18 0f 1f 44 00 00 48 c7 c3 80 68 01 00 49 89 fc 49 89 f5 9c 58 0f 1f 44 00 00 48 89 c2 fa 66 0f 1f 44 00 00 49 89 55 00 <49> 8b 44 24 08 49 89 de 8b 40 18 4c 03 34 c5 a0 cf bf 81 4c 89 RIP [<ffffffff81058e52>] task_rq_lock+0x42/0xa0 RSP <ffff88006d1e3738> CR2: 0000000000000008 The kernel version was 2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64. Maloo report: https://testing.hpdd.intel.com/test_sets/8a17fb50-9e4c-11e4-8c99-5254006e85c2 The failure did not occur everytime while running mds-survey. E.g., the following are passed reports for mds-survey running with kernel 2.6.32-431.29.2.el6: https://testing.hpdd.intel.com/test_sessions/56234146-8f2a-11e4-89f3-5254006e85c2 https://testing.hpdd.intel.com/test_sessions/96e23bb2-795c-11e4-9e8a-5254006e85c2

          Can you please submit a patch to Lustre to apply the upstream patch to our server kernel. It would also be good to figure out how to request that this patch be included into RHEL.

          adilger Andreas Dilger added a comment - Can you please submit a patch to Lustre to apply the upstream patch to our server kernel. It would also be good to figure out how to request that this patch be included into RHEL.

          Maybe I missed something, but the latest kernel RPM I could find on our download site was 2.6.32-431.20.3.el6_lustre.x86_64, on which the same error happened:

          BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
          IP: [<ffffffff81058e52>] task_rq_lock+0x42/0xa0
          
          isaac Isaac Huang (Inactive) added a comment - Maybe I missed something, but the latest kernel RPM I could find on our download site was 2.6.32-431.20.3.el6_lustre.x86_64, on which the same error happened: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [<ffffffff81058e52>] task_rq_lock+0x42/0xa0

          Isaac, can you see if this bug is fixed in the 2.6.32-431.29.2.el6 (RHEL6.5) kernel? That is what is supported for 2.5.3 and 2.6+ so it makes sense to be using that kernel for testing. I am also running the 2.6.32-358.23.2.el6 kernel on my test system, but I'm going to update it because of LU-5722, which looks like it may also be a kernel bug.

          adilger Andreas Dilger added a comment - Isaac, can you see if this bug is fixed in the 2.6.32-431.29.2.el6 (RHEL6.5) kernel? That is what is supported for 2.5.3 and 2.6+ so it makes sense to be using that kernel for testing. I am also running the 2.6.32-358.23.2.el6 kernel on my test system, but I'm going to update it because of LU-5722 , which looks like it may also be a kernel bug.

          People

            isaac Isaac Huang (Inactive)
            isaac Isaac Huang (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: