Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-815

BUG: unable to handle kernel NULL pointer dereference" in lprocfs_rd_import()

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.9.0
    • None
    • None
    • Lustre-2.0, RHEL6.0
    • 3
    • 24,449
    • 6530

    Description

      We've been hitting this problem for several months whe we reading in "/proc/fs/lustre/osc/<OST>/import".

      I saw there's maybe a related patch (BZ#22032 - WC's git: 839280926956f16552194fe803ba21096770ebc4) which was integrated for Lustre-2.1. What do you think of this? If 22032's patch is not related, then does this sound to you as a know problem?

      ==============================================================================
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
      IP: [<ffffffffa0482d3d>] lprocfs_rd_import+0x32d/0x6b0 [obdclass]
      PGD c7cf9f067 PUD ae9bcc067 PMD 0
      Oops: 0000 1 SMP
      last sysfs file: /sys/devices/pci0000:00/0000:00:05.0/0000:05:00.0/infiniband/mlx4_0/ports/1/rate
      CPU 5
      Modules linked in: sit(U) tunnel4(U) lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) lquota(U) fid(U) fld(U) ko2iblnd(U)
      ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U)
      ib_sa(U) ib_uverbs(U) ib_umad(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) ipmi_devintf(U) ipmi_si(U)
      ipmi_msghandler(U) iptable_filter(U) ip_tables(U) x_tables(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) auth_rpcgss(U) sunrpc(U)
      acpi_cpufreq(U) freq_table(U) vtune_drv(U) autofs4(U) ipv6(U) sg(U) i7core_edac(U) edac_core(U) i2c_i801(U) i2c_core(U)
      igb(U) ioatdma(U) dca(U) iTCO_wdt(U) iTCO_vendor_support(U) ext3(U) jbd(U) mbcache(U) sd_mod(U) crc_t10dif(U) usbhid(U)
      hid(U) ehci_hcd(U) ahci(U) uhci_hcd(U) dm_mod(U) [last unloaded: libcfs]

      Pid: 29413, comm: grep Not tainted 2.6.32-30.el6.Bull.14.x86_64 #1 bullx super-node
      RIP: 0010:[<ffffffffa0482d3d>] [<ffffffffa0482d3d>] lprocfs_rd_import+0x32d/0x6b0 [obdclass]
      RSP: 0018:ffff8806e57ffd78 EFLAGS: 00010206
      RAX: 0000000000000000 RBX: ffff880c7db5a000 RCX: 0000000000000038
      RDX: ffff880c6fd42105 RSI: 00000000fffffffe RDI: 0000000000000013
      RBP: ffff8806e57ffe38 R08: 0000000000000000 R09: 00000000fffffffe
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
      R13: 0000000000000105 R14: 0000000000000000 R15: 0000000000001000
      FS: 00002b8d09d85f60(0000) GS:ffff88088e440000(0000) knlGS:0000000000000000
      CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000018 CR3: 00000009da84e000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process grep (pid: 29413, threadinfo ffff8806e57fe000, task ffff8807bf266c50)
      Stack:
      0000000000000000 ffffea0029a6ddf8 00000200011cce48 00000010e68896a0
      <0> ffff880a4323e948 ffff880c7db5a000 ffff880a4323e438 ffff880c6fd42000
      <0> ffff8806e57ffde8 ffff880880001d80 ffff880c7d7c2300 00000000000000d0
      Call Trace:
      [<ffffffff8113e377>] ? alloc_pages_current+0x87/0xd0
      [<ffffffffa0480651>] lprocfs_fops_read+0xd1/0x1e0 [obdclass]
      [<ffffffff811b6a36>] proc_reg_read+0x76/0xb0
      [<ffffffff81157f55>] vfs_read+0xb5/0x1a0
      [<ffffffff810c5282>] ? audit_syscall_entry+0x252/0x280
      [<ffffffff81158091>] sys_read+0x51/0x90
      [<ffffffff8100c172>] system_call_fastpath+0x16/0x1b
      Code: 18 08 75 a2 48 8b 9d 68 ff ff ff 66 ff 83 78 02 00 00 48 8b 43 60 44 8b 83 28 02 00 00 44 8b b3 14 01 00 00 44
      8b a3 24 02 00 00 <48> 8b 78 18 44 89 85 58 ff ff ff e8 d3 5e dc ff 49 63 fd 48 03
      RIP [<ffffffffa0482d3d>] lprocfs_rd_import+0x32d/0x6b0 [obdclass]
      RSP <ffff8806e57ffd78>
      ==============================================================================

      And further+in-deep analysis clearly indicates this problem comes from a race between a process reading
      "/proc/fs/lustre/osc/<OST>/import" special file via lprocfs layer and other Lustre layers dealing with
      imports.

      Thanks,

      Attachments

        Activity

          [LU-815] BUG: unable to handle kernel NULL pointer dereference" in lprocfs_rd_import()

          Oops, thank's to Andreas asking me to review code+patches from top-down, starting with master branch !! And bingo, a similar patch has been already applied starting with b2_3, it comes from JIRA LU-1448 where the same issue was found for disabled OSCs when in our case it also happen during OSC mount !!

          Patch on master is at http://review.whamcloud.com/2995, so it needs to be cherry-picked from there to be applied to b2_1/b2_2 branches.

          In the mean time, should I need to "Abandon" my change on Gerrit by pointing to master change ??

          bfaccini Bruno Faccini (Inactive) added a comment - Oops, thank's to Andreas asking me to review code+patches from top-down, starting with master branch !! And bingo, a similar patch has been already applied starting with b2_3, it comes from JIRA LU-1448 where the same issue was found for disabled OSCs when in our case it also happen during OSC mount !! Patch on master is at http://review.whamcloud.com/2995 , so it needs to be cherry-picked from there to be applied to b2_1/b2_2 branches. In the mean time, should I need to "Abandon" my change on Gerrit by pointing to master change ??

          Humm, even if running with lustre 2.1.1 (including fix for BZ#22032) we can still reproduce the same crash/Oops !! So would like to re-open this JIRA ...

          Again the crash is due to imp->imp_connection beeing NULL and beeing dereferenced in lprocfs_rd_import().

          So I am back with my earlier fix idea, not choosen by Bull R&D in favor of BZ#22032 at that time ..., where imp->imp_connection access must be done under imp->imp_lock protection too and NULL value detected.

          Patch against b2_1 is at http://review.whamcloud.com/4187

          bfaccini Bruno Faccini (Inactive) added a comment - Humm, even if running with lustre 2.1.1 (including fix for BZ#22032) we can still reproduce the same crash/Oops !! So would like to re-open this JIRA ... Again the crash is due to imp->imp_connection beeing NULL and beeing dereferenced in lprocfs_rd_import(). So I am back with my earlier fix idea, not choosen by Bull R&D in favor of BZ#22032 at that time ..., where imp->imp_connection access must be done under imp->imp_lock protection too and NULL value detected. Patch against b2_1 is at http://review.whamcloud.com/4187

          I'm going to mark this fixed in 2.1.0. Please reopen if the customer hits this problem again.

          adilger Andreas Dilger added a comment - I'm going to mark this fixed in 2.1.0. Please reopen if the customer hits this problem again.

          Yes, we integrated the proposed patch, and delivered it to the customer. But we do not have any feedback yet.

          sebastien.buisson Sebastien Buisson (Inactive) added a comment - Yes, we integrated the proposed patch, and delivered it to the customer. But we do not have any feedback yet.
          pjones Peter Jones added a comment -

          Any feedback on this ticket? Have you been able to try the suggested fix yet? If not, when do you expect to be able to do so?

          pjones Peter Jones added a comment - Any feedback on this ticket? Have you been able to try the suggested fix yet? If not, when do you expect to be able to do so?

          Actually I don't think the patch in 839280926956f16552194fe803ba21096770ebc4 is in the official 2.0.0 release. In git we can see this patch was introduced between 2.0.52.0 and 2.0.53.0 tags so the result shown by git describe is very strange.

          We're going to integrate this patch in our 2.0.0 (which is the official) and if we still have the problem we'll try with LU-615.

          Thanks Andreas

          dmoreno Diego Moreno (Inactive) added a comment - Actually I don't think the patch in 839280926956f16552194fe803ba21096770ebc4 is in the official 2.0.0 release. In git we can see this patch was introduced between 2.0.52.0 and 2.0.53.0 tags so the result shown by git describe is very strange. We're going to integrate this patch in our 2.0.0 (which is the official) and if we still have the problem we'll try with LU-615 . Thanks Andreas

          Looking at git for 839280926956f16552194fe803ba21096770ebc4, it definitely seems related, but "git describe" shows that this should be included into v2_0_0-rc1a, which means it should be in the Lustre 2.0.0 release already. Are you running the official 2.0.0 release, or some earlier build?

          The other possibility is that this is related to the patch in http://review.whamcloud.com/1544 (LU-615), which is fixing the reads from .../import to avoid problems overflowing the page buffer. This patch has not yet been landed to the master (2.2) release branch, so I would recommend testing it first if you plan to apply it before testing and landing has completed.

          adilger Andreas Dilger added a comment - Looking at git for 839280926956f16552194fe803ba21096770ebc4, it definitely seems related, but "git describe" shows that this should be included into v2_0_0-rc1a, which means it should be in the Lustre 2.0.0 release already. Are you running the official 2.0.0 release, or some earlier build? The other possibility is that this is related to the patch in http://review.whamcloud.com/1544 ( LU-615 ), which is fixing the reads from .../import to avoid problems overflowing the page buffer. This patch has not yet been landed to the master (2.2) release branch, so I would recommend testing it first if you plan to apply it before testing and landing has completed.

          People

            adilger Andreas Dilger
            lustre-bull Lustre Bull (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: