Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.7.0
    • Lustre 2.6.0, Lustre 2.5.1
    • Centos 6.5
    • 3
    • 15202

    Description

      We got the following oops during testing:

      <4>PGD 0 
      <4>Oops: 0002 [#1] SMP 
      <4>last sysfs file: /sys/devices/system/cpu/online
      <4>CPU 1 
      <4>Modules linked in: lmv(U) fld(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) nfs lockd fscache auth_rpcgss nfs_acl sunrpc VSMqfs(P)(U) autofs4 ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 ppdev vmware_balloon parport_pc parport vmxnet3 sg i2c_piix4 i2c_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif sr_mod cdrom vmw_pvscsi pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
      <4>
      <4>Pid: 24385, comm: ldlm_cb00_056 Tainted: P           ---------------    2.6.32-431.17.1.el6.x86_64 #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
      <4>RIP: 0010:[<ffffffff8118a509>]  [<ffffffff8118a509>] fput+0x9/0x30
      <4>RSP: 0018:ffff88012db55c20  EFLAGS: 00010246
      <4>RAX: 00000000ffffffe0 RBX: ffff8800a8ea4fc0 RCX: 0000000000000000
      <4>RDX: ffffffffa03c9eb0 RSI: 0000000000000000 RDI: 0000000000000000
      <4>RBP: ffff88012db55c20 R08: 00000000ffffff0a R09: 00000000fffffffc
      <4>R10: 0000000000000001 R11: 282064656c696166 R12: ffffffffa03c9c60
      <4>R13: ffff88005df240f8 R14: 0000000000000000 R15: ffff88013b4ca000
      <4>FS:  0000000000000000(0000) GS:ffff880028280000(0000) knlGS:0000000000000000
      <4>CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      <4>CR2: 0000000000000030 CR3: 0000000001a85000 CR4: 00000000000407e0
      <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      <4>Process ldlm_cb00_056 (pid: 24385, threadinfo ffff88012db54000, task ffff88012da5f500)
      <4>Stack:
      <4> ffff88012db55c60 ffffffffa0388044 0000000000000002 00000000ffffffe0
      <4><d> ffff88005df240f8 ffff88005df24100 ffffffffa07103be ffff8801398d7000
      <4><d> ffff88012db55cc0 ffffffffa08649f7 ffff88008c9af3f0 000000008116f303
      <4>Call Trace:
      <4> [<ffffffffa0388044>] libcfs_kkuc_group_put+0x94/0x180 [libcfs]
      <4> [<ffffffffa08649f7>] mdc_set_info_async+0x147/0x780 [mdc]
      <4> [<ffffffffa0699fad>] ldlm_callback_handler+0x4dd/0x1800 [ptlrpc]
      <4> [<ffffffffa04c321f>] ? keys_fill+0x6f/0x190 [obdclass]
      <4> [<ffffffffa06b8f6c>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
      <4> [<ffffffffa06bf61b>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc]
      <4> [<ffffffffa06c7a35>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
      <4> [<ffffffffa036f4ce>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      <4> [<ffffffffa03804ff>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
      <4> [<ffffffffa06bf119>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
      <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
      <4> [<ffffffffa06c8d9d>] ptlrpc_main+0xaed/0x1740 [ptlrpc]
      <4> [<ffffffffa06c82b0>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
      <4> [<ffffffff8109ab56>] kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
      <4> [<ffffffff8109aac0>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      <4>Code: fe ff ff 31 d2 48 89 de 83 cf ff ff d0 e9 da fe ff ff 48 89 df e8 f8 63 04 00 e9 bb fe ff ff 0f 1f 00 55 48 89 e5 0f 1f 44 00 00 <f0> 48 ff 4f 30 0f 94 c0 84 c0 75 0b c9 c3 66 0f 1f 84 00 00 00 
      <1>RIP  [<ffffffff8118a509>] fput+0x9/0x30
      <4> RSP <ffff88012db55c20>
      <4>CR2: 0000000000000030
      

      crash bt:

      PID: 24385  TASK: ffff88012da5f500  CPU: 1   COMMAND: "ldlm_cb00_056"
       #0 [ffff88012db55810] machine_kexec at ffffffff81038f3b
       #1 [ffff88012db55870] crash_kexec at ffffffff810c59f2
       #2 [ffff88012db55940] oops_end at ffffffff8152b7f0
       #3 [ffff88012db55970] no_context at ffffffff8104a00b
       #4 [ffff88012db559c0] __bad_area_nosemaphore at ffffffff8104a295
       #5 [ffff88012db55a10] bad_area_nosemaphore at ffffffff8104a363
       #6 [ffff88012db55a20] __do_page_fault at ffffffff8104aabf
       #7 [ffff88012db55b40] do_page_fault at ffffffff8152d73e
       #8 [ffff88012db55b70] page_fault at ffffffff8152aaf5
          [exception RIP: fput+9]
          RIP: ffffffff8118a509  RSP: ffff88012db55c20  RFLAGS: 00010246
          RAX: 00000000ffffffe0  RBX: ffff8800a8ea4fc0  RCX: 0000000000000000
          RDX: ffffffffa03c9eb0  RSI: 0000000000000000  RDI: 0000000000000000
          RBP: ffff88012db55c20   R8: 00000000ffffff0a   R9: 00000000fffffffc
          R10: 0000000000000001  R11: 282064656c696166  R12: ffffffffa03c9c60
          R13: ffff88005df240f8  R14: 0000000000000000  R15: ffff88013b4ca000
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #9 [ffff88012db55c28] libcfs_kkuc_group_put at ffffffffa0388044 [libcfs]
      #10 [ffff88012db55c68] mdc_set_info_async at ffffffffa08649f7 [mdc]
      #11 [ffff88012db55cc8] ldlm_callback_handler at ffffffffa0699fad [ptlrpc]
      #12 [ffff88012db55d68] ptlrpc_server_handle_request at ffffffffa06c7a35
      [ptlrpc]
      #13 [ffff88012db55e48] ptlrpc_main at ffffffffa06c8d9d [ptlrpc]
      #14 [ffff88012db55ee8] kthread at ffffffff8109ab56
      #15 [ffff88012db55f48] kernel_thread at ffffffff8100c20a
      

      The offending line in libcfs_kkuc_group_put() is

        fput(reg->kr_fp);
      

      reg is coming from kkuc_groups, which is an array of lists.

      crash64> rd kkuc_groups 8
      ffffffffa03c9c40:  0000000000000000 0000000000000000   ................
      ffffffffa03c9c50:  0000000000000000 0000000000000000   ................
      ffffffffa03c9c60:  ffff8800a8ea4fc0 ffff8800a8ea4fc0   .O.......O......
      ffffffffa03c9c70:  0000000000000000 0000000000000000   ................
      

      So only one element is on the lists.

      list ffffffffa03c9c60 -s kkuc_reg
      ...
      ffff8800a8ea4fc0
      struct kkuc_reg {
        kr_chain = {
          next = 0xffffffffa03c9c60 <kkuc_groups+32>,
          prev = 0xffffffffa03c9c60 <kkuc_groups+32>
        },
        kr_uid = 23389,
        kr_fp = 0x0,
        kr_data = 0xffff8800a8ea4f80
      }
      

      So apparently reg->kr_fp is NULL. I'm not sure about reg, but since
      it's the only one in the list, and RBX=ffff8800a8ea4fc0, that must be it.

      Looking at libcfs_kkuc_group_put(), it appears that it is not locking
      things properly:

              down_read(&kg_sem);
              cfs_list_for_each_entry(reg, &kkuc_groups[group], kr_chain) {
      ...
                                      fput(reg->kr_fp);
                                      reg->kr_fp = NULL;
      ...
              up_read(&kg_sem);
      

      Since reg can be modified, the lock should be down_write/up_write
      instead. I suspect there was a race where 2 callers executed that
      function. One won and the 2nd crashed.

      Attachments

        Activity

          People

            cliffw Cliff White (Inactive)
            fzago Frank Zago (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: