Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8354

soft lockup in ldlm_plain_compat_queue

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • None
    • 3
    • 9223372036854775807

    Description

      <6>[1058680.630618] Lustre: Setting parameter snx11001-MDT0000.mdd.changelog_mask in log snx11001-MDT0000
      <0>[1058752.944434] BUG: soft lockup - CPU#9 stuck for 67s! [lctl:79108]
      <4>[1058753.055094] CPU 9 
      <4>[1058753.057343] Modules linked in: ost(U) osd_ldiskfs(U) ldiskfs(U) mdt(U) mdd(U) lfsck(U) mgs(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) lquota(U) ko2iblnd(U) ptlrpc(U) obdclass(U) ksocklnd(U) lnet(U) sha512_generic sha256_generic crc32c_intel libcfs(U) raid1 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx ext4 jbd2 mbcache ib_ipoib(U) rdma_ucm(U) ib_ucm(U) ib_uverbs(U) ib_umad(U) rdma_cm(U) ib_cm(U) iw_cm(U) mlx4_ib(U) ib_sa(U) ib_mad(U) ib_core(U) ib_addr(U) nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack xt_multiport iptable_filter xt_NOTRACK nf_conntrack iptable_raw ip_tables ipmi_devintf acpi_cpufreq freq_table mperf dm_mod sg ses enclosure sd_mod crc_t10dif wmi iTCO_wdt iTCO_vendor_support isci libsas mpt2sas scsi_transport_sas raid_class sb_edac edac_core ahci i2c_i801 lpc_ich mfd_core shpchp nfs lockd fscache auth_rpcgss nfs_acl sunrpc igb dca i2c_algo_bit i2c_core mlx4_en(U) ptp pps_core mlx4_core(U) compat(U) bonding ipv6 8021q garp stp llc [last unloaded: ib_core]
      <4>[1058753.161188] 
      <4>[1058753.163132] Pid: 79108, comm: lctl Not tainted 2.6.32-431.17.1.x2.0.76.x86_64 #1 Intel Corporation S2600JF/S2600JF
      <4>[1058753.175129] RIP: 0010:[<ffffffffa0897750>]  [<ffffffffa0897750>] ldlm_add_ast_work_item+0x30/0x150 [ptlrpc]
      <4>[1058753.186440] RSP: 0018:ffff880f4540da48  EFLAGS: 00000246
      <4>[1058753.192658] RAX: ffff880fc0039e40 RBX: ffff880f4540da68 RCX: 00000000000013cf
      <4>[1058753.201000] RDX: ffff880f4540daa8 RSI: ffff880e610c7340 RDI: ffff880e3a3ddd00
      <4>[1058753.209342] RBP: ffffffff8100bb8e R08: ffff880fb713fd50 R09: ffff880e610c7340
      <4>[1058753.217678] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880fb713fd50
      <4>[1058753.226014] R13: ffff880e610c7340 R14: 0000000000000000 R15: 0000000000000000
      <4>[1058753.234349] FS:  00007f17bd91c700(0000) GS:ffff880060720000(0000) knlGS:0000000000000000
      <4>[1058753.243757] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      <4>[1058753.250465] CR2: 00007fd267c5f000 CR3: 0000000eea234000 CR4: 00000000000407e0
      <4>[1058753.258809] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      <4>[1058753.267146] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      <4>[1058753.275483] Process lctl (pid: 79108, threadinfo ffff880f4540c000, task ffff881033667540)
      <4>[1058753.284983] Stack:
      <4>[1058753.287507]  ffffffffa08b0948 0000000000000010 ffff880fb9d7fac0 ffff880f4540daa8
      <4>[1058753.295905] <d> ffff880f4540dae8 ffffffffa08b0958 ffff880f4540dc40 ffff880e610c73a0
      <4>[1058753.304897] <d> ffff880fc0039e58 ffff880fc0039e80 ffff880fc0039e40 0000000100000001
      <4>[1058753.314187] Call Trace:
      <4>[1058753.317222]  [<ffffffffa08b0948>] ? ldlm_process_plain_lock+0x1b8/0xa80 [ptlrpc]
      <4>[1058753.325871]  [<ffffffffa08b0958>] ? ldlm_process_plain_lock+0x1c8/0xa80 [ptlrpc]
      <4>[1058753.334520]  [<ffffffffa089bbab>] ? ldlm_lock_enqueue+0x48b/0xa60 [ptlrpc]
      <4>[1058753.342508]  [<ffffffffa08bbac1>] ? ldlm_cli_enqueue_local+0x1b1/0x810 [ptlrpc]
      <4>[1058753.351055]  [<ffffffffa0d5d650>] ? mgs_completion_ast_config+0x0/0x20 [mgs]
      <4>[1058753.359319]  [<ffffffffa08ba880>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc]
      <4>[1058753.367097]  [<ffffffffa0d5d30b>] ? mgs_revoke_lock+0x1fb/0x350 [mgs]
      <4>[1058753.374599]  [<ffffffffa08ba880>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc]
      <4>[1058753.382377]  [<ffffffffa0d5d650>] ? mgs_completion_ast_config+0x0/0x20 [mgs]
      <4>[1058753.390629]  [<ffffffffa0d7b32f>] ? mgs_setparam+0xe6f/0x10f0 [mgs]
      <4>[1058753.397924]  [<ffffffffa0d63712>] ? mgs_iocontrol+0x15b2/0x18e0 [mgs]
      <4>[1058753.405456]  [<ffffffffa0661ed5>] ? obd_ioctl_getdata+0x145/0x1150 [obdclass]
      <4>[1058753.413811]  [<ffffffffa067b2be>] ? class_handle_ioctl+0x16fe/0x2270 [obdclass]
      <4>[1058753.422346]  [<ffffffffa06612ab>] ? obd_class_ioctl+0x4b/0x190 [obdclass]
      <4>[1058753.430224]  [<ffffffff8119e0e2>] ? vfs_ioctl+0x22/0xa0
      <4>[1058753.436345]  [<ffffffff8119e284>] ? do_vfs_ioctl+0x84/0x580
      <4>[1058753.442861]  [<ffffffff8119e801>] ? sys_ioctl+0x81/0xa0
      <4>[1058753.448989]  [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
      <4>[1058753.456180] Code: 54 53 48 83 ec 10 0f 1f 44 00 00 f6 05 0d b2 cb ff 01 48 89 fb 49 89 f4 74 0d f6 05 fc b1 cb ff 01 0f 85 9c 00 00 00 48 8b 43 48 <8b> 40 18 89 c1 c1 f9 10 66 39 c1 0f 84 ff 00 00 00 4d 85 e4 0f 
      

      The lock iteration in ldlm_plain_compat_queue() was previously optimized to skip locks of the same type, but this optimization was broken by patch http://review.whamcloud.com/10945 "LU-3963 ldlm: convert to linux list api" that converted list_for_each() to list_for_each_entry(). The original loop advanced the "tmp" pointer to the end of the skip list of locks with the same type, but the current list iterates over all locks and may take too long if there are a large number of clients connected to the MGS.

      Attachments

        Issue Links

          Activity

            People

              jhammond John Hammond
              askulysh Andriy Skulysh
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: