Details

    • Bug
    • Resolution: Incomplete
    • Critical
    • None
    • Lustre 2.10.3
    • client sles12sp2 lustre 2.10.3
      servers 2.7.3 and 2.10.3
    • 2
    • 9223372036854775807

    Description

      Clients hang in LNetMDUnlink. May be a dup of LU-11092 and LU-10669.

       

      [166855.238376] CPU: 33 PID: 2938 Comm: ptlrpcd_01_02 Tainted: P        W  OEL  NX 4.4.90-92.45.1.20171031-nasa #1
      [166855.238378] Hardware name: SGI.COM SUMMIT/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
      [166855.238381] task: ffff8807db820bc0 ti: ffff8807db824000 task.ti: ffff8807db824000
      [166855.238383] RIP: 0010:[<ffffffff810cc0a1>]  [<ffffffff810cc0a1>] native_queued_spin_lock_slowpath+0x111/0x1a0
      [166855.238392] RSP: 0018:ffff8807db827b98  EFLAGS: 00000246
      [166855.238393] RAX: 0000000000000000 RBX: ffff880fe93574e0 RCX: 0000000000880000
      [166855.238395] RDX: ffff88081e2567c0 RSI: 0000000000280001 RDI: ffff88101cdb6e00
      [166855.238396] RBP: ffff8807db827b98 R08: ffff88101db567c0 R09: 0000000000000000
      [166855.238398] R10: 0000000000000000 R11: ffff880ee98f8817 R12: 0000000000000008
      [166855.238400] R13: 000000000a222d0f R14: 0000000000000001 R15: 0000000000000000
      [166855.238402] FS:  0000000000000000(0000) GS:ffff88101db40000(0000) knlGS:0000000000000000
      [166855.238403] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [166855.238405] CR2: 0000000000641038 CR3: 0000000001afe000 CR4: 00000000001406e0
      [166855.238407] Stack:
      [166855.238408]  ffff8807db827ba8 ffffffff8119162a ffff8807db827bb8 ffffffff8161e640
      [166855.238411]  ffff8807db827be0 ffffffffa0a96683 ffffffffa1dc78e7 0000000000000001
      [166855.238414]  000000002888b43d ffff8807db827cb8 ffffffffa0b254f5 ffffffffa1dc78d8
      [166855.238417] Call Trace:
      [166855.238431]  [<ffffffff8119162a>] queued_spin_lock_slowpath+0xb/0xf
      [166855.238439]  [<ffffffff8161e640>] _raw_spin_lock+0x20/0x30
      [166855.238467]  [<ffffffffa0a96683>] cfs_percpt_lock+0x53/0x100 [libcfs]
      [166855.238510]  [<ffffffffa0b254f5>] LNetMDUnlink+0x65/0x150 [lnet]
      [166855.238573]  [<ffffffffa1d5cc88>] ptlrpc_unregister_reply+0xf8/0x6f0 [ptlrpc]
      [166855.238636]  [<ffffffffa1d616d8>] ptlrpc_expire_one_request+0xb8/0x430 [ptlrpc]
      [166855.238674]  [<ffffffffa1d61aff>] ptlrpc_expired_set+0xaf/0x190 [ptlrpc]
      [166855.238719]  [<ffffffffa1d8f998>] ptlrpcd+0x258/0x4e0 [ptlrpc]
      [166855.238729]  [<ffffffff8109f276>] kthread+0xd6/0xf0
      [166855.238735]  [<ffffffff8161ed3f>] ret_from_fork+0x3f/0x70
      [166855.241341] DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70
      [166855.241342] 
      [166855.241343] Leftover inexact backtrace:
                      
      [166855.241348]  [<ffffffff8109f1a0>] ? kthread_park+0x60/0x60
       

      We will try to get a reproducer.

      Attachments

        Issue Links

          Activity

            [LU-11100] Clients hangs in LNetMDUnlink

            This can be closed

            mhanafi Mahmoud Hanafi added a comment - This can be closed

            Is it possible to grab the stack traces of all the tasks when we hit this issue:

            echo t > /proc/sysrq-trigger 

            It would be useful to see who's holding the lock. Last I looked at the crash dump for this case, it looked like the MD/ME lists were growing, so I was suspecting that it takes a long time to go through them.

            ashehata Amir Shehata (Inactive) added a comment - Is it possible to grab the stack traces of all the tasks when we hit this issue: echo t > /proc/sysrq-trigger It would be useful to see who's holding the lock. Last I looked at the crash dump for this case, it looked like the MD/ME lists were growing, so I was suspecting that it takes a long time to go through them.

            We hit this issue again today.

            mhanafi Mahmoud Hanafi added a comment - We hit this issue again today.

            We are hitting this with lustre2.12.3. So the above patches didn't fix the issue.

            Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] Modules linked in: iptable_nat(E) nf_nat_ipv4(E) nf_nat(E) binfmt_misc(E) fuse(E) mgc(OEN) lustre(OEN) lmv(OEN) mdc(OEN) fid(OEN) osc(OEN) rpcsec_gss_krb5(E) auth_rpcgss(E) lov(OEN) nfsv4(E) fld(OEN) dns_resolver(E) ko2iblnd(OEN) ptlrpc(OEN) obdclass(OEN) lnet(OEN) nfsv3(E) nfs_acl(E) nfs(E) lockd(E) grace(E) fscache(E) libcfs(OEN) rdma_ucm(OEX) ib_ucm(OEX) rdma_cm(OEX) iw_cm(OEX) configfs(E) ib_ipoib(OEX) ib_cm(OEX) ib_umad(OEX) bonding(E) iscsi_ibft(E) iscsi_boot_sysfs(E) nf_log_ipv6(E) nf_log_common(E) xt_LOG(E) nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ip6table_filter(E) ip6_tables(E) xt_tcpudp(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) xt_conntrack(E) iptable_filter(E) xt_CT(E) nf_conntrack(E) libcrc32c(E) iptable_raw(E) ip_tables(E) x_tables(E) mlx4_ib(OEX) ib_uverbs(OEX) ib_core(OEX)
            Jan 21 10:19:20 pfe24 kernel: [1579630760.775887]  tcp_bic(EN) intel_rapl(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) ipmi_ssif(E) crc32_pclmul(E) ghash_clmulni_intel(E) pcbc(E) mlx4_core(OEX) aesni_intel(E) iTCO_wdt(E) iTCO_vendor_support(E) aes_x86_64(E) crypto_simd(E) glue_helper(E) mlx_compat(OEX) devlink(E) cryptd(E) ipmi_si(E) ioatdma(E) igb(E) mei_me(E) mei(E) lpc_ich(E) wmi(E) ipmi_devintf(E) ipmi_msghandler(E) i2c_i801(E) shpchp(E) mfd_core(E) pcspkr(E) dca(E) button(E) acpi_cpufreq(E) sunrpc(E) ext4(E) crc16(E) jbd2(E) mbcache(E) sd_mod(E) csiostor(E) sr_mod(E) cdrom(E) mgag200(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ttm(E) isci(EX) ahci(E) cxgb4(E) drm(E) libahci(E) libsas(E) ptp(E) crc32c_intel(E) serio_raw(E) drm_panel_orientation_quirks(E)
            Jan 21 10:19:20 pfe24 kernel: [1579630760.775887]  libata(E) scsi_transport_fc(E) pps_core(E) scsi_transport_sas(E) hwperf(OEX) numatools(OEX) xpmem(OEX) gru(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) autofs4(E)
            Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] Supported: No, Unreleased kernel
            Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] CPU: 10 PID: 4337 Comm: ptlrpcd_01_05 Tainted: G           OEL     4.12.14-95.40.1.20191112-nasa #1 SLE12-SP4 (unreleased)
            Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] Hardware name: SGI.COM C1104-RP7/X9DRW-3LN4F+/X9DRW-3TF+, BIOS 3.00 09/12/2013
            Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] task: ffff91f29dfeca00 task.stack: ffffa1c3cb810000
            Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] RIP: 0010:native_queued_spin_lock_slowpath+0xda/0x1d0
            Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] RSP: 0018:ffffa1c3cb813c30 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
            Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 00000000002c0000
            Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] RDX: ffff91fadf2a3a00 RSI: ffff91fadf3e3a00 RDI: ffff91facfa7d040
            Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] RBP: ffffa1c3cb813d10 R08: 0000000000000000 R09: 0000000000000150
            Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] R10: 000000000000002d R11: ffff91f982e8d817 R12: 000000007068765d
            Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] R13: 000000001c1a1d97 R14: 0000000000000000 R15: ffff91f29dfeca00
            Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] FS:  0000000000000000(0000) GS:ffff91fadf280000(0000) knlGS:0000000000000000
            Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] CR2: 00002aaaaacc4000 CR3: 0000000d3700a001 CR4: 00000000000606e0
            Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] Call Trace:
            Jan 21 10:19:20 pfe24 kernel: [1579630760.775887]  queued_spin_lock_slowpath+0x7/0xa
            Jan 21 10:19:20 pfe24 kernel: [1579630760.775887]  LNetMDUnlink+0x65/0x150 [lnet]
            Jan 21 10:19:20 pfe24 kernel: [1579630760.775887]  ptlrpc_unregister_reply+0xf2/0x6f0 [ptlrpc]
            Jan 21 10:19:20 pfe24 kernel: [1579630760.775887]  ? ptlrpc_set_import_discon+0xf5/0x6e0 [ptlrpc]
            Jan 21 10:19:20 pfe24 kernel: [1579630760.775887]  ptlrpc_expire_one_request+0xe4/0x4d0 [ptlrpc]
            Jan 21 10:19:20 pfe24 kernel: [1579630760.775887]  ptlrpc_expired_set+0xa9/0x180 [ptlrpc]
            Jan 21 10:19:20 pfe24 kernel: [1579630760.775887]  ptlrpcd+0x22e/0x4a0 [ptlrpc]
            Jan 21 10:19:20 pfe24 kernel: [1579630760.775887]  ? wake_up_q+0x70/0x70
            Jan 21 10:19:20 pfe24 kernel: [1579630760.775887]  kthread+0xff/0x140
            Jan 21 10:19:20 pfe24 kernel: [1579630760.775887]  ? ptlrpcd_check+0x560/0x560 [ptlrpc]
            Jan 21 10:19:20 pfe24 kernel: [1579630760.775887]  ? __kthread_parkme+0x70/0x70
            Jan 21 10:19:20 pfe24 kernel: [1579630760.775887]  ret_from_fork+0x35/0x40
            
             
            mhanafi Mahmoud Hanafi added a comment - We are hitting this with lustre2.12.3. So the above patches didn't fix the issue. Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] Modules linked in: iptable_nat(E) nf_nat_ipv4(E) nf_nat(E) binfmt_misc(E) fuse(E) mgc(OEN) lustre(OEN) lmv(OEN) mdc(OEN) fid(OEN) osc(OEN) rpcsec_gss_krb5(E) auth_rpcgss(E) lov(OEN) nfsv4(E) fld(OEN) dns_resolver(E) ko2iblnd(OEN) ptlrpc(OEN) obdclass(OEN) lnet(OEN) nfsv3(E) nfs_acl(E) nfs(E) lockd(E) grace(E) fscache(E) libcfs(OEN) rdma_ucm(OEX) ib_ucm(OEX) rdma_cm(OEX) iw_cm(OEX) configfs(E) ib_ipoib(OEX) ib_cm(OEX) ib_umad(OEX) bonding(E) iscsi_ibft(E) iscsi_boot_sysfs(E) nf_log_ipv6(E) nf_log_common(E) xt_LOG(E) nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ip6table_filter(E) ip6_tables(E) xt_tcpudp(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) xt_conntrack(E) iptable_filter(E) xt_CT(E) nf_conntrack(E) libcrc32c(E) iptable_raw(E) ip_tables(E) x_tables(E) mlx4_ib(OEX) ib_uverbs(OEX) ib_core(OEX) Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] tcp_bic(EN) intel_rapl(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) ipmi_ssif(E) crc32_pclmul(E) ghash_clmulni_intel(E) pcbc(E) mlx4_core(OEX) aesni_intel(E) iTCO_wdt(E) iTCO_vendor_support(E) aes_x86_64(E) crypto_simd(E) glue_helper(E) mlx_compat(OEX) devlink(E) cryptd(E) ipmi_si(E) ioatdma(E) igb(E) mei_me(E) mei(E) lpc_ich(E) wmi(E) ipmi_devintf(E) ipmi_msghandler(E) i2c_i801(E) shpchp(E) mfd_core(E) pcspkr(E) dca(E) button(E) acpi_cpufreq(E) sunrpc(E) ext4(E) crc16(E) jbd2(E) mbcache(E) sd_mod(E) csiostor(E) sr_mod(E) cdrom(E) mgag200(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ttm(E) isci(EX) ahci(E) cxgb4(E) drm(E) libahci(E) libsas(E) ptp(E) crc32c_intel(E) serio_raw(E) drm_panel_orientation_quirks(E) Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] libata(E) scsi_transport_fc(E) pps_core(E) scsi_transport_sas(E) hwperf(OEX) numatools(OEX) xpmem(OEX) gru(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) autofs4(E) Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] Supported: No, Unreleased kernel Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] CPU: 10 PID: 4337 Comm: ptlrpcd_01_05 Tainted: G OEL 4.12.14-95.40.1.20191112-nasa #1 SLE12-SP4 (unreleased) Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] Hardware name: SGI.COM C1104-RP7/X9DRW-3LN4F+/X9DRW-3TF+, BIOS 3.00 09/12/2013 Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] task: ffff91f29dfeca00 task.stack: ffffa1c3cb810000 Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] RIP: 0010:native_queued_spin_lock_slowpath+0xda/0x1d0 Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] RSP: 0018:ffffa1c3cb813c30 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 00000000002c0000 Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] RDX: ffff91fadf2a3a00 RSI: ffff91fadf3e3a00 RDI: ffff91facfa7d040 Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] RBP: ffffa1c3cb813d10 R08: 0000000000000000 R09: 0000000000000150 Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] R10: 000000000000002d R11: ffff91f982e8d817 R12: 000000007068765d Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] R13: 000000001c1a1d97 R14: 0000000000000000 R15: ffff91f29dfeca00 Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] FS: 0000000000000000(0000) GS:ffff91fadf280000(0000) knlGS:0000000000000000 Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] CR2: 00002aaaaacc4000 CR3: 0000000d3700a001 CR4: 00000000000606e0 Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] Call Trace: Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] queued_spin_lock_slowpath+0x7/0xa Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] LNetMDUnlink+0x65/0x150 [lnet] Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] ptlrpc_unregister_reply+0xf2/0x6f0 [ptlrpc] Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] ? ptlrpc_set_import_discon+0xf5/0x6e0 [ptlrpc] Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] ptlrpc_expire_one_request+0xe4/0x4d0 [ptlrpc] Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] ptlrpc_expired_set+0xa9/0x180 [ptlrpc] Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] ptlrpcd+0x22e/0x4a0 [ptlrpc] Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] ? wake_up_q+0x70/0x70 Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] kthread+0xff/0x140 Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] ? ptlrpcd_check+0x560/0x560 [ptlrpc] Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] ? __kthread_parkme+0x70/0x70 Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] ret_from_fork+0x35/0x40

            For documentation purpose, backporting LU-9230 to b2_10 is tracked in LU-11352 "backport of LU-9230 to 2.10.5"

            jaylan Jay Lan (Inactive) added a comment - For documentation purpose, backporting LU-9230 to b2_10 is tracked in LU-11352 "backport of LU-9230 to 2.10.5"

            We tried the settings recommend in LU-11092. But it is making things worse because we are hitting LU-9230 more frequently.

            mhanafi Mahmoud Hanafi added a comment - We tried the settings recommend in LU-11092 . But it is making things worse because we are hitting LU-9230 more frequently.

            Hi Julien,

            Thank you for your info. I did not realize that Mahmoud has done the ldlm settings at our site. I was trying to find out if I miss any patch that I need to cherry-pick. Thanks.

            jaylan Jay Lan (Inactive) added a comment - Hi Julien, Thank you for your info. I did not realize that Mahmoud has done the ldlm settings at our site. I was trying to find out if I miss any patch that I need to cherry-pick. Thanks.

            FYI: This sounds similar to an issue we hit: LU-11092. We fixed it by changing the ldlm lru_* settings.

            jwallior Julien Wallior added a comment - FYI: This sounds similar to an issue we hit: LU-11092 . We fixed it by changing the ldlm lru_* settings.

            Admir,

            We still hit this problem with LU-11079 applied late last week.

            Jay

            jaylan Jay Lan (Inactive) added a comment - Admir, We still hit this problem with LU-11079 applied late last week. Jay

            Ah, I figured out the problem that caused the compilation errors. The nasa_LU_11079.patch in the attachments is not the same as the patch in the b2_10 reviews. Applying the patch from the attachments to nas-2.11.0 addressed the merge and compilation problem.

            A back port to b2_11 is not needed. Thanks.

            jaylan Jay Lan (Inactive) added a comment - Ah, I figured out the problem that caused the compilation errors. The nasa_LU_11079.patch in the attachments is not the same as the patch in the b2_10 reviews. Applying the patch from the attachments to nas-2.11.0 addressed the merge and compilation problem. A back port to b2_11 is not needed. Thanks.

            People

              ashehata Amir Shehata (Inactive)
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: