Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.12.0
    • CentOS 7.6, kernel 3.10.0-957.1.3.el7_lustre.x86_64, MOFED 4.5
    • 3
    • 9223372036854775807

    Description

      This just occurred on a Lustre 2.12.0 client, running Robinhood, many cores with 2 x AMD EPYC 7401 so 96 cpu threads total.

      [1488057.711176] CPU: 63 PID: 54246 Comm: ptlrpcd_07_06 Kdump: loaded Tainted: G           OEL ------------   3.10.0-957.1.3.el7_lustre.x86_64 #1
      [1488057.711177] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.3.6 04/20/2018
      [1488057.711178] task: ffff8abafc522080 ti: ffff8abaf4f54000 task.ti: ffff8abaf4f54000
      [1488057.711182] RIP: 0010:[<ffffffff897121e6>]  [<ffffffff897121e6>] native_queued_spin_lock_slowpath+0x126/0x200
      [1488057.711183] RSP: 0018:ffff8abaf4f57b78  EFLAGS: 00000246
      [1488057.711183] RAX: 0000000000000000 RBX: ffffffffc0ff5a97 RCX: 0000000001f90000
      [1488057.711184] RDX: ffff8a9affa9b780 RSI: 0000000000910000 RDI: ffff8adafa7b1b00
      [1488057.711185] RBP: ffff8abaf4f57b78 R08: ffff8afb3f9db780 R09: 0000000000000000
      [1488057.711185] R10: 0000000000000000 R11: 000000000000000f R12: ffff8aed569e2a00
      [1488057.711186] R13: 0005c4aa17c51860 R14: ffff8aed24f64b00 R15: 0000000000000007
      [1488057.711187] FS:  00007ea1d389b700(0000) GS:ffff8afb3f9c0000(0000) knlGS:0000000000000000
      [1488057.711188] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [1488057.711189] CR2: 00007ecf3e8c7ff6 CR3: 000000276ae10000 CR4: 00000000003407e0
      [1488057.711190] Call Trace:
      [1488057.711192]  [<ffffffff89d5bfcb>] queued_spin_lock_slowpath+0xb/0xf
      [1488057.711193]  [<ffffffff89d6a480>] _raw_spin_lock+0x20/0x30
      [1488057.711201]  [<ffffffffc0c23418>] cfs_percpt_lock+0x58/0x110 [libcfs]
      [1488057.711211]  [<ffffffffc0c889d8>] LNetMDUnlink+0x78/0x180 [lnet]
      [1488057.711250]  [<ffffffffc0f276bf>] ptlrpc_unregister_reply+0xbf/0x790 [ptlrpc]
      [1488057.711287]  [<ffffffffc0f2c35e>] ptlrpc_expire_one_request+0xee/0x520 [ptlrpc]
      [1488057.711324]  [<ffffffffc0f2c83f>] ptlrpc_expired_set+0xaf/0x1a0 [ptlrpc]
      [1488057.711362]  [<ffffffffc0f5cc5c>] ptlrpcd+0x28c/0x550 [ptlrpc]
      [1488057.711364]  [<ffffffff896d67b0>] ? wake_up_state+0x20/0x20
      [1488057.711402]  [<ffffffffc0f5c9d0>] ? ptlrpcd_check+0x590/0x590 [ptlrpc]
      [1488057.711404]  [<ffffffff896c1c31>] kthread+0xd1/0xe0
      [1488057.711406]  [<ffffffff896c1b60>] ? insert_kthread_work+0x40/0x40
      [1488057.711408]  [<ffffffff89d74c24>] ret_from_fork_nospec_begin+0xe/0x21
      [1488057.711410]  [<ffffffff896c1b60>] ? insert_kthread_work+0x40/0x40
      

      A lot of CPU were stucks in LNetMDUnlink. Server crashed with hard lockup at the end. vmcore available on demand. Attached vmcore-dmesg.txt

      I kept the default lru_size/lru_max_age 0/3900000 values on this server, so I'll try to reduce them as follow:

      • lru_size=100
      • lru_max_age=1200
        like on our Lustre 2.10 robinhood server for Oak.

      Any other recommendation welcomed.

      Thanks!
      Stephane

      Attachments

        Issue Links

          Activity

            [LU-11895] CPU lockup in LNetMDUnlink

            Uploaded original crash dump of fir-rbh01 to your FTP as vmcore-fir-rbh01-2019-01-28-15-10-41.gz

            kernel used is 3.10.0-957.1.3.el7_lustre.x86_64 (even tough it's not a lustre server).
            kernel-debuginfo-3.10.0-957.1.3.el7_lustre.x86_64.rpm and -common should be available already on the FTP

            The issue that occurred today is on a server with less CPUs (dual Xeon CPU E5-2650 v2) and running kernel 3.10.0-957.10.1.el7.x86_64 (unpatched).

            sthiell Stephane Thiell added a comment - Uploaded original crash dump of fir-rbh01 to your FTP as vmcore-fir-rbh01-2019-01-28-15-10-41.gz kernel used is 3.10.0-957.1.3.el7_lustre.x86_64  (even tough it's not a lustre server). kernel-debuginfo-3.10.0-957.1.3.el7_lustre.x86_64.rpm and -common should be available already on the FTP The issue that occurred today is on a server with less CPUs (dual Xeon CPU E5-2650 v2) and running kernel 3.10.0-957.10.1.el7.x86_64 (unpatched).

            Hi Amir,

            Oops sorry for being so slow on this one, we had seen only the problem once so far. But today a login node started to do the same, however it did recover by itself!

            Anyway, we had a lot of soft lockups, like that:

            [1230992.435041] NMI watchdog: BUG: soft lockup - CPU#8 stuck for 23s! [ptlrpcd_00_10:42801]
            [1230992.444173] Modules linked in: binfmt_misc squashfs overlay(T) rpcsec_gss_krb5 nfsv4 dns_resolver fuse mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ptlrpc(OE) ko2iblnd(OE) obdclass(OE)
             lnet(OE) libcfs(OE) xt_multiport ip_set_hash_ip nfsv3 nfs_acl nfs lockd grace ip6t_MASQUERADE nf_nat_masquerade_ipv6 ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_set ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJ
            ECT nf_reject_ipv6 xt_conntrack ip_set_hash_net ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw
             iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
            [1230992.524464]  rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dell_rbu cachefiles fscache sb
            _edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm mgag200 i2c_algo_bit irqbypass ttm drm_kms_helper crc32_pclmul syscopyarea ghash_clmulni_intel sysfillrect sysimgblt aesni_intel lrw fb_sys_fops 
            gf128mul drm iTCO_wdt iTCO_vendor_support glue_helper ablk_helper dcdbas cryptd cdc_ether usbnet mii drm_panel_orientation_quirks pcspkr sg ipmi_si joydev wmi lpc_ich ipmi_devintf mei_me mei ipmi_msghandler acpi
            _pad acpi_power_meter ext4 mbcache jbd2 loop auth_rpcgss sunrpc ip_tables xfs mlx4_ib(OE) ib_core(OE) sr_mod cdrom sd_mod crc_t10dif crct10dif_generic bnx2x ahci crct10dif_pclmul
            [1230992.602267]  mlx4_core(OE) crct10dif_common libahci mlx_compat(OE) crc32c_intel mdio devlink ptp libata megaraid_sas pps_core libcrc32c
            [1230992.614517] CPU: 8 PID: 42801 Comm: ptlrpcd_00_10 Kdump: loaded Tainted: G           OEL ------------ T 3.10.0-957.10.1.el7.x86_64 #1
            [1230992.628103] Hardware name: Dell Inc. PowerEdge R620/0GFKVD, BIOS 2.7.0 05/23/2018
            [1230992.636647] task: ffff8eb9f3334100 ti: ffff8eb9f47c4000 task.ti: ffff8eb9f47c4000
            [1230992.645192] RIP: 0010:[<ffffffff9cd12226>]  [<ffffffff9cd12226>] native_queued_spin_lock_slowpath+0x126/0x200
            [1230992.656465] RSP: 0018:ffff8eb9f47c7b78  EFLAGS: 00000246
            [1230992.662583] RAX: 0000000000000000 RBX: ffffffffc1014956 RCX: 0000000000410000
            [1230992.670738] RDX: ffff8eba1f09b780 RSI: 0000000000290001 RDI: ffff8eb21a6e0a00
            [1230992.678893] RBP: ffff8eb9f47c7b78 R08: ffff8eb21f91b780 R09: 0000000000000000
            [1230992.687047] R10: 0000000000000000 R11: 000000000000000f R12: ffff8eb01a572a00
            [1230992.695201] R13: 0005c991f29a0f30 R14: ffff8eb69a22b000 R15: 0000000000000007
            [1230992.703356] FS:  0000000000000000(0000) GS:ffff8eb21f900000(0000) knlGS:0000000000000000
            [1230992.712577] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [1230992.719182] CR2: 00007fd45e897090 CR3: 0000000fb8c10000 CR4: 00000000001607e0
            [1230992.727339] Call Trace:
            [1230992.730262]  [<ffffffff9d35cfcb>] queued_spin_lock_slowpath+0xb/0xf
            [1230992.737453]  [<ffffffff9d36b480>] _raw_spin_lock+0x20/0x30
            [1230992.743785]  [<ffffffffc0bbe418>] cfs_percpt_lock+0x58/0x110 [libcfs]
            [1230992.751181]  [<ffffffffc0c3c9d8>] LNetMDUnlink+0x78/0x180 [lnet]
            [1230992.758106]  [<ffffffffc0fa685f>] ptlrpc_unregister_reply+0xbf/0x790 [ptlrpc]
            [1230992.766280]  [<ffffffffc0fab4fe>] ptlrpc_expire_one_request+0xee/0x520 [ptlrpc]
            [1230992.774648]  [<ffffffffc0fab9df>] ptlrpc_expired_set+0xaf/0x1a0 [ptlrpc]
            [1230992.782343]  [<ffffffffc0fdae8c>] ptlrpcd+0x28c/0x550 [ptlrpc]
            [1230992.789047]  [<ffffffff9ccd67f0>] ? wake_up_state+0x20/0x20
            [1230992.795477]  [<ffffffffc0fdac00>] ? ptlrpcd_check+0x590/0x590 [ptlrpc]
            [1230992.802956]  [<ffffffff9ccc1c71>] kthread+0xd1/0xe0
            [1230992.808590]  [<ffffffff9ccc1ba0>] ? insert_kthread_work+0x40/0x40
            [1230992.815585]  [<ffffffff9d375c37>] ret_from_fork_nospec_begin+0x21/0x21
            [1230992.823063]  [<ffffffff9ccc1ba0>] ? insert_kthread_work+0x40/0x40
            

            When I checked at this time, the login node (lustre client) was completely loaded with its 32 CPUs taking 100% cpu each (in ptlrpcd).

            Then, the client recovered with the following logs:

            [1231021.312842] Lustre: 42815:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1554765376/real 1554764903]  req@ffff8eb33a496000 x1629003522849584/t0(0) o103->fir-OST0007-osc-ffff8eba05ae6000@10.0.10.102@o2ib7:17/18 lens 328/224 e 0 to 1 dl 1554765478 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
            [1231021.345646] Lustre: 42815:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1744666 previous similar messages
            [1231161.240794] LNetError: 42764:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
            [1231161.254982] LNetError: 42764:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 1 previous similar message
            [1231171.900676] LNetError: 42765:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
            [1231171.914863] LNetError: 42765:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 1 previous similar message
            [1231247.531671] Lustre: fir-OST000d-osc-ffff8eba05ae6000: Connection to fir-OST000d (at 10.0.10.104@o2ib7) was lost; in progress operations using this service will wait for recovery to complete
            [1231247.550706] Lustre: Skipped 104 previous similar messages                  
            [1231374.205074] LNetError: 42762:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
            [1231374.219259] LNetError: 42762:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 1 previous similar message
            [1231416.399571] LNetError: 42768:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
            [1231416.413765] LNetError: 42768:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 1 previous similar message
            [1231486.750358] LNetError: 42768:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
            [1231486.755230] LustreError: 167-0: fir-MDT0003-mdc-ffff8eba05ae6000: This client was evicted by fir-MDT0003; in progress operations using this service will fail.
            [1231486.755365] LustreError: 34127:0:(file.c:4393:ll_inode_revalidate_fini()) fir: revalidate FID [0x28000f583:0x2cf0:0x0] error: rc = -5
            [1231486.755368] LustreError: 34127:0:(file.c:4393:ll_inode_revalidate_fini()) Skipped 1 previous similar message
            [1231486.805305] LNetError: 42768:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 1 previous similar message
            [1231486.844678] Lustre: fir-MDT0003-mdc-ffff8eba05ae6000: Connection restored to 10.0.10.52@o2ib7 (at 10.0.10.52@o2ib7)
            [1231486.856531] Lustre: Skipped 152 previous similar messages                  
            [1231555.033198] Lustre: Evicted from MGS (at 10.0.10.51@o2ib7) after server handle changed from 0x5c08f3702ce50dae to 0x5c08f37030f7e96c
            [1231557.546491] Lustre: Evicted from fir-MDT0000_UUID (at 10.0.10.51@o2ib7) after server handle changed from 0x5c08f3702ce68e45 to 0x5c08f37030f7e957
            [1231557.561265] LustreError: 167-0: fir-MDT0000-mdc-ffff8eba05ae6000: This client was evicted by fir-MDT0000; in progress operations using this service will fail.
            [1231557.578742] LustreError: 31049:0:(file.c:4393:ll_inode_revalidate_fini()) fir: revalidate FID [0x240000406:0x138:0x0] error: rc = -5
            [1231557.592241] LustreError: 31049:0:(file.c:4393:ll_inode_revalidate_fini()) Skipped 9 previous similar messages
            [1231558.858554] LustreError: 32190:0:(file.c:216:ll_close_inode_openhandle()) fir-clilmv-ffff8eba05ae6000: inode [0x20000fd69:0x154bf:0x0] mdc close failed: rc = -5
            [1231566.830744] LustreError: 32327:0:(file.c:4393:ll_inode_revalidate_fini()) fir: revalidate FID [0x240000406:0x138:0x0] error: rc = -5
            [1231566.844246] LustreError: 32327:0:(file.c:4393:ll_inode_revalidate_fini()) Skipped 2 previous similar messages
            [1231596.575109] LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail
            [1231621.286801] Lustre: 42791:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1554766045/real 0]  req@ffff8eace510a900 x1629003644418896/t0(0) o103->fir-OST002b-osc-ffff8eba05ae6000@10.0.10.108@o2ib7:17/18 lens 328/224 e 0 to 1 dl 1554766077 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
            [1231621.318732] Lustre: 42791:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1009312 previous similar messages
            [1231723.524429] LustreError: 43001:0:(mgc_request.c:599:do_requeue()) failed processing log: -5
            [1231873.637411] Lustre: DEBUG MARKER: Mon Apr  8 16:32:27 2019   
            

            We are not running with the patches from LU-11100 for now. I'll see if I can find the original crash dump.

            sthiell Stephane Thiell added a comment - Hi Amir, Oops sorry for being so slow on this one, we had seen only the problem once so far. But today a login node started to do the same, however it did recover by itself! Anyway, we had a lot of soft lockups, like that: [1230992.435041] NMI watchdog: BUG: soft lockup - CPU#8 stuck for 23s! [ptlrpcd_00_10:42801] [1230992.444173] Modules linked in: binfmt_misc squashfs overlay(T) rpcsec_gss_krb5 nfsv4 dns_resolver fuse mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ptlrpc(OE) ko2iblnd(OE) obdclass(OE) lnet(OE) libcfs(OE) xt_multiport ip_set_hash_ip nfsv3 nfs_acl nfs lockd grace ip6t_MASQUERADE nf_nat_masquerade_ipv6 ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_set ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJ ECT nf_reject_ipv6 xt_conntrack ip_set_hash_net ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter [1230992.524464] rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dell_rbu cachefiles fscache sb _edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm mgag200 i2c_algo_bit irqbypass ttm drm_kms_helper crc32_pclmul syscopyarea ghash_clmulni_intel sysfillrect sysimgblt aesni_intel lrw fb_sys_fops gf128mul drm iTCO_wdt iTCO_vendor_support glue_helper ablk_helper dcdbas cryptd cdc_ether usbnet mii drm_panel_orientation_quirks pcspkr sg ipmi_si joydev wmi lpc_ich ipmi_devintf mei_me mei ipmi_msghandler acpi _pad acpi_power_meter ext4 mbcache jbd2 loop auth_rpcgss sunrpc ip_tables xfs mlx4_ib(OE) ib_core(OE) sr_mod cdrom sd_mod crc_t10dif crct10dif_generic bnx2x ahci crct10dif_pclmul [1230992.602267] mlx4_core(OE) crct10dif_common libahci mlx_compat(OE) crc32c_intel mdio devlink ptp libata megaraid_sas pps_core libcrc32c [1230992.614517] CPU: 8 PID: 42801 Comm: ptlrpcd_00_10 Kdump: loaded Tainted: G OEL ------------ T 3.10.0-957.10.1.el7.x86_64 #1 [1230992.628103] Hardware name: Dell Inc. PowerEdge R620/0GFKVD, BIOS 2.7.0 05/23/2018 [1230992.636647] task: ffff8eb9f3334100 ti: ffff8eb9f47c4000 task.ti: ffff8eb9f47c4000 [1230992.645192] RIP: 0010:[<ffffffff9cd12226>] [<ffffffff9cd12226>] native_queued_spin_lock_slowpath+0x126/0x200 [1230992.656465] RSP: 0018:ffff8eb9f47c7b78 EFLAGS: 00000246 [1230992.662583] RAX: 0000000000000000 RBX: ffffffffc1014956 RCX: 0000000000410000 [1230992.670738] RDX: ffff8eba1f09b780 RSI: 0000000000290001 RDI: ffff8eb21a6e0a00 [1230992.678893] RBP: ffff8eb9f47c7b78 R08: ffff8eb21f91b780 R09: 0000000000000000 [1230992.687047] R10: 0000000000000000 R11: 000000000000000f R12: ffff8eb01a572a00 [1230992.695201] R13: 0005c991f29a0f30 R14: ffff8eb69a22b000 R15: 0000000000000007 [1230992.703356] FS: 0000000000000000(0000) GS:ffff8eb21f900000(0000) knlGS:0000000000000000 [1230992.712577] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1230992.719182] CR2: 00007fd45e897090 CR3: 0000000fb8c10000 CR4: 00000000001607e0 [1230992.727339] Call Trace: [1230992.730262] [<ffffffff9d35cfcb>] queued_spin_lock_slowpath+0xb/0xf [1230992.737453] [<ffffffff9d36b480>] _raw_spin_lock+0x20/0x30 [1230992.743785] [<ffffffffc0bbe418>] cfs_percpt_lock+0x58/0x110 [libcfs] [1230992.751181] [<ffffffffc0c3c9d8>] LNetMDUnlink+0x78/0x180 [lnet] [1230992.758106] [<ffffffffc0fa685f>] ptlrpc_unregister_reply+0xbf/0x790 [ptlrpc] [1230992.766280] [<ffffffffc0fab4fe>] ptlrpc_expire_one_request+0xee/0x520 [ptlrpc] [1230992.774648] [<ffffffffc0fab9df>] ptlrpc_expired_set+0xaf/0x1a0 [ptlrpc] [1230992.782343] [<ffffffffc0fdae8c>] ptlrpcd+0x28c/0x550 [ptlrpc] [1230992.789047] [<ffffffff9ccd67f0>] ? wake_up_state+0x20/0x20 [1230992.795477] [<ffffffffc0fdac00>] ? ptlrpcd_check+0x590/0x590 [ptlrpc] [1230992.802956] [<ffffffff9ccc1c71>] kthread+0xd1/0xe0 [1230992.808590] [<ffffffff9ccc1ba0>] ? insert_kthread_work+0x40/0x40 [1230992.815585] [<ffffffff9d375c37>] ret_from_fork_nospec_begin+0x21/0x21 [1230992.823063] [<ffffffff9ccc1ba0>] ? insert_kthread_work+0x40/0x40 When I checked at this time, the login node (lustre client) was completely loaded with its 32 CPUs taking 100% cpu each (in ptlrpcd). Then, the client recovered with the following logs: [1231021.312842] Lustre: 42815:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1554765376/real 1554764903] req@ffff8eb33a496000 x1629003522849584/t0(0) o103->fir-OST0007-osc-ffff8eba05ae6000@10.0.10.102@o2ib7:17/18 lens 328/224 e 0 to 1 dl 1554765478 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 [1231021.345646] Lustre: 42815:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1744666 previous similar messages [1231161.240794] LNetError: 42764:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) [1231161.254982] LNetError: 42764:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 1 previous similar message [1231171.900676] LNetError: 42765:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) [1231171.914863] LNetError: 42765:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 1 previous similar message [1231247.531671] Lustre: fir-OST000d-osc-ffff8eba05ae6000: Connection to fir-OST000d (at 10.0.10.104@o2ib7) was lost; in progress operations using this service will wait for recovery to complete [1231247.550706] Lustre: Skipped 104 previous similar messages [1231374.205074] LNetError: 42762:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) [1231374.219259] LNetError: 42762:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 1 previous similar message [1231416.399571] LNetError: 42768:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) [1231416.413765] LNetError: 42768:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 1 previous similar message [1231486.750358] LNetError: 42768:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) [1231486.755230] LustreError: 167-0: fir-MDT0003-mdc-ffff8eba05ae6000: This client was evicted by fir-MDT0003; in progress operations using this service will fail. [1231486.755365] LustreError: 34127:0:(file.c:4393:ll_inode_revalidate_fini()) fir: revalidate FID [0x28000f583:0x2cf0:0x0] error: rc = -5 [1231486.755368] LustreError: 34127:0:(file.c:4393:ll_inode_revalidate_fini()) Skipped 1 previous similar message [1231486.805305] LNetError: 42768:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 1 previous similar message [1231486.844678] Lustre: fir-MDT0003-mdc-ffff8eba05ae6000: Connection restored to 10.0.10.52@o2ib7 (at 10.0.10.52@o2ib7) [1231486.856531] Lustre: Skipped 152 previous similar messages [1231555.033198] Lustre: Evicted from MGS (at 10.0.10.51@o2ib7) after server handle changed from 0x5c08f3702ce50dae to 0x5c08f37030f7e96c [1231557.546491] Lustre: Evicted from fir-MDT0000_UUID (at 10.0.10.51@o2ib7) after server handle changed from 0x5c08f3702ce68e45 to 0x5c08f37030f7e957 [1231557.561265] LustreError: 167-0: fir-MDT0000-mdc-ffff8eba05ae6000: This client was evicted by fir-MDT0000; in progress operations using this service will fail. [1231557.578742] LustreError: 31049:0:(file.c:4393:ll_inode_revalidate_fini()) fir: revalidate FID [0x240000406:0x138:0x0] error: rc = -5 [1231557.592241] LustreError: 31049:0:(file.c:4393:ll_inode_revalidate_fini()) Skipped 9 previous similar messages [1231558.858554] LustreError: 32190:0:(file.c:216:ll_close_inode_openhandle()) fir-clilmv-ffff8eba05ae6000: inode [0x20000fd69:0x154bf:0x0] mdc close failed: rc = -5 [1231566.830744] LustreError: 32327:0:(file.c:4393:ll_inode_revalidate_fini()) fir: revalidate FID [0x240000406:0x138:0x0] error: rc = -5 [1231566.844246] LustreError: 32327:0:(file.c:4393:ll_inode_revalidate_fini()) Skipped 2 previous similar messages [1231596.575109] LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail [1231621.286801] Lustre: 42791:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1554766045/real 0] req@ffff8eace510a900 x1629003644418896/t0(0) o103->fir-OST002b-osc-ffff8eba05ae6000@10.0.10.108@o2ib7:17/18 lens 328/224 e 0 to 1 dl 1554766077 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [1231621.318732] Lustre: 42791:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1009312 previous similar messages [1231723.524429] LustreError: 43001:0:(mgc_request.c:599:do_requeue()) failed processing log: -5 [1231873.637411] Lustre: DEBUG MARKER: Mon Apr 8 16:32:27 2019 We are not running with the patches from LU-11100 for now. I'll see if I can find the original crash dump.

            This appears similar to: https://jira.whamcloud.com/browse/LU-11100

            Would you be able to try the patches suggested there?

            In the meantime if you're able to make the vmcore and vmlinux available so I can take a further look, that'll be helpful.

            ashehata Amir Shehata (Inactive) added a comment - This appears similar to: https://jira.whamcloud.com/browse/LU-11100 Would you be able to try the patches suggested there? In the meantime if you're able to make the vmcore and vmlinux available so I can take a further look, that'll be helpful.
            [root@fir-rbh01 ~]# lctl get_param cpu*
            cpu_partition_distance=
            0	: 0:10 1:16 2:16 3:16 4:28 5:28 6:22 7:28
            1	: 0:16 1:10 2:16 3:16 4:28 5:28 6:28 7:22
            2	: 0:16 1:16 2:10 3:16 4:22 5:28 6:28 7:28
            3	: 0:16 1:16 2:16 3:10 4:28 5:22 6:28 7:28
            4	: 0:28 1:28 2:22 3:28 4:10 5:16 6:16 7:16
            5	: 0:28 1:28 2:28 3:22 4:16 5:10 6:16 7:16
            6	: 0:22 1:28 2:28 3:28 4:16 5:16 6:10 7:16
            7	: 0:28 1:22 2:28 3:28 4:16 5:16 6:16 7:10
            cpu_partition_table=
            0	: 0 8 16 24 32 40 48 56 64 72 80 88
            1	: 2 10 18 26 34 42 50 58 66 74 82 90
            2	: 4 12 20 28 36 44 52 60 68 76 84 92
            3	: 6 14 22 30 38 46 54 62 70 78 86 94
            4	: 1 9 17 25 33 41 49 57 65 73 81 89
            5	: 3 11 19 27 35 43 51 59 67 75 83 91
            6	: 5 13 21 29 37 45 53 61 69 77 85 93
            7	: 7 15 23 31 39 47 55 63 71 79 87 95
            

            I think that makes sense as each EPYC CPU socket has 4 domains.

            [root@fir-rbh01 ~]# numactl --hardware
            available: 8 nodes (0-7)
            node 0 cpus: 0 8 16 24 32 40 48 56 64 72 80 88
            node 0 size: 65213 MB
            node 0 free: 41206 MB
            node 1 cpus: 2 10 18 26 34 42 50 58 66 74 82 90
            node 1 size: 65535 MB
            node 1 free: 34082 MB
            node 2 cpus: 4 12 20 28 36 44 52 60 68 76 84 92
            node 2 size: 65535 MB
            node 2 free: 63939 MB
            node 3 cpus: 6 14 22 30 38 46 54 62 70 78 86 94
            node 3 size: 65535 MB
            node 3 free: 63611 MB
            node 4 cpus: 1 9 17 25 33 41 49 57 65 73 81 89
            node 4 size: 65535 MB
            node 4 free: 63923 MB
            node 5 cpus: 3 11 19 27 35 43 51 59 67 75 83 91
            node 5 size: 65535 MB
            node 5 free: 63569 MB
            node 6 cpus: 5 13 21 29 37 45 53 61 69 77 85 93
            node 6 size: 65535 MB
            node 6 free: 63449 MB
            node 7 cpus: 7 15 23 31 39 47 55 63 71 79 87 95
            node 7 size: 65535 MB
            node 7 free: 63535 MB
            node distances:
            node   0   1   2   3   4   5   6   7 
              0:  10  16  16  16  28  28  22  28 
              1:  16  10  16  16  28  28  28  22 
              2:  16  16  10  16  22  28  28  28 
              3:  16  16  16  10  28  22  28  28 
              4:  28  28  22  28  10  16  16  16 
              5:  28  28  28  22  16  10  16  16 
              6:  22  28  28  28  16  16  10  16 
              7:  28  22  28  28  16  16  16  10
            
            sthiell Stephane Thiell added a comment - [root@fir-rbh01 ~]# lctl get_param cpu* cpu_partition_distance= 0 : 0:10 1:16 2:16 3:16 4:28 5:28 6:22 7:28 1 : 0:16 1:10 2:16 3:16 4:28 5:28 6:28 7:22 2 : 0:16 1:16 2:10 3:16 4:22 5:28 6:28 7:28 3 : 0:16 1:16 2:16 3:10 4:28 5:22 6:28 7:28 4 : 0:28 1:28 2:22 3:28 4:10 5:16 6:16 7:16 5 : 0:28 1:28 2:28 3:22 4:16 5:10 6:16 7:16 6 : 0:22 1:28 2:28 3:28 4:16 5:16 6:10 7:16 7 : 0:28 1:22 2:28 3:28 4:16 5:16 6:16 7:10 cpu_partition_table= 0 : 0 8 16 24 32 40 48 56 64 72 80 88 1 : 2 10 18 26 34 42 50 58 66 74 82 90 2 : 4 12 20 28 36 44 52 60 68 76 84 92 3 : 6 14 22 30 38 46 54 62 70 78 86 94 4 : 1 9 17 25 33 41 49 57 65 73 81 89 5 : 3 11 19 27 35 43 51 59 67 75 83 91 6 : 5 13 21 29 37 45 53 61 69 77 85 93 7 : 7 15 23 31 39 47 55 63 71 79 87 95 I think that makes sense as each EPYC CPU socket has 4 domains. [root@fir-rbh01 ~]# numactl --hardware available: 8 nodes (0-7) node 0 cpus: 0 8 16 24 32 40 48 56 64 72 80 88 node 0 size: 65213 MB node 0 free: 41206 MB node 1 cpus: 2 10 18 26 34 42 50 58 66 74 82 90 node 1 size: 65535 MB node 1 free: 34082 MB node 2 cpus: 4 12 20 28 36 44 52 60 68 76 84 92 node 2 size: 65535 MB node 2 free: 63939 MB node 3 cpus: 6 14 22 30 38 46 54 62 70 78 86 94 node 3 size: 65535 MB node 3 free: 63611 MB node 4 cpus: 1 9 17 25 33 41 49 57 65 73 81 89 node 4 size: 65535 MB node 4 free: 63923 MB node 5 cpus: 3 11 19 27 35 43 51 59 67 75 83 91 node 5 size: 65535 MB node 5 free: 63569 MB node 6 cpus: 5 13 21 29 37 45 53 61 69 77 85 93 node 6 size: 65535 MB node 6 free: 63449 MB node 7 cpus: 7 15 23 31 39 47 55 63 71 79 87 95 node 7 size: 65535 MB node 7 free: 63535 MB node distances: node 0 1 2 3 4 5 6 7 0: 10 16 16 16 28 28 22 28 1: 16 10 16 16 28 28 28 22 2: 16 16 10 16 22 28 28 28 3: 16 16 16 10 28 22 28 28 4: 28 28 22 28 10 16 16 16 5: 28 28 28 22 16 10 16 16 6: 22 28 28 28 16 16 10 16 7: 28 22 28 28 16 16 16 10

            How many CPTs on this system? lctl get_param cpu*

            adilger Andreas Dilger added a comment - How many CPTs on this system? lctl get_param cpu*
            pjones Peter Jones added a comment -

            Amir

            Could you please investigate?

            Thanks

            Peter

            pjones Peter Jones added a comment - Amir Could you please investigate? Thanks Peter

            People

              ashehata Amir Shehata (Inactive)
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: