-- Logs begin at Fri 2019-10-25 09:03:32 PDT, end at Mon 2019-11-04 11:22:14 PST. -- Oct 25 09:03:32 sh-103-53 kernel: Initializing cgroup subsys cpuset Oct 25 09:03:32 sh-103-53 kernel: Initializing cgroup subsys cpu Oct 25 09:03:32 sh-103-53 kernel: Initializing cgroup subsys cpuacct Oct 25 09:03:32 sh-103-53 kernel: Linux version 3.10.0-957.27.2.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC) ) #1 SMP Mon Jul 29 17:46:05 UTC 2019 Oct 25 09:03:32 sh-103-53 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-957.27.2.el7.x86_64 root=UUID=1dd663c3-13ea-4d40-9f79-d517c89819f2 ro transparent_hugepage=madvise namespace.unpriv_enable=1 user_namespace.enable=1 nopti nospectre_v2 spectre_v2_user=off spec_store_bypass_disable=off l1tf=off crashkernel=auto console=ttyS0,115200 LANG=en_US.UTF-8 Oct 25 09:03:32 sh-103-53 kernel: e820: BIOS-provided physical RAM map: Oct 25 09:03:32 sh-103-53 kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009cfff] usable Oct 25 09:03:32 sh-103-53 kernel: BIOS-e820: [mem 0x000000000009d000-0x000000000009ffff] reserved Oct 25 09:03:32 sh-103-53 kernel: BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved Oct 25 09:03:32 sh-103-53 kernel: BIOS-e820: [mem 0x0000000000100000-0x000000005ddfefff] usable Oct 25 09:03:32 sh-103-53 kernel: BIOS-e820: [mem 0x000000005ddff000-0x000000006cffefff] reserved Oct 25 09:03:32 sh-103-53 kernel: BIOS-e820: [mem 0x000000006cfff000-0x000000006effefff] ACPI NVS Oct 25 09:03:32 sh-103-53 kernel: BIOS-e820: [mem 0x000000006efff000-0x000000006f7fefff] ACPI data Oct 25 09:03:32 sh-103-53 kernel: BIOS-e820: [mem 0x000000006f7ff000-0x000000006f7fffff] usable Oct 25 09:03:32 sh-103-53 kernel: BIOS-e820: [mem 0x000000006f800000-0x000000008fffffff] reserved Oct 25 09:03:32 sh-103-53 kernel: BIOS-e820: [mem 0x00000000fd000000-0x00000000fe7fffff] reserved Oct 25 09:03:32 sh-103-53 kernel: BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved Oct 25 09:03:32 sh-103-53 kernel: BIOS-e820: [mem 0x00000000fec80000-0x00000000fed00fff] reserved Oct 25 09:03:32 sh-103-53 kernel: BIOS-e820: [mem 0x00000000fed40000-0x00000000fed44fff] reserved Oct 25 09:03:32 sh-103-53 kernel: BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff] reserved Oct 25 09:03:32 sh-103-53 kernel: BIOS-e820: [mem 0x0000000100000000-0x000000303fffffff] usable Oct 25 09:03:32 sh-103-53 kernel: NX (Execute Disable) protection: active Oct 25 09:03:32 sh-103-53 kernel: SMBIOS 3.2 present. Oct 25 09:03:32 sh-103-53 kernel: DMI: Dell Inc. PowerEdge C6420/0YTVTT, BIOS 2.3.10 08/15/2019 Oct 25 09:03:32 sh-103-53 kernel: e820: update [mem 0x00000000-0x00000fff] usable ==> reserved Oct 25 09:03:32 sh-103-53 kernel: e820: remove [mem 0x000a0000-0x000fffff] usable Oct 25 09:03:32 sh-103-53 kernel: e820: last_pfn = 0x3040000 max_arch_pfn = 0x400000000 Oct 25 09:03:32 sh-103-53 kernel: MTRR default type: uncachable Oct 25 09:03:32 sh-103-53 kernel: MTRR fixed ranges enabled: Oct 25 09:03:32 sh-103-53 kernel: 00000-9FFFF write-back Oct 25 09:03:32 sh-103-53 kernel: A0000-BFFFF uncachable Oct 25 09:03:32 sh-103-53 kernel: C0000-FFFFF write-protect Oct 25 09:03:32 sh-103-53 kernel: MTRR variable ranges enabled: Oct 25 09:03:32 sh-103-53 kernel: 0 base 000000000000 mask 3FE000000000 write-back Oct 25 09:03:32 sh-103-53 kernel: 1 base 002000000000 mask 3FF000000000 write-back Oct 25 09:03:32 sh-103-53 kernel: 2 base 003000000000 mask 3FFFC0000000 write-back Oct 25 09:03:32 sh-103-53 kernel: 3 base 000080000000 mask 3FFF80000000 uncachable Oct 25 09:03:32 sh-103-53 kernel: 4 base 00007F000000 mask 3FFFFF000000 uncachable Oct 25 09:03:32 sh-103-53 kernel: 5 disabled Oct 25 09:03:32 sh-103-53 kernel: 6 disabled Oct 25 09:03:32 sh-103-53 kernel: 7 disabled Oct 25 09:03:32 sh-103-53 kernel: 8 disabled Oct 25 09:03:32 sh-103-53 kernel: 9 disabled Oct 25 09:03:32 sh-103-53 kernel: PAT configuration [0-7]: WB WC UC- UC WB WP UC- UC Oct 25 09:03:32 sh-103-53 kernel: original variable MTRRs Oct 25 09:03:32 sh-103-53 kernel: reg 0, base: 0GB, range: 128GB, type WB Oct 25 09:03:32 sh-103-53 kernel: reg 1, base: 128GB, range: 64GB, type WB Oct 25 09:03:32 sh-103-53 kernel: reg 2, base: 192GB, range: 1GB, type WB Oct 25 09:03:32 sh-103-53 kernel: reg 3, base: 2GB, range: 2GB, type UC Oct 25 09:03:32 sh-103-53 kernel: reg 4, base: 2032MB, range: 16MB, type UC Oct 25 09:03:32 sh-103-53 kernel: total RAM covered: 195568M Oct 25 09:03:32 sh-103-53 kernel: Found optimal setting for mtrr clean up Oct 25 09:03:32 sh-103-53 kernel: gran_size: 64K chunk_size: 32M num_reg: 9 lose cover RAM: 0G Oct 25 09:03:32 sh-103-53 kernel: New variable MTRRs Oct 25 09:03:32 sh-103-53 kernel: reg 0, base: 0GB, range: 2GB, type WB Oct 25 09:03:32 sh-103-53 kernel: reg 1, base: 2032MB, range: 16MB, type UC Oct 25 09:03:32 sh-103-53 kernel: reg 2, base: 4GB, range: 4GB, type WB Oct 25 09:03:32 sh-103-53 kernel: reg 3, base: 8GB, range: 8GB, type WB Oct 25 09:03:32 sh-103-53 kernel: reg 4, base: 16GB, range: 16GB, type WB Oct 25 09:03:32 sh-103-53 kernel: reg 5, base: 32GB, range: 32GB, type WB Oct 25 09:03:32 sh-103-53 kernel: reg 6, base: 64GB, range: 64GB, type WB Oct 25 09:03:32 sh-103-53 kernel: reg 7, base: 128GB, range: 64GB, type WB Oct 25 09:03:32 sh-103-53 kernel: reg 8, base: 192GB, range: 1GB, type WB Oct 25 09:03:32 sh-103-53 kernel: e820: update [mem 0x7f000000-0xffffffff] usable ==> reserved Oct 25 09:03:32 sh-103-53 kernel: e820: last_pfn = 0x6f800 max_arch_pfn = 0x400000000 Oct 25 09:03:32 sh-103-53 kernel: Base memory trampoline at [ffff975240097000] 97000 size 24576 Oct 25 09:03:32 sh-103-53 kernel: Using GB pages for direct mapping Oct 25 09:03:32 sh-103-53 kernel: BRK [0xe63053000, 0xe63053fff] PGTABLE Oct 25 09:03:32 sh-103-53 kernel: BRK [0xe63054000, 0xe63054fff] PGTABLE Oct 25 09:03:32 sh-103-53 kernel: BRK [0xe63055000, 0xe63055fff] PGTABLE Oct 25 09:03:32 sh-103-53 kernel: BRK [0xe63056000, 0xe63056fff] PGTABLE Oct 25 09:03:32 sh-103-53 kernel: BRK [0xe63057000, 0xe63057fff] PGTABLE Oct 25 09:03:32 sh-103-53 kernel: BRK [0xe63058000, 0xe63058fff] PGTABLE Oct 25 09:03:32 sh-103-53 kernel: BRK [0xe63059000, 0xe63059fff] PGTABLE Oct 25 09:03:32 sh-103-53 kernel: BRK [0xe6305a000, 0xe6305afff] PGTABLE Oct 25 09:03:32 sh-103-53 kernel: RAMDISK: [mem 0x35a03000-0x36cf9fff] Oct 25 09:03:32 sh-103-53 kernel: Early table checksum verification disabled Oct 25 09:03:32 sh-103-53 kernel: ACPI: RSDP 00000000000fe320 00024 (v02 DELL ) Oct 25 09:03:32 sh-103-53 kernel: ACPI: XSDT 000000006f40b188 000EC (v01 DELL PE_SC3 00000000 01000013) Oct 25 09:03:32 sh-103-53 kernel: ACPI: FACP 000000006f7f8000 00114 (v06 DELL PE_SC3 00000000 DELL 00000001) Oct 25 09:03:32 sh-103-53 kernel: ACPI: DSDT 000000006f50c000 2DC584 (v02 DELL PE_SC3 00000003 DELL 00000001) Oct 25 09:03:32 sh-103-53 kernel: ACPI: FACS 000000006edee000 00040 Oct 25 09:03:32 sh-103-53 kernel: ACPI: SSDT 000000006f7fc000 0046C (v02 INTEL ADDRXLAT 00000001 INTL 20180508) Oct 25 09:03:32 sh-103-53 kernel: ACPI: MCEJ 000000006f7fb000 00130 (v01 DELL PE_SC3 00000002 DELL 00000001) Oct 25 09:03:32 sh-103-53 kernel: ACPI: WD__ 000000006f7fa000 00134 (v01 DELL PE_SC3 00000001 DELL 00000001) Oct 25 09:03:32 sh-103-53 kernel: ACPI: SLIC 000000006f7f9000 00024 (v01 DELL PE_SC3 00000001 DELL 00000001) Oct 25 09:03:32 sh-103-53 kernel: ACPI: HPET 000000006f7f7000 00038 (v01 DELL PE_SC3 00000001 DELL 00000001) Oct 25 09:03:32 sh-103-53 kernel: ACPI: APIC 000000006f7f5000 016DE (v04 DELL PE_SC3 00000000 DELL 00000001) Oct 25 09:03:32 sh-103-53 kernel: ACPI: MCFG 000000006f7f4000 0003C (v01 DELL PE_SC3 00000001 DELL 00000001) Oct 25 09:03:32 sh-103-53 kernel: ACPI: MIGT 000000006f7f3000 00040 (v01 DELL PE_SC3 00000000 DELL 00000001) Oct 25 09:03:32 sh-103-53 kernel: ACPI: MSCT 000000006f7f2000 00090 (v01 DELL PE_SC3 00000001 DELL 00000001) Oct 25 09:03:32 sh-103-53 kernel: ACPI: PCAT 000000006f7f1000 00088 (v02 DELL PE_SC3 00000002 DELL 00000001) Oct 25 09:03:32 sh-103-53 kernel: ACPI: PCCT 000000006f7f0000 0006E (v01 DELL PE_SC3 00000002 DELL 00000001) Oct 25 09:03:32 sh-103-53 kernel: ACPI: RASF 000000006f7ef000 00030 (v01 DELL PE_SC3 00000001 DELL 00000001) Oct 25 09:03:32 sh-103-53 kernel: ACPI: SLIT 000000006f7ee000 0042C (v01 DELL PE_SC3 00000001 DELL 00000001) Oct 25 09:03:32 sh-103-53 kernel: ACPI: SRAT 000000006f7eb000 02D30 (v03 DELL PE_SC3 00000002 DELL 00000001) Oct 25 09:03:32 sh-103-53 kernel: ACPI: SVOS 000000006f7ea000 00032 (v01 DELL PE_SC3 00000000 DELL 00000001) Oct 25 09:03:32 sh-103-53 kernel: ACPI: WSMT 000000006f7e9000 00028 (v01 DELL PE_SC3 00000000 DELL 00000001) Oct 25 09:03:32 sh-103-53 kernel: ACPI: OEM4 000000006f45e000 AD1C1 (v02 INTEL CPU CST 00003000 INTL 20180508) Oct 25 09:03:32 sh-103-53 kernel: ACPI: SSDT 000000006f426000 37465 (v02 INTEL SSDT PM 00004000 INTL 20180508) Oct 25 09:03:32 sh-103-53 kernel: ACPI: SSDT 000000006f40c000 00A1F (v02 DELL PE_SC3 00000000 DELL 00000001) Oct 25 09:03:32 sh-103-53 kernel: ACPI: SSDT 000000006f422000 0357F (v02 INTEL SpsNm 00000002 INTL 20180508) Oct 25 09:03:32 sh-103-53 kernel: ACPI: HEST 000000006f7fd000 0017C (v01 DELL PE_SC3 00000002 DELL 00000001) Oct 25 09:03:32 sh-103-53 kernel: ACPI: BERT 000000006f421000 00030 (v01 DELL PE_SC3 00000002 DELL 00000001) Oct 25 09:03:32 sh-103-53 kernel: ACPI: ERST 000000006f420000 00230 (v01 DELL PE_SC3 00000002 DELL 00000001) Oct 25 09:03:32 sh-103-53 kernel: ACPI: EINJ 000000006f41f000 00150 (v01 DELL PE_SC3 00000002 DELL 00000001) Oct 25 09:03:32 sh-103-53 kernel: ACPI: Local APIC address 0xfee00000 Oct 25 09:03:32 sh-103-53 kernel: SRAT: PXM 0 -> APIC 0x00 -> Node 0 Oct 25 09:03:32 sh-103-53 kernel: SRAT: PXM 1 -> APIC 0x20 -> Node 1 Oct 25 09:03:32 sh-103-53 kernel: SRAT: PXM 0 -> APIC 0x0a -> Node 0 Oct 25 09:03:32 sh-103-53 kernel: SRAT: PXM 1 -> APIC 0x2a -> Node 1 Oct 25 09:03:32 sh-103-53 kernel: SRAT: PXM 0 -> APIC 0x02 -> Node 0 Oct 25 09:03:32 sh-103-53 kernel: SRAT: PXM 1 -> APIC 0x22 -> Node 1 Oct 25 09:03:32 sh-103-53 kernel: SRAT: PXM 0 -> APIC 0x08 -> Node 0 Oct 25 09:03:32 sh-103-53 kernel: SRAT: PXM 1 -> APIC 0x28 -> Node 1 Oct 25 09:03:32 sh-103-53 kernel: SRAT: PXM 0 -> APIC 0x04 -> Node 0 Oct 25 09:03:32 sh-103-53 kernel: SRAT: PXM 1 -> APIC 0x24 -> Node 1 Oct 25 09:03:32 sh-103-53 kernel: SRAT: PXM 0 -> APIC 0x06 -> Node 0 Oct 25 09:03:32 sh-103-53 kernel: SRAT: PXM 1 -> APIC 0x26 -> Node 1 Oct 25 09:03:32 sh-103-53 kernel: SRAT: PXM 0 -> APIC 0x10 -> Node 0 Oct 25 09:03:32 sh-103-53 kernel: SRAT: PXM 1 -> APIC 0x30 -> Node 1 Oct 25 09:03:32 sh-103-53 kernel: SRAT: PXM 0 -> APIC 0x1a -> Node 0 Oct 25 09:03:32 sh-103-53 kernel: SRAT: PXM 1 -> APIC 0x3a -> Node 1 Oct 25 09:03:32 sh-103-53 kernel: SRAT: PXM 0 -> APIC 0x12 -> Node 0 Oct 25 09:03:32 sh-103-53 kernel: SRAT: PXM 1 -> APIC 0x32 -> Node 1 Oct 25 09:03:32 sh-103-53 kernel: SRAT: PXM 0 -> APIC 0x18 -> Node 0 Oct 25 09:03:32 sh-103-53 kernel: SRAT: PXM 1 -> APIC 0x38 -> Node 1 Oct 25 09:03:32 sh-103-53 kernel: SRAT: PXM 0 -> APIC 0x14 -> Node 0 Oct 25 09:03:32 sh-103-53 kernel: SRAT: PXM 1 -> APIC 0x34 -> Node 1 Oct 25 09:03:32 sh-103-53 kernel: SRAT: PXM 0 -> APIC 0x16 -> Node 0 Oct 25 09:03:32 sh-103-53 kernel: SRAT: PXM 1 -> APIC 0x36 -> Node 1 Oct 25 09:03:32 sh-103-53 kernel: SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff] Oct 25 09:03:32 sh-103-53 kernel: SRAT: Node 0 PXM 0 [mem 0x100000000-0x183fffffff] Oct 25 09:03:32 sh-103-53 kernel: SRAT: Node 1 PXM 1 [mem 0x1840000000-0x303fffffff] Oct 25 09:03:32 sh-103-53 kernel: NUMA: Initialized distance table, cnt=2 Oct 25 09:03:32 sh-103-53 kernel: NUMA: Node 0 [mem 0x00000000-0x7fffffff] + [mem 0x100000000-0x183fffffff] -> [mem 0x00000000-0x183fffffff] Oct 25 09:03:32 sh-103-53 kernel: NODE_DATA(0) allocated [mem 0x183ffd9000-0x183fffffff] Oct 25 09:03:32 sh-103-53 kernel: NODE_DATA(1) allocated [mem 0x303ffd8000-0x303fffefff] Oct 25 09:03:32 sh-103-53 kernel: Reserving 172MB of memory at 672MB for crashkernel (System RAM: 195037MB) Oct 25 09:03:32 sh-103-53 kernel: Zone ranges: Oct 25 09:03:32 sh-103-53 kernel: DMA [mem 0x00001000-0x00ffffff] Oct 25 09:03:32 sh-103-53 kernel: DMA32 [mem 0x01000000-0xffffffff] Oct 25 09:03:32 sh-103-53 kernel: Normal [mem 0x100000000-0x303fffffff] Oct 25 09:03:32 sh-103-53 kernel: Movable zone start for each node Oct 25 09:03:32 sh-103-53 kernel: Early memory node ranges Oct 25 09:03:32 sh-103-53 kernel: node 0: [mem 0x00001000-0x0009cfff] Oct 25 09:03:32 sh-103-53 kernel: node 0: [mem 0x00100000-0x5ddfefff] Oct 25 09:03:32 sh-103-53 kernel: node 0: [mem 0x6f7ff000-0x6f7fffff] Oct 25 09:03:32 sh-103-53 kernel: node 0: [mem 0x100000000-0x183fffffff] Oct 25 09:03:32 sh-103-53 kernel: node 1: [mem 0x1840000000-0x303fffffff] Oct 25 09:03:32 sh-103-53 kernel: Initmem setup node 0 [mem 0x00001000-0x183fffffff] Oct 25 09:03:32 sh-103-53 kernel: On node 0 totalpages: 24763804 Oct 25 09:03:32 sh-103-53 kernel: DMA zone: 64 pages used for memmap Oct 25 09:03:32 sh-103-53 kernel: DMA zone: 21 pages reserved Oct 25 09:03:32 sh-103-53 kernel: DMA zone: 3996 pages, LIFO batch:0 Oct 25 09:03:32 sh-103-53 kernel: DMA32 zone: 5944 pages used for memmap Oct 25 09:03:32 sh-103-53 kernel: DMA32 zone: 380416 pages, LIFO batch:31 Oct 25 09:03:32 sh-103-53 kernel: Normal zone: 380928 pages used for memmap Oct 25 09:03:32 sh-103-53 kernel: Normal zone: 24379392 pages, LIFO batch:31 Oct 25 09:03:32 sh-103-53 kernel: Initmem setup node 1 [mem 0x1840000000-0x303fffffff] Oct 25 09:03:32 sh-103-53 kernel: On node 1 totalpages: 25165824 Oct 25 09:03:32 sh-103-53 kernel: Normal zone: 393216 pages used for memmap Oct 25 09:03:32 sh-103-53 kernel: Normal zone: 25165824 pages, LIFO batch:31 Oct 25 09:03:32 sh-103-53 kernel: ACPI: PM-Timer IO Port: 0x508 Oct 25 09:03:32 sh-103-53 kernel: ACPI: Local APIC address 0xfee00000 Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0x20] lapic_id[0x20] enabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0x01] lapic_id[0x0a] enabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0x21] lapic_id[0x2a] enabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] enabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0x22] lapic_id[0x22] enabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0x03] lapic_id[0x08] enabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0x23] lapic_id[0x28] enabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0x04] lapic_id[0x04] enabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0x24] lapic_id[0x24] enabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0x05] lapic_id[0x06] enabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0x25] lapic_id[0x26] enabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0x06] lapic_id[0x10] enabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0x26] lapic_id[0x30] enabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0x07] lapic_id[0x1a] enabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0x27] lapic_id[0x3a] enabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0x08] lapic_id[0x12] enabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0x28] lapic_id[0x32] enabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0x09] lapic_id[0x18] enabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0x29] lapic_id[0x38] enabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0x0a] lapic_id[0x14] enabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0x2a] lapic_id[0x34] enabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0x0b] lapic_id[0x16] enabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0x2b] lapic_id[0x36] enabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x00] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x01] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x02] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x03] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x04] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x05] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x06] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x07] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x08] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x09] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x0a] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x0b] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x0c] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x0d] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x0e] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x0f] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x10] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x11] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x12] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x13] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x14] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x15] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x16] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x17] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x18] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x19] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x1a] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x1b] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x1c] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x1d] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x1e] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x1f] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x20] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x21] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x22] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x23] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x24] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x25] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x26] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x27] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x28] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x29] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x2a] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x2b] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x2c] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x2d] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x2e] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x2f] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x30] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x31] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x32] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x33] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x34] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x35] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x36] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x37] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x38] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x39] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x3a] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x3b] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x3c] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x3d] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x3e] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x3f] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x40] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x41] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x42] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x43] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x44] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x45] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x46] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x47] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x48] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x49] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x4a] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x4b] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x4c] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x4d] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x4e] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x4f] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x50] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x51] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x52] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x53] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x54] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x55] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x56] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x57] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x58] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x59] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x5a] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x5b] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x5c] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x5d] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x5e] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x5f] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x60] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x61] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x62] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x63] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x64] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x65] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x66] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x67] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x68] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x69] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x6a] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x6b] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x6c] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x6d] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x6e] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x6f] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x70] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x71] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x72] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x73] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x74] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x75] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x76] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x77] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x78] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x79] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x7a] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x7b] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x7c] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x7d] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x7e] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x7f] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x80] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x81] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x82] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x83] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x84] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x85] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x86] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x87] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x88] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x89] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x8a] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x8b] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x8c] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x8d] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x8e] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x8f] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x90] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x91] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x92] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x93] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x94] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x95] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x96] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x97] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x98] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x99] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x9a] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x9b] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x9c] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x9d] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x9e] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0x9f] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xa0] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xa1] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xa2] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xa3] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xa4] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xa5] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xa6] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xa7] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xa8] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xa9] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xaa] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xab] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xac] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xad] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xae] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xaf] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xb0] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xb1] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xb2] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xb3] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xb4] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xb5] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xb6] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xb7] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xb8] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xb9] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xba] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xbb] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xbc] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xbd] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xbe] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xbf] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xc0] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xc1] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xc2] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xc3] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xc4] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xc5] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xc6] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xc7] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xc8] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xc9] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xca] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xcb] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xcc] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xcd] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xce] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xcf] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xd0] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xd1] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xd2] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xd3] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xd4] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xd5] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xd6] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xd7] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xd8] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xd9] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xda] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xdb] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xdc] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xdd] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xde] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC (apic_id[0xffffffff] uid[0xdf] disabled) Oct 25 09:03:32 sh-103-53 kernel: ACPI: X2APIC_NMI (uid[0xffffffff] high level lint[0x1]) Oct 25 09:03:32 sh-103-53 kernel: ACPI: LAPIC_NMI (acpi_id[0xff] high level lint[0x1]) Oct 25 09:03:32 sh-103-53 kernel: ACPI: IOAPIC (id[0x08] address[0xfec00000] gsi_base[0]) Oct 25 09:03:32 sh-103-53 kernel: IOAPIC[0]: apic_id 8, version 32, address 0xfec00000, GSI 0-23 Oct 25 09:03:32 sh-103-53 kernel: ACPI: IOAPIC (id[0x09] address[0xfec01000] gsi_base[24]) Oct 25 09:03:32 sh-103-53 kernel: IOAPIC[1]: apic_id 9, version 32, address 0xfec01000, GSI 24-31 Oct 25 09:03:32 sh-103-53 kernel: ACPI: IOAPIC (id[0x0a] address[0xfec08000] gsi_base[32]) Oct 25 09:03:32 sh-103-53 kernel: IOAPIC[2]: apic_id 10, version 32, address 0xfec08000, GSI 32-39 Oct 25 09:03:32 sh-103-53 kernel: ACPI: IOAPIC (id[0x0b] address[0xfec10000] gsi_base[40]) Oct 25 09:03:32 sh-103-53 kernel: IOAPIC[3]: apic_id 11, version 32, address 0xfec10000, GSI 40-47 Oct 25 09:03:32 sh-103-53 kernel: ACPI: IOAPIC (id[0x0c] address[0xfec18000] gsi_base[48]) Oct 25 09:03:32 sh-103-53 kernel: IOAPIC[4]: apic_id 12, version 32, address 0xfec18000, GSI 48-55 Oct 25 09:03:32 sh-103-53 kernel: ACPI: IOAPIC (id[0x0f] address[0xfec20000] gsi_base[72]) Oct 25 09:03:32 sh-103-53 kernel: IOAPIC[5]: apic_id 15, version 32, address 0xfec20000, GSI 72-79 Oct 25 09:03:32 sh-103-53 kernel: ACPI: IOAPIC (id[0x10] address[0xfec28000] gsi_base[80]) Oct 25 09:03:32 sh-103-53 kernel: IOAPIC[6]: apic_id 16, version 32, address 0xfec28000, GSI 80-87 Oct 25 09:03:32 sh-103-53 kernel: ACPI: IOAPIC (id[0x11] address[0xfec30000] gsi_base[88]) Oct 25 09:03:32 sh-103-53 kernel: IOAPIC[7]: apic_id 17, version 32, address 0xfec30000, GSI 88-95 Oct 25 09:03:32 sh-103-53 kernel: ACPI: IOAPIC (id[0x12] address[0xfec38000] gsi_base[96]) Oct 25 09:03:32 sh-103-53 kernel: IOAPIC[8]: apic_id 18, version 32, address 0xfec38000, GSI 96-103 Oct 25 09:03:32 sh-103-53 kernel: ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) Oct 25 09:03:32 sh-103-53 kernel: ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) Oct 25 09:03:32 sh-103-53 kernel: ACPI: IRQ0 used by override. Oct 25 09:03:32 sh-103-53 kernel: ACPI: IRQ9 used by override. Oct 25 09:03:32 sh-103-53 kernel: Using ACPI (MADT) for SMP configuration information Oct 25 09:03:32 sh-103-53 kernel: ACPI: HPET id: 0x8086a701 base: 0xfed00000 Oct 25 09:03:32 sh-103-53 kernel: smpboot: Allowing 448 CPUs, 424 hotplug CPUs Oct 25 09:03:32 sh-103-53 kernel: PM: Registered nosave memory: [mem 0x0009d000-0x0009ffff] Oct 25 09:03:32 sh-103-53 kernel: PM: Registered nosave memory: [mem 0x000a0000-0x000dffff] Oct 25 09:03:32 sh-103-53 kernel: PM: Registered nosave memory: [mem 0x000e0000-0x000fffff] Oct 25 09:03:32 sh-103-53 kernel: PM: Registered nosave memory: [mem 0x5ddff000-0x6cffefff] Oct 25 09:03:32 sh-103-53 kernel: PM: Registered nosave memory: [mem 0x6cfff000-0x6effefff] Oct 25 09:03:32 sh-103-53 kernel: PM: Registered nosave memory: [mem 0x6efff000-0x6f7fefff] Oct 25 09:03:32 sh-103-53 kernel: PM: Registered nosave memory: [mem 0x6f800000-0x8fffffff] Oct 25 09:03:32 sh-103-53 kernel: PM: Registered nosave memory: [mem 0x90000000-0xfcffffff] Oct 25 09:03:32 sh-103-53 kernel: PM: Registered nosave memory: [mem 0xfd000000-0xfe7fffff] Oct 25 09:03:32 sh-103-53 kernel: PM: Registered nosave memory: [mem 0xfe800000-0xfebfffff] Oct 25 09:03:32 sh-103-53 kernel: PM: Registered nosave memory: [mem 0xfec00000-0xfec00fff] Oct 25 09:03:32 sh-103-53 kernel: PM: Registered nosave memory: [mem 0xfec01000-0xfec7ffff] Oct 25 09:03:32 sh-103-53 kernel: PM: Registered nosave memory: [mem 0xfec80000-0xfed00fff] Oct 25 09:03:32 sh-103-53 kernel: PM: Registered nosave memory: [mem 0xfed01000-0xfed3ffff] Oct 25 09:03:32 sh-103-53 kernel: PM: Registered nosave memory: [mem 0xfed40000-0xfed44fff] Oct 25 09:03:32 sh-103-53 kernel: PM: Registered nosave memory: [mem 0xfed45000-0xfeffffff] Oct 25 09:03:32 sh-103-53 kernel: PM: Registered nosave memory: [mem 0xff000000-0xffffffff] Oct 25 09:03:32 sh-103-53 kernel: e820: [mem 0x90000000-0xfcffffff] available for PCI devices Oct 25 09:03:32 sh-103-53 kernel: Booting paravirtualized kernel on bare hardware Oct 25 09:03:32 sh-103-53 kernel: setup_percpu: NR_CPUS:5120 nr_cpumask_bits:448 nr_cpu_ids:448 nr_node_ids:2 Oct 25 09:03:32 sh-103-53 kernel: PERCPU: Embedded 38 pages/cpu @ffff976a1dc00000 s118784 r8192 d28672 u262144 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: s118784 r8192 d28672 u262144 alloc=1*2097152 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [0] 000 002 004 006 008 010 012 014 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [0] 016 018 020 022 024 026 028 030 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [0] 032 034 036 038 040 042 044 046 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [0] 048 050 052 054 056 058 060 062 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [0] 064 066 068 070 072 074 076 078 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [0] 080 082 084 086 088 090 092 094 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [0] 096 098 100 102 104 106 108 110 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [0] 112 114 116 118 120 122 124 126 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [0] 128 130 132 134 136 138 140 142 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [0] 144 146 148 150 152 154 156 158 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [0] 160 162 164 166 168 170 172 174 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [0] 176 178 180 182 184 186 188 190 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [0] 192 194 196 198 200 202 204 206 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [0] 208 210 212 214 216 218 220 222 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [0] 224 226 228 230 232 234 236 238 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [0] 240 242 244 246 248 250 252 254 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [0] 256 258 260 262 264 266 268 270 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [0] 272 274 276 278 280 282 284 286 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [0] 288 290 292 294 296 298 300 302 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [0] 304 306 308 310 312 314 316 318 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [0] 320 322 324 326 328 330 332 334 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [0] 336 338 340 342 344 346 348 350 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [0] 352 354 356 358 360 362 364 366 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [0] 368 370 372 374 376 378 380 382 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [0] 384 386 388 390 392 394 396 398 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [0] 400 402 404 406 408 410 412 414 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [0] 416 418 420 422 424 426 428 430 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [0] 432 434 436 438 440 442 444 446 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [1] 001 003 005 007 009 011 013 015 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [1] 017 019 021 023 025 027 029 031 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [1] 033 035 037 039 041 043 045 047 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [1] 049 051 053 055 057 059 061 063 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [1] 065 067 069 071 073 075 077 079 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [1] 081 083 085 087 089 091 093 095 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [1] 097 099 101 103 105 107 109 111 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [1] 113 115 117 119 121 123 125 127 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [1] 129 131 133 135 137 139 141 143 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [1] 145 147 149 151 153 155 157 159 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [1] 161 163 165 167 169 171 173 175 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [1] 177 179 181 183 185 187 189 191 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [1] 193 195 197 199 201 203 205 207 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [1] 209 211 213 215 217 219 221 223 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [1] 225 227 229 231 233 235 237 239 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [1] 241 243 245 247 249 251 253 255 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [1] 257 259 261 263 265 267 269 271 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [1] 273 275 277 279 281 283 285 287 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [1] 289 291 293 295 297 299 301 303 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [1] 305 307 309 311 313 315 317 319 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [1] 321 323 325 327 329 331 333 335 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [1] 337 339 341 343 345 347 349 351 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [1] 353 355 357 359 361 363 365 367 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [1] 369 371 373 375 377 379 381 383 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [1] 385 387 389 391 393 395 397 399 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [1] 401 403 405 407 409 411 413 415 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [1] 417 419 421 423 425 427 429 431 Oct 25 09:03:32 sh-103-53 kernel: pcpu-alloc: [1] 433 435 437 439 441 443 445 447 Oct 25 09:03:32 sh-103-53 kernel: Built 2 zonelists in Zone order, mobility grouping on. Total pages: 49149455 Oct 25 09:03:32 sh-103-53 kernel: Policy zone: Normal Oct 25 09:03:32 sh-103-53 kernel: Kernel command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-957.27.2.el7.x86_64 root=UUID=1dd663c3-13ea-4d40-9f79-d517c89819f2 ro transparent_hugepage=madvise namespace.unpriv_enable=1 user_namespace.enable=1 nopti nospectre_v2 spectre_v2_user=off spec_store_bypass_disable=off l1tf=off crashkernel=auto console=ttyS0,115200 LANG=en_US.UTF-8 Oct 25 09:03:32 sh-103-53 kernel: PID hash table entries: 4096 (order: 3, 32768 bytes) Oct 25 09:03:32 sh-103-53 kernel: x86/fpu: xstate_offset[2]: 0240, xstate_sizes[2]: 0100 Oct 25 09:03:32 sh-103-53 kernel: x86/fpu: xstate_offset[3]: 03c0, xstate_sizes[3]: 0040 Oct 25 09:03:32 sh-103-53 kernel: x86/fpu: xstate_offset[4]: 0400, xstate_sizes[4]: 0040 Oct 25 09:03:32 sh-103-53 kernel: x86/fpu: xstate_offset[5]: 0440, xstate_sizes[5]: 0040 Oct 25 09:03:32 sh-103-53 kernel: x86/fpu: xstate_offset[6]: 0480, xstate_sizes[6]: 0200 Oct 25 09:03:32 sh-103-53 kernel: x86/fpu: xstate_offset[7]: 0680, xstate_sizes[7]: 0400 Oct 25 09:03:32 sh-103-53 kernel: x86/fpu: xstate_offset[8]: 0000, xstate_sizes[8]: 0080 Oct 25 09:03:32 sh-103-53 kernel: x86/fpu: xstate_offset[9]: 0a80, xstate_sizes[9]: 0008 Oct 25 09:03:32 sh-103-53 kernel: xsave: enabled xstate_bv 0x2ff, cntxt size 0xa88 using standard form Oct 25 09:03:32 sh-103-53 kernel: Memory: 5470752k/202375168k available (7672k kernel code, 2656656k absent, 3472572k reserved, 6049k data, 1876k init) Oct 25 09:03:32 sh-103-53 kernel: SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=448, Nodes=2 Oct 25 09:03:32 sh-103-53 kernel: Hierarchical RCU implementation. Oct 25 09:03:32 sh-103-53 kernel: RCU restricting CPUs from NR_CPUS=5120 to nr_cpu_ids=448. Oct 25 09:03:32 sh-103-53 kernel: NR_IRQS:327936 nr_irqs:5368 0 Oct 25 09:03:32 sh-103-53 kernel: Console: colour VGA+ 80x25 Oct 25 09:03:32 sh-103-53 kernel: console [ttyS0] enabled Oct 25 09:03:32 sh-103-53 kernel: allocated 799539200 bytes of page_cgroup Oct 25 09:03:32 sh-103-53 kernel: please try 'cgroup_disable=memory' option if you don't want memory cgroups Oct 25 09:03:32 sh-103-53 kernel: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl Oct 25 09:03:32 sh-103-53 kernel: hpet clockevent registered Oct 25 09:03:32 sh-103-53 kernel: tsc: Detected 2300.000 MHz processor Oct 25 09:03:32 sh-103-53 kernel: Calibrating delay loop (skipped), value calculated using timer frequency.. 4600.00 BogoMIPS (lpj=2300000) Oct 25 09:03:32 sh-103-53 kernel: pid_max: default: 458752 minimum: 3584 Oct 25 09:03:32 sh-103-53 kernel: Security Framework initialized Oct 25 09:03:32 sh-103-53 kernel: SELinux: Initializing. Oct 25 09:03:32 sh-103-53 kernel: SELinux: Starting in permissive mode Oct 25 09:03:32 sh-103-53 kernel: Yama: becoming mindful. Oct 25 09:03:32 sh-103-53 kernel: Dentry cache hash table entries: 33554432 (order: 16, 268435456 bytes) Oct 25 09:03:32 sh-103-53 kernel: Inode-cache hash table entries: 16777216 (order: 15, 134217728 bytes) Oct 25 09:03:32 sh-103-53 kernel: Mount-cache hash table entries: 524288 (order: 10, 4194304 bytes) Oct 25 09:03:32 sh-103-53 kernel: Mountpoint-cache hash table entries: 524288 (order: 10, 4194304 bytes) Oct 25 09:03:32 sh-103-53 kernel: Initializing cgroup subsys memory Oct 25 09:03:32 sh-103-53 kernel: Initializing cgroup subsys devices Oct 25 09:03:32 sh-103-53 kernel: Initializing cgroup subsys freezer Oct 25 09:03:32 sh-103-53 kernel: Initializing cgroup subsys net_cls Oct 25 09:03:32 sh-103-53 kernel: Initializing cgroup subsys blkio Oct 25 09:03:32 sh-103-53 kernel: Initializing cgroup subsys perf_event Oct 25 09:03:32 sh-103-53 kernel: Initializing cgroup subsys hugetlb Oct 25 09:03:32 sh-103-53 kernel: Initializing cgroup subsys pids Oct 25 09:03:32 sh-103-53 kernel: Initializing cgroup subsys net_prio Oct 25 09:03:32 sh-103-53 kernel: ENERGY_PERF_BIAS: Set to 'normal', was 'performance' Oct 25 09:03:32 sh-103-53 kernel: ENERGY_PERF_BIAS: View and update with x86_energy_perf_policy(8) Oct 25 09:03:32 sh-103-53 kernel: CPU0: Thermal monitoring enabled (TM1) Oct 25 09:03:32 sh-103-53 kernel: Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0 Oct 25 09:03:32 sh-103-53 kernel: Last level dTLB entries: 4KB 64, 2MB 0, 4MB 0 Oct 25 09:03:32 sh-103-53 kernel: tlb_flushall_shift: 6 Oct 25 09:03:32 sh-103-53 kernel: Speculative Store Bypass: Vulnerable Oct 25 09:03:32 sh-103-53 kernel: FEATURE SPEC_CTRL Present Oct 25 09:03:32 sh-103-53 kernel: FEATURE IBPB_SUPPORT Present Oct 25 09:03:32 sh-103-53 kernel: Spectre V2 : Enabling Indirect Branch Prediction Barrier Oct 25 09:03:32 sh-103-53 kernel: Spectre V2 : Vulnerable Oct 25 09:03:32 sh-103-53 kernel: MDS: Mitigation: Clear CPU buffers Oct 25 09:03:32 sh-103-53 kernel: Freeing SMP alternatives: 28k freed Oct 25 09:03:32 sh-103-53 kernel: ACPI: Core revision 20130517 Oct 25 09:03:32 sh-103-53 kernel: ACPI: All ACPI Tables successfully acquired Oct 25 09:03:32 sh-103-53 kernel: ftrace: allocating 29213 entries in 115 pages Oct 25 09:03:32 sh-103-53 kernel: IRQ remapping doesn't support X2APIC mode, disable x2apic. Oct 25 09:03:32 sh-103-53 kernel: Switched APIC routing to physical flat. Oct 25 09:03:32 sh-103-53 kernel: ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1 Oct 25 09:03:32 sh-103-53 kernel: smpboot: CPU0: Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz (fam: 06, model: 55, stepping: 04) Oct 25 09:03:32 sh-103-53 kernel: TSC deadline timer enabled Oct 25 09:03:32 sh-103-53 kernel: Performance Events: PEBS fmt3+, Skylake events, 32-deep LBR, full-width counters, Intel PMU driver. Oct 25 09:03:32 sh-103-53 kernel: ... version: 4 Oct 25 09:03:32 sh-103-53 kernel: ... bit width: 48 Oct 25 09:03:32 sh-103-53 kernel: ... generic registers: 8 Oct 25 09:03:32 sh-103-53 kernel: ... value mask: 0000ffffffffffff Oct 25 09:03:32 sh-103-53 kernel: ... max period: 00007fffffffffff Oct 25 09:03:32 sh-103-53 kernel: ... fixed-purpose events: 3 Oct 25 09:03:32 sh-103-53 kernel: ... event mask: 00000007000000ff Oct 25 09:03:32 sh-103-53 kernel: NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter. Oct 25 09:03:32 sh-103-53 kernel: smpboot: Booting Node 1, Processors #1 OK Oct 25 09:03:32 sh-103-53 kernel: smpboot: Booting Node 0, Processors #2 OK Oct 25 09:03:32 sh-103-53 kernel: smpboot: Booting Node 1, Processors #3 OK Oct 25 09:03:32 sh-103-53 kernel: smpboot: Booting Node 0, Processors #4 OK Oct 25 09:03:32 sh-103-53 kernel: smpboot: Booting Node 1, Processors #5 OK Oct 25 09:03:32 sh-103-53 kernel: smpboot: Booting Node 0, Processors #6 OK Oct 25 09:03:32 sh-103-53 kernel: smpboot: Booting Node 1, Processors #7 OK Oct 25 09:03:32 sh-103-53 kernel: smpboot: Booting Node 0, Processors #8 OK Oct 25 09:03:32 sh-103-53 kernel: smpboot: Booting Node 1, Processors #9 OK Oct 25 09:03:32 sh-103-53 kernel: smpboot: Booting Node 0, Processors #10 OK Oct 25 09:03:32 sh-103-53 kernel: smpboot: Booting Node 1, Processors #11 OK Oct 25 09:03:32 sh-103-53 kernel: smpboot: Booting Node 0, Processors #12 OK Oct 25 09:03:32 sh-103-53 kernel: smpboot: Booting Node 1, Processors #13 OK Oct 25 09:03:32 sh-103-53 kernel: smpboot: Booting Node 0, Processors #14 OK Oct 25 09:03:32 sh-103-53 kernel: smpboot: Booting Node 1, Processors #15 OK Oct 25 09:03:32 sh-103-53 kernel: smpboot: Booting Node 0, Processors #16 OK Oct 25 09:03:32 sh-103-53 kernel: smpboot: Booting Node 1, Processors #17 OK Oct 25 09:03:32 sh-103-53 kernel: smpboot: Booting Node 0, Processors #18 OK Oct 25 09:03:32 sh-103-53 kernel: smpboot: Booting Node 1, Processors #19 OK Oct 25 09:03:32 sh-103-53 kernel: smpboot: Booting Node 0, Processors #20 OK Oct 25 09:03:32 sh-103-53 kernel: smpboot: Booting Node 1, Processors #21 OK Oct 25 09:03:32 sh-103-53 kernel: smpboot: Booting Node 0, Processors #22 OK Oct 25 09:03:32 sh-103-53 kernel: smpboot: Booting Node 1, Processors #23 Oct 25 09:03:32 sh-103-53 kernel: Brought up 24 CPUs Oct 25 09:03:32 sh-103-53 kernel: smpboot: Max logical packages: 38 Oct 25 09:03:32 sh-103-53 kernel: smpboot: Total of 24 processors activated (110456.35 BogoMIPS) Oct 25 09:03:32 sh-103-53 kernel: node 0 initialised, 23454177 pages in 370ms Oct 25 09:03:32 sh-103-53 kernel: node 1 initialised, 24239616 pages in 384ms Oct 25 09:03:32 sh-103-53 kernel: devtmpfs: initialized Oct 25 09:03:32 sh-103-53 kernel: EVM: security.selinux Oct 25 09:03:32 sh-103-53 kernel: EVM: security.ima Oct 25 09:03:32 sh-103-53 kernel: EVM: security.capability Oct 25 09:03:32 sh-103-53 kernel: PM: Registering ACPI NVS region [mem 0x6cfff000-0x6effefff] (33554432 bytes) Oct 25 09:03:32 sh-103-53 kernel: atomic64 test passed for x86-64 platform with CX8 and with SSE Oct 25 09:03:32 sh-103-53 kernel: pinctrl core: initialized pinctrl subsystem Oct 25 09:03:32 sh-103-53 kernel: RTC time: 16:03:28, date: 10/25/19 Oct 25 09:03:32 sh-103-53 kernel: NET: Registered protocol family 16 Oct 25 09:03:32 sh-103-53 kernel: ACPI FADT declares the system doesn't support PCIe ASPM, so disable it Oct 25 09:03:32 sh-103-53 kernel: ACPI: bus type PCI registered Oct 25 09:03:32 sh-103-53 kernel: acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5 Oct 25 09:03:32 sh-103-53 kernel: PCI: MMCONFIG for domain 0000 [bus 00-ff] at [mem 0x80000000-0x8fffffff] (base 0x80000000) Oct 25 09:03:32 sh-103-53 kernel: PCI: MMCONFIG at [mem 0x80000000-0x8fffffff] reserved in E820 Oct 25 09:03:32 sh-103-53 kernel: PCI: Using configuration type 1 for base access Oct 25 09:03:32 sh-103-53 kernel: PCI: Dell System detected, enabling pci=bfsort. Oct 25 09:03:32 sh-103-53 kernel: ACPI: Added _OSI(Module Device) Oct 25 09:03:32 sh-103-53 kernel: ACPI: Added _OSI(Processor Device) Oct 25 09:03:32 sh-103-53 kernel: ACPI: Added _OSI(3.0 _SCP Extensions) Oct 25 09:03:32 sh-103-53 kernel: ACPI: Added _OSI(Processor Aggregator Device) Oct 25 09:03:32 sh-103-53 kernel: ACPI: Added _OSI(Linux-Dell-Video) Oct 25 09:03:32 sh-103-53 kernel: ACPI: EC: Look up EC in DSDT Oct 25 09:03:32 sh-103-53 kernel: ACPI: Executed 4 blocks of module-level executable AML code Oct 25 09:03:32 sh-103-53 kernel: ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored Oct 25 09:03:32 sh-103-53 kernel: ACPI: Executed 1 blocks of module-level executable AML code Oct 25 09:03:32 sh-103-53 kernel: ACPI: Dynamic OEM Table Load: Oct 25 09:03:32 sh-103-53 kernel: ACPI: OEM4 (null) AD1C1 (v02 INTEL CPU CST 00003000 INTL 20180508) Oct 25 09:03:32 sh-103-53 kernel: ACPI: Interpreter enabled Oct 25 09:03:32 sh-103-53 kernel: ACPI: (supports S0 S5) Oct 25 09:03:32 sh-103-53 kernel: ACPI: Using IOAPIC for interrupt routing Oct 25 09:03:32 sh-103-53 kernel: HEST: Table parsing has been initialized. Oct 25 09:03:32 sh-103-53 kernel: PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug Oct 25 09:03:32 sh-103-53 kernel: ACPI: Enabled 5 GPEs in block 00 to 7F Oct 25 09:03:32 sh-103-53 kernel: ACPI: PCI Root Bridge [PC00] (domain 0000 [bus 00-16]) Oct 25 09:03:32 sh-103-53 kernel: acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI] Oct 25 09:03:32 sh-103-53 kernel: acpi PNP0A08:00: PCIe AER handled by firmware Oct 25 09:03:32 sh-103-53 kernel: acpi PNP0A08:00: _OSC: platform does not support [SHPCHotplug] Oct 25 09:03:32 sh-103-53 kernel: acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug PME PCIeCapability] Oct 25 09:03:32 sh-103-53 kernel: acpi PNP0A08:00: FADT indicates ASPM is unsupported, using BIOS configuration Oct 25 09:03:32 sh-103-53 kernel: PCI host bridge to bus 0000:00 Oct 25 09:03:32 sh-103-53 kernel: pci_bus 0000:00: root bus resource [io 0x0000-0x0cf7 window] Oct 25 09:03:32 sh-103-53 kernel: pci_bus 0000:00: root bus resource [io 0x1000-0x3fff window] Oct 25 09:03:32 sh-103-53 kernel: pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window] Oct 25 09:03:32 sh-103-53 kernel: pci_bus 0000:00: root bus resource [mem 0x000c4000-0x000c7fff window] Oct 25 09:03:32 sh-103-53 kernel: pci_bus 0000:00: root bus resource [mem 0xfe010000-0xfe010fff window] Oct 25 09:03:32 sh-103-53 kernel: pci_bus 0000:00: root bus resource [mem 0x90000000-0x9d7fffff window] Oct 25 09:03:32 sh-103-53 kernel: pci_bus 0000:00: root bus resource [mem 0x380000000000-0x380fffffffff window] Oct 25 09:03:32 sh-103-53 kernel: pci_bus 0000:00: root bus resource [bus 00-16] Oct 25 09:03:32 sh-103-53 kernel: pci 0000:00:00.0: [8086:2020] type 00 class 0x060000 Oct 25 09:03:32 sh-103-53 kernel: pci 0000:00:05.0: [8086:2024] type 00 class 0x088000 Oct 25 09:03:32 sh-103-53 kernel: pci 0000:00:05.2: [8086:2025] type 00 class 0x088000 Oct 25 09:03:32 sh-103-53 kernel: pci 0000:00:05.4: [8086:2026] type 00 class 0x080020 Oct 25 09:03:32 sh-103-53 kernel: pci 0000:00:05.4: reg 0x10: [mem 0x93120000-0x93120fff] Oct 25 09:03:32 sh-103-53 kernel: pci 0000:00:08.0: [8086:2014] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:08.1: [8086:2015] type 00 class 0x110100 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:08.2: [8086:2016] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:11.0: [8086:a1ec] type 00 class 0xff0000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:11.5: [8086:a1d2] type 00 class 0x010601 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:11.5: reg 0x10: [mem 0x93116000-0x93117fff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:11.5: reg 0x14: [mem 0x9311f000-0x9311f0ff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:11.5: reg 0x18: [io 0x2068-0x206f] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:11.5: reg 0x1c: [io 0x2074-0x2077] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:11.5: reg 0x20: [io 0x2040-0x205f] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:11.5: reg 0x24: [mem 0x93080000-0x930fffff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:11.5: PME# supported from D3hot Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:14.0: [8086:a1af] type 00 class 0x0c0330 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:14.0: reg 0x10: [mem 0x93100000-0x9310ffff 64bit] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:14.0: PME# supported from D3hot D3cold Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:14.0: System wakeup disabled by ACPI Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:14.2: [8086:a1b1] type 00 class 0x118000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:14.2: reg 0x10: [mem 0x9311c000-0x9311cfff 64bit] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:16.0: [8086:a1ba] type 00 class 0x078000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:16.0: reg 0x10: [mem 0x9311b000-0x9311bfff 64bit] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:16.0: PME# supported from D3hot Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:16.1: [8086:a1bb] type 00 class 0x078000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:16.1: reg 0x10: [mem 0x9311a000-0x9311afff 64bit] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:16.1: PME# supported from D3hot Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:16.4: [8086:a1be] type 00 class 0x078000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:16.4: reg 0x10: [mem 0x93119000-0x93119fff 64bit] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:16.4: PME# supported from D3hot Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:17.0: [8086:a182] type 00 class 0x010601 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:17.0: reg 0x10: [mem 0x93114000-0x93115fff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:17.0: reg 0x14: [mem 0x9311e000-0x9311e0ff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:17.0: reg 0x18: [io 0x2060-0x2067] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:17.0: reg 0x1c: [io 0x2070-0x2073] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:17.0: reg 0x20: [io 0x2020-0x203f] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:17.0: reg 0x24: [mem 0x93000000-0x9307ffff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:17.0: PME# supported from D3hot Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1c.0: [8086:a190] type 01 class 0x060400 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1c.0: PME# supported from D0 D3hot D3cold Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1c.0: System wakeup disabled by ACPI Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1c.4: [8086:a194] type 01 class 0x060400 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1c.4: PME# supported from D0 D3hot D3cold Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1c.4: System wakeup disabled by ACPI Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1c.5: [8086:a195] type 01 class 0x060400 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1c.5: PME# supported from D0 D3hot D3cold Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1c.5: System wakeup disabled by ACPI Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1f.0: [8086:a1c1] type 00 class 0x060100 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1f.2: [8086:a1a1] type 00 class 0x058000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1f.2: reg 0x10: [mem 0x93110000-0x93113fff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1f.4: [8086:a1a3] type 00 class 0x0c0500 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1f.4: reg 0x10: [mem 0x93118000-0x931180ff 64bit] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1f.4: reg 0x20: [io 0x2000-0x201f] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1f.5: [8086:a1a4] type 00 class 0x0c8000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1f.5: reg 0x10: [mem 0xfe010000-0xfe010fff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1c.0: PCI bridge to [bus 01] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1c.0: bridge window [mem 0x92a00000-0x92dfffff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:02:00.0: [1556:be00] type 01 class 0x060400 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:02:00.0: System wakeup disabled by ACPI Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1c.4: PCI bridge to [bus 02-03] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1c.4: bridge window [mem 0x92000000-0x928fffff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1c.4: bridge window [mem 0x91000000-0x91ffffff 64bit pref] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:03:00.0: [102b:0536] type 00 class 0x030000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:03:00.0: reg 0x10: [mem 0x91000000-0x91ffffff pref] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:03:00.0: reg 0x14: [mem 0x92808000-0x9280bfff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:03:00.0: reg 0x18: [mem 0x92000000-0x927fffff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:02:00.0: PCI bridge to [bus 03] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:02:00.0: bridge window [mem 0x92000000-0x928fffff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:02:00.0: bridge window [mem 0x91000000-0x91ffffff 64bit pref] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:04:00.0: [8086:1521] type 00 class 0x020000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:04:00.0: reg 0x10: [mem 0x92e00000-0x92efffff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:04:00.0: reg 0x1c: [mem 0x92f00000-0x92f03fff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:04:00.0: reg 0x30: [mem 0xfff00000-0xffffffff pref] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:04:00.0: PME# supported from D0 D3hot D3cold Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1c.5: PCI bridge to [bus 04] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1c.5: bridge window [mem 0x92e00000-0x92ffffff] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:00: on NUMA node 0 Oct 25 09:03:33 sh-103-53 kernel: ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 10 *11 12 14 15) Oct 25 09:03:33 sh-103-53 kernel: ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 *6 10 11 12 14 15) Oct 25 09:03:33 sh-103-53 kernel: ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 *5 6 10 11 12 14 15) Oct 25 09:03:33 sh-103-53 kernel: ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 10 *11 12 14 15) Oct 25 09:03:33 sh-103-53 kernel: ACPI: PCI Interrupt Link [LNKE] (IRQs 3 4 5 6 10 *11 12 14 15) Oct 25 09:03:33 sh-103-53 kernel: ACPI: PCI Interrupt Link [LNKF] (IRQs 3 4 5 *6 10 11 12 14 15) Oct 25 09:03:33 sh-103-53 kernel: ACPI: PCI Interrupt Link [LNKG] (IRQs 3 4 *5 6 10 11 12 14 15) Oct 25 09:03:33 sh-103-53 kernel: ACPI: PCI Interrupt Link [LNKH] (IRQs 3 4 5 6 10 *11 12 14 15) Oct 25 09:03:33 sh-103-53 kernel: ACPI: PCI Root Bridge [PC01] (domain 0000 [bus 17-39]) Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:01: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI] Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:01: PCIe AER handled by firmware Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:01: _OSC: platform does not support [SHPCHotplug] Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:01: _OSC: OS now controls [PCIeHotplug PME PCIeCapability] Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:01: FADT indicates ASPM is unsupported, using BIOS configuration Oct 25 09:03:33 sh-103-53 kernel: PCI host bridge to bus 0000:17 Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:17: root bus resource [io 0x4000-0x5fff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:17: root bus resource [mem 0x9d800000-0xaaffffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:17: root bus resource [mem 0x381000000000-0x381fffffffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:17: root bus resource [bus 17-39] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:05.0: [8086:2034] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:05.2: [8086:2035] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:05.4: [8086:2036] type 00 class 0x080020 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:05.4: reg 0x10: [mem 0x9d800000-0x9d800fff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:08.0: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:08.1: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:08.2: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:08.3: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:08.4: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:08.5: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:08.6: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:08.7: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:09.0: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:09.1: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:09.2: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:09.3: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:09.4: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:09.5: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:09.6: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:09.7: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:0a.0: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:0a.1: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:0e.0: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:0e.1: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:0e.2: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:0e.3: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:0e.4: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:0e.5: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:0e.6: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:0e.7: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:0f.0: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:0f.1: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:0f.2: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:0f.3: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:0f.4: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:0f.5: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:0f.6: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:0f.7: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:10.0: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:10.1: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:1d.0: [8086:2054] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:1d.1: [8086:2055] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:1d.2: [8086:2056] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:1d.3: [8086:2057] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:1e.0: [8086:2080] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:1e.1: [8086:2081] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:1e.2: [8086:2082] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:1e.3: [8086:2083] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:1e.4: [8086:2084] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:1e.5: [8086:2085] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:17:1e.6: [8086:2086] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:17: on NUMA node 0 Oct 25 09:03:33 sh-103-53 kernel: ACPI: PCI Root Bridge [PC02] (domain 0000 [bus 3a-5c]) Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:02: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI] Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:02: PCIe AER handled by firmware Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:02: _OSC: platform does not support [SHPCHotplug] Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:02: _OSC: OS now controls [PCIeHotplug PME PCIeCapability] Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:02: FADT indicates ASPM is unsupported, using BIOS configuration Oct 25 09:03:33 sh-103-53 kernel: PCI host bridge to bus 0000:3a Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:3a: root bus resource [io 0x6000-0x7fff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:3a: root bus resource [mem 0xab000000-0xb87fffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:3a: root bus resource [mem 0x382000000000-0x382fffffffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:3a: root bus resource [bus 3a-5c] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:3a:05.0: [8086:2034] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:3a:05.2: [8086:2035] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:3a:05.4: [8086:2036] type 00 class 0x080020 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:3a:05.4: reg 0x10: [mem 0xab000000-0xab000fff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:3a:08.0: [8086:2066] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:3a:09.0: [8086:2066] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:3a:0a.0: [8086:2040] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:3a:0a.1: [8086:2041] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:3a:0a.2: [8086:2042] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:3a:0a.3: [8086:2043] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:3a:0a.4: [8086:2044] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:3a:0a.5: [8086:2045] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:3a:0a.6: [8086:2046] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:3a:0a.7: [8086:2047] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:3a:0b.0: [8086:2048] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:3a:0b.1: [8086:2049] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:3a:0b.2: [8086:204a] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:3a:0b.3: [8086:204b] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:3a:0c.0: [8086:2040] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:3a:0c.1: [8086:2041] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:3a:0c.2: [8086:2042] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:3a:0c.3: [8086:2043] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:3a:0c.4: [8086:2044] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:3a:0c.5: [8086:2045] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:3a:0c.6: [8086:2046] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:3a:0c.7: [8086:2047] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:3a:0d.0: [8086:2048] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:3a:0d.1: [8086:2049] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:3a:0d.2: [8086:204a] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:3a:0d.3: [8086:204b] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:3a: on NUMA node 0 Oct 25 09:03:33 sh-103-53 kernel: ACPI: PCI Root Bridge [PC03] (domain 0000 [bus 5d-7f]) Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:03: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI] Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:03: PCIe AER handled by firmware Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:03: _OSC: platform does not support [SHPCHotplug] Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:03: _OSC: OS now controls [PCIeHotplug PME PCIeCapability] Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:03: FADT indicates ASPM is unsupported, using BIOS configuration Oct 25 09:03:33 sh-103-53 kernel: PCI host bridge to bus 0000:5d Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:5d: root bus resource [io 0x8000-0x9fff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:5d: root bus resource [mem 0xb8800000-0xc5ffffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:5d: root bus resource [mem 0x383000000000-0x383fffffffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:5d: root bus resource [bus 5d-7f] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:5d:00.0: [8086:2030] type 01 class 0x060400 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:5d:00.0: PME# supported from D0 D3hot D3cold Oct 25 09:03:33 sh-103-53 kernel: pci 0000:5d:00.0: System wakeup disabled by ACPI Oct 25 09:03:33 sh-103-53 kernel: pci 0000:5d:05.0: [8086:2034] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:5d:05.2: [8086:2035] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:5d:05.4: [8086:2036] type 00 class 0x080020 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:5d:05.4: reg 0x10: [mem 0xbc000000-0xbc000fff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:5d:0e.0: [8086:2058] type 00 class 0x110100 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:5d:0e.1: [8086:2059] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:5d:0f.0: [8086:2058] type 00 class 0x110100 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:5d:0f.1: [8086:2059] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:5d:12.0: [8086:204c] type 00 class 0x110100 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:5d:12.1: [8086:204d] type 00 class 0x110100 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:5d:12.2: [8086:204e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:5d:15.0: [8086:2018] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:5d:16.0: [8086:2018] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:5d:16.4: [8086:2018] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:5e:00.0: [15b3:1017] type 00 class 0x020700 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:5e:00.0: reg 0x10: [mem 0xba000000-0xbbffffff 64bit pref] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:5e:00.0: reg 0x30: [mem 0xfff00000-0xffffffff pref] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:5e:00.0: PME# supported from D3cold Oct 25 09:03:33 sh-103-53 kernel: pci 0000:5d:00.0: PCI bridge to [bus 5e] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:5d:00.0: bridge window [mem 0xba000000-0xbbffffff 64bit pref] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:5d: on NUMA node 0 Oct 25 09:03:33 sh-103-53 kernel: ACPI: PCI Root Bridge [PC06] (domain 0000 [bus 80-84]) Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:06: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI] Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:06: PCIe AER handled by firmware Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:06: _OSC: platform does not support [SHPCHotplug] Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:06: _OSC: OS now controls [PCIeHotplug PME PCIeCapability] Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:06: FADT indicates ASPM is unsupported, using BIOS configuration Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:06: host bridge window [io 0x0000 window] (ignored, not CPU addressable) Oct 25 09:03:33 sh-103-53 kernel: PCI host bridge to bus 0000:80 Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:80: root bus resource [mem 0xc6000000-0xd37fffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:80: root bus resource [mem 0x384000000000-0x384fffffffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:80: root bus resource [bus 80-84] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:80:05.0: [8086:2024] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:80:05.2: [8086:2025] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:80:05.4: [8086:2026] type 00 class 0x080020 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:80:05.4: reg 0x10: [mem 0xc6000000-0xc6000fff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:80:08.0: [8086:2014] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:80:08.1: [8086:2015] type 00 class 0x110100 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:80:08.2: [8086:2016] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:80: on NUMA node 1 Oct 25 09:03:33 sh-103-53 kernel: ACPI: PCI Root Bridge [PC07] (domain 0000 [bus 85-ad]) Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:07: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI] Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:07: PCIe AER handled by firmware Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:07: _OSC: platform does not support [SHPCHotplug] Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:07: _OSC: OS now controls [PCIeHotplug PME PCIeCapability] Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:07: FADT indicates ASPM is unsupported, using BIOS configuration Oct 25 09:03:33 sh-103-53 kernel: PCI host bridge to bus 0000:85 Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:85: root bus resource [io 0xa000-0xbfff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:85: root bus resource [mem 0xd3800000-0xe0ffffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:85: root bus resource [mem 0x385000000000-0x385fffffffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:85: root bus resource [bus 85-ad] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:05.0: [8086:2034] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:05.2: [8086:2035] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:05.4: [8086:2036] type 00 class 0x080020 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:05.4: reg 0x10: [mem 0xd3800000-0xd3800fff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:08.0: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:08.1: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:08.2: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:08.3: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:08.4: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:08.5: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:08.6: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:08.7: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:09.0: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:09.1: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:09.2: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:09.3: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:09.4: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:09.5: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:09.6: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:09.7: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:0a.0: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:0a.1: [8086:208d] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:0e.0: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:0e.1: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:0e.2: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:0e.3: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:0e.4: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:0e.5: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:0e.6: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:0e.7: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:0f.0: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:0f.1: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:0f.2: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:0f.3: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:0f.4: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:0f.5: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:0f.6: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:0f.7: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:10.0: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:10.1: [8086:208e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:1d.0: [8086:2054] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:1d.1: [8086:2055] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:1d.2: [8086:2056] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:1d.3: [8086:2057] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:1e.0: [8086:2080] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:1e.1: [8086:2081] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:1e.2: [8086:2082] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:1e.3: [8086:2083] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:1e.4: [8086:2084] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:1e.5: [8086:2085] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:85:1e.6: [8086:2086] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:85: on NUMA node 1 Oct 25 09:03:33 sh-103-53 kernel: ACPI: PCI Root Bridge [PC08] (domain 0000 [bus ae-d6]) Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:08: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI] Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:08: PCIe AER handled by firmware Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:08: _OSC: platform does not support [SHPCHotplug] Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:08: _OSC: OS now controls [PCIeHotplug PME PCIeCapability] Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:08: FADT indicates ASPM is unsupported, using BIOS configuration Oct 25 09:03:33 sh-103-53 kernel: PCI host bridge to bus 0000:ae Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:ae: root bus resource [io 0xc000-0xdfff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:ae: root bus resource [mem 0xe1000000-0xee7fffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:ae: root bus resource [mem 0x386000000000-0x386fffffffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:ae: root bus resource [bus ae-d6] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:ae:05.0: [8086:2034] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:ae:05.2: [8086:2035] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:ae:05.4: [8086:2036] type 00 class 0x080020 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:ae:05.4: reg 0x10: [mem 0xe1000000-0xe1000fff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:ae:08.0: [8086:2066] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:ae:09.0: [8086:2066] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:ae:0a.0: [8086:2040] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:ae:0a.1: [8086:2041] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:ae:0a.2: [8086:2042] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:ae:0a.3: [8086:2043] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:ae:0a.4: [8086:2044] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:ae:0a.5: [8086:2045] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:ae:0a.6: [8086:2046] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:ae:0a.7: [8086:2047] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:ae:0b.0: [8086:2048] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:ae:0b.1: [8086:2049] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:ae:0b.2: [8086:204a] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:ae:0b.3: [8086:204b] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:ae:0c.0: [8086:2040] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:ae:0c.1: [8086:2041] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:ae:0c.2: [8086:2042] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:ae:0c.3: [8086:2043] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:ae:0c.4: [8086:2044] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:ae:0c.5: [8086:2045] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:ae:0c.6: [8086:2046] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:ae:0c.7: [8086:2047] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:ae:0d.0: [8086:2048] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:ae:0d.1: [8086:2049] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:ae:0d.2: [8086:204a] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:ae:0d.3: [8086:204b] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:ae: on NUMA node 1 Oct 25 09:03:33 sh-103-53 kernel: ACPI: PCI Root Bridge [PC09] (domain 0000 [bus d7-ff]) Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:09: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI] Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:09: PCIe AER handled by firmware Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:09: _OSC: platform does not support [SHPCHotplug] Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:09: _OSC: OS now controls [PCIeHotplug PME PCIeCapability] Oct 25 09:03:33 sh-103-53 kernel: acpi PNP0A08:09: FADT indicates ASPM is unsupported, using BIOS configuration Oct 25 09:03:33 sh-103-53 kernel: PCI host bridge to bus 0000:d7 Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:d7: root bus resource [io 0xe000-0xffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:d7: root bus resource [mem 0xee800000-0xfbffffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:d7: root bus resource [mem 0x387000000000-0x387fffffffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:d7: root bus resource [bus d7-ff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:00.0: [8086:2030] type 01 class 0x060400 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:00.0: PME# supported from D0 D3hot D3cold Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:00.0: System wakeup disabled by ACPI Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:01.0: [8086:2031] type 01 class 0x060400 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:01.0: PME# supported from D0 D3hot D3cold Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:01.0: System wakeup disabled by ACPI Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:02.0: [8086:2032] type 01 class 0x060400 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:02.0: PME# supported from D0 D3hot D3cold Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:02.0: System wakeup disabled by ACPI Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:03.0: [8086:2033] type 01 class 0x060400 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:03.0: PME# supported from D0 D3hot D3cold Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:03.0: System wakeup disabled by ACPI Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:05.0: [8086:2034] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:05.2: [8086:2035] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:05.4: [8086:2036] type 00 class 0x080020 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:05.4: reg 0x10: [mem 0xef000000-0xef000fff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:0e.0: [8086:2058] type 00 class 0x110100 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:0e.1: [8086:2059] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:0f.0: [8086:2058] type 00 class 0x110100 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:0f.1: [8086:2059] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:12.0: [8086:204c] type 00 class 0x110100 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:12.1: [8086:204d] type 00 class 0x110100 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:12.2: [8086:204e] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:15.0: [8086:2018] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:16.0: [8086:2018] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:16.4: [8086:2018] type 00 class 0x088000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:00.0: PCI bridge to [bus d8] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:00.0: bridge window [mem 0xeec00000-0xeeffffff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:01.0: PCI bridge to [bus d9] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:01.0: bridge window [mem 0xee800000-0xeebfffff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:02.0: PCI bridge to [bus da] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:03.0: PCI bridge to [bus db] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:d7: on NUMA node 1 Oct 25 09:03:33 sh-103-53 kernel: vgaarb: device added: PCI:0000:03:00.0,decodes=io+mem,owns=io+mem,locks=none Oct 25 09:03:33 sh-103-53 kernel: vgaarb: loaded Oct 25 09:03:33 sh-103-53 kernel: vgaarb: bridge control possible 0000:03:00.0 Oct 25 09:03:33 sh-103-53 kernel: SCSI subsystem initialized Oct 25 09:03:33 sh-103-53 kernel: ACPI: bus type USB registered Oct 25 09:03:33 sh-103-53 kernel: usbcore: registered new interface driver usbfs Oct 25 09:03:33 sh-103-53 kernel: usbcore: registered new interface driver hub Oct 25 09:03:33 sh-103-53 kernel: usbcore: registered new device driver usb Oct 25 09:03:33 sh-103-53 kernel: EDAC MC: Ver: 3.0.0 Oct 25 09:03:33 sh-103-53 kernel: PCI: Using ACPI for IRQ routing Oct 25 09:03:33 sh-103-53 kernel: PCI: pci_cache_line_size set to 64 bytes Oct 25 09:03:33 sh-103-53 kernel: Expanded resource reserved due to conflict with PNP0003:00 Oct 25 09:03:33 sh-103-53 kernel: e820: reserve RAM buffer [mem 0x0009d000-0x0009ffff] Oct 25 09:03:33 sh-103-53 kernel: e820: reserve RAM buffer [mem 0x5ddff000-0x5fffffff] Oct 25 09:03:33 sh-103-53 kernel: e820: reserve RAM buffer [mem 0x6f800000-0x6fffffff] Oct 25 09:03:33 sh-103-53 kernel: NetLabel: Initializing Oct 25 09:03:33 sh-103-53 kernel: NetLabel: domain hash size = 128 Oct 25 09:03:33 sh-103-53 kernel: NetLabel: protocols = UNLABELED CIPSOv4 Oct 25 09:03:33 sh-103-53 kernel: NetLabel: unlabeled traffic allowed by default Oct 25 09:03:33 sh-103-53 kernel: hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0, 0, 0, 0, 0, 0 Oct 25 09:03:33 sh-103-53 kernel: hpet0: 8 comparators, 64-bit 24.000000 MHz counter Oct 25 09:03:33 sh-103-53 kernel: amd_nb: Cannot enumerate AMD northbridges Oct 25 09:03:33 sh-103-53 kernel: Switched to clocksource hpet Oct 25 09:03:33 sh-103-53 kernel: pnp: PnP ACPI init Oct 25 09:03:33 sh-103-53 kernel: ACPI: bus type PNP registered Oct 25 09:03:33 sh-103-53 kernel: pnp 00:00: Plug and Play ACPI device, IDs PNP0b00 (active) Oct 25 09:03:33 sh-103-53 kernel: system 00:01: [io 0x0500-0x05fe] could not be reserved Oct 25 09:03:33 sh-103-53 kernel: system 00:01: [io 0x0400-0x047f] has been reserved Oct 25 09:03:33 sh-103-53 kernel: system 00:01: [io 0x0600-0x061f] has been reserved Oct 25 09:03:33 sh-103-53 kernel: system 00:01: [io 0x0ca0-0x0ca5] has been reserved Oct 25 09:03:33 sh-103-53 kernel: system 00:01: [io 0x0880-0x0883] has been reserved Oct 25 09:03:33 sh-103-53 kernel: system 00:01: [io 0x0800-0x081f] has been reserved Oct 25 09:03:33 sh-103-53 kernel: system 00:01: [mem 0xfed1c000-0xfed3ffff] has been reserved Oct 25 09:03:33 sh-103-53 kernel: system 00:01: [mem 0xfed45000-0xfed8bfff] has been reserved Oct 25 09:03:33 sh-103-53 kernel: system 00:01: [mem 0xff000000-0xffffffff] has been reserved Oct 25 09:03:33 sh-103-53 kernel: system 00:01: [mem 0xfee00000-0xfeefffff] has been reserved Oct 25 09:03:33 sh-103-53 kernel: system 00:01: [mem 0xfed12000-0xfed1200f] has been reserved Oct 25 09:03:33 sh-103-53 kernel: system 00:01: [mem 0xfed12010-0xfed1201f] has been reserved Oct 25 09:03:33 sh-103-53 kernel: system 00:01: [mem 0xfed1b000-0xfed1bfff] has been reserved Oct 25 09:03:33 sh-103-53 kernel: system 00:01: Plug and Play ACPI device, IDs PNP0c02 (active) Oct 25 09:03:33 sh-103-53 kernel: pnp 00:02: Plug and Play ACPI device, IDs PNP0501 (active) Oct 25 09:03:33 sh-103-53 kernel: system 00:03: [mem 0xfd000000-0xfdabffff] has been reserved Oct 25 09:03:33 sh-103-53 kernel: system 00:03: [mem 0xfdad0000-0xfdadffff] has been reserved Oct 25 09:03:33 sh-103-53 kernel: system 00:03: [mem 0xfdb00000-0xfdffffff] has been reserved Oct 25 09:03:33 sh-103-53 kernel: system 00:03: [mem 0xfe000000-0xfe00ffff] has been reserved Oct 25 09:03:33 sh-103-53 kernel: system 00:03: [mem 0xfe011000-0xfe01ffff] has been reserved Oct 25 09:03:33 sh-103-53 kernel: system 00:03: [mem 0xfe036000-0xfe03bfff] has been reserved Oct 25 09:03:33 sh-103-53 kernel: system 00:03: [mem 0xfe03d000-0xfe3fffff] has been reserved Oct 25 09:03:33 sh-103-53 kernel: system 00:03: [mem 0xfe410000-0xfe7fffff] has been reserved Oct 25 09:03:33 sh-103-53 kernel: system 00:03: Plug and Play ACPI device, IDs PNP0c02 (active) Oct 25 09:03:33 sh-103-53 kernel: system 00:04: [io 0x1000-0x10fe] has been reserved Oct 25 09:03:33 sh-103-53 kernel: system 00:04: Plug and Play ACPI device, IDs PNP0c02 (active) Oct 25 09:03:33 sh-103-53 kernel: pnp: PnP ACPI: found 5 devices Oct 25 09:03:33 sh-103-53 kernel: ACPI: bus type PNP unregistered Oct 25 09:03:33 sh-103-53 kernel: pci 0000:04:00.0: can't claim BAR 6 [mem 0xfff00000-0xffffffff pref]: no compatible bridge window Oct 25 09:03:33 sh-103-53 kernel: pci 0000:5e:00.0: can't claim BAR 6 [mem 0xfff00000-0xffffffff pref]: no compatible bridge window Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1c.0: PCI bridge to [bus 01] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1c.0: bridge window [mem 0x92a00000-0x92dfffff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:02:00.0: PCI bridge to [bus 03] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:02:00.0: bridge window [mem 0x92000000-0x928fffff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:02:00.0: bridge window [mem 0x91000000-0x91ffffff 64bit pref] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1c.4: PCI bridge to [bus 02-03] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1c.4: bridge window [mem 0x92000000-0x928fffff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1c.4: bridge window [mem 0x91000000-0x91ffffff 64bit pref] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:04:00.0: BAR 6: no space for [mem size 0x00100000 pref] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:04:00.0: BAR 6: failed to assign [mem size 0x00100000 pref] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1c.5: PCI bridge to [bus 04] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:00:1c.5: bridge window [mem 0x92e00000-0x92ffffff] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:00: resource 4 [io 0x0000-0x0cf7 window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:00: resource 5 [io 0x1000-0x3fff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:00: resource 6 [mem 0x000a0000-0x000bffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:00: resource 7 [mem 0x000c4000-0x000c7fff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:00: resource 8 [mem 0xfe010000-0xfe010fff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:00: resource 9 [mem 0x90000000-0x9d7fffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:00: resource 10 [mem 0x380000000000-0x380fffffffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:01: resource 1 [mem 0x92a00000-0x92dfffff] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:02: resource 1 [mem 0x92000000-0x928fffff] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:02: resource 2 [mem 0x91000000-0x91ffffff 64bit pref] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:03: resource 1 [mem 0x92000000-0x928fffff] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:03: resource 2 [mem 0x91000000-0x91ffffff 64bit pref] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:04: resource 1 [mem 0x92e00000-0x92ffffff] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:17: resource 4 [io 0x4000-0x5fff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:17: resource 5 [mem 0x9d800000-0xaaffffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:17: resource 6 [mem 0x381000000000-0x381fffffffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:3a: resource 4 [io 0x6000-0x7fff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:3a: resource 5 [mem 0xab000000-0xb87fffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:3a: resource 6 [mem 0x382000000000-0x382fffffffff window] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:5d:00.0: BAR 14: assigned [mem 0xb8800000-0xb88fffff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:5e:00.0: BAR 6: assigned [mem 0xb8800000-0xb88fffff pref] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:5d:00.0: PCI bridge to [bus 5e] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:5d:00.0: bridge window [mem 0xb8800000-0xb88fffff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:5d:00.0: bridge window [mem 0xba000000-0xbbffffff 64bit pref] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:5d: resource 4 [io 0x8000-0x9fff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:5d: resource 5 [mem 0xb8800000-0xc5ffffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:5d: resource 6 [mem 0x383000000000-0x383fffffffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:5e: resource 1 [mem 0xb8800000-0xb88fffff] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:5e: resource 2 [mem 0xba000000-0xbbffffff 64bit pref] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:80: resource 4 [mem 0xc6000000-0xd37fffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:80: resource 5 [mem 0x384000000000-0x384fffffffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:85: resource 4 [io 0xa000-0xbfff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:85: resource 5 [mem 0xd3800000-0xe0ffffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:85: resource 6 [mem 0x385000000000-0x385fffffffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:ae: resource 4 [io 0xc000-0xdfff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:ae: resource 5 [mem 0xe1000000-0xee7fffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:ae: resource 6 [mem 0x386000000000-0x386fffffffff window] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:00.0: bridge window [io 0x1000-0x0fff] to [bus d8] add_size 1000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:00.0: bridge window [mem 0x00100000-0x000fffff 64bit pref] to [bus d8] add_size 200000 add_align 100000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:01.0: bridge window [io 0x1000-0x0fff] to [bus d9] add_size 1000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:01.0: bridge window [mem 0x00100000-0x000fffff 64bit pref] to [bus d9] add_size 200000 add_align 100000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:00.0: res[15]=[mem 0x00100000-0x000fffff 64bit pref] res_to_dev_res add_size 200000 min_align 100000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:00.0: res[15]=[mem 0x00100000-0x002fffff 64bit pref] res_to_dev_res add_size 200000 min_align 100000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:01.0: res[15]=[mem 0x00100000-0x000fffff 64bit pref] res_to_dev_res add_size 200000 min_align 100000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:01.0: res[15]=[mem 0x00100000-0x002fffff 64bit pref] res_to_dev_res add_size 200000 min_align 100000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:00.0: res[13]=[io 0x1000-0x0fff] res_to_dev_res add_size 1000 min_align 1000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:00.0: res[13]=[io 0x1000-0x1fff] res_to_dev_res add_size 1000 min_align 1000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:01.0: res[13]=[io 0x1000-0x0fff] res_to_dev_res add_size 1000 min_align 1000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:01.0: res[13]=[io 0x1000-0x1fff] res_to_dev_res add_size 1000 min_align 1000 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:00.0: BAR 15: assigned [mem 0x387000000000-0x3870001fffff 64bit pref] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:01.0: BAR 15: assigned [mem 0x387000200000-0x3870003fffff 64bit pref] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:00.0: BAR 13: assigned [io 0xe000-0xefff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:01.0: BAR 13: assigned [io 0xf000-0xffff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:00.0: PCI bridge to [bus d8] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:00.0: bridge window [io 0xe000-0xefff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:00.0: bridge window [mem 0xeec00000-0xeeffffff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:00.0: bridge window [mem 0x387000000000-0x3870001fffff 64bit pref] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:01.0: PCI bridge to [bus d9] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:01.0: bridge window [io 0xf000-0xffff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:01.0: bridge window [mem 0xee800000-0xeebfffff] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:01.0: bridge window [mem 0x387000200000-0x3870003fffff 64bit pref] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:02.0: PCI bridge to [bus da] Oct 25 09:03:33 sh-103-53 kernel: pci 0000:d7:03.0: PCI bridge to [bus db] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:d7: resource 4 [io 0xe000-0xffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:d7: resource 5 [mem 0xee800000-0xfbffffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:d7: resource 6 [mem 0x387000000000-0x387fffffffff window] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:d8: resource 0 [io 0xe000-0xefff] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:d8: resource 1 [mem 0xeec00000-0xeeffffff] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:d8: resource 2 [mem 0x387000000000-0x3870001fffff 64bit pref] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:d9: resource 0 [io 0xf000-0xffff] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:d9: resource 1 [mem 0xee800000-0xeebfffff] Oct 25 09:03:33 sh-103-53 kernel: pci_bus 0000:d9: resource 2 [mem 0x387000200000-0x3870003fffff 64bit pref] Oct 25 09:03:33 sh-103-53 kernel: NET: Registered protocol family 2 Oct 25 09:03:33 sh-103-53 kernel: TCP established hash table entries: 524288 (order: 10, 4194304 bytes) Oct 25 09:03:33 sh-103-53 kernel: TCP bind hash table entries: 65536 (order: 8, 1048576 bytes) Oct 25 09:03:33 sh-103-53 kernel: TCP: Hash tables configured (established 524288 bind 65536) Oct 25 09:03:33 sh-103-53 kernel: TCP: reno registered Oct 25 09:03:33 sh-103-53 kernel: UDP hash table entries: 65536 (order: 9, 2097152 bytes) Oct 25 09:03:33 sh-103-53 kernel: UDP-Lite hash table entries: 65536 (order: 9, 2097152 bytes) Oct 25 09:03:33 sh-103-53 kernel: NET: Registered protocol family 1 Oct 25 09:03:33 sh-103-53 kernel: pci 0000:03:00.0: Boot video device Oct 25 09:03:33 sh-103-53 kernel: PCI: CLS 128 bytes, default 64 Oct 25 09:03:33 sh-103-53 kernel: Unpacking initramfs... Oct 25 09:03:33 sh-103-53 kernel: Freeing initrd memory: 19420k freed Oct 25 09:03:33 sh-103-53 kernel: PCI-DMA: Using software bounce buffering for IO (SWIOTLB) Oct 25 09:03:33 sh-103-53 kernel: software IO TLB [mem 0x59dff000-0x5ddff000] (64MB) mapped at [ffff975299dff000-ffff97529ddfefff] Oct 25 09:03:33 sh-103-53 kernel: RAPL PMU: API unit is 2^-32 Joules, 3 fixed counters, 655360 ms ovfl timer Oct 25 09:03:33 sh-103-53 kernel: RAPL PMU: hw unit of domain pp0-core 2^-14 Joules Oct 25 09:03:33 sh-103-53 kernel: RAPL PMU: hw unit of domain package 2^-14 Joules Oct 25 09:03:33 sh-103-53 kernel: RAPL PMU: hw unit of domain dram 2^-16 Joules Oct 25 09:03:33 sh-103-53 kernel: sha1_ssse3: Using AVX2 optimized SHA-1 implementation Oct 25 09:03:33 sh-103-53 kernel: sha256_ssse3: Using AVX2 optimized SHA-256 implementation Oct 25 09:03:33 sh-103-53 kernel: futex hash table entries: 131072 (order: 11, 8388608 bytes) Oct 25 09:03:33 sh-103-53 kernel: Initialise system trusted keyring Oct 25 09:03:33 sh-103-53 kernel: audit: initializing netlink socket (disabled) Oct 25 09:03:33 sh-103-53 kernel: type=2000 audit(1572019405.801:1): initialized Oct 25 09:03:33 sh-103-53 kernel: HugeTLB registered 1 GB page size, pre-allocated 0 pages Oct 25 09:03:33 sh-103-53 kernel: HugeTLB registered 2 MB page size, pre-allocated 0 pages Oct 25 09:03:33 sh-103-53 kernel: zpool: loaded Oct 25 09:03:33 sh-103-53 kernel: zbud: loaded Oct 25 09:03:33 sh-103-53 kernel: VFS: Disk quotas dquot_6.5.2 Oct 25 09:03:33 sh-103-53 kernel: Dquot-cache hash table entries: 512 (order 0, 4096 bytes) Oct 25 09:03:33 sh-103-53 kernel: msgmni has been set to 32768 Oct 25 09:03:33 sh-103-53 kernel: Key type big_key registered Oct 25 09:03:33 sh-103-53 kernel: SELinux: Registering netfilter hooks Oct 25 09:03:33 sh-103-53 kernel: NET: Registered protocol family 38 Oct 25 09:03:33 sh-103-53 kernel: Key type asymmetric registered Oct 25 09:03:33 sh-103-53 kernel: Asymmetric key parser 'x509' registered Oct 25 09:03:33 sh-103-53 kernel: Block layer SCSI generic (bsg) driver version 0.4 loaded (major 248) Oct 25 09:03:33 sh-103-53 kernel: io scheduler noop registered Oct 25 09:03:33 sh-103-53 kernel: io scheduler deadline registered (default) Oct 25 09:03:33 sh-103-53 kernel: io scheduler cfq registered Oct 25 09:03:33 sh-103-53 kernel: io scheduler mq-deadline registered Oct 25 09:03:33 sh-103-53 kernel: io scheduler kyber registered Oct 25 09:03:33 sh-103-53 kernel: pcieport 0000:00:1c.0: irq 24 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: pcieport 0000:00:1c.4: irq 25 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: pcieport 0000:00:1c.5: irq 26 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: pcieport 0000:5d:00.0: irq 28 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: pcieport 0000:d7:00.0: irq 30 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: pcieport 0000:d7:01.0: irq 31 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: pcieport 0000:d7:02.0: irq 32 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: pcieport 0000:d7:03.0: irq 33 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: pcieport 0000:00:1c.0: Signaling PME through PCIe PME interrupt Oct 25 09:03:33 sh-103-53 kernel: pcie_pme 0000:00:1c.0:pcie001: service driver pcie_pme loaded Oct 25 09:03:33 sh-103-53 kernel: pcieport 0000:00:1c.4: Signaling PME through PCIe PME interrupt Oct 25 09:03:33 sh-103-53 kernel: pci 0000:02:00.0: Signaling PME through PCIe PME interrupt Oct 25 09:03:33 sh-103-53 kernel: pci 0000:03:00.0: Signaling PME through PCIe PME interrupt Oct 25 09:03:33 sh-103-53 kernel: pcie_pme 0000:00:1c.4:pcie001: service driver pcie_pme loaded Oct 25 09:03:33 sh-103-53 kernel: pcieport 0000:00:1c.5: Signaling PME through PCIe PME interrupt Oct 25 09:03:33 sh-103-53 kernel: pci 0000:04:00.0: Signaling PME through PCIe PME interrupt Oct 25 09:03:33 sh-103-53 kernel: pcie_pme 0000:00:1c.5:pcie001: service driver pcie_pme loaded Oct 25 09:03:33 sh-103-53 kernel: pcieport 0000:5d:00.0: Signaling PME through PCIe PME interrupt Oct 25 09:03:33 sh-103-53 kernel: pci 0000:5e:00.0: Signaling PME through PCIe PME interrupt Oct 25 09:03:33 sh-103-53 kernel: pcie_pme 0000:5d:00.0:pcie001: service driver pcie_pme loaded Oct 25 09:03:33 sh-103-53 kernel: pcieport 0000:d7:00.0: Signaling PME through PCIe PME interrupt Oct 25 09:03:33 sh-103-53 kernel: pcie_pme 0000:d7:00.0:pcie001: service driver pcie_pme loaded Oct 25 09:03:33 sh-103-53 kernel: pcieport 0000:d7:01.0: Signaling PME through PCIe PME interrupt Oct 25 09:03:33 sh-103-53 kernel: pcie_pme 0000:d7:01.0:pcie001: service driver pcie_pme loaded Oct 25 09:03:33 sh-103-53 kernel: pcieport 0000:d7:02.0: Signaling PME through PCIe PME interrupt Oct 25 09:03:33 sh-103-53 kernel: pcie_pme 0000:d7:02.0:pcie001: service driver pcie_pme loaded Oct 25 09:03:33 sh-103-53 kernel: pcieport 0000:d7:03.0: Signaling PME through PCIe PME interrupt Oct 25 09:03:33 sh-103-53 kernel: pcie_pme 0000:d7:03.0:pcie001: service driver pcie_pme loaded Oct 25 09:03:33 sh-103-53 kernel: pci_hotplug: PCI Hot Plug PCI Core version: 0.5 Oct 25 09:03:33 sh-103-53 kernel: pciehp 0000:d7:00.0:pcie004: Slot #160 AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ Surprise+ Interlock- NoCompl- LLActRep+ Oct 25 09:03:33 sh-103-53 kernel: pciehp 0000:d7:00.0:pcie004: service driver pciehp loaded Oct 25 09:03:33 sh-103-53 kernel: pciehp 0000:d7:01.0:pcie004: Slot #161 AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ Surprise+ Interlock- NoCompl- LLActRep+ Oct 25 09:03:33 sh-103-53 kernel: pciehp 0000:d7:01.0:pcie004: service driver pciehp loaded Oct 25 09:03:33 sh-103-53 kernel: pciehp: PCI Express Hot Plug Controller Driver version: 0.4 Oct 25 09:03:33 sh-103-53 kernel: shpchp: Standard Hot Plug PCI Controller Driver version: 0.4 Oct 25 09:03:33 sh-103-53 kernel: intel_idle: MWAIT substates: 0x2020 Oct 25 09:03:33 sh-103-53 kernel: intel_idle: v0.4.1 model 0x55 Oct 25 09:03:33 sh-103-53 kernel: intel_idle: lapic_timer_reliable_states 0xffffffff Oct 25 09:03:33 sh-103-53 kernel: input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0 Oct 25 09:03:33 sh-103-53 kernel: ACPI: Power Button [PWRF] Oct 25 09:03:33 sh-103-53 kernel: ERST: Error Record Serialization Table (ERST) support is initialized. Oct 25 09:03:33 sh-103-53 kernel: pstore: Registered erst as persistent store backend Oct 25 09:03:33 sh-103-53 kernel: GHES: APEI firmware first mode is enabled by APEI bit and WHEA _OSC. Oct 25 09:03:33 sh-103-53 kernel: Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled Oct 25 09:03:33 sh-103-53 kernel: 00:02: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A Oct 25 09:03:33 sh-103-53 kernel: serial8250: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A Oct 25 09:03:33 sh-103-53 kernel: Non-volatile memory driver v1.3 Oct 25 09:03:33 sh-103-53 kernel: Linux agpgart interface v0.103 Oct 25 09:03:33 sh-103-53 kernel: crash memory driver: version 1.1 Oct 25 09:03:33 sh-103-53 kernel: rdac: device handler registered Oct 25 09:03:33 sh-103-53 kernel: hp_sw: device handler registered Oct 25 09:03:33 sh-103-53 kernel: emc: device handler registered Oct 25 09:03:33 sh-103-53 kernel: alua: device handler registered Oct 25 09:03:33 sh-103-53 kernel: libphy: Fixed MDIO Bus: probed Oct 25 09:03:33 sh-103-53 kernel: ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver Oct 25 09:03:33 sh-103-53 kernel: ehci-pci: EHCI PCI platform driver Oct 25 09:03:33 sh-103-53 kernel: ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver Oct 25 09:03:33 sh-103-53 kernel: ohci-pci: OHCI PCI platform driver Oct 25 09:03:33 sh-103-53 kernel: uhci_hcd: USB Universal Host Controller Interface driver Oct 25 09:03:33 sh-103-53 kernel: xhci_hcd 0000:00:14.0: xHCI Host Controller Oct 25 09:03:33 sh-103-53 kernel: xhci_hcd 0000:00:14.0: new USB bus registered, assigned bus number 1 Oct 25 09:03:33 sh-103-53 kernel: xhci_hcd 0000:00:14.0: hcc params 0x200077c1 hci version 0x100 quirks 0x00009810 Oct 25 09:03:33 sh-103-53 kernel: xhci_hcd 0000:00:14.0: cache line size of 128 is not supported Oct 25 09:03:33 sh-103-53 kernel: xhci_hcd 0000:00:14.0: irq 34 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: usb usb1: New USB device found, idVendor=1d6b, idProduct=0002 Oct 25 09:03:33 sh-103-53 kernel: usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1 Oct 25 09:03:33 sh-103-53 kernel: usb usb1: Product: xHCI Host Controller Oct 25 09:03:33 sh-103-53 kernel: usb usb1: Manufacturer: Linux 3.10.0-957.27.2.el7.x86_64 xhci-hcd Oct 25 09:03:33 sh-103-53 kernel: usb usb1: SerialNumber: 0000:00:14.0 Oct 25 09:03:33 sh-103-53 kernel: hub 1-0:1.0: USB hub found Oct 25 09:03:33 sh-103-53 kernel: hub 1-0:1.0: 16 ports detected Oct 25 09:03:33 sh-103-53 kernel: xhci_hcd 0000:00:14.0: xHCI Host Controller Oct 25 09:03:33 sh-103-53 kernel: xhci_hcd 0000:00:14.0: new USB bus registered, assigned bus number 2 Oct 25 09:03:33 sh-103-53 kernel: usb usb2: New USB device found, idVendor=1d6b, idProduct=0003 Oct 25 09:03:33 sh-103-53 kernel: usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1 Oct 25 09:03:33 sh-103-53 kernel: usb usb2: Product: xHCI Host Controller Oct 25 09:03:33 sh-103-53 kernel: usb usb2: Manufacturer: Linux 3.10.0-957.27.2.el7.x86_64 xhci-hcd Oct 25 09:03:33 sh-103-53 kernel: usb usb2: SerialNumber: 0000:00:14.0 Oct 25 09:03:33 sh-103-53 kernel: hub 2-0:1.0: USB hub found Oct 25 09:03:33 sh-103-53 kernel: hub 2-0:1.0: 10 ports detected Oct 25 09:03:33 sh-103-53 kernel: usb: port power management may be unreliable Oct 25 09:03:33 sh-103-53 kernel: usbcore: registered new interface driver usbserial_generic Oct 25 09:03:33 sh-103-53 kernel: usbserial: USB Serial support registered for generic Oct 25 09:03:33 sh-103-53 kernel: i8042: PNP: No PS/2 controller found. Probing ports directly. Oct 25 09:03:33 sh-103-53 kernel: usb 1-14: new high-speed USB device number 2 using xhci_hcd Oct 25 09:03:33 sh-103-53 kernel: usb 1-14: New USB device found, idVendor=1604, idProduct=10c0 Oct 25 09:03:33 sh-103-53 kernel: usb 1-14: New USB device strings: Mfr=0, Product=0, SerialNumber=0 Oct 25 09:03:33 sh-103-53 kernel: hub 1-14:1.0: USB hub found Oct 25 09:03:33 sh-103-53 kernel: hub 1-14:1.0: 4 ports detected Oct 25 09:03:33 sh-103-53 kernel: usb 1-14.1: new high-speed USB device number 3 using xhci_hcd Oct 25 09:03:33 sh-103-53 kernel: i8042: No controller found Oct 25 09:03:33 sh-103-53 kernel: tsc: Refined TSC clocksource calibration: 2294.607 MHz Oct 25 09:03:33 sh-103-53 kernel: mousedev: PS/2 mouse device common for all mice Oct 25 09:03:33 sh-103-53 kernel: rtc_cmos 00:00: RTC can wake from S4 Oct 25 09:03:33 sh-103-53 kernel: rtc_cmos 00:00: rtc core: registered rtc_cmos as rtc0 Oct 25 09:03:33 sh-103-53 kernel: rtc_cmos 00:00: alarms up to one month, y3k, 114 bytes nvram, hpet irqs Oct 25 09:03:33 sh-103-53 kernel: cpuidle: using governor menu Oct 25 09:03:33 sh-103-53 kernel: hidraw: raw HID events driver (C) Jiri Kosina Oct 25 09:03:33 sh-103-53 kernel: usbcore: registered new interface driver usbhid Oct 25 09:03:33 sh-103-53 kernel: usbhid: USB HID core driver Oct 25 09:03:33 sh-103-53 kernel: Detected 1 PCC Subspaces Oct 25 09:03:33 sh-103-53 kernel: Registering PCC driver as Mailbox controller Oct 25 09:03:33 sh-103-53 kernel: drop_monitor: Initializing network drop monitor service Oct 25 09:03:33 sh-103-53 kernel: TCP: cubic registered Oct 25 09:03:33 sh-103-53 kernel: Initializing XFRM netlink socket Oct 25 09:03:33 sh-103-53 kernel: NET: Registered protocol family 10 Oct 25 09:03:33 sh-103-53 kernel: NET: Registered protocol family 17 Oct 25 09:03:33 sh-103-53 kernel: mpls_gso: MPLS GSO support Oct 25 09:03:33 sh-103-53 kernel: intel_rdt: Intel RDT MB allocation detected Oct 25 09:03:33 sh-103-53 kernel: mce: Using 20 MCE banks Oct 25 09:03:33 sh-103-53 kernel: microcode: sig=0x50654, pf=0x80, revision=0x2000060 Oct 25 09:03:33 sh-103-53 kernel: microcode: Microcode Update Driver: v2.01 , Peter Oruba Oct 25 09:03:33 sh-103-53 kernel: PM: Hibernation image not present or could not be loaded. Oct 25 09:03:33 sh-103-53 kernel: Loading compiled-in X.509 certificates Oct 25 09:03:33 sh-103-53 kernel: Loaded X.509 cert 'CentOS Linux kpatch signing key: ea0413152cde1d98ebdca3fe6f0230904c9ef717' Oct 25 09:03:33 sh-103-53 kernel: Loaded X.509 cert 'CentOS Linux Driver update signing key: 7f421ee0ab69461574bb358861dbe77762a4201b' Oct 25 09:03:33 sh-103-53 kernel: Loaded X.509 cert 'CentOS Linux kernel signing key: 520a4e2d9d553ef84201c188b87fe51b9de11a5e' Oct 25 09:03:33 sh-103-53 kernel: registered taskstats version 1 Oct 25 09:03:33 sh-103-53 kernel: Key type trusted registered Oct 25 09:03:33 sh-103-53 kernel: Key type encrypted registered Oct 25 09:03:33 sh-103-53 kernel: IMA: No TPM chip found, activating TPM-bypass! (rc=-19) Oct 25 09:03:33 sh-103-53 kernel: Magic number: 7:115:86 Oct 25 09:03:33 sh-103-53 kernel: clockevents clockevent357: hash matches Oct 25 09:03:33 sh-103-53 kernel: rtc_cmos 00:00: setting system clock to 2019-10-25 16:03:32 UTC (1572019412) Oct 25 09:03:33 sh-103-53 kernel: Switched to clocksource tsc Oct 25 09:03:33 sh-103-53 kernel: Freeing unused kernel memory: 1876k freed Oct 25 09:03:33 sh-103-53 kernel: usb 1-14.1: New USB device found, idVendor=1604, idProduct=10c0 Oct 25 09:03:33 sh-103-53 kernel: Write protecting the kernel read-only data: 12288k Oct 25 09:03:33 sh-103-53 kernel: usb 1-14.1: New USB device strings: Mfr=0, Product=0, SerialNumber=0 Oct 25 09:03:33 sh-103-53 kernel: hub 1-14.1:1.0: USB hub found Oct 25 09:03:33 sh-103-53 kernel: Freeing unused kernel memory: 508k freed Oct 25 09:03:33 sh-103-53 kernel: hub 1-14.1:1.0: 4 ports detected Oct 25 09:03:33 sh-103-53 kernel: Freeing unused kernel memory: 596k freed Oct 25 09:03:33 sh-103-53 kernel: random: systemd: uninitialized urandom read (16 bytes read) Oct 25 09:03:33 sh-103-53 kernel: random: systemd: uninitialized urandom read (16 bytes read) Oct 25 09:03:33 sh-103-53 kernel: random: systemd: uninitialized urandom read (16 bytes read) Oct 25 09:03:33 sh-103-53 systemd[1]: systemd 219 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN) Oct 25 09:03:33 sh-103-53 systemd[1]: Detected architecture x86-64. Oct 25 09:03:33 sh-103-53 systemd[1]: Running in initial RAM disk. Oct 25 09:03:33 sh-103-53 kernel: usb 1-14.3: new high-speed USB device number 4 using xhci_hcd Oct 25 09:03:33 sh-103-53 systemd[1]: Set hostname to . Oct 25 09:03:33 sh-103-53 kernel: random: systemd: uninitialized urandom read (16 bytes read) Oct 25 09:03:33 sh-103-53 kernel: random: systemd: uninitialized urandom read (16 bytes read) Oct 25 09:03:33 sh-103-53 kernel: random: systemd: uninitialized urandom read (16 bytes read) Oct 25 09:03:33 sh-103-53 kernel: random: systemd: uninitialized urandom read (16 bytes read) Oct 25 09:03:33 sh-103-53 kernel: random: systemd: uninitialized urandom read (16 bytes read) Oct 25 09:03:33 sh-103-53 kernel: random: systemd: uninitialized urandom read (16 bytes read) Oct 25 09:03:33 sh-103-53 kernel: random: systemd: uninitialized urandom read (16 bytes read) Oct 25 09:03:33 sh-103-53 systemd[1]: Started Dispatch Password Requests to Console Directory Watch. Oct 25 09:03:33 sh-103-53 systemd[1]: Reached target Local File Systems. Oct 25 09:03:33 sh-103-53 kernel: usb 1-14.3: New USB device found, idVendor=413c, idProduct=a102 Oct 25 09:03:33 sh-103-53 kernel: usb 1-14.3: New USB device strings: Mfr=1, Product=2, SerialNumber=0 Oct 25 09:03:33 sh-103-53 kernel: usb 1-14.3: Product: iDRAC Virtual NIC USB Device Oct 25 09:03:33 sh-103-53 kernel: usb 1-14.3: Manufacturer: Dell(TM) Oct 25 09:03:33 sh-103-53 systemd[1]: Reached target Timers. Oct 25 09:03:33 sh-103-53 systemd[1]: Created slice Root Slice. Oct 25 09:03:33 sh-103-53 systemd[1]: Listening on udev Kernel Socket. Oct 25 09:03:33 sh-103-53 systemd[1]: Listening on Journal Socket. Oct 25 09:03:33 sh-103-53 systemd[1]: Listening on udev Control Socket. Oct 25 09:03:33 sh-103-53 systemd[1]: Reached target Sockets. Oct 25 09:03:33 sh-103-53 systemd[1]: Created slice System Slice. Oct 25 09:03:33 sh-103-53 kernel: usb 1-14.4: new high-speed USB device number 5 using xhci_hcd Oct 25 09:03:33 sh-103-53 systemd[1]: Starting Journal Service... Oct 25 09:03:33 sh-103-53 systemd[1]: Starting Apply Kernel Variables... Oct 25 09:03:33 sh-103-53 systemd[1]: Starting Create list of required static device nodes for the current kernel... Oct 25 09:03:33 sh-103-53 systemd[1]: Starting dracut cmdline hook... Oct 25 09:03:33 sh-103-53 systemd[1]: Reached target Swap. Oct 25 09:03:33 sh-103-53 systemd[1]: Reached target Slices. Oct 25 09:03:33 sh-103-53 systemd[1]: Reached target Paths. Oct 25 09:03:33 sh-103-53 kernel: usb 1-14.4: New USB device found, idVendor=1604, idProduct=10c0 Oct 25 09:03:33 sh-103-53 kernel: usb 1-14.4: New USB device strings: Mfr=0, Product=0, SerialNumber=0 Oct 25 09:03:33 sh-103-53 kernel: hub 1-14.4:1.0: USB hub found Oct 25 09:03:33 sh-103-53 kernel: hub 1-14.4:1.0: 4 ports detected Oct 25 09:03:33 sh-103-53 systemd[1]: Started Journal Service. Oct 25 09:03:33 sh-103-53 kernel: dca service started, version 1.12.1 Oct 25 09:03:33 sh-103-53 kernel: pps_core: LinuxPPS API ver. 1 registered Oct 25 09:03:33 sh-103-53 kernel: pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti Oct 25 09:03:33 sh-103-53 kernel: PTP clock support registered Oct 25 09:03:33 sh-103-53 kernel: Request for unknown module key 'Mellanox Technologies signing key: 61feb074fc7292f958419386ffdd9d5ca999e403' err -11 Oct 25 09:03:33 sh-103-53 kernel: mlx_compat: loading out-of-tree module taints kernel. Oct 25 09:03:33 sh-103-53 kernel: libata version 3.00 loaded. Oct 25 09:03:33 sh-103-53 kernel: mlx_compat: module verification failed: signature and/or required key missing - tainting kernel Oct 25 09:03:33 sh-103-53 kernel: Compat-mlnx-ofed backport release: 1c4bf42 Oct 25 09:03:33 sh-103-53 kernel: Backport based on mlnx_ofed/mlnx-ofa_kernel-4.0.git 1c4bf42 Oct 25 09:03:33 sh-103-53 kernel: compat.git: mlnx_ofed/mlnx-ofa_kernel-4.0.git Oct 25 09:03:33 sh-103-53 kernel: igb: Intel(R) Gigabit Ethernet Network Driver - version 5.4.0-k Oct 25 09:03:33 sh-103-53 kernel: igb: Copyright (c) 2007-2014 Intel Corporation. Oct 25 09:03:33 sh-103-53 kernel: igb 0000:04:00.0: irq 35 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: igb 0000:04:00.0: irq 35 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: igb 0000:04:00.0: irq 36 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: igb 0000:04:00.0: irq 37 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: igb 0000:04:00.0: irq 38 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: igb 0000:04:00.0: irq 39 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: igb 0000:04:00.0: irq 40 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: igb 0000:04:00.0: irq 41 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: igb 0000:04:00.0: irq 42 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: igb 0000:04:00.0: irq 43 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: ahci 0000:00:11.5: version 3.0 Oct 25 09:03:33 sh-103-53 kernel: ahci 0000:00:11.5: irq 44 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: ahci 0000:00:11.5: AHCI 0001.0301 32 slots 6 ports 6 Gbps 0x3f impl SATA mode Oct 25 09:03:33 sh-103-53 kernel: ahci 0000:00:11.5: flags: 64bit ncq sntf pm led clo only pio slum part ems deso sadm sds apst Oct 25 09:03:33 sh-103-53 kernel: Request for unknown module key 'Mellanox Technologies signing key: 61feb074fc7292f958419386ffdd9d5ca999e403' err -11 Oct 25 09:03:33 sh-103-53 kernel: scsi host0: ahci Oct 25 09:03:33 sh-103-53 kernel: scsi host1: ahci Oct 25 09:03:33 sh-103-53 kernel: scsi host2: ahci Oct 25 09:03:33 sh-103-53 kernel: scsi host3: ahci Oct 25 09:03:33 sh-103-53 kernel: scsi host4: ahci Oct 25 09:03:33 sh-103-53 kernel: scsi host5: ahci Oct 25 09:03:33 sh-103-53 kernel: ata1: SATA max UDMA/133 abar m524288@0x93080000 port 0x93080100 irq 44 Oct 25 09:03:33 sh-103-53 kernel: ata2: SATA max UDMA/133 abar m524288@0x93080000 port 0x93080180 irq 44 Oct 25 09:03:33 sh-103-53 kernel: ata3: SATA max UDMA/133 abar m524288@0x93080000 port 0x93080200 irq 44 Oct 25 09:03:33 sh-103-53 kernel: ata4: SATA max UDMA/133 abar m524288@0x93080000 port 0x93080280 irq 44 Oct 25 09:03:33 sh-103-53 kernel: ata5: SATA max UDMA/133 abar m524288@0x93080000 port 0x93080300 irq 44 Oct 25 09:03:33 sh-103-53 kernel: ata6: SATA max UDMA/133 abar m524288@0x93080000 port 0x93080380 irq 44 Oct 25 09:03:33 sh-103-53 kernel: Request for unknown module key 'Mellanox Technologies signing key: 61feb074fc7292f958419386ffdd9d5ca999e403' err -11 Oct 25 09:03:33 sh-103-53 kernel: ahci 0000:00:17.0: irq 46 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: ahci 0000:00:17.0: AHCI 0001.0301 32 slots 8 ports 6 Gbps 0xff impl SATA mode Oct 25 09:03:33 sh-103-53 kernel: ahci 0000:00:17.0: flags: 64bit ncq sntf pm led clo only pio slum part ems deso sadm sds apst Oct 25 09:03:33 sh-103-53 kernel: mlx5_core 0000:5e:00.0: firmware version: 16.25.4062 Oct 25 09:03:33 sh-103-53 kernel: mlx5_core 0000:5e:00.0: 126.016 Gb/s available PCIe bandwidth (8 GT/s x16 link) Oct 25 09:03:33 sh-103-53 kernel: scsi host6: ahci Oct 25 09:03:33 sh-103-53 kernel: scsi host7: ahci Oct 25 09:03:33 sh-103-53 kernel: scsi host8: ahci Oct 25 09:03:33 sh-103-53 kernel: scsi host9: ahci Oct 25 09:03:33 sh-103-53 kernel: scsi host10: ahci Oct 25 09:03:33 sh-103-53 kernel: scsi host11: ahci Oct 25 09:03:33 sh-103-53 kernel: scsi host12: ahci Oct 25 09:03:33 sh-103-53 kernel: scsi host13: ahci Oct 25 09:03:33 sh-103-53 kernel: ata7: SATA max UDMA/133 abar m524288@0x93000000 port 0x93000100 irq 46 Oct 25 09:03:33 sh-103-53 kernel: ata8: SATA max UDMA/133 abar m524288@0x93000000 port 0x93000180 irq 46 Oct 25 09:03:33 sh-103-53 kernel: ata9: SATA max UDMA/133 abar m524288@0x93000000 port 0x93000200 irq 46 Oct 25 09:03:33 sh-103-53 kernel: ata10: SATA max UDMA/133 abar m524288@0x93000000 port 0x93000280 irq 46 Oct 25 09:03:33 sh-103-53 kernel: ata11: SATA max UDMA/133 abar m524288@0x93000000 port 0x93000300 irq 46 Oct 25 09:03:33 sh-103-53 kernel: ata12: SATA max UDMA/133 abar m524288@0x93000000 port 0x93000380 irq 46 Oct 25 09:03:33 sh-103-53 kernel: ata13: SATA max UDMA/133 abar m524288@0x93000000 port 0x93000400 irq 46 Oct 25 09:03:33 sh-103-53 kernel: ata14: SATA max UDMA/133 abar m524288@0x93000000 port 0x93000480 irq 46 Oct 25 09:03:33 sh-103-53 kernel: igb 0000:04:00.0: added PHC on eth0 Oct 25 09:03:33 sh-103-53 kernel: igb 0000:04:00.0: Intel(R) Gigabit Ethernet Network Connection Oct 25 09:03:33 sh-103-53 kernel: igb 0000:04:00.0: eth0: (PCIe:5.0Gb/s:Width x1) 6c:2b:59:a0:5e:34 Oct 25 09:03:33 sh-103-53 kernel: igb 0000:04:00.0: eth0: PBA No: 106500-000 Oct 25 09:03:33 sh-103-53 kernel: igb 0000:04:00.0: Using MSI-X interrupts. 8 rx queue(s), 8 tx queue(s) Oct 25 09:03:33 sh-103-53 kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Oct 25 09:03:33 sh-103-53 kernel: ata6: SATA link down (SStatus 0 SControl 300) Oct 25 09:03:33 sh-103-53 kernel: ata3: SATA link down (SStatus 0 SControl 300) Oct 25 09:03:33 sh-103-53 kernel: ata5: SATA link down (SStatus 0 SControl 300) Oct 25 09:03:33 sh-103-53 kernel: ata2: SATA link down (SStatus 0 SControl 300) Oct 25 09:03:33 sh-103-53 kernel: ata4: SATA link down (SStatus 0 SControl 300) Oct 25 09:03:33 sh-103-53 kernel: ata1.00: ATA-10: SSDSCKJB240G7R, N201DL42, max UDMA/133 Oct 25 09:03:33 sh-103-53 kernel: ata1.00: 468862128 sectors, multi 1: LBA48 NCQ (depth 31/32) Oct 25 09:03:33 sh-103-53 kernel: ata1.00: configured for UDMA/133 Oct 25 09:03:33 sh-103-53 kernel: scsi 0:0:0:0: Direct-Access ATA SSDSCKJB240G7R DL42 PQ: 0 ANSI: 5 Oct 25 09:03:33 sh-103-53 kernel: mlx5_core 0000:5e:00.0: irq 47 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: mlx5_core 0000:5e:00.0: irq 48 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: mlx5_core 0000:5e:00.0: irq 49 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: mlx5_core 0000:5e:00.0: irq 50 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: mlx5_core 0000:5e:00.0: irq 51 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: mlx5_core 0000:5e:00.0: irq 52 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: mlx5_core 0000:5e:00.0: irq 53 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: mlx5_core 0000:5e:00.0: irq 54 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: mlx5_core 0000:5e:00.0: irq 55 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: mlx5_core 0000:5e:00.0: irq 56 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: mlx5_core 0000:5e:00.0: irq 57 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: mlx5_core 0000:5e:00.0: irq 58 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: mlx5_core 0000:5e:00.0: irq 59 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: mlx5_core 0000:5e:00.0: irq 60 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: mlx5_core 0000:5e:00.0: irq 61 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: mlx5_core 0000:5e:00.0: irq 62 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: mlx5_core 0000:5e:00.0: irq 63 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: mlx5_core 0000:5e:00.0: irq 64 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: mlx5_core 0000:5e:00.0: irq 65 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: mlx5_core 0000:5e:00.0: irq 66 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: mlx5_core 0000:5e:00.0: irq 67 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: mlx5_core 0000:5e:00.0: irq 68 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: mlx5_core 0000:5e:00.0: irq 69 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: mlx5_core 0000:5e:00.0: irq 70 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: mlx5_core 0000:5e:00.0: irq 71 for MSI/MSI-X Oct 25 09:03:33 sh-103-53 kernel: mlx5_core 0000:5e:00.0: Port module event: module 0, Cable plugged Oct 25 09:03:33 sh-103-53 kernel: mlx5_core 0000:5e:00.0: mlx5_fw_tracer_start:776:(pid 330): FWTracer: Ownership granted and active Oct 25 09:03:33 sh-103-53 kernel: Request for unknown module key 'Mellanox Technologies signing key: 61feb074fc7292f958419386ffdd9d5ca999e403' err -11 Oct 25 09:03:33 sh-103-53 kernel: random: fast init done Oct 25 09:03:33 sh-103-53 kernel: ata12: SATA link down (SStatus 4 SControl 300) Oct 25 09:03:33 sh-103-53 kernel: ata14: SATA link down (SStatus 4 SControl 300) Oct 25 09:03:33 sh-103-53 kernel: ata8: SATA link down (SStatus 0 SControl 300) Oct 25 09:03:33 sh-103-53 kernel: ata10: SATA link down (SStatus 4 SControl 300) Oct 25 09:03:33 sh-103-53 kernel: ata9: SATA link down (SStatus 4 SControl 300) Oct 25 09:03:33 sh-103-53 kernel: ata11: SATA link down (SStatus 4 SControl 300) Oct 25 09:03:33 sh-103-53 kernel: ata7: SATA link down (SStatus 0 SControl 300) Oct 25 09:03:33 sh-103-53 kernel: ata13: SATA link down (SStatus 4 SControl 300) Oct 25 09:03:33 sh-103-53 kernel: Request for unknown module key 'Mellanox Technologies signing key: 61feb074fc7292f958419386ffdd9d5ca999e403' err -11 Oct 25 09:03:33 sh-103-53 kernel: Request for unknown module key 'Mellanox Technologies signing key: 61feb074fc7292f958419386ffdd9d5ca999e403' err -11 Oct 25 09:03:33 sh-103-53 kernel: mlx5_ib: Mellanox Connect-IB Infiniband driver v4.7-1.0.0 Oct 25 09:03:33 sh-103-53 kernel: sd 0:0:0:0: [sda] 468862128 512-byte logical blocks: (240 GB/223 GiB) Oct 25 09:03:33 sh-103-53 kernel: sd 0:0:0:0: [sda] 4096-byte physical blocks Oct 25 09:03:33 sh-103-53 kernel: sd 0:0:0:0: [sda] Write Protect is off Oct 25 09:03:33 sh-103-53 kernel: sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 Oct 25 09:03:33 sh-103-53 kernel: sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA Oct 25 09:03:33 sh-103-53 kernel: sda: sda1 sda2 sda3 Oct 25 09:03:33 sh-103-53 kernel: sd 0:0:0:0: [sda] Attached SCSI disk Oct 25 09:03:34 sh-103-53 kernel: SGI XFS with ACLs, security attributes, no debug enabled Oct 25 09:03:34 sh-103-53 kernel: XFS (sda1): Mounting V5 Filesystem Oct 25 09:03:34 sh-103-53 kernel: XFS (sda1): Ending clean mount Oct 25 09:03:34 sh-103-53 kernel: random: crng init done Oct 25 09:03:34 sh-103-53.int systemd-journald[214]: Received SIGTERM from PID 1 (systemd). Oct 25 09:03:34 sh-103-53.int kernel: SELinux: Disabled at runtime. Oct 25 09:03:34 sh-103-53.int kernel: SELinux: Unregistering netfilter hooks Oct 25 09:03:34 sh-103-53.int kernel: type=1404 audit(1572019414.606:2): selinux=0 auid=4294967295 ses=4294967295 Oct 25 09:03:34 sh-103-53.int kernel: ip_tables: (C) 2000-2006 Netfilter Core Team Oct 25 09:03:34 sh-103-53.int systemd[1]: Inserted module 'ip_tables' Oct 25 09:03:35 sh-103-53.int kernel: loop: module loaded Oct 25 09:03:35 sh-103-53.int kernel: EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: user_xattr Oct 25 09:03:35 sh-103-53.int kernel: ACPI Error: No handler for Region [SYSI] (ffff97539b9cad80) [IPMI] (20130517/evregion-162) Oct 25 09:03:35 sh-103-53.int kernel: ACPI Error: Oct 25 09:03:35 sh-103-53.int kernel: Region IPMI (ID=7) has no handler (20130517/exfldio-305) Oct 25 09:03:35 sh-103-53.int kernel: ACPI Error: Method parse/execution failed [\_SB_.PMI0._GHL] (Node ffff97539b93a618), AE_NOT_EXIST (20130517/psparse-536) Oct 25 09:03:35 sh-103-53.int kernel: ACPI Error: Method parse/execution failed [\_SB_.PMI0._PMC] (Node ffff97539b93a578), AE_NOT_EXIST (20130517/psparse-536) Oct 25 09:03:35 sh-103-53.int kernel: ACPI Exception: AE_NOT_EXIST, Evaluating _PMC (20130517/power_meter-753) Oct 25 09:03:35 sh-103-53.int kernel: ipmi message handler version 39.2 Oct 25 09:03:35 sh-103-53.int kernel: wmi_bus wmi_bus-PNP0C14:00: WQBC data block query control method not found Oct 25 09:03:35 sh-103-53.int kernel: lpc_ich 0000:00:1f.0: I/O space for ACPI uninitialized Oct 25 09:03:35 sh-103-53.int kernel: sd 0:0:0:0: Attached scsi generic sg0 type 0 Oct 25 09:03:35 sh-103-53.int kernel: lpc_ich 0000:00:1f.0: No MFD cells added Oct 25 09:03:35 sh-103-53.int kernel: i801_smbus 0000:00:1f.4: SMBus using PCI interrupt Oct 25 09:03:35 sh-103-53.int kernel: input: PC Speaker as /devices/platform/pcspkr/input/input1 Oct 25 09:03:35 sh-103-53.int kernel: ipmi device interface Oct 25 09:03:35 sh-103-53.int kernel: cryptd: max_cpu_qlen set to 1000 Oct 25 09:03:35 sh-103-53.int kernel: mei_me 0000:00:16.0: irq 72 for MSI/MSI-X Oct 25 09:03:35 sh-103-53.int kernel: IPMI System Interface driver Oct 25 09:03:35 sh-103-53.int kernel: ipmi_si dmi-ipmi-si.0: ipmi_platform: probing via SMBIOS Oct 25 09:03:35 sh-103-53.int kernel: ipmi_si: SMBIOS: io 0xca8 regsize 1 spacing 4 irq 10 Oct 25 09:03:35 sh-103-53.int kernel: ipmi_si: Adding SMBIOS-specified kcs state machine Oct 25 09:03:35 sh-103-53.int kernel: ipmi_si IPI0001:00: ipmi_platform: probing via ACPI Oct 25 09:03:35 sh-103-53.int kernel: ipmi_si IPI0001:00: [io 0x0ca8] regsize 1 spacing 4 irq 10 Oct 25 09:03:35 sh-103-53.int kernel: ipmi_si dmi-ipmi-si.0: Removing SMBIOS-specified kcs state machine in favor of ACPI Oct 25 09:03:35 sh-103-53.int kernel: ipmi_si: Adding ACPI-specified kcs state machine Oct 25 09:03:35 sh-103-53.int kernel: ipmi_si: Trying ACPI-specified kcs state machine at i/o address 0xca8, slave address 0x20, irq 10 Oct 25 09:03:36 sh-103-53.int kernel: AVX2 version of gcm_enc/dec engaged. Oct 25 09:03:36 sh-103-53.int kernel: AES CTR mode by8 optimization enabled Oct 25 09:03:36 sh-103-53.int kernel: cdc_ether 1-14.3:1.0 eth0: register 'cdc_ether' at usb-0000:00:14.0-14.3, CDC Ethernet Device, 6c:2b:59:a0:5b:84 Oct 25 09:03:36 sh-103-53.int kernel: alg: No test for __gcm-aes-aesni (__driver-gcm-aes-aesni) Oct 25 09:03:36 sh-103-53.int kernel: alg: No test for __generic-gcm-aes-aesni (__driver-generic-gcm-aes-aesni) Oct 25 09:03:36 sh-103-53.int kernel: XFS (sda2): Mounting V5 Filesystem Oct 25 09:03:36 sh-103-53.int kernel: ipmi_si IPI0001:00: The BMC does not support setting the recv irq bit, compensating, but the BMC needs to be fixed. Oct 25 09:03:36 sh-103-53.int kernel: XFS (sda2): Ending clean mount Oct 25 09:03:36 sh-103-53.int kernel: ipmi_si IPI0001:00: Using irq 10 Oct 25 09:03:36 sh-103-53.int kernel: usbcore: registered new interface driver cdc_ether Oct 25 09:03:36 sh-103-53.int kernel: ipmi_si IPI0001:00: Found new BMC (man_id: 0x0002a2, prod_id: 0x0100, dev_id: 0x20) Oct 25 09:03:36 sh-103-53.int kernel: Adding 4194300k swap on /dev/sda3. Priority:-2 extents:1 across:4194300k SSFS Oct 25 09:03:36 sh-103-53.int kernel: [TTM] Zone kernel: Available graphics memory: 98134184 kiB Oct 25 09:03:36 sh-103-53.int kernel: [TTM] Zone dma32: Available graphics memory: 2097152 kiB Oct 25 09:03:36 sh-103-53.int kernel: [TTM] Initializing pool allocator Oct 25 09:03:36 sh-103-53.int kernel: kvm: disabled by bios Oct 25 09:03:36 sh-103-53.int kernel: [TTM] Initializing DMA pool allocator Oct 25 09:03:36 sh-103-53.int kernel: ipmi_si IPI0001:00: IPMI kcs interface initialized Oct 25 09:03:36 sh-103-53.int kernel: intel_rapl: Found RAPL domain package Oct 25 09:03:36 sh-103-53.int kernel: intel_rapl: Found RAPL domain dram Oct 25 09:03:36 sh-103-53.int kernel: intel_rapl: DRAM domain energy unit 15300pj Oct 25 09:03:36 sh-103-53.int kernel: intel_rapl: Found RAPL domain package Oct 25 09:03:36 sh-103-53.int kernel: intel_rapl: Found RAPL domain dram Oct 25 09:03:36 sh-103-53.int kernel: intel_rapl: DRAM domain energy unit 15300pj Oct 25 09:03:36 sh-103-53.int kernel: kvm: disabled by bios Oct 25 09:03:36 sh-103-53.int kernel: EDAC MC0: Giving out device to 'skx_edac.c' 'Skylake Socket#0 IMC#0': DEV 0000:3a:0a.0 Oct 25 09:03:36 sh-103-53.int kernel: EDAC MC1: Giving out device to 'skx_edac.c' 'Skylake Socket#0 IMC#1': DEV 0000:3a:0c.0 Oct 25 09:03:36 sh-103-53.int kernel: dcdbas dcdbas: Dell Systems Management Base Driver (version 5.6.0-3.3) Oct 25 09:03:36 sh-103-53.int kernel: EDAC MC2: Giving out device to 'skx_edac.c' 'Skylake Socket#1 IMC#0': DEV 0000:ae:0a.0 Oct 25 09:03:36 sh-103-53.int kernel: EDAC MC3: Giving out device to 'skx_edac.c' 'Skylake Socket#1 IMC#1': DEV 0000:ae:0c.0 Oct 25 09:03:36 sh-103-53.int kernel: iTCO_vendor_support: vendor-support=0 Oct 25 09:03:36 sh-103-53.int kernel: iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11 Oct 25 09:03:36 sh-103-53.int kernel: mgag200 0000:03:00.0: VGA-1: EDID is invalid: Oct 25 09:03:36 sh-103-53.int kernel: [00] BAD 09 d1 9e 76 6e 08 00 00 ff ff ff ff ff ff ff ff Oct 25 09:03:36 sh-103-53.int kernel: [00] BAD ee d3 d0 a1 5a 4c 94 24 ff ff ff ff ff ff ff ff Oct 25 09:03:36 sh-103-53.int kernel: [00] BAD 01 01 01 01 01 01 01 01 ff ff ff ff ff ff ff ff Oct 25 09:03:36 sh-103-53.int kernel: [00] BAD 00 40 41 00 26 30 18 88 ff ff ff ff ff ff ff ff Oct 25 09:03:36 sh-103-53.int kernel: [00] BAD d5 09 80 a0 20 5e 63 10 ff ff ff ff ff ff ff ff Oct 25 09:03:36 sh-103-53.int kernel: [00] BAD 00 1a 00 00 00 fd 00 38 ff ff ff ff ff ff ff ff Oct 25 09:03:36 sh-103-53.int kernel: [00] BAD 20 20 20 20 00 00 00 fc ff ff ff ff ff ff ff ff Oct 25 09:03:36 sh-103-53.int kernel: [00] BAD 35 31 47 0a 20 20 00 68 ff ff ff ff ff ff ff ff Oct 25 09:03:36 sh-103-53.int kernel: iTCO_wdt: Found a Intel PCH TCO device (Version=4, TCOBASE=0x0400) Oct 25 09:03:36 sh-103-53.int kernel: fbcon: mgadrmfb (fb0) is primary device Oct 25 09:03:36 sh-103-53.int kernel: iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0) Oct 25 09:03:36 sh-103-53.int kernel: Console: switching to colour frame buffer device 128x48 Oct 25 09:03:36 sh-103-53.int kernel: mgag200 0000:03:00.0: fb0: mgadrmfb frame buffer device Oct 25 09:03:36 sh-103-53.int kernel: [drm] Initialized mgag200 1.0.0 20110418 for 0000:03:00.0 on minor 0 Oct 25 09:03:37 sh-103-53.int kernel: RPC: Registered named UNIX socket transport module. Oct 25 09:03:37 sh-103-53.int kernel: RPC: Registered udp transport module. Oct 25 09:03:37 sh-103-53.int kernel: RPC: Registered tcp transport module. Oct 25 09:03:37 sh-103-53.int kernel: RPC: Registered tcp NFSv4.1 backchannel transport module. Oct 25 09:03:37 sh-103-53.int kernel: FS-Cache: Loaded Oct 25 09:03:37 sh-103-53.int kernel: CacheFiles: Loaded Oct 25 09:03:37 sh-103-53.int kernel: Request for unknown module key 'Mellanox Technologies signing key: 61feb074fc7292f958419386ffdd9d5ca999e403' err -11 Oct 25 09:03:37 sh-103-53.int kernel: FS-Cache: Cache "mycache" added (type cachefiles) Oct 25 09:03:37 sh-103-53.int kernel: CacheFiles: File cache on loop0 registered Oct 25 09:03:38 sh-103-53.int kernel: Request for unknown module key 'Mellanox Technologies signing key: 61feb074fc7292f958419386ffdd9d5ca999e403' err -11 Oct 25 09:03:38 sh-103-53.int kernel: Request for unknown module key 'Mellanox Technologies signing key: 61feb074fc7292f958419386ffdd9d5ca999e403' err -11 Oct 25 09:03:38 sh-103-53.int kernel: Request for unknown module key 'Mellanox Technologies signing key: 61feb074fc7292f958419386ffdd9d5ca999e403' err -11 Oct 25 09:03:38 sh-103-53.int kernel: Request for unknown module key 'Mellanox Technologies signing key: 61feb074fc7292f958419386ffdd9d5ca999e403' err -11 Oct 25 09:03:38 sh-103-53.int kernel: Request for unknown module key 'Mellanox Technologies signing key: 61feb074fc7292f958419386ffdd9d5ca999e403' err -11 Oct 25 09:03:38 sh-103-53.int kernel: Request for unknown module key 'Mellanox Technologies signing key: 61feb074fc7292f958419386ffdd9d5ca999e403' err -11 Oct 25 09:03:38 sh-103-53.int kernel: mlx5_core 0000:5e:00.0: slow_pci_heuristic:5575:(pid 7309): Max link speed = 100000, PCI BW = 126016 Oct 25 09:03:38 sh-103-53.int kernel: mlx5_core 0000:5e:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0) Oct 25 09:03:38 sh-103-53.int kernel: mlx5_core 0000:5e:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0) Oct 25 09:03:38 sh-103-53.int kernel: Request for unknown module key 'Mellanox Technologies signing key: 61feb074fc7292f958419386ffdd9d5ca999e403' err -11 Oct 25 09:03:38 sh-103-53.int kernel: Request for unknown module key 'Mellanox Technologies signing key: 61feb074fc7292f958419386ffdd9d5ca999e403' err -11 Oct 25 09:03:38 sh-103-53.int kernel: Request for unknown module key 'Mellanox Technologies signing key: 61feb074fc7292f958419386ffdd9d5ca999e403' err -11 Oct 25 09:03:38 sh-103-53.int kernel: Request for unknown module key 'Mellanox Technologies signing key: 61feb074fc7292f958419386ffdd9d5ca999e403' err -11 Oct 25 09:03:39 sh-103-53.int kernel: igb 0000:04:00.0: changing MTU from 1500 to 9000 Oct 25 09:03:39 sh-103-53.int kernel: IPv6: ADDRCONF(NETDEV_UP): em1: link is not ready Oct 25 09:03:42 sh-103-53.int kernel: igb 0000:04:00.0 em1: igb: em1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX Oct 25 09:03:42 sh-103-53.int kernel: IPv6: ADDRCONF(NETDEV_CHANGE): em1: link becomes ready Oct 25 09:03:43 sh-103-53.int kernel: IPv6: ADDRCONF(NETDEV_UP): ib0: link is not ready Oct 25 09:03:43 sh-103-53.int kernel: IPv6: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready Oct 25 09:03:49 sh-103-53.int kernel: FS-Cache: Netfs 'nfs' registered for caching Oct 25 09:06:46 sh-103-53.int kernel: kvm: disabled by bios Oct 25 09:06:52 sh-103-53.int kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 24, npartitions: 2 Oct 25 09:06:52 sh-103-53.int kernel: alg: No test for adler32 (adler32-zlib) Oct 25 09:06:53 sh-103-53.int kernel: Lustre: Lustre: Build Version: 2.12.3 Oct 25 09:06:53 sh-103-53.int kernel: LNet: Using FastReg for registration Oct 25 09:06:53 sh-103-53.int kernel: LNet: Added LNI 10.9.103.53@o2ib4 [8/256/0/180] Oct 25 09:06:53 sh-103-53.int kernel: LNetError: 91091:0:(api-ni.c:3214:lnet_dyn_del_ni()) net tcp not found Oct 25 09:06:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 206 seconds Oct 25 09:06:55 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 09:06:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 207 seconds Oct 25 09:06:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 209 seconds Oct 25 09:07:08 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.24@o2ib4 added to recovery queue. Health = 900 Oct 25 09:07:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 25 09:07:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 1 previous similar message Oct 25 09:07:09 sh-103-53.int kernel: Lustre: Mounted fir-client Oct 25 09:07:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 09:07:38 sh-103-53.int kernel: LNetError: 3539:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 09:07:53 sh-103-53.int kernel: LNetError: 3539:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 09:07:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 09:08:08 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 09:08:08 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 7 previous similar messages Oct 25 09:08:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 09:08:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 3 previous similar messages Oct 25 09:09:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 09:09:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 3 previous similar messages Oct 25 09:09:08 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 09:09:08 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 2 previous similar messages Oct 25 09:09:28 sh-103-53.int kernel: INFO: task mount.lustre:91199 blocked for more than 120 seconds. Oct 25 09:09:28 sh-103-53.int kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 25 09:09:28 sh-103-53.int kernel: mount.lustre D ffff9769fcab2080 0 91199 91198 0x00000000 Oct 25 09:09:28 sh-103-53.int kernel: Call Trace: Oct 25 09:09:28 sh-103-53.int kernel: [] ? __enqueue_entity+0x78/0x80 Oct 25 09:09:28 sh-103-53.int kernel: [] ? enqueue_entity+0x2ef/0xbe0 Oct 25 09:09:28 sh-103-53.int kernel: [] schedule+0x29/0x70 Oct 25 09:09:28 sh-103-53.int kernel: [] schedule_timeout+0x221/0x2d0 Oct 25 09:09:28 sh-103-53.int kernel: [] ? tracing_record_cmdline+0x1d/0x120 Oct 25 09:09:29 sh-103-53.int kernel: [] ? probe_sched_wakeup+0x2b/0xa0 Oct 25 09:09:29 sh-103-53.int kernel: [] ? ttwu_do_wakeup+0xb5/0xe0 Oct 25 09:09:29 sh-103-53.int kernel: [] wait_for_completion+0xfd/0x140 Oct 25 09:09:29 sh-103-53.int kernel: [] ? wake_up_state+0x20/0x20 Oct 25 09:09:29 sh-103-53.int kernel: [] llog_process_or_fork+0x244/0x450 [obdclass] Oct 25 09:09:29 sh-103-53.int kernel: [] llog_process+0x14/0x20 [obdclass] Oct 25 09:09:29 sh-103-53.int kernel: [] class_config_parse_llog+0x125/0x350 [obdclass] Oct 25 09:09:29 sh-103-53.int kernel: [] mgc_process_cfg_log+0x788/0xc40 [mgc] Oct 25 09:09:29 sh-103-53.int kernel: [] mgc_process_log+0x3d9/0x8f0 [mgc] Oct 25 09:09:29 sh-103-53.int kernel: [] ? config_recover_log_add+0x13f/0x280 [mgc] Oct 25 09:09:29 sh-103-53.int kernel: [] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] Oct 25 09:09:29 sh-103-53.int kernel: [] mgc_process_config+0x88b/0x13f0 [mgc] Oct 25 09:09:29 sh-103-53.int kernel: [] lustre_process_log+0x2d8/0xad0 [obdclass] Oct 25 09:09:29 sh-103-53.int kernel: [] ? kobject_uevent+0xb/0x10 Oct 25 09:09:29 sh-103-53.int kernel: [] ? kset_register+0x56/0x70 Oct 25 09:09:29 sh-103-53.int kernel: [] ? ll_debugfs_register_super+0x480/0x740 [lustre] Oct 25 09:09:29 sh-103-53.int kernel: [] ll_fill_super+0x4eb/0x11f0 [lustre] Oct 25 09:09:29 sh-103-53.int kernel: [] lustre_fill_super+0x28c/0x920 [obdclass] Oct 25 09:09:29 sh-103-53.int kernel: [] ? lustre_common_put_super+0x270/0x270 [obdclass] Oct 25 09:09:29 sh-103-53.int kernel: [] mount_nodev+0x4f/0xb0 Oct 25 09:09:29 sh-103-53.int kernel: [] lustre_mount+0x38/0x60 [obdclass] Oct 25 09:09:29 sh-103-53.int kernel: [] mount_fs+0x3e/0x1b0 Oct 25 09:09:29 sh-103-53.int kernel: [] vfs_kern_mount+0x67/0x110 Oct 25 09:09:29 sh-103-53.int kernel: [] do_mount+0x1ef/0xce0 Oct 25 09:09:29 sh-103-53.int kernel: [] ? __check_object_size+0x1ca/0x250 Oct 25 09:09:29 sh-103-53.int kernel: [] ? kmem_cache_alloc_trace+0x3c/0x200 Oct 25 09:09:29 sh-103-53.int kernel: [] SyS_mount+0x83/0xd0 Oct 25 09:09:29 sh-103-53.int kernel: [] system_call_fastpath+0x22/0x27 Oct 25 09:10:08 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 09:10:08 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 2 previous similar messages Oct 25 09:10:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 09:10:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 25 09:11:08 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 09:11:08 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 3 previous similar messages Oct 25 09:11:29 sh-103-53.int kernel: INFO: task mount.lustre:91199 blocked for more than 120 seconds. Oct 25 09:11:29 sh-103-53.int kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 25 09:11:29 sh-103-53.int kernel: mount.lustre D ffff9769fcab2080 0 91199 91198 0x00000000 Oct 25 09:11:29 sh-103-53.int kernel: Call Trace: Oct 25 09:11:29 sh-103-53.int kernel: [] ? __enqueue_entity+0x78/0x80 Oct 25 09:11:29 sh-103-53.int kernel: [] ? enqueue_entity+0x2ef/0xbe0 Oct 25 09:11:29 sh-103-53.int kernel: [] schedule+0x29/0x70 Oct 25 09:11:29 sh-103-53.int kernel: [] schedule_timeout+0x221/0x2d0 Oct 25 09:11:29 sh-103-53.int kernel: [] ? tracing_record_cmdline+0x1d/0x120 Oct 25 09:11:29 sh-103-53.int kernel: [] ? probe_sched_wakeup+0x2b/0xa0 Oct 25 09:11:29 sh-103-53.int kernel: [] ? ttwu_do_wakeup+0xb5/0xe0 Oct 25 09:11:29 sh-103-53.int kernel: [] wait_for_completion+0xfd/0x140 Oct 25 09:11:29 sh-103-53.int kernel: [] ? wake_up_state+0x20/0x20 Oct 25 09:11:29 sh-103-53.int kernel: [] llog_process_or_fork+0x244/0x450 [obdclass] Oct 25 09:11:29 sh-103-53.int kernel: [] llog_process+0x14/0x20 [obdclass] Oct 25 09:11:29 sh-103-53.int kernel: [] class_config_parse_llog+0x125/0x350 [obdclass] Oct 25 09:11:29 sh-103-53.int kernel: [] mgc_process_cfg_log+0x788/0xc40 [mgc] Oct 25 09:11:29 sh-103-53.int kernel: [] mgc_process_log+0x3d9/0x8f0 [mgc] Oct 25 09:11:29 sh-103-53.int kernel: [] ? config_recover_log_add+0x13f/0x280 [mgc] Oct 25 09:11:29 sh-103-53.int kernel: [] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] Oct 25 09:11:29 sh-103-53.int kernel: [] mgc_process_config+0x88b/0x13f0 [mgc] Oct 25 09:11:29 sh-103-53.int kernel: [] lustre_process_log+0x2d8/0xad0 [obdclass] Oct 25 09:11:29 sh-103-53.int kernel: [] ? kobject_uevent+0xb/0x10 Oct 25 09:11:29 sh-103-53.int kernel: [] ? kset_register+0x56/0x70 Oct 25 09:11:29 sh-103-53.int kernel: [] ? ll_debugfs_register_super+0x480/0x740 [lustre] Oct 25 09:11:29 sh-103-53.int kernel: [] ll_fill_super+0x4eb/0x11f0 [lustre] Oct 25 09:11:29 sh-103-53.int kernel: [] lustre_fill_super+0x28c/0x920 [obdclass] Oct 25 09:11:29 sh-103-53.int kernel: [] ? lustre_common_put_super+0x270/0x270 [obdclass] Oct 25 09:11:29 sh-103-53.int kernel: [] mount_nodev+0x4f/0xb0 Oct 25 09:11:29 sh-103-53.int kernel: [] lustre_mount+0x38/0x60 [obdclass] Oct 25 09:11:29 sh-103-53.int kernel: [] mount_fs+0x3e/0x1b0 Oct 25 09:11:29 sh-103-53.int kernel: [] vfs_kern_mount+0x67/0x110 Oct 25 09:11:29 sh-103-53.int kernel: [] do_mount+0x1ef/0xce0 Oct 25 09:11:29 sh-103-53.int kernel: [] ? __check_object_size+0x1ca/0x250 Oct 25 09:11:29 sh-103-53.int kernel: [] ? kmem_cache_alloc_trace+0x3c/0x200 Oct 25 09:11:29 sh-103-53.int kernel: [] SyS_mount+0x83/0xd0 Oct 25 09:11:29 sh-103-53.int kernel: [] system_call_fastpath+0x22/0x27 Oct 25 09:12:08 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 09:12:08 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 3 previous similar messages Oct 25 09:12:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 09:12:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 12 previous similar messages Oct 25 09:13:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 09:13:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 3 previous similar messages Oct 25 09:13:29 sh-103-53.int kernel: INFO: task mount.lustre:91199 blocked for more than 120 seconds. Oct 25 09:13:29 sh-103-53.int kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 25 09:13:29 sh-103-53.int kernel: mount.lustre D ffff9769fcab2080 0 91199 91198 0x00000000 Oct 25 09:13:29 sh-103-53.int kernel: Call Trace: Oct 25 09:13:29 sh-103-53.int kernel: [] ? __enqueue_entity+0x78/0x80 Oct 25 09:13:29 sh-103-53.int kernel: [] ? enqueue_entity+0x2ef/0xbe0 Oct 25 09:13:29 sh-103-53.int kernel: [] schedule+0x29/0x70 Oct 25 09:13:29 sh-103-53.int kernel: [] schedule_timeout+0x221/0x2d0 Oct 25 09:13:29 sh-103-53.int kernel: [] ? tracing_record_cmdline+0x1d/0x120 Oct 25 09:13:29 sh-103-53.int kernel: [] ? probe_sched_wakeup+0x2b/0xa0 Oct 25 09:13:29 sh-103-53.int kernel: [] ? ttwu_do_wakeup+0xb5/0xe0 Oct 25 09:13:29 sh-103-53.int kernel: [] wait_for_completion+0xfd/0x140 Oct 25 09:13:29 sh-103-53.int kernel: [] ? wake_up_state+0x20/0x20 Oct 25 09:13:29 sh-103-53.int kernel: [] llog_process_or_fork+0x244/0x450 [obdclass] Oct 25 09:13:29 sh-103-53.int kernel: [] llog_process+0x14/0x20 [obdclass] Oct 25 09:13:29 sh-103-53.int kernel: [] class_config_parse_llog+0x125/0x350 [obdclass] Oct 25 09:13:29 sh-103-53.int kernel: [] mgc_process_cfg_log+0x788/0xc40 [mgc] Oct 25 09:13:29 sh-103-53.int kernel: [] mgc_process_log+0x3d9/0x8f0 [mgc] Oct 25 09:13:29 sh-103-53.int kernel: [] ? config_recover_log_add+0x13f/0x280 [mgc] Oct 25 09:13:29 sh-103-53.int kernel: [] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] Oct 25 09:13:29 sh-103-53.int kernel: [] mgc_process_config+0x88b/0x13f0 [mgc] Oct 25 09:13:29 sh-103-53.int kernel: [] lustre_process_log+0x2d8/0xad0 [obdclass] Oct 25 09:13:29 sh-103-53.int kernel: [] ? kobject_uevent+0xb/0x10 Oct 25 09:13:29 sh-103-53.int kernel: [] ? kset_register+0x56/0x70 Oct 25 09:13:29 sh-103-53.int kernel: [] ? ll_debugfs_register_super+0x480/0x740 [lustre] Oct 25 09:13:29 sh-103-53.int kernel: [] ll_fill_super+0x4eb/0x11f0 [lustre] Oct 25 09:13:29 sh-103-53.int kernel: [] lustre_fill_super+0x28c/0x920 [obdclass] Oct 25 09:13:29 sh-103-53.int kernel: [] ? lustre_common_put_super+0x270/0x270 [obdclass] Oct 25 09:13:29 sh-103-53.int kernel: [] mount_nodev+0x4f/0xb0 Oct 25 09:13:29 sh-103-53.int kernel: [] lustre_mount+0x38/0x60 [obdclass] Oct 25 09:13:29 sh-103-53.int kernel: [] mount_fs+0x3e/0x1b0 Oct 25 09:13:29 sh-103-53.int kernel: [] vfs_kern_mount+0x67/0x110 Oct 25 09:13:29 sh-103-53.int kernel: [] do_mount+0x1ef/0xce0 Oct 25 09:13:29 sh-103-53.int kernel: [] ? __check_object_size+0x1ca/0x250 Oct 25 09:13:29 sh-103-53.int kernel: [] ? kmem_cache_alloc_trace+0x3c/0x200 Oct 25 09:13:29 sh-103-53.int kernel: [] SyS_mount+0x83/0xd0 Oct 25 09:13:29 sh-103-53.int kernel: [] system_call_fastpath+0x22/0x27 Oct 25 09:14:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 09:14:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 3 previous similar messages Oct 25 09:15:29 sh-103-53.int kernel: INFO: task mount.lustre:91199 blocked for more than 120 seconds. Oct 25 09:15:29 sh-103-53.int kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 25 09:15:29 sh-103-53.int kernel: mount.lustre D ffff9769fcab2080 0 91199 91198 0x00000000 Oct 25 09:15:29 sh-103-53.int kernel: Call Trace: Oct 25 09:15:29 sh-103-53.int kernel: [] ? __enqueue_entity+0x78/0x80 Oct 25 09:15:29 sh-103-53.int kernel: [] ? enqueue_entity+0x2ef/0xbe0 Oct 25 09:15:29 sh-103-53.int kernel: [] schedule+0x29/0x70 Oct 25 09:15:29 sh-103-53.int kernel: [] schedule_timeout+0x221/0x2d0 Oct 25 09:15:29 sh-103-53.int kernel: [] ? tracing_record_cmdline+0x1d/0x120 Oct 25 09:15:29 sh-103-53.int kernel: [] ? probe_sched_wakeup+0x2b/0xa0 Oct 25 09:15:29 sh-103-53.int kernel: [] ? ttwu_do_wakeup+0xb5/0xe0 Oct 25 09:15:29 sh-103-53.int kernel: [] wait_for_completion+0xfd/0x140 Oct 25 09:15:29 sh-103-53.int kernel: [] ? wake_up_state+0x20/0x20 Oct 25 09:15:29 sh-103-53.int kernel: [] llog_process_or_fork+0x244/0x450 [obdclass] Oct 25 09:15:29 sh-103-53.int kernel: [] llog_process+0x14/0x20 [obdclass] Oct 25 09:15:29 sh-103-53.int kernel: [] class_config_parse_llog+0x125/0x350 [obdclass] Oct 25 09:15:29 sh-103-53.int kernel: [] mgc_process_cfg_log+0x788/0xc40 [mgc] Oct 25 09:15:29 sh-103-53.int kernel: [] mgc_process_log+0x3d9/0x8f0 [mgc] Oct 25 09:15:29 sh-103-53.int kernel: [] ? config_recover_log_add+0x13f/0x280 [mgc] Oct 25 09:15:29 sh-103-53.int kernel: [] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] Oct 25 09:15:29 sh-103-53.int kernel: [] mgc_process_config+0x88b/0x13f0 [mgc] Oct 25 09:15:29 sh-103-53.int kernel: [] lustre_process_log+0x2d8/0xad0 [obdclass] Oct 25 09:15:29 sh-103-53.int kernel: [] ? kobject_uevent+0xb/0x10 Oct 25 09:15:29 sh-103-53.int kernel: [] ? kset_register+0x56/0x70 Oct 25 09:15:29 sh-103-53.int kernel: [] ? ll_debugfs_register_super+0x480/0x740 [lustre] Oct 25 09:15:29 sh-103-53.int kernel: [] ll_fill_super+0x4eb/0x11f0 [lustre] Oct 25 09:15:29 sh-103-53.int kernel: [] lustre_fill_super+0x28c/0x920 [obdclass] Oct 25 09:15:29 sh-103-53.int kernel: [] ? lustre_common_put_super+0x270/0x270 [obdclass] Oct 25 09:15:29 sh-103-53.int kernel: [] mount_nodev+0x4f/0xb0 Oct 25 09:15:29 sh-103-53.int kernel: [] lustre_mount+0x38/0x60 [obdclass] Oct 25 09:15:29 sh-103-53.int kernel: [] mount_fs+0x3e/0x1b0 Oct 25 09:15:29 sh-103-53.int kernel: [] vfs_kern_mount+0x67/0x110 Oct 25 09:15:29 sh-103-53.int kernel: [] do_mount+0x1ef/0xce0 Oct 25 09:15:29 sh-103-53.int kernel: [] ? __check_object_size+0x1ca/0x250 Oct 25 09:15:29 sh-103-53.int kernel: [] ? kmem_cache_alloc_trace+0x3c/0x200 Oct 25 09:15:29 sh-103-53.int kernel: [] SyS_mount+0x83/0xd0 Oct 25 09:15:29 sh-103-53.int kernel: [] system_call_fastpath+0x22/0x27 Oct 25 09:16:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 09:16:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 6 previous similar messages Oct 25 09:16:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 09:16:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 29 previous similar messages Oct 25 09:17:14 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 25 09:17:29 sh-103-53.int kernel: INFO: task mount.lustre:91199 blocked for more than 120 seconds. Oct 25 09:17:29 sh-103-53.int kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 25 09:17:29 sh-103-53.int kernel: mount.lustre D ffff9769fcab2080 0 91199 91198 0x00000000 Oct 25 09:17:29 sh-103-53.int kernel: Call Trace: Oct 25 09:17:29 sh-103-53.int kernel: [] ? __enqueue_entity+0x78/0x80 Oct 25 09:17:29 sh-103-53.int kernel: [] ? enqueue_entity+0x2ef/0xbe0 Oct 25 09:17:29 sh-103-53.int kernel: [] schedule+0x29/0x70 Oct 25 09:17:29 sh-103-53.int kernel: [] schedule_timeout+0x221/0x2d0 Oct 25 09:17:29 sh-103-53.int kernel: [] ? tracing_record_cmdline+0x1d/0x120 Oct 25 09:17:29 sh-103-53.int kernel: [] ? probe_sched_wakeup+0x2b/0xa0 Oct 25 09:17:29 sh-103-53.int kernel: [] ? ttwu_do_wakeup+0xb5/0xe0 Oct 25 09:17:29 sh-103-53.int kernel: [] wait_for_completion+0xfd/0x140 Oct 25 09:17:29 sh-103-53.int kernel: [] ? wake_up_state+0x20/0x20 Oct 25 09:17:29 sh-103-53.int kernel: [] llog_process_or_fork+0x244/0x450 [obdclass] Oct 25 09:17:29 sh-103-53.int kernel: [] llog_process+0x14/0x20 [obdclass] Oct 25 09:17:29 sh-103-53.int kernel: [] class_config_parse_llog+0x125/0x350 [obdclass] Oct 25 09:17:29 sh-103-53.int kernel: [] mgc_process_cfg_log+0x788/0xc40 [mgc] Oct 25 09:17:29 sh-103-53.int kernel: [] mgc_process_log+0x3d9/0x8f0 [mgc] Oct 25 09:17:29 sh-103-53.int kernel: [] ? config_recover_log_add+0x13f/0x280 [mgc] Oct 25 09:17:29 sh-103-53.int kernel: [] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] Oct 25 09:17:29 sh-103-53.int kernel: [] mgc_process_config+0x88b/0x13f0 [mgc] Oct 25 09:17:30 sh-103-53.int kernel: [] lustre_process_log+0x2d8/0xad0 [obdclass] Oct 25 09:17:30 sh-103-53.int kernel: [] ? kobject_uevent+0xb/0x10 Oct 25 09:17:30 sh-103-53.int kernel: [] ? kset_register+0x56/0x70 Oct 25 09:17:30 sh-103-53.int kernel: [] ? ll_debugfs_register_super+0x480/0x740 [lustre] Oct 25 09:17:30 sh-103-53.int kernel: [] ll_fill_super+0x4eb/0x11f0 [lustre] Oct 25 09:17:30 sh-103-53.int kernel: [] lustre_fill_super+0x28c/0x920 [obdclass] Oct 25 09:17:30 sh-103-53.int kernel: [] ? lustre_common_put_super+0x270/0x270 [obdclass] Oct 25 09:17:30 sh-103-53.int kernel: [] mount_nodev+0x4f/0xb0 Oct 25 09:17:30 sh-103-53.int kernel: [] lustre_mount+0x38/0x60 [obdclass] Oct 25 09:17:30 sh-103-53.int kernel: [] mount_fs+0x3e/0x1b0 Oct 25 09:17:30 sh-103-53.int kernel: [] vfs_kern_mount+0x67/0x110 Oct 25 09:17:30 sh-103-53.int kernel: [] do_mount+0x1ef/0xce0 Oct 25 09:17:30 sh-103-53.int kernel: [] ? __check_object_size+0x1ca/0x250 Oct 25 09:17:30 sh-103-53.int kernel: [] ? kmem_cache_alloc_trace+0x3c/0x200 Oct 25 09:17:30 sh-103-53.int kernel: [] SyS_mount+0x83/0xd0 Oct 25 09:17:30 sh-103-53.int kernel: [] system_call_fastpath+0x22/0x27 Oct 25 09:19:18 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 09:19:18 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 11 previous similar messages Oct 25 09:19:30 sh-103-53.int kernel: INFO: task mount.lustre:91199 blocked for more than 120 seconds. Oct 25 09:19:30 sh-103-53.int kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 25 09:19:30 sh-103-53.int kernel: mount.lustre D ffff9769fcab2080 0 91199 91198 0x00000000 Oct 25 09:19:30 sh-103-53.int kernel: Call Trace: Oct 25 09:19:30 sh-103-53.int kernel: [] ? __enqueue_entity+0x78/0x80 Oct 25 09:19:30 sh-103-53.int kernel: [] ? enqueue_entity+0x2ef/0xbe0 Oct 25 09:19:30 sh-103-53.int kernel: [] schedule+0x29/0x70 Oct 25 09:19:30 sh-103-53.int kernel: [] schedule_timeout+0x221/0x2d0 Oct 25 09:19:30 sh-103-53.int kernel: [] ? tracing_record_cmdline+0x1d/0x120 Oct 25 09:19:30 sh-103-53.int kernel: [] ? probe_sched_wakeup+0x2b/0xa0 Oct 25 09:19:30 sh-103-53.int kernel: [] ? ttwu_do_wakeup+0xb5/0xe0 Oct 25 09:19:30 sh-103-53.int kernel: [] wait_for_completion+0xfd/0x140 Oct 25 09:19:30 sh-103-53.int kernel: [] ? wake_up_state+0x20/0x20 Oct 25 09:19:30 sh-103-53.int kernel: [] llog_process_or_fork+0x244/0x450 [obdclass] Oct 25 09:19:30 sh-103-53.int kernel: [] llog_process+0x14/0x20 [obdclass] Oct 25 09:19:30 sh-103-53.int kernel: [] class_config_parse_llog+0x125/0x350 [obdclass] Oct 25 09:19:30 sh-103-53.int kernel: [] mgc_process_cfg_log+0x788/0xc40 [mgc] Oct 25 09:19:30 sh-103-53.int kernel: [] mgc_process_log+0x3d9/0x8f0 [mgc] Oct 25 09:19:30 sh-103-53.int kernel: [] ? config_recover_log_add+0x13f/0x280 [mgc] Oct 25 09:19:30 sh-103-53.int kernel: [] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] Oct 25 09:19:30 sh-103-53.int kernel: [] mgc_process_config+0x88b/0x13f0 [mgc] Oct 25 09:19:30 sh-103-53.int kernel: [] lustre_process_log+0x2d8/0xad0 [obdclass] Oct 25 09:19:30 sh-103-53.int kernel: [] ? kobject_uevent+0xb/0x10 Oct 25 09:19:30 sh-103-53.int kernel: [] ? kset_register+0x56/0x70 Oct 25 09:19:30 sh-103-53.int kernel: [] ? ll_debugfs_register_super+0x480/0x740 [lustre] Oct 25 09:19:30 sh-103-53.int kernel: [] ll_fill_super+0x4eb/0x11f0 [lustre] Oct 25 09:19:30 sh-103-53.int kernel: [] lustre_fill_super+0x28c/0x920 [obdclass] Oct 25 09:19:30 sh-103-53.int kernel: [] ? lustre_common_put_super+0x270/0x270 [obdclass] Oct 25 09:19:30 sh-103-53.int kernel: [] mount_nodev+0x4f/0xb0 Oct 25 09:19:30 sh-103-53.int kernel: [] lustre_mount+0x38/0x60 [obdclass] Oct 25 09:19:30 sh-103-53.int kernel: [] mount_fs+0x3e/0x1b0 Oct 25 09:19:30 sh-103-53.int kernel: [] vfs_kern_mount+0x67/0x110 Oct 25 09:19:30 sh-103-53.int kernel: [] do_mount+0x1ef/0xce0 Oct 25 09:19:30 sh-103-53.int kernel: [] ? __check_object_size+0x1ca/0x250 Oct 25 09:19:30 sh-103-53.int kernel: [] ? kmem_cache_alloc_trace+0x3c/0x200 Oct 25 09:19:30 sh-103-53.int kernel: [] SyS_mount+0x83/0xd0 Oct 25 09:19:30 sh-103-53.int kernel: [] system_call_fastpath+0x22/0x27 Oct 25 09:21:30 sh-103-53.int kernel: INFO: task mount.lustre:91199 blocked for more than 120 seconds. Oct 25 09:21:30 sh-103-53.int kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 25 09:21:30 sh-103-53.int kernel: mount.lustre D ffff9769fcab2080 0 91199 91198 0x00000000 Oct 25 09:21:30 sh-103-53.int kernel: Call Trace: Oct 25 09:21:30 sh-103-53.int kernel: [] ? __enqueue_entity+0x78/0x80 Oct 25 09:21:30 sh-103-53.int kernel: [] ? enqueue_entity+0x2ef/0xbe0 Oct 25 09:21:30 sh-103-53.int kernel: [] schedule+0x29/0x70 Oct 25 09:21:30 sh-103-53.int kernel: [] schedule_timeout+0x221/0x2d0 Oct 25 09:21:30 sh-103-53.int kernel: [] ? tracing_record_cmdline+0x1d/0x120 Oct 25 09:21:30 sh-103-53.int kernel: [] ? probe_sched_wakeup+0x2b/0xa0 Oct 25 09:21:30 sh-103-53.int kernel: [] ? ttwu_do_wakeup+0xb5/0xe0 Oct 25 09:21:30 sh-103-53.int kernel: [] wait_for_completion+0xfd/0x140 Oct 25 09:21:30 sh-103-53.int kernel: [] ? wake_up_state+0x20/0x20 Oct 25 09:21:30 sh-103-53.int kernel: [] llog_process_or_fork+0x244/0x450 [obdclass] Oct 25 09:21:30 sh-103-53.int kernel: [] llog_process+0x14/0x20 [obdclass] Oct 25 09:21:30 sh-103-53.int kernel: [] class_config_parse_llog+0x125/0x350 [obdclass] Oct 25 09:21:30 sh-103-53.int kernel: [] mgc_process_cfg_log+0x788/0xc40 [mgc] Oct 25 09:21:30 sh-103-53.int kernel: [] mgc_process_log+0x3d9/0x8f0 [mgc] Oct 25 09:21:30 sh-103-53.int kernel: [] ? config_recover_log_add+0x13f/0x280 [mgc] Oct 25 09:21:30 sh-103-53.int kernel: [] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] Oct 25 09:21:30 sh-103-53.int kernel: [] mgc_process_config+0x88b/0x13f0 [mgc] Oct 25 09:21:30 sh-103-53.int kernel: [] lustre_process_log+0x2d8/0xad0 [obdclass] Oct 25 09:21:30 sh-103-53.int kernel: [] ? kobject_uevent+0xb/0x10 Oct 25 09:21:30 sh-103-53.int kernel: [] ? kset_register+0x56/0x70 Oct 25 09:21:30 sh-103-53.int kernel: [] ? ll_debugfs_register_super+0x480/0x740 [lustre] Oct 25 09:21:30 sh-103-53.int kernel: [] ll_fill_super+0x4eb/0x11f0 [lustre] Oct 25 09:21:30 sh-103-53.int kernel: [] lustre_fill_super+0x28c/0x920 [obdclass] Oct 25 09:21:30 sh-103-53.int kernel: [] ? lustre_common_put_super+0x270/0x270 [obdclass] Oct 25 09:21:30 sh-103-53.int kernel: [] mount_nodev+0x4f/0xb0 Oct 25 09:21:30 sh-103-53.int kernel: [] lustre_mount+0x38/0x60 [obdclass] Oct 25 09:21:30 sh-103-53.int kernel: [] mount_fs+0x3e/0x1b0 Oct 25 09:21:30 sh-103-53.int kernel: [] vfs_kern_mount+0x67/0x110 Oct 25 09:21:30 sh-103-53.int kernel: [] do_mount+0x1ef/0xce0 Oct 25 09:21:30 sh-103-53.int kernel: [] ? __check_object_size+0x1ca/0x250 Oct 25 09:21:30 sh-103-53.int kernel: [] ? kmem_cache_alloc_trace+0x3c/0x200 Oct 25 09:21:30 sh-103-53.int kernel: [] SyS_mount+0x83/0xd0 Oct 25 09:21:30 sh-103-53.int kernel: [] system_call_fastpath+0x22/0x27 Oct 25 09:23:30 sh-103-53.int kernel: INFO: task mount.lustre:91199 blocked for more than 120 seconds. Oct 25 09:23:30 sh-103-53.int kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 25 09:23:30 sh-103-53.int kernel: mount.lustre D ffff9769fcab2080 0 91199 91198 0x00000000 Oct 25 09:23:30 sh-103-53.int kernel: Call Trace: Oct 25 09:23:30 sh-103-53.int kernel: [] ? __enqueue_entity+0x78/0x80 Oct 25 09:23:30 sh-103-53.int kernel: [] ? enqueue_entity+0x2ef/0xbe0 Oct 25 09:23:30 sh-103-53.int kernel: [] schedule+0x29/0x70 Oct 25 09:23:30 sh-103-53.int kernel: [] schedule_timeout+0x221/0x2d0 Oct 25 09:23:30 sh-103-53.int kernel: [] ? tracing_record_cmdline+0x1d/0x120 Oct 25 09:23:30 sh-103-53.int kernel: [] ? probe_sched_wakeup+0x2b/0xa0 Oct 25 09:23:30 sh-103-53.int kernel: [] ? ttwu_do_wakeup+0xb5/0xe0 Oct 25 09:23:30 sh-103-53.int kernel: [] wait_for_completion+0xfd/0x140 Oct 25 09:23:30 sh-103-53.int kernel: [] ? wake_up_state+0x20/0x20 Oct 25 09:23:30 sh-103-53.int kernel: [] llog_process_or_fork+0x244/0x450 [obdclass] Oct 25 09:23:30 sh-103-53.int kernel: [] llog_process+0x14/0x20 [obdclass] Oct 25 09:23:30 sh-103-53.int kernel: [] class_config_parse_llog+0x125/0x350 [obdclass] Oct 25 09:23:30 sh-103-53.int kernel: [] mgc_process_cfg_log+0x788/0xc40 [mgc] Oct 25 09:23:30 sh-103-53.int kernel: [] mgc_process_log+0x3d9/0x8f0 [mgc] Oct 25 09:23:30 sh-103-53.int kernel: [] ? config_recover_log_add+0x13f/0x280 [mgc] Oct 25 09:23:30 sh-103-53.int kernel: [] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] Oct 25 09:23:30 sh-103-53.int kernel: [] mgc_process_config+0x88b/0x13f0 [mgc] Oct 25 09:23:30 sh-103-53.int kernel: [] lustre_process_log+0x2d8/0xad0 [obdclass] Oct 25 09:23:30 sh-103-53.int kernel: [] ? kobject_uevent+0xb/0x10 Oct 25 09:23:30 sh-103-53.int kernel: [] ? kset_register+0x56/0x70 Oct 25 09:23:30 sh-103-53.int kernel: [] ? ll_debugfs_register_super+0x480/0x740 [lustre] Oct 25 09:23:30 sh-103-53.int kernel: [] ll_fill_super+0x4eb/0x11f0 [lustre] Oct 25 09:23:30 sh-103-53.int kernel: [] lustre_fill_super+0x28c/0x920 [obdclass] Oct 25 09:23:30 sh-103-53.int kernel: [] ? lustre_common_put_super+0x270/0x270 [obdclass] Oct 25 09:23:30 sh-103-53.int kernel: [] mount_nodev+0x4f/0xb0 Oct 25 09:23:30 sh-103-53.int kernel: [] lustre_mount+0x38/0x60 [obdclass] Oct 25 09:23:30 sh-103-53.int kernel: [] mount_fs+0x3e/0x1b0 Oct 25 09:23:30 sh-103-53.int kernel: [] vfs_kern_mount+0x67/0x110 Oct 25 09:23:30 sh-103-53.int kernel: [] do_mount+0x1ef/0xce0 Oct 25 09:23:30 sh-103-53.int kernel: [] ? __check_object_size+0x1ca/0x250 Oct 25 09:23:30 sh-103-53.int kernel: [] ? kmem_cache_alloc_trace+0x3c/0x200 Oct 25 09:23:30 sh-103-53.int kernel: [] SyS_mount+0x83/0xd0 Oct 25 09:23:30 sh-103-53.int kernel: [] system_call_fastpath+0x22/0x27 Oct 25 09:24:18 sh-103-53.int kernel: Lustre: Mounted oak-client Oct 25 09:24:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 09:24:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 19 previous similar messages Oct 25 09:25:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 09:25:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 58 previous similar messages Oct 25 09:26:17 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 09:27:04 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 09:27:38 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 09:27:55 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 09:28:07 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 09:28:17 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 09:29:08 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 09:29:58 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 09:30:31 sh-103-53.int kernel: Key type dns_resolver registered Oct 25 09:30:31 sh-103-53.int kernel: NFS: Registering the id_resolver key type Oct 25 09:30:31 sh-103-53.int kernel: Key type id_resolver registered Oct 25 09:30:31 sh-103-53.int kernel: Key type id_legacy registered Oct 25 09:31:15 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 09:31:15 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 5 previous similar messages Oct 25 09:33:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 09:33:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 25 09:34:10 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 09:34:10 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 4 previous similar messages Oct 25 09:35:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 25 09:35:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 25 09:36:34 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 25 09:37:34 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 25 09:40:06 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 09:40:06 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 25 09:41:34 sh-103-53.int kernel: perf: interrupt took too long (2516 > 2500), lowering kernel.perf_event_max_sample_rate to 79000 Oct 25 09:43:43 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 09:43:43 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 59 previous similar messages Oct 25 09:45:28 sh-103-53.int kernel: perf: interrupt took too long (3154 > 3145), lowering kernel.perf_event_max_sample_rate to 63000 Oct 25 09:45:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 25 09:45:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 25 09:49:13 sh-103-53.int kernel: LNetError: 3539:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 09:49:13 sh-103-53.int kernel: LNetError: 3539:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 13 previous similar messages Oct 25 09:51:59 sh-103-53.int kernel: perf: interrupt took too long (3945 > 3942), lowering kernel.perf_event_max_sample_rate to 50000 Oct 25 09:53:53 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 09:53:53 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 59 previous similar messages Oct 25 09:55:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 25 09:55:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 76 previous similar messages Oct 25 09:59:41 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 09:59:41 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 12 previous similar messages Oct 25 10:04:03 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 10:04:03 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 59 previous similar messages Oct 25 10:05:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 7 seconds Oct 25 10:05:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 83 previous similar messages Oct 25 10:05:57 sh-103-53.int kernel: perf: interrupt took too long (4939 > 4931), lowering kernel.perf_event_max_sample_rate to 40000 Oct 25 10:06:48 sh-103-53.int kernel: Lustre: 91119:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572023201/real 1572023201] req@ffff97585be31b00 x1648382038265520/t0(0) o400->fir-MDT0002-mdc-ffff9781f2230800@10.0.10.53@o2ib7:12/10 lens 224/224 e 0 to 1 dl 1572023208 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 10:06:48 sh-103-53.int kernel: Lustre: fir-MDT0002-mdc-ffff9781f2230800: Connection to fir-MDT0002 (at 10.0.10.53@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 25 10:07:03 sh-103-53.int kernel: infiniband mlx5_0: dump_cqe:286:(pid 91088): dump error cqe Oct 25 10:07:03 sh-103-53.int kernel: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Oct 25 10:07:03 sh-103-53.int kernel: 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Oct 25 10:07:03 sh-103-53.int kernel: 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Oct 25 10:07:03 sh-103-53.int kernel: 00000030: 00 00 00 00 00 00 89 14 0a 00 00 88 1c 55 21 d3 Oct 25 10:07:09 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3350:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds Oct 25 10:07:09 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3425:kiblnd_check_conns()) Timed out RDMA with 10.9.0.21@o2ib4 (15): c: 6, oc: 0, rc: 8 Oct 25 10:07:19 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3350:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Oct 25 10:07:19 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3350:kiblnd_check_txs_locked()) Skipped 1 previous similar message Oct 25 10:07:19 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3425:kiblnd_check_conns()) Timed out RDMA with 10.9.0.23@o2ib4 (16): c: 0, oc: 0, rc: 8 Oct 25 10:07:19 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3425:kiblnd_check_conns()) Skipped 1 previous similar message Oct 25 10:07:19 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2007:lnet_handle_find_routed_path()) no route to 10.0.10.102@o2ib7 from Oct 25 10:07:19 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.0.10.102@o2ib7: -113 Oct 25 10:07:19 sh-103-53.int kernel: Lustre: 91122:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1572023233/real 1572023239] req@ffff97699b2e1b00 x1648382038270208/t0(0) o400->fir-OST0005-osc-ffff9781f2230800@10.0.10.102@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572023278 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 Oct 25 10:07:19 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 25 10:07:19 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 10:07:19 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 10:07:19 sh-103-53.int kernel: Lustre: 91122:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 75 previous similar messages Oct 25 10:07:44 sh-103-53.int kernel: LNetError: 91118:0:(lib-move.c:2007:lnet_handle_find_routed_path()) no route to 10.0.10.52@o2ib7 from Oct 25 10:07:44 sh-103-53.int kernel: LNetError: 91118:0:(lib-move.c:2007:lnet_handle_find_routed_path()) Skipped 200 previous similar messages Oct 25 10:08:09 sh-103-53.int kernel: LNetError: 91118:0:(lib-move.c:2007:lnet_handle_find_routed_path()) no route to 10.0.10.51@o2ib7 from Oct 25 10:08:09 sh-103-53.int kernel: LNetError: 91118:0:(lib-move.c:2007:lnet_handle_find_routed_path()) Skipped 100 previous similar messages Oct 25 10:08:34 sh-103-53.int kernel: LNetError: 91118:0:(lib-move.c:2007:lnet_handle_find_routed_path()) no route to 10.0.10.52@o2ib7 from Oct 25 10:08:34 sh-103-53.int kernel: LNetError: 91118:0:(lib-move.c:2007:lnet_handle_find_routed_path()) Skipped 100 previous similar messages Oct 25 10:09:00 sh-103-53.int kernel: LNetError: 91118:0:(lib-move.c:2007:lnet_handle_find_routed_path()) no route to 10.0.10.51@o2ib7 from Oct 25 10:09:00 sh-103-53.int kernel: LNetError: 91118:0:(lib-move.c:2007:lnet_handle_find_routed_path()) Skipped 100 previous similar messages Oct 25 10:09:25 sh-103-53.int kernel: LNetError: 91118:0:(lib-move.c:2007:lnet_handle_find_routed_path()) no route to 10.0.10.52@o2ib7 from Oct 25 10:09:25 sh-103-53.int kernel: LNetError: 91118:0:(lib-move.c:2007:lnet_handle_find_routed_path()) Skipped 100 previous similar messages Oct 25 10:09:50 sh-103-53.int kernel: LNetError: 91118:0:(lib-move.c:2007:lnet_handle_find_routed_path()) no route to 10.0.10.51@o2ib7 from Oct 25 10:09:50 sh-103-53.int kernel: LNetError: 91118:0:(lib-move.c:2007:lnet_handle_find_routed_path()) Skipped 100 previous similar messages Oct 25 10:10:04 sh-103-53.int kernel: LNetError: 3539:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 0 Oct 25 10:10:04 sh-103-53.int kernel: LNetError: 3539:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 25 10:10:07 sh-103-53.int kernel: LustreError: 95950:0:(lmv_obd.c:1415:lmv_statfs()) fir-MDT0001-mdc-ffff9781f2230800: can't stat MDS #0: rc = -4 Oct 25 10:10:15 sh-103-53.int kernel: LustreError: 167-0: fir-MDT0002-mdc-ffff9781f2230800: This client was evicted by fir-MDT0002; in progress operations using this service will fail. Oct 25 10:10:15 sh-103-53.int kernel: Lustre: fir-MDT0002-mdc-ffff9781f2230800: Connection restored to 10.0.10.53@o2ib7 (at 10.0.10.53@o2ib7) Oct 25 10:10:40 sh-103-53.int kernel: Lustre: Evicted from MGS (at MGC10.0.10.51@o2ib7_0) after server handle changed from 0xc9be2abaacfbc142 to 0xc9be2abad0572bfb Oct 25 10:10:40 sh-103-53.int kernel: LustreError: 167-0: fir-MDT0000-mdc-ffff9781f2230800: This client was evicted by fir-MDT0000; in progress operations using this service will fail. Oct 25 10:10:40 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 10:14:14 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.41@o2ib4 added to recovery queue. Health = 900 Oct 25 10:14:14 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 35 previous similar messages Oct 25 10:14:58 sh-103-53.int kernel: Lustre: 91129:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572023691/real 1572023691] req@ffff9769ab746780 x1648382038354368/t0(0) o400->fir-OST0030-osc-ffff9781f2230800@10.0.10.109@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572023698 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 10:14:58 sh-103-53.int kernel: Lustre: fir-OST0032-osc-ffff9781f2230800: Connection to fir-OST0032 (at 10.0.10.109@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 25 10:14:58 sh-103-53.int kernel: Lustre: Skipped 98 previous similar messages Oct 25 10:14:58 sh-103-53.int kernel: Lustre: fir-OST0032-osc-ffff9781f2230800: Connection restored to 10.0.10.109@o2ib7 (at 10.0.10.109@o2ib7) Oct 25 10:14:58 sh-103-53.int kernel: Lustre: Skipped 93 previous similar messages Oct 25 10:14:58 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 10:14:58 sh-103-53.int kernel: Lustre: 91129:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 39 previous similar messages Oct 25 10:15:00 sh-103-53.int kernel: Lustre: 91122:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572023691/real 1572023691] req@ffff9769f2fda880 x1648382038354048/t0(0) o400->fir-OST0016-osc-ffff9781f2230800@10.0.10.103@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572023700 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 10:15:00 sh-103-53.int kernel: Lustre: fir-OST0014-osc-ffff9781f2230800: Connection to fir-OST0014 (at 10.0.10.103@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 25 10:15:00 sh-103-53.int kernel: Lustre: Skipped 18 previous similar messages Oct 25 10:15:00 sh-103-53.int kernel: Lustre: 91122:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 5 previous similar messages Oct 25 10:15:00 sh-103-53.int kernel: Lustre: fir-OST0016-osc-ffff9781f2230800: Connection restored to 10.0.10.103@o2ib7 (at 10.0.10.103@o2ib7) Oct 25 10:15:00 sh-103-53.int kernel: Lustre: Skipped 14 previous similar messages Oct 25 10:15:14 sh-103-53.int kernel: Lustre: 91124:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572023691/real 1572023691] req@ffff9769f2fdec00 x1648382038353696/t0(0) o400->fir-OST0000-osc-ffff9781f2230800@10.0.10.101@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572023714 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 10:15:14 sh-103-53.int kernel: Lustre: fir-OST0008-osc-ffff9781f2230800: Connection to fir-OST0008 (at 10.0.10.101@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 25 10:15:14 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 10:15:14 sh-103-53.int kernel: Lustre: 91124:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 1 previous similar message Oct 25 10:15:21 sh-103-53.int kernel: Lustre: fir-OST0052-osc-ffff9781f2230800: Connection restored to 10.0.10.113@o2ib7 (at 10.0.10.113@o2ib7) Oct 25 10:15:22 sh-103-53.int kernel: Lustre: 91120:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572023714/real 1572023714] req@ffff97699dee8000 x1648382038359120/t0(0) o400->fir-OST0033-osc-ffff9781f2230800@10.0.10.110@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572023721 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 10:15:22 sh-103-53.int kernel: Lustre: fir-OST003c-osc-ffff9781f2230800: Connection to fir-OST003c (at 10.0.10.111@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 25 10:15:22 sh-103-53.int kernel: Lustre: Skipped 4 previous similar messages Oct 25 10:15:22 sh-103-53.int kernel: Lustre: 91120:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 5 previous similar messages Oct 25 10:15:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 10:15:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 79 previous similar messages Oct 25 10:15:53 sh-103-53.int kernel: Lustre: fir-OST0001-osc-ffff9781f2230800: Connection restored to 10.0.10.102@o2ib7 (at 10.0.10.102@o2ib7) Oct 25 10:15:53 sh-103-53.int kernel: Lustre: Skipped 7 previous similar messages Oct 25 10:15:58 sh-103-53.int kernel: Lustre: 91127:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572023714/real 1572023714] req@ffff978214e58480 x1648382038359616/t0(0) o400->fir-OST0058-osc-ffff9781f2230800@10.0.10.115@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572023758 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 10:15:58 sh-103-53.int kernel: Lustre: 91127:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 5 previous similar messages Oct 25 10:15:58 sh-103-53.int kernel: Lustre: fir-OST0058-osc-ffff9781f2230800: Connection to fir-OST0058 (at 10.0.10.115@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 25 10:15:58 sh-103-53.int kernel: Lustre: Skipped 6 previous similar messages Oct 25 10:16:00 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 10:16:00 sh-103-53.int kernel: LustreError: 91159:0:(mgc_request.c:599:do_requeue()) failed processing log: -5 Oct 25 10:16:27 sh-103-53.int kernel: Lustre: fir-OST0000-osc-ffff9781f2230800: Connection restored to 10.0.10.101@o2ib7 (at 10.0.10.101@o2ib7) Oct 25 10:16:27 sh-103-53.int kernel: Lustre: Skipped 14 previous similar messages Oct 25 10:16:34 sh-103-53.int kernel: Lustre: 91128:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572023787/real 1572023787] req@ffff9769da205e80 x1648382038368240/t0(0) o400->fir-OST0003-osc-ffff9781f2230800@10.0.10.102@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572023794 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 10:16:34 sh-103-53.int kernel: Lustre: fir-OST0049-osc-ffff9781f2230800: Connection to fir-OST0049 (at 10.0.10.114@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 25 10:16:34 sh-103-53.int kernel: Lustre: Skipped 14 previous similar messages Oct 25 10:16:34 sh-103-53.int kernel: Lustre: 91128:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 20 previous similar messages Oct 25 10:16:54 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 10:16:54 sh-103-53.int kernel: LustreError: 91159:0:(mgc_request.c:2127:mgc_process_log()) MGC10.0.10.51@o2ib7: recover log fir-cliir failed, not fatal: rc = -5 Oct 25 10:16:54 sh-103-53.int kernel: LustreError: 96498:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978201b8d080) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 10:17:01 sh-103-53.int kernel: Lustre: fir-OST0037-osc-ffff9781f2230800: Connection restored to 10.0.10.110@o2ib7 (at 10.0.10.110@o2ib7) Oct 25 10:17:01 sh-103-53.int kernel: Lustre: Skipped 10 previous similar messages Oct 25 10:17:38 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 10:17:38 sh-103-53.int kernel: LustreError: 91159:0:(mgc_request.c:599:do_requeue()) failed processing log: -5 Oct 25 10:18:12 sh-103-53.int kernel: Lustre: 91121:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572023691/real 1572023691] req@ffff9769ccb16c00 x1648382038354992/t0(0) o400->fir-OST0057-osc-ffff9781f2230800@10.0.10.116@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572023892 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 10:18:12 sh-103-53.int kernel: Lustre: 91121:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 26 previous similar messages Oct 25 10:19:14 sh-103-53.int kernel: Lustre: fir-OST005d-osc-ffff9781f2230800: Connection to fir-OST005d (at 10.0.10.116@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 25 10:19:14 sh-103-53.int kernel: Lustre: Skipped 28 previous similar messages Oct 25 10:19:14 sh-103-53.int kernel: Lustre: fir-OST005d-osc-ffff9781f2230800: Connection restored to 10.0.10.116@o2ib7 (at 10.0.10.116@o2ib7) Oct 25 10:19:14 sh-103-53.int kernel: Lustre: Skipped 30 previous similar messages Oct 25 10:20:35 sh-103-53.int kernel: Lustre: 91124:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572023787/real 1572023787] req@ffff9769dabfd100 x1648382038368576/t0(0) o400->fir-OST0018-osc-ffff9781f2230800@10.0.10.105@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572024035 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 10:20:35 sh-103-53.int kernel: Lustre: 91124:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 12 previous similar messages Oct 25 10:20:53 sh-103-53.int kernel: LNetError: 95683:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 10:20:53 sh-103-53.int kernel: LNetError: 95683:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 1 previous similar message Oct 25 10:21:35 sh-103-53.int kernel: Lustre: fir-MDT0002-mdc-ffff9781f2230800: Connection to fir-MDT0002 (at 10.0.10.53@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 25 10:21:35 sh-103-53.int kernel: Lustre: Skipped 12 previous similar messages Oct 25 10:21:35 sh-103-53.int kernel: Lustre: fir-MDT0002-mdc-ffff9781f2230800: Connection restored to 10.0.10.53@o2ib7 (at 10.0.10.53@o2ib7) Oct 25 10:21:35 sh-103-53.int kernel: Lustre: Skipped 18 previous similar messages Oct 25 10:22:48 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 10:22:48 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572023868, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6c321200/0x51ab3c4ee66d6be lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abad05b0431 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 10:22:48 sh-103-53.int kernel: LustreError: 97102:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978201b8d200) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 10:23:19 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 25 10:23:19 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Skipped 100 previous similar messages Oct 25 10:24:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.41@o2ib4 added to recovery queue. Health = 900 Oct 25 10:24:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 28 previous similar messages Oct 25 10:26:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 10:26:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 59 previous similar messages Oct 25 10:27:55 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 10:27:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572024175, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769ce77d7c0/0x51ab3c4ee66de04 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abad16e1b95 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 10:27:55 sh-103-53.int kernel: LustreError: 97634:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9769fcbc46c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 10:27:55 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 10:27:55 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 10:28:24 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 25 10:30:58 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 10:30:58 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 21 previous similar messages Oct 25 10:33:03 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 10:33:03 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572024483, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769b4253840/0x51ab3c4ee66df00 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abad2175f47 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 10:33:03 sh-103-53.int kernel: LustreError: 116214:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781ffb17080) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 10:34:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 10:34:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 52 previous similar messages Oct 25 10:36:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 10:36:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 25 10:38:08 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 10:38:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572024788, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769b4254380/0x51ab3c4ee66df54 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abad286c50f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 10:38:08 sh-103-53.int kernel: LustreError: 116793:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976ad9d56cc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 10:38:08 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 10:38:08 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 10:41:11 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 10:41:11 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 25 previous similar messages Oct 25 10:41:40 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 25 10:43:15 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 10:43:15 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572025095, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769d93aad00/0x51ab3c4ee66dfa1 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abad2d78fed expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 10:43:15 sh-103-53.int kernel: LustreError: 117164:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9782151f9e00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 10:44:43 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 10:44:43 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 58 previous similar messages Oct 25 10:46:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 10:46:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 56 previous similar messages Oct 25 10:48:24 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 10:48:24 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572025404, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97699ceab180/0x51ab3c4ee66dfcb lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abad3469479 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 10:48:24 sh-103-53.int kernel: LustreError: 117941:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9769e225b5c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 10:48:24 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 10:48:24 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 10:49:53 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 25 10:51:18 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 10:51:18 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 22 previous similar messages Oct 25 10:53:30 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572025710, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97699ceaba80/0x51ab3c4ee66e011 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abad4ed9bea expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 10:53:30 sh-103-53.int kernel: LustreError: 118570:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977793a27440) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 10:54:53 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 10:54:53 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 58 previous similar messages Oct 25 10:56:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 10:56:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 25 10:58:39 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 10:58:39 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 10:58:39 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572026019, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976999f406c0/0x51ab3c4ee66e057 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abad72309d3 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 10:58:39 sh-103-53.int kernel: LustreError: 119370:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754f3766000) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 10:58:39 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 10:58:39 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 11:01:38 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 11:01:38 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 25 previous similar messages Oct 25 11:03:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572026326, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976999f42880/0x51ab3c4ee66e09d lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abad91b6972 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 11:03:46 sh-103-53.int kernel: LustreError: 119785:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754f37672c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 11:05:03 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 11:05:03 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 56 previous similar messages Oct 25 11:06:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 11:06:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 25 11:08:54 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 11:08:54 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 11:08:54 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572026634, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97820733ad00/0x51ab3c4ee66ea13 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abadb27b36b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 11:08:54 sh-103-53.int kernel: LustreError: 120339:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820ac2b2c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 11:08:54 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 11:08:54 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 11:11:49 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 11:11:49 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 18 previous similar messages Oct 25 11:14:03 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572026943, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6c3218c0/0x51ab3c4ee66f65a lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abadd9bba1b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 11:15:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 11:15:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 56 previous similar messages Oct 25 11:16:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 11:16:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 25 11:19:09 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 11:19:09 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 11:19:09 sh-103-53.int kernel: LustreError: 122718:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978211feed80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 11:19:09 sh-103-53.int kernel: LustreError: 122718:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 11:19:09 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 11:19:09 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 11:21:50 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 11:21:50 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 26 previous similar messages Oct 25 11:24:17 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572027557, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0ad00/0x51ab3c4ee671e70 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abae228b6b1 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 11:24:17 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 11:25:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 11:25:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 59 previous similar messages Oct 25 11:26:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 25 11:26:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 55 previous similar messages Oct 25 11:28:24 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 25 11:29:25 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 11:29:25 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 11:29:25 sh-103-53.int kernel: LustreError: 124110:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820ac2a300) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 11:29:25 sh-103-53.int kernel: LustreError: 124110:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 11:29:25 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 11:29:25 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 11:32:02 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 11:32:02 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 29 previous similar messages Oct 25 11:34:31 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572028171, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff978211f6d580/0x51ab3c4ee6746e1 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abae597ce26 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 11:34:31 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 11:35:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 11:35:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 59 previous similar messages Oct 25 11:37:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 25 11:37:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 56 previous similar messages Oct 25 11:39:38 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 11:39:38 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 11:39:38 sh-103-53.int kernel: LustreError: 125453:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9782057d3740) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 11:39:38 sh-103-53.int kernel: LustreError: 125453:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 11:39:38 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 11:39:38 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 11:42:13 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 11:42:13 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 28 previous similar messages Oct 25 11:44:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572028787, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977d9825a880/0x51ab3c4ee677008 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abae8c54b0c expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 11:44:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 11:45:43 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 11:45:43 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 59 previous similar messages Oct 25 11:47:24 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 11:47:24 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 59 previous similar messages Oct 25 11:47:50 sh-103-53.int kernel: Lustre: 91142:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572029263/real 1572029263] req@ffff9782125f3180 x1648382040349936/t0(0) o400->oak-OST0021-osc-ffff9781f565a800@10.0.2.102@o2ib5:28/4 lens 224/224 e 0 to 1 dl 1572029270 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 11:47:50 sh-103-53.int kernel: Lustre: oak-OST0001-osc-ffff9781f565a800: Connection to oak-OST0001 (at 10.0.2.102@o2ib5) was lost; in progress operations using this service will wait for recovery to complete Oct 25 11:47:50 sh-103-53.int kernel: Lustre: 91142:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 64 previous similar messages Oct 25 11:48:22 sh-103-53.int kernel: Lustre: 120846:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572029257/real 1572029257] req@ffff975580e6bf00 x1648382040347712/t0(0) o101->fir-MDT0002-mdc-ffff9781f2230800@10.0.10.53@o2ib7:12/10 lens 1784/11712 e 0 to 1 dl 1572029302 ref 2 fl Rpc:XP/0/ffffffff rc 0/-1 Oct 25 11:48:22 sh-103-53.int kernel: Lustre: 120846:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 5 previous similar messages Oct 25 11:48:22 sh-103-53.int kernel: Lustre: fir-MDT0002-mdc-ffff9781f2230800: Connection to fir-MDT0002 (at 10.0.10.53@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 25 11:48:22 sh-103-53.int kernel: Lustre: Skipped 62 previous similar messages Oct 25 11:49:56 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 11:49:56 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 11:49:57 sh-103-53.int kernel: LustreError: 127040:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976329e85500) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 11:49:57 sh-103-53.int kernel: LustreError: 127040:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 11:49:57 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 11:49:57 sh-103-53.int kernel: Lustre: Skipped 65 previous similar messages Oct 25 11:52:18 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 11:52:18 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 27 previous similar messages Oct 25 11:53:49 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 25 11:55:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572029405, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977d9825ba80/0x51ab3c4ee679603 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abaeb5540dd expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 11:55:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 11:55:53 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 11:55:53 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 56 previous similar messages Oct 25 11:57:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 11:57:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 25 11:58:54 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 25 12:00:11 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 12:00:11 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 12:00:11 sh-103-53.int kernel: LustreError: 130619:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711e72d500) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 12:00:11 sh-103-53.int kernel: LustreError: 130619:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 12:00:11 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 12:00:11 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 12:02:00 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 25 12:02:23 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 12:02:23 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 28 previous similar messages Oct 25 12:05:20 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572030020, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97820c2a0b40/0x51ab3c4ee67d24e lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abaed9335fb expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 12:05:20 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 12:06:03 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 12:06:03 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 59 previous similar messages Oct 25 12:07:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 12:07:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 57 previous similar messages Oct 25 12:10:30 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 12:10:30 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 12:10:30 sh-103-53.int kernel: LustreError: 132900:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754fc37a180) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 12:10:30 sh-103-53.int kernel: LustreError: 132900:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 12:10:30 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 12:10:30 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 12:12:27 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 12:12:27 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 31 previous similar messages Oct 25 12:15:10 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 25 12:15:36 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572030636, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97699cea86c0/0x51ab3c4ee680810 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abaee023364 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 12:15:36 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 12:16:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 12:16:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 59 previous similar messages Oct 25 12:17:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 25 12:17:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 61 previous similar messages Oct 25 12:20:44 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 12:20:44 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 12:20:44 sh-103-53.int kernel: LustreError: 134353:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff975519ba1380) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 12:20:44 sh-103-53.int kernel: LustreError: 134353:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 12:20:44 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 12:20:44 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 12:22:28 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 12:22:28 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 34 previous similar messages Oct 25 12:25:50 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572031250, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff975730a698c0/0x51ab3c4ee6838b5 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abaeeabf20d expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 12:25:50 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 12:26:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 12:26:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 59 previous similar messages Oct 25 12:28:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 12:28:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 58 previous similar messages Oct 25 12:29:24 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 25 12:30:57 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 12:30:57 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 12:30:57 sh-103-53.int kernel: LustreError: 135838:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754f3615800) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 12:30:57 sh-103-53.int kernel: LustreError: 135838:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 12:30:57 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 12:30:57 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 12:32:44 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 12:32:44 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 36 previous similar messages Oct 25 12:36:06 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572031866, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769b4252d00/0x51ab3c4ee688553 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abaf4a36cfb expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 12:36:06 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 12:36:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 12:36:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 59 previous similar messages Oct 25 12:38:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 12:38:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 53 previous similar messages Oct 25 12:39:34 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 25 12:41:16 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 12:41:16 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 12:41:16 sh-103-53.int kernel: LustreError: 136804:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978211fef980) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 12:41:16 sh-103-53.int kernel: LustreError: 136804:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 12:41:16 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 12:41:16 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 12:42:52 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 12:42:52 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 44 previous similar messages Oct 25 12:46:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572032482, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea0900/0x51ab3c4ee68f385 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abaff21ba5a expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 12:46:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 12:46:43 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 12:46:43 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 59 previous similar messages Oct 25 12:48:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 25 12:48:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 53 previous similar messages Oct 25 12:51:30 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 12:51:30 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 12:51:31 sh-103-53.int kernel: LustreError: 137904:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09186000) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 12:51:31 sh-103-53.int kernel: LustreError: 137904:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 12:51:31 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 12:51:31 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 12:53:13 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 12:53:13 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 24 previous similar messages Oct 25 12:55:50 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 25 12:56:40 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572033100, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea3180/0x51ab3c4ee691eab lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abb0320bd30 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 12:56:40 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 12:56:49 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 12:56:49 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 56 previous similar messages Oct 25 12:58:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 12:58:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 58 previous similar messages Oct 25 13:01:45 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 13:01:45 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 13:01:45 sh-103-53.int kernel: LustreError: 138795:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09186840) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 13:01:45 sh-103-53.int kernel: LustreError: 138795:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 13:01:45 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 13:01:45 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 13:03:33 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 13:03:33 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 14 previous similar messages Oct 25 13:06:52 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572033712, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff978215008000/0x51ab3c4ee691f5a lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abb070c7c71 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 13:06:52 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 13:06:59 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 13:06:59 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 58 previous similar messages Oct 25 13:08:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 13:08:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 75 previous similar messages Oct 25 13:12:00 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 13:12:00 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 13:12:00 sh-103-53.int kernel: LustreError: 139531:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a12bb6240) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 13:12:00 sh-103-53.int kernel: LustreError: 139531:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 13:12:00 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 13:12:00 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 13:13:44 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 13:13:44 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 25 13:15:11 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 25 13:17:06 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572034326, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97821500ca40/0x51ab3c4ee691f92 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abb0afa5427 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 13:17:06 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 13:17:14 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 13:17:14 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 59 previous similar messages Oct 25 13:18:49 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 13:18:49 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 25 13:22:13 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 13:22:13 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 13:22:13 sh-103-53.int kernel: LustreError: 140258:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978205a0cb40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 13:22:13 sh-103-53.int kernel: LustreError: 140258:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 13:22:13 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 13:22:13 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 13:24:00 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 13:24:00 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 19 previous similar messages Oct 25 13:27:19 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 13:27:19 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 57 previous similar messages Oct 25 13:27:19 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572034939, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97821500ec00/0x51ab3c4ee6923eb lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abb0eab2358 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 13:27:19 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 13:28:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 25 13:28:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 54 previous similar messages Oct 25 13:32:26 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 13:32:26 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 13:32:26 sh-103-53.int kernel: LustreError: 140830:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978205a0db00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 13:32:26 sh-103-53.int kernel: LustreError: 140830:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 13:32:26 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 13:32:26 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 13:34:17 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 13:34:17 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 14 previous similar messages Oct 25 13:37:29 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 13:37:29 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 59 previous similar messages Oct 25 13:37:32 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572035552, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97821500a400/0x51ab3c4ee692423 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abb1215512e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 13:37:32 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 13:39:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 25 13:39:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 57 previous similar messages Oct 25 13:42:38 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 13:42:38 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 13:42:38 sh-103-53.int kernel: LustreError: 141400:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978205a0dd40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 13:42:38 sh-103-53.int kernel: LustreError: 141400:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 13:42:38 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 13:42:38 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 13:44:24 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 13:44:24 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 17 previous similar messages Oct 25 13:47:39 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 13:47:39 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 59 previous similar messages Oct 25 13:47:45 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572036165, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea4380/0x51ab3c4ee692462 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abb15762711 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 13:47:45 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 13:49:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 25 13:49:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 50 previous similar messages Oct 25 13:52:52 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 13:52:52 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 13:52:52 sh-103-53.int kernel: LustreError: 141974:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09186600) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 13:52:52 sh-103-53.int kernel: LustreError: 141974:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 13:52:52 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 13:52:52 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 13:54:39 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 13:54:39 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 18 previous similar messages Oct 25 13:57:49 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 13:57:49 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 59 previous similar messages Oct 25 13:58:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572036781, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea5580/0x51ab3c4ee69249a lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abb18ede454 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 13:58:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 13:59:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 13:59:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 56 previous similar messages Oct 25 14:03:07 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 14:03:07 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 14:03:07 sh-103-53.int kernel: LustreError: 142565:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09186cc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 14:03:07 sh-103-53.int kernel: LustreError: 142565:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 14:03:07 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 14:03:07 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 14:04:51 sh-103-53.int kernel: LNetError: 118051:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 14:04:51 sh-103-53.int kernel: LNetError: 118051:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 19 previous similar messages Oct 25 14:07:59 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 14:07:59 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 59 previous similar messages Oct 25 14:08:12 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572037392, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea6780/0x51ab3c4ee6924d2 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abb1b9179d2 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 14:08:12 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 14:09:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 14:09:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 57 previous similar messages Oct 25 14:13:22 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 14:13:22 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 14:13:22 sh-103-53.int kernel: LustreError: 143140:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a091869c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 14:13:22 sh-103-53.int kernel: LustreError: 143140:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 14:13:22 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 14:13:22 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 14:14:59 sh-103-53.int kernel: LNetError: 118051:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 14:14:59 sh-103-53.int kernel: LNetError: 118051:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 17 previous similar messages Oct 25 14:18:09 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 14:18:09 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 59 previous similar messages Oct 25 14:18:30 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572038010, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea7980/0x51ab3c4ee69250a lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abb1f060a6d expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 14:18:30 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 14:19:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 14:19:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 57 previous similar messages Oct 25 14:23:40 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 14:23:40 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 14:23:40 sh-103-53.int kernel: LustreError: 143717:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09186240) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 14:23:40 sh-103-53.int kernel: LustreError: 143717:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 14:23:40 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 14:23:40 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 14:25:05 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 14:25:05 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 16 previous similar messages Oct 25 14:28:19 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 14:28:19 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 58 previous similar messages Oct 25 14:28:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572038626, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea60c0/0x51ab3c4ee692542 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abb22e8420b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 14:28:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 14:30:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 14:30:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 54 previous similar messages Oct 25 14:33:53 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 14:33:53 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 14:33:53 sh-103-53.int kernel: LustreError: 144289:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09186fc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 14:33:53 sh-103-53.int kernel: LustreError: 144289:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 14:33:53 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 14:33:53 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 14:35:17 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 14:35:17 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 17 previous similar messages Oct 25 14:38:34 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 14:38:34 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 59 previous similar messages Oct 25 14:38:59 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572039239, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea4ec0/0x51ab3c4ee69257a lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abb2685e230 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 14:38:59 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 14:40:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 14:40:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 54 previous similar messages Oct 25 14:44:08 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 14:44:08 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 14:44:08 sh-103-53.int kernel: LustreError: 145070:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09187d40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 14:44:08 sh-103-53.int kernel: LustreError: 145070:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 14:44:08 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 14:44:08 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 14:45:24 sh-103-53.int kernel: LNetError: 122478:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 14:45:24 sh-103-53.int kernel: LNetError: 122478:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 20 previous similar messages Oct 25 14:48:44 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 14:48:44 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 56 previous similar messages Oct 25 14:49:16 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572039855, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea1b00/0x51ab3c4ee692be7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abb29ef0971 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 14:49:16 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 14:50:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 25 14:50:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 49 previous similar messages Oct 25 14:54:22 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 14:54:22 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 14:54:22 sh-103-53.int kernel: LustreError: 145650:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09187440) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 14:54:22 sh-103-53.int kernel: LustreError: 145650:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 14:54:22 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 14:54:22 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 14:55:34 sh-103-53.int kernel: LNetError: 122478:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 14:55:34 sh-103-53.int kernel: LNetError: 122478:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 19 previous similar messages Oct 25 14:58:54 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 14:58:54 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 57 previous similar messages Oct 25 14:59:31 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572040471, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea33c0/0x51ab3c4ee692c1f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abb2d2fdb42 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 14:59:31 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 15:00:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 15:00:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 51 previous similar messages Oct 25 15:04:40 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 15:04:40 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 15:04:40 sh-103-53.int kernel: LustreError: 146245:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09187e00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 15:04:40 sh-103-53.int kernel: LustreError: 146245:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 15:04:41 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 15:04:41 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 15:05:40 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 15:05:40 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 18 previous similar messages Oct 25 15:08:59 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 15:08:59 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 58 previous similar messages Oct 25 15:09:48 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572041088, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea2400/0x51ab3c4ee692c57 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abb43a0c9c6 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 15:09:48 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 15:10:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 15:10:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 51 previous similar messages Oct 25 15:14:55 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 15:14:55 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 15:14:55 sh-103-53.int kernel: LustreError: 146817:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09186180) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 15:14:55 sh-103-53.int kernel: LustreError: 146817:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 15:14:55 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 15:14:55 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 15:15:51 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 15:15:51 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 17 previous similar messages Oct 25 15:19:14 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 15:19:14 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 56 previous similar messages Oct 25 15:20:03 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572041703, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea0b40/0x51ab3c4ee692c8f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abb5bd4bcab expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 15:20:03 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 15:21:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 25 15:21:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 51 previous similar messages Oct 25 15:25:13 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 15:25:13 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 15:25:13 sh-103-53.int kernel: LustreError: 147394:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09186a80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 15:25:13 sh-103-53.int kernel: LustreError: 147394:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 15:25:13 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 15:25:13 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 15:26:04 sh-103-53.int kernel: LNetError: 118051:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 15:26:04 sh-103-53.int kernel: LNetError: 118051:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 17 previous similar messages Oct 25 15:29:19 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 15:29:19 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 58 previous similar messages Oct 25 15:30:21 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572042321, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea0b40/0x51ab3c4ee692cc7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abb73fa10ea expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 15:30:21 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 15:31:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 25 15:31:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 42 previous similar messages Oct 25 15:35:30 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 15:35:30 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 15:35:30 sh-103-53.int kernel: LustreError: 147969:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09187140) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 15:35:30 sh-103-53.int kernel: LustreError: 147969:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 15:35:30 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 15:35:30 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 15:36:17 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 15:36:17 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 17 previous similar messages Oct 25 15:39:29 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 15:39:29 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 57 previous similar messages Oct 25 15:40:36 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572042936, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea2400/0x51ab3c4ee692cff lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abb8e3b755e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 15:40:36 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 15:41:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 15:41:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 45 previous similar messages Oct 25 15:45:44 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 15:45:44 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 15:45:44 sh-103-53.int kernel: LustreError: 148544:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09187e00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 15:45:44 sh-103-53.int kernel: LustreError: 148544:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 15:45:44 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 15:45:44 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 15:47:04 sh-103-53.int kernel: LNetError: 118051:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 15:47:04 sh-103-53.int kernel: LNetError: 118051:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 16 previous similar messages Oct 25 15:49:44 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 15:49:44 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 58 previous similar messages Oct 25 15:50:51 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572043551, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea33c0/0x51ab3c4ee692d37 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abba839e97e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 15:50:51 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 15:51:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Oct 25 15:51:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 49 previous similar messages Oct 25 15:55:57 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 15:55:57 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 15:55:57 sh-103-53.int kernel: LustreError: 149129:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a091860c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 15:55:57 sh-103-53.int kernel: LustreError: 149129:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 15:55:57 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 15:55:57 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 15:57:18 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 15:57:18 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 16 previous similar messages Oct 25 15:59:49 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 15:59:49 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 59 previous similar messages Oct 25 16:01:04 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572044164, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea0d80/0x51ab3c4ee692d6f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abbbf87efee expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 16:01:04 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 16:01:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 16:01:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 57 previous similar messages Oct 25 16:06:11 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 16:06:11 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 16:06:11 sh-103-53.int kernel: LustreError: 149733:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09187bc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 16:06:11 sh-103-53.int kernel: LustreError: 149733:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 16:06:11 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 16:06:11 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 16:07:32 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 16:07:32 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 15 previous similar messages Oct 25 16:09:59 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 16:09:59 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 59 previous similar messages Oct 25 16:11:19 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572044779, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea45c0/0x51ab3c4ee692f1a lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abbdb699de3 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 16:11:19 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 16:11:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 6 seconds Oct 25 16:11:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 49 previous similar messages Oct 25 16:16:26 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 16:16:26 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 16:16:26 sh-103-53.int kernel: LustreError: 150331:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09187080) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 16:16:26 sh-103-53.int kernel: LustreError: 150331:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 16:16:26 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 16:16:26 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 16:17:49 sh-103-53.int kernel: LNetError: 150281:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 16:17:49 sh-103-53.int kernel: LNetError: 150281:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 18 previous similar messages Oct 25 16:20:09 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 16:20:09 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 58 previous similar messages Oct 25 16:21:35 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572045395, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea60c0/0x51ab3c4ee693063 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abbf82ce917 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 16:21:35 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 16:22:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 16:22:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 41 previous similar messages Oct 25 16:26:40 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 16:26:40 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 16:26:40 sh-103-53.int kernel: LustreError: 150917:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a091860c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 16:26:40 sh-103-53.int kernel: LustreError: 150917:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 16:26:40 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 16:26:40 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 16:28:02 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 16:28:02 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 16 previous similar messages Oct 25 16:30:19 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 16:30:19 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 59 previous similar messages Oct 25 16:31:50 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572046010, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea7980/0x51ab3c4ee6931ac lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abc0b942403 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 16:31:50 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 16:32:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 16:32:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 49 previous similar messages Oct 25 16:36:58 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 16:36:58 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 16:36:58 sh-103-53.int kernel: LustreError: 151512:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09186180) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 16:36:58 sh-103-53.int kernel: LustreError: 151512:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 16:36:58 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 16:36:58 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 16:38:04 sh-103-53.int kernel: LNetError: 122478:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 16:38:04 sh-103-53.int kernel: LNetError: 122478:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 16 previous similar messages Oct 25 16:40:29 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 16:40:29 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 54 previous similar messages Oct 25 16:42:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572046628, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea69c0/0x51ab3c4ee6931e4 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abc216d604e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 16:42:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 16:42:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 16:42:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 44 previous similar messages Oct 25 16:47:14 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 16:47:14 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 16:47:14 sh-103-53.int kernel: LustreError: 152094:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09186900) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 16:47:14 sh-103-53.int kernel: LustreError: 152094:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 16:47:14 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 16:47:14 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 16:48:10 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 16:48:10 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 21 previous similar messages Oct 25 16:50:39 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 16:50:39 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 56 previous similar messages Oct 25 16:52:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572047243, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea60c0/0x51ab3c4ee69321c lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abc38ca46b0 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 16:52:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 16:52:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 16:52:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 47 previous similar messages Oct 25 16:57:30 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 16:57:30 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 16:57:30 sh-103-53.int kernel: LustreError: 152670:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09186f00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 16:57:30 sh-103-53.int kernel: LustreError: 152670:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 16:57:30 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 16:57:30 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 16:58:24 sh-103-53.int kernel: LNetError: 122478:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 16:58:24 sh-103-53.int kernel: LNetError: 122478:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 18 previous similar messages Oct 25 17:00:49 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 17:00:49 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 59 previous similar messages Oct 25 17:02:36 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572047856, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea45c0/0x51ab3c4ee6932af lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abc4ffe54c4 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 17:02:36 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 17:02:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 17:02:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 51 previous similar messages Oct 25 17:07:44 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 17:07:44 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 17:07:44 sh-103-53.int kernel: LustreError: 153258:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09186e40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 17:07:44 sh-103-53.int kernel: LustreError: 153258:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 17:07:44 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 17:07:44 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 17:08:45 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 17:08:45 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 15 previous similar messages Oct 25 17:10:59 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 17:10:59 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 57 previous similar messages Oct 25 17:12:51 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572048471, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea3cc0/0x51ab3c4ee6932e7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abc659dce68 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 17:12:51 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 17:12:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 17:12:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 45 previous similar messages Oct 25 17:17:59 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 17:17:59 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 17:17:59 sh-103-53.int kernel: LustreError: 153842:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09186780) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 17:17:59 sh-103-53.int kernel: LustreError: 153842:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 17:18:00 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 17:18:00 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 17:18:56 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 17:18:56 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 20 previous similar messages Oct 25 17:21:09 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 17:21:09 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 53 previous similar messages Oct 25 17:23:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 17:23:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 45 previous similar messages Oct 25 17:23:07 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572049087, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea2d00/0x51ab3c4ee6933d5 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abc72b0f716 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 17:23:07 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 17:28:12 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 17:28:12 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 17:28:12 sh-103-53.int kernel: LustreError: 154424:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09187200) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 17:28:12 sh-103-53.int kernel: LustreError: 154424:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 17:28:12 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 17:28:12 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 17:29:08 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 17:29:08 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 20 previous similar messages Oct 25 17:31:20 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 17:31:20 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 54 previous similar messages Oct 25 17:33:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 25 17:33:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 38 previous similar messages Oct 25 17:33:20 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572049700, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea2880/0x51ab3c4ee6934c3 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abc81628f4b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 17:33:20 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 17:38:26 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 17:38:26 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 17:38:26 sh-103-53.int kernel: LustreError: 154974:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a091878c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 17:38:27 sh-103-53.int kernel: LustreError: 154974:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 17:38:27 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 17:38:27 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 17:39:19 sh-103-53.int kernel: LNetError: 122478:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 17:39:19 sh-103-53.int kernel: LNetError: 122478:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 17 previous similar messages Oct 25 17:41:29 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 17:41:29 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 54 previous similar messages Oct 25 17:43:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 17:43:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 53 previous similar messages Oct 25 17:43:36 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572050316, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef2d00/0x51ab3c4ee693667 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abc8d46f193 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 17:43:36 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 17:48:41 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 17:48:41 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 17:48:41 sh-103-53.int kernel: LustreError: 155550:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff975bbc21f200) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 17:48:41 sh-103-53.int kernel: LustreError: 155550:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 17:48:41 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 17:48:41 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 17:49:32 sh-103-53.int kernel: LNetError: 122478:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 17:49:32 sh-103-53.int kernel: LNetError: 122478:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 13 previous similar messages Oct 25 17:51:39 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 17:51:39 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 59 previous similar messages Oct 25 17:53:24 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 25 17:53:24 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 59 previous similar messages Oct 25 17:53:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572050927, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef2f40/0x51ab3c4ee69369f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abc9b792809 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 17:53:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 17:58:54 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 17:58:54 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 17:58:54 sh-103-53.int kernel: LustreError: 156121:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff975bbc21e240) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 17:58:54 sh-103-53.int kernel: LustreError: 156121:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 17:58:54 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 17:58:54 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 18:00:15 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 18:00:15 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 13 previous similar messages Oct 25 18:01:49 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 18:01:49 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 57 previous similar messages Oct 25 18:03:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 25 18:03:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 61 previous similar messages Oct 25 18:04:03 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572051542, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef6e40/0x51ab3c4ee69377f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abca89b606e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 18:04:03 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 18:09:12 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 18:09:12 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 18:09:12 sh-103-53.int kernel: LustreError: 156715:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e65296b40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 18:09:12 sh-103-53.int kernel: LustreError: 156715:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 18:09:12 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 18:09:12 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 18:10:33 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 18:10:33 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 13 previous similar messages Oct 25 18:11:59 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 18:11:59 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 56 previous similar messages Oct 25 18:13:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 25 18:13:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 74 previous similar messages Oct 25 18:14:20 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572052160, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69f980/0x51ab3c4ee6937b7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abcb7dfec0f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 18:14:20 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 18:19:28 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 18:19:28 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 18:19:28 sh-103-53.int kernel: LustreError: 157292:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9769cdf74540) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 18:19:28 sh-103-53.int kernel: LustreError: 157292:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 18:19:28 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 18:19:28 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 18:20:55 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 18:20:55 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 12 previous similar messages Oct 25 18:22:09 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 18:22:09 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 57 previous similar messages Oct 25 18:23:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 18:23:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 60 previous similar messages Oct 25 18:24:34 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572052774, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754f8e68d80/0x51ab3c4ee6937ef lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abcc72d9f52 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 18:24:34 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 18:29:42 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 18:29:42 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 18:29:42 sh-103-53.int kernel: LustreError: 157868:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9774412b7e00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 18:29:42 sh-103-53.int kernel: LustreError: 157868:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 18:29:42 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 18:29:42 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 18:31:05 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 18:31:05 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 12 previous similar messages Oct 25 18:32:20 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 18:32:20 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 55 previous similar messages Oct 25 18:33:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 18:33:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 25 18:34:48 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572053387, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0336c00/0x51ab3c4ee693827 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abcd694ec39 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 18:34:48 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 18:39:58 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 18:39:58 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 18:39:58 sh-103-53.int kernel: LustreError: 158444:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9774412b66c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 18:39:58 sh-103-53.int kernel: LustreError: 158444:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 18:39:58 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 18:39:58 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 18:41:20 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 18:41:20 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 14 previous similar messages Oct 25 18:42:30 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 18:42:30 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 57 previous similar messages Oct 25 18:43:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 25 18:43:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 25 18:45:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572054005, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0330d80/0x51ab3c4ee69385f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abce61028e4 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 18:45:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 18:50:13 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 18:50:13 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 18:50:13 sh-103-53.int kernel: LustreError: 159019:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9774412b7d40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 18:50:13 sh-103-53.int kernel: LustreError: 159019:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 18:50:13 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 18:50:13 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 18:51:33 sh-103-53.int kernel: LNetError: 122478:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 18:51:33 sh-103-53.int kernel: LNetError: 122478:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 12 previous similar messages Oct 25 18:52:40 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 18:52:40 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 55 previous similar messages Oct 25 18:54:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 18:54:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 25 18:55:18 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572054618, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0333cc0/0x51ab3c4ee693897 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abcf5edf118 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 18:55:18 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 19:00:28 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 19:00:28 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 19:00:28 sh-103-53.int kernel: LustreError: 159594:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978205734540) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 19:00:28 sh-103-53.int kernel: LustreError: 159594:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 19:00:28 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 19:00:28 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 19:02:40 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 19:02:40 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 14 previous similar messages Oct 25 19:02:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 19:02:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 53 previous similar messages Oct 25 19:04:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 19:04:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 74 previous similar messages Oct 25 19:05:34 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572055233, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf57980/0x51ab3c4ee6938cf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abd03ee0eae expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 19:05:34 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 19:10:43 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 19:10:43 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 19:10:43 sh-103-53.int kernel: LustreError: 160201:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978205735740) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 19:10:43 sh-103-53.int kernel: LustreError: 160201:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 19:10:43 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 19:10:43 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 19:12:51 sh-103-53.int kernel: LNetError: 122478:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 19:12:51 sh-103-53.int kernel: LNetError: 122478:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 12 previous similar messages Oct 25 19:13:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 19:13:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 52 previous similar messages Oct 25 19:14:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 25 19:14:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 75 previous similar messages Oct 25 19:15:50 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572055850, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf55580/0x51ab3c4ee693907 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abd14c6bfd7 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 19:15:50 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 19:20:56 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 19:20:56 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 19:20:56 sh-103-53.int kernel: LustreError: 160773:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978205734d80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 19:20:56 sh-103-53.int kernel: LustreError: 160773:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 19:20:56 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 19:20:56 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 19:22:51 sh-103-53.int kernel: LNetError: 122478:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 19:22:51 sh-103-53.int kernel: LNetError: 122478:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 12 previous similar messages Oct 25 19:23:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 19:23:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 56 previous similar messages Oct 25 19:24:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 25 19:24:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 71 previous similar messages Oct 25 19:26:04 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572056464, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf51d40/0x51ab3c4ee69393f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abd25e3d3eb expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 19:26:04 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 19:31:11 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 19:31:11 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 19:31:11 sh-103-53.int kernel: LustreError: 161347:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978205735740) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 19:31:11 sh-103-53.int kernel: LustreError: 161347:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 19:31:11 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 19:31:11 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 19:32:08 sh-103-53.int kernel: Lustre: 91125:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572057121/real 1572057121] req@ffff976a0c040480 x1648382045931008/t0(0) o400->MGC10.0.10.51@o2ib7@10.0.10.51@o2ib7:26/25 lens 224/224 e 0 to 1 dl 1572057128 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 19:33:00 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 19:33:00 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 14 previous similar messages Oct 25 19:33:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 19:33:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 54 previous similar messages Oct 25 19:34:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 19:34:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 25 19:37:14 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572057134, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf560c0/0x51ab3c4ee693993 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abd38ef4ca5 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 19:37:14 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 19:42:20 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 19:42:20 sh-103-53.int kernel: LustreError: Skipped 2 previous similar messages Oct 25 19:42:20 sh-103-53.int kernel: LustreError: 161981:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978205734180) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 19:42:20 sh-103-53.int kernel: LustreError: 161981:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 2 previous similar messages Oct 25 19:42:20 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 19:42:20 sh-103-53.int kernel: Lustre: Skipped 2 previous similar messages Oct 25 19:43:21 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 19:43:21 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 12 previous similar messages Oct 25 19:43:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 19:43:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 56 previous similar messages Oct 25 19:44:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 19:44:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 25 19:47:26 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572057746, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf53f00/0x51ab3c4ee6939cb lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abd42e4a3f9 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 19:47:26 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 19:52:31 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 19:52:31 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 19:52:31 sh-103-53.int kernel: LustreError: 162551:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a13c018c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 19:52:31 sh-103-53.int kernel: LustreError: 162551:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 19:52:31 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 19:52:31 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 19:53:29 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 19:53:29 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 15 previous similar messages Oct 25 19:53:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 19:53:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 49 previous similar messages Oct 25 19:54:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 19:54:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 25 19:57:37 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572058357, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea3180/0x51ab3c4ee693a03 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abd4d274766 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 19:57:37 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 20:02:45 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 20:02:45 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 20:02:45 sh-103-53.int kernel: LustreError: 163141:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09187d40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 20:02:45 sh-103-53.int kernel: LustreError: 163141:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 20:02:45 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 20:02:45 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 20:03:41 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 20:03:41 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 12 previous similar messages Oct 25 20:03:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 20:03:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 49 previous similar messages Oct 25 20:05:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 20:05:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 25 20:07:51 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572058971, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea2640/0x51ab3c4ee693a3b lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abd5a5b1bee expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 20:07:51 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 20:13:00 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 20:13:00 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 20:13:00 sh-103-53.int kernel: LustreError: 163718:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09186600) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 20:13:00 sh-103-53.int kernel: LustreError: 163718:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 20:13:00 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 20:13:00 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 20:13:51 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 20:13:51 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 13 previous similar messages Oct 25 20:14:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 20:14:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 49 previous similar messages Oct 25 20:15:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 20:15:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 61 previous similar messages Oct 25 20:18:06 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572059586, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef4a40/0x51ab3c4ee693a73 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abd639ed20e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 20:18:06 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 20:23:15 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 20:23:15 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 20:23:15 sh-103-53.int kernel: LustreError: 164297:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e65296900) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 20:23:15 sh-103-53.int kernel: LustreError: 164297:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 20:23:15 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 20:23:15 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 20:24:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 20:24:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 49 previous similar messages Oct 25 20:24:37 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 20:24:37 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 13 previous similar messages Oct 25 20:25:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 20:25:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 55 previous similar messages Oct 25 20:28:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572060202, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a6fd3a80/0x51ab3c4ee693aab lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abd6e6bac52 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 20:28:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 20:33:29 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 20:33:29 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 20:33:29 sh-103-53.int kernel: LustreError: 164872:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e65297380) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 20:33:29 sh-103-53.int kernel: LustreError: 164872:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 20:33:29 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 20:33:29 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 20:34:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 20:34:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 49 previous similar messages Oct 25 20:34:51 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 20:34:51 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 14 previous similar messages Oct 25 20:35:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 20:35:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 25 20:38:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572060818, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a6fd2f40/0x51ab3c4ee693ae3 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abd7c3c54e3 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 20:38:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 20:43:47 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 20:43:47 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 20:43:47 sh-103-53.int kernel: LustreError: 165450:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e652966c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 20:43:47 sh-103-53.int kernel: LustreError: 165450:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 20:43:47 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 20:43:47 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 20:44:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 20:44:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 52 previous similar messages Oct 25 20:45:06 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 20:45:06 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 13 previous similar messages Oct 25 20:45:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 20:45:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 58 previous similar messages Oct 25 20:48:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572061435, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97602ea77bc0/0x51ab3c4ee693b1b lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abd88486597 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 20:48:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 20:54:06 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 20:54:06 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 20:54:06 sh-103-53.int kernel: LustreError: 166032:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09186780) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 20:54:06 sh-103-53.int kernel: LustreError: 166032:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 20:54:06 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 20:54:06 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 20:54:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 20:54:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 53 previous similar messages Oct 25 20:55:26 sh-103-53.int kernel: LNetError: 160882:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 20:55:26 sh-103-53.int kernel: LNetError: 160882:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 16 previous similar messages Oct 25 20:55:28 sh-103-53.int kernel: Lustre: 91124:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572062121/real 1572062121] req@ffff977e6cef9200 x1648382046823296/t0(0) o400->fir-MDT0003-mdc-ffff9781f2230800@10.0.10.54@o2ib7:12/10 lens 224/224 e 0 to 1 dl 1572062128 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 20:55:28 sh-103-53.int kernel: Lustre: fir-MDT0003-mdc-ffff9781f2230800: Connection to fir-MDT0003 (at 10.0.10.54@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 25 20:55:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 20:55:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 60 previous similar messages Oct 25 20:56:22 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3350:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds Oct 25 20:56:22 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3425:kiblnd_check_conns()) Timed out RDMA with 10.9.0.21@o2ib4 (19): c: 0, oc: 0, rc: 8 Oct 25 20:56:50 sh-103-53.int kernel: Lustre: 91120:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572062153/real 1572062153] req@ffff9754f76d4c80 x1648382046827840/t0(0) o400->fir-MDT0002-mdc-ffff9781f2230800@10.0.10.53@o2ib7:12/10 lens 224/224 e 0 to 1 dl 1572062210 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 20:56:50 sh-103-53.int kernel: Lustre: fir-MDT0002-mdc-ffff9781f2230800: Connection to fir-MDT0002 (at 10.0.10.53@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 25 20:57:15 sh-103-53.int kernel: Lustre: 91128:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572062178/real 1572062178] req@ffff976a17eef500 x1648382046832400/t0(0) o400->fir-MDT0002-mdc-ffff9781f2230800@10.0.10.53@o2ib7:12/10 lens 224/224 e 0 to 1 dl 1572062235 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 20:57:40 sh-103-53.int kernel: Lustre: 91123:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572062203/real 1572062203] req@ffff97820f4e3a80 x1648382046836992/t0(0) o400->fir-MDT0002-mdc-ffff9781f2230800@10.0.10.54@o2ib7:12/10 lens 224/224 e 0 to 1 dl 1572062260 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 20:59:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572062053, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea4ec0/0x51ab3c4ee693b53 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abd9a285311 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 20:59:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 21:01:50 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3350:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Oct 25 21:01:50 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3350:kiblnd_check_txs_locked()) Skipped 1 previous similar message Oct 25 21:01:50 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3425:kiblnd_check_conns()) Timed out RDMA with 10.9.0.22@o2ib4 (14): c: 0, oc: 0, rc: 8 Oct 25 21:01:50 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3425:kiblnd_check_conns()) Skipped 1 previous similar message Oct 25 21:01:50 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2007:lnet_handle_find_routed_path()) no route to 10.0.10.51@o2ib7 from Oct 25 21:01:50 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2007:lnet_handle_find_routed_path()) Skipped 100 previous similar messages Oct 25 21:01:50 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.0.10.51@o2ib7: -113 Oct 25 21:01:50 sh-103-53.int kernel: Lustre: 91130:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1572062504/real 1572062510] req@ffff9769f2fdf080 x1648382046887712/t0(0) o400->fir-OST0016-osc-ffff9781f2230800@10.0.10.103@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572062511 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 Oct 25 21:01:50 sh-103-53.int kernel: Lustre: fir-OST0003-osc-ffff9781f2230800: Connection to fir-OST0003 (at 10.0.10.102@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 25 21:01:50 sh-103-53.int kernel: Lustre: 91130:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 10 previous similar messages Oct 25 21:01:52 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.0.10.109@o2ib7: -113 Oct 25 21:01:52 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Skipped 15 previous similar messages Oct 25 21:01:53 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.0.10.108@o2ib7: -113 Oct 25 21:01:53 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Skipped 15 previous similar messages Oct 25 21:01:54 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2007:lnet_handle_find_routed_path()) no route to 10.0.10.114@o2ib7 from Oct 25 21:01:54 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2007:lnet_handle_find_routed_path()) Skipped 141 previous similar messages Oct 25 21:02:21 sh-103-53.int kernel: LNetError: 91118:0:(lib-move.c:2007:lnet_handle_find_routed_path()) no route to 10.0.10.52@o2ib7 from Oct 25 21:02:21 sh-103-53.int kernel: LNetError: 91118:0:(lib-move.c:2007:lnet_handle_find_routed_path()) Skipped 22 previous similar messages Oct 25 21:02:46 sh-103-53.int kernel: LNetError: 91118:0:(lib-move.c:2007:lnet_handle_find_routed_path()) no route to 10.0.10.51@o2ib7 from Oct 25 21:02:46 sh-103-53.int kernel: LNetError: 91118:0:(lib-move.c:2007:lnet_handle_find_routed_path()) Skipped 100 previous similar messages Oct 25 21:03:36 sh-103-53.int kernel: LNetError: 91118:0:(lib-move.c:2007:lnet_handle_find_routed_path()) no route to 10.0.10.51@o2ib7 from Oct 25 21:03:36 sh-103-53.int kernel: LNetError: 91118:0:(lib-move.c:2007:lnet_handle_find_routed_path()) Skipped 201 previous similar messages Oct 25 21:04:26 sh-103-53.int kernel: LustreError: 166630:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09186480) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 21:04:26 sh-103-53.int kernel: Lustre: fir-OST0000-osc-ffff9781f2230800: Connection restored to 10.0.10.101@o2ib7 (at 10.0.10.101@o2ib7) Oct 25 21:04:26 sh-103-53.int kernel: Lustre: Skipped 2 previous similar messages Oct 25 21:04:26 sh-103-53.int kernel: LustreError: 166630:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 21:04:49 sh-103-53.int kernel: Lustre: 91124:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572062479/real 1572062479] req@ffff9769a6f42d00 x1648382046882800/t0(0) o400->fir-MDT0003-mdc-ffff9781f2230800@10.0.10.52@o2ib7:12/10 lens 224/224 e 0 to 1 dl 1572062689 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 21:04:49 sh-103-53.int kernel: Lustre: 91124:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 89 previous similar messages Oct 25 21:04:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.41@o2ib4 added to recovery queue. Health = 900 Oct 25 21:04:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 25 21:04:58 sh-103-53.int kernel: Lustre: fir-OST0058-osc-ffff9781f2230800: Connection to fir-OST0058 (at 10.0.10.115@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 25 21:04:58 sh-103-53.int kernel: Lustre: Skipped 119 previous similar messages Oct 25 21:05:23 sh-103-53.int kernel: LustreError: 167-0: fir-MDT0000-mdc-ffff9781f2230800: This client was evicted by fir-MDT0000; in progress operations using this service will fail. Oct 25 21:05:23 sh-103-53.int kernel: LustreError: Skipped 92 previous similar messages Oct 25 21:05:35 sh-103-53.int kernel: Lustre: 91126:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572062691/real 1572062691] req@ffff9769cb21f980 x1648382046922352/t0(0) o400->fir-OST0053-osc-ffff9781f2230800@10.0.10.114@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572062735 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 21:05:35 sh-103-53.int kernel: Lustre: 91126:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 29 previous similar messages Oct 25 21:06:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 21:06:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 231 previous similar messages Oct 25 21:08:13 sh-103-53.int kernel: Lustre: 91122:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572062886/real 1572062886] req@ffff976a15428480 x1648382046953664/t0(0) o400->fir-OST0012-osc-ffff9781f2230800@10.0.10.103@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572062893 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 21:08:13 sh-103-53.int kernel: Lustre: fir-OST0017-osc-ffff9781f2230800: Connection to fir-OST0017 (at 10.0.10.104@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 25 21:08:13 sh-103-53.int kernel: Lustre: Skipped 8 previous similar messages Oct 25 21:08:13 sh-103-53.int kernel: Lustre: 91122:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Oct 25 21:08:48 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 21:08:48 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 1 previous similar message Oct 25 21:09:32 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 21:09:32 sh-103-53.int kernel: LustreError: Skipped 2 previous similar messages Oct 25 21:09:32 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572062672, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7660d7c0/0x51ab3c4ee693b8b lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abda1757172 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 21:09:32 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 21:10:36 sh-103-53.int kernel: Lustre: 91135:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572063029/real 1572063029] req@ffff976a20aaec00 x1648382046972416/t0(0) o400->fir-OST0015-osc-ffff9781f2230800@10.0.10.104@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572063036 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 21:10:36 sh-103-53.int kernel: Lustre: fir-OST000d-osc-ffff9781f2230800: Connection to fir-OST000d (at 10.0.10.104@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 25 21:10:36 sh-103-53.int kernel: Lustre: Skipped 5 previous similar messages Oct 25 21:10:36 sh-103-53.int kernel: Lustre: 91135:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Oct 25 21:14:41 sh-103-53.int kernel: LustreError: 167209:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977446ed0c00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 21:14:41 sh-103-53.int kernel: LustreError: 167209:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 21:14:41 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 21:14:41 sh-103-53.int kernel: Lustre: Skipped 137 previous similar messages Oct 25 21:15:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.41@o2ib4 added to recovery queue. Health = 900 Oct 25 21:15:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 19 previous similar messages Oct 25 21:16:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 25 21:16:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 25 21:18:56 sh-103-53.int kernel: LNetError: 160882:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 21:18:56 sh-103-53.int kernel: LNetError: 160882:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 25 21:19:47 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 21:19:47 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 21:19:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572063287, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c3600/0x51ab3c4ee693bc3 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abda7346ad2 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 21:19:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 21:22:25 sh-103-53.int kernel: Lustre: 91135:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572063738/real 1572063738] req@ffff976a197ef980 x1648382047095728/t0(0) o400->fir-OST0017-osc-ffff9781f2230800@10.0.10.104@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572063745 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 21:22:25 sh-103-53.int kernel: Lustre: fir-OST0017-osc-ffff9781f2230800: Connection to fir-OST0017 (at 10.0.10.104@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 25 21:22:25 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 21:24:54 sh-103-53.int kernel: LustreError: 167781:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0832f380) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 21:24:54 sh-103-53.int kernel: LustreError: 167781:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 21:24:54 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 21:24:54 sh-103-53.int kernel: Lustre: Skipped 6 previous similar messages Oct 25 21:25:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 25 21:25:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 20 previous similar messages Oct 25 21:26:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 21:26:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 63 previous similar messages Oct 25 21:29:06 sh-103-53.int kernel: LNetError: 160882:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 21:29:06 sh-103-53.int kernel: LNetError: 160882:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 25 21:30:01 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 21:30:01 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 21:30:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572063901, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb09f80/0x51ab3c4ee693bfb lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abdaaee6587 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 21:30:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 21:35:07 sh-103-53.int kernel: LustreError: 168355:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9782186a8a80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 21:35:07 sh-103-53.int kernel: LustreError: 168355:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 21:35:07 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 21:35:07 sh-103-53.int kernel: Lustre: Skipped 3 previous similar messages Oct 25 21:35:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 25 21:35:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 25 21:36:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 21:36:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 25 21:39:46 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 21:39:46 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 25 21:40:16 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 21:40:16 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 21:40:16 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572064516, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0cc80/0x51ab3c4ee693c33 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abdb52075ff expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 21:40:16 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 21:42:27 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 25 21:42:27 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Skipped 31 previous similar messages Oct 25 21:42:36 sh-103-53.int kernel: Lustre: 91139:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572064941/real 1572064941] req@ffff9782125f2880 x1648382047306432/t0(0) o400->fir-OST0011-osc-ffff9781f2230800@10.0.10.104@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572064955 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 21:42:36 sh-103-53.int kernel: Lustre: 91139:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Oct 25 21:42:36 sh-103-53.int kernel: Lustre: fir-OST0011-osc-ffff9781f2230800: Connection to fir-OST0011 (at 10.0.10.104@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 25 21:42:36 sh-103-53.int kernel: Lustre: Skipped 4 previous similar messages Oct 25 21:45:25 sh-103-53.int kernel: LustreError: 168933:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9782186a8e40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 21:45:25 sh-103-53.int kernel: LustreError: 168933:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 21:45:25 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 21:45:25 sh-103-53.int kernel: Lustre: Skipped 3 previous similar messages Oct 25 21:45:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 25 21:45:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 25 21:46:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 21:46:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 61 previous similar messages Oct 25 21:50:06 sh-103-53.int kernel: LNetError: 160882:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 21:50:06 sh-103-53.int kernel: LNetError: 160882:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 25 21:50:31 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 21:50:31 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 21:50:31 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572065131, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97820c2a8000/0x51ab3c4ee693c6b lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abdbfcd9f4a expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 21:50:31 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 21:52:00 sh-103-53.int kernel: Lustre: 91139:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572065506/real 1572065506] req@ffff97781f71e780 x1648382047402608/t0(0) o400->fir-OST0011-osc-ffff9781f2230800@10.0.10.104@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572065520 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 21:52:00 sh-103-53.int kernel: Lustre: 91139:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 1 previous similar message Oct 25 21:52:00 sh-103-53.int kernel: Lustre: fir-OST0011-osc-ffff9781f2230800: Connection to fir-OST0011 (at 10.0.10.104@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 25 21:52:00 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 21:54:42 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 25 21:55:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 25 21:55:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 25 21:55:41 sh-103-53.int kernel: LustreError: 169508:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977795ebd380) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 21:55:41 sh-103-53.int kernel: LustreError: 169508:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 21:55:41 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 21:55:41 sh-103-53.int kernel: Lustre: Skipped 2 previous similar messages Oct 25 21:56:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 21:56:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 59 previous similar messages Oct 25 22:00:27 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 22:00:27 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 25 22:00:48 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 22:00:48 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 22:00:48 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572065748, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d60c0/0x51ab3c4ee693ca3 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abdc9aeafdf expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 22:00:48 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 22:00:56 sh-103-53.int kernel: Lustre: 91135:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572066042/real 1572066042] req@ffff977e6f2a8480 x1648382047494096/t0(0) o400->fir-OST0013-osc-ffff9781f2230800@10.0.10.104@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572066056 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 22:00:56 sh-103-53.int kernel: Lustre: fir-OST0013-osc-ffff9781f2230800: Connection to fir-OST0013 (at 10.0.10.104@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 25 22:02:47 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 25 22:05:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 25 22:05:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 25 22:05:53 sh-103-53.int kernel: LustreError: 170107:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977795ebcf00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 22:05:53 sh-103-53.int kernel: LustreError: 170107:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 22:05:53 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 22:05:53 sh-103-53.int kernel: Lustre: Skipped 3 previous similar messages Oct 25 22:07:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 22:07:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 25 22:10:31 sh-103-53.int kernel: LNetError: 160882:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 22:10:31 sh-103-53.int kernel: LNetError: 160882:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 25 22:10:58 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 22:10:58 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 22:10:58 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572066358, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d7980/0x51ab3c4ee693cdb lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abdd6cd173f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 22:10:58 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 22:11:01 sh-103-53.int kernel: Lustre: 91134:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572066654/real 1572066654] req@ffff975399d2e300 x1648382047599264/t0(0) o400->fir-OST000d-osc-ffff9781f2230800@10.0.10.104@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572066661 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 22:11:01 sh-103-53.int kernel: Lustre: fir-OST000e-osc-ffff9781f2230800: Connection to fir-OST000e (at 10.0.10.103@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 25 22:11:01 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 22:11:01 sh-103-53.int kernel: Lustre: 91134:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Oct 25 22:16:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 25 22:16:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 25 22:16:08 sh-103-53.int kernel: LustreError: 170683:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977446ed0f00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 22:16:08 sh-103-53.int kernel: LustreError: 170683:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 22:16:08 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 22:16:08 sh-103-53.int kernel: Lustre: Skipped 4 previous similar messages Oct 25 22:17:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 22:17:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 25 22:20:52 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 22:20:52 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 25 22:21:16 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 22:21:16 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 22:21:16 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572066976, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0331440/0x51ab3c4ee693d13 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abde11af410 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 22:21:16 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 22:26:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 22:26:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 33 previous similar messages Oct 25 22:26:21 sh-103-53.int kernel: LustreError: 171256:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae32c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 22:26:21 sh-103-53.int kernel: LustreError: 171256:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 22:26:21 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 22:26:21 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 22:27:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 22:27:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 25 22:29:24 sh-103-53.int kernel: Lustre: 91126:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572067757/real 1572067757] req@ffff9769ab755a00 x1648382047796304/t0(0) o400->fir-OST000f-osc-ffff9781f2230800@10.0.10.104@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572067764 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 22:29:24 sh-103-53.int kernel: Lustre: 91126:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 1 previous similar message Oct 25 22:29:24 sh-103-53.int kernel: Lustre: fir-OST000f-osc-ffff9781f2230800: Connection to fir-OST000f (at 10.0.10.104@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 25 22:29:24 sh-103-53.int kernel: Lustre: Skipped 2 previous similar messages Oct 25 22:31:27 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 22:31:27 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 22:31:27 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572067587, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a03372c0/0x51ab3c4ee693d4b lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abdeed55f63 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 22:31:27 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 22:32:00 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 22:32:00 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 25 22:36:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 22:36:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 28 previous similar messages Oct 25 22:36:32 sh-103-53.int kernel: LustreError: 171828:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae2780) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 22:36:32 sh-103-53.int kernel: LustreError: 171828:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 22:36:32 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 22:36:32 sh-103-53.int kernel: Lustre: Skipped 5 previous similar messages Oct 25 22:37:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 22:37:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 25 22:39:57 sh-103-53.int kernel: Lustre: 91128:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572068390/real 1572068390] req@ffff9769a6eb4c80 x1648382047905984/t0(0) o400->fir-OST0011-osc-ffff9781f2230800@10.0.10.104@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572068397 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 22:39:57 sh-103-53.int kernel: Lustre: 91128:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Oct 25 22:39:57 sh-103-53.int kernel: Lustre: fir-OST0011-osc-ffff9781f2230800: Connection to fir-OST0011 (at 10.0.10.104@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 25 22:39:57 sh-103-53.int kernel: Lustre: Skipped 4 previous similar messages Oct 25 22:41:37 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 22:41:37 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 22:41:37 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572068197, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0331200/0x51ab3c4ee693d83 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abdfe29dedc expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 22:41:37 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 22:43:21 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 22:43:21 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 25 22:46:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 22:46:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 28 previous similar messages Oct 25 22:46:43 sh-103-53.int kernel: LustreError: 172403:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09186c00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 22:46:43 sh-103-53.int kernel: LustreError: 172403:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 22:46:43 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 22:46:43 sh-103-53.int kernel: Lustre: Skipped 5 previous similar messages Oct 25 22:47:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 25 22:47:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 25 22:51:51 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 22:51:51 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 22:51:51 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572068811, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea60c0/0x51ab3c4ee693dbb lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abe0b434a05 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 22:51:51 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 22:53:37 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 25 22:54:01 sh-103-53.int kernel: LNetError: 160882:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 22:54:01 sh-103-53.int kernel: LNetError: 160882:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 25 22:56:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 22:56:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 25 22:57:00 sh-103-53.int kernel: LustreError: 172981:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781fc312840) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 22:57:00 sh-103-53.int kernel: LustreError: 172981:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 22:57:00 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 22:57:00 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 22:57:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 22:57:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 61 previous similar messages Oct 25 23:02:08 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 23:02:08 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 23:02:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572069428, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a6fd6c00/0x51ab3c4ee693df3 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abe19983581 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 23:02:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 23:04:41 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 25 23:04:41 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 25 23:06:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 25 23:06:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 25 23:07:14 sh-103-53.int kernel: LustreError: 173572:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e65296a80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 23:07:14 sh-103-53.int kernel: LustreError: 173572:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 23:07:14 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 23:07:14 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 23:07:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 23:07:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 25 23:08:36 sh-103-53.int kernel: Lustre: 91124:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572070109/real 1572070109] req@ffff976a12c40900 x1648382048209248/t0(0) o400->fir-OST005e-osc-ffff9781f2230800@10.0.10.115@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572070116 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 23:08:36 sh-103-53.int kernel: Lustre: fir-OST0006-osc-ffff9781f2230800: Connection to fir-OST0006 (at 10.0.10.101@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 25 23:08:36 sh-103-53.int kernel: Lustre: Skipped 2 previous similar messages Oct 25 23:08:36 sh-103-53.int kernel: Lustre: 91124:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 38 previous similar messages Oct 25 23:10:25 sh-103-53.int kernel: Lustre: 91121:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572070181/real 1572070181] req@ffff9769c6af0900 x1648382048218560/t0(0) o400->fir-OST0018-osc-ffff9781f2230800@10.0.10.105@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572070225 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 23:10:25 sh-103-53.int kernel: Lustre: fir-OST005f-osc-ffff9781f2230800: Connection to fir-OST005f (at 10.0.10.116@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 25 23:10:25 sh-103-53.int kernel: Lustre: fir-OST0008-osc-ffff9781f2230800: Connection to fir-OST0008 (at 10.0.10.101@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 25 23:10:25 sh-103-53.int kernel: Lustre: Skipped 82 previous similar messages Oct 25 23:10:25 sh-103-53.int kernel: Lustre: Skipped 82 previous similar messages Oct 25 23:10:25 sh-103-53.int kernel: Lustre: 91121:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 51 previous similar messages Oct 25 23:11:12 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3350:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds Oct 25 23:11:12 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3350:kiblnd_check_txs_locked()) Skipped 1 previous similar message Oct 25 23:11:12 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3425:kiblnd_check_conns()) Timed out RDMA with 10.9.0.21@o2ib4 (16): c: 7, oc: 0, rc: 8 Oct 25 23:11:12 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3425:kiblnd_check_conns()) Skipped 1 previous similar message Oct 25 23:11:35 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3350:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds Oct 25 23:11:35 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3425:kiblnd_check_conns()) Timed out RDMA with 10.9.0.23@o2ib4 (9): c: 4, oc: 0, rc: 8 Oct 25 23:14:35 sh-103-53.int kernel: Lustre: 91121:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572070282/real 1572070282] req@ffff977e876a9f80 x1648382048232112/t0(0) o400->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 224/224 e 0 to 1 dl 1572070475 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 23:14:35 sh-103-53.int kernel: Lustre: 91121:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 47 previous similar messages Oct 25 23:16:38 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 23:16:38 sh-103-53.int kernel: LustreError: Skipped 2 previous similar messages Oct 25 23:16:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572070298, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d60c0/0x51ab3c4ee693e47 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abe27a11f1f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 23:16:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 23:17:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.41@o2ib4 added to recovery queue. Health = 900 Oct 25 23:17:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 22 previous similar messages Oct 25 23:17:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 23:17:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 25 23:21:43 sh-103-53.int kernel: LustreError: 174393:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09187080) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 23:21:43 sh-103-53.int kernel: LustreError: 174393:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 2 previous similar messages Oct 25 23:21:43 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 23:21:43 sh-103-53.int kernel: Lustre: Skipped 137 previous similar messages Oct 25 23:26:52 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 23:26:52 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 23:26:52 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572070912, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea7080/0x51ab3c4ee693e7f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abe359716dc expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 23:26:52 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 23:27:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.41@o2ib4 added to recovery queue. Health = 900 Oct 25 23:27:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 19 previous similar messages Oct 25 23:27:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 23:27:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 63 previous similar messages Oct 25 23:31:57 sh-103-53.int kernel: LustreError: 174965:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09187140) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 23:31:57 sh-103-53.int kernel: LustreError: 174965:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 23:31:57 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 23:31:57 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 23:37:06 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 23:37:06 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 23:37:06 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572071526, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea6780/0x51ab3c4ee693eb7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abe467b37ef expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 23:37:06 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 23:37:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.41@o2ib4 added to recovery queue. Health = 900 Oct 25 23:37:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 19 previous similar messages Oct 25 23:38:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 25 23:38:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 25 23:39:22 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 25 23:42:14 sh-103-53.int kernel: LustreError: 175544:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09186840) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 23:42:14 sh-103-53.int kernel: LustreError: 175544:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 23:42:14 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 23:42:14 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 23:47:20 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 23:47:20 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 23:47:20 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572072140, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea5580/0x51ab3c4ee693eef lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abe56f93ea0 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 23:47:20 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 23:47:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.41@o2ib4 added to recovery queue. Health = 900 Oct 25 23:47:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 19 previous similar messages Oct 25 23:48:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 6 seconds Oct 25 23:48:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 25 23:52:27 sh-103-53.int kernel: LustreError: 176124:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09187800) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 25 23:52:27 sh-103-53.int kernel: LustreError: 176124:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 25 23:52:27 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 25 23:52:27 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 25 23:57:33 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 25 23:57:33 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 25 23:57:33 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572072753, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea45c0/0x51ab3c4ee693f27 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abe66570e4f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 25 23:57:33 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 25 23:57:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 25 23:57:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 31 previous similar messages Oct 25 23:58:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Oct 25 23:58:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 25 23:59:42 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 00:02:41 sh-103-53.int kernel: LustreError: 176724:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09187ec0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 00:02:41 sh-103-53.int kernel: LustreError: 176724:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 00:02:41 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 00:02:41 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 00:07:46 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 00:07:46 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 00:07:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572073366, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea0d80/0x51ab3c4ee693f5f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abe76201f54 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 00:07:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 00:07:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 00:07:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 00:07:53 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 00:08:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 26 00:08:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 26 00:12:54 sh-103-53.int kernel: LustreError: 177291:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09186fc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 00:12:54 sh-103-53.int kernel: LustreError: 177291:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 00:12:54 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 00:12:54 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 00:18:00 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 00:18:00 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 00:18:00 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572073980, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea4800/0x51ab3c4ee693f97 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abe84aa8e8d expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 00:18:00 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 00:18:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 00:18:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 00:18:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 26 00:18:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 26 00:23:09 sh-103-53.int kernel: LustreError: 177865:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781fc313680) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 00:23:09 sh-103-53.int kernel: LustreError: 177865:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 00:23:09 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 00:23:09 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 00:28:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 00:28:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 00:28:19 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 00:28:19 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 00:28:19 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572074599, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef1f80/0x51ab3c4ee693fcf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abe95a9dc96 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 00:28:19 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 00:28:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 26 00:28:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 61 previous similar messages Oct 26 00:29:12 sh-103-53.int kernel: LNetError: 166589:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 0 Oct 26 00:29:12 sh-103-53.int kernel: LNetError: 166589:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 3 previous similar messages Oct 26 00:33:28 sh-103-53.int kernel: LustreError: 178454:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781fc313b00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 00:33:28 sh-103-53.int kernel: LustreError: 178454:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 00:33:28 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 00:33:28 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 00:34:17 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 00:38:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 00:38:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 00:38:33 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 00:38:33 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 00:38:33 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572075213, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef5100/0x51ab3c4ee694007 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abeaa54967c expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 00:38:33 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 00:38:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 00:38:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 26 00:43:40 sh-103-53.int kernel: LustreError: 179028:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781fc313bc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 00:43:40 sh-103-53.int kernel: LustreError: 179028:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 00:43:40 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 00:43:40 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 00:48:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 00:48:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 00:48:46 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 00:48:46 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 00:48:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572075826, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef0900/0x51ab3c4ee69403f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abec3c3fc0f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 00:48:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 00:49:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 00:49:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 26 00:50:33 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 00:53:54 sh-103-53.int kernel: LustreError: 179602:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781fc312900) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 00:53:54 sh-103-53.int kernel: LustreError: 179602:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 00:53:54 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 00:53:54 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 00:58:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 00:58:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 00:58:59 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 00:58:59 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 00:58:59 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572076439, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef06c0/0x51ab3c4ee694077 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abedc0fb512 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 00:58:59 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 00:59:17 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 26 00:59:17 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 26 01:04:05 sh-103-53.int kernel: LustreError: 180193:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781fc312000) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 01:04:05 sh-103-53.int kernel: LustreError: 180193:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 01:04:05 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 01:04:05 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 01:08:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 01:08:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 01:09:14 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 01:09:14 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 01:09:14 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572077054, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef3180/0x51ab3c4ee6940af lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abefaa5e11e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 01:09:14 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 01:09:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 01:09:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 26 01:14:21 sh-103-53.int kernel: LustreError: 180768:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781fc313380) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 01:14:21 sh-103-53.int kernel: LustreError: 180768:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 01:14:21 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 01:14:21 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 01:19:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 01:19:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 01:19:29 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 01:19:29 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 01:19:29 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572077669, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef5580/0x51ab3c4ee6940e7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abf12457a55 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 01:19:29 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 01:19:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 01:19:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 26 01:24:38 sh-103-53.int kernel: LustreError: 181348:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781fc3129c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 01:24:38 sh-103-53.int kernel: LustreError: 181348:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 01:24:38 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 01:24:38 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 01:29:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.24@o2ib4 added to recovery queue. Health = 900 Oct 26 01:29:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 01:29:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 26 01:29:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 63 previous similar messages Oct 26 01:29:47 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 01:29:47 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 01:29:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572078286, 301s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef7980/0x51ab3c4ee69411f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abf332e88c4 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 01:29:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 01:34:52 sh-103-53.int kernel: LustreError: 181923:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcd740) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 01:34:52 sh-103-53.int kernel: LustreError: 181923:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 01:34:52 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 01:34:52 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 01:39:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 01:39:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 01:39:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Oct 26 01:39:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 26 01:40:01 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 01:40:01 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 01:40:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572078901, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc6986c0/0x51ab3c4ee694157 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abf51883bc4 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 01:40:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 01:45:07 sh-103-53.int kernel: LustreError: 182497:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcd140) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 01:45:07 sh-103-53.int kernel: LustreError: 182497:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 01:45:07 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 01:45:07 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 01:49:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 01:49:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 01:49:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Oct 26 01:49:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 63 previous similar messages Oct 26 01:49:42 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 0 Oct 26 01:50:12 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 01:50:12 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 01:50:12 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572079512, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69ec00/0x51ab3c4ee69418f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abf71fb9a1e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 01:50:12 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 01:55:19 sh-103-53.int kernel: LustreError: 183071:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebccb40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 01:55:19 sh-103-53.int kernel: LustreError: 183071:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 01:55:19 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 01:55:19 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 01:59:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 01:59:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 01:59:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 6 seconds Oct 26 01:59:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 72 previous similar messages Oct 26 02:00:28 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 02:00:28 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 02:00:28 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572080127, 301s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69c5c0/0x51ab3c4ee6941c7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abf88b67759 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 02:00:28 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 02:05:37 sh-103-53.int kernel: LustreError: 183667:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcc540) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 02:05:37 sh-103-53.int kernel: LustreError: 183667:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 02:05:37 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 02:05:37 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 02:09:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Oct 26 02:09:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 26 02:09:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 02:09:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 02:10:42 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 02:10:42 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 02:10:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572080742, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69a400/0x51ab3c4ee6941ff lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abf8da0a531 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 02:10:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 02:15:52 sh-103-53.int kernel: LustreError: 184244:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcdb00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 02:15:52 sh-103-53.int kernel: LustreError: 184244:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 02:15:52 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 02:15:52 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 02:19:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 9 seconds Oct 26 02:19:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 26 02:20:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 02:20:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 02:20:59 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 02:20:59 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 02:20:59 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572081359, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc699200/0x51ab3c4ee694237 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abf930220b8 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 02:20:59 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 02:26:08 sh-103-53.int kernel: LustreError: 184819:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebccd80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 02:26:08 sh-103-53.int kernel: LustreError: 184819:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 02:26:08 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 02:26:08 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 02:30:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 02:30:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 26 02:30:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 02:30:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 02:31:18 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 02:31:18 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 02:31:18 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572081978, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69a1c0/0x51ab3c4ee69426f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abf975bd4a4 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 02:31:18 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 02:36:28 sh-103-53.int kernel: LustreError: 185396:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcdd40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 02:36:28 sh-103-53.int kernel: LustreError: 185396:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 02:36:28 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 02:36:28 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 02:40:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 02:40:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 26 02:40:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 02:40:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 02:41:37 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 02:41:37 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 02:41:37 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572082596, 301s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69c380/0x51ab3c4ee6942a7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abf9b6f2bb4 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 02:41:37 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 02:42:22 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 02:46:46 sh-103-53.int kernel: LustreError: 185974:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcc9c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 02:46:46 sh-103-53.int kernel: LustreError: 185974:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 02:46:46 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 02:46:46 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 02:50:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 02:50:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 26 02:50:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 02:50:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 02:51:55 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 02:51:55 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 02:51:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572083215, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69e9c0/0x51ab3c4ee6942df lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abf9f36de86 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 02:51:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 02:57:05 sh-103-53.int kernel: LustreError: 186555:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebccf00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 02:57:05 sh-103-53.int kernel: LustreError: 186555:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 02:57:05 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 02:57:05 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 03:00:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 03:00:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 59 previous similar messages Oct 26 03:00:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 03:00:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 03:02:15 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 03:02:15 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 03:02:15 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572083835, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc698480/0x51ab3c4ee694317 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfa06d8299 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 03:02:15 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 03:07:23 sh-103-53.int kernel: LustreError: 187162:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebccc00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 03:07:23 sh-103-53.int kernel: LustreError: 187162:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 03:07:23 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 03:07:23 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 03:10:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 03:10:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 60 previous similar messages Oct 26 03:10:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 03:10:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 03:12:32 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 03:12:32 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 03:12:32 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572084452, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69e300/0x51ab3c4ee69434f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfa4638b0e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 03:12:32 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 03:13:57 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 03:17:37 sh-103-53.int kernel: LustreError: 187737:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcc0c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 03:17:37 sh-103-53.int kernel: LustreError: 187737:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 03:17:37 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 03:17:37 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 03:20:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 03:20:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 60 previous similar messages Oct 26 03:21:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 03:21:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 26 03:22:46 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 03:22:46 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 03:22:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572085066, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69d340/0x51ab3c4ee694387 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfa8794619 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 03:22:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 03:27:52 sh-103-53.int kernel: LustreError: 188333:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcce40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 03:27:52 sh-103-53.int kernel: LustreError: 188333:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 03:27:52 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 03:27:52 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 03:30:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 26 03:30:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 26 03:31:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 03:31:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 03:32:57 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 03:32:57 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 03:32:57 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572085677, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69f2c0/0x51ab3c4ee6943bf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfacaf3a9d expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 03:32:57 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 03:38:03 sh-103-53.int kernel: LustreError: 188902:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcdbc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 03:38:03 sh-103-53.int kernel: LustreError: 188902:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 03:38:03 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 03:38:03 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 03:40:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Oct 26 03:40:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 26 03:41:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.24@o2ib4 added to recovery queue. Health = 900 Oct 26 03:41:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 26 03:43:09 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 03:43:09 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 03:43:09 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572086289, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69e300/0x51ab3c4ee6943f7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfb0fe3818 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 03:43:09 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 03:43:22 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 03:48:16 sh-103-53.int kernel: LustreError: 189477:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebccd80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 03:48:16 sh-103-53.int kernel: LustreError: 189477:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 03:48:16 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 03:48:16 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 03:50:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 03:50:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 63 previous similar messages Oct 26 03:51:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.24@o2ib4 added to recovery queue. Health = 900 Oct 26 03:51:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 34 previous similar messages Oct 26 03:53:23 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 03:53:23 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 03:53:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572086903, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a03333c0/0x51ab3c4ee69442f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfb563c025 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 03:53:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 03:58:32 sh-103-53.int kernel: LustreError: 190059:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae2480) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 03:58:32 sh-103-53.int kernel: LustreError: 190059:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 03:58:32 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 03:58:32 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 04:01:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 26 04:01:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 26 04:01:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 04:01:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 04:03:38 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 04:03:38 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 04:03:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572087518, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0334140/0x51ab3c4ee694467 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfb9a1cb13 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 04:03:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 04:08:47 sh-103-53.int kernel: LustreError: 190653:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae2f00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 04:08:47 sh-103-53.int kernel: LustreError: 190653:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 04:08:47 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 04:08:47 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 04:11:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 04:11:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 26 04:11:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 04:11:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 26 04:13:55 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 04:13:55 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 04:13:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572088135, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0334140/0x51ab3c4ee69449f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfbd63eaa2 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 04:13:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 04:19:02 sh-103-53.int kernel: LustreError: 191227:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae32c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 04:19:02 sh-103-53.int kernel: LustreError: 191227:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 04:19:02 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 04:19:02 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 04:21:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 04:21:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 26 04:22:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 04:22:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 04:24:08 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 04:24:08 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 04:24:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572088748, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a03321c0/0x51ab3c4ee6944d7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfc129f3cb expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 04:24:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 04:29:15 sh-103-53.int kernel: LustreError: 191799:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae3740) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 04:29:15 sh-103-53.int kernel: LustreError: 191799:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 04:29:15 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 04:29:15 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 04:31:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 15 seconds Oct 26 04:31:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 72 previous similar messages Oct 26 04:32:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 04:32:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 04:34:22 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 04:34:22 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 04:34:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572089362, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf52880/0x51ab3c4ee69450f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfc4bc6965 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 04:34:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 04:39:28 sh-103-53.int kernel: LustreError: 192375:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b5500) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 04:39:28 sh-103-53.int kernel: LustreError: 192375:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 04:39:28 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 04:39:28 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 04:41:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Oct 26 04:41:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 26 04:42:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 04:42:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 04:44:22 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 04:44:35 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 04:44:35 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 04:44:35 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572089975, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf52d00/0x51ab3c4ee694547 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfc86355bb expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 04:44:35 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 04:49:42 sh-103-53.int kernel: LustreError: 192951:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b5e00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 04:49:42 sh-103-53.int kernel: LustreError: 192951:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 04:49:42 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 04:49:42 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 04:51:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 04:51:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 61 previous similar messages Oct 26 04:52:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 04:52:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 04:54:51 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 04:54:51 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 04:54:51 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572090591, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf53f00/0x51ab3c4ee69457f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfcc3b97ca expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 04:54:51 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 04:59:59 sh-103-53.int kernel: LustreError: 193526:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b4300) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 04:59:59 sh-103-53.int kernel: LustreError: 193526:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 04:59:59 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 04:59:59 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 05:01:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 05:01:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 63 previous similar messages Oct 26 05:02:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 05:02:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 05:05:07 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 05:05:07 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 05:05:07 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572091207, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf56540/0x51ab3c4ee6945b7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfcfde17d9 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 05:05:07 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 05:09:48 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 05:10:14 sh-103-53.int kernel: LustreError: 194123:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b5740) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 05:10:14 sh-103-53.int kernel: LustreError: 194123:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 05:10:14 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 05:10:14 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 05:11:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 05:11:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 26 05:12:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 05:12:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 05:15:23 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 05:15:23 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 05:15:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572091823, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf51200/0x51ab3c4ee6945ef lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfd1c38711 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 05:15:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 05:20:30 sh-103-53.int kernel: LustreError: 194700:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b4480) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 05:20:30 sh-103-53.int kernel: LustreError: 194700:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 05:20:30 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 05:20:30 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 05:21:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 05:21:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 63 previous similar messages Oct 26 05:23:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 05:23:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 26 05:25:39 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 05:25:39 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 05:25:39 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572092439, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf54140/0x51ab3c4ee694627 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfd2ad5b4b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 05:25:39 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 05:30:48 sh-103-53.int kernel: LustreError: 195278:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b46c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 05:30:48 sh-103-53.int kernel: LustreError: 195278:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 05:30:48 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 05:30:48 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 05:31:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 05:31:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 26 05:33:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 05:33:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 05:35:56 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 05:35:56 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 05:35:56 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572093056, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf56c00/0x51ab3c4ee69465f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfd5fc751b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 05:35:56 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 05:41:05 sh-103-53.int kernel: LustreError: 195852:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b5200) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 05:41:05 sh-103-53.int kernel: LustreError: 195852:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 05:41:05 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 05:41:05 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 05:42:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 05:42:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 26 05:43:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 05:43:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 05:46:11 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 05:46:11 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 05:46:11 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572093671, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf57080/0x51ab3c4ee694697 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfd9e256f1 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 05:46:11 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 05:51:17 sh-103-53.int kernel: LustreError: 196425:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b5140) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 05:51:17 sh-103-53.int kernel: LustreError: 196425:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 05:51:17 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 05:51:17 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 05:52:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 26 05:52:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 26 05:53:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 05:53:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 05:56:23 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 05:56:23 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 05:56:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572094283, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf569c0/0x51ab3c4ee6946cf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfdd12f684 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 05:56:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 06:01:30 sh-103-53.int kernel: LustreError: 197015:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b40c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 06:01:30 sh-103-53.int kernel: LustreError: 197015:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 06:01:30 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 06:01:30 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 06:02:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Oct 26 06:02:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 71 previous similar messages Oct 26 06:03:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 06:03:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 26 06:06:36 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 06:06:36 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 06:06:36 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572094896, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf54140/0x51ab3c4ee694707 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfdfa3f3ed expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 06:06:36 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 06:11:43 sh-103-53.int kernel: LustreError: 197590:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b4840) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 06:11:43 sh-103-53.int kernel: LustreError: 197590:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 06:11:43 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 06:11:43 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 06:12:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 06:12:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 26 06:13:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 06:13:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 36 previous similar messages Oct 26 06:16:49 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 06:16:49 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 06:16:49 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572095509, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97602ea76c00/0x51ab3c4ee69473f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfdfc16b9a expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 06:16:49 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 06:21:57 sh-103-53.int kernel: LustreError: 198164:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a13c00b40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 06:21:57 sh-103-53.int kernel: LustreError: 198164:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 06:21:57 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 06:21:57 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 06:22:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 26 06:22:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 26 06:24:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 06:24:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 06:27:03 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 06:27:03 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 06:27:03 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572096123, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97602ea74a40/0x51ab3c4ee694777 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfe1ba28ee expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 06:27:03 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 06:32:09 sh-103-53.int kernel: LustreError: 198740:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a13c00cc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 06:32:09 sh-103-53.int kernel: LustreError: 198740:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 06:32:09 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 06:32:09 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 06:32:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 06:32:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 61 previous similar messages Oct 26 06:34:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 06:34:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 06:37:15 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 06:37:15 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 06:37:15 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572096735, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97602ea73840/0x51ab3c4ee6947af lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfe3e883b3 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 06:37:15 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 06:42:25 sh-103-53.int kernel: LustreError: 199321:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a13c00300) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 06:42:25 sh-103-53.int kernel: LustreError: 199321:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 06:42:25 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 06:42:25 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 06:42:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Oct 26 06:42:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 26 06:44:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 06:44:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 36 previous similar messages Oct 26 06:47:31 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 06:47:31 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 06:47:31 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572097351, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97602ea745c0/0x51ab3c4ee6947e7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfe6023c03 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 06:47:31 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 06:52:41 sh-103-53.int kernel: LustreError: 199896:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a13c00600) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 06:52:41 sh-103-53.int kernel: LustreError: 199896:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 06:52:41 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 06:52:41 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 06:53:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 06:53:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 26 06:54:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 06:54:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 06:56:33 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 06:57:49 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 06:57:49 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 06:57:49 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572097969, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97602ea70000/0x51ab3c4ee69481f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfe7ceb6e9 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 06:57:49 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 07:02:55 sh-103-53.int kernel: LustreError: 200490:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a13c01d40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 07:02:55 sh-103-53.int kernel: LustreError: 200490:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 07:02:55 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 07:02:55 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 07:03:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 26 07:03:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 58 previous similar messages Oct 26 07:04:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 07:04:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 26 07:08:05 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 07:08:05 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 07:08:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572098585, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97602ea77980/0x51ab3c4ee694857 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfe95decff expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 07:08:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 07:13:13 sh-103-53.int kernel: LustreError: 201065:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a13c01bc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 07:13:13 sh-103-53.int kernel: LustreError: 201065:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 07:13:13 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 07:13:13 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 07:13:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 26 07:13:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 56 previous similar messages Oct 26 07:14:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 07:14:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 07:18:21 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 07:18:21 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 07:18:21 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572099201, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97602ea73600/0x51ab3c4ee69488f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfeaf6c2cf expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 07:18:21 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 07:23:30 sh-103-53.int kernel: LustreError: 201653:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a13c009c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 07:23:30 sh-103-53.int kernel: LustreError: 201653:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 07:23:30 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 07:23:30 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 07:23:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 07:23:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 61 previous similar messages Oct 26 07:25:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 07:25:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 26 07:28:40 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 07:28:40 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 07:28:40 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572099820, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97602ea75c40/0x51ab3c4ee6948c7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfec3c9842 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 07:28:40 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 07:32:07 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 07:33:46 sh-103-53.int kernel: LustreError: 202229:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a13c01b00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 07:33:46 sh-103-53.int kernel: LustreError: 202229:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 07:33:46 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 07:33:46 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 07:34:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 07:34:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 52 previous similar messages Oct 26 07:35:12 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 07:35:12 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 07:38:53 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 07:38:53 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 07:38:53 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572100433, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97602ea76780/0x51ab3c4ee6948ff lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfed674113 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 07:38:53 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 07:44:00 sh-103-53.int kernel: LustreError: 202801:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9762f3b1d800) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 07:44:00 sh-103-53.int kernel: LustreError: 202801:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 07:44:00 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 07:44:00 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 07:44:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Oct 26 07:44:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 26 07:45:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 07:45:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 07:49:05 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 07:49:05 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 07:49:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572101045, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769d93adc40/0x51ab3c4ee694937 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfee78ddd6 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 07:49:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 07:54:15 sh-103-53.int kernel: LustreError: 203378:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9762f3b1dc80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 07:54:15 sh-103-53.int kernel: LustreError: 203378:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 07:54:15 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 07:54:15 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 07:54:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 26 07:54:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 26 07:55:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 07:55:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 07:59:25 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 07:59:25 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 07:59:25 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572101665, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769d93a8900/0x51ab3c4ee69496f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfefbd22c8 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 07:59:25 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 08:02:38 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 08:04:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 08:04:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 26 08:04:34 sh-103-53.int kernel: LustreError: 203974:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09187ec0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 08:04:34 sh-103-53.int kernel: LustreError: 203974:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 08:04:34 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 08:04:34 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 08:05:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 08:05:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 08:09:42 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 08:09:42 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 08:09:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572102282, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea2400/0x51ab3c4ee6949a7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abff0675d4f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 08:09:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 08:14:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 6 seconds Oct 26 08:14:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 56 previous similar messages Oct 26 08:14:52 sh-103-53.int kernel: LustreError: 204552:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09187080) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 08:14:52 sh-103-53.int kernel: LustreError: 204552:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 08:14:52 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 08:14:52 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 08:15:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 08:15:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 08:20:00 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 08:20:00 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 08:20:00 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572102900, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea1f80/0x51ab3c4ee6949df lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abff15ef03d expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 08:20:00 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 08:24:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 6 seconds Oct 26 08:24:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 46 previous similar messages Oct 26 08:25:07 sh-103-53.int kernel: LustreError: 205125:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a091863c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 08:25:07 sh-103-53.int kernel: LustreError: 205125:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 08:25:07 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 08:25:07 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 08:26:03 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 08:26:03 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 08:30:15 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 08:30:15 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 08:30:15 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572103514, 301s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea2ac0/0x51ab3c4ee694a17 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abff227597b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 08:30:15 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 08:34:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 08:34:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 26 08:35:22 sh-103-53.int kernel: LustreError: 205701:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09186fc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 08:35:22 sh-103-53.int kernel: LustreError: 205701:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 08:35:22 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 08:35:22 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 08:36:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 08:36:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 08:40:30 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 08:40:30 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 08:40:30 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572104130, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea0000/0x51ab3c4ee694a4f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abff3847385 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 08:40:30 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 08:44:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 08:44:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 26 08:45:36 sh-103-53.int kernel: LustreError: 206276:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781fc313740) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 08:45:36 sh-103-53.int kernel: LustreError: 206276:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 08:45:36 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 08:45:36 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 08:46:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 08:46:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 08:50:44 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 08:50:44 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 08:50:44 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572104744, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef2f40/0x51ab3c4ee694a87 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abff454b2d6 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 08:50:44 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 08:55:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 08:55:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 49 previous similar messages Oct 26 08:55:51 sh-103-53.int kernel: LustreError: 206851:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e652966c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 08:55:51 sh-103-53.int kernel: LustreError: 206851:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 08:55:51 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 08:55:51 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 08:56:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 08:56:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 09:01:00 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 09:01:00 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 09:01:00 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572105360, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a6fd69c0/0x51ab3c4ee694abf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abff53e9c02 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 09:01:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 09:05:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 09:05:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 61 previous similar messages Oct 26 09:06:10 sh-103-53.int kernel: LustreError: 207447:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e65297b00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 09:06:10 sh-103-53.int kernel: LustreError: 207447:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 09:06:10 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 09:06:10 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 09:06:43 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 09:06:43 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 09:11:15 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 09:11:15 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 09:11:15 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572105975, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754f8e69680/0x51ab3c4ee694af7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abff6140e9c expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 09:11:15 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 09:15:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 09:15:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 26 09:16:23 sh-103-53.int kernel: LustreError: 208019:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcde00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 09:16:23 sh-103-53.int kernel: LustreError: 208019:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 09:16:23 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 09:16:23 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 09:16:53 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 09:16:53 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 09:21:30 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 09:21:30 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 09:21:30 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572106590, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69ec00/0x51ab3c4ee694b2f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abff7673150 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 09:21:30 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 09:25:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 09:25:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 26 09:26:38 sh-103-53.int kernel: LustreError: 208598:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcc3c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 09:26:38 sh-103-53.int kernel: LustreError: 208598:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 09:26:38 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 09:26:38 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 09:27:03 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 09:27:03 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 09:31:45 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 09:31:45 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 09:31:45 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572107205, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69cc80/0x51ab3c4ee694b67 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abff8aa0476 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 09:31:45 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 09:35:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 09:35:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 26 09:36:52 sh-103-53.int kernel: LustreError: 209171:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcd740) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 09:36:52 sh-103-53.int kernel: LustreError: 209171:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 09:36:52 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 09:36:52 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 09:37:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 09:37:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 09:41:58 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 09:41:58 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 09:41:58 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572107818, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69af40/0x51ab3c4ee694b9f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abff9afc31d expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 09:41:58 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 09:46:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 20 seconds Oct 26 09:46:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 26 09:47:03 sh-103-53.int kernel: LustreError: 209743:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcc000) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 09:47:03 sh-103-53.int kernel: LustreError: 209743:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 09:47:03 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 09:47:03 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 09:47:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 09:47:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 09:52:12 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 09:52:12 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 09:52:12 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572108432, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc698000/0x51ab3c4ee694bd7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abffab0622b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 09:52:12 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 09:53:20 sh-103-53.int kernel: LNetError: 209155:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 0 Oct 26 09:54:30 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 09:56:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 26 09:56:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 44 previous similar messages Oct 26 09:57:18 sh-103-53.int kernel: LustreError: 210317:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcd680) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 09:57:18 sh-103-53.int kernel: LustreError: 210317:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 09:57:18 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 09:57:18 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 09:57:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 09:57:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 10:02:27 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 10:02:27 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 10:02:27 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572109047, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc698000/0x51ab3c4ee694c0f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abffbee3a2b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 10:02:27 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 10:06:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 26 10:06:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 48 previous similar messages Oct 26 10:07:35 sh-103-53.int kernel: LustreError: 210922:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcd440) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 10:07:35 sh-103-53.int kernel: LustreError: 210922:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 10:07:35 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 10:07:35 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 10:07:43 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 10:07:43 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 36 previous similar messages Oct 26 10:12:43 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 10:12:43 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 10:12:43 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572109663, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69af40/0x51ab3c4ee694c47 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abffd745c10 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 10:12:43 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 10:16:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Oct 26 10:16:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 59 previous similar messages Oct 26 10:17:50 sh-103-53.int kernel: LustreError: 211499:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcd140) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 10:17:50 sh-103-53.int kernel: LustreError: 211499:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 10:17:50 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 10:17:50 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 10:17:53 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 10:17:53 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 10:22:57 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 10:22:57 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 10:22:57 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572110277, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69cc80/0x51ab3c4ee694c7f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2abfff4c59da expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 10:22:57 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 10:26:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 26 10:26:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 63 previous similar messages Oct 26 10:28:03 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 10:28:03 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 10:28:04 sh-103-53.int kernel: LustreError: 212073:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcd380) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 10:28:04 sh-103-53.int kernel: LustreError: 212073:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 10:28:04 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 10:28:04 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 10:33:13 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 10:33:13 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 10:33:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572110893, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69b840/0x51ab3c4ee694cb7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac000ddaf42 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 10:33:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 10:36:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 10:36:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 26 10:38:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 10:38:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 26 10:38:24 sh-103-53.int kernel: LustreError: 212662:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcd980) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 10:38:24 sh-103-53.int kernel: LustreError: 212662:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 10:38:24 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 10:38:24 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 10:43:29 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 10:43:29 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 10:43:29 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572111509, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc699440/0x51ab3c4ee694cef lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac00266fc7f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 10:43:29 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 10:46:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 10:46:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 26 10:48:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 10:48:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 10:48:38 sh-103-53.int kernel: LustreError: 213236:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcd8c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 10:48:38 sh-103-53.int kernel: LustreError: 213236:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 10:48:38 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 10:48:38 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 10:53:44 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 10:53:44 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 10:53:44 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572112124, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69f740/0x51ab3c4ee694d27 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac00437ef60 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 10:53:44 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 10:56:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 10:56:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 26 10:58:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 10:58:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 10:58:51 sh-103-53.int kernel: LustreError: 213808:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcd8c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 10:58:51 sh-103-53.int kernel: LustreError: 213808:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 10:58:51 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 10:58:51 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 11:03:58 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 11:03:58 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 11:03:58 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572112738, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69d100/0x51ab3c4ee694d5f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac005f77ff3 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 11:03:58 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 11:06:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 11:06:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 26 11:08:43 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 11:08:43 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 11:09:05 sh-103-53.int kernel: LustreError: 214400:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebccf00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 11:09:05 sh-103-53.int kernel: LustreError: 214400:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 11:09:05 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 11:09:05 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 11:14:11 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 11:14:11 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 11:14:11 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572113351, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69fbc0/0x51ab3c4ee694d97 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac007bca646 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 11:14:11 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 11:17:04 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 26 11:17:04 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 26 11:18:53 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 11:18:53 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 11:19:17 sh-103-53.int kernel: LustreError: 214970:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcc3c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 11:19:17 sh-103-53.int kernel: LustreError: 214970:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 11:19:17 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 11:19:17 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 11:21:55 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 11:24:26 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 11:24:26 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 11:24:26 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572113966, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0335a00/0x51ab3c4ee694dcf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac009b80411 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 11:24:26 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 11:27:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 11:27:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 26 11:29:03 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 11:29:03 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 11:29:34 sh-103-53.int kernel: LustreError: 215550:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae3440) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 11:29:34 sh-103-53.int kernel: LustreError: 215550:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 11:29:34 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 11:29:34 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 11:34:42 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 11:34:42 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 11:34:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572114582, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0332640/0x51ab3c4ee694e07 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac00b23da18 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 11:34:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 11:37:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 26 11:37:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 50 previous similar messages Oct 26 11:39:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 11:39:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 11:39:48 sh-103-53.int kernel: LustreError: 216123:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae2e40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 11:39:48 sh-103-53.int kernel: LustreError: 216123:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 11:39:48 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 11:39:48 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 11:43:14 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 11:44:53 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 11:44:53 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 11:44:53 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572115193, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0337740/0x51ab3c4ee694e3f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac00c64ab5f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 11:44:53 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 11:48:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 16 seconds Oct 26 11:48:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 44 previous similar messages Oct 26 11:48:20 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 11:49:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 11:49:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 11:50:01 sh-103-53.int kernel: LustreError: 216704:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b5440) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 11:50:01 sh-103-53.int kernel: LustreError: 216704:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 11:50:01 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 11:50:01 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 11:55:07 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 11:55:07 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 11:55:07 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572115807, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769d93af740/0x51ab3c4ee694e77 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac00d6359f9 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 11:55:07 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 11:58:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Oct 26 11:58:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 40 previous similar messages Oct 26 11:59:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 11:59:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 12:00:17 sh-103-53.int kernel: LustreError: 217279:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9762f3b1d680) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 12:00:17 sh-103-53.int kernel: LustreError: 217279:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 12:00:17 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 12:00:17 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 12:05:26 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 12:05:26 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 12:05:26 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572116426, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769d93a98c0/0x51ab3c4ee694eaf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac00e88f5a6 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 12:05:26 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 12:08:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 12:08:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 53 previous similar messages Oct 26 12:09:43 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 12:09:43 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 12:10:32 sh-103-53.int kernel: LustreError: 217871:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9762f3b1d8c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 12:10:32 sh-103-53.int kernel: LustreError: 217871:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 12:10:32 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 12:10:32 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 12:15:41 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 12:15:41 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 12:15:41 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572117041, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769d93ade80/0x51ab3c4ee694ee7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac010299614 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 12:15:41 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 12:18:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 12:18:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 26 12:19:49 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 12:19:49 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 12:20:46 sh-103-53.int kernel: LustreError: 218447:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9762f3b1c780) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 12:20:46 sh-103-53.int kernel: LustreError: 218447:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 12:20:46 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 12:20:46 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 12:25:57 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 12:25:57 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 12:25:57 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572117657, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769d93a98c0/0x51ab3c4ee694f1f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac011acd055 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 12:25:57 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 12:28:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 12:28:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 26 12:30:04 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 12:30:04 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 26 12:31:06 sh-103-53.int kernel: LustreError: 219027:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9762f3b1c0c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 12:31:06 sh-103-53.int kernel: LustreError: 219027:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 12:31:06 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 12:31:06 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 12:36:14 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 12:36:14 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 12:36:14 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572118274, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769d93a8d80/0x51ab3c4ee694f57 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac013806fed expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 12:36:14 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 12:38:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Oct 26 12:38:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 77 previous similar messages Oct 26 12:40:09 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 12:40:09 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 12:41:20 sh-103-53.int kernel: LustreError: 219602:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9762f3b1c240) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 12:41:20 sh-103-53.int kernel: LustreError: 219602:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 12:41:20 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 12:41:20 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 12:46:26 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 12:46:26 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 12:46:26 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572118886, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769d93adc40/0x51ab3c4ee694f8f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac01580fb35 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 12:46:26 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 12:48:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 12:48:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 75 previous similar messages Oct 26 12:50:19 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 12:50:19 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 12:51:31 sh-103-53.int kernel: LustreError: 220175:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9762f3b1d680) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 12:51:31 sh-103-53.int kernel: LustreError: 220175:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 12:51:31 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 12:51:31 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 12:56:39 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 12:56:39 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 12:56:39 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572119499, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769d93a9680/0x51ab3c4ee694fc7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac0176062e6 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 12:56:39 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 12:58:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 26 12:58:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 78 previous similar messages Oct 26 13:00:29 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 13:00:29 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 13:00:30 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 13:01:49 sh-103-53.int kernel: LustreError: 220766:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9762f3b1c180) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 13:01:49 sh-103-53.int kernel: LustreError: 220766:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 13:01:49 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 13:01:49 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 13:06:54 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 13:06:54 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 13:06:54 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572120114, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769d93aad00/0x51ab3c4ee694fff lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac0187396f9 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 13:06:54 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 13:08:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 26 13:08:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 73 previous similar messages Oct 26 13:10:40 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 13:10:40 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 26 13:12:03 sh-103-53.int kernel: LustreError: 221343:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9762f3b1dd40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 13:12:03 sh-103-53.int kernel: LustreError: 221343:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 13:12:03 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 13:12:03 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 13:17:10 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 13:17:10 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 13:17:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572120730, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea4140/0x51ab3c4ee695037 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac0193154cb expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 13:17:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 13:19:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 26 13:19:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 74 previous similar messages Oct 26 13:20:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 13:20:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 13:22:15 sh-103-53.int kernel: LustreError: 221914:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09187b00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 13:22:15 sh-103-53.int kernel: LustreError: 221914:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 13:22:15 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 13:22:15 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 13:26:57 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 13:27:23 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 13:27:23 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 13:27:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572121343, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef1b00/0x51ab3c4ee69506f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac01a3de6c4 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 13:27:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 13:29:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 26 13:29:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 26 13:31:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 13:31:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 13:32:31 sh-103-53.int kernel: LustreError: 222492:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978218f51080) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 13:32:31 sh-103-53.int kernel: LustreError: 222492:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 13:32:31 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 13:32:31 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 13:37:38 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 13:37:38 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 13:37:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572121958, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef3a80/0x51ab3c4ee6950a7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac01b98141b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 13:37:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 13:39:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 13:39:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 26 13:41:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 13:41:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 13:42:46 sh-103-53.int kernel: LustreError: 223065:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978218f51980) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 13:42:46 sh-103-53.int kernel: LustreError: 223065:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 13:42:46 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 13:42:46 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 13:47:56 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 13:47:56 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 13:47:56 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572122576, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef2f40/0x51ab3c4ee6950df lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac01c894cd2 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 13:47:56 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 13:50:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 13:50:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 57 previous similar messages Oct 26 13:51:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 13:51:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 35 previous similar messages Oct 26 13:53:03 sh-103-53.int kernel: LustreError: 223642:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978218f51680) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 13:53:03 sh-103-53.int kernel: LustreError: 223642:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 13:53:03 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 13:53:03 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 13:58:10 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 13:58:10 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 13:58:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572123190, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef60c0/0x51ab3c4ee695117 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac01d9e652e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 13:58:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 14:00:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 14:00:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 53 previous similar messages Oct 26 14:01:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 14:01:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 14:03:16 sh-103-53.int kernel: LustreError: 224233:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978218f51200) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 14:03:16 sh-103-53.int kernel: LustreError: 224233:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 14:03:16 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 14:03:16 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 14:08:23 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 14:08:23 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 14:08:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572123803, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef1d40/0x51ab3c4ee69514f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac01e3ab51e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 14:08:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 14:10:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 14:10:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 55 previous similar messages Oct 26 14:11:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 14:11:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 14:13:30 sh-103-53.int kernel: LustreError: 224818:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978218f50f00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 14:13:30 sh-103-53.int kernel: LustreError: 224818:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 14:13:30 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 14:13:30 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 14:18:39 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 14:18:39 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 14:18:39 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572124419, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef57c0/0x51ab3c4ee695187 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac01f046778 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 14:18:39 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 14:20:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 14:20:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 26 14:21:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 14:21:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 26 14:23:47 sh-103-53.int kernel: LustreError: 225396:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978218f51740) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 14:23:47 sh-103-53.int kernel: LustreError: 225396:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 14:23:47 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 14:23:47 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 14:28:53 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 14:28:53 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 14:28:53 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572125033, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef2400/0x51ab3c4ee6951bf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac020606775 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 14:28:53 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 14:30:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 14:30:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 26 14:32:02 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 14:32:02 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 14:34:01 sh-103-53.int kernel: LustreError: 225970:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978218f506c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 14:34:01 sh-103-53.int kernel: LustreError: 225970:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 14:34:01 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 14:34:01 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 14:39:11 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 14:39:11 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 14:39:11 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572125651, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef4c80/0x51ab3c4ee6951f7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac0227c8fd6 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 14:39:11 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 14:41:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 14:41:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 26 14:42:12 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 14:42:12 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 26 14:44:18 sh-103-53.int kernel: LustreError: 226548:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978218f512c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 14:44:18 sh-103-53.int kernel: LustreError: 226548:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 14:44:18 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 14:44:18 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 14:49:24 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 14:49:24 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 14:49:24 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572126264, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef6780/0x51ab3c4ee69522f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac024674b47 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 14:49:24 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 14:51:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 14:51:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 61 previous similar messages Oct 26 14:52:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 14:52:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 14:54:33 sh-103-53.int kernel: LustreError: 227123:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978218f512c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 14:54:33 sh-103-53.int kernel: LustreError: 227123:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 14:54:33 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 14:54:33 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 14:59:43 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 14:59:43 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 14:59:43 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572126883, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef2880/0x51ab3c4ee695267 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac025ecf48f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 14:59:43 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 15:01:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 26 15:01:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 50 previous similar messages Oct 26 15:02:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 15:02:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 15:04:52 sh-103-53.int kernel: LustreError: 227728:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978218f515c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 15:04:52 sh-103-53.int kernel: LustreError: 227728:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 15:04:52 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 15:04:52 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 15:10:00 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 15:10:00 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 15:10:00 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572127500, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef5340/0x51ab3c4ee69529f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac027bed14f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 15:10:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 15:11:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 15:11:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 56 previous similar messages Oct 26 15:12:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 15:12:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 15:15:11 sh-103-53.int kernel: LustreError: 228306:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978218f50f00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 15:15:11 sh-103-53.int kernel: LustreError: 228306:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 15:15:11 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 15:15:11 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 15:20:16 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 15:20:16 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 15:20:16 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572128116, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef0480/0x51ab3c4ee6952d7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac0291fb9d3 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 15:20:16 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 15:21:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 26 15:21:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 60 previous similar messages Oct 26 15:22:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 15:22:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 15:25:25 sh-103-53.int kernel: LustreError: 228880:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978218f51740) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 15:25:25 sh-103-53.int kernel: LustreError: 228880:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 15:25:25 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 15:25:25 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 15:30:34 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 15:30:34 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 15:30:34 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572128734, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef60c0/0x51ab3c4ee69530f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac02a46cb82 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 15:30:34 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 15:31:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 15:31:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 53 previous similar messages Oct 26 15:33:02 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 15:33:02 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 15:35:39 sh-103-53.int kernel: LustreError: 229453:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978218f50cc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 15:35:39 sh-103-53.int kernel: LustreError: 229453:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 15:35:39 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 15:35:39 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 15:40:46 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 15:40:46 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 15:40:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572129346, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef2f40/0x51ab3c4ee695347 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac02b801fd3 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 15:40:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 15:42:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 15:42:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 26 15:43:12 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 15:43:12 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 15:45:52 sh-103-53.int kernel: LustreError: 230026:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978218f50480) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 15:45:52 sh-103-53.int kernel: LustreError: 230026:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 15:45:52 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 15:45:52 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 15:50:58 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 15:50:58 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 15:50:58 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572129957, 301s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef21c0/0x51ab3c4ee69537f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac02cb1c7b6 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 15:50:58 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 15:52:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 15:52:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 26 15:53:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 15:53:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 15:56:03 sh-103-53.int kernel: LustreError: 230596:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978218f51a40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 15:56:03 sh-103-53.int kernel: LustreError: 230596:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 15:56:03 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 15:56:03 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 16:01:10 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 16:01:10 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 16:01:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572130570, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef2ac0/0x51ab3c4ee6953b7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac02e16dc31 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 16:01:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 16:02:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 16:02:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 74 previous similar messages Oct 26 16:03:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 16:03:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 16:06:16 sh-103-53.int kernel: LustreError: 231187:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978218f515c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 16:06:16 sh-103-53.int kernel: LustreError: 231187:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 16:06:16 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 16:06:16 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 16:11:23 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 16:11:23 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 16:11:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572131183, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef60c0/0x51ab3c4ee6953ef lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac02f6890fb expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 16:11:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 16:12:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 16:12:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 26 16:13:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 16:13:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 16:16:29 sh-103-53.int kernel: LustreError: 231759:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978218f51140) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 16:16:29 sh-103-53.int kernel: LustreError: 231759:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 16:16:29 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 16:16:29 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 16:21:38 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 16:21:38 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 16:21:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572131798, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef3cc0/0x51ab3c4ee695427 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac030cab8d8 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 16:21:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 16:22:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 16:22:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 26 16:23:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 16:23:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 35 previous similar messages Oct 26 16:26:46 sh-103-53.int kernel: LustreError: 232336:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978218f506c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 16:26:46 sh-103-53.int kernel: LustreError: 232336:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 16:26:46 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 16:26:46 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 16:31:52 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 16:31:52 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 16:31:52 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572132412, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef57c0/0x51ab3c4ee69545f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac0322496f8 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 16:31:52 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 16:32:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 16:32:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 26 16:34:02 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 16:34:02 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 26 16:37:01 sh-103-53.int kernel: LustreError: 232911:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a1a683440) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 16:37:01 sh-103-53.int kernel: LustreError: 232911:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 16:37:01 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 16:37:01 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 16:42:08 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 16:42:08 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 16:42:08 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 16:42:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572133028, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754f8e6a880/0x51ab3c4ee695497 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac0336dbb6b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 16:42:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 16:42:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 16:42:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 26 16:44:12 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 16:44:12 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 16:47:18 sh-103-53.int kernel: LustreError: 233490:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820bf2d800) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 16:47:18 sh-103-53.int kernel: LustreError: 233490:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 16:47:18 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 16:47:18 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 16:52:25 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 16:52:25 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 16:52:25 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572133645, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754f8e6ba80/0x51ab3c4ee6954cf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac034ba561b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 16:52:25 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 16:52:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 16:52:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 26 16:54:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 16:54:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 16:57:33 sh-103-53.int kernel: LustreError: 234069:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcc900) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 16:57:33 sh-103-53.int kernel: LustreError: 234069:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 16:57:33 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 16:57:33 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 17:02:42 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 17:02:42 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 17:02:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572134262, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0333600/0x51ab3c4ee695507 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac036109b83 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 17:02:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 17:02:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 17:02:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 72 previous similar messages Oct 26 17:04:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 17:04:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 17:07:47 sh-103-53.int kernel: LustreError: 234664:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae2840) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 17:07:47 sh-103-53.int kernel: LustreError: 234664:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 17:07:47 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 17:07:47 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 17:11:38 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 17:12:53 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 17:12:53 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 17:12:53 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572134873, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0335580/0x51ab3c4ee69553f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac0375852fa expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 17:12:53 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 17:13:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 17:13:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 26 17:14:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 17:14:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 17:18:03 sh-103-53.int kernel: LustreError: 235238:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae3b00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 17:18:03 sh-103-53.int kernel: LustreError: 235238:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 17:18:03 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 17:18:03 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 17:20:37 sh-103-53.int kernel: LNetError: 233182:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 0 Oct 26 17:23:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 17:23:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 26 17:23:13 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 17:23:13 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 17:23:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572135493, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0337080/0x51ab3c4ee695577 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac0389e579f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 17:23:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 17:24:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 17:24:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 17:27:54 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 17:28:18 sh-103-53.int kernel: LustreError: 235815:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae2a80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 17:28:18 sh-103-53.int kernel: LustreError: 235815:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 17:28:18 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 17:28:18 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 17:33:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 17:33:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 26 17:33:27 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 17:33:27 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 17:33:27 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572136107, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0337740/0x51ab3c4ee6955af lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac03a090a70 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 17:33:27 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 17:34:03 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 17:35:02 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 17:35:02 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 17:38:34 sh-103-53.int kernel: LustreError: 236391:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae2c00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 17:38:34 sh-103-53.int kernel: LustreError: 236391:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 17:38:34 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 17:38:34 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 17:43:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 17:43:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 26 17:43:41 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 17:43:41 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 17:43:41 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572136721, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0332640/0x51ab3c4ee6955e7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac03b7e537c expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 17:43:41 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 17:45:12 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 17:45:12 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 26 17:48:49 sh-103-53.int kernel: LustreError: 236972:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae3080) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 17:48:49 sh-103-53.int kernel: LustreError: 236972:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 17:48:49 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 17:48:49 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 17:53:18 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 17:53:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 17:53:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 26 17:53:56 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 17:53:56 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 17:53:56 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572137336, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0330900/0x51ab3c4ee69561f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac03cd08e65 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 17:53:56 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 17:55:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 17:55:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 17:57:23 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 17:59:06 sh-103-53.int kernel: LustreError: 237549:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae26c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 17:59:06 sh-103-53.int kernel: LustreError: 237549:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 17:59:06 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 17:59:06 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 18:03:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 18:03:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 26 18:04:15 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 18:04:15 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 18:04:15 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572137955, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0330b40/0x51ab3c4ee695657 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac03e415558 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 18:04:15 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 18:05:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 18:05:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 18:09:22 sh-103-53.int kernel: LustreError: 238142:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b4cc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 18:09:22 sh-103-53.int kernel: LustreError: 238142:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 18:09:22 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 18:09:22 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 18:14:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 18:14:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 57 previous similar messages Oct 26 18:14:29 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 18:14:29 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 18:14:29 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572138569, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf50000/0x51ab3c4ee69568f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac03fb5f0fa expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 18:14:29 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 18:15:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 18:15:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 18:19:38 sh-103-53.int kernel: LustreError: 238718:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b4480) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 18:19:38 sh-103-53.int kernel: LustreError: 238718:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 18:19:38 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 18:19:38 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 18:24:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 18:24:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 26 18:24:48 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 18:24:48 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 18:24:48 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572139188, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf55a00/0x51ab3c4ee6956c7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac04130fd0e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 18:24:48 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 18:25:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 18:25:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 18:29:53 sh-103-53.int kernel: LustreError: 239295:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b5380) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 18:29:53 sh-103-53.int kernel: LustreError: 239295:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 18:29:53 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 18:29:53 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 18:34:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 18:34:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 26 18:35:01 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 18:35:01 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 18:35:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572139801, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf53600/0x51ab3c4ee6956ff lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac042a154b7 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 18:35:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 18:36:02 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 18:36:02 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 26 18:37:07 sh-103-53.int kernel: LNetError: 233182:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 226 Oct 26 18:40:10 sh-103-53.int kernel: LustreError: 239875:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b49c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 18:40:10 sh-103-53.int kernel: LustreError: 239875:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 18:40:10 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 18:40:10 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 18:44:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 18:44:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 59 previous similar messages Oct 26 18:45:17 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 18:45:17 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 18:45:17 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572140417, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf51680/0x51ab3c4ee695737 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac044d3f669 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 18:45:17 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 18:46:12 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 18:46:12 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 18:50:25 sh-103-53.int kernel: LustreError: 240449:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b5500) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 18:50:25 sh-103-53.int kernel: LustreError: 240449:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 18:50:25 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 18:50:25 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 18:55:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 18:55:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 26 18:55:33 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 18:55:33 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 18:55:33 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572141033, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf53600/0x51ab3c4ee69576f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac04630db8a expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 18:55:33 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 18:56:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 18:56:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 19:00:41 sh-103-53.int kernel: LustreError: 241023:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b4000) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 19:00:41 sh-103-53.int kernel: LustreError: 241023:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 19:00:41 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 19:00:41 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 19:05:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 19:05:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 73 previous similar messages Oct 26 19:05:48 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 19:05:48 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 19:05:48 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572141648, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf55a00/0x51ab3c4ee6957a7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac0478380ae expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 19:05:48 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 19:06:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 19:06:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 19:10:57 sh-103-53.int kernel: LustreError: 241615:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b58c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 19:10:57 sh-103-53.int kernel: LustreError: 241615:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 19:10:57 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 19:10:57 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 19:15:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 11 seconds Oct 26 19:15:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 26 19:16:05 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 19:16:05 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 19:16:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572142265, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf50000/0x51ab3c4ee6957df lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac048db69d7 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 19:16:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 19:16:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 19:16:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 26 19:21:12 sh-103-53.int kernel: LustreError: 242196:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b4a80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 19:21:12 sh-103-53.int kernel: LustreError: 242196:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 19:21:12 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 19:21:12 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 19:25:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 19:25:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 26 19:26:18 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 19:26:18 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 19:26:18 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572142878, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf52640/0x51ab3c4ee695817 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac04a504e7b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 19:26:18 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 19:26:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 19:26:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 19:31:26 sh-103-53.int kernel: LustreError: 242769:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b40c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 19:31:26 sh-103-53.int kernel: LustreError: 242769:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 19:31:26 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 19:31:26 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 19:35:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 26 19:35:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 26 19:36:35 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 19:36:35 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 19:36:35 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572143495, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf55100/0x51ab3c4ee69584f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac04babe10a expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 19:36:35 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 19:37:02 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 19:37:02 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 26 19:41:42 sh-103-53.int kernel: LustreError: 243344:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b5a40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 19:41:42 sh-103-53.int kernel: LustreError: 243344:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 19:41:42 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 19:41:42 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 19:45:08 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 19:45:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 26 19:45:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 26 19:46:48 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 19:46:48 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 19:46:48 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572144108, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf57bc0/0x51ab3c4ee695887 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac04d26ae64 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 19:46:48 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 19:47:12 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 19:47:12 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 26 19:51:53 sh-103-53.int kernel: LustreError: 243916:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b4480) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 19:51:53 sh-103-53.int kernel: LustreError: 243916:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 19:51:53 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 19:51:53 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 19:55:18 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 19:55:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 19:55:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 72 previous similar messages Oct 26 19:57:00 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 19:57:00 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 19:57:00 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572144720, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf57740/0x51ab3c4ee6958bf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac04e96b6ac expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 19:57:00 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 19:57:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 19:57:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 20:00:23 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 20:02:10 sh-103-53.int kernel: LustreError: 244510:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b4fc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 20:02:10 sh-103-53.int kernel: LustreError: 244510:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 20:02:10 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 20:02:10 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 20:05:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 20:05:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 72 previous similar messages Oct 26 20:07:15 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 20:07:15 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 20:07:15 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572145335, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf55340/0x51ab3c4ee6958f7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac04ff1ea67 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 20:07:15 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 20:07:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 20:07:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 20:12:23 sh-103-53.int kernel: LustreError: 245085:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b5c80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 20:12:23 sh-103-53.int kernel: LustreError: 245085:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 20:12:23 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 20:12:23 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 20:15:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Oct 26 20:15:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 76 previous similar messages Oct 26 20:17:32 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 20:17:32 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 20:17:32 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572145952, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf52640/0x51ab3c4ee69592f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac0514078f6 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 20:17:32 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 20:17:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 20:17:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 20:22:41 sh-103-53.int kernel: LustreError: 245663:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b4240) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 20:22:41 sh-103-53.int kernel: LustreError: 245663:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 20:22:41 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 20:22:41 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 20:25:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 20 seconds Oct 26 20:25:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 26 20:27:49 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 20:27:49 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 20:27:49 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572146569, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf50b40/0x51ab3c4ee695967 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac052bf0de6 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 20:27:50 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 20:27:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 20:27:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 20:32:58 sh-103-53.int kernel: LustreError: 246240:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a13c00540) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 20:32:58 sh-103-53.int kernel: LustreError: 246240:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 20:32:58 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 20:32:58 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 20:35:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 13 seconds Oct 26 20:35:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 79 previous similar messages Oct 26 20:38:02 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 20:38:02 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 20:38:08 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 20:38:08 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 20:38:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572147188, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97602ea721c0/0x51ab3c4ee69599f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac054271cff expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 20:38:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 20:43:17 sh-103-53.int kernel: LustreError: 246819:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a13c01c80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 20:43:17 sh-103-53.int kernel: LustreError: 246819:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 20:43:17 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 20:43:17 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 20:45:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 20:45:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 76 previous similar messages Oct 26 20:47:09 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 20:48:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 20:48:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 26 20:48:25 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 20:48:25 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 20:48:25 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572147805, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97602ea745c0/0x51ab3c4ee6959d7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac055a42113 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 20:48:25 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 20:53:33 sh-103-53.int kernel: LustreError: 247394:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a13c00840) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 20:53:33 sh-103-53.int kernel: LustreError: 247394:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 20:53:33 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 20:53:33 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 20:55:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 26 20:55:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 72 previous similar messages Oct 26 20:58:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 20:58:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 26 20:58:42 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 20:58:42 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 20:58:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572148422, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97602ea72400/0x51ab3c4ee695a0f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac0574a9315 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 20:58:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 21:03:49 sh-103-53.int kernel: LustreError: 248000:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a13c00b40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 21:03:49 sh-103-53.int kernel: LustreError: 248000:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 21:03:49 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 21:03:49 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 21:06:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Oct 26 21:06:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 78 previous similar messages Oct 26 21:08:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 21:08:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 21:08:59 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 21:08:59 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 21:08:59 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572149039, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97602ea706c0/0x51ab3c4ee695a47 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac058d6bd29 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 21:08:59 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 21:14:08 sh-103-53.int kernel: LustreError: 248587:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a13c00cc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 21:14:08 sh-103-53.int kernel: LustreError: 248587:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 21:14:08 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 21:14:08 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 21:16:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 26 21:16:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 26 21:18:39 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 21:18:39 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 21:19:16 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 21:19:16 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 21:19:16 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572149656, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97602ea77500/0x51ab3c4ee695a7f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac05a80a84e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 21:19:16 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 21:24:26 sh-103-53.int kernel: LustreError: 249165:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a13c01680) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 21:24:26 sh-103-53.int kernel: LustreError: 249165:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 21:24:26 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 21:24:26 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 21:26:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 21:26:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 73 previous similar messages Oct 26 21:28:49 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 21:28:49 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 26 21:29:36 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 21:29:36 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 21:29:36 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572150276, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97602ea73f00/0x51ab3c4ee695ab7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac05c3abdfb expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 21:29:36 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 21:34:42 sh-103-53.int kernel: LustreError: 249741:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a13c01200) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 21:34:42 sh-103-53.int kernel: LustreError: 249741:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 21:34:42 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 21:34:42 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 21:36:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 21:36:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 26 21:39:00 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 21:39:00 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 21:39:48 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 21:39:48 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 21:39:48 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572150888, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97602ea73180/0x51ab3c4ee695aef lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac05dca4bde expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 21:39:48 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 21:44:57 sh-103-53.int kernel: LustreError: 250319:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a13c01980) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 21:44:57 sh-103-53.int kernel: LustreError: 250319:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 21:44:57 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 21:44:57 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 21:46:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 21:46:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 61 previous similar messages Oct 26 21:47:55 sh-103-53.int kernel: LNetError: 246982:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 201 Oct 26 21:49:10 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 21:49:10 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 21:50:05 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 21:50:05 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 21:50:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572151505, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97602ea760c0/0x51ab3c4ee695b27 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac05fc86fb2 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 21:50:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 21:55:14 sh-103-53.int kernel: LustreError: 250895:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a13c01c80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 21:55:14 sh-103-53.int kernel: LustreError: 250895:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 21:55:14 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 21:55:14 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 21:56:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 21:56:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 26 21:59:20 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 21:59:20 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 22:00:23 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 22:00:23 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 22:00:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572152123, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97602ea76e40/0x51ab3c4ee695b5f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac06180877e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 22:00:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 22:03:26 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 22:05:30 sh-103-53.int kernel: LustreError: 251488:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a13c00840) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 22:05:30 sh-103-53.int kernel: LustreError: 251488:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 22:05:30 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 22:05:30 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 22:06:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 22:06:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 26 22:09:30 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 22:09:30 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 22:10:37 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 22:10:37 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 22:10:37 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572152737, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97602ea76780/0x51ab3c4ee695b97 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac062ff2a6e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 22:10:37 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 22:15:43 sh-103-53.int kernel: LustreError: 252063:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a13c012c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 22:15:43 sh-103-53.int kernel: LustreError: 252063:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 22:15:43 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 22:15:43 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 22:17:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 22:17:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 26 22:19:40 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 22:19:40 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 22:20:51 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 22:20:51 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 22:20:51 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572153351, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea1200/0x51ab3c4ee695bcf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac064534070 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 22:20:51 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 22:25:59 sh-103-53.int kernel: LustreError: 252639:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09186b40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 22:25:59 sh-103-53.int kernel: LustreError: 252639:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 22:25:59 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 22:25:59 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 22:27:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 26 22:27:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 77 previous similar messages Oct 26 22:29:50 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 22:29:50 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 26 22:31:08 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 22:31:08 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 22:31:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572153968, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea60c0/0x51ab3c4ee695c07 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac0659b4f1a expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 22:31:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 22:36:14 sh-103-53.int kernel: LustreError: 253211:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09186000) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 22:36:14 sh-103-53.int kernel: LustreError: 253211:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 22:36:14 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 22:36:14 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 22:37:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 22:37:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 79 previous similar messages Oct 26 22:40:00 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 22:40:00 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 22:41:23 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 22:41:23 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 22:41:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572154583, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea7980/0x51ab3c4ee695c3f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac066ea9729 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 22:41:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 22:46:30 sh-103-53.int kernel: LustreError: 253789:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a09186300) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 22:46:30 sh-103-53.int kernel: LustreError: 253789:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 22:46:30 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 22:46:30 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 22:47:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Oct 26 22:47:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 26 22:50:10 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 22:50:10 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 22:51:38 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 22:51:38 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 22:51:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572155198, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea7500/0x51ab3c4ee695c77 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac067ff1497 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 22:51:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 22:56:46 sh-103-53.int kernel: LustreError: 254364:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e65296b40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 22:56:46 sh-103-53.int kernel: LustreError: 254364:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 22:56:46 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 22:56:46 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 22:57:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 22:57:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 72 previous similar messages Oct 26 23:00:20 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 23:00:20 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 23:01:56 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 23:01:56 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 23:01:56 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572155816, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754f8e6bcc0/0x51ab3c4ee695caf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac0691538c1 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 23:01:56 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 23:05:27 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 23:07:05 sh-103-53.int kernel: LustreError: 254959:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a16d0b2c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 23:07:05 sh-103-53.int kernel: LustreError: 254959:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 23:07:05 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 23:07:05 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 23:07:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 23:07:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 26 23:10:30 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 23:10:30 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 23:12:14 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 23:12:14 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 23:12:14 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572156434, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754f8e6cec0/0x51ab3c4ee695ce7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac06a27b1c5 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 23:12:14 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 23:17:22 sh-103-53.int kernel: LustreError: 255537:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978200750e40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 23:17:22 sh-103-53.int kernel: LustreError: 255537:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 23:17:22 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 23:17:22 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 23:18:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 23:18:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 73 previous similar messages Oct 26 23:20:40 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 23:20:40 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 23:22:31 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 23:22:31 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 23:22:31 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572157051, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc698900/0x51ab3c4ee695d1f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac06b3b09a3 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 23:22:31 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 23:27:41 sh-103-53.int kernel: LustreError: 256116:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcd200) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 23:27:41 sh-103-53.int kernel: LustreError: 256116:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 23:27:41 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 23:27:41 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 23:28:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 23:28:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 26 23:30:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 26 23:30:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 23:32:50 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 23:32:50 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 23:32:50 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572157670, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69e9c0/0x51ab3c4ee695d57 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac06c75b64b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 23:32:50 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 23:37:56 sh-103-53.int kernel: LustreError: 256693:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcc240) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 23:37:56 sh-103-53.int kernel: LustreError: 256693:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 23:37:56 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 23:37:56 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 23:38:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 23:38:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 61 previous similar messages Oct 26 23:41:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.41@o2ib4 added to recovery queue. Health = 900 Oct 26 23:41:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 26 23:43:02 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 23:43:02 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 23:43:02 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572158282, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69aac0/0x51ab3c4ee695d8f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac06e03a0ac expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 23:43:02 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 23:43:03 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 26 23:48:11 sh-103-53.int kernel: LustreError: 257271:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcc780) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 23:48:11 sh-103-53.int kernel: LustreError: 257271:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 23:48:11 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 23:48:11 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 23:48:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 23:48:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 26 23:49:56 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 26 23:50:51 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 26 23:51:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.41@o2ib4 added to recovery queue. Health = 900 Oct 26 23:51:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 19 previous similar messages Oct 26 23:51:36 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 26 23:52:03 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 26 23:52:51 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 26 23:53:21 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 26 23:53:21 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 26 23:53:21 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572158901, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc699f80/0x51ab3c4ee695dc7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac06f8189c8 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 26 23:53:21 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 26 23:54:46 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 26 23:55:08 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 26 23:56:02 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 26 23:57:10 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 26 23:57:10 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 1 previous similar message Oct 26 23:58:26 sh-103-53.int kernel: LustreError: 257853:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcd8c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 26 23:58:26 sh-103-53.int kernel: LustreError: 257853:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 26 23:58:26 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 26 23:58:26 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 26 23:58:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 26 23:58:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 60 previous similar messages Oct 27 00:00:11 sh-103-53.int kernel: LNetError: 257067:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 00:00:11 sh-103-53.int kernel: LNetError: 257067:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 1 previous similar message Oct 27 00:01:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.41@o2ib4 added to recovery queue. Health = 900 Oct 27 00:01:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 19 previous similar messages Oct 27 00:03:36 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 00:03:36 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 00:03:36 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572159516, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc698240/0x51ab3c4ee695dff lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac074f3ae86 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 00:03:36 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 00:04:57 sh-103-53.int kernel: LNetError: 257067:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 00:04:57 sh-103-53.int kernel: LNetError: 257067:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 3 previous similar messages Oct 27 00:08:42 sh-103-53.int kernel: LustreError: 258454:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcd800) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 00:08:42 sh-103-53.int kernel: LustreError: 258454:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 00:08:42 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 00:08:42 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 00:08:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 00:08:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 63 previous similar messages Oct 27 00:11:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.41@o2ib4 added to recovery queue. Health = 900 Oct 27 00:11:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 19 previous similar messages Oct 27 00:13:50 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 00:13:50 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 00:13:50 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572160130, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69af40/0x51ab3c4ee695e37 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac076764360 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 00:13:50 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 00:16:22 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 00:16:22 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 27 00:18:56 sh-103-53.int kernel: LustreError: 259028:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcdec0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 00:18:56 sh-103-53.int kernel: LustreError: 259028:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 00:18:56 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 00:18:56 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 00:19:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 00:19:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 63 previous similar messages Oct 27 00:21:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 00:21:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 25 previous similar messages Oct 27 00:24:06 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 00:24:06 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 00:24:06 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572160746, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69d7c0/0x51ab3c4ee695e6f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac0777ee1bd expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 00:24:06 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 00:26:37 sh-103-53.int kernel: LNetError: 250076:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 00:26:37 sh-103-53.int kernel: LNetError: 250076:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 27 00:29:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 00:29:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 63 previous similar messages Oct 27 00:29:14 sh-103-53.int kernel: LustreError: 259612:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcc600) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 00:29:14 sh-103-53.int kernel: LustreError: 259612:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 00:29:14 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 00:29:14 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 00:31:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 00:31:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 00:34:22 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 00:34:22 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 00:34:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572161362, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69d580/0x51ab3c4ee695ea7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac078613a20 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 00:34:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 00:34:53 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 00:37:49 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 00:37:49 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 27 00:39:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 00:39:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 72 previous similar messages Oct 27 00:39:29 sh-103-53.int kernel: LustreError: 260185:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcc9c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 00:39:29 sh-103-53.int kernel: LustreError: 260185:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 00:39:29 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 00:39:29 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 00:42:03 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 00:42:03 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 00:44:37 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 00:44:37 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 00:44:37 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572161977, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc699440/0x51ab3c4ee695edf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac0793dc993 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 00:44:37 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 00:48:00 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 00:48:00 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 4 previous similar messages Oct 27 00:49:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 00:49:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 27 00:49:44 sh-103-53.int kernel: LustreError: 260810:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcc300) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 00:49:44 sh-103-53.int kernel: LustreError: 260810:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 00:49:44 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 00:49:44 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 00:52:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 00:52:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 00:54:54 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 00:54:54 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 00:54:54 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572162594, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69f080/0x51ab3c4ee695f17 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac0795c7f8f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 00:54:54 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 00:58:47 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 00:58:47 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 27 01:00:02 sh-103-53.int kernel: LustreError: 261389:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcde00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 01:00:02 sh-103-53.int kernel: LustreError: 261389:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 01:00:02 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 01:00:02 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 01:00:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 27 01:00:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 61 previous similar messages Oct 27 01:02:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 01:02:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 01:05:07 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 01:05:07 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 01:05:07 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572163207, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69f2c0/0x51ab3c4ee695f4f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac079974d8d expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 01:05:07 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 01:09:03 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 01:09:03 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 27 01:10:16 sh-103-53.int kernel: LustreError: 261985:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcc3c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 01:10:16 sh-103-53.int kernel: LustreError: 261985:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 01:10:16 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 01:10:16 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 01:10:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 01:10:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 27 01:12:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 01:12:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 01:15:26 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 01:15:26 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 01:15:26 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572163826, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc6998c0/0x51ab3c4ee695f87 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac079e82f5c expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 01:15:26 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 01:20:29 sh-103-53.int kernel: LNetError: 262403:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 01:20:29 sh-103-53.int kernel: LNetError: 262403:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 27 01:20:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 01:20:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 27 01:20:34 sh-103-53.int kernel: LustreError: 262561:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcd680) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 01:20:34 sh-103-53.int kernel: LustreError: 262561:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 01:20:34 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 01:20:34 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 01:22:39 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 01:22:39 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 01:25:40 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 01:25:40 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 01:25:40 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572164440, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69e0c0/0x51ab3c4ee695fbf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac07a04b8df expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 01:25:40 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 01:30:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 01:30:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 27 01:30:48 sh-103-53.int kernel: LustreError: 263138:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae3200) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 01:30:48 sh-103-53.int kernel: LustreError: 263138:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 01:30:48 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 01:30:48 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 01:32:49 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 01:32:49 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 01:34:41 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 01:34:41 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 27 01:35:58 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 01:35:58 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 01:35:58 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572165058, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0334a40/0x51ab3c4ee695ff7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac07a88b2f2 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 01:35:58 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 01:40:49 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 01:40:49 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 27 01:41:06 sh-103-53.int kernel: LustreError: 263715:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae20c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 01:41:06 sh-103-53.int kernel: LustreError: 263715:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 01:41:06 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 01:41:06 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 01:42:59 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 01:42:59 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 01:44:55 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 01:44:55 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 27 01:46:13 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 01:46:13 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 01:46:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572165673, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0332d00/0x51ab3c4ee69602f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac08074e1e4 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 01:46:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 01:50:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 27 01:50:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 27 01:51:23 sh-103-53.int kernel: LustreError: 264290:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae3ec0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 01:51:23 sh-103-53.int kernel: LustreError: 264290:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 01:51:23 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 01:51:23 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 01:53:10 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 01:53:10 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 01:56:05 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 01:56:05 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 27 01:56:29 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 01:56:29 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 01:56:29 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572166289, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0333180/0x51ab3c4ee696067 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac08b540fef expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 01:56:29 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 01:58:16 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 02:01:04 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 02:01:04 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 27 02:01:38 sh-103-53.int kernel: LustreError: 264886:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae2240) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 02:01:38 sh-103-53.int kernel: LustreError: 264886:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 02:01:38 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 02:01:38 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 02:03:20 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 02:03:20 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 02:06:47 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 02:06:47 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 02:06:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572166907, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0336e40/0x51ab3c4ee69609f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac095c4cdef expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 02:06:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 02:10:14 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 02:10:14 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 27 02:11:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 02:11:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 27 02:11:55 sh-103-53.int kernel: LustreError: 265460:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae3440) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 02:11:55 sh-103-53.int kernel: LustreError: 265460:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 02:11:55 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 02:11:55 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 02:13:30 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 02:13:30 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 02:14:32 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 02:17:05 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 02:17:05 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 02:17:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572167525, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0332ac0/0x51ab3c4ee6960d7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac0a18e02a0 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 02:17:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 02:20:30 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 02:20:30 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 13 previous similar messages Oct 27 02:21:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 02:21:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 60 previous similar messages Oct 27 02:22:12 sh-103-53.int kernel: LustreError: 266038:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae2b40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 02:22:12 sh-103-53.int kernel: LustreError: 266038:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 02:22:12 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 02:22:12 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 02:23:40 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 02:23:40 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 02:27:19 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 02:27:19 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 02:27:19 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572168139, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0334800/0x51ab3c4ee69610f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac0aa2e9fd4 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 02:27:19 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 02:31:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 02:31:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 27 02:31:36 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 02:31:36 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 27 02:32:27 sh-103-53.int kernel: LustreError: 266612:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae2a80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 02:32:27 sh-103-53.int kernel: LustreError: 266612:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 02:32:27 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 02:32:27 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 02:33:50 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 02:33:50 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 02:37:36 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 02:37:36 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 02:37:36 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572168756, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0335100/0x51ab3c4ee696147 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac0b507e878 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 02:37:36 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 02:41:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 02:41:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 27 02:41:52 sh-103-53.int kernel: LNetError: 257067:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 02:41:52 sh-103-53.int kernel: LNetError: 257067:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 27 02:42:45 sh-103-53.int kernel: LustreError: 267188:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae2000) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 02:42:45 sh-103-53.int kernel: LustreError: 267188:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 02:42:45 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 02:42:45 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 02:44:00 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 02:44:00 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 02:47:51 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 02:47:51 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 02:47:51 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572169371, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0335100/0x51ab3c4ee69617f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac0bc41eebe expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 02:47:51 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 02:51:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 02:51:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 56 previous similar messages Oct 27 02:52:03 sh-103-53.int kernel: LNetError: 257067:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 02:52:03 sh-103-53.int kernel: LNetError: 257067:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 27 02:53:00 sh-103-53.int kernel: LustreError: 267763:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae2cc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 02:53:00 sh-103-53.int kernel: LustreError: 267763:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 02:53:00 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 02:53:00 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 02:54:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 02:54:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 02:58:07 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 02:58:07 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 02:58:07 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572169987, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0334800/0x51ab3c4ee6961b7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac0beb74464 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 02:58:07 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 03:01:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 03:01:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 27 03:02:46 sh-103-53.int kernel: LNetError: 257067:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 03:02:46 sh-103-53.int kernel: LNetError: 257067:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 5 previous similar messages Oct 27 03:03:12 sh-103-53.int kernel: LustreError: 268352:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae2f00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 03:03:12 sh-103-53.int kernel: LustreError: 268352:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 03:03:12 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 03:03:12 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 03:04:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 03:04:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 03:05:22 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 03:08:21 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 03:08:21 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 03:08:21 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572170601, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0332ac0/0x51ab3c4ee6961ef lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac0d11615f8 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 03:08:21 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 03:12:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 03:12:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 59 previous similar messages Oct 27 03:13:10 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 03:13:10 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 27 03:13:30 sh-103-53.int kernel: LustreError: 268929:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae3680) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 03:13:30 sh-103-53.int kernel: LustreError: 268929:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 03:13:30 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 03:13:30 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 03:14:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 03:14:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 03:16:33 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 03:18:39 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 03:18:39 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 03:18:39 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572171219, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a03372c0/0x51ab3c4ee696227 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac0e7995b14 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 03:18:39 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 03:22:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 03:22:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 61 previous similar messages Oct 27 03:23:49 sh-103-53.int kernel: LustreError: 269508:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae3c80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 03:23:49 sh-103-53.int kernel: LustreError: 269508:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 03:23:49 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 03:23:49 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 03:24:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 03:24:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 03:25:12 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 03:25:12 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 27 03:28:58 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 03:28:58 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 03:28:58 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572171838, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0333180/0x51ab3c4ee69625f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac0f36011de expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 03:28:58 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 03:32:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 03:32:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 27 03:34:06 sh-103-53.int kernel: LustreError: 270136:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae2000) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 03:34:06 sh-103-53.int kernel: LustreError: 270136:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 03:34:06 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 03:34:06 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 03:34:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 03:34:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 03:36:42 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 03:36:42 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 5 previous similar messages Oct 27 03:39:14 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 03:39:14 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 03:39:14 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572172454, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0332400/0x51ab3c4ee696297 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac0fb545d04 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 03:39:14 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 03:42:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 03:42:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 27 03:44:20 sh-103-53.int kernel: LustreError: 270710:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae2900) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 03:44:20 sh-103-53.int kernel: LustreError: 270710:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 03:44:20 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 03:44:20 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 03:45:02 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 03:45:02 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 03:46:56 sh-103-53.int kernel: LNetError: 257067:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 03:46:56 sh-103-53.int kernel: LNetError: 257067:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 27 03:49:27 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 03:49:27 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 03:49:27 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572173067, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0334a40/0x51ab3c4ee6962cf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac105378dfc expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 03:49:27 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 03:52:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 03:52:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 27 03:54:34 sh-103-53.int kernel: LustreError: 271284:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae2f00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 03:54:34 sh-103-53.int kernel: LustreError: 271284:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 03:54:34 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 03:54:34 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 03:55:12 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 03:55:12 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 03:57:06 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 03:57:06 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 27 03:59:41 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 03:59:41 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 03:59:41 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572173681, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0332400/0x51ab3c4ee696307 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac10df0a3f1 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 03:59:41 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 04:03:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 04:03:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 27 04:04:50 sh-103-53.int kernel: LustreError: 271874:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b5080) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 04:04:50 sh-103-53.int kernel: LustreError: 271874:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 04:04:50 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 04:04:50 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 04:05:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 04:05:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 04:07:23 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 04:07:52 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 04:07:52 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 12 previous similar messages Oct 27 04:09:59 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 04:09:59 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 04:09:59 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572174299, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a6fd5580/0x51ab3c4ee69633f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac11476a475 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 04:09:59 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 04:13:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 27 04:13:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 27 04:15:06 sh-103-53.int kernel: LustreError: 272461:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781f4a7af00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 04:15:06 sh-103-53.int kernel: LustreError: 272461:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 04:15:06 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 04:15:06 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 04:15:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 04:15:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 04:18:02 sh-103-53.int kernel: LNetError: 262403:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 04:18:02 sh-103-53.int kernel: LNetError: 262403:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 27 04:18:33 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 04:20:15 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 04:20:15 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 04:20:15 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572174915, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754f8e6e9c0/0x51ab3c4ee696377 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac1230531d6 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 04:20:15 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 04:23:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 04:23:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 58 previous similar messages Oct 27 04:25:22 sh-103-53.int kernel: LustreError: 273039:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781f4a7a600) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 04:25:22 sh-103-53.int kernel: LustreError: 273039:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 04:25:22 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 04:25:22 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 04:25:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 04:25:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 04:28:43 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 04:29:12 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 04:29:12 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 27 04:30:29 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 04:30:29 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 04:30:29 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572175529, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754f8e6fbc0/0x51ab3c4ee6963af lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac13b653201 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 04:30:29 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 04:33:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 04:33:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 27 04:35:39 sh-103-53.int kernel: LustreError: 273615:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcd740) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 04:35:39 sh-103-53.int kernel: LustreError: 273615:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 04:35:39 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 04:35:39 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 04:35:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 04:35:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 04:39:37 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 04:39:37 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 27 04:40:47 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 04:40:47 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 04:40:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572176147, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69b840/0x51ab3c4ee6963e7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac1459b81be expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 04:40:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 04:42:59 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 04:43:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 04:43:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 63 previous similar messages Oct 27 04:45:56 sh-103-53.int kernel: LustreError: 274190:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcdec0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 04:45:56 sh-103-53.int kernel: LustreError: 274190:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 04:45:56 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 04:45:56 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 04:46:02 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 04:46:02 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 04:50:59 sh-103-53.int kernel: LNetError: 257067:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 04:50:59 sh-103-53.int kernel: LNetError: 257067:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 27 04:51:05 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 04:51:05 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 04:51:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572176765, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a03333c0/0x51ab3c4ee69641f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac14cb8035a expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 04:51:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 04:53:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 04:53:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 27 04:54:08 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 04:56:11 sh-103-53.int kernel: LustreError: 274764:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae2780) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 04:56:11 sh-103-53.int kernel: LustreError: 274764:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 04:56:11 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 04:56:11 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 04:56:12 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 04:56:12 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 05:01:21 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 05:01:21 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 05:01:21 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572177381, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0335a00/0x51ab3c4ee696457 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac15c13907f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 05:01:21 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 05:01:52 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 05:01:52 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 27 05:03:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 05:03:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 59 previous similar messages Oct 27 05:06:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 05:06:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 05:06:26 sh-103-53.int kernel: LustreError: 275358:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae2c00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 05:06:26 sh-103-53.int kernel: LustreError: 275358:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 05:06:26 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 05:06:26 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 05:11:35 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 05:11:35 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 05:11:35 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572177995, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0337980/0x51ab3c4ee69648f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac15fdf6f10 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 05:11:35 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 05:12:53 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 05:12:53 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 27 05:13:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 05:13:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 27 05:16:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 05:16:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 05:16:43 sh-103-53.int kernel: LustreError: 275937:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae2900) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 05:16:43 sh-103-53.int kernel: LustreError: 275937:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 05:16:43 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 05:16:43 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 05:20:34 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 05:21:49 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 05:21:49 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 05:21:49 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572178609, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0337740/0x51ab3c4ee6964c7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac16530ca50 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 05:21:49 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 05:23:31 sh-103-53.int kernel: LNetError: 257067:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 05:23:31 sh-103-53.int kernel: LNetError: 257067:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 13 previous similar messages Oct 27 05:23:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 27 05:23:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 61 previous similar messages Oct 27 05:25:39 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 05:26:43 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 05:26:43 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 05:26:54 sh-103-53.int kernel: LustreError: 276508:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae29c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 05:26:54 sh-103-53.int kernel: LustreError: 276508:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 05:26:54 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 05:26:54 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 05:32:00 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 05:32:00 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 05:32:00 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572179220, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0330d80/0x51ab3c4ee6964ff lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac17813ca31 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 05:32:00 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 05:33:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 05:33:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 27 05:34:37 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 05:34:37 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 12 previous similar messages Oct 27 05:36:53 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 05:36:53 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 05:37:06 sh-103-53.int kernel: LustreError: 277081:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae2600) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 05:37:06 sh-103-53.int kernel: LustreError: 277081:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 05:37:06 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 05:37:06 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 05:42:14 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 05:42:14 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 05:42:14 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572179834, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea7980/0x51ab3c4ee696537 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac18e7ba8e4 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 05:42:14 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 05:43:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 05:43:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 27 05:44:42 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 05:44:42 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 13 previous similar messages Oct 27 05:47:03 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 05:47:03 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 05:47:23 sh-103-53.int kernel: LustreError: 277656:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feec600) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 05:47:23 sh-103-53.int kernel: LustreError: 277656:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 05:47:23 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 05:47:23 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 05:52:09 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 05:52:29 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 05:52:29 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 05:52:29 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572180449, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea5a00/0x51ab3c4ee69656f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac19725bef7 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 05:52:29 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 05:54:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 05:54:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 27 05:55:02 sh-103-53.int kernel: LNetError: 257067:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 05:55:02 sh-103-53.int kernel: LNetError: 257067:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 12 previous similar messages Oct 27 05:57:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 05:57:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 05:57:39 sh-103-53.int kernel: LustreError: 278237:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feeda40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 05:57:39 sh-103-53.int kernel: LustreError: 278237:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 05:57:39 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 05:57:39 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 06:02:48 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 06:02:48 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 06:02:48 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572181068, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea1b00/0x51ab3c4ee6965a7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac19de20dba expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 06:02:48 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 06:04:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 06:04:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 27 06:06:13 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 06:06:13 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 27 06:07:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 06:07:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 06:07:54 sh-103-53.int kernel: LustreError: 278831:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feec780) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 06:07:54 sh-103-53.int kernel: LustreError: 278831:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 06:07:54 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 06:07:54 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 06:13:00 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 06:13:00 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 06:13:00 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572181680, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea0000/0x51ab3c4ee6965df lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac1ad1bf5dd expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 06:13:00 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 06:14:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 06:14:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 27 06:17:12 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 06:17:12 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 12 previous similar messages Oct 27 06:17:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 06:17:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 06:18:07 sh-103-53.int kernel: LustreError: 279404:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feec9c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 06:18:07 sh-103-53.int kernel: LustreError: 279404:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 06:18:07 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 06:18:07 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 06:23:16 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 06:23:16 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 06:23:16 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572182296, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea2ac0/0x51ab3c4ee696617 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac1b438710c expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 06:23:16 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 06:24:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 06:24:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 27 06:27:28 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 06:27:28 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 27 06:27:43 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 06:27:43 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 06:28:22 sh-103-53.int kernel: LustreError: 279981:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978218f51a40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 06:28:22 sh-103-53.int kernel: LustreError: 279981:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 06:28:22 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 06:28:22 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 06:33:30 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 06:33:30 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 06:33:30 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572182910, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef2880/0x51ab3c4ee69664f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac1c116d16e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 06:33:30 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 06:34:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 06:34:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 59 previous similar messages Oct 27 06:36:49 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 06:37:43 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 06:37:43 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 27 06:37:53 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 06:37:53 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 06:38:39 sh-103-53.int kernel: LustreError: 280555:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978218f51b00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 06:38:39 sh-103-53.int kernel: LustreError: 280555:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 06:38:39 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 06:38:39 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 06:43:47 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 06:43:47 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 06:43:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572183527, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef5100/0x51ab3c4ee696687 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac1d8afa43b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 06:43:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 06:44:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 06:44:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 27 06:48:03 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 06:48:03 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 06:48:53 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 06:48:53 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 27 06:48:53 sh-103-53.int kernel: LustreError: 281129:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978218f51a40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 06:48:53 sh-103-53.int kernel: LustreError: 281129:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 06:48:53 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 06:48:53 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 06:53:59 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 06:53:59 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 06:53:59 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572184139, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef5100/0x51ab3c4ee6966bf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac1e4ae13d1 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 06:53:59 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 06:54:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 06:54:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 27 06:58:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 06:58:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 06:59:04 sh-103-53.int kernel: LustreError: 281711:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e652969c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 06:59:04 sh-103-53.int kernel: LustreError: 281711:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 06:59:04 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 06:59:05 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 06:59:05 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 06:59:05 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 27 07:04:15 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 07:04:15 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 07:04:15 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572184755, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a6fd0fc0/0x51ab3c4ee6966f7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac1eb5e79c7 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 07:04:15 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 07:04:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 07:04:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 73 previous similar messages Oct 27 07:08:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 07:08:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 07:09:24 sh-103-53.int kernel: LustreError: 282305:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e65297800) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 07:09:24 sh-103-53.int kernel: LustreError: 282305:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 07:09:24 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 07:09:24 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 07:11:13 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 07:11:13 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 27 07:14:32 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 07:14:32 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 07:14:32 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572185372, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a6fd0b40/0x51ab3c4ee69672f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac1fcfbe7b8 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 07:14:32 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 07:15:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 07:15:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 27 07:15:29 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 07:18:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 07:18:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 07:19:40 sh-103-53.int kernel: LustreError: 282880:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e65297e00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 07:19:40 sh-103-53.int kernel: LustreError: 282880:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 07:19:40 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 07:19:40 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 07:21:23 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 07:21:23 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 13 previous similar messages Oct 27 07:24:50 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 07:24:50 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 07:24:50 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572185990, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a6fd3cc0/0x51ab3c4ee696767 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac2031f27df expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 07:24:50 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 07:25:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 07:25:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 27 07:28:43 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 07:28:43 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 07:30:00 sh-103-53.int kernel: LustreError: 283468:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e65296600) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 07:30:00 sh-103-53.int kernel: LustreError: 283468:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 07:30:00 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 07:30:00 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 07:32:33 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 07:32:33 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 5 previous similar messages Oct 27 07:35:08 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 07:35:08 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 07:35:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572186608, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a6fd7080/0x51ab3c4ee69679f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac2078f16c4 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 07:35:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 07:35:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 07:35:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 63 previous similar messages Oct 27 07:38:53 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 07:38:53 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 07:40:13 sh-103-53.int kernel: LustreError: 284043:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0d8315c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 07:40:13 sh-103-53.int kernel: LustreError: 284043:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 07:40:13 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 07:40:13 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 07:42:48 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 07:42:48 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 12 previous similar messages Oct 27 07:45:19 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 07:45:19 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 07:45:19 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572187219, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754f8e6e9c0/0x51ab3c4ee6967d7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac21ef2d579 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 07:45:19 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 07:45:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 07:45:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 27 07:49:03 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 07:49:03 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 07:50:26 sh-103-53.int kernel: LustreError: 284620:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcd740) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 07:50:26 sh-103-53.int kernel: LustreError: 284620:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 07:50:26 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 07:50:27 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 07:52:58 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 07:52:58 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 4 previous similar messages Oct 27 07:55:32 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 07:55:32 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 07:55:32 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572187832, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69ca40/0x51ab3c4ee69680f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac235272b24 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 07:55:32 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 07:55:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 07:55:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 71 previous similar messages Oct 27 07:59:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 07:59:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 08:00:38 sh-103-53.int kernel: LustreError: 285191:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcdec0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 08:00:38 sh-103-53.int kernel: LustreError: 285191:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 08:00:38 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 08:00:38 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 08:04:04 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 08:04:04 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 27 08:05:44 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 08:05:44 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 08:05:44 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572188444, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69c140/0x51ab3c4ee696847 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac23eeb3a6b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 08:05:44 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 08:06:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 08:06:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 27 08:09:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 08:09:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 08:10:53 sh-103-53.int kernel: LustreError: 285784:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcce40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 08:10:53 sh-103-53.int kernel: LustreError: 285784:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 08:10:53 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 08:10:53 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 08:13:24 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 08:14:19 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 08:14:19 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 27 08:16:01 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 08:16:01 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 08:16:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572189061, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69a1c0/0x51ab3c4ee69687f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac244097c33 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 08:16:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 08:16:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 08:16:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 27 08:19:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 08:19:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 08:21:08 sh-103-53.int kernel: LustreError: 286361:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcdec0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 08:21:08 sh-103-53.int kernel: LustreError: 286361:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 08:21:08 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 08:21:08 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 08:24:34 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 08:24:34 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 27 08:26:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 08:26:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 27 08:26:18 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 08:26:18 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 08:26:18 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572189678, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc698240/0x51ab3c4ee6968b7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac254afafc7 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 08:26:18 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 08:27:39 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 08:29:43 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 08:29:43 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 08:31:26 sh-103-53.int kernel: LustreError: 286939:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebccc00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 08:31:26 sh-103-53.int kernel: LustreError: 286939:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 08:31:26 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 08:31:26 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 08:35:38 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 08:35:38 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 27 08:36:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 08:36:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 59 previous similar messages Oct 27 08:36:36 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 08:36:36 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 08:36:36 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572190296, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69b3c0/0x51ab3c4ee6968ef lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac25ba6c358 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 08:36:36 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 08:39:49 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 08:39:49 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 08:41:47 sh-103-53.int kernel: LustreError: 287521:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebccf00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 08:41:47 sh-103-53.int kernel: LustreError: 287521:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 08:41:47 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 08:41:47 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 08:46:30 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 08:46:30 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 27 08:46:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 08:46:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 27 08:46:55 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 08:46:55 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 08:46:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572190915, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69b180/0x51ab3c4ee696927 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac268d0357d expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 08:46:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 08:49:59 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 08:49:59 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 08:52:04 sh-103-53.int kernel: LustreError: 288098:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcd680) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 08:52:04 sh-103-53.int kernel: LustreError: 288098:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 08:52:04 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 08:52:04 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 08:55:05 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 08:56:46 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 08:56:46 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 27 08:57:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 08:57:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 27 08:57:11 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 08:57:11 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 08:57:11 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572191530, 301s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea3600/0x51ab3c4ee69695f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac28055bf59 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 08:57:11 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 09:00:09 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 09:00:09 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 09:02:16 sh-103-53.int kernel: LustreError: 288688:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feed980) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 09:02:16 sh-103-53.int kernel: LustreError: 288688:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 09:02:16 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 09:02:16 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 09:06:58 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 09:06:58 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 27 09:07:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 09:07:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 75 previous similar messages Oct 27 09:07:27 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 09:07:27 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 09:07:27 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572192147, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef0d80/0x51ab3c4ee696997 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac28f3156ae expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 09:07:27 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 09:10:19 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 09:10:19 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 09:12:36 sh-103-53.int kernel: LustreError: 289267:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e65297680) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 09:12:36 sh-103-53.int kernel: LustreError: 289267:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 09:12:36 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 09:12:36 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 09:17:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 09:17:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 27 09:17:14 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 09:17:14 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 27 09:17:46 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 09:17:46 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 09:17:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572192766, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a6fd57c0/0x51ab3c4ee6969cf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac294070b15 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 09:17:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 09:20:29 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 09:20:29 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 09:22:53 sh-103-53.int kernel: LustreError: 289849:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e65297740) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 09:22:53 sh-103-53.int kernel: LustreError: 289849:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 09:22:53 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 09:22:53 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 09:27:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 09:27:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 63 previous similar messages Oct 27 09:27:31 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 09:27:31 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 27 09:27:58 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 09:27:58 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 09:27:58 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572193378, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754f8e6dc40/0x51ab3c4ee696a07 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac2a2da1634 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 09:27:58 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 09:30:39 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 09:30:39 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 09:33:06 sh-103-53.int kernel: LustreError: 290423:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754f362b5c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 09:33:06 sh-103-53.int kernel: LustreError: 290423:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 09:33:06 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 09:33:06 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 09:37:24 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 27 09:37:24 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 27 09:38:09 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 09:38:09 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 27 09:38:16 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 09:38:16 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 09:38:16 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572193996, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754f8e6cec0/0x51ab3c4ee696a3f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac2a8ca0147 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 09:38:16 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 09:40:54 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 09:40:54 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 09:43:25 sh-103-53.int kernel: LustreError: 291000:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0d8318c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 09:43:25 sh-103-53.int kernel: LustreError: 291000:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 09:43:25 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 09:43:25 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 09:47:49 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 27 09:47:49 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 27 09:48:34 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 09:48:34 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 09:48:34 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572194614, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69a880/0x51ab3c4ee696a77 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac2ab257fd8 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 09:48:34 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 09:49:49 sh-103-53.int kernel: LNetError: 262403:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 09:49:49 sh-103-53.int kernel: LNetError: 262403:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 13 previous similar messages Oct 27 09:51:04 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 09:51:04 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 09:53:41 sh-103-53.int kernel: LustreError: 291575:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcdec0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 09:53:41 sh-103-53.int kernel: LustreError: 291575:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 09:53:41 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 09:53:41 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 09:57:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 09:57:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 27 09:58:50 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 09:58:50 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 09:58:50 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572195230, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0337bc0/0x51ab3c4ee696aaf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac2c012ef56 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 09:58:50 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 09:59:51 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 09:59:51 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 27 10:01:09 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 10:01:09 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 10:03:57 sh-103-53.int kernel: LustreError: 292169:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae23c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 10:03:57 sh-103-53.int kernel: LustreError: 292169:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 10:03:57 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 10:03:57 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 10:07:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 10:07:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 27 10:09:07 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 10:09:07 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 10:09:07 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572195847, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0335340/0x51ab3c4ee696ae7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac2d16b312a expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 10:09:07 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 10:10:03 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 10:10:03 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 12 previous similar messages Oct 27 10:11:20 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 10:11:20 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 10:14:17 sh-103-53.int kernel: LustreError: 292763:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae2540) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 10:14:17 sh-103-53.int kernel: LustreError: 292763:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 10:14:17 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 10:14:17 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 10:14:26 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 10:18:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 10:18:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 27 10:19:25 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 10:19:25 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 10:19:25 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572196465, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0334c80/0x51ab3c4ee696b1f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac2de4a876c expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 10:19:25 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 10:21:11 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 10:21:11 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 4 previous similar messages Oct 27 10:21:30 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 10:21:30 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 10:24:35 sh-103-53.int kernel: LustreError: 293345:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae2a80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 10:24:35 sh-103-53.int kernel: LustreError: 293345:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 10:24:35 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 10:24:35 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 10:27:36 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 10:28:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 10:28:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 60 previous similar messages Oct 27 10:29:44 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 10:29:44 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 10:29:44 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572197084, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a03369c0/0x51ab3c4ee696b57 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac2ea0ecd06 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 10:29:44 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 10:31:25 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 10:31:25 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 12 previous similar messages Oct 27 10:31:40 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 10:31:40 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 10:34:50 sh-103-53.int kernel: LustreError: 293919:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae2b40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 10:34:50 sh-103-53.int kernel: LustreError: 293919:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 10:34:50 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 10:34:50 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 10:38:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 10:38:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 27 10:39:58 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 10:39:58 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 10:39:58 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572197697, 301s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf54800/0x51ab3c4ee696b8f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac2f33c6bff expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 10:39:58 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 10:41:44 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 10:41:44 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 27 10:41:50 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 10:41:50 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 10:45:05 sh-103-53.int kernel: LustreError: 294506:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b5c80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 10:45:05 sh-103-53.int kernel: LustreError: 294506:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 10:45:05 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 10:45:05 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 10:48:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 10:48:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 73 previous similar messages Oct 27 10:50:13 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 10:50:13 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 10:50:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572198313, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf52d00/0x51ab3c4ee696bc7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac306a365f3 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 10:50:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 10:51:54 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 10:51:54 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 27 10:52:00 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 10:52:00 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 10:55:21 sh-103-53.int kernel: LustreError: 295083:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b4900) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 10:55:21 sh-103-53.int kernel: LustreError: 295083:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 10:55:21 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 10:55:21 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 10:58:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 10:58:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 73 previous similar messages Oct 27 11:00:31 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 11:00:31 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 11:00:31 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572198931, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf52d00/0x51ab3c4ee696bff lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac30f5f1504 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 11:00:31 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 11:02:10 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 11:02:10 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 11:02:46 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 11:02:46 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 27 11:05:37 sh-103-53.int kernel: LustreError: 295675:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b58c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 11:05:37 sh-103-53.int kernel: LustreError: 295675:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 11:05:37 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 11:05:37 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 11:09:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 11:09:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 27 11:10:43 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 11:10:43 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 11:10:43 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572199543, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf54a40/0x51ab3c4ee696c37 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac3224be097 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 11:10:43 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 11:12:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 11:12:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 11:13:15 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 11:13:15 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 27 11:15:48 sh-103-53.int kernel: LustreError: 296250:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b5800) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 11:15:48 sh-103-53.int kernel: LustreError: 296250:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 11:15:48 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 11:15:48 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 11:19:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 11:19:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 27 11:19:27 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 11:20:55 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 11:20:55 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 11:20:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572200155, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf56540/0x51ab3c4ee696c6f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac335064d90 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 11:20:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 11:22:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 11:22:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 11:23:27 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 11:23:27 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 27 11:26:01 sh-103-53.int kernel: LustreError: 296823:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b4c00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 11:26:01 sh-103-53.int kernel: LustreError: 296823:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 11:26:01 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 11:26:01 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 11:29:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 11:29:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 71 previous similar messages Oct 27 11:31:10 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 11:31:10 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 11:31:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572200770, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97602ea70fc0/0x51ab3c4ee696ca7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac33e419071 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 11:31:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 11:32:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 11:32:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 11:34:36 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 11:34:36 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 27 11:36:15 sh-103-53.int kernel: LustreError: 297398:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feecfc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 11:36:15 sh-103-53.int kernel: LustreError: 297398:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 11:36:15 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 11:36:15 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 11:39:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 27 11:39:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 27 11:41:22 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 11:41:22 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 11:41:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572201382, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea1d40/0x51ab3c4ee696cdf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac346b33dfb expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 11:41:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 11:42:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 11:42:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 11:44:46 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 11:44:46 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 27 11:46:32 sh-103-53.int kernel: LustreError: 297977:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feed500) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 11:46:32 sh-103-53.int kernel: LustreError: 297977:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 11:46:32 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 11:46:32 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 11:49:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 11:49:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 27 11:51:41 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 11:51:41 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 11:51:41 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572202001, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea21c0/0x51ab3c4ee696d17 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac35559bcb5 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 11:51:41 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 11:53:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 11:53:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 11:55:36 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 11:55:36 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 27 11:56:51 sh-103-53.int kernel: LustreError: 298554:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978218f51980) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 11:56:51 sh-103-53.int kernel: LustreError: 298554:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 11:56:51 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 11:56:51 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 11:59:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 11:59:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 73 previous similar messages Oct 27 12:02:01 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 12:02:01 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 12:02:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572202621, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef3180/0x51ab3c4ee696d4f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac35dcd3800 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 12:02:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 12:03:12 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 12:03:12 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 12:07:08 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 12:07:08 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 27 12:07:09 sh-103-53.int kernel: LustreError: 299150:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978218f51680) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 12:07:09 sh-103-53.int kernel: LustreError: 299150:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 12:07:09 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 12:07:09 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 12:09:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 12:09:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 27 12:12:15 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 12:12:15 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 12:12:15 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572203235, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef45c0/0x51ab3c4ee696d87 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac363c73b6a expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 12:12:15 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 12:13:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 12:13:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 12:17:18 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 12:17:18 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 27 12:17:20 sh-103-53.int kernel: LustreError: 299722:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978218f50540) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 12:17:20 sh-103-53.int kernel: LustreError: 299722:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 12:17:20 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 12:17:20 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 12:20:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 27 12:20:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 27 12:22:26 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 12:22:26 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 12:22:26 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572203846, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a6fd3a80/0x51ab3c4ee696dbf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac37501c5aa expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 12:22:26 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 12:23:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 12:23:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 12:27:27 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 12:27:27 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 27 12:27:34 sh-103-53.int kernel: LustreError: 300294:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e65297440) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 12:27:34 sh-103-53.int kernel: LustreError: 300294:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 12:27:34 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 12:27:34 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 12:30:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 12:30:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 27 12:32:42 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 12:32:42 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 12:32:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572204462, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a6fd18c0/0x51ab3c4ee696df7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac37db685cb expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 12:32:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 12:33:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 12:33:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 12:37:49 sh-103-53.int kernel: LustreError: 300872:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977dc76643c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 12:37:49 sh-103-53.int kernel: LustreError: 300872:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 12:37:49 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 12:37:49 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 12:38:39 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 12:38:39 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 13 previous similar messages Oct 27 12:40:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 12:40:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 27 12:42:55 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 12:42:55 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 12:42:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572205075, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754f8e6aac0/0x51ab3c4ee696e2f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac39036776f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 12:42:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 12:43:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 12:43:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 12:48:02 sh-103-53.int kernel: LustreError: 301447:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754f362b2c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 12:48:02 sh-103-53.int kernel: LustreError: 301447:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 12:48:02 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 12:48:02 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 12:49:26 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 12:49:26 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 27 12:50:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 12:50:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 58 previous similar messages Oct 27 12:53:08 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 12:53:08 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 12:53:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572205688, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc699440/0x51ab3c4ee696e67 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac39feb1a55 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 12:53:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 12:54:02 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 12:54:02 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 12:58:15 sh-103-53.int kernel: LustreError: 302021:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcc600) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 12:58:15 sh-103-53.int kernel: LustreError: 302021:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 12:58:15 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 12:58:15 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 12:59:36 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 12:59:36 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 27 13:00:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 13:00:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 27 13:03:23 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 13:03:23 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 13:03:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572206303, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69f740/0x51ab3c4ee696e9f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac3b7365c21 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 13:03:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 13:04:12 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 13:04:12 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 13:08:32 sh-103-53.int kernel: LustreError: 302613:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcce40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 13:08:32 sh-103-53.int kernel: LustreError: 302613:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 13:08:32 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 13:08:32 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 13:11:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 13:11:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 27 13:11:09 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 13:11:09 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 27 13:13:39 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 13:13:39 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 13:13:39 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572206919, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69d100/0x51ab3c4ee696ed7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac3bf738edd expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 13:13:39 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 13:14:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 13:14:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 13:18:46 sh-103-53.int kernel: LustreError: 303187:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebccc00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 13:18:46 sh-103-53.int kernel: LustreError: 303187:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 13:18:46 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 13:18:46 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 13:21:17 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 27 13:21:17 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 27 13:21:17 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 13:21:17 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 27 13:23:55 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 13:23:55 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 13:23:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572207535, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc6998c0/0x51ab3c4ee696f0f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac3c3b9104d expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 13:23:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 13:24:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 13:24:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 13:29:02 sh-103-53.int kernel: LustreError: 303764:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcd200) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 13:29:02 sh-103-53.int kernel: LustreError: 303764:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 13:29:02 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 13:29:02 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 13:31:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 27 13:31:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 61 previous similar messages Oct 27 13:31:18 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 13:31:18 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 27 13:34:10 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 13:34:10 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 13:34:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572208150, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc699680/0x51ab3c4ee696f47 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac3c88f4ae1 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 13:34:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 13:34:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 13:34:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 13:39:17 sh-103-53.int kernel: LustreError: 304345:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcc300) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 13:39:17 sh-103-53.int kernel: LustreError: 304345:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 13:39:17 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 13:39:17 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 13:41:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 13:41:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 27 13:42:43 sh-103-53.int kernel: LNetError: 291785:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 13:42:43 sh-103-53.int kernel: LNetError: 291785:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 12 previous similar messages Oct 27 13:44:24 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 13:44:24 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 13:44:24 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572208764, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc698480/0x51ab3c4ee696f7f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac3ce1d93c5 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 13:44:25 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 13:44:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 13:44:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 13:46:53 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 13:49:35 sh-103-53.int kernel: LustreError: 304922:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b5680) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 13:49:35 sh-103-53.int kernel: LustreError: 304922:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 13:49:35 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 13:49:35 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 13:51:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 13:51:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 27 13:53:33 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 13:53:33 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 27 13:54:43 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 13:54:43 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 13:54:43 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572209383, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf54140/0x51ab3c4ee696fb7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac3d505188e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 13:54:43 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 13:55:03 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 13:55:03 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 13:59:51 sh-103-53.int kernel: LustreError: 305496:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b43c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 13:59:51 sh-103-53.int kernel: LustreError: 305496:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 13:59:51 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 13:59:51 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 14:01:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 27 14:01:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 27 14:04:58 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 14:04:58 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 27 14:05:01 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 14:05:01 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 14:05:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572210001, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf56780/0x51ab3c4ee696fef lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac3dbc23b1f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 14:05:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 14:05:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 14:05:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 14:10:07 sh-103-53.int kernel: LustreError: 306093:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b5500) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 14:10:07 sh-103-53.int kernel: LustreError: 306093:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 14:10:07 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 14:10:07 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 14:11:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 14:11:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 27 14:15:11 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 14:15:11 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 27 14:15:17 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 14:15:17 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 14:15:17 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572210617, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf56e40/0x51ab3c4ee697027 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac3e11d6169 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 14:15:17 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 14:15:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 14:15:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 14:19:24 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 14:20:26 sh-103-53.int kernel: LustreError: 306680:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b5ec0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 14:20:26 sh-103-53.int kernel: LustreError: 306680:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 14:20:26 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 14:20:26 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 14:21:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 14:21:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 27 14:23:30 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 14:25:29 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 14:25:29 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 14:25:36 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 14:25:36 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 14:25:36 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572211236, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf56c00/0x51ab3c4ee69705f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac3e6478a8b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 14:25:36 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 14:27:23 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 14:27:23 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 27 14:30:44 sh-103-53.int kernel: LustreError: 307258:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feed500) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 14:30:44 sh-103-53.int kernel: LustreError: 307258:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 14:30:44 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 14:30:44 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 14:31:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 27 14:31:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 27 14:35:39 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 14:35:39 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 14:35:49 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 14:35:49 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 14:35:49 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572211849, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea3cc0/0x51ab3c4ee697097 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac3ea8e2608 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 14:35:49 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 14:37:36 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 14:37:36 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 27 14:40:57 sh-103-53.int kernel: LustreError: 307831:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feece40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 14:40:57 sh-103-53.int kernel: LustreError: 307831:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 14:40:57 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 14:40:57 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 14:42:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 14:42:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 27 14:45:49 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 14:45:49 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 14:46:06 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 14:46:06 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 14:46:06 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572212466, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea18c0/0x51ab3c4ee6970cf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac3ef2cdecd expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 14:46:06 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 14:48:38 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 14:48:38 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 27 14:51:14 sh-103-53.int kernel: LustreError: 308409:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feecd80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 14:51:14 sh-103-53.int kernel: LustreError: 308409:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 14:51:14 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 14:51:14 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 14:52:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 14:52:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 27 14:55:59 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 14:55:59 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 14:56:22 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 14:56:22 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 14:56:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572213082, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea4a40/0x51ab3c4ee697107 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac3f4949abb expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 14:56:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 14:58:53 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 14:58:53 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 27 15:01:29 sh-103-53.int kernel: LustreError: 309001:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feec3c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 15:01:29 sh-103-53.int kernel: LustreError: 309001:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 15:01:29 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 15:01:29 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 15:02:17 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 15:02:17 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 27 15:06:14 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 15:06:14 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 15:06:39 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 15:06:39 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 15:06:39 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572213698, 301s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea5c40/0x51ab3c4ee69713f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac4063b599f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 15:06:39 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 15:09:44 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 15:09:44 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 27 15:11:47 sh-103-53.int kernel: LustreError: 309580:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feedbc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 15:11:47 sh-103-53.int kernel: LustreError: 309580:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 15:11:47 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 15:11:47 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 15:12:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 15:12:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 71 previous similar messages Oct 27 15:16:24 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 15:16:24 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 15:16:55 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 15:16:55 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 15:16:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572214315, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea7500/0x51ab3c4ee697177 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac40ef6ba05 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 15:16:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 15:19:53 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 15:19:53 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 27 15:22:03 sh-103-53.int kernel: LustreError: 310154:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feeda40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 15:22:03 sh-103-53.int kernel: LustreError: 310154:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 15:22:03 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 15:22:03 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 15:22:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 15:22:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 27 15:26:34 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 15:26:34 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 15:27:13 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 15:27:13 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 15:27:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572214933, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea7980/0x51ab3c4ee6971af lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac413e46eb3 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 15:27:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 15:31:29 sh-103-53.int kernel: LNetError: 291785:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 15:31:29 sh-103-53.int kernel: LNetError: 291785:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 5 previous similar messages Oct 27 15:32:25 sh-103-53.int kernel: LustreError: 310734:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feec840) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 15:32:25 sh-103-53.int kernel: LustreError: 310734:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 15:32:25 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 15:32:25 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 15:32:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 15:32:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 27 15:36:44 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 15:36:44 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 15:37:35 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 15:37:35 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 15:37:35 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572215555, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea60c0/0x51ab3c4ee6971e7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac41822ff7a expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 15:37:35 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 15:42:22 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 15:42:22 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 27 15:42:45 sh-103-53.int kernel: LustreError: 311312:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feecc00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 15:42:45 sh-103-53.int kernel: LustreError: 311312:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 15:42:45 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 15:42:45 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 15:43:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 15:43:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 27 15:46:54 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 15:46:54 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 15:47:51 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 15:47:51 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 15:47:51 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572216171, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea4c80/0x51ab3c4ee69721f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac41c40c14b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 15:47:51 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 15:52:57 sh-103-53.int kernel: LustreError: 311884:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feed2c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 15:52:57 sh-103-53.int kernel: LustreError: 311884:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 15:52:57 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 15:52:57 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 15:52:57 sh-103-53.int kernel: LNetError: 291785:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 15:52:57 sh-103-53.int kernel: LNetError: 291785:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 27 15:53:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 27 15:53:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 61 previous similar messages Oct 27 15:57:04 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 15:57:04 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 15:58:08 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 15:58:08 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 15:58:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572216788, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea4140/0x51ab3c4ee697257 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac420575ea2 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 15:58:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 16:03:16 sh-103-53.int kernel: LustreError: 312481:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feeda40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 16:03:16 sh-103-53.int kernel: LustreError: 312481:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 16:03:16 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 16:03:16 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 16:04:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 16:04:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 27 16:04:35 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 16:04:35 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 27 16:07:10 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 16:07:10 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 16:08:27 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 16:08:27 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 16:08:27 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572217407, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea2640/0x51ab3c4ee69728f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac4261ce5a1 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 16:08:27 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 16:12:16 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 16:13:33 sh-103-53.int kernel: LustreError: 313060:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feec6c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 16:13:33 sh-103-53.int kernel: LustreError: 313060:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 16:13:33 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 16:13:33 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 16:14:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 16:14:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 27 16:14:51 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 16:14:51 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 27 16:17:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 16:17:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 16:18:41 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 16:18:41 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 16:18:41 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572218021, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea0480/0x51ab3c4ee6972c7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac43777aae9 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 16:18:41 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 16:23:49 sh-103-53.int kernel: LustreError: 313634:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feec480) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 16:23:49 sh-103-53.int kernel: LustreError: 313634:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 16:23:49 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 16:23:49 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 16:24:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 16:24:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 27 16:25:06 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 16:25:06 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 27 16:27:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 16:27:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 16:28:54 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 16:28:54 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 16:28:54 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572218634, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea2880/0x51ab3c4ee6972ff lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac43fab4f8e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 16:28:54 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 16:34:01 sh-103-53.int kernel: LustreError: 314205:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feeccc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 16:34:01 sh-103-53.int kernel: LustreError: 314205:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 16:34:01 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 16:34:01 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 16:34:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 16:34:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 27 16:35:28 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 16:35:28 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 13 previous similar messages Oct 27 16:37:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 16:37:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 16:39:10 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 16:39:10 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 16:39:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572219250, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea0480/0x51ab3c4ee697337 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac445c72240 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 16:39:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 16:44:16 sh-103-53.int kernel: LustreError: 314780:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feecd80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 16:44:16 sh-103-53.int kernel: LustreError: 314780:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 16:44:16 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 16:44:16 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 16:44:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 16:44:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 27 16:45:36 sh-103-53.int kernel: LNetError: 291785:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 16:45:36 sh-103-53.int kernel: LNetError: 291785:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 27 16:47:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 16:47:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 16:49:22 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 16:49:22 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 16:49:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572219862, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea33c0/0x51ab3c4ee69736f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac4553b3b20 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 16:49:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 16:54:31 sh-103-53.int kernel: LustreError: 315354:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feede00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 16:54:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 27 16:54:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 27 16:54:31 sh-103-53.int kernel: LustreError: 315354:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 16:54:31 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 16:54:31 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 16:55:51 sh-103-53.int kernel: LNetError: 291785:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 16:55:51 sh-103-53.int kernel: LNetError: 291785:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 27 16:58:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 16:58:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 16:59:40 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 16:59:40 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 16:59:40 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572220480, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea4140/0x51ab3c4ee6973a7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac45b8463a8 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 16:59:40 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 17:04:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 27 17:04:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 27 17:04:49 sh-103-53.int kernel: LustreError: 315951:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feed740) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 17:04:49 sh-103-53.int kernel: LustreError: 315951:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 17:04:49 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 17:04:49 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 17:07:01 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 17:07:01 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 27 17:08:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 17:08:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 17:09:58 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 17:09:58 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 17:09:58 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572221098, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea4c80/0x51ab3c4ee6973df lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac4617a5983 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 17:09:58 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 17:14:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 17:14:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 27 17:15:06 sh-103-53.int kernel: LustreError: 316539:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feed080) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 17:15:06 sh-103-53.int kernel: LustreError: 316539:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 17:15:06 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 17:15:07 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 17:17:14 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 17:17:14 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 27 17:18:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 17:18:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 17:20:12 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 17:20:12 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 17:20:12 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572221712, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea60c0/0x51ab3c4ee697417 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac46764a3d8 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 17:20:12 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 17:25:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 17:25:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 27 17:25:18 sh-103-53.int kernel: LustreError: 317110:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feedd40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 17:25:18 sh-103-53.int kernel: LustreError: 317110:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 17:25:18 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 17:25:18 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 17:27:28 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 17:27:28 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 27 17:28:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 17:28:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 17:30:26 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 17:30:26 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 17:30:26 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572222326, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea7980/0x51ab3c4ee69744f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac46d4fa5a0 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 17:30:26 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 17:35:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 17:35:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 71 previous similar messages Oct 27 17:35:32 sh-103-53.int kernel: LustreError: 317684:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feec6c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 17:35:32 sh-103-53.int kernel: LustreError: 317684:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 17:35:32 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 17:35:32 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 17:38:34 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 17:38:34 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 27 17:38:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 17:38:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 17:40:41 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 17:40:41 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 17:40:41 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572222941, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea7980/0x51ab3c4ee697487 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac4736c6fcf expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 17:40:41 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 17:43:49 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 17:45:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 17:45:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 27 17:45:50 sh-103-53.int kernel: LustreError: 318263:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feed5c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 17:45:50 sh-103-53.int kernel: LustreError: 318263:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 17:45:50 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 17:45:50 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 17:48:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 17:48:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 17:49:17 sh-103-53.int kernel: LNetError: 291785:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 17:49:17 sh-103-53.int kernel: LNetError: 291785:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 27 17:50:56 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 17:50:56 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 17:50:56 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572223556, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea60c0/0x51ab3c4ee6974bf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac479908f08 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 17:50:56 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 17:55:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 17:55:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 61 previous similar messages Oct 27 17:56:03 sh-103-53.int kernel: LustreError: 318849:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feeca80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 17:56:03 sh-103-53.int kernel: LustreError: 318849:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 17:56:03 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 17:56:03 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 17:59:02 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 17:59:02 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 17:59:52 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 17:59:52 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 27 18:01:12 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 18:01:12 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 18:01:12 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572224172, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea4c80/0x51ab3c4ee6974f7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac48c99991d expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 18:01:12 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 18:05:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 18:05:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 60 previous similar messages Oct 27 18:06:21 sh-103-53.int kernel: LustreError: 319443:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feecc00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 18:06:21 sh-103-53.int kernel: LustreError: 319443:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 18:06:21 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 18:06:21 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 18:09:12 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 18:09:12 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 18:11:02 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 18:11:02 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 27 18:11:31 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 18:11:31 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 18:11:31 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572224791, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea4140/0x51ab3c4ee69752f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac492b8c0e8 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 18:11:31 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 18:15:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 18:15:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 27 18:16:41 sh-103-53.int kernel: LustreError: 320021:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feec540) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 18:16:41 sh-103-53.int kernel: LustreError: 320021:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 18:16:41 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 18:16:41 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 18:19:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 18:19:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 18:21:22 sh-103-53.int kernel: LNetError: 291785:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 18:21:22 sh-103-53.int kernel: LNetError: 291785:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 5 previous similar messages Oct 27 18:21:24 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 18:21:50 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 18:21:50 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 18:21:50 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572225410, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea33c0/0x51ab3c4ee697567 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac49a8bb762 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 18:21:50 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 18:25:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 18:25:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 27 18:26:59 sh-103-53.int kernel: LustreError: 320599:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feed5c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 18:26:59 sh-103-53.int kernel: LustreError: 320599:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 18:26:59 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 18:26:59 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 18:29:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 18:29:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 18:32:05 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 18:32:05 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 18:32:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572226024, 301s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea3180/0x51ab3c4ee69759f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac4a02d6c4c expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 18:32:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 18:33:26 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 18:33:26 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 27 18:35:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 18:35:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 27 18:37:12 sh-103-53.int kernel: LustreError: 321172:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feedb00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 18:37:12 sh-103-53.int kernel: LustreError: 321172:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 18:37:12 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 18:37:12 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 18:39:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 18:39:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 18:42:20 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 18:42:20 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 18:42:20 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572226640, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea2880/0x51ab3c4ee6975d7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac4b597cb79 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 18:42:20 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 18:43:39 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 18:43:39 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 27 18:46:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 18:46:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 76 previous similar messages Oct 27 18:47:31 sh-103-53.int kernel: LustreError: 321754:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feec900) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 18:47:31 sh-103-53.int kernel: LustreError: 321754:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 18:47:31 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 18:47:31 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 18:49:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 18:49:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 18:52:38 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 18:52:38 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 18:52:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572227258, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef69c0/0x51ab3c4ee69760f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac4bbb76bfd expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 18:52:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 18:54:47 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 18:54:47 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 2 previous similar messages Oct 27 18:56:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 27 18:56:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 72 previous similar messages Oct 27 18:57:45 sh-103-53.int kernel: LustreError: 322328:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978218f51b00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 18:57:45 sh-103-53.int kernel: LustreError: 322328:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 18:57:45 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 18:57:45 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 18:59:05 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 19:00:02 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 19:00:02 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 19:02:54 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 19:02:54 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 19:02:54 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572227873, 301s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef72c0/0x51ab3c4ee697647 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac4c1bb0cce expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 19:02:54 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 19:05:00 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 19:05:00 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 27 19:06:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 19:06:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 27 19:08:03 sh-103-53.int kernel: LustreError: 322923:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978218f50cc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 19:08:03 sh-103-53.int kernel: LustreError: 322923:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 19:08:03 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 19:08:03 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 19:10:12 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 19:10:12 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 19:12:13 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 19:13:11 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 19:13:11 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 19:13:11 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572228490, 301s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef4140/0x51ab3c4ee69767f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac4c7bed29e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 19:13:11 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 19:16:07 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 19:16:07 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 27 19:16:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 19:16:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 27 19:18:21 sh-103-53.int kernel: LustreError: 323501:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978218f51680) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 19:18:21 sh-103-53.int kernel: LustreError: 323501:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 19:18:21 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 19:18:21 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 19:20:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 19:20:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 19:21:23 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 19:23:30 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 19:23:30 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 19:23:30 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572229110, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef2d00/0x51ab3c4ee6976b7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac4ce5af539 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 19:23:30 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 19:26:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 27 19:26:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 57 previous similar messages Oct 27 19:26:58 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 19:26:58 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 27 19:28:40 sh-103-53.int kernel: LustreError: 324079:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978218f51140) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 19:28:40 sh-103-53.int kernel: LustreError: 324079:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 19:28:40 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 19:28:40 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 19:30:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 19:30:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 19:33:47 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 19:33:47 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 19:33:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572229727, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a6fd5100/0x51ab3c4ee6976ef lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac4d4e01b58 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 19:33:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 19:37:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 19:37:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 27 19:38:08 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 19:38:08 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 27 19:38:58 sh-103-53.int kernel: LustreError: 324658:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e65296b40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 19:38:58 sh-103-53.int kernel: LustreError: 324658:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 19:38:58 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 19:38:58 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 19:40:43 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 19:40:43 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 19:44:08 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 19:44:08 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 19:44:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572230348, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a6fd6c00/0x51ab3c4ee697727 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac4dbac300a expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 19:44:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 19:47:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 19:47:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 27 19:48:29 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 19:48:29 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 27 19:48:50 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 19:49:17 sh-103-53.int kernel: LustreError: 325237:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977dc766a900) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 19:49:17 sh-103-53.int kernel: LustreError: 325237:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 19:49:17 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 19:49:17 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 19:50:53 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 19:50:53 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 19:54:24 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 19:54:24 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 19:54:24 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572230964, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754f8e69680/0x51ab3c4ee69775f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac4e26765a7 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 19:54:24 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 19:57:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 19:57:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 63 previous similar messages Oct 27 19:59:27 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 19:59:27 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 15 previous similar messages Oct 27 19:59:30 sh-103-53.int kernel: LustreError: 325810:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcd380) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 19:59:30 sh-103-53.int kernel: LustreError: 325810:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 19:59:30 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 19:59:30 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 20:01:03 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 20:01:03 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 20:04:37 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 20:04:37 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 20:04:37 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572231577, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69ba80/0x51ab3c4ee697797 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac4e932d419 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 20:04:37 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 20:07:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 20:07:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 71 previous similar messages Oct 27 20:09:43 sh-103-53.int kernel: LustreError: 326400:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebccc00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 20:09:43 sh-103-53.int kernel: LustreError: 326400:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 20:09:43 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 20:09:43 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 20:09:43 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 20:09:43 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 27 20:11:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 20:11:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 20:14:53 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 20:14:53 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 20:14:53 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572232193, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69af40/0x51ab3c4ee6977cf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac4f0397ac4 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 20:14:53 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 20:17:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Oct 27 20:17:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 59 previous similar messages Oct 27 20:18:20 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 20:20:00 sh-103-53.int kernel: LustreError: 326975:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcc000) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 20:20:00 sh-103-53.int kernel: LustreError: 326975:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 20:20:00 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 20:20:00 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 20:20:07 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 20:20:07 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 27 20:21:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 20:21:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 20:25:08 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 20:25:08 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 20:25:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572232808, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a0333180/0x51ab3c4ee697807 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac4f719c14a expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 20:25:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 20:27:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 27 20:27:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 61 previous similar messages Oct 27 20:30:13 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 20:30:13 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 27 20:30:14 sh-103-53.int kernel: LustreError: 327554:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bf2ae3980) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 20:30:14 sh-103-53.int kernel: LustreError: 327554:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 20:30:14 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 20:30:14 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 20:31:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 20:31:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 20:35:20 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 20:35:20 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 20:35:20 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572233420, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf52f40/0x51ab3c4ee69783f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac501fdf9ee expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 20:35:20 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 20:35:40 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 20:37:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 20:37:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 27 20:40:24 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 20:40:24 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 27 20:40:29 sh-103-53.int kernel: LustreError: 328130:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b4180) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 20:40:29 sh-103-53.int kernel: LustreError: 328130:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 20:40:29 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 20:40:29 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 20:41:44 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 20:41:44 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 20:45:36 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 20:45:36 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 20:45:36 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572234036, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97602ea75100/0x51ab3c4ee697877 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac52dc09a81 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 20:45:36 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 20:47:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 20:47:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 27 20:50:39 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 20:50:39 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 27 20:50:45 sh-103-53.int kernel: LustreError: 328713:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9762f3b1dc80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 20:50:45 sh-103-53.int kernel: LustreError: 328713:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 20:50:45 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 20:50:45 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 20:51:54 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 20:51:54 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 20:55:52 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 20:55:52 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 20:55:52 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572234652, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea0480/0x51ab3c4ee6978af lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac55943987e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 20:55:52 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 20:57:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 20:57:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 27 21:00:55 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 21:00:55 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 27 21:01:00 sh-103-53.int kernel: LustreError: 329291:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feece40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 21:01:00 sh-103-53.int kernel: LustreError: 329291:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 21:01:00 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 21:01:00 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 21:02:05 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 21:02:05 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 21:06:08 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 21:06:08 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 21:06:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572235268, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef4ec0/0x51ab3c4ee6978e7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac577d9cccc expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 21:06:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 21:07:06 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 21:07:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 21:07:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 27 21:11:17 sh-103-53.int kernel: LustreError: 329884:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978218f51980) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 21:11:17 sh-103-53.int kernel: LustreError: 329884:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 21:11:17 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 21:11:17 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 21:12:10 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 21:12:10 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 27 21:13:00 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 21:13:00 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 27 21:16:24 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 21:16:24 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 21:16:24 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572235884, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7660ec00/0x51ab3c4ee69791f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac5993ee15f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 21:16:24 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 21:17:16 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 21:18:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 21:18:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 27 21:21:31 sh-103-53.int kernel: LustreError: 330465:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0832e180) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 21:21:31 sh-103-53.int kernel: LustreError: 330465:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 21:21:31 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 21:21:31 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 21:22:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 21:22:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 30 previous similar messages Oct 27 21:23:17 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 21:23:17 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 27 21:26:41 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 21:26:41 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 21:26:41 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572236501, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c5340/0x51ab3c4ee697957 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac5c3cf79ac expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 21:26:41 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 21:28:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 21:28:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 27 21:31:48 sh-103-53.int kernel: LustreError: 331041:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978206e15ec0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 21:31:48 sh-103-53.int kernel: LustreError: 331041:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 21:31:48 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 21:31:48 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 21:32:30 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 21:32:30 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 27 21:34:25 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 21:34:25 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 4 previous similar messages Oct 27 21:36:55 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 21:36:55 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 21:36:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572237115, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb09d40/0x51ab3c4ee69798f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac5f3a5bd22 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 21:36:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 21:38:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 21:38:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 27 21:42:05 sh-103-53.int kernel: LustreError: 331618:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9782186a86c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 21:42:05 sh-103-53.int kernel: LustreError: 331618:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 21:42:05 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 21:42:06 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 21:42:40 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 21:42:40 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 27 21:42:41 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 21:44:37 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 21:44:37 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 5 previous similar messages Oct 27 21:47:16 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 21:47:16 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 21:47:16 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572237736, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0ee40/0x51ab3c4ee6979c7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac61fd2b7c5 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 21:47:16 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 21:47:47 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 21:48:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 21:48:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 27 21:52:26 sh-103-53.int kernel: LustreError: 332197:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9782186a9380) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 21:52:26 sh-103-53.int kernel: LustreError: 332197:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 21:52:26 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 21:52:26 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 21:52:50 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 21:52:50 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 27 21:55:25 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 21:55:25 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 5 previous similar messages Oct 27 21:57:35 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 21:57:35 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 21:57:35 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572238355, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0f080/0x51ab3c4ee6979ff lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac64f2c05f1 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 21:57:35 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 21:58:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 21:58:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 27 22:02:42 sh-103-53.int kernel: LustreError: 332789:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977795ebc240) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 22:02:42 sh-103-53.int kernel: LustreError: 332789:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 22:02:42 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 22:02:42 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 22:03:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 22:03:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 27 22:05:46 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 22:05:46 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 27 22:07:52 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 22:07:52 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 22:07:52 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572238972, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c5a00/0x51ab3c4ee697a37 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac67b7c2b04 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 22:07:52 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 22:08:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 22:08:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 27 22:13:00 sh-103-53.int kernel: LustreError: 333369:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0832e240) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 22:13:00 sh-103-53.int kernel: LustreError: 333369:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 22:13:00 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 22:13:00 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 22:13:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 22:13:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 27 22:16:01 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 22:16:01 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 5 previous similar messages Oct 27 22:18:10 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 22:18:10 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 22:18:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572239590, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c6c00/0x51ab3c4ee697a6f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac6a75689ce expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 22:18:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 22:19:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 22:19:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 27 22:23:19 sh-103-53.int kernel: LustreError: 333949:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978206e146c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 22:23:19 sh-103-53.int kernel: LustreError: 333949:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 22:23:19 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 22:23:19 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 22:23:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 22:23:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 27 22:26:16 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 22:26:16 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 27 22:28:26 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 22:28:26 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 22:28:26 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572240206, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97779d612f40/0x51ab3c4ee697aa7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac6d6ed4dca expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 22:28:26 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 22:29:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 22:29:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 27 22:33:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 22:33:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 27 22:33:35 sh-103-53.int kernel: LustreError: 334523:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978206e14840) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 22:33:35 sh-103-53.int kernel: LustreError: 334523:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 22:33:35 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 22:33:35 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 22:36:32 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 22:36:32 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 27 22:36:37 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 22:38:46 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 22:38:46 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 22:38:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572240826, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97779d614140/0x51ab3c4ee697adf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac7054c9895 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 22:38:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 22:39:17 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 22:39:17 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 27 22:43:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 27 22:43:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 27 22:43:58 sh-103-53.int kernel: LustreError: 335103:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978206e15e00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 22:43:58 sh-103-53.int kernel: LustreError: 335103:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 22:43:58 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 22:43:58 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 22:47:25 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 22:47:25 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 27 22:49:06 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 22:49:06 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 22:49:06 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572241446, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97779d612880/0x51ab3c4ee697b17 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac733cb5bb8 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 22:49:06 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 22:49:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 27 22:49:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 27 22:53:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 22:53:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 36 previous similar messages Oct 27 22:54:14 sh-103-53.int kernel: LustreError: 335681:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff975af5ae5b00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 22:54:14 sh-103-53.int kernel: LustreError: 335681:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 22:54:14 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 22:54:14 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 22:57:36 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 22:57:36 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 27 22:59:21 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 22:59:21 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 22:59:21 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572242061, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb08b40/0x51ab3c4ee697b4f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac75c32b410 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 22:59:21 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 22:59:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 27 seconds Oct 27 22:59:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 27 23:04:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 23:04:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 27 23:04:29 sh-103-53.int kernel: LustreError: 336274:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97779579b980) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 23:04:29 sh-103-53.int kernel: LustreError: 336274:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 23:04:29 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 23:04:29 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 23:07:56 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 23:07:56 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 27 23:09:38 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 23:09:38 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 23:09:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572242678, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d5a00/0x51ab3c4ee697b87 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac77e519bfb expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 23:09:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 23:09:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 23:09:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 27 23:10:07 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 27 23:14:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 23:14:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 27 23:14:46 sh-103-53.int kernel: LustreError: 336848:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0832e3c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 23:14:46 sh-103-53.int kernel: LustreError: 336848:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 23:14:46 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 23:14:46 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 23:18:09 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 23:18:09 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 27 23:19:54 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 23:19:54 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 23:19:54 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572243294, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c7500/0x51ab3c4ee697bbf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac7a047ec74 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 23:19:54 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 23:20:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 27 23:20:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 27 23:24:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 23:24:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 27 23:25:04 sh-103-53.int kernel: LustreError: 337430:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff975af5ae5200) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 23:25:04 sh-103-53.int kernel: LustreError: 337430:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 23:25:04 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 23:25:04 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 23:28:56 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 23:28:56 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 27 23:30:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 23:30:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 27 23:30:11 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 23:30:11 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 23:30:11 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572243911, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626f740/0x51ab3c4ee697bf7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac7c28528f5 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 23:30:11 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 23:34:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 23:34:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 27 23:35:22 sh-103-53.int kernel: LustreError: 338007:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff975af5ae49c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 23:35:22 sh-103-53.int kernel: LustreError: 338007:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 23:35:22 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 23:35:22 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 23:39:14 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 23:39:14 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 12 previous similar messages Oct 27 23:40:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 23:40:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 61 previous similar messages Oct 27 23:40:30 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 23:40:30 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 23:40:30 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572244530, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626c5c0/0x51ab3c4ee697c2f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac7dfba5006 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 23:40:30 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 23:44:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 23:44:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 27 23:45:39 sh-103-53.int kernel: LustreError: 338594:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff975af5ae4b40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 23:45:39 sh-103-53.int kernel: LustreError: 338594:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 23:45:39 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 23:45:39 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 27 23:50:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 27 23:50:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 27 23:50:41 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 27 23:50:41 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 27 23:50:49 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 27 23:50:49 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 27 23:50:49 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572245149, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626f500/0x51ab3c4ee697c67 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac7fde2f4ab expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 27 23:50:49 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 27 23:54:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 27 23:54:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 27 23:55:58 sh-103-53.int kernel: LustreError: 339182:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff975af5ae46c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 27 23:55:58 sh-103-53.int kernel: LustreError: 339182:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 27 23:55:58 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 27 23:55:59 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 00:00:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 00:00:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 28 00:00:42 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 00:00:42 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 28 00:01:06 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 00:01:06 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 00:01:06 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572245766, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0bf00/0x51ab3c4ee697c9f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac81ce9bf8e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 00:01:06 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 00:05:02 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 00:05:02 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 28 00:06:14 sh-103-53.int kernel: LustreError: 339777:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9782186a95c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 00:06:14 sh-103-53.int kernel: LustreError: 339777:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 00:06:14 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 00:06:14 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 00:10:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 00:10:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 28 00:10:58 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 00:10:58 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 3 previous similar messages Oct 28 00:11:22 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 00:11:22 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 00:11:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572246382, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97783febf740/0x51ab3c4ee697cd7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac83a1570c2 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 00:11:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 00:15:12 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 00:15:12 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 28 00:16:29 sh-103-53.int kernel: LustreError: 340353:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97779579ad80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 00:16:29 sh-103-53.int kernel: LustreError: 340353:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 00:16:29 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 00:16:29 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 00:20:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 28 00:20:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 28 00:20:18 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 28 00:21:11 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 00:21:11 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 28 00:21:37 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 00:21:37 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 00:21:37 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572246997, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d4a40/0x51ab3c4ee697d0f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac855e31531 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 00:21:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 00:25:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 00:25:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 28 00:26:48 sh-103-53.int kernel: LustreError: 340943:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977795ebc600) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 00:26:48 sh-103-53.int kernel: LustreError: 340943:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 00:26:48 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 00:26:48 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 00:30:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 00:30:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 28 00:31:55 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 00:31:55 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 00:31:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572247615, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7660f740/0x51ab3c4ee697d47 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac870c6e77a expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 00:31:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 00:32:23 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 00:32:23 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 28 00:35:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 00:35:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 00:37:01 sh-103-53.int kernel: LustreError: 341517:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977446ed1980) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 00:37:01 sh-103-53.int kernel: LustreError: 341517:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 00:37:01 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 00:37:01 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 00:40:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 28 00:40:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 28 00:42:08 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 00:42:08 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 00:42:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572248228, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c72c0/0x51ab3c4ee697d7f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac8899cdc9b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 00:42:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 00:43:27 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 00:43:27 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 28 00:45:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 00:45:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 28 00:47:14 sh-103-53.int kernel: LustreError: 342091:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978206e155c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 00:47:14 sh-103-53.int kernel: LustreError: 342091:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 00:47:14 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 00:47:14 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 00:50:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 00:50:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 28 00:52:24 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 00:52:24 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 00:52:24 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572248844, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626ec00/0x51ab3c4ee697db7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac8a6919337 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 00:52:24 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 00:53:43 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 00:53:43 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 28 00:55:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 00:55:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 28 00:57:35 sh-103-53.int kernel: LustreError: 342668:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9782186a9bc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 00:57:35 sh-103-53.int kernel: LustreError: 342668:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 00:57:35 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 00:57:35 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 01:00:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 01:00:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 28 01:02:45 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 01:02:45 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 01:02:45 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572249465, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0b600/0x51ab3c4ee697def lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac8c38515e8 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 01:02:45 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 01:04:53 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 01:04:53 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 28 01:06:03 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 01:06:03 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 28 01:07:57 sh-103-53.int kernel: LustreError: 343268:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977121677d40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 01:07:57 sh-103-53.int kernel: LustreError: 343268:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 01:07:57 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 01:07:57 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 01:10:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 01:10:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 28 01:13:10 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 01:13:10 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 01:13:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572250090, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9781fcfbf500/0x51ab3c4ee697e27 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac8de36dbc2 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 01:13:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 01:15:47 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 01:15:47 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 28 01:16:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 01:16:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 28 01:18:21 sh-103-53.int kernel: LustreError: 343854:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977795ebc900) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 01:18:21 sh-103-53.int kernel: LustreError: 343854:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 01:18:21 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 01:18:21 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 01:20:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 01:20:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 28 01:23:31 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 01:23:31 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 01:23:31 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572250711, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626c800/0x51ab3c4ee697e5f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac8fd7d406d expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 01:23:31 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 01:24:19 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 28 01:26:08 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 01:26:08 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 28 01:26:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 01:26:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 01:28:41 sh-103-53.int kernel: LustreError: 344431:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff975af5ae4000) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 01:28:41 sh-103-53.int kernel: LustreError: 344431:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 01:28:41 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 01:28:41 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 01:30:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 01:30:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 28 01:33:49 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 01:33:49 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 01:33:49 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572251329, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e76269680/0x51ab3c4ee697e97 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac91d94d517 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 01:33:49 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 01:36:23 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 01:36:23 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 5 previous similar messages Oct 28 01:36:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 01:36:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 01:38:56 sh-103-53.int kernel: LustreError: 345006:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff975af5ae4a80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 01:38:56 sh-103-53.int kernel: LustreError: 345006:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 01:38:56 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 01:38:56 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 01:40:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 28 01:40:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 71 previous similar messages Oct 28 01:42:40 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 28 01:44:04 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 01:44:04 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 01:44:04 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572251944, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626d100/0x51ab3c4ee697ecf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac93aec7a23 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 01:44:04 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 01:46:40 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 01:46:40 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 28 01:46:44 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 01:46:44 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 01:49:13 sh-103-53.int kernel: LustreError: 345583:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9782186a9a40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 01:49:13 sh-103-53.int kernel: LustreError: 345583:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 01:49:13 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 01:49:13 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 01:51:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 28 01:51:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 28 01:54:19 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 01:54:19 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 01:54:19 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572252559, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb08fc0/0x51ab3c4ee697f07 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac9587b1f46 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 01:54:19 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 01:56:55 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 01:56:55 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 28 01:57:45 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 01:57:45 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 28 01:59:36 sh-103-53.int kernel: LustreError: 346167:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977121677a40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 01:59:36 sh-103-53.int kernel: LustreError: 346167:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 01:59:36 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 01:59:36 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 02:01:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 02:01:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 28 02:04:46 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 02:04:46 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 02:04:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572253186, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9781fcfbe540/0x51ab3c4ee697f3f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac976eaae33 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 02:04:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 02:07:00 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 02:07:00 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 02:09:53 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 02:09:53 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 4 previous similar messages Oct 28 02:09:54 sh-103-53.int kernel: LustreError: 346765:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977795ebc540) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 02:09:54 sh-103-53.int kernel: LustreError: 346765:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 02:09:54 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 02:09:54 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 02:12:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 28 02:12:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 28 02:15:03 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 02:15:03 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 02:15:03 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572253803, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7660da00/0x51ab3c4ee697f77 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac9998705b0 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 02:15:03 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 02:17:10 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 02:17:10 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 02:20:08 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 02:20:08 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 28 02:20:13 sh-103-53.int kernel: LustreError: 347344:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978206e146c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 02:20:13 sh-103-53.int kernel: LustreError: 347344:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 02:20:13 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 02:20:13 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 02:22:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Oct 28 02:22:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 73 previous similar messages Oct 28 02:25:24 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 02:25:24 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 02:25:24 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572254424, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97779d610900/0x51ab3c4ee697faf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac9bc73d18e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 02:25:24 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 02:27:25 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 02:27:25 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 28 02:30:10 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 02:30:10 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 28 02:30:37 sh-103-53.int kernel: LustreError: 347929:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978206e14b40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 02:30:37 sh-103-53.int kernel: LustreError: 347929:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 02:30:37 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 02:30:37 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 02:32:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 02:32:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 28 02:35:47 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 02:35:47 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 02:35:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572255047, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97779d614c80/0x51ab3c4ee697fe7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ac9e0ba409f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 02:35:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 02:37:30 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 02:37:30 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 35 previous similar messages Oct 28 02:40:25 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 02:40:25 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 28 02:40:54 sh-103-53.int kernel: LustreError: 348506:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978206e15e00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 02:40:54 sh-103-53.int kernel: LustreError: 348506:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 02:40:54 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 02:40:54 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 02:42:24 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 02:42:24 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 71 previous similar messages Oct 28 02:46:02 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 02:46:02 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 02:46:02 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572255662, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97779d611b00/0x51ab3c4ee69801f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2aca03dc1bb0 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 02:46:02 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 02:47:40 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 02:47:40 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 36 previous similar messages Oct 28 02:51:08 sh-103-53.int kernel: LustreError: 349078:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978206e15e00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 02:51:08 sh-103-53.int kernel: LustreError: 349078:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 02:51:08 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 02:51:08 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 02:51:36 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 02:51:36 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 28 02:52:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 28 02:52:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 28 02:56:15 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 02:56:15 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 02:56:15 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572256275, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97779d613f00/0x51ab3c4ee698057 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2aca25d1bd37 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 02:56:15 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 02:57:50 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 02:57:50 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 03:01:25 sh-103-53.int kernel: LustreError: 349671:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9782186a8780) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 03:01:25 sh-103-53.int kernel: LustreError: 349671:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 03:01:25 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 03:01:25 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 03:01:56 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 28 03:02:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 03:02:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 28 03:02:46 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 03:02:46 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 28 03:06:33 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 03:06:33 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 03:06:33 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572256893, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0ca40/0x51ab3c4ee69808f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2aca4d87b5ed expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 03:06:33 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 03:08:00 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 03:08:00 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 28 03:11:41 sh-103-53.int kernel: LustreError: 350258:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781fcab6cc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 03:11:41 sh-103-53.int kernel: LustreError: 350258:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 03:11:41 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 03:11:41 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 03:12:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 19 seconds Oct 28 03:12:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 28 03:13:00 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 03:13:00 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 28 03:16:49 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 03:16:49 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 03:16:49 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572257509, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97783febcc80/0x51ab3c4ee6980c7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2aca731cac1d expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 03:16:49 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 03:18:10 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 03:18:10 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 03:21:58 sh-103-53.int kernel: LustreError: 350841:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977795ebdd40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 03:21:58 sh-103-53.int kernel: LustreError: 350841:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 03:21:58 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 03:21:58 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 03:23:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 03:23:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 28 03:24:09 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 03:24:09 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 5 previous similar messages Oct 28 03:27:03 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 03:27:03 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 03:27:03 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572258123, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7660d580/0x51ab3c4ee6980ff lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2aca99d7f590 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 03:27:03 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 03:28:20 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 03:28:20 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 28 03:32:12 sh-103-53.int kernel: LustreError: 351415:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977446ed0840) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 03:32:12 sh-103-53.int kernel: LustreError: 351415:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 03:32:12 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 03:32:12 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 03:33:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 03:33:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 28 03:34:20 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 03:34:20 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 28 03:37:20 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 03:37:20 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 03:37:20 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572258740, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7660aac0/0x51ab3c4ee698137 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acac15f7d6e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 03:37:20 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 03:38:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 03:38:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 28 03:42:26 sh-103-53.int kernel: LustreError: 351992:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977446ed0600) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 03:42:26 sh-103-53.int kernel: LustreError: 351992:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 03:42:26 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 03:42:26 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 03:43:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 03:43:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 28 03:44:37 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 03:44:37 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 28 03:44:38 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 28 03:47:33 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 03:47:33 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 03:47:33 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572259353, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c6540/0x51ab3c4ee69816f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acaeb4a6bb6 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 03:47:33 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 03:48:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 28 03:48:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 03:52:40 sh-103-53.int kernel: LustreError: 352584:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978206e14180) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 03:52:40 sh-103-53.int kernel: LustreError: 352584:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 03:52:40 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 03:52:40 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 03:53:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 03:53:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 71 previous similar messages Oct 28 03:55:42 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 03:55:42 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 28 03:57:47 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 03:57:47 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 03:57:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572259967, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d2f40/0x51ab3c4ee6981a7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acb1616cf82 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 03:57:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 03:58:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 03:58:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 04:02:55 sh-103-53.int kernel: LustreError: 353178:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0832f680) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 04:02:55 sh-103-53.int kernel: LustreError: 353178:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 04:02:55 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 04:02:55 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 04:03:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 04:03:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 28 04:05:51 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 04:05:51 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 28 04:08:05 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 04:08:05 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 04:08:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572260585, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c5a00/0x51ab3c4ee6981df lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acb428a4959 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 04:08:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 04:09:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 28 04:09:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 28 04:13:15 sh-103-53.int kernel: LustreError: 353756:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978206e140c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 04:13:15 sh-103-53.int kernel: LustreError: 353756:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 04:13:15 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 04:13:15 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 04:13:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 04:13:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 28 04:15:56 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 04:15:56 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 4 previous similar messages Oct 28 04:18:25 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 04:18:25 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 04:18:25 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572261205, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e762698c0/0x51ab3c4ee698217 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acb6c8fe715 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 04:18:25 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 04:19:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 04:19:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 36 previous similar messages Oct 28 04:23:31 sh-103-53.int kernel: LustreError: 354331:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff975af5ae5080) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 04:23:31 sh-103-53.int kernel: LustreError: 354331:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 04:23:31 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 04:23:31 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 04:24:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 04:24:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 63 previous similar messages Oct 28 04:26:08 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 04:26:08 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 28 04:28:42 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 04:28:42 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 04:28:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572261822, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626cc80/0x51ab3c4ee69824f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acb98bba30e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 04:28:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 04:29:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 04:29:21 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 04:33:50 sh-103-53.int kernel: LustreError: 354909:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff975af5ae5e00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 04:33:50 sh-103-53.int kernel: LustreError: 354909:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 04:33:50 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 04:33:50 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 04:34:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 04:34:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 74 previous similar messages Oct 28 04:34:27 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 28 04:36:22 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 04:36:22 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 28 04:38:56 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 04:38:56 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 04:38:56 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572262436, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e76268480/0x51ab3c4ee698287 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acbc18e6208 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 04:38:56 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 04:39:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 04:39:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 04:44:02 sh-103-53.int kernel: LustreError: 355482:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff975af5ae5200) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 04:44:02 sh-103-53.int kernel: LustreError: 355482:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 04:44:02 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 04:44:02 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 04:44:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 04:44:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 28 04:47:28 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 04:47:28 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 28 04:49:10 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 04:49:10 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 04:49:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572263050, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626c140/0x51ab3c4ee6982bf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acbeb386b58 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 04:49:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 04:49:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 04:49:41 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 04:54:22 sh-103-53.int kernel: LustreError: 356061:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9782186a9a40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 04:54:22 sh-103-53.int kernel: LustreError: 356061:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 04:54:22 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 04:54:22 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 04:54:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 04:54:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 28 04:58:38 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 04:58:38 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 28 04:59:29 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 04:59:29 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 04:59:29 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572263669, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0aac0/0x51ab3c4ee6982f7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acc16253218 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 04:59:29 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 04:59:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 04:59:51 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 35 previous similar messages Oct 28 04:59:52 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 28 05:02:57 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 28 05:04:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 28 05:04:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 28 05:04:38 sh-103-53.int kernel: LustreError: 356654:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977121676a80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 05:04:38 sh-103-53.int kernel: LustreError: 356654:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 05:04:38 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 05:04:38 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 05:08:55 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 05:08:55 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 28 05:09:48 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 05:09:48 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 05:09:48 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572264288, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97783febd7c0/0x51ab3c4ee69832f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acc3f29383f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 05:09:48 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 05:10:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 05:10:01 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 28 05:14:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 05:14:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 75 previous similar messages Oct 28 05:14:57 sh-103-53.int kernel: LustreError: 357231:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977795ebc0c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 05:14:57 sh-103-53.int kernel: LustreError: 357231:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 05:14:57 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 05:14:57 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 05:20:01 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 05:20:01 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 3 previous similar messages Oct 28 05:20:05 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 05:20:05 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 05:20:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572264905, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e76608240/0x51ab3c4ee698367 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acc678c727d expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 05:20:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 05:20:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 05:20:11 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 28 05:24:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 05:24:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 28 05:25:15 sh-103-53.int kernel: LustreError: 357810:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977446ed03c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 05:25:15 sh-103-53.int kernel: LustreError: 357810:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 05:25:15 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 05:25:15 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 05:30:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 05:30:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 28 05:30:28 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 05:30:28 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 05:30:28 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572265528, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c5580/0x51ab3c4ee69839f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acc936ce394 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 05:30:28 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 05:32:15 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 05:32:15 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 28 05:35:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 05:35:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 75 previous similar messages Oct 28 05:35:37 sh-103-53.int kernel: LustreError: 358393:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0832f500) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 05:35:37 sh-103-53.int kernel: LustreError: 358393:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 05:35:37 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 05:35:37 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 05:39:33 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 28 05:40:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 05:40:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 05:40:45 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 05:40:45 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 05:40:45 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572266145, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c45c0/0x51ab3c4ee6983d7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2accc5020e2a expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 05:40:45 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 05:42:27 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 05:42:27 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 5 previous similar messages Oct 28 05:45:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 05:45:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 28 05:45:53 sh-103-53.int kernel: LustreError: 358971:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0832ee40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 05:45:53 sh-103-53.int kernel: LustreError: 358971:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 05:45:53 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 05:45:53 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 05:50:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 05:50:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 28 05:51:01 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 05:51:01 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 05:51:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572266761, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97779d616c00/0x51ab3c4ee69840f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2accf4b1a5d9 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 05:51:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 05:53:37 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 05:53:37 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 28 05:55:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 05:55:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 72 previous similar messages Oct 28 05:56:13 sh-103-53.int kernel: LustreError: 359551:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978206e14e40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 05:56:13 sh-103-53.int kernel: LustreError: 359551:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 05:56:13 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 05:56:13 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 06:00:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 06:00:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 28 06:01:24 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 06:01:24 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 06:01:24 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572267384, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e76269440/0x51ab3c4ee698447 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acd24c1bca4 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 06:01:24 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 06:04:27 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 06:04:27 sh-103-53.int kernel: LNetError: 320937:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 28 06:05:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 06:05:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 28 06:06:36 sh-103-53.int kernel: LustreError: 360150:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff975af5ae4780) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 06:06:36 sh-103-53.int kernel: LustreError: 360150:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 06:06:36 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 06:06:36 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 06:10:03 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 28 06:11:02 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 06:11:02 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 06:11:41 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 06:11:41 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 06:11:41 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572268001, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0af40/0x51ab3c4ee69847f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acd4694a8f4 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 06:11:41 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 06:14:49 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 06:14:49 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 5 previous similar messages Oct 28 06:15:08 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 28 06:15:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 28 06:15:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 63 previous similar messages Oct 28 06:16:48 sh-103-53.int kernel: LustreError: 360734:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9782186a98c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 06:16:48 sh-103-53.int kernel: LustreError: 360734:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 06:16:48 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 06:16:48 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 06:21:12 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 06:21:12 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 28 06:21:55 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 06:21:55 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 06:21:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572268615, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d4c80/0x51ab3c4ee6984b7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acd4d1a6e4c expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 06:21:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 06:25:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 28 06:25:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 71 previous similar messages Oct 28 06:26:12 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 06:26:12 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 5 previous similar messages Oct 28 06:27:01 sh-103-53.int kernel: LustreError: 361307:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977795ebd380) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 06:27:01 sh-103-53.int kernel: LustreError: 361307:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 06:27:01 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 06:27:01 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 06:31:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 28 06:31:22 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 06:32:10 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 06:32:10 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 06:32:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572269230, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d21c0/0x51ab3c4ee6984ef lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acd523e6f41 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 06:32:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 06:35:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 06:35:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 28 06:36:22 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 06:36:22 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 28 06:37:21 sh-103-53.int kernel: LustreError: 361885:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977446ed1b00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 06:37:21 sh-103-53.int kernel: LustreError: 361885:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 06:37:21 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 06:37:21 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 06:41:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 06:41:32 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 35 previous similar messages Oct 28 06:42:31 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 06:42:31 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 06:42:31 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572269851, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c2880/0x51ab3c4ee698527 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acd57b03945 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 06:42:32 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 06:42:57 sh-103-53.int kernel: Lustre: oak-OST0004-osc-ffff9781f565a800: Connection to oak-OST0004 (at 10.0.2.102@o2ib5) was lost; in progress operations using this service will wait for recovery to complete Oct 28 06:42:57 sh-103-53.int kernel: Lustre: Skipped 69 previous similar messages Oct 28 06:43:28 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3350:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds Oct 28 06:43:28 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3425:kiblnd_check_conns()) Timed out RDMA with 10.9.0.41@o2ib4 (22): c: 0, oc: 0, rc: 8 Oct 28 06:43:28 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2007:lnet_handle_find_routed_path()) no route to 10.0.2.51@o2ib5 from 10.9.103.53@o2ib4 Oct 28 06:43:28 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2007:lnet_handle_find_routed_path()) Skipped 100 previous similar messages Oct 28 06:43:28 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.0.2.51@o2ib5: -113 Oct 28 06:43:28 sh-103-53.int kernel: Lustre: 91142:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1572270202/real 1572270208] req@ffff97820fb5b600 x1648382083885808/t0(0) o400->oak-OST0011-osc-ffff9781f565a800@10.0.2.102@o2ib5:28/4 lens 224/224 e 0 to 1 dl 1572270246 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 Oct 28 06:43:28 sh-103-53.int kernel: Lustre: 91142:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 13 previous similar messages Oct 28 06:43:30 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.0.2.105@o2ib5: -113 Oct 28 06:43:30 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Skipped 15 previous similar messages Oct 28 06:43:31 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.0.2.106@o2ib5: -113 Oct 28 06:43:31 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Skipped 15 previous similar messages Oct 28 06:43:35 sh-103-53.int kernel: Lustre: oak-OST0059-osc-ffff9781f565a800: Connection to oak-OST0059 (at 10.0.2.106@o2ib5) was lost; in progress operations using this service will wait for recovery to complete Oct 28 06:43:35 sh-103-53.int kernel: Lustre: Skipped 152 previous similar messages Oct 28 06:43:37 sh-103-53.int kernel: LNetError: 91118:0:(lib-move.c:2007:lnet_handle_find_routed_path()) no route to 10.0.2.106@o2ib5 from Oct 28 06:43:37 sh-103-53.int kernel: LNetError: 91118:0:(lib-move.c:2007:lnet_handle_find_routed_path()) Skipped 220 previous similar messages Oct 28 06:43:38 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:820:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) Oct 28 06:44:02 sh-103-53.int kernel: LNetError: 91118:0:(lib-move.c:2007:lnet_handle_find_routed_path()) no route to 10.0.2.52@o2ib5 from Oct 28 06:44:02 sh-103-53.int kernel: LNetError: 91118:0:(lib-move.c:2007:lnet_handle_find_routed_path()) Skipped 3 previous similar messages Oct 28 06:44:52 sh-103-53.int kernel: LNetError: 91118:0:(lib-move.c:2007:lnet_handle_find_routed_path()) no route to 10.0.2.52@o2ib5 from Oct 28 06:44:52 sh-103-53.int kernel: LNetError: 91118:0:(lib-move.c:2007:lnet_handle_find_routed_path()) Skipped 339 previous similar messages Oct 28 06:45:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.42@o2ib4: 0 seconds Oct 28 06:45:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 225 previous similar messages Oct 28 06:46:07 sh-103-53.int kernel: LNetError: 91118:0:(lib-move.c:2007:lnet_handle_find_routed_path()) no route to 10.0.2.51@o2ib5 from Oct 28 06:46:07 sh-103-53.int kernel: LNetError: 91118:0:(lib-move.c:2007:lnet_handle_find_routed_path()) Skipped 499 previous similar messages Oct 28 06:46:57 sh-103-53.int kernel: Lustre: Evicted from MGS (at MGC10.0.2.51@o2ib5_0) after server handle changed from 0x66220404ba1738f3 to 0x662204055c9f1ff5 Oct 28 06:46:57 sh-103-53.int kernel: LustreError: 167-0: oak-MDT0000-mdc-ffff9781f565a800: This client was evicted by oak-MDT0000; in progress operations using this service will fail. Oct 28 06:46:57 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 06:47:29 sh-103-53.int kernel: Lustre: 91138:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572270442/real 1572270442] req@ffff97779ea74c80 x1648382083929744/t0(0) o400->oak-OST0056-osc-ffff9781f565a800@10.0.2.105@o2ib5:28/4 lens 224/224 e 0 to 1 dl 1572270449 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 28 06:47:29 sh-103-53.int kernel: Lustre: oak-OST0035-osc-ffff9781f565a800: Connection to oak-OST0035 (at 10.0.2.106@o2ib5) was lost; in progress operations using this service will wait for recovery to complete Oct 28 06:47:29 sh-103-53.int kernel: Lustre: Skipped 8 previous similar messages Oct 28 06:47:29 sh-103-53.int kernel: Lustre: 91138:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 163 previous similar messages Oct 28 06:47:39 sh-103-53.int kernel: LustreError: 362599:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0832f2c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 06:47:39 sh-103-53.int kernel: LustreError: 362599:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 06:47:39 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 06:47:39 sh-103-53.int kernel: Lustre: Skipped 137 previous similar messages Oct 28 06:50:36 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 06:50:36 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 2 previous similar messages Oct 28 06:51:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 06:51:42 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 25 previous similar messages Oct 28 06:52:46 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 06:52:46 sh-103-53.int kernel: LustreError: Skipped 2 previous similar messages Oct 28 06:52:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572270466, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97779d617bc0/0x51ab3c4ee69857b lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acd5ca6447c expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 06:52:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 06:55:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 28 06:55:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 71 previous similar messages Oct 28 06:57:49 sh-103-53.int kernel: Lustre: oak-OST000e-osc-ffff9781f565a800: Connection restored to 10.0.2.101@o2ib5 (at 10.0.2.101@o2ib5) Oct 28 06:57:49 sh-103-53.int kernel: Lustre: Skipped 38 previous similar messages Oct 28 06:57:53 sh-103-53.int kernel: LustreError: 363176:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978206e14d80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 06:57:53 sh-103-53.int kernel: LustreError: 363176:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 07:01:42 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 07:01:42 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 1 previous similar message Oct 28 07:01:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 07:01:52 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 34 previous similar messages Oct 28 07:03:03 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 07:03:03 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 07:03:03 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572271083, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626af40/0x51ab3c4ee6985ba lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acd62f8a29f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 07:03:03 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 07:05:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 28 07:05:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 28 07:08:13 sh-103-53.int kernel: LustreError: 363854:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754f6bad380) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 07:08:13 sh-103-53.int kernel: LustreError: 363854:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 07:08:13 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 07:08:13 sh-103-53.int kernel: Lustre: Skipped 34 previous similar messages Oct 28 07:12:02 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 07:12:02 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 36 previous similar messages Oct 28 07:12:51 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 07:12:51 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 28 07:13:22 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 07:13:22 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 07:13:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572271702, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626d340/0x51ab3c4ee6985f9 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acd69535871 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 07:13:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 07:15:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 07:15:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 72 previous similar messages Oct 28 07:18:30 sh-103-53.int kernel: LustreError: 364431:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754f6bac240) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 07:18:30 sh-103-53.int kernel: LustreError: 364431:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 07:18:30 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 07:18:30 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 07:22:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 28 07:22:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 28 07:23:38 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 07:23:38 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 07:23:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572272318, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0e9c0/0x51ab3c4ee698631 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acd6f40b6cb expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 07:23:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 07:24:09 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 07:24:09 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 28 07:26:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 07:26:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 72 previous similar messages Oct 28 07:28:46 sh-103-53.int kernel: LustreError: 365005:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754fc38a000) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 07:28:46 sh-103-53.int kernel: LustreError: 365005:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 07:28:46 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 07:28:46 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 07:32:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 07:32:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 07:33:55 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 07:33:55 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 07:33:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572272935, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0f740/0x51ab3c4ee698669 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acd751f6de9 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 07:33:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 07:35:16 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 07:35:16 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 28 07:36:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 07:36:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 75 previous similar messages Oct 28 07:39:02 sh-103-53.int kernel: LustreError: 365593:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754fc38b140) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 07:39:02 sh-103-53.int kernel: LustreError: 365593:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 07:39:02 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 07:39:02 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 07:42:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 07:42:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 36 previous similar messages Oct 28 07:44:13 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 07:44:13 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 07:44:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572273553, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0cec0/0x51ab3c4ee6986a1 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acd86669552 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 07:44:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 07:45:30 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 07:45:30 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 28 07:46:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 07:46:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 28 07:49:24 sh-103-53.int kernel: LustreError: 366196:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754fc38acc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 07:49:24 sh-103-53.int kernel: LustreError: 366196:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 07:49:24 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 07:49:24 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 07:52:43 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 07:52:43 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 07:54:33 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 07:54:33 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 07:54:33 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572274173, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97783febc140/0x51ab3c4ee6986d9 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acd8c31cdee expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 07:54:33 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 07:56:18 sh-103-53.int kernel: LNetError: 365318:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 07:56:18 sh-103-53.int kernel: LNetError: 365318:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 28 07:56:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 07:56:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 28 07:59:45 sh-103-53.int kernel: LustreError: 366810:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977121677b00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 07:59:45 sh-103-53.int kernel: LustreError: 366810:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 07:59:45 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 07:59:45 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 08:02:53 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 28 08:02:53 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 36 previous similar messages Oct 28 08:05:01 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 08:05:01 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 08:05:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572274800, 301s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c0240/0x51ab3c4ee698711 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acd9d40c705 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 08:05:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 08:06:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 08:06:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 74 previous similar messages Oct 28 08:06:43 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 08:06:43 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 28 08:10:10 sh-103-53.int kernel: LustreError: 367513:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978206e15380) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 08:10:10 sh-103-53.int kernel: LustreError: 367513:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 08:10:10 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 08:10:10 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 08:13:03 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 28 08:13:03 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 36 previous similar messages Oct 28 08:15:19 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 08:15:19 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 08:15:19 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572275419, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626de80/0x51ab3c4ee698749 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acda339fafa expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 08:15:19 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 08:16:44 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Oct 28 08:16:44 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 28 08:16:44 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 08:16:44 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 28 08:20:27 sh-103-53.int kernel: LustreError: 368191:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754f6bad800) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 08:20:27 sh-103-53.int kernel: LustreError: 368191:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 08:20:27 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 08:20:27 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 08:23:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 28 08:23:13 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 28 08:25:37 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 08:25:37 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 08:25:37 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572276037, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0a400/0x51ab3c4ee698781 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acda96ff429 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 08:25:37 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 08:26:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 08:26:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 28 08:27:00 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 08:27:00 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 28 08:30:44 sh-103-53.int kernel: LustreError: 368818:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754fc38a600) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 08:30:44 sh-103-53.int kernel: LustreError: 368818:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 08:30:44 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 08:30:44 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 08:33:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 08:33:23 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 08:35:53 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 08:35:53 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 08:35:53 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572276653, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0f2c0/0x51ab3c4ee6987b9 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acdafba4dfc expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 08:35:53 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 08:36:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 08:36:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 28 08:37:16 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 08:37:16 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 28 08:38:30 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 28 08:38:30 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Skipped 31 previous similar messages Oct 28 08:41:07 sh-103-53.int kernel: LustreError: 369400:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977121676540) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 08:41:07 sh-103-53.int kernel: LustreError: 369400:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 08:41:07 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 08:41:07 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 08:43:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 28 08:43:33 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 08:46:19 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 08:46:19 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 08:46:19 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572277279, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97783feba400/0x51ab3c4ee6987f1 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acdb6388f5e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 08:46:19 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 08:47:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 08:47:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 63 previous similar messages Oct 28 08:48:28 sh-103-53.int kernel: LNetError: 365318:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 08:48:28 sh-103-53.int kernel: LNetError: 365318:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 4 previous similar messages Oct 28 08:51:32 sh-103-53.int kernel: LustreError: 369985:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9771216778c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 08:51:32 sh-103-53.int kernel: LustreError: 369985:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 08:51:32 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 08:51:32 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 08:53:43 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 08:53:43 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 35 previous similar messages Oct 28 08:56:44 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 08:56:44 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 08:56:44 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572277904, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d4140/0x51ab3c4ee698829 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acdbcc06a61 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 08:56:44 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 08:57:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 08:57:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 73 previous similar messages Oct 28 09:00:36 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 09:00:36 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 4 previous similar messages Oct 28 09:01:51 sh-103-53.int kernel: LustreError: 370587:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977795ebc300) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 09:01:51 sh-103-53.int kernel: LustreError: 370587:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 09:01:51 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 09:01:51 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 09:03:53 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 09:03:53 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 28 09:06:58 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 09:06:58 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 09:06:58 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572278518, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7660c800/0x51ab3c4ee698861 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acdc3091211 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 09:06:58 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 09:07:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 28 09:07:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 28 09:10:51 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 09:10:51 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 14 previous similar messages Oct 28 09:12:07 sh-103-53.int kernel: LustreError: 371162:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977446ed0d80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 09:12:07 sh-103-53.int kernel: LustreError: 371162:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 09:12:07 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 09:12:07 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 09:14:04 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 09:14:04 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 09:17:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 09:17:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 28 09:17:20 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 09:17:20 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 09:17:20 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572279140, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c0fc0/0x51ab3c4ee698899 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acdc9619bb7 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 09:17:20 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 09:22:02 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 09:22:02 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 28 09:22:32 sh-103-53.int kernel: LustreError: 371747:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978206e14e40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 09:22:32 sh-103-53.int kernel: LustreError: 371747:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 09:22:32 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 09:22:32 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 09:24:14 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 09:24:14 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 36 previous similar messages Oct 28 09:27:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 12 seconds Oct 28 09:27:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 28 09:27:37 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 09:27:37 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 09:27:37 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572279757, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626da00/0x51ab3c4ee6988d1 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acdcfabf6cc expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 09:27:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 09:32:14 sh-103-53.int kernel: LNetError: 371905:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 09:32:14 sh-103-53.int kernel: LNetError: 371905:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 28 09:32:44 sh-103-53.int kernel: LustreError: 372319:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754f6bac0c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 09:32:44 sh-103-53.int kernel: LustreError: 372319:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 09:32:44 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 09:32:44 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 09:34:24 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 09:34:24 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 28 09:37:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 28 09:37:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 28 09:37:53 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 09:37:53 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 09:37:53 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572280373, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626c5c0/0x51ab3c4ee698909 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acdd49f5602 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 09:37:53 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 09:39:30 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 28 09:43:02 sh-103-53.int kernel: LustreError: 372898:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754f6bad440) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 09:43:02 sh-103-53.int kernel: LustreError: 372898:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 09:43:02 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 09:43:02 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 09:44:22 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 09:44:22 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 5 previous similar messages Oct 28 09:44:34 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 09:44:34 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 28 09:47:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 09:47:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 28 09:48:10 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 09:48:10 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 09:48:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572280990, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0ad00/0x51ab3c4ee698941 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acdda3efd05 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 09:48:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 09:53:27 sh-103-53.int kernel: LustreError: 373516:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754fc38a0c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 09:53:27 sh-103-53.int kernel: LustreError: 373516:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 09:53:27 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 09:53:27 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 09:54:44 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 09:54:44 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 28 09:55:14 sh-103-53.int kernel: LNetError: 371905:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 09:55:14 sh-103-53.int kernel: LNetError: 371905:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 28 09:57:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 09:57:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 28 09:58:34 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 09:58:34 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 09:58:34 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572281614, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0fbc0/0x51ab3c4ee698979 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acde001cf20 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 09:58:34 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 10:03:46 sh-103-53.int kernel: LustreError: 374111:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977121676780) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 10:03:46 sh-103-53.int kernel: LustreError: 374111:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 10:03:46 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 10:03:46 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 10:04:54 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 10:04:54 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 10:06:24 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 10:06:24 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 28 10:07:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Oct 28 10:07:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 28 10:08:59 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 10:08:59 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 10:08:59 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572282239, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d1680/0x51ab3c4ee6989b1 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acde6818a2e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 10:08:59 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 10:14:13 sh-103-53.int kernel: LustreError: 374704:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978206e15380) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 10:14:13 sh-103-53.int kernel: LustreError: 374704:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 10:14:13 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 10:14:13 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 10:15:04 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 10:15:04 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 10:17:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 10:17:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 28 10:18:00 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 10:18:00 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 28 10:19:20 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 10:19:20 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 10:19:20 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572282859, 301s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626ee40/0x51ab3c4ee6989e9 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acded5fb591 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 10:19:20 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 10:24:35 sh-103-53.int kernel: LustreError: 375287:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754f6bac480) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 10:24:35 sh-103-53.int kernel: LustreError: 375287:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 10:24:35 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 10:24:35 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 10:25:15 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 10:25:15 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 10:27:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 10:27:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 72 previous similar messages Oct 28 10:29:10 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 10:29:10 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 28 10:29:44 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 10:29:44 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 10:29:44 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572283484, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626e9c0/0x51ab3c4ee698a21 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acdf42986bd expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 10:29:44 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 10:34:55 sh-103-53.int kernel: LustreError: 375866:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754f6bada40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 10:34:55 sh-103-53.int kernel: LustreError: 375866:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 10:34:55 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 10:34:55 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 10:35:25 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 10:35:25 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 10:38:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 10:38:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 63 previous similar messages Oct 28 10:39:12 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 10:39:12 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 28 10:40:07 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 10:40:07 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 10:40:07 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572284107, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e76269b00/0x51ab3c4ee698a59 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acdfb194632 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 10:40:07 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 10:45:18 sh-103-53.int kernel: LustreError: 376446:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754fc38b380) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 10:45:18 sh-103-53.int kernel: LustreError: 376446:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 10:45:18 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 10:45:18 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 10:45:35 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 10:45:35 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 10:47:36 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 28 10:48:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 10:48:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 74 previous similar messages Oct 28 10:49:31 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 10:49:31 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 5 previous similar messages Oct 28 10:50:31 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 10:50:31 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 10:50:31 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572284731, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0a880/0x51ab3c4ee698a91 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ace01c43e8c expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 10:50:31 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 10:55:44 sh-103-53.int kernel: LustreError: 377032:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977121676c00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 10:55:44 sh-103-53.int kernel: LustreError: 377032:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 10:55:44 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 10:55:44 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 10:55:45 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 10:55:45 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 28 10:58:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 28 10:58:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 28 10:59:35 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 10:59:35 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 28 11:00:56 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 11:00:56 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 11:00:56 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572285356, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97779d612640/0x51ab3c4ee698ac9 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ace07ea46c9 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 11:00:56 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 11:05:55 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 11:05:55 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 11:06:04 sh-103-53.int kernel: LustreError: 377630:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978206e14b40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 11:06:04 sh-103-53.int kernel: LustreError: 377630:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 11:06:04 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 11:06:04 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 11:08:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 28 11:08:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 71 previous similar messages Oct 28 11:09:36 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 11:09:36 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 28 11:11:12 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 11:11:12 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 11:11:12 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572285972, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626a880/0x51ab3c4ee698b01 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ace0e1998c1 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 11:11:12 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 11:16:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 11:16:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 11:16:24 sh-103-53.int kernel: LustreError: 378201:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754f6bac780) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 11:16:24 sh-103-53.int kernel: LustreError: 378201:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 11:16:24 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 11:16:24 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 11:18:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 11:18:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 28 11:19:46 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 11:19:46 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 28 11:21:30 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 11:21:30 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 11:21:30 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572286590, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97783feb9f80/0x51ab3c4ee698b39 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ace1479ba92 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 11:21:30 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 11:26:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 11:26:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 28 11:26:38 sh-103-53.int kernel: LustreError: 378776:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977121677d40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 11:26:38 sh-103-53.int kernel: LustreError: 378776:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 11:26:38 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 11:26:38 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 11:29:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 11:29:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 28 11:30:01 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 11:30:01 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 28 11:31:49 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 11:31:49 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 11:31:49 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572287209, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d1440/0x51ab3c4ee698b71 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ace1ad4664d expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 11:31:49 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 11:35:22 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 28 11:36:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 11:36:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 11:37:01 sh-103-53.int kernel: LustreError: 379364:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978206e15440) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 11:37:01 sh-103-53.int kernel: LustreError: 379364:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 11:37:01 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 11:37:01 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 11:39:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 28 11:39:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 71 previous similar messages Oct 28 11:40:52 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 11:40:52 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 28 11:42:08 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 11:42:08 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 11:42:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572287828, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0b3c0/0x51ab3c4ee698ba9 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ace21985f8d expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 11:42:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 11:46:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 11:46:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 28 11:47:17 sh-103-53.int kernel: LustreError: 379951:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754fc38a9c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 11:47:17 sh-103-53.int kernel: LustreError: 379951:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 11:47:17 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 11:47:17 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 11:49:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 11:49:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 28 11:51:32 sh-103-53.int kernel: LNetError: 371905:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 11:51:32 sh-103-53.int kernel: LNetError: 371905:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 28 11:52:26 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 11:52:26 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 11:52:26 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572288445, 301s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9781fcfbe300/0x51ab3c4ee698be1 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ace2853afb7 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 11:52:26 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 11:56:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 11:56:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 28 11:57:36 sh-103-53.int kernel: LustreError: 380528:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977795ebcfc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 11:57:36 sh-103-53.int kernel: LustreError: 380528:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 11:57:36 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 11:57:36 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 11:59:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 11:59:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 60 previous similar messages Oct 28 12:02:41 sh-103-53.int kernel: LNetError: 371905:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 12:02:41 sh-103-53.int kernel: LNetError: 371905:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 28 12:02:50 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 12:02:50 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 12:02:50 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572289070, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7660b840/0x51ab3c4ee698c19 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ace2ee828ca expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 12:02:50 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 12:06:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 12:06:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 12:07:58 sh-103-53.int kernel: LustreError: 381126:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0832f200) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 12:07:58 sh-103-53.int kernel: LustreError: 381126:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 12:07:58 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 12:07:58 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 12:09:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 12:09:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 72 previous similar messages Oct 28 12:13:13 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 12:13:13 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 12:13:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572289693, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c18c0/0x51ab3c4ee698c51 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ace35709227 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 12:13:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 12:14:36 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 12:14:36 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 28 12:17:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 12:17:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 36 previous similar messages Oct 28 12:18:26 sh-103-53.int kernel: LustreError: 381715:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0832e0c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 12:18:26 sh-103-53.int kernel: LustreError: 381715:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 12:18:26 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 12:18:26 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 12:19:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 12:19:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 28 12:23:35 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 12:23:35 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 12:23:35 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572290315, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c0000/0x51ab3c4ee698c89 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ace3c2ef609 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 12:23:35 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 12:26:07 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 12:26:07 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 28 12:27:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 12:27:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 28 12:28:44 sh-103-53.int kernel: LustreError: 382291:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754f6bad800) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 12:28:44 sh-103-53.int kernel: LustreError: 382291:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 12:28:44 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 12:28:44 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 12:30:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 12:30:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 60 previous similar messages Oct 28 12:33:53 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 12:33:53 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 12:33:53 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572290933, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626cc80/0x51ab3c4ee698cc1 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ace429c48f8 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 12:33:53 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 12:37:16 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 12:37:16 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 12 previous similar messages Oct 28 12:37:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 12:37:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 12:39:01 sh-103-53.int kernel: LustreError: 382870:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977121677440) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 12:39:01 sh-103-53.int kernel: LustreError: 382870:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 12:39:01 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 12:39:01 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 12:40:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 12:40:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 74 previous similar messages Oct 28 12:44:10 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 12:44:10 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 12:44:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572291550, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c2d00/0x51ab3c4ee698cf9 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ace4bc92c72 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 12:44:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 12:47:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 12:47:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 28 12:48:25 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 12:48:25 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 28 12:49:16 sh-103-53.int kernel: LustreError: 383463:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754f6bad200) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 12:49:16 sh-103-53.int kernel: LustreError: 383463:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 12:49:16 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 12:49:16 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 12:50:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 12:50:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 28 12:54:23 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 12:54:23 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 12:54:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572292163, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e76268480/0x51ab3c4ee698d31 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ace512b9ec0 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 12:54:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 12:57:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 12:57:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 28 12:58:39 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 12:58:39 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 28 12:59:31 sh-103-53.int kernel: LustreError: 384039:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754f6bac540) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 12:59:31 sh-103-53.int kernel: LustreError: 384039:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 12:59:31 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 12:59:31 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 13:00:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 13:00:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 72 previous similar messages Oct 28 13:03:52 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 28 13:04:42 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 13:04:42 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 13:04:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572292782, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0c5c0/0x51ab3c4ee698d69 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ace583c60b6 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 13:04:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 13:07:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 13:07:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 13:09:48 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 13:09:48 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 28 13:09:52 sh-103-53.int kernel: LustreError: 384636:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9771216769c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 13:09:52 sh-103-53.int kernel: LustreError: 384636:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 13:09:52 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 13:09:52 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 13:10:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 28 13:10:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 72 previous similar messages Oct 28 13:14:57 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 13:14:57 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 13:14:57 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572293397, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97783feba1c0/0x51ab3c4ee698da1 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ace5e7fcbbc expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 13:14:57 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 13:18:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 28 13:18:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 13:20:05 sh-103-53.int kernel: LustreError: 385216:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97779579b140) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 13:20:05 sh-103-53.int kernel: LustreError: 385216:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 13:20:05 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 13:20:05 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 13:20:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 13:20:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 28 13:23:57 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 13:23:57 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 28 13:25:14 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 13:25:14 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 13:25:14 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572294014, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d2880/0x51ab3c4ee698dd9 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ace644a0b7d expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 13:25:14 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 13:28:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 13:28:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 13:30:24 sh-103-53.int kernel: LustreError: 385793:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977795ebdec0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 13:30:24 sh-103-53.int kernel: LustreError: 385793:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 13:30:24 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 13:30:24 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 13:31:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 13:31:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 73 previous similar messages Oct 28 13:34:11 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 13:34:11 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 28 13:35:34 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 13:35:34 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 13:35:34 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572294634, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7660c380/0x51ab3c4ee698e11 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ace7a01cd2f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 13:35:34 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 13:38:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 13:38:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 13:40:43 sh-103-53.int kernel: LustreError: 386369:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0832e780) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 13:40:43 sh-103-53.int kernel: LustreError: 386369:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 13:40:43 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 13:40:43 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 13:41:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 13:41:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 76 previous similar messages Oct 28 13:45:01 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 13:45:01 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 28 13:45:53 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 13:45:53 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 13:45:53 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572295253, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97779d610240/0x51ab3c4ee698e49 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acea14f8f95 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 13:45:53 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 13:48:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 28 13:48:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 28 13:50:59 sh-103-53.int kernel: LustreError: 386948:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978206e14b40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 13:50:59 sh-103-53.int kernel: LustreError: 386948:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 13:50:59 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 13:50:59 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 13:51:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 13:51:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 63 previous similar messages Oct 28 13:56:06 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 13:56:06 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 13:56:06 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572295866, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626c140/0x51ab3c4ee698e81 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acec2fad678 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 13:56:06 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 13:56:36 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 13:56:36 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 28 13:58:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 13:58:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 14:01:19 sh-103-53.int kernel: LustreError: 387544:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754f6bac900) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 14:01:19 sh-103-53.int kernel: LustreError: 387544:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 14:01:19 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 14:01:19 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 14:01:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 14:01:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 28 14:06:33 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 14:06:33 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 14:06:33 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572296493, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626e9c0/0x51ab3c4ee698eb9 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acee23569ef expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 14:06:33 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 14:08:41 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 14:08:41 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 28 14:08:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 14:08:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 36 previous similar messages Oct 28 14:11:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 14:11:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 71 previous similar messages Oct 28 14:11:44 sh-103-53.int kernel: LustreError: 388128:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754fc38b500) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 14:11:44 sh-103-53.int kernel: LustreError: 388128:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 14:11:44 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 14:11:44 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 14:16:50 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 14:16:50 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 14:16:50 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572297110, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb098c0/0x51ab3c4ee698ef1 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acf088be1a5 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 14:16:50 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 14:18:56 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 800 Oct 28 14:18:56 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 28 14:19:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 14:19:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 28 14:21:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 14:21:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 63 previous similar messages Oct 28 14:21:59 sh-103-53.int kernel: LustreError: 388711:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754fc38b740) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 14:21:59 sh-103-53.int kernel: LustreError: 388711:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 14:21:59 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 14:21:59 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 14:27:09 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 14:27:09 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 14:27:09 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572297729, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9781fcfbde80/0x51ab3c4ee698f29 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acf33cc07e3 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 14:27:09 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 14:29:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 14:29:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 28 14:30:06 sh-103-53.int kernel: LNetError: 371905:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 14:30:06 sh-103-53.int kernel: LNetError: 371905:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 28 14:31:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 14:31:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 28 14:32:21 sh-103-53.int kernel: LustreError: 389299:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977446ed1e00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 14:32:21 sh-103-53.int kernel: LustreError: 389299:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 14:32:21 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 14:32:21 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 14:37:33 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 14:37:33 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 14:37:33 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572298353, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97779d6121c0/0x51ab3c4ee698f61 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acf6224be5c expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 14:37:33 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 14:39:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 28 14:39:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 14:40:11 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 14:40:11 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 28 14:41:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 14:41:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 74 previous similar messages Oct 28 14:42:49 sh-103-53.int kernel: LustreError: 389884:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754f6bad440) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 14:42:49 sh-103-53.int kernel: LustreError: 389884:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 14:42:49 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 14:42:49 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 14:48:01 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 14:48:01 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 14:48:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572298981, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e76268fc0/0x51ab3c4ee698f99 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acf90e1c669 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 14:48:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 14:49:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 14:49:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 28 14:51:28 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 14:51:28 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 28 14:51:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 14:51:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 28 14:53:14 sh-103-53.int kernel: LustreError: 390502:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754f6badec0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 14:53:14 sh-103-53.int kernel: LustreError: 390502:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 14:53:14 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 14:53:14 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 14:58:21 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 14:58:21 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 14:58:21 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572299601, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626e0c0/0x51ab3c4ee698fd1 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acfc20d1c97 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 14:58:21 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 14:59:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 14:59:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 15:01:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 28 15:01:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 28 15:02:36 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 15:02:36 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 28 15:03:33 sh-103-53.int kernel: LustreError: 391097:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754f6bac9c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 15:03:33 sh-103-53.int kernel: LustreError: 391097:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 15:03:33 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 15:03:33 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 15:08:40 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 15:08:40 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 15:08:40 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572300220, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0e0c0/0x51ab3c4ee699009 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2acfee164414 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 15:08:40 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 15:09:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 15:09:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 36 previous similar messages Oct 28 15:12:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 28 15:12:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 28 15:13:46 sh-103-53.int kernel: LNetError: 371905:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 15:13:46 sh-103-53.int kernel: LNetError: 371905:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 28 15:13:48 sh-103-53.int kernel: LustreError: 391674:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977121676600) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 15:13:48 sh-103-53.int kernel: LustreError: 391674:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 15:13:48 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 15:13:48 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 15:18:54 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 15:18:54 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 15:18:54 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572300834, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d4380/0x51ab3c4ee699041 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad01481f58a expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 15:18:54 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 15:20:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 28 15:20:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 15:22:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 15:22:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 28 15:23:56 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 15:23:56 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 28 15:24:05 sh-103-53.int kernel: LustreError: 392249:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977795ebcd80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 15:24:05 sh-103-53.int kernel: LustreError: 392249:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 15:24:05 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 15:24:05 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 15:29:18 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 15:29:18 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 15:29:18 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572301458, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626d7c0/0x51ab3c4ee699079 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad037edbdca expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 15:29:18 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 15:30:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 15:30:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 36 previous similar messages Oct 28 15:33:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 15:33:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 28 15:34:31 sh-103-53.int kernel: LustreError: 392836:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e797938c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 15:34:31 sh-103-53.int kernel: LustreError: 392836:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 15:34:31 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 15:34:31 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 15:35:07 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 15:35:07 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 28 15:39:43 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 15:39:43 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 15:39:43 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572302083, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97783febc380/0x51ab3c4ee6990b1 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad05e523b53 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 15:39:43 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 15:40:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 15:40:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 36 previous similar messages Oct 28 15:43:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 15:43:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 28 15:44:55 sh-103-53.int kernel: LustreError: 393416:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977121676e40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 15:44:55 sh-103-53.int kernel: LustreError: 393416:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 15:44:55 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 15:44:55 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 15:45:23 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 15:45:23 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 28 15:50:01 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 15:50:01 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 15:50:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572302701, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9781fcfb9440/0x51ab3c4ee6990e9 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad088f8f103 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 15:50:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 15:50:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 15:50:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 28 15:53:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 15:53:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 28 15:55:10 sh-103-53.int kernel: LustreError: 393991:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977795ebdec0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 15:55:10 sh-103-53.int kernel: LustreError: 393991:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 15:55:10 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 15:55:10 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 15:56:12 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 15:56:12 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 5 previous similar messages Oct 28 16:00:23 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 16:00:23 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 16:00:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572303323, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97779d611d40/0x51ab3c4ee699121 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad0b27efe3e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 16:00:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 16:00:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 28 16:00:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 28 16:03:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 16:03:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 28 16:05:36 sh-103-53.int kernel: LustreError: 394592:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978206e15ec0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 16:05:36 sh-103-53.int kernel: LustreError: 394592:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 16:05:36 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 16:05:36 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 16:07:42 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 16:07:42 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 28 16:10:47 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 16:10:47 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 16:10:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572303947, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626f2c0/0x51ab3c4ee699159 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad1009362d1 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 16:10:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 16:10:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 16:10:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 36 previous similar messages Oct 28 16:13:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 16:13:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 28 16:15:53 sh-103-53.int kernel: LustreError: 395169:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754fc38a480) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 16:15:53 sh-103-53.int kernel: LustreError: 395169:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 16:15:53 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 16:15:53 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 16:18:29 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 16:18:29 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 28 16:21:01 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 16:21:01 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 16:21:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572304561, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb086c0/0x51ab3c4ee699191 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad14a4db816 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 16:21:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 16:21:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 16:21:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 16:23:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Oct 28 16:23:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 28 16:26:08 sh-103-53.int kernel: LustreError: 395745:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754fc38a900) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 16:26:08 sh-103-53.int kernel: LustreError: 395745:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 16:26:08 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 16:26:08 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 16:29:06 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 16:29:06 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 28 16:31:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 16:31:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 16:31:18 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 16:31:18 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 16:31:18 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572305178, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97783febe0c0/0x51ab3c4ee6991c9 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad16f457596 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 16:31:18 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 16:33:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 16:33:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 71 previous similar messages Oct 28 16:36:26 sh-103-53.int kernel: LustreError: 396319:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977121676600) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 16:36:26 sh-103-53.int kernel: LustreError: 396319:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 16:36:26 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 16:36:26 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 16:40:16 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 16:40:16 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 28 16:41:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 16:41:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 28 16:41:36 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 16:41:36 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 16:41:36 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572305796, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97783febca40/0x51ab3c4ee699201 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad1964a693a expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 16:41:36 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 16:43:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 16:43:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 28 16:46:56 sh-103-53.int kernel: LustreError: 396916:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977121676840) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 16:46:56 sh-103-53.int kernel: LustreError: 396916:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 16:46:56 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 16:46:56 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 16:50:19 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 16:50:19 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 28 16:51:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 16:51:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 28 16:52:10 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 16:52:10 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 16:52:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572306430, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d3f00/0x51ab3c4ee699239 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad1bc393955 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 16:52:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 16:53:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 28 16:53:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 28 16:57:23 sh-103-53.int kernel: LustreError: 397512:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977795ebc240) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 16:57:23 sh-103-53.int kernel: LustreError: 397512:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 16:57:23 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 16:57:23 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 17:00:26 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 17:00:26 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 28 17:01:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 28 17:01:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 36 previous similar messages Oct 28 17:02:36 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 17:02:36 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 17:02:36 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572307056, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d0240/0x51ab3c4ee699271 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad1e31e89e4 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 17:02:36 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 17:03:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 28 17:03:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 28 17:07:47 sh-103-53.int kernel: LustreError: 398113:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977795ebc6c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 17:07:47 sh-103-53.int kernel: LustreError: 398113:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 17:07:47 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 17:07:47 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 17:10:48 sh-103-53.int kernel: LNetError: 371905:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 17:10:48 sh-103-53.int kernel: LNetError: 371905:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 28 17:11:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 17:11:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 28 17:13:01 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 17:13:01 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 17:13:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572307681, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e76609b00/0x51ab3c4ee6992a9 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad20ad23693 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 17:13:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 17:13:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 17:13:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 71 previous similar messages Oct 28 17:13:58 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 28 17:18:15 sh-103-53.int kernel: LustreError: 398699:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977446ed1200) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 17:18:15 sh-103-53.int kernel: LustreError: 398699:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 17:18:15 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 17:18:15 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 17:20:52 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 17:20:52 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 28 17:22:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 17:22:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 28 17:23:28 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 17:23:28 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 17:23:28 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572308308, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e76269d40/0x51ab3c4ee6992e1 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad23466da39 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 17:23:28 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 17:23:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 17:23:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 28 17:28:35 sh-103-53.int kernel: LustreError: 399298:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754fc38b680) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 17:28:35 sh-103-53.int kernel: LustreError: 399298:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 17:28:35 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 17:28:35 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 17:31:08 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 17:31:08 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 28 17:32:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 28 17:32:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 36 previous similar messages Oct 28 17:33:18 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 28 17:33:45 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 17:33:45 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 17:33:45 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572308925, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0ad00/0x51ab3c4ee699319 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad25e765a44 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 17:33:45 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 17:33:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 17:33:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 28 17:38:55 sh-103-53.int kernel: LustreError: 399877:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977121677980) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 17:38:55 sh-103-53.int kernel: LustreError: 399877:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 17:38:55 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 17:38:55 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 17:39:22 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 28 17:42:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 28 17:42:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 28 17:44:05 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 17:44:05 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 17:44:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572309545, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97783febf740/0x51ab3c4ee699351 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad28e2fc1e2 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 17:44:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 17:44:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 28 17:44:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 28 17:44:06 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 17:44:06 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 28 17:49:13 sh-103-53.int kernel: LustreError: 400454:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977121677a40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 17:49:13 sh-103-53.int kernel: LustreError: 400454:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 17:49:13 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 17:49:13 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 17:52:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 17:52:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 35 previous similar messages Oct 28 17:54:18 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 17:54:18 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 17:54:18 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572310158, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97783febd580/0x51ab3c4ee699389 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad2bb846fff expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 17:54:18 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 17:54:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 17:54:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 28 17:56:06 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 17:56:06 sh-103-53.int kernel: LNetError: 293164:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 28 17:59:29 sh-103-53.int kernel: LustreError: 401033:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977121677d40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 17:59:29 sh-103-53.int kernel: LustreError: 401033:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 17:59:29 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 17:59:29 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 18:02:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 28 18:02:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 36 previous similar messages Oct 28 18:04:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Oct 28 18:04:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 28 18:04:38 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 18:04:38 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 18:04:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572310778, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d5e80/0x51ab3c4ee6993c1 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad2e1bf55a8 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 18:04:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 18:06:26 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 18:06:26 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 28 18:09:46 sh-103-53.int kernel: LustreError: 401626:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977795ebcd80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 18:09:46 sh-103-53.int kernel: LustreError: 401626:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 18:09:46 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 18:09:46 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 18:12:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 18:12:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 18:14:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 18:14:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 28 18:14:53 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 18:14:53 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 18:14:53 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572311393, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d60c0/0x51ab3c4ee6993f9 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad307b32397 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 18:14:53 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 18:17:36 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 18:17:36 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 28 18:20:03 sh-103-53.int kernel: LustreError: 402206:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977446ed15c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 18:20:03 sh-103-53.int kernel: LustreError: 402206:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 18:20:03 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 18:20:03 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 18:23:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 18:23:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 28 18:24:44 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 18:24:44 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 63 previous similar messages Oct 28 18:25:09 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 18:25:09 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 18:25:09 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572312009, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c1b00/0x51ab3c4ee699431 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad32da17c81 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 18:25:09 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 18:28:31 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 18:28:31 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 28 18:30:15 sh-103-53.int kernel: LustreError: 402777:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754f6bade00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 18:30:15 sh-103-53.int kernel: LustreError: 402777:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 18:30:15 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 18:30:15 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 18:33:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 18:33:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 28 18:34:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 18:34:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 58 previous similar messages Oct 28 18:35:22 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 18:35:22 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 18:35:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572312622, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626e0c0/0x51ab3c4ee699470 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad3518fc2f1 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 18:35:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 18:39:15 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 18:39:15 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 28 18:40:31 sh-103-53.int kernel: LustreError: 403355:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754f6bac540) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 18:40:31 sh-103-53.int kernel: LustreError: 403355:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 18:40:31 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 18:40:31 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 18:43:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 28 18:43:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 18:44:28 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 28 18:44:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 18:44:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 60 previous similar messages Oct 28 18:45:39 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 18:45:39 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 18:45:39 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572313239, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626af40/0x51ab3c4ee6994a8 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad378d0ea6c expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 18:45:39 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 18:49:25 sh-103-53.int kernel: LNetError: 400924:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 18:49:25 sh-103-53.int kernel: LNetError: 400924:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 28 18:50:48 sh-103-53.int kernel: LustreError: 403935:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754f6bacd80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 18:50:48 sh-103-53.int kernel: LustreError: 403935:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 18:50:48 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 18:50:48 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 18:53:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 28 18:53:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 36 previous similar messages Oct 28 18:54:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 18:54:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 28 18:55:57 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 18:55:57 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 18:55:57 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572313857, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e76269440/0x51ab3c4ee6994e0 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad3a179bd40 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 18:55:57 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 18:59:26 sh-103-53.int kernel: LNetError: 400924:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 18:59:26 sh-103-53.int kernel: LNetError: 400924:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 28 19:01:09 sh-103-53.int kernel: LustreError: 404533:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754f6bacf00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 19:01:09 sh-103-53.int kernel: LustreError: 404533:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 19:01:09 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 19:01:09 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 19:03:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 19:03:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 35 previous similar messages Oct 28 19:05:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 19:05:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 28 19:06:20 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 19:06:20 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 19:06:20 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572314480, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb08900/0x51ab3c4ee699518 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad3cab5ec2c expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 19:06:20 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 19:09:45 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 19:09:45 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 28 19:11:32 sh-103-53.int kernel: LustreError: 405117:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754fc38b200) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 19:11:32 sh-103-53.int kernel: LustreError: 405117:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 19:11:32 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 19:11:32 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 19:13:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 19:13:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 36 previous similar messages Oct 28 19:15:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 19:15:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 59 previous similar messages Oct 28 19:16:44 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 19:16:44 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 19:16:44 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572315104, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0a400/0x51ab3c4ee699550 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad3f1c76242 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 19:16:44 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 19:20:31 sh-103-53.int kernel: LNetError: 405059:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 19:20:31 sh-103-53.int kernel: LNetError: 405059:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 28 19:21:53 sh-103-53.int kernel: LustreError: 405706:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754fc38b680) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 19:21:53 sh-103-53.int kernel: LustreError: 405706:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 19:21:53 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 19:21:53 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 19:24:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 19:24:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 28 19:25:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 19:25:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 28 19:27:08 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 19:27:08 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 19:27:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572315728, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb08000/0x51ab3c4ee699588 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad416e2cc3e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 19:27:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 19:32:15 sh-103-53.int kernel: LustreError: 406294:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754fc38a540) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 19:32:15 sh-103-53.int kernel: LustreError: 406294:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 19:32:15 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 19:32:15 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 19:32:46 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 19:32:46 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 28 19:34:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 19:34:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 28 19:35:17 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 28 19:35:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 19:35:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 28 19:37:21 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 19:37:21 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 19:37:21 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572316341, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d5580/0x51ab3c4ee6995c0 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad43c595ba9 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 19:37:21 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 19:42:31 sh-103-53.int kernel: LustreError: 406868:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977795ebd200) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 19:42:31 sh-103-53.int kernel: LustreError: 406868:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 19:42:31 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 19:42:31 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 19:43:56 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 19:43:56 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 28 19:44:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 28 19:44:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 19:45:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 19:45:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 28 19:47:38 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 19:47:38 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 19:47:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572316958, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d7080/0x51ab3c4ee6995f8 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad45a78c88a expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 19:47:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 19:52:53 sh-103-53.int kernel: LustreError: 407448:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977795ebcd80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 19:52:53 sh-103-53.int kernel: LustreError: 407448:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 19:52:53 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 19:52:53 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 19:54:13 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 19:54:13 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 28 19:54:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 19:54:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 28 19:55:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 19:55:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 28 19:58:02 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 19:58:02 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 19:58:02 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572317582, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d6300/0x51ab3c4ee699630 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad47a92520f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 19:58:02 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 20:03:12 sh-103-53.int kernel: LustreError: 408041:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977795ebd800) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 20:03:12 sh-103-53.int kernel: LustreError: 408041:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 20:03:12 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 20:03:12 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 20:04:32 sh-103-53.int kernel: LNetError: 405059:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 20:04:32 sh-103-53.int kernel: LNetError: 405059:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 28 20:04:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 20:04:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 28 20:06:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 20:06:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 28 20:08:19 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 20:08:19 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 20:08:19 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572318199, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97779d610fc0/0x51ab3c4ee699668 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad49b699816 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 20:08:20 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 20:13:32 sh-103-53.int kernel: LustreError: 408622:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754f6bad080) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 20:13:32 sh-103-53.int kernel: LustreError: 408622:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 20:13:32 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 20:13:32 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 20:14:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 20:14:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 36 previous similar messages Oct 28 20:15:43 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 20:15:43 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 28 20:16:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 20:16:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 28 20:18:43 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 20:18:43 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 20:18:43 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572318823, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626b840/0x51ab3c4ee6996a0 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad4bfc04161 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 20:18:43 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 20:23:52 sh-103-53.int kernel: LustreError: 409200:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754f6bad800) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 20:23:52 sh-103-53.int kernel: LustreError: 409200:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 20:23:52 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 20:23:52 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 20:25:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 20:25:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 28 20:26:00 sh-103-53.int kernel: LNetError: 405059:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 20:26:00 sh-103-53.int kernel: LNetError: 405059:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 28 20:26:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 20:26:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 61 previous similar messages Oct 28 20:29:00 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 20:29:00 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 20:29:00 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572319440, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626f980/0x51ab3c4ee6996d8 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad4e22b665d expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 20:29:00 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 20:34:11 sh-103-53.int kernel: LustreError: 409784:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754fc38a480) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 20:34:11 sh-103-53.int kernel: LustreError: 409784:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 20:34:11 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 20:34:11 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 20:35:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 20:35:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 28 20:36:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 20:36:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 28 20:38:03 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 20:38:03 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 28 20:39:22 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 20:39:22 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 20:39:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572320062, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97783febe540/0x51ab3c4ee699710 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad50a48a72b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 20:39:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 20:44:34 sh-103-53.int kernel: LustreError: 410366:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977446ed0b40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 20:44:34 sh-103-53.int kernel: LustreError: 410366:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 20:44:34 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 20:44:34 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 20:45:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 20:45:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 28 20:47:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 20:47:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 28 20:48:23 sh-103-53.int kernel: LNetError: 409818:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 20:48:23 sh-103-53.int kernel: LNetError: 409818:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 28 20:49:41 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 20:49:41 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 20:49:41 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572320681, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97779d610900/0x51ab3c4ee699748 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad52e9e9cad expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 20:49:41 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 20:54:51 sh-103-53.int kernel: LustreError: 410946:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9754fc38bbc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 20:54:51 sh-103-53.int kernel: LustreError: 410946:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 20:54:51 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 20:54:51 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 20:55:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 20:55:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 28 20:57:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 20:57:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 57 previous similar messages Oct 28 20:59:11 sh-103-53.int kernel: LNetError: 409818:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 20:59:11 sh-103-53.int kernel: LNetError: 409818:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 28 21:00:05 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 21:00:05 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 21:00:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572321305, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0ad00/0x51ab3c4ee699780 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad5508ef558 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 21:00:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 21:05:20 sh-103-53.int kernel: LustreError: 411549:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977121677440) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 21:05:20 sh-103-53.int kernel: LustreError: 411549:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 21:05:20 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 21:05:20 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 21:05:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 21:05:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 28 21:07:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 28 21:07:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 60 previous similar messages Oct 28 21:10:33 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 21:10:33 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 21:10:33 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572321933, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9781fcfba1c0/0x51ab3c4ee6997b8 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad5791aefe2 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 21:10:33 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 21:11:05 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3350:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 1 seconds Oct 28 21:11:05 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3350:kiblnd_check_txs_locked()) Skipped 1 previous similar message Oct 28 21:11:05 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3425:kiblnd_check_conns()) Timed out RDMA with 10.9.0.21@o2ib4 (7): c: 0, oc: 0, rc: 8 Oct 28 21:11:05 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3425:kiblnd_check_conns()) Skipped 1 previous similar message Oct 28 21:11:05 sh-103-53.int kernel: Lustre: 91134:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1572322258/real 0] req@ffff977e7326f500 x1648382093091104/t0(0) o400->fir-OST000d-osc-ffff9781f2230800@10.0.10.104@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572322265 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 28 21:11:05 sh-103-53.int kernel: Lustre: fir-OST0046-osc-ffff9781f2230800: Connection to fir-OST0046 (at 10.0.10.111@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 28 21:11:05 sh-103-53.int kernel: Lustre: Skipped 21 previous similar messages Oct 28 21:11:05 sh-103-53.int kernel: Lustre: 91134:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 15 previous similar messages Oct 28 21:11:26 sh-103-53.int kernel: Lustre: 91159:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572322233/real 1572322233] req@ffff977e7326da00 x1648382093090768/t0(0) o503->MGC10.0.10.51@o2ib7@10.0.10.51@o2ib7:26/25 lens 272/8416 e 0 to 1 dl 1572322286 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Oct 28 21:11:26 sh-103-53.int kernel: Lustre: 91159:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Oct 28 21:11:26 sh-103-53.int kernel: LustreError: 91159:0:(mgc_request.c:599:do_requeue()) failed processing log: -5 Oct 28 21:15:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 28 21:15:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 32 previous similar messages Oct 28 21:16:47 sh-103-53.int kernel: LustreError: 412185:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcd680) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 21:16:47 sh-103-53.int kernel: LustreError: 412185:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 21:16:47 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 21:16:47 sh-103-53.int kernel: Lustre: Skipped 18 previous similar messages Oct 28 21:17:44 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 1 seconds Oct 28 21:17:44 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 113 previous similar messages Oct 28 21:18:34 sh-103-53.int kernel: Lustre: 91121:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572322707/real 1572322707] req@ffff9769f2ed2880 x1648382093170320/t0(0) o400->fir-OST005a-osc-ffff9781f2230800@10.0.10.115@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572322714 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 28 21:18:34 sh-103-53.int kernel: Lustre: fir-OST0052-osc-ffff9781f2230800: Connection to fir-OST0052 (at 10.0.10.113@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 28 21:18:34 sh-103-53.int kernel: Lustre: Skipped 15 previous similar messages Oct 28 21:18:34 sh-103-53.int kernel: Lustre: 91121:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 18 previous similar messages Oct 28 21:21:49 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 21:21:49 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 28 21:21:54 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 21:21:54 sh-103-53.int kernel: LustreError: Skipped 2 previous similar messages Oct 28 21:21:54 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572322614, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc699200/0x51ab3c4ee6997f7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad5999d53c0 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 21:21:54 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 21:23:40 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 21:23:40 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 1 previous similar message Oct 28 21:26:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 28 21:26:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 26 previous similar messages Oct 28 21:26:36 sh-103-53.int kernel: LNetError: 409818:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 21:26:36 sh-103-53.int kernel: LNetError: 409818:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 2 previous similar messages Oct 28 21:27:05 sh-103-53.int kernel: LustreError: 412773:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebccd80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 21:27:05 sh-103-53.int kernel: LustreError: 412773:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 21:27:05 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 21:27:05 sh-103-53.int kernel: Lustre: Skipped 30 previous similar messages Oct 28 21:27:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 21:27:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 28 21:31:52 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 21:31:52 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 28 21:32:12 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 28 21:32:13 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 21:32:13 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 21:32:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572323233, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69bf00/0x51ab3c4ee69982f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad5a9e15a67 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 21:32:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 21:36:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.41@o2ib4 added to recovery queue. Health = 900 Oct 28 21:36:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 24 previous similar messages Oct 28 21:37:22 sh-103-53.int kernel: LustreError: 413350:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebccc00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 21:37:22 sh-103-53.int kernel: LustreError: 413350:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 21:37:22 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 21:37:22 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 21:38:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 21:38:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 57 previous similar messages Oct 28 21:42:03 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 21:42:03 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 13 previous similar messages Oct 28 21:42:30 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 21:42:30 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 21:42:30 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572323850, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69aac0/0x51ab3c4ee699867 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad5b93f6dae expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 21:42:30 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 21:46:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 28 21:46:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 23 previous similar messages Oct 28 21:47:40 sh-103-53.int kernel: LustreError: 413927:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcd140) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 21:47:40 sh-103-53.int kernel: LustreError: 413927:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 21:47:40 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 21:47:40 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 21:48:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 21:48:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 28 21:52:21 sh-103-53.int kernel: LNetError: 409818:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 21:52:21 sh-103-53.int kernel: LNetError: 409818:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 28 21:52:52 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 21:52:52 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 21:52:52 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572324472, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69ba80/0x51ab3c4ee69989f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad5c7e7d19f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 21:52:52 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 21:56:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 21:56:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 26 previous similar messages Oct 28 21:58:04 sh-103-53.int kernel: LustreError: 414510:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcc180) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 21:58:04 sh-103-53.int kernel: LustreError: 414510:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 21:58:04 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 21:58:04 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 21:58:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 28 seconds Oct 28 21:58:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 28 22:03:12 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 22:03:12 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 22:03:12 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572325092, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69e9c0/0x51ab3c4ee6998d7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad5dd55e47a expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 22:03:12 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 22:04:06 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 22:04:06 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 28 22:06:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 22:06:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 33 previous similar messages Oct 28 22:08:19 sh-103-53.int kernel: LustreError: 415104:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcda40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 22:08:19 sh-103-53.int kernel: LustreError: 415104:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 22:08:19 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 22:08:19 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 22:08:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 22:08:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 28 22:13:28 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 22:13:28 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 22:13:28 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572325708, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc699440/0x51ab3c4ee69990f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad5f24e9814 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 22:13:28 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 22:14:47 sh-103-53.int kernel: LNetError: 408127:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 22:14:47 sh-103-53.int kernel: LNetError: 408127:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 28 22:16:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 22:16:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 34 previous similar messages Oct 28 22:18:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 22:18:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 28 22:18:38 sh-103-53.int kernel: LustreError: 415693:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcca80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 22:18:38 sh-103-53.int kernel: LustreError: 415693:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 22:18:38 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 22:18:38 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 22:23:43 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 22:23:43 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 22:23:43 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572326323, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69e540/0x51ab3c4ee699947 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad60402f16a expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 22:23:44 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 22:25:56 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 22:25:56 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 28 22:27:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 22:27:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 33 previous similar messages Oct 28 22:28:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 22:28:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 66 previous similar messages Oct 28 22:28:53 sh-103-53.int kernel: LustreError: 416268:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcce40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 22:28:53 sh-103-53.int kernel: LustreError: 416268:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 22:28:53 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 22:28:53 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 22:34:06 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 22:34:06 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 22:34:06 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572326946, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69c800/0x51ab3c4ee69997f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad609b0e8df expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 22:34:06 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 22:34:12 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 28 22:37:08 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 22:37:08 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 28 22:37:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 22:37:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 35 previous similar messages Oct 28 22:38:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 22:38:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 28 22:39:16 sh-103-53.int kernel: LustreError: 416851:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebccb40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 22:39:16 sh-103-53.int kernel: LustreError: 416851:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 22:39:16 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 22:39:16 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 22:44:22 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 22:44:22 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 22:44:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572327562, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69fbc0/0x51ab3c4ee6999b7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad60e835ab7 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 22:44:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 22:47:19 sh-103-53.int kernel: LNetError: 409818:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 22:47:19 sh-103-53.int kernel: LNetError: 409818:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 28 22:47:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 22:47:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 33 previous similar messages Oct 28 22:49:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 22:49:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 28 22:49:28 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 28 22:49:32 sh-103-53.int kernel: LustreError: 417431:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b5e00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 22:49:32 sh-103-53.int kernel: LustreError: 417431:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 22:49:32 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 22:49:32 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 22:50:34 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 28 22:54:40 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 22:54:40 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 22:54:40 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572328180, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea33c0/0x51ab3c4ee6999ef lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad6111e1d73 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 22:54:40 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 22:57:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 22:57:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 32 previous similar messages Oct 28 22:58:11 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 22:58:11 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 28 22:59:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 22:59:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 28 22:59:48 sh-103-53.int kernel: LustreError: 418010:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feec300) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 22:59:48 sh-103-53.int kernel: LustreError: 418010:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 22:59:48 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 22:59:48 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 23:04:43 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 28 23:05:00 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 23:05:00 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 23:05:00 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572328800, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea0240/0x51ab3c4ee699a27 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad6140d791e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 23:05:00 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 23:07:47 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 23:07:47 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 32 previous similar messages Oct 28 23:09:12 sh-103-53.int kernel: LNetError: 409818:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 23:09:12 sh-103-53.int kernel: LNetError: 409818:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 28 23:09:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 28 23:09:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 28 23:10:12 sh-103-53.int kernel: LustreError: 418615:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feec540) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 23:10:12 sh-103-53.int kernel: LustreError: 418615:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 23:10:12 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 23:10:12 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 23:15:23 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 23:15:23 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 23:15:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572329423, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea4c80/0x51ab3c4ee699a5f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad6183c03f3 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 23:15:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 23:17:57 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 23:17:57 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 30 previous similar messages Oct 28 23:19:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 28 23:19:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 28 23:20:32 sh-103-53.int kernel: LustreError: 419197:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feec600) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 23:20:32 sh-103-53.int kernel: LustreError: 419197:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 23:20:32 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 23:20:32 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 23:21:52 sh-103-53.int kernel: LNetError: 409818:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 23:21:52 sh-103-53.int kernel: LNetError: 409818:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 12 previous similar messages Oct 28 23:25:40 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 23:25:40 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 23:25:40 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572330040, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea60c0/0x51ab3c4ee699a97 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad62304b622 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 23:25:40 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 23:28:07 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 23:28:07 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 32 previous similar messages Oct 28 23:29:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 23:29:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 28 23:30:49 sh-103-53.int kernel: LustreError: 419775:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feeda40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 23:30:49 sh-103-53.int kernel: LustreError: 419775:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 23:30:49 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 23:30:49 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 23:33:00 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 23:33:00 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 28 23:36:02 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 23:36:02 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 23:36:02 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572330662, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef3cc0/0x51ab3c4ee699acf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad630d6b8a0 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 23:36:02 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 23:38:17 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 23:38:17 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 32 previous similar messages Oct 28 23:40:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 23:40:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 28 23:41:16 sh-103-53.int kernel: LustreError: 420362:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781feea75c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 23:41:16 sh-103-53.int kernel: LustreError: 420362:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 23:41:16 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 23:41:16 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 23:43:02 sh-103-53.int kernel: LNetError: 409818:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 23:43:02 sh-103-53.int kernel: LNetError: 409818:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 28 23:46:23 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 23:46:23 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 23:46:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572331283, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a6fd6780/0x51ab3c4ee699b07 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad63d993e9b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 23:46:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 23:48:27 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 23:48:27 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 34 previous similar messages Oct 28 23:50:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 28 23:50:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 28 23:51:29 sh-103-53.int kernel: LustreError: 420939:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978209273a40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 28 23:51:29 sh-103-53.int kernel: LustreError: 420939:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 28 23:51:29 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 28 23:51:29 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 28 23:54:02 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 28 23:54:02 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 28 23:56:44 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 28 23:56:44 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 28 23:56:44 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572331904, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754f8e68900/0x51ab3c4ee699b3f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad64b9d6a45 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 28 23:56:44 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 28 23:58:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 28 23:58:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 32 previous similar messages Oct 29 00:00:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 29 00:00:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 69 previous similar messages Oct 29 00:01:56 sh-103-53.int kernel: LustreError: 421563:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977446e849c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 00:01:56 sh-103-53.int kernel: LustreError: 421563:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 00:01:56 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 00:01:56 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 00:04:08 sh-103-53.int kernel: LNetError: 408127:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 00:04:08 sh-103-53.int kernel: LNetError: 408127:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 29 00:07:08 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 00:07:08 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 00:07:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572332528, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754f8e698c0/0x51ab3c4ee699b77 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad65a90c9cc expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 00:07:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 00:08:47 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 00:08:47 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 00:10:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 00:10:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 63 previous similar messages Oct 29 00:12:19 sh-103-53.int kernel: LustreError: 422148:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978211fee840) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 00:12:19 sh-103-53.int kernel: LustreError: 422148:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 00:12:19 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 00:12:19 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 00:14:31 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 00:14:31 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 29 00:17:26 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 00:17:26 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 00:17:26 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572333146, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754f8e6a640/0x51ab3c4ee699baf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad668bf4085 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 00:17:26 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 00:18:57 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 00:18:57 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 32 previous similar messages Oct 29 00:20:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 00:20:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 29 00:22:37 sh-103-53.int kernel: LustreError: 422726:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978211fee600) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 00:22:37 sh-103-53.int kernel: LustreError: 422726:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 00:22:37 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 00:22:37 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 00:24:47 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 00:24:47 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 29 00:27:45 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 00:27:45 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 00:27:45 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572333765, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69f080/0x51ab3c4ee699be7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad67669df91 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 00:27:45 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 00:29:07 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 00:29:07 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 00:30:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 00:30:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 29 00:32:55 sh-103-53.int kernel: LustreError: 423306:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcd2c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 00:32:55 sh-103-53.int kernel: LustreError: 423306:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 00:32:55 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 00:32:55 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 00:35:02 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 00:35:02 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 29 00:38:06 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 00:38:06 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 00:38:06 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572334386, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc699d40/0x51ab3c4ee699c1f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad68402bfe3 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 00:38:06 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 00:39:17 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 00:39:17 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 35 previous similar messages Oct 29 00:41:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 00:41:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 29 00:43:19 sh-103-53.int kernel: LustreError: 423891:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcd080) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 00:43:19 sh-103-53.int kernel: LustreError: 423891:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 00:43:19 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 00:43:19 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 00:46:16 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 00:46:16 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 29 00:48:31 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 00:48:31 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 00:48:31 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572335011, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69ca40/0x51ab3c4ee699c57 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad690826b5c expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 00:48:31 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 00:49:27 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 00:49:27 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 32 previous similar messages Oct 29 00:51:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 00:51:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 29 00:53:45 sh-103-53.int kernel: LustreError: 424486:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b4a80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 00:53:45 sh-103-53.int kernel: LustreError: 424486:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 00:53:45 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 00:53:45 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 00:56:21 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 00:56:21 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 29 00:58:58 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 00:58:58 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 00:58:58 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572335638, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf51440/0x51ab3c4ee699c8f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad69c89c347 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 00:58:58 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 00:59:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 00:59:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 28 previous similar messages Oct 29 01:01:17 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 01:01:17 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 73 previous similar messages Oct 29 01:04:12 sh-103-53.int kernel: LustreError: 425094:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0fe99a40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 01:04:12 sh-103-53.int kernel: LustreError: 425094:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 01:04:12 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 01:04:12 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 01:07:12 sh-103-53.int kernel: LNetError: 408127:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 01:07:12 sh-103-53.int kernel: LNetError: 408127:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 29 01:09:18 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 01:09:18 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 01:09:18 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572336258, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97699df9ad00/0x51ab3c4ee699cc7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad6a7317e6a expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 01:09:18 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 01:09:47 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 01:09:47 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 01:11:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 01:11:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 29 01:14:25 sh-103-53.int kernel: LustreError: 425671:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781feea72c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 01:14:25 sh-103-53.int kernel: LustreError: 425671:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 01:14:25 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 01:14:25 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 01:17:46 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 01:17:46 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 29 01:19:31 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 01:19:31 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 01:19:31 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572336871, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef3180/0x51ab3c4ee699cff lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad6b2f9cafc expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 01:19:31 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 01:19:57 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 01:19:57 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 33 previous similar messages Oct 29 01:21:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Oct 29 01:21:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 71 previous similar messages Oct 29 01:24:40 sh-103-53.int kernel: LustreError: 426248:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781feea6e40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 01:24:40 sh-103-53.int kernel: LustreError: 426248:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 01:24:40 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 01:24:40 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 01:28:52 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 01:28:52 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 5 previous similar messages Oct 29 01:29:51 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 01:29:51 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 01:29:51 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572337491, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef0d80/0x51ab3c4ee699d37 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad6c191f404 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 01:29:52 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 01:30:07 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 01:30:07 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 31 previous similar messages Oct 29 01:31:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 01:31:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 63 previous similar messages Oct 29 01:35:05 sh-103-53.int kernel: LustreError: 426835:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781feea7140) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 01:35:05 sh-103-53.int kernel: LustreError: 426835:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 01:35:05 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 01:35:05 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 01:38:58 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 01:38:58 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 29 01:40:17 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 01:40:17 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 01:40:17 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572338117, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754f8e6b600/0x51ab3c4ee699d6f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad6cdc445aa expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 01:40:17 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 01:40:17 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 01:40:17 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 34 previous similar messages Oct 29 01:42:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 01:42:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 29 01:43:18 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 29 01:45:28 sh-103-53.int kernel: LustreError: 427420:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcdd40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 01:45:28 sh-103-53.int kernel: LustreError: 427420:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 01:45:28 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 01:45:28 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 01:49:15 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 01:49:15 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 29 01:50:27 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 01:50:27 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 32 previous similar messages Oct 29 01:50:34 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 01:50:34 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 01:50:34 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572338733, 301s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf56540/0x51ab3c4ee699da7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad6dd9d86c4 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 01:50:34 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 01:52:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 01:52:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 29 01:55:43 sh-103-53.int kernel: LustreError: 427997:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0fe99440) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 01:55:43 sh-103-53.int kernel: LustreError: 427997:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 01:55:43 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 01:55:43 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 02:00:22 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 02:00:22 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 29 02:00:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 02:00:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 31 previous similar messages Oct 29 02:00:57 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 02:00:57 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 02:00:57 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572339357, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea7740/0x51ab3c4ee699ddf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad6ee0ee4c2 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 02:00:57 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 02:02:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 02:02:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 72 previous similar messages Oct 29 02:06:08 sh-103-53.int kernel: LustreError: 428603:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feecf00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 02:06:08 sh-103-53.int kernel: LustreError: 428603:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 02:06:08 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 02:06:08 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 02:10:47 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 02:10:47 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 31 previous similar messages Oct 29 02:11:21 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 02:11:21 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 02:11:21 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572339981, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea72c0/0x51ab3c4ee699e17 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad6fda12c36 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 02:11:21 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 02:12:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 02:12:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 29 02:12:41 sh-103-53.int kernel: LNetError: 409818:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 02:12:41 sh-103-53.int kernel: LNetError: 409818:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 29 02:16:29 sh-103-53.int kernel: LustreError: 429182:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feec480) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 02:16:29 sh-103-53.int kernel: LustreError: 429182:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 02:16:29 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 02:16:29 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 02:20:57 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 02:20:57 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 34 previous similar messages Oct 29 02:21:39 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 02:21:39 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 02:21:39 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572340599, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea57c0/0x51ab3c4ee699e4f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad70fbe6429 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 02:21:39 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 02:22:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 02:22:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 29 02:23:46 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 02:23:46 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 29 02:26:47 sh-103-53.int kernel: LustreError: 429761:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feed440) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 02:26:47 sh-103-53.int kernel: LustreError: 429761:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 02:26:47 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 02:26:47 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 02:31:07 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 02:31:07 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 31 previous similar messages Oct 29 02:31:54 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 02:31:54 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 02:31:54 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572341214, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea4c80/0x51ab3c4ee699e87 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad7227ecd81 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 02:31:54 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 02:32:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Oct 29 02:32:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 29 02:34:02 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 02:34:02 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 29 02:37:02 sh-103-53.int kernel: LustreError: 430349:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feec600) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 02:37:02 sh-103-53.int kernel: LustreError: 430349:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 02:37:02 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 02:37:02 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 02:41:17 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 02:41:17 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 34 previous similar messages Oct 29 02:42:09 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 02:42:09 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 02:42:09 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572341829, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea3840/0x51ab3c4ee699ebf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad7372e7726 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 02:42:09 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 02:43:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 02:43:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 29 02:45:10 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 02:45:10 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 29 02:47:21 sh-103-53.int kernel: LustreError: 430927:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feec9c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 02:47:21 sh-103-53.int kernel: LustreError: 430927:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 02:47:21 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 02:47:21 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 02:51:27 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 02:51:27 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 35 previous similar messages Oct 29 02:52:35 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 02:52:35 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 02:52:35 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572342455, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea33c0/0x51ab3c4ee699ef7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad74c2041a3 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 02:52:35 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 02:53:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 02:53:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 29 02:56:24 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 02:56:24 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 29 02:57:46 sh-103-53.int kernel: LustreError: 431512:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e65296780) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 02:57:46 sh-103-53.int kernel: LustreError: 431512:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 02:57:46 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 02:57:46 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 03:01:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 03:01:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 34 previous similar messages Oct 29 03:02:59 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 03:02:59 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 03:02:59 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572343079, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754f8e6c140/0x51ab3c4ee699f2f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad761fcce4c expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 03:02:59 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 03:03:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 03:03:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 58 previous similar messages Oct 29 03:08:12 sh-103-53.int kernel: LustreError: 432116:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e7eb0c3c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 03:08:12 sh-103-53.int kernel: LustreError: 432116:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 03:08:12 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 03:08:12 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 03:10:27 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 03:10:27 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 9 previous similar messages Oct 29 03:11:47 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 03:11:47 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 34 previous similar messages Oct 29 03:13:21 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 03:13:21 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 03:13:21 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572343701, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754f8e6d340/0x51ab3c4ee699f67 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad774c7f5ee expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 03:13:21 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 03:13:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 29 03:13:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 71 previous similar messages Oct 29 03:18:36 sh-103-53.int kernel: LustreError: 432704:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0c2e3200) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 03:18:36 sh-103-53.int kernel: LustreError: 432704:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 03:18:36 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 03:18:36 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 03:18:53 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 29 03:21:49 sh-103-53.int kernel: LNetError: 408127:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 687 Oct 29 03:21:49 sh-103-53.int kernel: LNetError: 408127:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 29 03:21:57 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 03:21:57 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 33 previous similar messages Oct 29 03:23:44 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 03:23:44 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 03:23:44 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572344324, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc69a400/0x51ab3c4ee699f9f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad78d7c9ae6 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 03:23:44 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 03:23:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 03:23:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 29 03:28:52 sh-103-53.int kernel: LustreError: 433304:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcd140) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 03:28:52 sh-103-53.int kernel: LustreError: 433304:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 03:28:52 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 03:28:52 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 03:31:54 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 03:31:54 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 11 previous similar messages Oct 29 03:32:07 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 03:32:07 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 30 previous similar messages Oct 29 03:33:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 03:33:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 29 03:34:02 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 03:34:02 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 03:34:02 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572344942, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf54ec0/0x51ab3c4ee699fd7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad79f46c906 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 03:34:02 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 03:39:17 sh-103-53.int kernel: LustreError: 433888:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b4900) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 03:39:17 sh-103-53.int kernel: LustreError: 433888:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 03:39:17 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 03:39:17 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 03:42:17 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 03:42:17 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 03:42:47 sh-103-53.int kernel: LNetError: 409818:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 03:42:47 sh-103-53.int kernel: LNetError: 409818:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 29 03:43:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 03:43:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 29 03:44:25 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 03:44:25 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 03:44:25 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572345565, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97699df9e780/0x51ab3c4ee69a00f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad7b3a77bb5 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 03:44:25 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 03:49:35 sh-103-53.int kernel: LustreError: 434470:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0fe98840) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 03:49:35 sh-103-53.int kernel: LustreError: 434470:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 03:49:35 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 03:49:35 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 03:52:27 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 03:52:27 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 32 previous similar messages Oct 29 03:53:47 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 03:53:47 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 5 previous similar messages Oct 29 03:54:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 03:54:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 63 previous similar messages Oct 29 03:54:46 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 03:54:46 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 03:54:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572346186, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97699df9f2c0/0x51ab3c4ee69a047 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad7c9565c5f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 03:54:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 03:55:28 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 29 03:59:58 sh-103-53.int kernel: LustreError: 435062:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feece40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 03:59:58 sh-103-53.int kernel: LustreError: 435062:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 03:59:58 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 03:59:58 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 04:02:38 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 04:02:38 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 32 previous similar messages Oct 29 04:04:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 04:04:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 29 04:04:15 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 04:04:15 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 29 04:05:11 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 04:05:11 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 04:05:11 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572346811, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a6fd57c0/0x51ab3c4ee69a07f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad7dffac50b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 04:05:11 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 04:10:24 sh-103-53.int kernel: LustreError: 435668:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781fc2afd40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 04:10:24 sh-103-53.int kernel: LustreError: 435668:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 04:10:24 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 04:10:24 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 04:12:47 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 04:12:47 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 30 previous similar messages Oct 29 04:14:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 04:14:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 68 previous similar messages Oct 29 04:14:16 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 04:14:16 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 10 previous similar messages Oct 29 04:15:34 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 04:15:34 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 04:15:34 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572347434, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fc699f80/0x51ab3c4ee69a0b7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad7fb64e984 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 04:15:34 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 04:20:43 sh-103-53.int kernel: LustreError: 436246:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcda40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 04:20:43 sh-103-53.int kernel: LustreError: 436246:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 04:20:43 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 04:20:43 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 04:22:57 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 04:22:57 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 30 previous similar messages Oct 29 04:24:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 04:24:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 70 previous similar messages Oct 29 04:25:46 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 04:25:46 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 29 04:25:53 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 04:25:53 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 04:25:53 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572348053, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf52880/0x51ab3c4ee69a0ef lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad8152932e4 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 04:25:53 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 04:31:06 sh-103-53.int kernel: LustreError: 436831:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b4900) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 04:31:06 sh-103-53.int kernel: LustreError: 436831:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 04:31:06 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 04:31:06 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 04:33:07 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 04:33:07 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 30 previous similar messages Oct 29 04:34:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 04:34:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 29 04:36:18 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 04:36:18 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 04:36:18 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572348678, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf52d00/0x51ab3c4ee69a127 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad82e2dcf71 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 04:36:18 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 04:37:43 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 04:37:43 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 7 previous similar messages Oct 29 04:38:04 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3350:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Oct 29 04:38:04 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3425:kiblnd_check_conns()) Timed out RDMA with 10.9.0.23@o2ib4 (6): c: 0, oc: 0, rc: 8 Oct 29 04:38:06 sh-103-53.int kernel: Lustre: 91129:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572349078/real 1572349078] req@ffff976a0ec54c80 x1648382097832640/t0(0) o400->fir-OST0009-osc-ffff9781f2230800@10.0.10.102@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572349085 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 29 04:38:06 sh-103-53.int kernel: Lustre: 91120:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572349078/real 1572349078] req@ffff976a0c436300 x1648382097832880/t0(0) o400->fir-OST0018-osc-ffff9781f2230800@10.0.10.105@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572349085 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 29 04:38:06 sh-103-53.int kernel: Lustre: 91120:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 9 previous similar messages Oct 29 04:38:06 sh-103-53.int kernel: Lustre: fir-OST0018-osc-ffff9781f2230800: Connection to fir-OST0018 (at 10.0.10.105@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 29 04:38:06 sh-103-53.int kernel: Lustre: Skipped 28 previous similar messages Oct 29 04:38:06 sh-103-53.int kernel: Lustre: 91129:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Oct 29 04:38:13 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 04:38:42 sh-103-53.int kernel: Lustre: 91119:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572349078/real 1572349078] req@ffff976a0c432400 x1648382097833056/t0(0) o400->fir-OST0023-osc-ffff9781f2230800@10.0.10.106@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572349122 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 29 04:38:42 sh-103-53.int kernel: Lustre: 91119:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 1 previous similar message Oct 29 04:38:42 sh-103-53.int kernel: Lustre: fir-OST0023-osc-ffff9781f2230800: Connection to fir-OST0023 (at 10.0.10.106@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 29 04:38:43 sh-103-53.int kernel: Lustre: Skipped 5 previous similar messages Oct 29 04:41:30 sh-103-53.int kernel: LustreError: 437416:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b43c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 04:41:30 sh-103-53.int kernel: LustreError: 437416:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 04:41:30 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 04:41:30 sh-103-53.int kernel: Lustre: Skipped 8 previous similar messages Oct 29 04:43:17 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 04:43:17 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 30 previous similar messages Oct 29 04:44:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 29 04:44:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 124 previous similar messages Oct 29 04:46:18 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 29 04:46:42 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 04:46:42 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 04:46:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572349302, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf53a80/0x51ab3c4ee69a15f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad845f0f74f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 04:46:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 04:51:24 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 04:51:55 sh-103-53.int kernel: LustreError: 438002:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b5500) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 04:51:55 sh-103-53.int kernel: LustreError: 438002:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 04:51:55 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 04:51:55 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 04:53:27 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 04:53:27 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 04:54:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 6 seconds Oct 29 04:54:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 156 previous similar messages Oct 29 04:57:03 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 04:57:03 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 04:57:03 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572349923, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf55a00/0x51ab3c4ee69a197 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad85e3d56ed expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 04:57:03 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 05:02:15 sh-103-53.int kernel: LustreError: 438611:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b4d80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 05:02:15 sh-103-53.int kernel: LustreError: 438611:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 05:02:15 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 05:02:15 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 05:03:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 05:03:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 05:04:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 05:04:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 137 previous similar messages Oct 29 05:07:27 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 05:07:27 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 05:07:27 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572350547, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf50fc0/0x51ab3c4ee69a1cf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad8759f44ae expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 05:07:27 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 05:12:39 sh-103-53.int kernel: LustreError: 439197:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b4cc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 05:12:39 sh-103-53.int kernel: LustreError: 439197:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 05:12:39 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 05:12:39 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 05:13:48 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 05:13:48 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 05:14:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Oct 29 05:14:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 149 previous similar messages Oct 29 05:17:55 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 05:17:55 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 05:17:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572351175, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf52640/0x51ab3c4ee69a207 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad8936f5f6f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 05:17:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 05:23:07 sh-103-53.int kernel: LustreError: 439784:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b5980) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 05:23:07 sh-103-53.int kernel: LustreError: 439784:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 05:23:07 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 05:23:07 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 05:23:58 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 05:23:58 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 05:25:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 6 seconds Oct 29 05:25:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 140 previous similar messages Oct 29 05:28:15 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 05:28:15 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 05:28:15 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572351795, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf55100/0x51ab3c4ee69a23f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad8b5e0553b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 05:28:15 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 05:33:24 sh-103-53.int kernel: LustreError: 440365:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b5a40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 05:33:24 sh-103-53.int kernel: LustreError: 440365:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 05:33:24 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 05:33:24 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 05:33:58 sh-103-53.int kernel: LNetError: 408127:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 0 Oct 29 05:33:58 sh-103-53.int kernel: LNetError: 408127:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 1 previous similar message Oct 29 05:34:08 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 05:34:08 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 05:35:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Oct 29 05:35:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 129 previous similar messages Oct 29 05:38:37 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 05:38:37 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 05:38:37 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572352417, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97602ea75580/0x51ab3c4ee69a277 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad8d217c10e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 05:38:37 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 05:43:51 sh-103-53.int kernel: LustreError: 440952:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0e346300) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 05:43:51 sh-103-53.int kernel: LustreError: 440952:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 05:43:51 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 05:43:51 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 05:44:14 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.41@o2ib4 added to recovery queue. Health = 900 Oct 29 05:44:14 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 05:45:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.23@o2ib4: 0 seconds Oct 29 05:45:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 139 previous similar messages Oct 29 05:49:05 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 05:49:05 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 05:49:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572353045, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97602ea733c0/0x51ab3c4ee69a2af lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad8eebf87d3 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 05:49:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 05:54:18 sh-103-53.int kernel: LustreError: 441541:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0e3463c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 05:54:18 sh-103-53.int kernel: LustreError: 441541:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 05:54:18 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 05:54:18 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 05:54:24 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 05:54:24 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 28 previous similar messages Oct 29 05:55:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.23@o2ib4: 1 seconds Oct 29 05:55:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 146 previous similar messages Oct 29 05:59:29 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 05:59:29 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 05:59:29 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572353669, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769d93a9200/0x51ab3c4ee69a2e7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad910781d86 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 05:59:29 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 05:59:31 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 06:04:31 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 189 Oct 29 06:04:39 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 06:04:39 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 06:04:44 sh-103-53.int kernel: LustreError: 442144:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9762f3b1ccc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 06:04:44 sh-103-53.int kernel: LustreError: 442144:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 06:04:44 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 06:04:44 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 06:05:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.23@o2ib4: 0 seconds Oct 29 06:05:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 141 previous similar messages Oct 29 06:09:58 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 06:09:58 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 06:09:58 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572354298, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769d93aaac0/0x51ab3c4ee69a31f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad9302967e6 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 06:09:58 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 06:14:44 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 29 06:14:44 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 06:15:11 sh-103-53.int kernel: LustreError: 442742:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9762f3b1c6c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 06:15:11 sh-103-53.int kernel: LustreError: 442742:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 06:15:11 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 06:15:11 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 06:15:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.23@o2ib4: 7 seconds Oct 29 06:15:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 147 previous similar messages Oct 29 06:20:22 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 06:20:22 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 06:20:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572354922, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769d93a9680/0x51ab3c4ee69a357 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad950e18da8 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 06:20:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 06:22:58 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3350:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Oct 29 06:22:58 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3425:kiblnd_check_conns()) Timed out RDMA with 10.9.0.21@o2ib4 (6): c: 0, oc: 0, rc: 8 Oct 29 06:23:00 sh-103-53.int kernel: Lustre: 91123:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572355372/real 1572355372] req@ffff976a12e4ad00 x1648382098935104/t0(0) o400->fir-OST0001-osc-ffff9781f2230800@10.0.10.102@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572355379 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 29 06:23:00 sh-103-53.int kernel: Lustre: fir-OST0015-osc-ffff9781f2230800: Connection to fir-OST0015 (at 10.0.10.104@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 29 06:23:00 sh-103-53.int kernel: Lustre: 91123:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 10 previous similar messages Oct 29 06:23:36 sh-103-53.int kernel: Lustre: 91121:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572355372/real 1572355372] req@ffff975ab12de300 x1648382098935456/t0(0) o400->fir-OST0017-osc-ffff9781f2230800@10.0.10.104@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572355416 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 29 06:23:36 sh-103-53.int kernel: Lustre: fir-OST0017-osc-ffff9781f2230800: Connection to fir-OST0017 (at 10.0.10.104@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 29 06:23:36 sh-103-53.int kernel: Lustre: Skipped 10 previous similar messages Oct 29 06:24:54 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.41@o2ib4 added to recovery queue. Health = 900 Oct 29 06:24:54 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 28 previous similar messages Oct 29 06:25:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 06:25:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 193 previous similar messages Oct 29 06:25:31 sh-103-53.int kernel: LustreError: 443326:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9762f3b1d680) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 06:25:31 sh-103-53.int kernel: LustreError: 443326:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 06:25:31 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 06:25:31 sh-103-53.int kernel: Lustre: Skipped 13 previous similar messages Oct 29 06:30:01 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 06:30:41 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 06:30:41 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 06:30:41 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572355541, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97699df9a880/0x51ab3c4ee69a38f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad9741752c6 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 06:30:41 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 06:35:05 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.41@o2ib4 added to recovery queue. Health = 900 Oct 29 06:35:05 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 19 previous similar messages Oct 29 06:35:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 1 seconds Oct 29 06:35:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 237 previous similar messages Oct 29 06:35:48 sh-103-53.int kernel: LustreError: 443907:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0fe998c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 06:35:48 sh-103-53.int kernel: LustreError: 443907:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 06:35:48 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 06:35:48 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 06:41:00 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 06:41:00 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 06:41:00 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572356160, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea1680/0x51ab3c4ee69a3c7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad9943170e1 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 06:41:00 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 06:45:15 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.41@o2ib4 added to recovery queue. Health = 900 Oct 29 06:45:15 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 19 previous similar messages Oct 29 06:45:15 sh-103-53.int kernel: LNetError: 443743:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 101 Oct 29 06:45:16 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 06:45:17 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 06:45:17 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Skipped 1 previous similar message Oct 29 06:45:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 06:45:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 237 previous similar messages Oct 29 06:46:15 sh-103-53.int kernel: LustreError: 444495:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feed140) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 06:46:15 sh-103-53.int kernel: LustreError: 444495:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 06:46:15 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 06:46:15 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 06:51:21 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 06:51:34 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 06:51:34 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 06:51:34 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572356794, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea06c0/0x51ab3c4ee69a3ff lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad9b812afb5 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 06:51:34 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 06:52:15 sh-103-53.int kernel: LNetError: 444306:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 0 Oct 29 06:55:25 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.41@o2ib4 added to recovery queue. Health = 900 Oct 29 06:55:25 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 19 previous similar messages Oct 29 06:55:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 8 seconds Oct 29 06:55:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 212 previous similar messages Oct 29 06:56:40 sh-103-53.int kernel: LNetError: 444306:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 44 Oct 29 06:56:46 sh-103-53.int kernel: LustreError: 445088:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feed080) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 06:56:46 sh-103-53.int kernel: LustreError: 445088:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 06:56:46 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 06:56:46 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 07:02:00 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 07:02:00 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 07:02:00 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572357420, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754dbef6540/0x51ab3c4ee69a437 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad9da2b8d8e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 07:02:00 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 07:05:35 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.41@o2ib4 added to recovery queue. Health = 900 Oct 29 07:05:35 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 19 previous similar messages Oct 29 07:05:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 3 seconds Oct 29 07:05:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 235 previous similar messages Oct 29 07:07:10 sh-103-53.int kernel: LustreError: 445700:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781feea75c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 07:07:10 sh-103-53.int kernel: LustreError: 445700:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 07:07:10 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 07:07:10 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 07:12:21 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 07:12:21 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 07:12:21 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572358041, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a6fd5a00/0x51ab3c4ee69a46f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ad9fb2fc5b9 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 07:12:21 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 07:15:45 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 07:15:45 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 07:15:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 5 seconds Oct 29 07:15:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 238 previous similar messages Oct 29 07:17:31 sh-103-53.int kernel: LustreError: 446284:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b4fc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 07:17:31 sh-103-53.int kernel: LustreError: 446284:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 07:17:31 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 07:17:31 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 07:22:42 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 07:22:42 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 07:22:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572358662, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf57080/0x51ab3c4ee69a4a7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ada17d21916 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 07:22:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 07:25:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 29 07:25:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 222 previous similar messages Oct 29 07:25:55 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 07:25:55 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 07:27:51 sh-103-53.int kernel: LustreError: 446867:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b4c00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 07:27:51 sh-103-53.int kernel: LustreError: 446867:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 07:27:51 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 07:27:51 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 07:33:04 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 07:33:04 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 07:33:04 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572359284, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97602ea757c0/0x51ab3c4ee69a4df lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ada39946af3 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 07:33:04 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 07:35:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.23@o2ib4: 17 seconds Oct 29 07:35:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 227 previous similar messages Oct 29 07:36:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 07:36:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 07:38:18 sh-103-53.int kernel: LustreError: 447459:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0e346fc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 07:38:18 sh-103-53.int kernel: LustreError: 447459:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 07:38:18 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 07:38:18 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 07:43:29 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 07:43:29 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 07:43:29 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572359909, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97602ea75340/0x51ab3c4ee69a517 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ada4d757557 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 07:43:29 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 07:46:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 1 seconds Oct 29 07:46:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 254 previous similar messages Oct 29 07:46:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 07:46:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 29 07:48:35 sh-103-53.int kernel: LustreError: 448041:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0e346d80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 07:48:35 sh-103-53.int kernel: LustreError: 448041:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 07:48:35 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 07:48:35 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 07:53:44 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 07:53:44 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 07:53:44 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572360524, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97602ea74ec0/0x51ab3c4ee69a54f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ada517fd475 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 07:53:45 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 07:56:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 29 07:56:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 230 previous similar messages Oct 29 07:56:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 07:56:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 07:58:57 sh-103-53.int kernel: LustreError: 448624:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0e3463c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 07:58:57 sh-103-53.int kernel: LustreError: 448624:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 07:58:57 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 07:58:57 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 08:01:21 sh-103-53.int kernel: LNetError: 447739:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 0 Oct 29 08:04:12 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 08:04:12 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 08:04:12 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572361152, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97602ea71440/0x51ab3c4ee69a587 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ada55cd09a7 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 08:04:12 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 08:06:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.23@o2ib4: 3 seconds Oct 29 08:06:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 238 previous similar messages Oct 29 08:06:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 08:06:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 29 08:09:27 sh-103-53.int kernel: LustreError: 449233:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0e347980) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 08:09:27 sh-103-53.int kernel: LustreError: 449233:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 08:09:27 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 08:09:27 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 08:14:37 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 08:14:37 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 08:14:37 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572361777, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97699df98900/0x51ab3c4ee69a5bf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ada59e4a558 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 08:14:37 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 08:16:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 10 seconds Oct 29 08:16:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 239 previous similar messages Oct 29 08:16:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 08:16:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 29 08:19:47 sh-103-53.int kernel: LustreError: 449816:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0fe983c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 08:19:47 sh-103-53.int kernel: LustreError: 449816:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 08:19:47 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 08:19:47 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 08:24:56 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 08:24:56 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 08:24:56 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572362396, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97699df9aac0/0x51ab3c4ee69a5f7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ada5e257013 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 08:24:56 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 08:26:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 08:26:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 236 previous similar messages Oct 29 08:26:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 08:26:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 29 08:27:58 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 08:27:58 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Skipped 2 previous similar messages Oct 29 08:30:07 sh-103-53.int kernel: LustreError: 450414:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0fe99440) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 08:30:07 sh-103-53.int kernel: LustreError: 450414:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 08:30:07 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 08:30:07 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 08:35:22 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 08:35:22 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 08:35:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572363022, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97699df9c800/0x51ab3c4ee69a62f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ada626ccfd6 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 08:35:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 08:36:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 4 seconds Oct 29 08:36:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 222 previous similar messages Oct 29 08:37:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 08:37:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 29 08:40:34 sh-103-53.int kernel: LustreError: 451004:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0fe98cc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 08:40:34 sh-103-53.int kernel: LustreError: 451004:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 08:40:34 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 08:40:34 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 08:43:12 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 08:43:12 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Skipped 1 previous similar message Oct 29 08:45:43 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 08:45:43 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 08:45:43 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572363643, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97699df98480/0x51ab3c4ee69a667 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ada6679ce99 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 08:45:43 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 08:46:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 3 seconds Oct 29 08:46:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 249 previous similar messages Oct 29 08:47:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 08:47:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 08:50:53 sh-103-53.int kernel: LustreError: 451586:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0fe98900) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 08:50:53 sh-103-53.int kernel: LustreError: 451586:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 08:50:53 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 08:50:53 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 08:56:08 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 08:56:08 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 08:56:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572364268, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97699df9bf00/0x51ab3c4ee69a69f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ada6adc5b44 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 08:56:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 08:56:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Oct 29 08:56:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 253 previous similar messages Oct 29 08:57:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 08:57:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 09:01:21 sh-103-53.int kernel: LustreError: 452191:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0fe98240) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 09:01:21 sh-103-53.int kernel: LustreError: 452191:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 09:01:21 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 09:01:21 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 09:03:32 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 09:03:33 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 09:03:33 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Skipped 1 previous similar message Oct 29 09:06:33 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 09:06:33 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 09:06:33 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572364893, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97699df9d100/0x51ab3c4ee69a6d7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ada6f28ea7c expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 09:06:33 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 09:06:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 6 seconds Oct 29 09:06:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 235 previous similar messages Oct 29 09:07:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 09:07:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 09:11:42 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 09:11:46 sh-103-53.int kernel: LustreError: 452780:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0fe98840) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 09:11:46 sh-103-53.int kernel: LustreError: 452780:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 09:11:46 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 09:11:46 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 09:16:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.23@o2ib4: 0 seconds Oct 29 09:16:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 247 previous similar messages Oct 29 09:17:00 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 09:17:00 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 09:17:00 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572365520, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97699df9ba80/0x51ab3c4ee69a70f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ada738e49e0 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 09:17:00 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 09:17:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 09:17:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 29 09:22:10 sh-103-53.int kernel: LustreError: 453380:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0fe983c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 09:22:10 sh-103-53.int kernel: LustreError: 453380:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 09:22:10 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 09:22:10 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 09:26:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 2 seconds Oct 29 09:26:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 239 previous similar messages Oct 29 09:27:17 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 09:27:17 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 09:27:17 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572366137, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea1680/0x51ab3c4ee69a747 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ada77dffa32 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 09:27:18 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 09:27:57 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 09:27:57 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 09:32:31 sh-103-53.int kernel: LustreError: 453963:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feedc80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 09:32:31 sh-103-53.int kernel: LustreError: 453963:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 09:32:31 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 09:32:31 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 09:36:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 29 09:36:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 242 previous similar messages Oct 29 09:37:43 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 09:37:43 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 09:37:43 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572366763, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea0240/0x51ab3c4ee69a77f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ada7c38e4fa expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 09:37:43 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 09:38:07 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.41@o2ib4 added to recovery queue. Health = 900 Oct 29 09:38:07 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 29 09:42:56 sh-103-53.int kernel: LustreError: 454569:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feec600) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 09:42:56 sh-103-53.int kernel: LustreError: 454569:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 09:42:56 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 09:42:56 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 09:46:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 7 seconds Oct 29 09:46:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 237 previous similar messages Oct 29 09:48:09 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 09:48:09 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 09:48:09 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572367389, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea4ec0/0x51ab3c4ee69a7b7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ada809b4f3f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 09:48:09 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 09:48:18 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 09:48:18 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 29 09:53:18 sh-103-53.int kernel: LustreError: 455153:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6feec780) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 09:53:18 sh-103-53.int kernel: LustreError: 455153:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 09:53:18 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 09:53:18 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 09:54:13 sh-103-53.int kernel: LNetError: 450899:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 09:54:26 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 09:56:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.23@o2ib4: 1 seconds Oct 29 09:56:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 235 previous similar messages Oct 29 09:58:26 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 09:58:26 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 09:58:26 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572368006, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea1b00/0x51ab3c4ee69a7ef lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ada84f64e85 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 09:58:26 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 09:58:28 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 09:58:28 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 29 10:03:40 sh-103-53.int kernel: LustreError: 456124:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97742fa02cc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 10:03:40 sh-103-53.int kernel: LustreError: 456124:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 10:03:40 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 10:03:40 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 10:04:34 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 10:07:04 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 29 10:07:04 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 241 previous similar messages Oct 29 10:08:38 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 10:08:38 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 10:08:46 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 10:08:46 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 10:08:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572368626, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c3cc0/0x51ab3c4ee69a90e lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ada88edcebc expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 10:08:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 10:11:39 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 29 10:11:39 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Skipped 1 previous similar message Oct 29 10:13:57 sh-103-53.int kernel: LustreError: 456966:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820068e900) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 10:13:57 sh-103-53.int kernel: LustreError: 456966:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 10:13:57 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 10:13:57 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 10:15:45 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 10:17:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 5 seconds Oct 29 10:17:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 217 previous similar messages Oct 29 10:18:48 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 10:18:48 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 10:19:12 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 10:19:12 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 10:19:12 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572369252, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c4c80/0x51ab3c4ee69a946 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ada8d272d98 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 10:19:12 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 10:21:49 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 10:24:24 sh-103-53.int kernel: LustreError: 457819:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9771237ecc00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 10:24:24 sh-103-53.int kernel: LustreError: 457819:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 10:24:24 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 10:24:24 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 10:27:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 5 seconds Oct 29 10:27:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 218 previous similar messages Oct 29 10:28:58 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 10:28:58 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 10:29:38 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 10:29:38 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 10:29:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572369878, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7660af40/0x51ab3c4ee69a97e lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ada917d4450 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 10:29:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 10:34:58 sh-103-53.int kernel: LustreError: 458680:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978214a8ab40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 10:34:58 sh-103-53.int kernel: LustreError: 458680:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 10:34:58 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 10:34:58 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 10:37:24 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 5 seconds Oct 29 10:37:24 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 227 previous similar messages Oct 29 10:39:08 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 10:39:08 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 29 10:40:17 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 10:40:17 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 10:40:17 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572370517, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97779d615c40/0x51ab3c4ee69a9b6 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ada9596fa59 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 10:40:17 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 10:45:30 sh-103-53.int kernel: LustreError: 1087:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9779d260ab40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 10:45:30 sh-103-53.int kernel: LustreError: 1087:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 10:45:30 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 10:45:30 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 10:47:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 3 seconds Oct 29 10:47:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 235 previous similar messages Oct 29 10:48:32 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 10:49:18 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 10:49:18 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 10:50:42 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 10:50:42 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 10:50:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572371142, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626b180/0x51ab3c4ee69a9ee lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ada997ceb0f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 10:50:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 10:52:19 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 10:55:48 sh-103-53.int kernel: LustreError: 2041:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9782003b4e40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 10:55:48 sh-103-53.int kernel: LustreError: 2041:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 10:55:48 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 10:55:48 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 10:57:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 29 10:57:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 227 previous similar messages Oct 29 10:59:28 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 10:59:28 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 11:00:29 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 11:01:03 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 11:01:03 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 11:01:03 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572371763, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7660e0c0/0x51ab3c4ee69aa3b lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ada9d85c4aa expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 11:01:03 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 11:04:55 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3350:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Oct 29 11:04:55 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3425:kiblnd_check_conns()) Timed out RDMA with 10.9.0.22@o2ib4 (6): c: 0, oc: 0, rc: 8 Oct 29 11:04:56 sh-103-53.int kernel: Lustre: 91142:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1572372289/real 0] req@ffff977e7d27ec00 x1648382102048656/t0(0) o400->fir-OST004f-osc-ffff9781f2230800@10.0.10.114@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572372296 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 29 11:04:56 sh-103-53.int kernel: Lustre: fir-OST000f-osc-ffff9781f2230800: Connection to fir-OST000f (at 10.0.10.104@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 29 11:04:56 sh-103-53.int kernel: Lustre: 91142:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 34 previous similar messages Oct 29 11:05:00 sh-103-53.int kernel: Lustre: 91132:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572372289/real 1572372289] req@ffff978215481680 x1648382102047344/t0(0) o400->fir-MDT0001-mdc-ffff9781f2230800@10.0.10.52@o2ib7:12/10 lens 224/224 e 0 to 1 dl 1572372300 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 29 11:05:00 sh-103-53.int kernel: Lustre: fir-MDT0001-mdc-ffff9781f2230800: Connection to fir-MDT0001 (at 10.0.10.52@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 29 11:05:00 sh-103-53.int kernel: Lustre: Skipped 35 previous similar messages Oct 29 11:05:33 sh-103-53.int kernel: Lustre: 91133:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572372289/real 1572372289] req@ffff978215483600 x1648382102047360/t0(0) o400->fir-MDT0002-mdc-ffff9781f2230800@10.0.10.53@o2ib7:12/10 lens 224/224 e 0 to 1 dl 1572372333 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 29 11:05:33 sh-103-53.int kernel: Lustre: fir-MDT0002-mdc-ffff9781f2230800: Connection to fir-MDT0002 (at 10.0.10.53@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 29 11:05:47 sh-103-53.int kernel: Lustre: 91142:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572372289/real 1572372289] req@ffff978215480d80 x1648382102047312/t0(0) o400->MGC10.0.10.51@o2ib7@10.0.10.51@o2ib7:26/25 lens 224/224 e 0 to 1 dl 1572372347 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 29 11:05:51 sh-103-53.int kernel: LustreError: 2694:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978214a8afc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 11:05:51 sh-103-53.int kernel: LustreError: 2694:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 11:05:51 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 11:05:51 sh-103-53.int kernel: Lustre: Skipped 39 previous similar messages Oct 29 11:07:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 29 11:07:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 254 previous similar messages Oct 29 11:09:38 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.24@o2ib4 added to recovery queue. Health = 900 Oct 29 11:09:38 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 33 previous similar messages Oct 29 11:11:28 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 11:11:28 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 11:11:28 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572372388, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c6e40/0x51ab3c4ee69aa73 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adaa159cf2a expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 11:16:46 sh-103-53.int kernel: LustreError: 3626:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978214a8b2c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 11:16:46 sh-103-53.int kernel: LustreError: 3626:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 11:16:46 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 11:16:46 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 11:17:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.23@o2ib4: 19 seconds Oct 29 11:17:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 222 previous similar messages Oct 29 11:19:48 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 11:19:48 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 11:22:07 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 11:22:07 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 11:22:07 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572373027, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97779d611200/0x51ab3c4ee69aaab lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adaa5913c65 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 11:22:07 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 11:27:17 sh-103-53.int kernel: LustreError: 4492:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9774413cf800) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 11:27:17 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 11:27:17 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 11:27:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 4 seconds Oct 29 11:27:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 215 previous similar messages Oct 29 11:29:59 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 11:29:59 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 29 11:30:00 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 11:32:24 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 11:32:24 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 11:32:24 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572373644, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c7740/0x51ab3c4ee69aaea lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adaa8b2953d expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 11:32:24 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 11:37:34 sh-103-53.int kernel: LustreError: 5347:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978214a8a3c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 11:37:34 sh-103-53.int kernel: LustreError: 5347:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 11:37:34 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 11:37:34 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 11:37:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 4 seconds Oct 29 11:37:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 204 previous similar messages Oct 29 11:40:08 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 11:40:08 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 11:41:55 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 11:41:58 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 11:42:10 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 11:42:16 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 11:42:26 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 11:42:48 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 11:42:48 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 11:42:48 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572374268, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7663ad00/0x51ab3c4ee69ab22 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adaac112a1e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 11:42:48 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 11:47:44 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 5 seconds Oct 29 11:47:44 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 220 previous similar messages Oct 29 11:47:59 sh-103-53.int kernel: LustreError: 6297:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9779d260b500) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 11:47:59 sh-103-53.int kernel: LustreError: 6297:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 11:47:59 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 11:47:59 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 11:50:18 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 11:50:18 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 11:53:15 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 11:53:15 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 11:53:15 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572374895, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97783febb840/0x51ab3c4ee69aba7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adaaf6b8922 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 11:53:15 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 11:57:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 5 seconds Oct 29 11:57:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 228 previous similar messages Oct 29 11:58:33 sh-103-53.int kernel: LustreError: 7174:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977adfe343c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 11:58:33 sh-103-53.int kernel: LustreError: 7174:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 11:58:33 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 11:58:33 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 12:00:29 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.24@o2ib4 added to recovery queue. Health = 900 Oct 29 12:00:29 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 12:03:52 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 12:03:52 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 12:03:52 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572375532, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7660da00/0x51ab3c4ee69abdf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adabf6435e4 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 12:03:52 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 12:07:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 5 seconds Oct 29 12:07:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 221 previous similar messages Oct 29 12:08:35 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 12:09:04 sh-103-53.int kernel: LustreError: 8071:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781fbe0a540) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 12:09:04 sh-103-53.int kernel: LustreError: 8071:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 12:09:04 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 12:09:04 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 12:09:35 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 12:10:39 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 12:10:39 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 29 12:10:39 sh-103-53.int kernel: LNetError: 6416:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 0 Oct 29 12:11:19 sh-103-53.int kernel: LNetError: 6416:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 19 Oct 29 12:14:18 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 12:14:18 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 12:14:18 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572376158, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769d93af740/0x51ab3c4ee69ac17 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adad7103405 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 12:14:18 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 12:17:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 29 12:17:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 218 previous similar messages Oct 29 12:19:31 sh-103-53.int kernel: LustreError: 8927:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781fbe0b680) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 12:19:31 sh-103-53.int kernel: LustreError: 8927:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 12:19:31 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 12:19:31 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 12:20:49 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 12:20:49 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 12:24:43 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 12:24:43 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 12:24:43 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572376783, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff975723edb840/0x51ab3c4ee69ac4f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adaee7cb997 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 12:24:43 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 12:28:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 29 12:28:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 229 previous similar messages Oct 29 12:29:54 sh-103-53.int kernel: LustreError: 9592:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977107791080) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 12:29:54 sh-103-53.int kernel: LustreError: 9592:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 12:29:54 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 12:29:54 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 12:30:14 sh-103-53.int kernel: LNetError: 6416:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 16 Oct 29 12:30:59 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 12:30:59 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 29 12:35:00 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 12:35:04 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 12:35:04 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 12:35:04 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572377404, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea60c0/0x51ab3c4ee69ac87 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adb07f2ccff expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 12:35:04 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 12:35:52 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 12:35:54 sh-103-53.int kernel: LNetError: 6416:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 12:36:00 sh-103-53.int kernel: LNetError: 6416:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 12:36:09 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 12:36:09 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Skipped 1 previous similar message Oct 29 12:36:24 sh-103-53.int kernel: LNetError: 6416:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 12:38:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.23@o2ib4: 0 seconds Oct 29 12:38:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 220 previous similar messages Oct 29 12:39:47 sh-103-53.int kernel: Lustre: 91129:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572377980/real 1572377980] req@ffff97820ae7f080 x1648382103197600/t0(0) o400->oak-OST007f-osc-ffff9781f565a800@10.0.2.109@o2ib5:28/4 lens 224/224 e 0 to 1 dl 1572377987 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 29 12:39:47 sh-103-53.int kernel: Lustre: oak-OST008e-osc-ffff9781f565a800: Connection to oak-OST008e (at 10.0.2.109@o2ib5) was lost; in progress operations using this service will wait for recovery to complete Oct 29 12:39:47 sh-103-53.int kernel: Lustre: 91129:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 20 previous similar messages Oct 29 12:40:11 sh-103-53.int kernel: LustreError: 10564:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976b4da475c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 12:40:11 sh-103-53.int kernel: LustreError: 10564:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 12:40:11 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 12:40:11 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 12:41:05 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 12:41:05 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 12:45:22 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 12:45:22 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 12:45:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572378022, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e76269200/0x51ab3c4ee69acd4 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adb1fac6db8 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 12:45:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 12:48:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 29 12:48:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 218 previous similar messages Oct 29 12:49:16 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 12:50:31 sh-103-53.int kernel: LustreError: 11411:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9774416a2000) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 12:50:31 sh-103-53.int kernel: LustreError: 11411:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 12:50:31 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 12:50:31 sh-103-53.int kernel: Lustre: Skipped 25 previous similar messages Oct 29 12:51:15 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 12:51:15 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 29 12:55:42 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 12:55:42 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 12:55:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572378642, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0fbc0/0x51ab3c4ee69ad0c lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adb3a3432dc expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 12:55:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 12:58:17 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 29 12:58:17 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 226 previous similar messages Oct 29 12:58:29 sh-103-53.int kernel: LNetError: 6416:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 0 Oct 29 13:00:56 sh-103-53.int kernel: LustreError: 12273:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711e7dafc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 13:00:56 sh-103-53.int kernel: LustreError: 12273:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 13:00:56 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 13:00:56 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 13:01:25 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 13:01:25 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 13:06:13 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 13:06:13 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 13:06:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572379273, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97779d6121c0/0x51ab3c4ee69ad44 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adb4e21b539 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 13:06:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 13:08:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 13:08:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 218 previous similar messages Oct 29 13:11:27 sh-103-53.int kernel: LustreError: 13161:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9771237ec000) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 13:11:27 sh-103-53.int kernel: LustreError: 13161:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 13:11:27 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 13:11:27 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 13:11:35 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 13:11:35 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 13:16:45 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 13:16:45 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 13:16:45 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572379905, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0f080/0x51ab3c4ee69ad7c lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adb667d575c expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 13:16:45 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 13:18:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 4 seconds Oct 29 13:18:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 226 previous similar messages Oct 29 13:21:50 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 13:21:50 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 29 13:21:50 sh-103-53.int kernel: LNetError: 6416:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 200 Oct 29 13:22:01 sh-103-53.int kernel: LustreError: 14021:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978206e14e40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 13:22:01 sh-103-53.int kernel: LustreError: 14021:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 13:22:01 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 13:22:01 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 13:27:12 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 13:27:12 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 13:27:12 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572380532, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0af40/0x51ab3c4ee69adb4 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adb825ea7b8 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 13:27:12 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 13:28:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 6 seconds Oct 29 13:28:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 217 previous similar messages Oct 29 13:29:17 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 13:29:20 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 13:29:26 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 13:29:30 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 13:29:45 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 13:29:45 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 1 previous similar message Oct 29 13:31:55 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 13:31:55 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 29 13:32:20 sh-103-53.int kernel: LustreError: 14753:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978206e15a40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 13:32:20 sh-103-53.int kernel: LustreError: 14753:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 13:32:20 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 13:32:20 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 13:37:02 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 13:37:29 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 13:37:29 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 13:37:29 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572381149, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0d340/0x51ab3c4ee69ae39 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adb9ca537e2 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 13:37:29 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 13:38:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.23@o2ib4: 0 seconds Oct 29 13:38:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 221 previous similar messages Oct 29 13:42:05 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 13:42:05 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 13:42:44 sh-103-53.int kernel: LustreError: 15598:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977df0a33ec0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 13:42:44 sh-103-53.int kernel: LustreError: 15598:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 13:42:44 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 13:42:44 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 13:47:11 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 13:47:59 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 13:47:59 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 13:47:59 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572381779, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9781fcfbb840/0x51ab3c4ee69ae71 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adbba71b878 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 13:48:00 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 13:48:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 29 13:48:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 222 previous similar messages Oct 29 13:52:15 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 13:52:15 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 13:52:16 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 13:53:15 sh-103-53.int kernel: LustreError: 16456:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820d8e8600) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 13:53:15 sh-103-53.int kernel: LustreError: 16456:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 13:53:15 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 13:53:15 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 13:58:28 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 13:58:28 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 13:58:28 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572382408, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb08fc0/0x51ab3c4ee69aea9 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adbd4a60bbe expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 13:58:28 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 13:58:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 5 seconds Oct 29 13:58:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 216 previous similar messages Oct 29 14:02:30 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 14:02:30 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 14:03:41 sh-103-53.int kernel: LustreError: 17328:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978206e15ec0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 14:03:41 sh-103-53.int kernel: LustreError: 17328:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 14:03:41 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 14:03:41 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 14:08:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 14 seconds Oct 29 14:08:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 232 previous similar messages Oct 29 14:08:53 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 14:08:53 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 14:08:53 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572383033, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97783feb8fc0/0x51ab3c4ee69aee1 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adbefaf5cea expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 14:08:53 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 14:12:35 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 14:12:35 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 36 previous similar messages Oct 29 14:14:09 sh-103-53.int kernel: LustreError: 18186:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bfbb729c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 14:14:09 sh-103-53.int kernel: LustreError: 18186:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 14:14:09 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 14:14:09 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 14:18:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 1 seconds Oct 29 14:18:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 222 previous similar messages Oct 29 14:19:19 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 14:19:19 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 14:19:19 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572383659, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb09440/0x51ab3c4ee69af19 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adbf55d470e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 14:19:19 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 14:22:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 14:22:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 14:24:03 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 14:24:15 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 14:24:15 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 2 previous similar messages Oct 29 14:24:19 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 14:24:30 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 14:24:30 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 1 previous similar message Oct 29 14:24:38 sh-103-53.int kernel: LustreError: 19084:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bfbb73980) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 14:24:38 sh-103-53.int kernel: LustreError: 19084:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 14:24:38 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 14:24:38 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 14:28:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 1 seconds Oct 29 14:28:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 226 previous similar messages Oct 29 14:29:49 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 14:29:49 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 14:29:49 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572384289, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0b600/0x51ab3c4ee69af6d lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adbfa925620 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 14:29:49 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 14:32:55 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 14:32:55 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 14:34:56 sh-103-53.int kernel: LustreError: 19902:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711e72ca80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 14:34:56 sh-103-53.int kernel: LustreError: 19902:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 14:34:56 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 14:34:56 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 14:36:01 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 14:36:01 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Skipped 1 previous similar message Oct 29 14:38:03 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 14:39:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 29 14:39:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 242 previous similar messages Oct 29 14:40:09 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 14:40:09 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 14:40:09 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572384909, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0b3c0/0x51ab3c4ee69afa5 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adbfefeab6d expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 14:40:09 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 14:41:06 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 14:43:10 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 14:43:10 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 14:45:27 sh-103-53.int kernel: LustreError: 20572:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711e72cc00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 14:45:27 sh-103-53.int kernel: LustreError: 20572:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 14:45:27 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 14:45:27 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 14:49:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 5 seconds Oct 29 14:49:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 209 previous similar messages Oct 29 14:50:41 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 14:50:41 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 14:50:41 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572385541, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97820c2a8480/0x51ab3c4ee69afdd lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adc028ca0ae expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 14:50:41 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 14:52:11 sh-103-53.int kernel: LNetError: 21176:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 0 Oct 29 14:52:11 sh-103-53.int kernel: LNetError: 21176:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 1 previous similar message Oct 29 14:53:15 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 14:53:15 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 14:53:17 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 14:54:45 sh-103-53.int kernel: LNetError: 21176:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 0 Oct 29 14:56:05 sh-103-53.int kernel: LustreError: 21430:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9782186a9bc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 14:56:05 sh-103-53.int kernel: LustreError: 21430:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 14:56:05 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 14:56:05 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 14:59:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 29 14:59:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 223 previous similar messages Oct 29 15:01:16 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 15:01:16 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 15:01:16 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572386176, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769d93af980/0x51ab3c4ee69b015 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adc052f04a2 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 15:01:16 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 15:03:25 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 15:03:25 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 15:06:25 sh-103-53.int kernel: LustreError: 22299:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781fbe0b680) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 15:06:25 sh-103-53.int kernel: LustreError: 22299:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 15:06:25 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 15:06:25 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 15:09:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 5 seconds Oct 29 15:09:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 216 previous similar messages Oct 29 15:11:37 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 15:11:37 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 15:11:37 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572386797, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff975723ed9680/0x51ab3c4ee69b04d lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adc07387d86 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 15:11:37 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 15:13:35 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 15:13:35 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 15:13:55 sh-103-53.int kernel: LNetError: 22289:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 0 Oct 29 15:16:05 sh-103-53.int kernel: LNetError: 21176:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 0 Oct 29 15:16:51 sh-103-53.int kernel: LustreError: 23173:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978218d04480) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 15:16:51 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 15:16:51 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 15:17:41 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 15:17:42 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 15:17:44 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 15:17:52 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 15:18:06 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 15:18:41 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 15:19:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 7 seconds Oct 29 15:19:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 217 previous similar messages Oct 29 15:21:39 sh-103-53.int kernel: LNetError: 21176:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 0 Oct 29 15:21:39 sh-103-53.int kernel: LNetError: 21176:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 1 previous similar message Oct 29 15:22:04 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 15:22:04 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 15:22:04 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572387424, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97602ea70480/0x51ab3c4ee69b08c lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adc11a1c5f9 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 15:22:04 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 15:23:50 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.24@o2ib4 added to recovery queue. Health = 900 Oct 29 15:23:50 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 36 previous similar messages Oct 29 15:27:13 sh-103-53.int kernel: LustreError: 24094:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0e346840) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 15:27:13 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 15:27:13 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 15:28:51 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 29 15:29:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.23@o2ib4: 0 seconds Oct 29 15:29:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 212 previous similar messages Oct 29 15:32:23 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 15:32:23 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 15:32:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572388043, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97602ea77bc0/0x51ab3c4ee69b111 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adc23cccc36 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 15:32:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 15:32:56 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 15:33:55 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 15:33:55 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 29 15:33:56 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 15:37:38 sh-103-53.int kernel: LustreError: 24759:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97699dfb5380) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 15:37:38 sh-103-53.int kernel: LustreError: 24759:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 15:37:38 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 15:37:38 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 15:39:01 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 29 15:39:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 29 15:39:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 230 previous similar messages Oct 29 15:42:49 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 15:42:49 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 15:42:49 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572388668, 301s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977d98259f80/0x51ab3c4ee69b149 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adc379fe128 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 15:42:49 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 15:44:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 15:44:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 36 previous similar messages Oct 29 15:48:04 sh-103-53.int kernel: LustreError: 25607:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e7a27a900) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 15:48:04 sh-103-53.int kernel: LustreError: 25607:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 15:48:04 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 15:48:04 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 15:49:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 2 seconds Oct 29 15:49:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 223 previous similar messages Oct 29 15:53:22 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 15:53:22 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 15:53:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572389302, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97779d611b00/0x51ab3c4ee69b181 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adc5018f660 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 15:53:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 15:54:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 15:54:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 15:58:35 sh-103-53.int kernel: LustreError: 26485:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978202e2d380) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 15:58:35 sh-103-53.int kernel: LustreError: 26485:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 15:58:35 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 15:58:35 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 15:59:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.23@o2ib4: 0 seconds Oct 29 15:59:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 222 previous similar messages Oct 29 16:00:23 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 16:01:56 sh-103-53.int kernel: LNetError: 25243:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 197 Oct 29 16:03:45 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 16:03:45 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 16:03:45 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572389925, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0f980/0x51ab3c4ee69b1c0 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adc6b3d27ac expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 16:03:45 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 16:04:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 16:04:31 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 16:09:08 sh-103-53.int kernel: LustreError: 27355:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97821199a9c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 16:09:08 sh-103-53.int kernel: LustreError: 27355:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 16:09:08 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 16:09:08 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 16:09:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 29 16:09:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 229 previous similar messages Oct 29 16:11:23 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 16:11:41 sh-103-53.int kernel: LNetError: 24430:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 16:11:57 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 16:11:57 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 1 previous similar message Oct 29 16:14:24 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 16:14:24 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 16:14:24 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572390564, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977d98258b40/0x51ab3c4ee69b1ff lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adc8743dc49 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 16:14:24 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 16:14:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 16:14:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 16:16:37 sh-103-53.int kernel: LNetError: 24430:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 0 Oct 29 16:16:37 sh-103-53.int kernel: LNetError: 24430:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 1 previous similar message Oct 29 16:19:37 sh-103-53.int kernel: LustreError: 28326:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97821199bec0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 16:19:37 sh-103-53.int kernel: LustreError: 28326:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 16:19:37 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 16:19:37 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 16:19:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 29 16:19:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 229 previous similar messages Oct 29 16:19:42 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 16:24:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 16:24:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 16:24:46 sh-103-53.int kernel: LNetError: 24430:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 0 Oct 29 16:24:47 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 16:24:47 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 16:24:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572391187, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977d9825ba80/0x51ab3c4ee69b253 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adca5054498 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 16:24:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 16:24:47 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 16:24:48 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 16:24:48 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Skipped 1 previous similar message Oct 29 16:27:52 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 16:29:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.23@o2ib4: 7 seconds Oct 29 16:29:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 214 previous similar messages Oct 29 16:30:09 sh-103-53.int kernel: LustreError: 29191:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9779d1b60f00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 16:30:09 sh-103-53.int kernel: LustreError: 29191:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 16:30:09 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 16:30:09 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 16:34:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 16:34:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 29 16:35:20 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 16:35:20 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 16:35:20 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572391820, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e76268b40/0x51ab3c4ee69b28b lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adcc2b5ea6c expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 16:35:20 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 16:39:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 29 16:39:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 229 previous similar messages Oct 29 16:40:31 sh-103-53.int kernel: LustreError: 30035:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820c3b4fc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 16:40:31 sh-103-53.int kernel: LustreError: 30035:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 16:40:31 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 16:40:31 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 16:45:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 16:45:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 16:45:45 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 16:45:45 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 16:45:45 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572392445, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0e9c0/0x51ab3c4ee69b2c3 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adce0c9a8fd expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 16:45:45 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 16:49:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 6 seconds Oct 29 16:49:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 230 previous similar messages Oct 29 16:50:55 sh-103-53.int kernel: LustreError: 30887:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976ad9d32600) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 16:50:55 sh-103-53.int kernel: LustreError: 30887:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 16:50:55 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 16:50:55 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 16:55:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 16:55:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages Oct 29 16:56:08 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 16:56:08 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 16:56:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572393068, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c5c40/0x51ab3c4ee69b2fb lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adcfdd8d799 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 16:56:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 16:58:41 sh-103-53.int kernel: LNetError: 30121:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 213 Oct 29 16:59:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 5 seconds Oct 29 16:59:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 227 previous similar messages Oct 29 17:01:22 sh-103-53.int kernel: LustreError: 31776:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9779d1b61740) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 17:01:22 sh-103-53.int kernel: LustreError: 31776:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 17:01:22 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 17:01:22 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 17:05:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 17:05:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 17:06:39 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 17:06:39 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 17:06:39 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572393699, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97779d613f00/0x51ab3c4ee69b333 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2add199c3447 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 17:06:39 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 17:09:32 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 17:09:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 5 seconds Oct 29 17:09:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 247 previous similar messages Oct 29 17:11:54 sh-103-53.int kernel: LustreError: 32652:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9774413cf380) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 17:11:54 sh-103-53.int kernel: LustreError: 32652:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 17:11:54 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 17:11:54 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 17:15:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 17:15:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 17:17:11 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 17:17:11 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 17:17:11 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572394331, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7660ee40/0x51ab3c4ee69b3b1 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2add375d5a86 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 17:17:11 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 17:19:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 5 seconds Oct 29 17:19:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 229 previous similar messages Oct 29 17:22:22 sh-103-53.int kernel: LustreError: 33480:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978215f78c00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 17:22:22 sh-103-53.int kernel: LustreError: 33480:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 17:22:22 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 17:22:22 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 17:25:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 17:25:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 29 17:27:39 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 17:27:39 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 17:27:39 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572394959, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff976a0fea6e40/0x51ab3c4ee69b3e9 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2add53ff6295 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 17:27:39 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 17:28:41 sh-103-53.int kernel: LNetError: 33548:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 0 Oct 29 17:29:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 29 17:29:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 226 previous similar messages Oct 29 17:32:56 sh-103-53.int kernel: LustreError: 34341:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978215f78a80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 17:32:56 sh-103-53.int kernel: LustreError: 34341:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 17:32:56 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 17:32:56 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 17:35:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 17:35:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 17:38:14 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 17:38:14 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 17:38:14 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572395594, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97783feb8900/0x51ab3c4ee69b421 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2add7265964e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 17:38:14 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 17:39:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 5 seconds Oct 29 17:39:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 220 previous similar messages Oct 29 17:43:29 sh-103-53.int kernel: LustreError: 35194:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9771143f58c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 17:43:29 sh-103-53.int kernel: LustreError: 35194:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 17:43:29 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 17:43:29 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 17:46:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 17:46:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 29 17:48:43 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 17:48:43 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 17:48:43 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572396223, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e76269680/0x51ab3c4ee69b459 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2add8c5f500f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 17:48:43 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 17:50:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 1 seconds Oct 29 17:50:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 221 previous similar messages Oct 29 17:54:01 sh-103-53.int kernel: LustreError: 36043:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9771237ec300) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 17:54:01 sh-103-53.int kernel: LustreError: 36043:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 17:54:01 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 17:54:01 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 17:56:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 17:56:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 29 17:59:13 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 17:59:13 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 17:59:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572396852, 301s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97783feb9200/0x51ab3c4ee69b491 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2add929d030b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 17:59:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 18:00:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 29 18:00:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 224 previous similar messages Oct 29 18:04:29 sh-103-53.int kernel: LustreError: 36910:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9771057ac180) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 18:04:29 sh-103-53.int kernel: LustreError: 36910:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 18:04:29 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 18:04:29 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 18:06:27 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 18:06:27 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 18:09:41 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 18:09:41 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 18:09:41 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572397481, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c4800/0x51ab3c4ee69b4c9 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2add97631165 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 18:09:41 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 18:10:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 1 seconds Oct 29 18:10:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 224 previous similar messages Oct 29 18:13:34 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 29 18:14:57 sh-103-53.int kernel: LustreError: 37789:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978202e2ca80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 18:14:57 sh-103-53.int kernel: LustreError: 37789:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 18:14:57 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 18:14:57 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 18:16:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 18:16:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 18:20:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 5 seconds Oct 29 18:20:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 229 previous similar messages Oct 29 18:20:18 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 18:20:18 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 18:20:18 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572398118, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97779d612640/0x51ab3c4ee69b51d lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adda06caf17 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 18:20:18 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 18:21:43 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 18:25:32 sh-103-53.int kernel: LustreError: 38547:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978202e2ccc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 18:25:32 sh-103-53.int kernel: LustreError: 38547:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 18:25:32 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 18:25:32 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 18:26:47 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 18:26:47 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 18:30:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 1 seconds Oct 29 18:30:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 262 previous similar messages Oct 29 18:30:39 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 18:30:39 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 18:30:39 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572398739, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97779d6157c0/0x51ab3c4ee69b555 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adda48dc80e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 18:30:39 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 18:35:58 sh-103-53.int kernel: LustreError: 39344:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a12bb6e40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 18:35:58 sh-103-53.int kernel: LustreError: 39344:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 18:35:58 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 18:35:58 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 18:36:57 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 18:36:57 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 29 18:39:00 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 18:40:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 8 seconds Oct 29 18:40:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 251 previous similar messages Oct 29 18:41:11 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 18:41:11 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 18:41:11 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572399371, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7663ec00/0x51ab3c4ee69b58d lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adda88f9967 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 18:41:11 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 18:46:25 sh-103-53.int kernel: LustreError: 40194:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a12bb72c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 18:46:25 sh-103-53.int kernel: LustreError: 40194:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 18:46:25 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 18:46:25 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 18:47:07 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 18:47:07 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 36 previous similar messages Oct 29 18:48:17 sh-103-53.int kernel: LNetError: 33548:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 0 Oct 29 18:50:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 5 seconds Oct 29 18:50:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 224 previous similar messages Oct 29 18:51:33 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 18:51:33 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 18:51:33 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572399993, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e76638240/0x51ab3c4ee69b5c5 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2addac6436c4 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 18:51:33 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 18:56:53 sh-103-53.int kernel: LustreError: 41031:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a12bb7740) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 18:56:53 sh-103-53.int kernel: LustreError: 41031:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 18:56:53 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 18:56:53 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 18:57:17 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.41@o2ib4 added to recovery queue. Health = 900 Oct 29 18:57:17 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 29 19:00:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 5 seconds Oct 29 19:00:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 212 previous similar messages Oct 29 19:02:03 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 19:02:03 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 19:02:03 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572400623, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7663da00/0x51ab3c4ee69b5fd lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2addb9a76914 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 19:02:03 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 19:02:24 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 19:04:23 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 29 19:07:21 sh-103-53.int kernel: LustreError: 41856:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9771237ed500) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 19:07:21 sh-103-53.int kernel: LustreError: 41856:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 19:07:21 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 19:07:21 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 19:07:27 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 19:07:27 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 35 previous similar messages Oct 29 19:10:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 29 19:10:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 230 previous similar messages Oct 29 19:12:22 sh-103-53.int kernel: LNetError: 33548:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 201 Oct 29 19:12:34 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 19:12:34 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 19:12:34 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572401254, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626ec00/0x51ab3c4ee69b635 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2addcc4b642c expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 19:12:34 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 19:17:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 19:17:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 29 19:18:42 sh-103-53.int kernel: LustreError: 42808:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781f4a7a6c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 19:18:42 sh-103-53.int kernel: LustreError: 42808:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 19:18:42 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 19:18:42 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 19:19:38 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 19:20:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 29 19:20:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 209 previous similar messages Oct 29 19:20:54 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3350:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 1 seconds Oct 29 19:20:54 sh-103-53.int kernel: LNetError: 91081:0:(o2iblnd_cb.c:3425:kiblnd_check_conns()) Timed out RDMA with 10.9.0.24@o2ib4 (7): c: 0, oc: 0, rc: 8 Oct 29 19:20:54 sh-103-53.int kernel: Lustre: 91134:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572402047/real 1572402047] req@ffff97821df3cc80 x1648382107792000/t0(0) o400->fir-OST0002-osc-ffff9781f2230800@10.0.10.101@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572402054 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 29 19:20:54 sh-103-53.int kernel: Lustre: fir-OST0053-osc-ffff9781f2230800: Connection to fir-OST0053 (at 10.0.10.114@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 29 19:20:54 sh-103-53.int kernel: Lustre: Skipped 23 previous similar messages Oct 29 19:20:54 sh-103-53.int kernel: Lustre: 91134:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 45 previous similar messages Oct 29 19:20:59 sh-103-53.int kernel: Lustre: 91142:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1572402047/real 1572402059] req@ffff97821df3e780 x1648382107791920/t0(0) o400->fir-MDT0001-mdc-ffff9781f2230800@10.0.10.52@o2ib7:12/10 lens 224/224 e 0 to 1 dl 1572402104 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 Oct 29 19:20:59 sh-103-53.int kernel: Lustre: fir-MDT0001-mdc-ffff9781f2230800: Connection to fir-MDT0001 (at 10.0.10.52@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 29 19:20:59 sh-103-53.int kernel: Lustre: Skipped 43 previous similar messages Oct 29 19:21:38 sh-103-53.int kernel: Lustre: 91131:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572402047/real 1572402047] req@ffff97821df3f080 x1648382107791952/t0(0) o400->fir-MDT0003-mdc-ffff9781f2230800@10.0.10.54@o2ib7:12/10 lens 224/224 e 0 to 1 dl 1572402098 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 29 19:21:38 sh-103-53.int kernel: Lustre: fir-MDT0003-mdc-ffff9781f2230800: Connection to fir-MDT0003 (at 10.0.10.54@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 29 19:24:01 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 19:24:01 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 19:24:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572401941, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97783feb8d80/0x51ab3c4ee69b66d lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adde5d371f8 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 19:24:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 19:27:47 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 19:27:47 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 31 previous similar messages Oct 29 19:29:18 sh-103-53.int kernel: LustreError: 43662:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97779f7460c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 19:29:18 sh-103-53.int kernel: LustreError: 43662:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 19:29:18 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 19:29:18 sh-103-53.int kernel: Lustre: Skipped 47 previous similar messages Oct 29 19:29:48 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.24@o2ib4: -125 Oct 29 19:30:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 29 19:30:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 291 previous similar messages Oct 29 19:34:27 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 19:34:27 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 19:34:27 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572402567, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7660da00/0x51ab3c4ee69b6a5 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2addffb0c911 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 19:34:27 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 19:37:57 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 19:37:57 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 19:39:35 sh-103-53.int kernel: LustreError: 44552:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9774413cfc80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 19:39:35 sh-103-53.int kernel: LustreError: 44552:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 19:39:35 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 19:39:35 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 19:40:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.23@o2ib4: 0 seconds Oct 29 19:40:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 258 previous similar messages Oct 29 19:44:49 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 19:44:49 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 19:44:49 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572403189, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7663a880/0x51ab3c4ee69b7b6 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ade1b2e7fbc expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 19:44:49 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 19:48:07 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 19:48:07 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 19:50:09 sh-103-53.int kernel: LustreError: 45316:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977daca77440) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 19:50:09 sh-103-53.int kernel: LustreError: 45316:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 19:50:09 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 19:50:09 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 19:50:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 9 seconds Oct 29 19:50:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 258 previous similar messages Oct 29 19:53:13 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 19:55:23 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 19:55:23 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 19:55:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572403823, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97783feba1c0/0x51ab3c4ee69b7ee lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ade379dbc1e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 19:55:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 19:58:17 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 19:58:17 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 20:00:35 sh-103-53.int kernel: LustreError: 46168:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820ff2acc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 20:00:35 sh-103-53.int kernel: LustreError: 46168:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 20:00:35 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 20:00:35 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 20:00:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 6 seconds Oct 29 20:00:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 268 previous similar messages Oct 29 20:05:55 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 20:05:55 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 20:05:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572404454, 301s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7660d100/0x51ab3c4ee69b826 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ade519aff01 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 20:05:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 20:08:27 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 20:08:27 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 20:11:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 2 seconds Oct 29 20:11:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 280 previous similar messages Oct 29 20:11:12 sh-103-53.int kernel: LustreError: 47020:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9779d1b61440) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 20:11:12 sh-103-53.int kernel: LustreError: 47020:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 20:11:12 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 20:11:12 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 20:16:23 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 20:16:23 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 20:16:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572405083, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c3600/0x51ab3c4ee69b85e lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ade6c6601b1 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 20:16:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 20:18:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 20:18:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 20:20:38 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 20:20:39 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 20:21:04 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 29 20:21:04 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 264 previous similar messages Oct 29 20:21:42 sh-103-53.int kernel: LustreError: 47797:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9779d1b61800) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 20:21:42 sh-103-53.int kernel: LustreError: 47797:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 20:21:42 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 20:21:42 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 20:26:55 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 20:26:55 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 20:26:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572405715, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97779d617500/0x51ab3c4ee69b896 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ade884a628b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 20:26:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 20:28:47 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 20:28:47 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 20:31:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 29 20:31:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 280 previous similar messages Oct 29 20:32:59 sh-103-53.int kernel: LustreError: 48694:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977796ee9b00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 20:32:59 sh-103-53.int kernel: LustreError: 48694:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 20:32:59 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 20:32:59 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 20:38:13 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 20:38:13 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 20:38:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572406393, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff975723edcec0/0x51ab3c4ee69b8ce lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adea79cf30f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 20:38:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 20:38:57 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 20:38:57 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 20:41:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 29 20:41:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 275 previous similar messages Oct 29 20:43:30 sh-103-53.int kernel: LustreError: 49469:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711ebcdc80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 20:43:30 sh-103-53.int kernel: LustreError: 49469:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 20:43:30 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 20:43:30 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 20:48:52 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 20:48:52 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 20:48:52 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572407032, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff975723ed8fc0/0x51ab3c4ee69b906 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adec386f104 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 20:48:52 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 20:49:08 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 20:49:08 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 20:51:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 4 seconds Oct 29 20:51:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 275 previous similar messages Oct 29 20:54:07 sh-103-53.int kernel: LustreError: 50315:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97821e131e00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 20:54:07 sh-103-53.int kernel: LustreError: 50315:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 20:54:07 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 20:54:07 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 20:57:15 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 20:57:16 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 20:57:16 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Skipped 1 previous similar message Oct 29 20:59:18 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 20:59:18 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 20:59:22 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 20:59:22 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 20:59:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572407662, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977d9825b600/0x51ab3c4ee69b93e lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adee00e6712 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 20:59:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 21:01:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 3 seconds Oct 29 21:01:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 257 previous similar messages Oct 29 21:02:21 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 21:04:41 sh-103-53.int kernel: LustreError: 51197:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97779f746cc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 21:04:41 sh-103-53.int kernel: LustreError: 51197:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 21:04:41 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 21:04:41 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 21:09:28 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 21:09:28 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 21:09:57 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 21:09:57 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 21:09:57 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572408297, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97779d617080/0x51ab3c4ee69ba4f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adefa8aacbd expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 21:09:57 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 21:11:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.24@o2ib4: 0 seconds Oct 29 21:11:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 260 previous similar messages Oct 29 21:15:14 sh-103-53.int kernel: LustreError: 52029:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978207204180) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 21:15:14 sh-103-53.int kernel: LustreError: 52029:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 21:15:14 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 21:15:14 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 21:19:38 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 21:19:38 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 21:20:30 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 21:20:30 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 21:20:30 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572408930, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626e780/0x51ab3c4ee69ba87 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adf10dc6e6f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 21:20:30 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 21:21:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 21:21:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 267 previous similar messages Oct 29 21:25:38 sh-103-53.int kernel: LustreError: 52882:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976ad9dc2480) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 21:25:38 sh-103-53.int kernel: LustreError: 52882:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 21:25:38 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 21:25:38 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 21:29:48 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 21:29:48 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 21:30:46 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 21:30:46 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 21:30:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572409546, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626ee40/0x51ab3c4ee69babf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adf2d1bda9b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 21:30:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 21:31:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 29 21:31:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 268 previous similar messages Oct 29 21:35:54 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 21:35:56 sh-103-53.int kernel: LustreError: 53629:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781f534be00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 21:35:56 sh-103-53.int kernel: LustreError: 53629:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 21:35:56 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 21:35:56 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 21:39:58 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 21:39:58 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 21:41:07 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 21:41:07 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 21:41:07 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572410167, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0ba80/0x51ab3c4ee69baf7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adf415d1179 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 21:41:07 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 21:41:24 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 5 seconds Oct 29 21:41:24 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 285 previous similar messages Oct 29 21:46:14 sh-103-53.int kernel: LustreError: 54465:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820bf59080) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 21:46:14 sh-103-53.int kernel: LustreError: 54465:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 21:46:14 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 21:46:14 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 21:50:08 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 21:50:08 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 21:51:24 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 21:51:24 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 21:51:24 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572410784, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97783febb180/0x51ab3c4ee69bb2f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adf56f0e8c7 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 21:51:24 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 21:51:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 5 seconds Oct 29 21:51:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 277 previous similar messages Oct 29 21:53:09 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 21:53:10 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 21:53:10 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Skipped 1 previous similar message Oct 29 21:56:33 sh-103-53.int kernel: LustreError: 55315:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97779f747e00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 21:56:33 sh-103-53.int kernel: LustreError: 55315:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 21:56:33 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 21:56:33 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 22:00:18 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 22:00:18 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 22:01:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.23@o2ib4: 0 seconds Oct 29 22:01:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 256 previous similar messages Oct 29 22:01:43 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 22:01:43 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 22:01:43 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572411403, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e76608d80/0x51ab3c4ee69bb67 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adf71d19178 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 22:01:43 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 22:06:51 sh-103-53.int kernel: LustreError: 56083:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e73fbc900) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 22:06:51 sh-103-53.int kernel: LustreError: 56083:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 22:06:51 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 22:06:51 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 22:10:28 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 22:10:28 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 22:11:44 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 5 seconds Oct 29 22:11:44 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 282 previous similar messages Oct 29 22:11:58 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 22:11:58 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 22:11:58 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572412018, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e76609440/0x51ab3c4ee69bb9f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adf8a5e4b73 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 22:11:58 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 22:17:08 sh-103-53.int kernel: LustreError: 56923:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976be9eb1c80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 22:17:08 sh-103-53.int kernel: LustreError: 56923:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 22:17:08 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 22:17:08 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 22:20:38 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 22:20:38 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 22:21:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 29 22:21:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 268 previous similar messages Oct 29 22:22:16 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 22:22:16 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 22:22:16 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572412636, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c1200/0x51ab3c4ee69bbd7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adfa77718c6 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 22:22:16 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 22:27:25 sh-103-53.int kernel: LustreError: 57819:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976be9eb1380) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 22:27:25 sh-103-53.int kernel: LustreError: 57819:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 22:27:25 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 22:27:25 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 22:30:48 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 22:30:48 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 22:32:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 9 seconds Oct 29 22:32:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 279 previous similar messages Oct 29 22:32:34 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 22:32:34 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 22:32:34 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572413254, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626a640/0x51ab3c4ee69bce8 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adfbf2de5ed expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 22:32:34 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 22:32:49 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 22:37:42 sh-103-53.int kernel: LustreError: 58601:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976ad9dc29c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 22:37:42 sh-103-53.int kernel: LustreError: 58601:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 22:37:42 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 22:37:42 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 22:40:58 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 22:40:58 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 22:42:00 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 22:42:04 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 29 22:42:04 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 267 previous similar messages Oct 29 22:42:49 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 22:42:49 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 22:42:49 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572413869, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626c5c0/0x51ab3c4ee69bd20 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adfd7575ef2 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 22:42:49 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 22:47:59 sh-103-53.int kernel: LustreError: 59447:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976ad9dc2180) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 22:47:59 sh-103-53.int kernel: LustreError: 59447:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 22:47:59 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 22:47:59 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 22:48:05 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 22:51:08 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 22:51:08 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 22:52:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 15 seconds Oct 29 22:52:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 283 previous similar messages Oct 29 22:53:07 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 22:53:07 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 22:53:07 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572414487, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626de80/0x51ab3c4ee69bd58 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2adfed563491 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 22:53:07 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 22:54:09 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 22:58:15 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 22:58:15 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Skipped 1 previous similar message Oct 29 22:58:16 sh-103-53.int kernel: LustreError: 60269:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976ad9dc3c80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 22:58:16 sh-103-53.int kernel: LustreError: 60269:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 22:58:16 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 22:58:16 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 23:01:18 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 23:01:18 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 23:02:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.24@o2ib4: 1 seconds Oct 29 23:02:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 246 previous similar messages Oct 29 23:03:24 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 23:03:24 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 23:03:24 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572415104, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb08000/0x51ab3c4ee69bd90 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae00aee7fb8 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 23:03:24 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 23:08:34 sh-103-53.int kernel: LustreError: 61050:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9771086eb440) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 23:08:34 sh-103-53.int kernel: LustreError: 61050:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 23:08:34 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 23:08:34 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 23:09:24 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 23:09:24 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Skipped 2 previous similar messages Oct 29 23:09:25 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 23:09:25 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Skipped 1 previous similar message Oct 29 23:11:28 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 23:11:28 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 23:12:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 5 seconds Oct 29 23:12:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 260 previous similar messages Oct 29 23:13:44 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 23:13:44 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 23:13:44 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572415724, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977d98258fc0/0x51ab3c4ee69bdc8 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae0295c3370 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 23:13:44 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 23:18:50 sh-103-53.int kernel: LustreError: 61882:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e77721c80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 23:18:50 sh-103-53.int kernel: LustreError: 61882:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 23:18:50 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 23:18:50 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 23:21:38 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 23:21:38 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 23:22:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 29 23:22:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 282 previous similar messages Oct 29 23:23:59 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 23:23:59 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 23:23:59 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572416339, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977d98259d40/0x51ab3c4ee69be00 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae04711698e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 23:23:59 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 23:29:05 sh-103-53.int kernel: LustreError: 62701:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e77720cc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 23:29:05 sh-103-53.int kernel: LustreError: 62701:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 23:29:05 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 23:29:05 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 23:31:48 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 23:31:48 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 23:32:24 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 7 seconds Oct 29 23:32:24 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 277 previous similar messages Oct 29 23:34:13 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 23:34:13 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 23:34:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572416953, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c18c0/0x51ab3c4ee69be38 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae0650411e0 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 23:34:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 23:38:54 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 29 23:39:22 sh-103-53.int kernel: LustreError: 63463:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976be9eb1ec0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 23:39:22 sh-103-53.int kernel: LustreError: 63463:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 23:39:22 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 23:39:23 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 23:41:58 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 23:41:58 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 23:42:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 5 seconds Oct 29 23:42:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 258 previous similar messages Oct 29 23:42:29 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 124 Oct 29 23:43:48 sh-103-53.int kernel: LNetError: 58068:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 23:43:59 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 23:44:19 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 23:44:29 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 29 23:44:29 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 23:44:29 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 23:44:29 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572417569, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97779d6172c0/0x51ab3c4ee69be70 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae083321040 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 23:44:29 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 23:49:35 sh-103-53.int kernel: LustreError: 64396:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978207205380) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 23:49:35 sh-103-53.int kernel: LustreError: 64396:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 23:49:35 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 23:49:35 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 29 23:50:06 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 29 23:52:08 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 29 23:52:08 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 29 23:52:08 sh-103-53.int kernel: LNetError: 58068:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 101 Oct 29 23:52:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.23@o2ib4: 0 seconds Oct 29 23:52:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 265 previous similar messages Oct 29 23:54:45 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 29 23:54:45 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 29 23:54:45 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572418185, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c1440/0x51ab3c4ee69bf81 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae0a1534937 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 29 23:54:45 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 29 23:59:55 sh-103-53.int kernel: LustreError: 65245:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976be878fd40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 29 23:59:55 sh-103-53.int kernel: LustreError: 65245:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 29 23:59:55 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 29 23:59:55 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 00:00:11 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 0 Oct 30 00:02:19 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 00:02:19 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 00:02:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.23@o2ib4: 1 seconds Oct 30 00:02:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 265 previous similar messages Oct 30 00:05:02 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 00:05:02 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 00:05:02 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572418802, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7663bcc0/0x51ab3c4ee69bfb9 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae0bfb5e7fe expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 00:05:02 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 00:10:09 sh-103-53.int kernel: LustreError: 66111:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781fcab7c80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 00:10:09 sh-103-53.int kernel: LustreError: 66111:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 00:10:09 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 00:10:09 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 00:12:29 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 00:12:29 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 00:12:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 1 seconds Oct 30 00:12:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 265 previous similar messages Oct 30 00:15:15 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 00:15:15 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 00:15:15 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572419415, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977d9825c380/0x51ab3c4ee69bff1 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae0dde4debb expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 00:15:16 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 00:15:30 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 30 00:19:36 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 30 00:20:22 sh-103-53.int kernel: LustreError: 66947:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9774413cecc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 00:20:22 sh-103-53.int kernel: LustreError: 66947:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 00:20:22 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 00:20:22 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 00:22:39 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 00:22:39 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 00:22:39 sh-103-53.int kernel: LNetError: 65734:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 0 Oct 30 00:22:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 30 00:22:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 256 previous similar messages Oct 30 00:24:40 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 30 00:25:32 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 00:25:32 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 00:25:32 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572420032, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97779d610480/0x51ab3c4ee69c029 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae0fc2cd451 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 00:25:32 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 00:30:37 sh-103-53.int kernel: LustreError: 67791:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977446e8a9c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 00:30:37 sh-103-53.int kernel: LustreError: 67791:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 00:30:37 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 00:30:37 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 00:32:49 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 00:32:49 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 00:33:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 30 00:33:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 257 previous similar messages Oct 30 00:35:44 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 00:35:44 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 00:35:44 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572420644, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7663ba80/0x51ab3c4ee69c061 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae119e9913d expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 00:35:44 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 00:35:51 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 30 00:35:51 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Skipped 1 previous similar message Oct 30 00:37:09 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 00:37:10 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 00:37:17 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 00:37:29 sh-103-53.int kernel: LNetError: 65734:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 00:40:53 sh-103-53.int kernel: LustreError: 68727:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9771237ed140) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 00:40:53 sh-103-53.int kernel: LustreError: 68727:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 00:40:53 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 00:40:53 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 00:42:59 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 00:42:59 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 00:43:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 30 00:43:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 250 previous similar messages Oct 30 00:45:59 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 00:45:59 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 00:45:59 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572421259, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7663b180/0x51ab3c4ee69c0a7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae137c4cfcf expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 00:45:59 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 00:51:06 sh-103-53.int kernel: LustreError: 69561:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9771237ec9c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 00:51:06 sh-103-53.int kernel: LustreError: 69561:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 00:51:06 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 00:51:06 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 00:53:09 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 00:53:09 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 00:53:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 9 seconds Oct 30 00:53:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 263 previous similar messages Oct 30 00:56:11 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 00:56:11 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 00:56:11 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572421871, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7663e780/0x51ab3c4ee69c0df lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae154e9fbfa expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 00:56:11 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 01:01:18 sh-103-53.int kernel: LustreError: 70416:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976ad9dc3a40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 01:01:18 sh-103-53.int kernel: LustreError: 70416:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 01:01:18 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 01:01:18 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 01:03:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 1 seconds Oct 30 01:03:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 262 previous similar messages Oct 30 01:03:15 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 01:03:15 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 01:06:26 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 01:06:26 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 01:06:26 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572422486, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7626af40/0x51ab3c4ee69c117 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae1722d6fa2 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 01:06:26 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 01:11:37 sh-103-53.int kernel: LustreError: 71254:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9777956a66c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 01:11:37 sh-103-53.int kernel: LustreError: 71254:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 01:11:37 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 01:11:37 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 01:13:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.23@o2ib4: 8 seconds Oct 30 01:13:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 251 previous similar messages Oct 30 01:13:30 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 01:13:30 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 01:16:49 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 01:16:49 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 01:16:49 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572423109, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97820c2ab180/0x51ab3c4ee69c14f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae18fa16ef1 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 01:16:49 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 01:21:56 sh-103-53.int kernel: LustreError: 72080:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9774413cfa40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 01:21:56 sh-103-53.int kernel: LustreError: 72080:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 01:21:56 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 01:21:56 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 01:23:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 4 seconds Oct 30 01:23:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 260 previous similar messages Oct 30 01:23:35 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 01:23:35 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 01:27:05 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 01:27:05 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 01:27:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572423725, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97779d614380/0x51ab3c4ee69c187 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae1ad58f9ac expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 01:27:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 01:32:14 sh-103-53.int kernel: LustreError: 72964:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978192781d40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 01:32:14 sh-103-53.int kernel: LustreError: 72964:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 01:32:14 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 01:32:14 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 01:32:46 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 30 01:33:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 1 seconds Oct 30 01:33:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 261 previous similar messages Oct 30 01:33:45 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 01:33:45 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 01:37:22 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 01:37:22 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 01:37:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572424342, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97820c2a9680/0x51ab3c4ee69c213 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae1cac1f528 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 01:37:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 01:42:29 sh-103-53.int kernel: LustreError: 73777:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978192781440) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 01:42:29 sh-103-53.int kernel: LustreError: 73777:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 01:42:29 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 01:42:29 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 01:43:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 1 seconds Oct 30 01:43:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 246 previous similar messages Oct 30 01:43:55 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 01:43:55 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 01:47:35 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 01:47:35 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 01:47:35 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572424955, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7663a400/0x51ab3c4ee69c24b lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae1e78b0245 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 01:47:35 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 01:52:41 sh-103-53.int kernel: LustreError: 74625:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9777956a75c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 01:52:41 sh-103-53.int kernel: LustreError: 74625:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 01:52:41 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 01:52:41 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 01:53:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 5 seconds Oct 30 01:53:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 259 previous similar messages Oct 30 01:54:05 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 01:54:05 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 01:57:50 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 01:57:50 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 01:57:50 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572425570, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0de80/0x51ab3c4ee69c283 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae204b21c9f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 01:57:50 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 01:59:13 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 30 02:02:57 sh-103-53.int kernel: LustreError: 75476:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9777956a7980) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 02:02:57 sh-103-53.int kernel: LustreError: 75476:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 02:02:57 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 02:02:57 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 02:03:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 6 seconds Oct 30 02:03:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 262 previous similar messages Oct 30 02:04:20 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 02:04:20 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 02:08:07 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 02:08:07 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 02:08:07 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572426187, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb08b40/0x51ab3c4ee69c2bb lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae221828cd4 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 02:08:07 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 02:13:14 sh-103-53.int kernel: LustreError: 76316:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781ffad83c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 02:13:14 sh-103-53.int kernel: LustreError: 76316:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 02:13:14 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 02:13:14 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 02:13:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 1 seconds Oct 30 02:13:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 263 previous similar messages Oct 30 02:14:25 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 02:14:25 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 02:18:25 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 02:18:25 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 02:18:25 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572426805, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97783febda00/0x51ab3c4ee69c2f3 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae23e9d5c6f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 02:18:25 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 02:18:32 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 30 02:23:33 sh-103-53.int kernel: LustreError: 77285:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711927a480) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 02:23:33 sh-103-53.int kernel: LustreError: 77285:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 02:23:33 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 02:23:33 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 02:24:04 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 30 02:24:04 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 257 previous similar messages Oct 30 02:24:35 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 02:24:35 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 02:28:40 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 02:28:40 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 02:28:40 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572427420, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977d9825c140/0x51ab3c4ee69c32b lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae25bc081d6 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 02:28:40 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 02:33:47 sh-103-53.int kernel: LustreError: 78286:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97572072f200) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 02:33:47 sh-103-53.int kernel: LustreError: 78286:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 02:33:47 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 02:33:47 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 02:34:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 5 seconds Oct 30 02:34:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 280 previous similar messages Oct 30 02:34:45 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 02:34:45 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 02:38:55 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 02:38:55 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 02:38:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572428035, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977d9825f740/0x51ab3c4ee69c371 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae278f7d632 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 02:38:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 02:39:51 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 30 02:39:52 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 30 02:42:56 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 30 02:44:01 sh-103-53.int kernel: LustreError: 79122:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97572072f8c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 02:44:01 sh-103-53.int kernel: LustreError: 79122:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 02:44:01 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 02:44:02 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 02:44:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 10 seconds Oct 30 02:44:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 259 previous similar messages Oct 30 02:45:00 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 02:45:00 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 02:49:09 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 02:49:09 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 02:49:09 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572428649, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7660ad00/0x51ab3c4ee69c3a9 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae2962ad1f7 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 02:49:09 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 02:54:16 sh-103-53.int kernel: LustreError: 79958:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9774413cf440) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 02:54:16 sh-103-53.int kernel: LustreError: 79958:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 02:54:16 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 02:54:16 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 02:54:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 5 seconds Oct 30 02:54:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 260 previous similar messages Oct 30 02:55:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 02:55:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 02:59:25 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 02:59:25 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 02:59:25 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572429265, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7660c380/0x51ab3c4ee69c3e1 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae2b35cd210 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 02:59:25 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 03:04:31 sh-103-53.int kernel: LustreError: 80798:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff975721fbe6c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 03:04:31 sh-103-53.int kernel: LustreError: 80798:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 03:04:31 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 03:04:31 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 03:04:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 30 03:04:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 276 previous similar messages Oct 30 03:05:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 03:05:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 03:09:39 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 03:09:39 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 03:09:39 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572429879, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97820c2a2d00/0x51ab3c4ee69c419 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae2d09fb9a3 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 03:09:39 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 03:14:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 11 seconds Oct 30 03:14:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 259 previous similar messages Oct 30 03:14:46 sh-103-53.int kernel: LustreError: 81633:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9777947e8600) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 03:14:46 sh-103-53.int kernel: LustreError: 81633:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 03:14:46 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 03:14:46 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 03:15:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 03:15:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 03:19:54 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 03:19:54 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 03:19:54 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572430494, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0d340/0x51ab3c4ee69c451 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae2eddd1150 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 03:19:54 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 03:24:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 30 03:24:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 257 previous similar messages Oct 30 03:24:37 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 30 03:25:02 sh-103-53.int kernel: LustreError: 82513:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9777947e83c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 03:25:02 sh-103-53.int kernel: LustreError: 82513:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 03:25:02 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 03:25:03 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 03:25:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 03:25:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 03:30:11 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 03:30:11 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 03:30:11 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572431111, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0c5c0/0x51ab3c4ee69c4dd lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae30b35e8aa expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 03:30:11 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 03:34:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.23@o2ib4: 9 seconds Oct 30 03:34:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 255 previous similar messages Oct 30 03:35:17 sh-103-53.int kernel: LustreError: 83330:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9782113a2cc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 03:35:17 sh-103-53.int kernel: LustreError: 83330:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 03:35:17 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 03:35:17 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 03:35:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 03:35:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 03:36:31 sh-103-53.int kernel: LNetError: 79873:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 0 Oct 30 03:40:22 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 03:40:22 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 03:40:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572431722, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c0000/0x51ab3c4ee69c515 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae3284784a2 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 03:40:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 03:44:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 30 03:44:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 254 previous similar messages Oct 30 03:45:29 sh-103-53.int kernel: LustreError: 84163:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97550c2aba40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 03:45:29 sh-103-53.int kernel: LustreError: 84163:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 03:45:29 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 03:45:29 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 03:45:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 03:45:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 03:50:36 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 03:50:36 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 03:50:36 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572432336, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7663af40/0x51ab3c4ee69c54d lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae345a79dbf expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 03:50:36 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 03:54:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 5 seconds Oct 30 03:54:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 265 previous similar messages Oct 30 03:55:45 sh-103-53.int kernel: LustreError: 85212:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781fc2ac240) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 03:55:45 sh-103-53.int kernel: LustreError: 85212:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 03:55:45 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 03:55:45 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 03:55:56 sh-103-53.int kernel: LNetError: 84591:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 0 Oct 30 03:56:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 03:56:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 03:58:46 sh-103-53.int kernel: LNetError: 84591:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 125 Oct 30 04:00:52 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 04:00:52 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 04:00:52 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572432952, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0b600/0x51ab3c4ee69c585 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae3641073c4 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 04:00:52 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 04:04:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 1 seconds Oct 30 04:04:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 279 previous similar messages Oct 30 04:05:59 sh-103-53.int kernel: LustreError: 85876:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977119720a80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 04:05:59 sh-103-53.int kernel: LustreError: 85876:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 04:05:59 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 04:05:59 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 04:06:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 04:06:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 04:11:08 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 04:11:08 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 04:11:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572433568, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97783febe300/0x51ab3c4ee69c5bd lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae3828e92bd expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 04:11:08 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 04:14:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 30 04:14:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 272 previous similar messages Oct 30 04:16:15 sh-103-53.int kernel: LustreError: 86715:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977435afa3c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 04:16:15 sh-103-53.int kernel: LustreError: 86715:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 04:16:15 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 04:16:15 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 04:16:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 04:16:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 04:20:32 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 30 04:21:23 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 04:21:23 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 04:21:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572434183, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97783febf980/0x51ab3c4ee69c5f5 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae3a13010e5 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 04:21:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 04:24:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 30 04:24:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 272 previous similar messages Oct 30 04:26:32 sh-103-53.int kernel: LustreError: 87714:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977445b2c600) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 04:26:32 sh-103-53.int kernel: LustreError: 87714:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 04:26:32 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 04:26:32 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 04:26:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 04:26:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 04:31:42 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 04:31:42 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 04:31:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572434802, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97779d611f80/0x51ab3c4ee69c63b lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae3bf915197 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 04:31:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 04:34:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.23@o2ib4: 0 seconds Oct 30 04:34:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 261 previous similar messages Oct 30 04:36:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 04:36:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 04:36:51 sh-103-53.int kernel: LustreError: 88686:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781fc2aca80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 04:36:51 sh-103-53.int kernel: LustreError: 88686:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 04:36:51 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 04:36:51 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 04:37:47 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 30 04:42:00 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 04:42:00 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 04:42:00 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572435420, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7663d7c0/0x51ab3c4ee69c673 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae3ddb0f1ee expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 04:42:00 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 04:44:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 4 seconds Oct 30 04:44:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 264 previous similar messages Oct 30 04:46:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 04:46:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 04:47:05 sh-103-53.int kernel: LustreError: 89521:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781fc2ac780) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 04:47:05 sh-103-53.int kernel: LustreError: 89521:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 04:47:05 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 04:47:05 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 04:52:16 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 04:52:16 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 04:52:16 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572436036, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7663c5c0/0x51ab3c4ee69c6ab lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae3fb69a1f3 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 04:52:16 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 04:54:07 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.23@o2ib4: -125 Oct 30 04:54:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.24@o2ib4: 2 seconds Oct 30 04:54:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 245 previous similar messages Oct 30 04:56:31 sh-103-53.int kernel: LNetError: 88257:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 201 Oct 30 04:57:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 04:57:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 04:57:28 sh-103-53.int kernel: LustreError: 90366:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781fc2ad200) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 04:57:28 sh-103-53.int kernel: LustreError: 90366:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 04:57:28 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 04:57:28 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 05:02:40 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 05:02:40 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 05:02:40 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572436660, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7663ad00/0x51ab3c4ee69c6e3 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae419c8b108 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 05:02:40 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 05:04:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Oct 30 05:04:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 272 previous similar messages Oct 30 05:05:45 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 24 Oct 30 05:07:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 05:07:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 05:07:47 sh-103-53.int kernel: LustreError: 91302:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781fc2aca80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 05:07:47 sh-103-53.int kernel: LustreError: 91302:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 05:07:47 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 05:07:47 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 05:12:53 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 05:12:53 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 05:12:53 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572437273, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e76639680/0x51ab3c4ee69c722 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae437d32da9 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 05:12:53 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 05:13:16 sh-103-53.int kernel: LNetError: 89883:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 0 Oct 30 05:13:22 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 30 05:13:24 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 30 05:13:24 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Skipped 1 previous similar message Oct 30 05:15:04 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 30 05:15:04 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 257 previous similar messages Oct 30 05:15:34 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 05:17:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 05:17:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 05:18:03 sh-103-53.int kernel: LustreError: 92160:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977446e8b980) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 05:18:03 sh-103-53.int kernel: LustreError: 92160:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 05:18:03 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 05:18:03 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 05:21:32 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 30 05:23:13 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 05:23:13 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 05:23:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572437893, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9781fcfb9f80/0x51ab3c4ee69c7ae lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae455ee4c50 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 05:23:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 05:25:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.23@o2ib4: 8 seconds Oct 30 05:25:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 273 previous similar messages Oct 30 05:27:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 05:27:36 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 05:28:22 sh-103-53.int kernel: LustreError: 92989:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976085a826c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 05:28:22 sh-103-53.int kernel: LustreError: 92989:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 05:28:22 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 05:28:22 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 05:33:30 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 05:33:30 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 05:33:30 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572438510, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7663f2c0/0x51ab3c4ee69c7e6 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae4731126a8 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 05:33:30 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 05:33:43 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 30 05:35:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.23@o2ib4: 0 seconds Oct 30 05:35:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 267 previous similar messages Oct 30 05:37:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 05:37:46 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 05:38:37 sh-103-53.int kernel: LustreError: 93825:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9777947efe00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 05:38:37 sh-103-53.int kernel: LustreError: 93825:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 05:38:37 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 05:38:37 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 05:38:49 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 30 05:43:43 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 05:43:43 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 05:43:43 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572439123, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e6cb0bcc0/0x51ab3c4ee69c81e lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae48f9ec760 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 05:43:43 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 05:45:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 4 seconds Oct 30 05:45:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 266 previous similar messages Oct 30 05:47:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 05:47:56 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 05:48:50 sh-103-53.int kernel: LustreError: 94663:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9769faf746c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 05:48:50 sh-103-53.int kernel: LustreError: 94663:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 05:48:50 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 05:48:50 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 05:53:58 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 05:53:58 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 05:53:58 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572439738, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c21c0/0x51ab3c4ee69c856 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae4ac10a408 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 05:53:58 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 05:55:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 30 05:55:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 261 previous similar messages Oct 30 05:58:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 05:58:06 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 05:59:07 sh-103-53.int kernel: LustreError: 95692:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820cadd5c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 05:59:07 sh-103-53.int kernel: LustreError: 95692:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 05:59:07 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 05:59:07 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 05:59:36 sh-103-53.int kernel: LNetError: 95187:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 99 Oct 30 05:59:36 sh-103-53.int kernel: LNetError: 95187:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 1 previous similar message Oct 30 06:04:13 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 06:04:13 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 06:04:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572440353, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e76269440/0x51ab3c4ee69c88e lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae4c8f62435 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 06:04:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 06:05:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 30 06:05:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 257 previous similar messages Oct 30 06:08:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 06:08:16 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 06:09:21 sh-103-53.int kernel: LustreError: 96544:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781f2e675c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 06:09:21 sh-103-53.int kernel: LustreError: 96544:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 06:09:21 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 06:09:21 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 06:14:31 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 06:14:31 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 06:14:31 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572440971, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97783febec00/0x51ab3c4ee69c8c6 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae4e5fb9d30 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 06:14:31 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 06:15:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 5 seconds Oct 30 06:15:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 284 previous similar messages Oct 30 06:18:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 06:18:26 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 06:19:40 sh-103-53.int kernel: LustreError: 97339:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781fde6e180) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 06:19:40 sh-103-53.int kernel: LustreError: 97339:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 06:19:40 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 06:19:40 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 06:23:32 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 30 06:24:49 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 06:24:49 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 06:24:49 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572441589, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d2ac0/0x51ab3c4ee69c8fe lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae502ae0439 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 06:24:50 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 06:25:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 6 seconds Oct 30 06:25:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 285 previous similar messages Oct 30 06:28:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 06:28:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 06:29:57 sh-103-53.int kernel: LustreError: 98071:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9774413cfec0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 06:29:57 sh-103-53.int kernel: LustreError: 98071:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 06:29:57 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 06:29:57 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 06:35:06 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 06:35:06 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 06:35:06 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572442206, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c3a80/0x51ab3c4ee69c936 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae51f5e38fd expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 06:35:06 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 06:36:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 30 06:36:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 287 previous similar messages Oct 30 06:36:38 sh-103-53.int kernel: LNetError: 95187:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 0 Oct 30 06:38:08 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 06:38:10 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 06:38:13 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 06:38:23 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 06:38:36 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 06:38:36 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 1 previous similar message Oct 30 06:38:47 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 06:38:47 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 06:38:52 sh-103-53.int kernel: LNetError: 95187:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 06:38:52 sh-103-53.int kernel: LNetError: 95187:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 2 previous similar messages Oct 30 06:38:52 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.24@o2ib4: -125 Oct 30 06:38:54 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 30 06:39:28 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 06:39:28 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 6 previous similar messages Oct 30 06:39:48 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.24@o2ib4: -125 Oct 30 06:40:11 sh-103-53.int kernel: LustreError: 99479:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97779ea26900) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 06:40:11 sh-103-53.int kernel: LustreError: 99479:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 06:40:11 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 06:40:11 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 06:40:42 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 06:40:42 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 8 previous similar messages Oct 30 06:40:50 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 30 06:40:50 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Skipped 1 previous similar message Oct 30 06:40:54 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.24@o2ib4: -125 Oct 30 06:41:56 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.24@o2ib4: -125 Oct 30 06:42:50 sh-103-53.int kernel: LNetError: 98305:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 06:42:50 sh-103-53.int kernel: LNetError: 98305:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 16 previous similar messages Oct 30 06:42:57 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.24@o2ib4: -125 Oct 30 06:42:57 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Skipped 1 previous similar message Oct 30 06:43:53 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 30 06:43:53 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Skipped 1 previous similar message Oct 30 06:45:17 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 06:45:17 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 06:45:17 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572442817, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7660a400/0x51ab3c4ee69d736 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae53be8542f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 06:45:17 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 06:45:54 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.24@o2ib4: -125 Oct 30 06:45:54 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Skipped 3 previous similar messages Oct 30 06:46:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 0 seconds Oct 30 06:46:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 220 previous similar messages Oct 30 06:48:57 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 06:48:57 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 06:50:27 sh-103-53.int kernel: LustreError: 100264:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977dad2252c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 06:50:27 sh-103-53.int kernel: LustreError: 100264:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 06:50:27 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 06:50:27 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 06:55:36 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 06:55:36 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 06:55:36 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572443436, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e76269200/0x51ab3c4ee69e020 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae556aaa667 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 06:55:36 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 06:56:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 1 seconds Oct 30 06:56:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 291 previous similar messages Oct 30 06:59:07 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 06:59:07 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 07:00:46 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 07:00:46 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 07:05:55 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 07:05:55 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 07:05:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572444055, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9781fcfb8000/0x51ab3c4ee69e5a6 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae5734411b5 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 07:05:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 07:06:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 1 seconds Oct 30 07:06:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 239 previous similar messages Oct 30 07:09:17 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 07:09:17 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 07:11:01 sh-103-53.int kernel: LustreError: 101920:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781903f7140) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 07:11:01 sh-103-53.int kernel: LustreError: 101920:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 07:11:01 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 07:11:01 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 07:11:07 sh-103-53.int kernel: LNetError: 100668:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 07:11:07 sh-103-53.int kernel: LNetError: 100668:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 34 previous similar messages Oct 30 07:16:10 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 07:16:10 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 07:16:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572444670, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9781fcfbf980/0x51ab3c4ee69e6da lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae5902c90a8 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 07:16:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 07:16:10 sh-103-53.int kernel: LustreError: 102235:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781903f7080) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 07:16:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.23@o2ib4: 5 seconds Oct 30 07:16:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 222 previous similar messages Oct 30 07:19:27 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 07:19:27 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 07:20:28 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 30 07:20:28 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Skipped 1 previous similar message Oct 30 07:21:18 sh-103-53.int kernel: LustreError: 102513:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781903f6300) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 07:21:18 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 07:21:18 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 07:26:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.23@o2ib4: 18 seconds Oct 30 07:26:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 228 previous similar messages Oct 30 07:26:28 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 07:26:28 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 07:26:28 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572445288, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9781fcfbee40/0x51ab3c4ee69e766 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae5ad29b902 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 07:26:28 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 07:26:28 sh-103-53.int kernel: LustreError: 102784:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781903f6540) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 07:29:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 07:29:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 07:31:39 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 07:31:39 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 07:36:10 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 07:36:10 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 1 previous similar message Oct 30 07:36:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.23@o2ib4: 2 seconds Oct 30 07:36:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 221 previous similar messages Oct 30 07:36:23 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 07:36:45 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 07:36:45 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 07:36:45 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572445905, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9781fcfbfbc0/0x51ab3c4ee69e79e lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae5c4e7d53d expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 07:36:45 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 07:36:45 sh-103-53.int kernel: LustreError: 103370:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781903f6180) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 07:36:45 sh-103-53.int kernel: LustreError: 103370:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 07:39:47 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 07:39:47 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 07:41:53 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 07:41:53 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 07:46:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 30 07:46:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 227 previous similar messages Oct 30 07:47:02 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 07:47:02 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 07:47:02 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572446521, 301s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9781fcfbde80/0x51ab3c4ee69e800 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae5e1b69d2e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 07:47:02 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 07:47:02 sh-103-53.int kernel: LustreError: 103919:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781903f6d80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 07:47:02 sh-103-53.int kernel: LustreError: 103919:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 07:49:57 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 07:49:57 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 07:52:10 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 07:52:10 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 07:56:03 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.24@o2ib4: -125 Oct 30 07:56:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.21@o2ib4: 1 seconds Oct 30 07:56:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 212 previous similar messages Oct 30 07:57:16 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 07:57:16 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 07:57:16 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572447136, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9781fcfb9680/0x51ab3c4ee69e838 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae5fed62fa9 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 07:57:16 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 07:57:16 sh-103-53.int kernel: LustreError: 104482:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781903f7e00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 07:57:16 sh-103-53.int kernel: LustreError: 104482:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 08:00:07 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 08:00:07 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 08:02:24 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 08:02:24 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 08:02:47 sh-103-53.int kernel: LNetError: 104125:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 0 Oct 30 08:06:13 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.21@o2ib4: -125 Oct 30 08:06:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 30 08:06:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 223 previous similar messages Oct 30 08:07:33 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 08:07:33 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 08:07:33 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572447753, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9781fcfbb3c0/0x51ab3c4ee69e870 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae61b44b5e1 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 08:07:33 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 08:07:33 sh-103-53.int kernel: LustreError: 105044:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781903f6300) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 08:07:33 sh-103-53.int kernel: LustreError: 105044:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 08:10:10 sh-103-53.int kernel: Lustre: 91133:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572448203/real 1572448203] req@ffff978215d4a400 x1648382116766336/t0(0) o400->fir-OST003a-osc-ffff9781f2230800@10.0.10.109@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572448210 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 30 08:10:10 sh-103-53.int kernel: Lustre: fir-OST005e-osc-ffff9781f2230800: Connection to fir-OST005e (at 10.0.10.115@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 30 08:10:10 sh-103-53.int kernel: Lustre: 91133:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 33 previous similar messages Oct 30 08:10:11 sh-103-53.int kernel: Lustre: 91142:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572448203/real 1572448203] req@ffff978212cb3600 x1648382116765328/t0(0) o400->MGC10.0.10.51@o2ib7@10.0.10.51@o2ib7:26/25 lens 224/224 e 0 to 1 dl 1572448211 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 30 08:10:11 sh-103-53.int kernel: Lustre: 91142:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 7 previous similar messages Oct 30 08:10:15 sh-103-53.int kernel: Lustre: 91120:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572448208/real 1572448208] req@ffff975580e69f80 x1648382116769904/t0(0) o3->fir-OST002a-osc-ffff9781f2230800@10.0.10.107@o2ib7:6/4 lens 488/4536 e 0 to 1 dl 1572448215 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Oct 30 08:10:15 sh-103-53.int kernel: Lustre: fir-OST002a-osc-ffff9781f2230800: Connection to fir-OST002a (at 10.0.10.107@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 30 08:10:15 sh-103-53.int kernel: Lustre: Skipped 40 previous similar messages Oct 30 08:10:17 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.22@o2ib4 added to recovery queue. Health = 900 Oct 30 08:10:17 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 29 previous similar messages Oct 30 08:10:24 sh-103-53.int kernel: Lustre: 91132:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572448203/real 1572448203] req@ffff978212cb6780 x1648382116765536/t0(0) o400->fir-OST0008-osc-ffff9781f2230800@10.0.10.101@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572448224 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 30 08:10:24 sh-103-53.int kernel: Lustre: fir-OST0002-osc-ffff9781f2230800: Connection to fir-OST0002 (at 10.0.10.101@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 30 08:10:24 sh-103-53.int kernel: Lustre: 91132:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 1 previous similar message Oct 30 08:10:47 sh-103-53.int kernel: Lustre: 91132:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572448203/real 1572448203] req@ffff978215d4b600 x1648382116766496/t0(0) o400->fir-OST0044-osc-ffff9781f2230800@10.0.10.111@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572448247 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 30 08:10:47 sh-103-53.int kernel: Lustre: 91132:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 1 previous similar message Oct 30 08:10:47 sh-103-53.int kernel: Lustre: fir-OST0044-osc-ffff9781f2230800: Connection to fir-OST0044 (at 10.0.10.111@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 30 08:10:47 sh-103-53.int kernel: Lustre: Skipped 2 previous similar messages Oct 30 08:14:59 sh-103-53.int kernel: perf: interrupt took too long (6190 > 6173), lowering kernel.perf_event_max_sample_rate to 32000 Oct 30 08:16:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 30 08:16:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 165 previous similar messages Oct 30 08:17:17 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 08:17:17 sh-103-53.int kernel: Lustre: Skipped 53 previous similar messages Oct 30 08:19:10 sh-103-53.int kernel: Lustre: 91142:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572448706/real 1572448706] req@ffff977dd97a1680 x1648382116860096/t0(0) o400->fir-OST0027-osc-ffff9781f2230800@10.0.10.108@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572448750 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 30 08:19:10 sh-103-53.int kernel: Lustre: 91142:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 5 previous similar messages Oct 30 08:19:10 sh-103-53.int kernel: Lustre: fir-OST0027-osc-ffff9781f2230800: Connection to fir-OST0027 (at 10.0.10.108@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 30 08:19:10 sh-103-53.int kernel: Lustre: Skipped 5 previous similar messages Oct 30 08:20:27 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.41@o2ib4 added to recovery queue. Health = 900 Oct 30 08:20:27 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 28 previous similar messages Oct 30 08:21:00 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 08:21:17 sh-103-53.int kernel: LNetError: 104125:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 08:22:13 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 08:22:35 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 08:22:35 sh-103-53.int kernel: LustreError: Skipped 2 previous similar messages Oct 30 08:22:35 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572448655, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d3cc0/0x51ab3c4ee69e9d5 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae64f17e085 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 08:22:35 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 08:22:35 sh-103-53.int kernel: LustreError: 105955:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9769e477fb00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 08:22:35 sh-103-53.int kernel: LustreError: 105955:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 2 previous similar messages Oct 30 08:23:13 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 08:23:58 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 08:24:22 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 08:25:13 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 08:25:47 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 08:26:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 30 08:26:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 30 08:26:58 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 08:26:58 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 4 previous similar messages Oct 30 08:27:49 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 08:27:49 sh-103-53.int kernel: Lustre: Skipped 2 previous similar messages Oct 30 08:29:08 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 08:29:08 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 4 previous similar messages Oct 30 08:30:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 30 08:30:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 26 previous similar messages Oct 30 08:32:57 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 08:32:57 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 08:32:57 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572449277, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d0480/0x51ab3c4ee69eb87 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae687a891d0 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 08:32:57 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 08:32:57 sh-103-53.int kernel: LustreError: 106748:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9769e477e300) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 08:32:57 sh-103-53.int kernel: LustreError: 106748:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 08:33:31 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 08:33:31 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 12 previous similar messages Oct 30 08:37:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 30 08:37:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 60 previous similar messages Oct 30 08:38:04 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 08:38:04 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 08:40:47 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 30 08:40:47 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 43 previous similar messages Oct 30 08:42:23 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 08:42:23 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 19 previous similar messages Oct 30 08:43:13 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 08:43:13 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 08:43:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572449893, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d6c00/0x51ab3c4ee69ed63 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae6b82fcdee expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 08:43:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 08:43:13 sh-103-53.int kernel: LustreError: 107559:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9769e477e6c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 08:43:13 sh-103-53.int kernel: LustreError: 107559:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 08:47:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 30 08:47:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 57 previous similar messages Oct 30 08:48:23 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 08:48:23 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 08:50:57 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 30 08:50:57 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 48 previous similar messages Oct 30 08:52:28 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 08:52:28 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 30 previous similar messages Oct 30 08:53:29 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 08:53:29 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 08:53:29 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572450509, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d72c0/0x51ab3c4ee69f383 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae6dec6244d expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 08:53:29 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 08:53:29 sh-103-53.int kernel: LustreError: 108575:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9769e477f380) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 08:53:29 sh-103-53.int kernel: LustreError: 108575:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 08:57:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 30 08:57:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 65 previous similar messages Oct 30 08:58:40 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 08:58:40 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 09:01:07 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 30 09:01:07 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 59 previous similar messages Oct 30 09:02:42 sh-103-53.int kernel: LNetError: 108660:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 09:02:42 sh-103-53.int kernel: LNetError: 108660:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 37 previous similar messages Oct 30 09:03:46 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 09:03:46 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 09:03:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572451126, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d72c0/0x51ab3c4ee69f47f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae719cfa23f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 09:03:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 09:07:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 30 09:07:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 48 previous similar messages Oct 30 09:08:56 sh-103-53.int kernel: LustreError: 109409:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9769e477f5c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 09:08:56 sh-103-53.int kernel: LustreError: 109409:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 09:08:56 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 09:08:56 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 09:11:17 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 30 09:11:17 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 59 previous similar messages Oct 30 09:12:52 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 09:12:52 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 39 previous similar messages Oct 30 09:14:05 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 09:14:05 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 09:14:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572451745, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d5e80/0x51ab3c4ee69f519 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae72384a06a expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 09:14:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 09:17:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 30 09:17:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 54 previous similar messages Oct 30 09:19:17 sh-103-53.int kernel: LustreError: 109959:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9769e477f980) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 09:19:17 sh-103-53.int kernel: LustreError: 109959:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 09:19:17 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 09:19:17 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 09:21:27 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 30 09:21:27 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 47 previous similar messages Oct 30 09:23:18 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 09:23:18 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 23 previous similar messages Oct 30 09:24:29 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 09:24:29 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 09:24:29 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572452369, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d0b40/0x51ab3c4ee69f558 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae725e8e2c8 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 09:24:29 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 09:28:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 30 09:28:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 58 previous similar messages Oct 30 09:29:41 sh-103-53.int kernel: LustreError: 110506:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9769e477f5c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 09:29:41 sh-103-53.int kernel: LustreError: 110506:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 09:29:41 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 09:29:41 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 09:31:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.23@o2ib4 added to recovery queue. Health = 900 Oct 30 09:31:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 43 previous similar messages Oct 30 09:33:33 sh-103-53.int kernel: LNetError: 105714:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 09:33:33 sh-103-53.int kernel: LNetError: 105714:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 21 previous similar messages Oct 30 09:34:54 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 09:34:54 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 09:34:54 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572452994, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d18c0/0x51ab3c4ee69f590 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7288e140b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 09:34:54 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 09:38:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 7 seconds Oct 30 09:38:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 30 09:40:07 sh-103-53.int kernel: LustreError: 111057:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9769e477f140) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 09:40:07 sh-103-53.int kernel: LustreError: 111057:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 09:40:07 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 09:40:07 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 09:41:48 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 30 09:41:48 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages Oct 30 09:43:37 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 09:43:37 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 25 previous similar messages Oct 30 09:45:22 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 09:45:22 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 09:45:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572453622, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d0000/0x51ab3c4ee69f5c8 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae72aeea6b9 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 09:45:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 09:48:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 30 09:48:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 59 previous similar messages Oct 30 09:50:32 sh-103-53.int kernel: LustreError: 111605:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9769e477ed80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 09:50:32 sh-103-53.int kernel: LustreError: 111605:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 09:50:33 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 09:50:33 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 09:51:57 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 30 09:51:57 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 45 previous similar messages Oct 30 09:53:42 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 09:53:42 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 23 previous similar messages Oct 30 09:55:47 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 09:55:47 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 09:55:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572454247, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d3840/0x51ab3c4ee69f600 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae72c936db4 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 09:55:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 09:58:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 30 09:58:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 63 previous similar messages Oct 30 10:01:00 sh-103-53.int kernel: LustreError: 112174:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9769e477e240) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 10:01:00 sh-103-53.int kernel: LustreError: 112174:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 10:01:00 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 10:01:00 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 10:02:07 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 30 10:02:07 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 47 previous similar messages Oct 30 10:04:01 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 10:04:01 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 23 previous similar messages Oct 30 10:06:12 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 10:06:12 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 10:06:12 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572454872, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d0d80/0x51ab3c4ee69f63f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae72f319cdc expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 10:06:12 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 10:08:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 6 seconds Oct 30 10:08:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 64 previous similar messages Oct 30 10:11:21 sh-103-53.int kernel: LustreError: 112735:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9769e477e0c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 10:11:21 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 10:11:21 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 10:12:17 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 30 10:12:17 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 49 previous similar messages Oct 30 10:14:07 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 10:14:07 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 17 previous similar messages Oct 30 10:16:30 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 10:16:30 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 10:16:30 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572455490, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d2f40/0x51ab3c4ee69f67e lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7313fe0aa expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 10:16:30 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 10:19:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 30 10:19:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 30 10:22:27 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 30 10:22:27 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 48 previous similar messages Oct 30 10:22:32 sh-103-53.int kernel: LustreError: 113322:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9769e477fec0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 10:22:32 sh-103-53.int kernel: LustreError: 113322:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 10:22:32 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 10:22:32 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 10:24:17 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 10:24:17 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 25 previous similar messages Oct 30 10:27:42 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 10:27:42 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 10:27:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572456162, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d3180/0x51ab3c4ee69f6b6 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7325af73a expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 10:27:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 10:29:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 30 10:29:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 30 10:32:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 30 10:32:37 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 47 previous similar messages Oct 30 10:32:49 sh-103-53.int kernel: LustreError: 113867:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9769e477fd40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 10:32:49 sh-103-53.int kernel: LustreError: 113867:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 10:32:49 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 10:32:49 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 10:35:00 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 10:35:00 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 23 previous similar messages Oct 30 10:38:03 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 10:38:03 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 10:38:03 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572456783, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d3600/0x51ab3c4ee69f6ee lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7346ca9d5 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 10:38:03 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 10:38:43 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 30 10:39:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 30 10:39:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 63 previous similar messages Oct 30 10:42:47 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 30 10:42:47 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 50 previous similar messages Oct 30 10:42:52 sh-103-53.int kernel: LNetError: 91067:0:(lib-move.c:2895:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.9.0.31@o2ib4: -125 Oct 30 10:43:17 sh-103-53.int kernel: LustreError: 114488:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9769e477f200) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 10:43:17 sh-103-53.int kernel: LustreError: 114488:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 10:43:17 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 10:43:17 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 10:45:25 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 10:45:25 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 29 previous similar messages Oct 30 10:48:29 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 10:48:29 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 10:48:29 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572457409, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d3cc0/0x51ab3c4ee6a107f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7353d4ea6 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 10:48:29 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 10:49:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Oct 30 10:49:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 51 previous similar messages Oct 30 10:52:57 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 30 10:52:57 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 42 previous similar messages Oct 30 10:53:41 sh-103-53.int kernel: LustreError: 115098:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9769e477f380) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 10:53:41 sh-103-53.int kernel: LustreError: 115098:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 10:53:41 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 10:53:41 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 10:55:28 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 10:55:28 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 12 previous similar messages Oct 30 10:58:51 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 10:58:51 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 10:58:51 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572458031, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d7740/0x51ab3c4ee6a10ef lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae735cd125a expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 10:58:51 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 10:59:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 6 seconds Oct 30 10:59:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 67 previous similar messages Oct 30 11:03:07 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 30 11:03:07 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 39 previous similar messages Oct 30 11:04:01 sh-103-53.int kernel: LustreError: 115662:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9769e477f5c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 11:04:01 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 11:04:01 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 11:05:54 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 11:05:54 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 12 previous similar messages Oct 30 11:09:12 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 11:09:12 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 11:09:12 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572458652, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d72c0/0x51ab3c4ee6a11c8 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae73903b01a expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 11:09:12 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 11:10:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 30 11:10:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 61 previous similar messages Oct 30 11:13:17 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.0.21@o2ib4 added to recovery queue. Health = 900 Oct 30 11:13:17 sh-103-53.int kernel: LNetError: 91067:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 40 previous similar messages Oct 30 11:14:20 sh-103-53.int kernel: LustreError: 116204:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9769e477f800) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 11:14:20 sh-103-53.int kernel: LustreError: 116204:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 11:14:20 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 11:14:20 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 11:16:09 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.103.53@o2ib4 added to recovery queue. Health = 900 Oct 30 11:16:09 sh-103-53.int kernel: LNetError: 91081:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 14 previous similar messages Oct 30 11:19:27 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 11:19:27 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 11:19:27 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572459267, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d7bc0/0x51ab3c4ee6a1200 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae73c178c16 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 11:19:27 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 11:20:17 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 30 11:20:17 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 62 previous similar messages Oct 30 11:24:36 sh-103-53.int kernel: LustreError: 117049:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9769e477ecc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 11:24:36 sh-103-53.int kernel: LustreError: 117049:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 11:24:36 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 11:24:36 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 11:29:47 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 11:29:47 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 11:29:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572459887, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d69c0/0x51ab3c4ee6a1238 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae73f24f737 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 11:29:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 11:30:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 30 11:30:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 31 previous similar messages Oct 30 11:34:55 sh-103-53.int kernel: LustreError: 117609:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9769e477fec0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 11:34:55 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 11:34:55 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 11:40:04 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 11:40:04 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 11:40:04 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572460504, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d7740/0x51ab3c4ee6a1277 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae74238f290 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 11:40:04 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 11:40:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Oct 30 11:40:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 41 previous similar messages Oct 30 11:45:14 sh-103-53.int kernel: LustreError: 118150:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9769e477e480) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 11:45:14 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 11:45:14 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 11:50:22 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 11:50:22 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 11:50:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572461122, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d72c0/0x51ab3c4ee6a12b6 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae745a4c8c6 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 11:50:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 11:50:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 30 11:50:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 40 previous similar messages Oct 30 11:55:34 sh-103-53.int kernel: LustreError: 118694:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9769e477e780) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 11:55:34 sh-103-53.int kernel: LustreError: 118694:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 11:55:34 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 11:55:34 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 12:00:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 30 12:00:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 37 previous similar messages Oct 30 12:00:46 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 12:00:46 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 12:00:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572461746, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d2640/0x51ab3c4ee6a12ee lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae74829a244 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 12:00:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 12:05:59 sh-103-53.int kernel: LustreError: 119259:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9769e477f2c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 12:05:59 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 12:05:59 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 12:11:13 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 12:11:13 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 12:11:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572462373, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d4380/0x51ab3c4ee6a1334 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae748b03b3f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 12:11:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 12:12:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 17 seconds Oct 30 12:12:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 37 previous similar messages Oct 30 12:16:20 sh-103-53.int kernel: LustreError: 119804:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9769e477fec0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 12:16:20 sh-103-53.int kernel: LustreError: 119804:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 12:16:20 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 12:16:20 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 12:21:28 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 12:21:28 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 12:21:28 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572462987, 301s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7660e300/0x51ab3c4ee6a1373 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7496bbca7 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 12:21:28 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 12:23:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 30 12:23:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 30 12:26:35 sh-103-53.int kernel: LustreError: 120592:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a17566000) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 12:26:35 sh-103-53.int kernel: LustreError: 120592:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 12:26:35 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 12:26:35 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 12:31:45 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 12:31:45 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 12:31:45 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572463605, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7c45c0/0x51ab3c4ee6a4e83 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae74a4d2d69 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 12:31:45 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 12:33:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 15 seconds Oct 30 12:33:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 30 12:36:53 sh-103-53.int kernel: LustreError: 122088:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976bedeb6000) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 12:36:53 sh-103-53.int kernel: LustreError: 122088:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 12:36:53 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 12:36:53 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 12:42:03 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 12:42:03 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 12:42:03 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572464223, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e7663de80/0x51ab3c4ee6a9c94 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae74b6ab021 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 12:42:03 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 12:43:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 6 seconds Oct 30 12:43:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 30 12:47:11 sh-103-53.int kernel: LustreError: 123279:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820bb76fc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 12:47:11 sh-103-53.int kernel: LustreError: 123279:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 12:47:11 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 12:47:11 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 12:52:22 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 12:52:22 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 12:52:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572464842, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9781fcfbcec0/0x51ab3c4ee6ab12b lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae74cbaa3f6 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 12:52:22 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 12:53:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 30 12:53:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 30 12:57:36 sh-103-53.int kernel: LustreError: 124342:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977121316b40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 12:57:36 sh-103-53.int kernel: LustreError: 124342:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 12:57:36 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 12:57:36 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 13:02:51 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 13:02:51 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 13:02:51 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572465471, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769b4250900/0x51ab3c4ee6afeb0 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae74d75b19f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 13:02:51 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 13:04:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 30 13:04:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 30 13:08:01 sh-103-53.int kernel: LustreError: 125252:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820be42900) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 13:08:01 sh-103-53.int kernel: LustreError: 125252:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 13:08:01 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 13:08:01 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 13:13:11 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 13:13:11 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 13:13:11 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572466091, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977107259b00/0x51ab3c4ee6b877d lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae74e3f4e19 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 13:13:11 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 13:14:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 30 13:14:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 30 13:18:21 sh-103-53.int kernel: LustreError: 126163:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781ffad95c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 13:18:21 sh-103-53.int kernel: LustreError: 126163:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 13:18:21 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 13:18:21 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 13:23:28 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 13:23:28 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 13:23:28 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572466708, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97743a7d2f40/0x51ab3c4ee6c1100 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae74f207f3a expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 13:23:28 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 13:24:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 12 seconds Oct 30 13:24:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 30 13:28:36 sh-103-53.int kernel: LustreError: 127070:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978207676d80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 13:28:36 sh-103-53.int kernel: LustreError: 127070:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 13:28:36 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 13:28:36 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 13:33:47 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 13:33:47 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 13:33:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572467327, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e76ac1b00/0x51ab3c4ee6cdd9d lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae74fccd179 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 13:33:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 13:34:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 30 13:34:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Oct 30 13:38:57 sh-103-53.int kernel: LustreError: 127837:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978206a5d380) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 13:38:57 sh-103-53.int kernel: LustreError: 127837:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 13:38:57 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 13:38:57 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 13:44:05 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 13:44:05 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 13:44:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572467945, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97781f6bd340/0x51ab3c4ee6d4fd4 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7507f8e8f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 13:44:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 13:44:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 30 13:44:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Oct 30 13:49:15 sh-103-53.int kernel: LustreError: 128614:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977799a2c000) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 13:49:15 sh-103-53.int kernel: LustreError: 128614:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 13:49:15 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 13:49:15 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 13:54:23 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 13:54:23 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 13:54:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572468563, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977437eeb600/0x51ab3c4ee6db657 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7513a8cf6 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 13:54:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 13:54:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 20 seconds Oct 30 13:54:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Oct 30 13:59:28 sh-103-53.int kernel: LustreError: 130000:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff978206a6eb40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 13:59:28 sh-103-53.int kernel: LustreError: 130000:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 13:59:28 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 13:59:28 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 14:04:35 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 14:04:35 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 14:04:35 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572469175, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97820fc93cc0/0x51ab3c4ee6ee437 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae752014e13 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 14:04:35 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 14:04:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Oct 30 14:04:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 30 14:09:40 sh-103-53.int kernel: LustreError: 130659:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e64f3c6c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 14:09:40 sh-103-53.int kernel: LustreError: 130659:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 14:09:40 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 14:09:40 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 14:14:50 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 14:14:50 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 14:14:50 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572469790, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff978210b69f80/0x51ab3c4ee6ee4c3 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae752a57e65 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 14:14:50 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 14:17:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 30 14:17:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 30 14:20:00 sh-103-53.int kernel: LustreError: 132654:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6c3e1ec0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 14:20:00 sh-103-53.int kernel: LustreError: 132654:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 14:20:00 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 14:20:00 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 14:25:09 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 14:25:09 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 14:25:09 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572470409, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977110760b40/0x51ab3c4ee6f5eb7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae75448d611 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 14:25:09 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 14:30:18 sh-103-53.int kernel: LustreError: 133355:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e6f363740) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 14:30:18 sh-103-53.int kernel: LustreError: 133355:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 14:30:18 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 14:30:18 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 14:35:26 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 14:35:26 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 14:35:26 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572471026, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977110767500/0x51ab3c4ee6f7593 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae756b7237c expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 14:35:26 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 14:40:32 sh-103-53.int kernel: LustreError: 133984:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9774412df980) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 14:40:32 sh-103-53.int kernel: LustreError: 133984:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 14:40:32 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 14:40:32 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 14:45:40 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 14:45:40 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 14:45:40 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572471640, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977681a4b600/0x51ab3c4ee6f861e lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7581fe9d7 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 14:45:40 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 14:50:50 sh-103-53.int kernel: LustreError: 134583:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9777963cfb00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 14:50:50 sh-103-53.int kernel: LustreError: 134583:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 14:50:50 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 14:50:50 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 14:56:00 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 14:56:00 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 14:56:00 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572472260, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9781bbaa9d40/0x51ab3c4ee6fa6b6 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae759358684 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 14:56:00 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 14:59:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 30 14:59:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 2 previous similar messages Oct 30 15:01:10 sh-103-53.int kernel: LustreError: 135487:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97779c672540) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 15:01:10 sh-103-53.int kernel: LustreError: 135487:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 15:01:10 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 15:01:10 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 15:01:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 6 seconds Oct 30 15:04:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 14 seconds Oct 30 15:04:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 1 previous similar message Oct 30 15:06:18 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 15:06:18 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 15:06:18 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572472878, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97821d8fba80/0x51ab3c4ee6fd98b lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae75a6d56d3 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 15:06:18 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 15:10:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 13 seconds Oct 30 15:10:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 3 previous similar messages Oct 30 15:11:29 sh-103-53.int kernel: LustreError: 136406:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820971e000) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 15:11:29 sh-103-53.int kernel: LustreError: 136406:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 15:11:29 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 15:11:29 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 15:16:41 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 15:16:41 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 15:16:41 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572473501, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97820c39c800/0x51ab3c4ee6fe4c8 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae75bbe5c49 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 15:16:41 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 15:20:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 30 15:20:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Oct 30 15:21:46 sh-103-53.int kernel: LustreError: 137050:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e64f258c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 15:21:46 sh-103-53.int kernel: LustreError: 137050:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 15:21:46 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 15:21:46 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 15:26:53 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 15:26:53 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 15:26:53 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572474113, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97820c39c800/0x51ab3c4ee6fe562 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae75d464c5f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 15:26:53 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 15:30:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 30 15:30:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 30 15:32:02 sh-103-53.int kernel: LustreError: 137683:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e64f24540) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 15:32:02 sh-103-53.int kernel: LustreError: 137683:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 15:32:02 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 15:32:02 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 15:37:09 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 15:37:09 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 15:37:09 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572474729, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97820c39a1c0/0x51ab3c4ee6fe753 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae75f3f7021 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 15:37:09 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 15:40:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 14 seconds Oct 30 15:40:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 30 15:42:21 sh-103-53.int kernel: LustreError: 138351:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e64f24cc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 15:42:21 sh-103-53.int kernel: LustreError: 138351:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 15:42:21 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 15:42:21 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 15:47:33 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 15:47:33 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 15:47:33 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572475353, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97820c398000/0x51ab3c4ee6fe879 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae760ba662b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 15:47:33 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 15:50:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 30 15:50:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 30 15:52:40 sh-103-53.int kernel: LustreError: 138951:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e64f24480) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 15:52:40 sh-103-53.int kernel: LustreError: 138951:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 15:52:40 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 15:52:40 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 15:57:47 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 15:57:47 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 15:57:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572475967, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977796f560c0/0x51ab3c4ee6fecbd lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae761c1eb5b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 15:57:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 16:00:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 30 16:00:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 30 16:02:56 sh-103-53.int kernel: LustreError: 139577:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a1a23a3c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 16:02:56 sh-103-53.int kernel: LustreError: 139577:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 16:02:56 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 16:02:56 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 16:08:05 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 16:08:05 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 16:08:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572476585, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e8b7a6c00/0x51ab3c4ee6feda4 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae76317b1dc expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 16:08:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 16:11:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 14 seconds Oct 30 16:11:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 30 16:13:15 sh-103-53.int kernel: LustreError: 140194:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9782092e7500) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 16:13:15 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 16:13:15 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 16:18:23 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 16:18:23 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 16:18:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572477203, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e8b7a6e40/0x51ab3c4ee6fef6b lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7649a1f25 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 16:18:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 16:21:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 7 seconds Oct 30 16:21:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 30 16:23:30 sh-103-53.int kernel: LustreError: 140808:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9782092e7980) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 16:23:30 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 16:23:30 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 16:28:38 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 16:28:38 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 16:28:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572477818, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e8b7a3840/0x51ab3c4ee6ff060 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae765831c49 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 16:28:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 16:31:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 30 16:31:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 30 16:33:49 sh-103-53.int kernel: LustreError: 141410:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9782092e72c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 16:33:49 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 16:33:49 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 16:38:56 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 16:38:56 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 16:38:56 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572478436, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e8b7a4380/0x51ab3c4ee6ff1b7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae766cd13e9 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 16:38:56 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 16:41:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 30 16:41:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 30 16:44:03 sh-103-53.int kernel: LustreError: 142014:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9782092e6300) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 16:44:03 sh-103-53.int kernel: LustreError: 142014:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 16:44:03 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 16:44:03 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 16:49:10 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 16:49:10 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 16:49:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572479050, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e8b7a5a00/0x51ab3c4ee6ff2c8 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7697afb77 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 16:49:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 16:52:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 30 16:52:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 30 16:54:16 sh-103-53.int kernel: LustreError: 142619:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9782092e6600) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 16:54:16 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 16:54:16 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 16:59:27 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 16:59:27 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 16:59:27 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572479667, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e8b7a0240/0x51ab3c4ee6ff3f5 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae76ba54dd1 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 16:59:27 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 17:02:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 30 17:02:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 30 17:04:39 sh-103-53.int kernel: LustreError: 143261:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a1a23a300) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 17:04:39 sh-103-53.int kernel: LustreError: 143261:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 17:04:39 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 17:04:39 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 17:09:53 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 17:09:53 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 17:09:53 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572480293, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e8b7a2ac0/0x51ab3c4ee6ff5f4 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae76cd8bc3d expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 17:09:53 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 17:13:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 30 17:13:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 30 17:15:06 sh-103-53.int kernel: LustreError: 143871:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9782092e7140) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 17:15:06 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 17:15:06 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 17:20:19 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 17:20:19 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 17:20:19 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572480919, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e8b7a5340/0x51ab3c4ee6ff713 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae76da4cbfb expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 17:20:19 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 17:23:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Oct 30 17:23:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 30 17:25:32 sh-103-53.int kernel: LustreError: 144476:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9782092e7200) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 17:25:32 sh-103-53.int kernel: LustreError: 144476:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 17:25:32 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 17:25:32 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 17:30:46 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 17:30:46 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 17:30:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572481546, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e8b7a2400/0x51ab3c4ee6ffaaf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae76eba32e6 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 17:30:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 17:33:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 30 17:33:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 30 17:36:01 sh-103-53.int kernel: LustreError: 145094:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9782092e7ec0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 17:36:01 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 17:36:01 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 17:41:12 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 17:41:12 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 17:41:12 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572482172, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e8b7a6300/0x51ab3c4ee6ffb6c lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7702042c7 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 17:41:12 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 17:43:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 30 17:43:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 6 previous similar messages Oct 30 17:46:25 sh-103-53.int kernel: LustreError: 145706:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a1a23a540) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 17:46:25 sh-103-53.int kernel: LustreError: 145706:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 17:46:25 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 17:46:25 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 17:51:31 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 17:51:31 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 17:51:31 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572482791, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977796f53f00/0x51ab3c4ee6ffd2c lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae770a59a0f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 17:51:31 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 17:53:49 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 30 17:53:49 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 30 17:56:38 sh-103-53.int kernel: LustreError: 146306:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a1a23b500) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 17:56:38 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 17:56:38 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 18:01:47 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 18:01:47 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 18:01:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572483407, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e8b7a18c0/0x51ab3c4ee6ffe67 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae771a76de1 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 18:01:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 18:03:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 14 seconds Oct 30 18:03:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 30 18:07:00 sh-103-53.int kernel: LustreError: 146926:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a1a23a6c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 18:07:00 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 18:07:00 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 18:12:10 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 18:12:10 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 18:12:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572484030, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977796f56c00/0x51ab3c4ee6fffa2 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae77302f78d expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 18:12:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 18:14:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 30 18:14:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 30 18:17:16 sh-103-53.int kernel: LustreError: 147527:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a1a23af00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 18:17:16 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 18:17:16 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 18:22:25 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 18:22:25 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 18:22:25 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572484645, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977796f53840/0x51ab3c4ee7001d9 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae774348274 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 18:22:25 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 18:24:17 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 30 18:24:17 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 30 18:27:33 sh-103-53.int kernel: LustreError: 148212:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781d5aa2a80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 18:27:33 sh-103-53.int kernel: LustreError: 148212:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 18:27:33 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 18:27:33 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 18:32:42 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 18:32:42 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 18:32:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572485262, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e8b7a3180/0x51ab3c4ee7003a0 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7755eef01 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 18:32:43 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 18:35:24 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 30 18:35:24 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Oct 30 18:37:49 sh-103-53.int kernel: LustreError: 148972:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97779d2e03c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 18:37:49 sh-103-53.int kernel: LustreError: 148972:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 18:37:49 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 18:37:49 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 18:42:58 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 18:42:58 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 18:42:58 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572485878, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e8b7a5100/0x51ab3c4ee7005f3 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae776b26721 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 18:42:58 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 18:45:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 30 18:45:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 30 18:48:05 sh-103-53.int kernel: LustreError: 149772:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781893749c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 18:48:05 sh-103-53.int kernel: LustreError: 149772:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 18:48:05 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 18:48:05 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 18:53:17 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 18:53:17 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 18:53:17 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572486497, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e8b7a7980/0x51ab3c4ee7006d3 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae777b6d2e2 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 18:53:17 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 18:55:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Oct 30 18:55:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 3 previous similar messages Oct 30 18:58:29 sh-103-53.int kernel: LustreError: 150572:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781d5aa2540) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 18:58:29 sh-103-53.int kernel: LustreError: 150572:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 18:58:29 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 18:58:29 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 19:03:39 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 19:03:39 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 19:03:39 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572487119, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e8b7a3f00/0x51ab3c4ee700942 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae778826c3a expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 19:03:39 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 19:05:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 30 19:05:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 30 19:08:46 sh-103-53.int kernel: LustreError: 151357:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781d5aa2480) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 19:08:46 sh-103-53.int kernel: LustreError: 151357:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 19:08:46 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 19:08:46 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 19:13:54 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 19:13:54 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 19:13:54 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572487734, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e8b7a7080/0x51ab3c4ee700a8b lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7794e0147 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 19:13:54 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 19:17:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 30 19:17:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 5 previous similar messages Oct 30 19:19:05 sh-103-53.int kernel: LustreError: 152135:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e77770b40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 19:19:05 sh-103-53.int kernel: LustreError: 152135:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 19:19:05 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 19:19:05 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 19:24:18 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 19:24:18 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 19:24:18 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572488358, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977796f506c0/0x51ab3c4ee700b72 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae77a32c213 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 19:24:18 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 19:27:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 15 seconds Oct 30 19:27:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 6 previous similar messages Oct 30 19:29:33 sh-103-53.int kernel: LustreError: 152957:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781d5aa26c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 19:29:33 sh-103-53.int kernel: LustreError: 152957:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 19:29:33 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 19:29:33 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 19:34:47 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 19:34:47 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 19:34:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572488987, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e8b7a4800/0x51ab3c4ee7012b1 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae77b6eca29 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 19:34:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 19:37:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 30 19:37:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 5 previous similar messages Oct 30 19:40:00 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 19:40:00 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 19:45:07 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 19:45:07 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 19:45:07 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572489607, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e8b7a1680/0x51ab3c4ee7015f2 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae77c2091e4 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 19:45:07 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 19:45:07 sh-103-53.int kernel: LustreError: 154031:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781d5aa2d80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 19:45:07 sh-103-53.int kernel: LustreError: 154031:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 19:47:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 30 19:47:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 30 19:50:16 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 19:50:16 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 19:55:27 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 19:55:27 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 19:55:27 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572490227, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e8b7a18c0/0x51ab3c4ee701db6 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae77d3109bf expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 19:55:27 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 19:55:27 sh-103-53.int kernel: LustreError: 154835:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97778ea41200) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 19:55:27 sh-103-53.int kernel: LustreError: 154835:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 19:57:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 30 19:57:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Oct 30 20:00:35 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 20:00:35 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 20:05:44 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 20:05:44 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 20:05:44 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572490844, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff977e8b7a0480/0x51ab3c4ee702002 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae77e5894dd expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 20:05:44 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 20:05:44 sh-103-53.int kernel: LustreError: 155543:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97778ea41d40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 20:05:44 sh-103-53.int kernel: LustreError: 155543:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 20:08:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Oct 30 20:08:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 30 20:10:55 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 20:10:55 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 20:16:07 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 20:16:07 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 20:16:07 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572491467, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a6fd4a40/0x51ab3c4ee702630 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae77f1bedfc expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 20:16:07 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 20:16:08 sh-103-53.int kernel: LustreError: 156246:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976019372000) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 20:16:08 sh-103-53.int kernel: LustreError: 156246:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 20:20:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 30 20:20:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 30 20:21:21 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 20:21:21 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 20:26:30 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 20:26:30 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 20:26:30 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572492089, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a6fd5100/0x51ab3c4ee7027db lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae77fce04d2 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 20:26:30 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 20:26:30 sh-103-53.int kernel: LustreError: 156928:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976019372540) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 20:26:30 sh-103-53.int kernel: LustreError: 156928:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 20:30:17 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 16 seconds Oct 30 20:30:17 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 30 20:31:38 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 20:31:38 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 20:36:48 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 20:36:48 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 20:36:48 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572492708, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a6fd1b00/0x51ab3c4ee702ddf lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae78075be62 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 20:36:48 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 20:36:48 sh-103-53.int kernel: LustreError: 157643:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976019372e40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 20:36:48 sh-103-53.int kernel: LustreError: 157643:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 20:40:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 30 20:40:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 30 20:42:00 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 20:42:00 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 20:47:13 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 20:47:13 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 20:47:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572493333, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9769a6fd72c0/0x51ab3c4ee7030a9 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7812a379d expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 20:47:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 20:47:13 sh-103-53.int kernel: LustreError: 158339:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976019373c80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 20:47:13 sh-103-53.int kernel: LustreError: 158339:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 20:50:44 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 30 20:50:44 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 6 previous similar messages Oct 30 20:52:26 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 20:52:26 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 20:57:34 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 20:57:34 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 20:57:34 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572493954, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97820c80ba80/0x51ab3c4ee703621 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae781d7df4d expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 20:57:34 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 20:57:34 sh-103-53.int kernel: LustreError: 159034:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977797e3c0c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 20:57:34 sh-103-53.int kernel: LustreError: 159034:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 21:02:43 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 21:02:43 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 21:02:49 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Oct 30 21:02:49 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 6 previous similar messages Oct 30 21:07:54 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 21:07:54 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 21:07:54 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572494574, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97820c809200/0x51ab3c4ee703a0a lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7828168ea expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 21:07:54 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 21:07:54 sh-103-53.int kernel: LustreError: 159738:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977107242840) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 21:07:54 sh-103-53.int kernel: LustreError: 159738:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 21:13:08 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 21:13:08 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 21:13:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 14 seconds Oct 30 21:13:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 5 previous similar messages Oct 30 21:18:21 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 21:18:21 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 21:18:21 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572495201, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97820c808000/0x51ab3c4ee703fac lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7833b4182 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 21:18:21 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 21:18:22 sh-103-53.int kernel: LustreError: 160453:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97711f60d680) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 21:18:22 sh-103-53.int kernel: LustreError: 160453:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 21:23:36 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 21:23:36 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 21:24:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 30 21:24:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 30 21:28:52 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 21:28:52 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 21:28:52 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572495832, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff978215008000/0x51ab3c4ee70450f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae783f18cf3 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 21:28:52 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 21:28:52 sh-103-53.int kernel: LustreError: 161157:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0573c300) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 21:28:52 sh-103-53.int kernel: LustreError: 161157:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 21:34:03 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 21:34:03 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 21:34:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 6 seconds Oct 30 21:34:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 4 previous similar messages Oct 30 21:39:12 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 21:39:12 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 21:39:12 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572496452, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff978215009680/0x51ab3c4ee704b59 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae784a4ec28 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 21:39:12 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 21:39:12 sh-103-53.int kernel: LustreError: 161859:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0573d2c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 21:39:12 sh-103-53.int kernel: LustreError: 161859:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 21:44:21 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 21:44:21 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 21:44:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Oct 30 21:44:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Oct 30 21:49:28 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 21:49:28 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 21:49:28 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572497068, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97821500f740/0x51ab3c4ee704cfd lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae78544011e expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 21:49:28 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 21:49:28 sh-103-53.int kernel: LustreError: 162564:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0573dec0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 21:49:28 sh-103-53.int kernel: LustreError: 162564:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 21:54:37 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 21:54:37 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 21:54:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 30 21:54:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 4 previous similar messages Oct 30 21:59:50 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 21:59:50 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 21:59:50 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572497690, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97821500de80/0x51ab3c4ee705554 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae785e3023a expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 21:59:50 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 21:59:51 sh-103-53.int kernel: LustreError: 163255:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff976a0573d980) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 21:59:51 sh-103-53.int kernel: LustreError: 163255:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 22:04:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 30 22:04:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 5 previous similar messages Oct 30 22:05:01 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 22:05:01 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 22:10:11 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 22:10:11 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 22:10:11 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572498311, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97821500a1c0/0x51ab3c4ee705991 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae786815d39 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 22:10:11 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 22:15:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 30 22:15:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Oct 30 22:15:20 sh-103-53.int kernel: LustreError: 164303:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9774fb7fc6c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 22:15:20 sh-103-53.int kernel: LustreError: 164303:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 22:15:20 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 22:15:20 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 22:20:33 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 22:20:33 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 22:20:33 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572498933, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff978215009b00/0x51ab3c4ee705d50 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7871df98f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 22:20:33 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 22:25:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 30 22:25:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 30 22:25:43 sh-103-53.int kernel: LustreError: 164988:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9774fb7fcd80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 22:25:43 sh-103-53.int kernel: LustreError: 164988:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 22:25:43 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 22:25:43 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 22:30:51 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 22:30:51 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 22:30:51 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572499551, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97821500c800/0x51ab3c4ee7063f5 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae787a3f8c9 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 22:30:51 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 22:35:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 6 seconds Oct 30 22:35:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 30 22:36:00 sh-103-53.int kernel: LustreError: 165680:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9774fb7fcfc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 22:36:00 sh-103-53.int kernel: LustreError: 165680:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 22:36:00 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 22:36:00 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 22:41:10 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 22:41:10 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 22:41:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572500170, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97821500bf00/0x51ab3c4ee70679f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae78830ef82 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 22:41:10 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 22:45:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 30 22:45:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 30 22:46:21 sh-103-53.int kernel: LustreError: 166380:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9774fb7fc6c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 22:46:21 sh-103-53.int kernel: LustreError: 166380:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 22:46:21 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 22:46:21 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 22:51:37 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 22:51:37 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 22:51:37 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572500797, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97821500e9c0/0x51ab3c4ee706b96 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae788d96a9c expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 22:51:37 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 22:55:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 30 22:55:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 30 22:56:50 sh-103-53.int kernel: LustreError: 167079:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9774fb7fd440) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 22:56:50 sh-103-53.int kernel: LustreError: 167079:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 22:56:50 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 22:56:50 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 23:02:04 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 23:02:04 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 23:02:04 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572501424, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97821500f500/0x51ab3c4ee706fda lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae789b9162c expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 23:02:04 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 23:05:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 30 23:05:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 6 previous similar messages Oct 30 23:07:16 sh-103-53.int kernel: LustreError: 167803:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9774fb7fd500) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 23:07:16 sh-103-53.int kernel: LustreError: 167803:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 23:07:16 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 23:07:16 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 23:12:24 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 23:12:24 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 23:12:24 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572502044, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97821500ca40/0x51ab3c4ee707521 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae78a96fff5 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 23:12:24 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 23:15:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 30 23:15:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 30 23:17:31 sh-103-53.int kernel: LustreError: 168483:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9774fb7fc900) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 23:17:31 sh-103-53.int kernel: LustreError: 168483:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 23:17:31 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 23:17:31 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 23:22:38 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 23:22:38 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 23:22:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572502658, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff978215008000/0x51ab3c4ee707823 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae78b638d3c expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 23:22:39 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 23:26:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 30 23:26:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 30 23:27:47 sh-103-53.int kernel: LustreError: 169172:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9774fb7fd980) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 23:27:47 sh-103-53.int kernel: LustreError: 169172:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 23:27:47 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 23:27:47 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 23:32:55 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 23:32:55 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 23:32:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572503275, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97821500a640/0x51ab3c4ee707de1 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae78c25e982 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 23:32:55 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 23:37:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Oct 30 23:37:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 30 23:38:08 sh-103-53.int kernel: LustreError: 169874:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9774fb7fcd80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 23:38:08 sh-103-53.int kernel: LustreError: 169874:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 23:38:08 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 23:38:08 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 23:43:23 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 23:43:23 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 23:43:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572503903, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff978215008000/0x51ab3c4ee708321 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae78ce21a29 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 23:43:23 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 23:47:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 30 23:47:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 5 previous similar messages Oct 30 23:48:33 sh-103-53.int kernel: LustreError: 170575:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9774fb7fc6c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 23:48:33 sh-103-53.int kernel: LustreError: 170575:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 23:48:33 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 23:48:33 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 30 23:53:46 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 30 23:53:46 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 30 23:53:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572504526, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff978215008d80/0x51ab3c4ee7087ff lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae78da4db5c expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 30 23:53:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 30 23:57:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 30 23:57:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 30 23:59:01 sh-103-53.int kernel: LustreError: 171294:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9774fb7fd680) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 30 23:59:01 sh-103-53.int kernel: LustreError: 171294:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 30 23:59:01 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 30 23:59:01 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 00:04:12 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 00:04:12 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 00:04:12 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572505152, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97821500f740/0x51ab3c4ee708be8 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae78e6b9215 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 00:04:12 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 00:07:49 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 00:07:49 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 00:09:30 sh-103-53.int kernel: LustreError: 172029:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9774fb7fd2c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 00:09:30 sh-103-53.int kernel: LustreError: 172029:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 00:09:30 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 00:09:30 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 00:14:37 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 00:14:37 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 00:14:37 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572505777, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9782150086c0/0x51ab3c4ee709017 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae78f226208 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 00:14:37 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 00:19:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 00:19:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 00:19:44 sh-103-53.int kernel: LustreError: 172710:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9774fb7fde00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 00:19:44 sh-103-53.int kernel: LustreError: 172710:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 00:19:44 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 00:19:45 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 00:24:57 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 00:24:57 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 00:24:57 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572506397, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97820c80b600/0x51ab3c4ee70951f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae78fd747fb expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 00:24:57 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 00:29:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 00:29:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 00:30:11 sh-103-53.int kernel: LustreError: 173415:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977ac2e0be00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 00:30:11 sh-103-53.int kernel: LustreError: 173415:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 00:30:11 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 00:30:11 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 00:35:25 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 00:35:25 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 00:35:25 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572507025, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97820c80b3c0/0x51ab3c4ee709a2e lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae790851281 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 00:35:25 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 00:39:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 31 00:39:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Oct 31 00:40:39 sh-103-53.int kernel: LustreError: 174115:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97742ea7f380) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 00:40:39 sh-103-53.int kernel: LustreError: 174115:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 00:40:39 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 00:40:39 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 00:45:52 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 00:45:52 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 00:45:52 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572507652, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97820c80e780/0x51ab3c4ee70a047 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae79132adc0 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 00:45:52 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 00:50:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 00:50:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 6 previous similar messages Oct 31 00:51:06 sh-103-53.int kernel: LustreError: 174808:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977ac2e0a9c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 00:51:06 sh-103-53.int kernel: LustreError: 174808:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 00:51:06 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 00:51:06 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 00:56:21 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 00:56:21 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 00:56:21 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572508281, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97820c80f980/0x51ab3c4ee70a69f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae791e724e0 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 00:56:21 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 01:00:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 01:00:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 01:01:35 sh-103-53.int kernel: LustreError: 175535:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977ac2e0a840) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 01:01:35 sh-103-53.int kernel: LustreError: 175535:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 01:01:35 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 01:01:35 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 01:06:47 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 01:06:47 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 01:06:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572508907, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97820c80dc40/0x51ab3c4ee70aac7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae79298f570 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 01:06:47 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 01:10:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Oct 31 01:10:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 01:12:02 sh-103-53.int kernel: LustreError: 176243:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9781a2eeafc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 01:12:02 sh-103-53.int kernel: LustreError: 176243:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 01:12:02 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 01:12:02 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 01:17:15 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 01:17:15 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 01:17:15 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572509535, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97820c80e780/0x51ab3c4ee70afa5 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7935aae63 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 01:17:15 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 01:21:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 15 seconds Oct 31 01:21:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 6 previous similar messages Oct 31 01:22:27 sh-103-53.int kernel: LustreError: 176939:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e65296e40) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 01:22:27 sh-103-53.int kernel: LustreError: 176939:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 01:22:27 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 01:22:27 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 01:27:41 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 01:27:41 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 01:27:41 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572510161, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf53840/0x51ab3c4ee70b715 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7940fa057 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 01:27:41 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 01:31:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 31 01:31:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 5 previous similar messages Oct 31 01:32:55 sh-103-53.int kernel: LustreError: 177632:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e652975c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 01:32:55 sh-103-53.int kernel: LustreError: 177632:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 01:32:55 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 01:32:55 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 01:38:09 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 01:38:09 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 01:38:09 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572510789, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf52ac0/0x51ab3c4ee70ba09 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae794c5d563 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 01:38:09 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 01:42:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 31 01:42:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 01:43:26 sh-103-53.int kernel: LustreError: 178344:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9775c2fb5200) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 01:43:26 sh-103-53.int kernel: LustreError: 178344:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 01:43:26 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 01:43:26 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 01:48:38 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 01:48:38 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 01:48:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572511418, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97820c808000/0x51ab3c4ee70bff1 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7958c4757 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 01:48:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 01:52:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 01:52:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 5 previous similar messages Oct 31 01:53:52 sh-103-53.int kernel: LustreError: 179044:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97821e130a80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 01:53:52 sh-103-53.int kernel: LustreError: 179044:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 01:53:52 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 01:53:52 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 01:59:05 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 01:59:05 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 01:59:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572512045, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97820c80c140/0x51ab3c4ee70c267 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae796570444 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 01:59:05 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 02:02:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 31 02:02:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 5 previous similar messages Oct 31 02:04:20 sh-103-53.int kernel: LustreError: 179764:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9770fd3458c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 02:04:20 sh-103-53.int kernel: LustreError: 179764:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 02:04:20 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 02:04:20 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 02:09:33 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 02:09:33 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 02:09:33 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572512673, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97820c809200/0x51ab3c4ee70c872 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae79714789d expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 02:09:33 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 02:12:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 13 seconds Oct 31 02:12:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Oct 31 02:14:47 sh-103-53.int kernel: LustreError: 180482:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977ac2e0ac00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 02:14:47 sh-103-53.int kernel: LustreError: 180482:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 02:14:47 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 02:14:47 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 02:20:02 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 02:20:02 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 02:20:02 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572513302, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff97820c8098c0/0x51ab3c4ee70cc15 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae797c4632b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 02:20:02 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 02:23:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 02:23:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 3 previous similar messages Oct 31 02:25:16 sh-103-53.int kernel: LustreError: 181177:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e65296240) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 02:25:16 sh-103-53.int kernel: LustreError: 181177:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 02:25:17 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 02:25:17 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 02:30:32 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 02:30:32 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 02:30:32 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572513931, 301s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf51f80/0x51ab3c4ee70d2a5 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7986f98fc expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 02:30:32 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 02:33:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 31 02:33:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 02:35:40 sh-103-53.int kernel: LustreError: 181873:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e65296180) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 02:35:40 sh-103-53.int kernel: LustreError: 181873:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 02:35:40 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 02:35:40 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 02:40:46 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 02:40:46 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 02:40:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572514546, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf55580/0x51ab3c4ee70d633 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7991c812f expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 02:40:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 02:43:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 31 02:43:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 02:45:57 sh-103-53.int kernel: LustreError: 182558:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e652978c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 02:45:57 sh-103-53.int kernel: LustreError: 182558:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 02:45:57 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 02:45:57 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 02:51:13 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 02:51:13 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 02:51:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572515173, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf569c0/0x51ab3c4ee70dc84 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae799cd9a92 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 02:51:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 02:53:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Oct 31 02:53:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 02:56:24 sh-103-53.int kernel: LustreError: 183263:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e65297080) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 02:56:24 sh-103-53.int kernel: LustreError: 183263:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 02:56:24 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 02:56:24 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 03:01:38 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 03:01:38 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 03:01:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572515798, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf54380/0x51ab3c4ee70e066 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae79a70f6b4 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 03:01:38 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 03:03:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 14 seconds Oct 31 03:03:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Oct 31 03:06:50 sh-103-53.int kernel: LustreError: 183980:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e65297140) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 03:06:50 sh-103-53.int kernel: LustreError: 183980:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 03:06:50 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 03:06:50 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 03:12:04 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 03:12:04 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 03:12:04 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572516424, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf55340/0x51ab3c4ee70e598 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae79b192c29 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 03:12:04 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 03:13:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 31 03:13:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Oct 31 03:17:11 sh-103-53.int kernel: LustreError: 184696:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e65296840) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 03:17:11 sh-103-53.int kernel: LustreError: 184696:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 03:17:11 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 03:17:11 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 03:22:18 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 03:22:18 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 03:22:18 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572517038, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf57080/0x51ab3c4ee70ea76 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae79baec7de expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 03:22:18 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 03:24:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 03:24:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Oct 31 03:27:34 sh-103-53.int kernel: LustreError: 185399:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e65296d80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 03:27:34 sh-103-53.int kernel: LustreError: 185399:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 03:27:34 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 03:27:34 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 03:32:46 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 03:32:46 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 03:32:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572517666, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf50fc0/0x51ab3c4ee70eec1 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae79c3f65a3 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 03:32:46 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 03:34:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 03:34:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Oct 31 03:37:58 sh-103-53.int kernel: LustreError: 186086:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e65297740) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 03:37:58 sh-103-53.int kernel: LustreError: 186086:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 03:37:58 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 03:37:58 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 03:43:13 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 03:43:13 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 03:43:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572518293, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf52ac0/0x51ab3c4ee70f3f3 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae79cda4143 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 03:43:13 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 03:44:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Oct 31 03:44:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Oct 31 03:48:26 sh-103-53.int kernel: LustreError: 186796:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e65296300) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 03:48:26 sh-103-53.int kernel: LustreError: 186796:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 03:48:26 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 03:48:26 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 03:53:37 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 03:53:37 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 03:53:37 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572518917, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf54800/0x51ab3c4ee70f765 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae79d7e262b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 03:53:37 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 03:54:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 31 03:54:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 03:58:50 sh-103-53.int kernel: LustreError: 187495:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff977e65296a80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 03:58:50 sh-103-53.int kernel: LustreError: 187495:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 03:58:50 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 03:58:50 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 04:03:59 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 04:03:59 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 04:03:59 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572519539, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf518c0/0x51ab3c4ee70fcac lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae79e1bb34b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 04:03:59 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 04:04:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 7 seconds Oct 31 04:04:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 04:09:15 sh-103-53.int kernel: LustreError: 188202:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9770fc30fc80) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 04:09:15 sh-103-53.int kernel: LustreError: 188202:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 04:09:15 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 04:09:15 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 04:14:28 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 04:14:28 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 04:14:28 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572520168, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf55c40/0x51ab3c4ee710095 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae79eb3b43d expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 04:14:28 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 04:14:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 26 seconds Oct 31 04:14:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Oct 31 04:19:40 sh-103-53.int kernel: LustreError: 188932:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9770fc30f2c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 04:19:40 sh-103-53.int kernel: LustreError: 188932:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 04:19:40 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 04:19:40 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 04:24:54 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 04:24:54 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 04:24:54 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572520794, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf52ac0/0x51ab3c4ee7106f4 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae79f5281c7 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 04:24:54 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 04:25:04 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 12 seconds Oct 31 04:25:04 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 04:30:08 sh-103-53.int kernel: LustreError: 189622:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97744734c240) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 04:30:08 sh-103-53.int kernel: LustreError: 189622:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 04:30:08 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 04:30:08 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 04:35:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 31 04:35:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 04:35:20 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 04:35:20 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 04:35:20 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572521420, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf50fc0/0x51ab3c4ee710cdc lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae79fe78948 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 04:35:20 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 04:40:29 sh-103-53.int kernel: LustreError: 190329:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820b7e0000) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 04:40:29 sh-103-53.int kernel: LustreError: 190329:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 04:40:29 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 04:40:29 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 04:45:24 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Oct 31 04:45:24 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 04:45:42 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 04:45:42 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 04:45:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572522042, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf57500/0x51ab3c4ee71110b lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7a077088b expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 04:45:42 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 04:51:02 sh-103-53.int kernel: LustreError: 191025:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9777fd2a40c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 04:51:02 sh-103-53.int kernel: LustreError: 191025:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 04:51:02 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 04:51:02 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 04:55:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 04:55:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 6 previous similar messages Oct 31 04:56:17 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 04:56:17 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 04:56:17 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572522677, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf57980/0x51ab3c4ee71184a lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7a1088b58 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 04:56:17 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 05:01:29 sh-103-53.int kernel: LustreError: 191948:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9777fd2a40c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 05:01:29 sh-103-53.int kernel: LustreError: 191948:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 05:01:29 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 05:01:29 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 05:05:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 05:05:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 05:06:39 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 05:06:39 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 05:06:39 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572523299, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf51b00/0x51ab3c4ee711c8e lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7a1a1f2a1 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 05:06:39 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 05:11:50 sh-103-53.int kernel: LustreError: 192799:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff97820b7e0000) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 05:11:50 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 05:11:50 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 05:15:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 05:15:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 05:17:01 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 05:17:01 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 05:17:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572523921, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf57500/0x51ab3c4ee7121f1 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7a2362e7a expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 05:17:01 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 05:22:14 sh-103-53.int kernel: LustreError: 193708:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9777fd2a4240) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 05:22:14 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 05:22:14 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 05:26:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 31 05:26:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 05:27:20 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 05:27:20 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 05:27:20 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572524540, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf57bc0/0x51ab3c4ee712754 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7a2bcfded expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 05:27:20 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 05:32:29 sh-103-53.int kernel: LustreError: 194612:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9777fd2a5bc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 05:32:29 sh-103-53.int kernel: LustreError: 194612:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 05:32:29 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 05:32:29 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 05:36:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 12 seconds Oct 31 05:36:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Oct 31 05:37:40 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 05:37:40 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 05:37:40 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572525160, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf54ec0/0x51ab3c4ee7129a0 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7a348762a expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 05:37:41 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 05:42:52 sh-103-53.int kernel: LustreError: 195499:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9777fd2a46c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 05:42:52 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 05:42:52 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 05:46:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 31 05:46:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 05:48:02 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Oct 31 05:48:02 sh-103-53.int kernel: LustreError: Skipped 1 previous similar message Oct 31 05:48:02 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572525782, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf50000/0x51ab3c4ee712aaa lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7a3d64f0c expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 05:48:02 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 05:52:13 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 31 05:53:15 sh-103-53.int kernel: LustreError: 196336:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9777fd2a4540) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 05:53:15 sh-103-53.int kernel: LustreError: 196336:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Oct 31 05:53:15 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 05:53:15 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Oct 31 05:54:23 sh-103-53.int kernel: Lustre: 91124:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572526445/real 1572526445] req@ffff977dc764da00 x1648382135729312/t0(0) o400->MGC10.0.10.51@o2ib7@10.0.10.51@o2ib7:26/25 lens 224/224 e 0 to 1 dl 1572526463 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 31 05:56:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 31 05:56:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 05:58:27 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572526407, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9754fbf560c0/0x51ab3c4ee712ae2 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc9be2ae7a459db25 expref: -99 pid: 91159 timeout: 0 lvb_type: 0 Oct 31 05:58:27 sh-103-53.int kernel: LustreError: 91159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message Oct 31 06:06:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 06:06:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 06:15:11 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Oct 31 06:16:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Oct 31 06:16:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 06:23:39 sh-103-53.int kernel: Lustre: Evicted from MGS (at MGC10.0.10.51@o2ib7_0) after server handle changed from 0xc9be2abad0572bfb to 0x6f11bfb841877501 Oct 31 06:23:39 sh-103-53.int kernel: LustreError: 198999:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9777fd2a4f00) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 31 06:23:39 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Oct 31 06:28:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Oct 31 06:28:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 06:38:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 06:38:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 06:48:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 31 06:48:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 06:58:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 06:58:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 07:08:44 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 6 seconds Oct 31 07:08:44 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 07:19:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 07:19:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 07:29:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 31 07:29:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 07:39:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Oct 31 07:39:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 07:50:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Oct 31 07:50:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 08:00:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 9 seconds Oct 31 08:00:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 08:10:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 08:10:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 08:21:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 08:21:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 08:31:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 31 08:31:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 08:42:17 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Oct 31 08:42:17 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 08:52:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 9 seconds Oct 31 08:52:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 09:02:44 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 09:02:44 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Oct 31 09:12:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 31 09:12:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Oct 31 09:22:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 9 seconds Oct 31 09:22:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 09:34:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Oct 31 09:34:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 09:44:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 9 seconds Oct 31 09:44:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 09:54:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 09:54:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 10:04:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 31 10:04:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 10:14:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 9 seconds Oct 31 10:14:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 10:25:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 31 10:25:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 10:36:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 9 seconds Oct 31 10:36:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 10:46:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 10:46:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 10:56:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 10:56:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Oct 31 11:06:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 9 seconds Oct 31 11:06:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 11:16:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 31 11:16:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 11:27:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 11:27:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 11:37:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Oct 31 11:37:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 11:47:17 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 9 seconds Oct 31 11:47:17 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Oct 31 11:57:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 31 11:57:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 12:07:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 7 seconds Oct 31 12:07:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Oct 31 12:18:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 31 12:18:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 12:28:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 7 seconds Oct 31 12:28:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 6 previous similar messages Oct 31 12:39:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 16 seconds Oct 31 12:39:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Oct 31 12:49:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 31 12:49:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 12:59:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 7 seconds Oct 31 12:59:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 13:09:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 13 seconds Oct 31 13:09:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 13:19:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 31 13:19:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 13:29:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Oct 31 13:29:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 13:40:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 12 seconds Oct 31 13:40:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 13:50:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 13:50:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 14:00:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 31 14:00:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 14:10:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 11 seconds Oct 31 14:10:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 14:20:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 31 14:20:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 14:30:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 31 14:30:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 14:41:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 11 seconds Oct 31 14:41:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 14:52:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 31 14:52:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 15:02:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 11 seconds Oct 31 15:02:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 15:12:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 31 15:12:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 15:22:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 31 15:22:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 15:33:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 11 seconds Oct 31 15:33:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 15:43:17 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 31 15:43:17 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 15:53:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 31 15:53:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 16:03:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 16:03:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 16:17:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 31 16:17:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 6 previous similar messages Oct 31 16:29:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 16:29:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 4 previous similar messages Oct 31 16:39:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 31 16:39:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 5 previous similar messages Oct 31 16:50:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 31 16:50:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 3 previous similar messages Oct 31 17:03:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 31 17:03:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 6 previous similar messages Oct 31 17:13:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 31 17:13:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 4 previous similar messages Oct 31 17:23:49 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Oct 31 17:23:49 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 17:35:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 31 17:35:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 17:45:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 11 seconds Oct 31 17:45:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 17:55:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 31 17:55:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 18:05:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 31 18:05:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Oct 31 18:15:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Oct 31 18:15:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 18:26:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 31 18:26:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 18:37:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 18:37:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 18:47:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 31 18:47:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Oct 31 18:57:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 18:57:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Oct 31 19:07:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 7 seconds Oct 31 19:07:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 19:17:49 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 19:17:49 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 19:27:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 31 19:27:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 19:38:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 9 seconds Oct 31 19:38:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 19:49:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Oct 31 19:49:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 19:59:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Oct 31 19:59:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 20:09:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 20:09:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 20:19:49 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 20:19:49 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 20:29:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 20:29:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 20:41:04 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Oct 31 20:41:04 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Oct 31 20:51:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Oct 31 20:51:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 21:01:15 sh-103-53.int kernel: Lustre: 91126:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572580828/real 1572580828] req@ffff97820e864c80 x1648382236693072/t0(0) o400->fir-OST0004-osc-ffff9781f2230800@10.0.10.101@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572580874 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 31 21:01:15 sh-103-53.int kernel: Lustre: 91130:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572580828/real 1572580828] req@ffff97820e863180 x1648382236693136/t0(0) o400->fir-OST0008-osc-ffff9781f2230800@10.0.10.101@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572580874 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 31 21:01:15 sh-103-53.int kernel: Lustre: 91130:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Oct 31 21:01:15 sh-103-53.int kernel: Lustre: fir-OST0000-osc-ffff9781f2230800: Connection to fir-OST0000 (at 10.0.10.101@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Oct 31 21:01:15 sh-103-53.int kernel: Lustre: 91126:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 1 previous similar message Oct 31 21:01:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 21:01:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 21:01:39 sh-103-53.int kernel: Lustre: 91126:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572580853/real 1572580853] req@ffff976a16dba880 x1648382236697696/t0(0) o400->fir-OST0006-osc-ffff9781f2230800@10.0.10.101@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572580899 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 31 21:02:00 sh-103-53.int kernel: Lustre: 91119:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572580874/real 1572580874] req@ffff975e29b05a00 x1648382236702224/t0(0) o400->fir-OST0004-osc-ffff9781f2230800@10.0.10.102@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572580920 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 31 21:02:00 sh-103-53.int kernel: Lustre: 91119:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 5 previous similar messages Oct 31 21:11:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 31 21:11:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Oct 31 21:19:33 sh-103-53.int kernel: Lustre: fir-OST0004-osc-ffff9781f2230800: Connection restored to 10.0.10.101@o2ib7 (at 10.0.10.101@o2ib7) Oct 31 21:21:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 13 seconds Oct 31 21:21:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 21:32:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Oct 31 21:32:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 21:43:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 14 seconds Oct 31 21:43:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 21:53:17 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 21:53:17 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 22:03:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Oct 31 22:03:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 22:13:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 14 seconds Oct 31 22:13:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 22:23:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 31 22:23:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 22:33:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Oct 31 22:33:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 22:44:04 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 12 seconds Oct 31 22:44:04 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 22:54:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 22:54:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Oct 31 23:04:24 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Oct 31 23:04:24 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Oct 31 23:14:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 12 seconds Oct 31 23:14:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 23:24:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 31 23:24:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 23:34:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Oct 31 23:34:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Oct 31 23:45:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Oct 31 23:45:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Oct 31 23:55:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Oct 31 23:55:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 00:05:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 01 00:05:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 00:15:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 6 seconds Nov 01 00:15:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 00:26:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 01 00:26:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 00:36:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Nov 01 00:36:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 01 00:48:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Nov 01 00:48:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 00:58:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 01 00:58:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 01:08:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 01 01:08:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 01:18:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 01 01:18:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 01:28:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Nov 01 01:28:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 01 01:39:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 01 01:39:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 01:50:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 11 seconds Nov 01 01:50:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 02:00:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 02:00:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 02:10:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 01 02:10:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 02:20:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 11 seconds Nov 01 02:20:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 01 02:31:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 01 02:31:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 02:41:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Nov 01 02:41:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 02:52:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 02:52:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 03:02:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 01 03:02:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 01 03:12:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 11 seconds Nov 01 03:12:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 01 03:23:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 01 03:23:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 03:33:49 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 11 seconds Nov 01 03:33:49 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 03:44:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 03:44:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 03:56:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 01 03:56:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 04:06:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 6 seconds Nov 01 04:06:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 04:16:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 15 seconds Nov 01 04:16:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 6 previous similar messages Nov 01 04:26:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 01 04:26:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 04:36:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Nov 01 04:36:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 04:47:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 12 seconds Nov 01 04:47:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 01 04:57:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 01 04:57:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 01 05:09:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 05:09:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 3 previous similar messages Nov 01 05:20:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 05:20:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 5 previous similar messages Nov 01 05:30:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 05:30:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 5 previous similar messages Nov 01 05:41:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 05:41:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 1 previous similar message Nov 01 05:52:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 01 05:52:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 6 previous similar messages Nov 01 06:02:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 01 06:02:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 06:12:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 06:12:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 06:22:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 01 06:22:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 06:32:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 9 seconds Nov 01 06:32:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 01 06:44:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 15 seconds Nov 01 06:44:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 06:55:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 01 06:55:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 07:05:17 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Nov 01 07:05:17 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 07:15:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 11 seconds Nov 01 07:15:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 07:25:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 07:25:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 07:35:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Nov 01 07:35:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 01 07:45:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Nov 01 07:45:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 07:56:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 01 07:56:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 08:06:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 01 08:06:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 08:16:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 01 08:16:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 01 08:26:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 08:26:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 01 08:36:49 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 01 08:36:49 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 08:46:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Nov 01 08:46:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 08:57:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 13 seconds Nov 01 08:57:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 09:07:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 01 09:07:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 01 09:17:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 01 09:17:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 09:27:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 01 09:27:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 09:37:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 09:37:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 09:47:02 sh-103-53.int kernel: Gadget2 (352693): Using mlock ulimits for SHM_HUGETLB is deprecated Nov 01 09:47:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Nov 01 09:47:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 09:58:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 09:58:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 4 previous similar messages Nov 01 10:08:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 10:08:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 10:13:18 sh-103-53.int kernel: Lustre: 91129:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572628391/real 1572628391] req@ffff975733aee300 x1648382323273344/t0(0) o400->fir-OST0013-osc-ffff9781f2230800@10.0.10.104@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572628398 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Nov 01 10:13:18 sh-103-53.int kernel: Lustre: fir-OST0015-osc-ffff9781f2230800: Connection to fir-OST0015 (at 10.0.10.104@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 01 10:13:18 sh-103-53.int kernel: Lustre: Skipped 5 previous similar messages Nov 01 10:13:18 sh-103-53.int kernel: Lustre: 91129:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Nov 01 10:13:19 sh-103-53.int kernel: Lustre: 91125:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572628391/real 1572628391] req@ffff975733aed100 x1648382323273296/t0(0) o400->fir-OST0010-osc-ffff9781f2230800@10.0.10.103@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572628399 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Nov 01 10:13:19 sh-103-53.int kernel: Lustre: 91122:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572628391/real 1572628391] req@ffff97651d102400 x1648382323273616/t0(0) o400->fir-OST0024-osc-ffff9781f2230800@10.0.10.107@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572628399 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Nov 01 10:13:19 sh-103-53.int kernel: Lustre: 91120:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572628391/real 1572628391] req@ffff9769cb21d580 x1648382323273392/t0(0) o400->fir-OST0016-osc-ffff9781f2230800@10.0.10.103@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572628399 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Nov 01 10:13:19 sh-103-53.int kernel: Lustre: fir-OST0016-osc-ffff9781f2230800: Connection to fir-OST0016 (at 10.0.10.103@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 01 10:13:19 sh-103-53.int kernel: Lustre: Skipped 3 previous similar messages Nov 01 10:13:19 sh-103-53.int kernel: Lustre: 91125:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 1 previous similar message Nov 01 10:13:55 sh-103-53.int kernel: Lustre: 91125:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572628391/real 1572628391] req@ffff975b3db12880 x1648382323273104/t0(0) o400->fir-OST0004-osc-ffff9781f2230800@10.0.10.101@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572628435 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Nov 01 10:13:55 sh-103-53.int kernel: Lustre: 91130:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572628391/real 1572628391] req@ffff975733aeba80 x1648382323273360/t0(0) o400->fir-OST0014-osc-ffff9781f2230800@10.0.10.103@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572628435 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Nov 01 10:13:55 sh-103-53.int kernel: Lustre: 91130:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Nov 01 10:13:55 sh-103-53.int kernel: Lustre: fir-OST002e-osc-ffff9781f2230800: Connection to fir-OST002e (at 10.0.10.107@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 01 10:13:55 sh-103-53.int kernel: Lustre: fir-OST000c-osc-ffff9781f2230800: Connection to fir-OST000c (at 10.0.10.103@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 01 10:13:55 sh-103-53.int kernel: Lustre: Skipped 3 previous similar messages Nov 01 10:13:55 sh-103-53.int kernel: Lustre: Skipped 4 previous similar messages Nov 01 10:13:55 sh-103-53.int kernel: Lustre: 91125:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 8 previous similar messages Nov 01 10:14:28 sh-103-53.int kernel: Lustre: 91121:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572628424/real 1572628424] req@ffff975ebf581680 x1648382323278000/t0(0) o400->fir-OST0012-osc-ffff9781f2230800@10.0.10.104@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572628468 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Nov 01 10:14:28 sh-103-53.int kernel: Lustre: 91121:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 15 previous similar messages Nov 01 10:18:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Nov 01 10:18:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 10:22:48 sh-103-53.int kernel: python3 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 Nov 01 10:22:48 sh-103-53.int kernel: python3 cpuset=step_batch mems_allowed=0-1 Nov 01 10:22:48 sh-103-53.int kernel: CPU: 7 PID: 257113 Comm: python3 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.27.2.el7.x86_64 #1 Nov 01 10:22:48 sh-103-53.int kernel: Hardware name: Dell Inc. PowerEdge C6420/0YTVTT, BIOS 2.3.10 08/15/2019 Nov 01 10:22:48 sh-103-53.int kernel: Call Trace: Nov 01 10:22:48 sh-103-53.int kernel: [] dump_stack+0x19/0x1b Nov 01 10:22:48 sh-103-53.int kernel: [] dump_header+0x90/0x229 Nov 01 10:22:48 sh-103-53.int kernel: [] ? default_wake_function+0x12/0x20 Nov 01 10:22:48 sh-103-53.int kernel: [] ? find_lock_task_mm+0x56/0xc0 Nov 01 10:22:48 sh-103-53.int kernel: [] ? try_get_mem_cgroup_from_mm+0x28/0x60 Nov 01 10:22:48 sh-103-53.int kernel: [] oom_kill_process+0x254/0x3d0 Nov 01 10:22:48 sh-103-53.int kernel: [] mem_cgroup_oom_synchronize+0x546/0x570 Nov 01 10:22:48 sh-103-53.int kernel: [] ? mem_cgroup_charge_common+0xc0/0xc0 Nov 01 10:22:48 sh-103-53.int kernel: [] pagefault_out_of_memory+0x14/0x90 Nov 01 10:22:48 sh-103-53.int kernel: [] mm_fault_error+0x6a/0x157 Nov 01 10:22:48 sh-103-53.int kernel: [] __do_page_fault+0x3c8/0x4f0 Nov 01 10:22:48 sh-103-53.int kernel: [] do_page_fault+0x35/0x90 Nov 01 10:22:48 sh-103-53.int kernel: [] page_fault+0x28/0x30 Nov 01 10:22:48 sh-103-53.int kernel: Task in /slurm/uid_292669/job_53953107/step_batch/task_0 killed as a result of limit of /slurm/uid_292669/job_53953107/step_batch Nov 01 10:22:48 sh-103-53.int kernel: memory: usage 4194304kB, limit 4194304kB, failcnt 11556 Nov 01 10:22:48 sh-103-53.int kernel: memory+swap: usage 4194304kB, limit 4194304kB, failcnt 0 Nov 01 10:22:48 sh-103-53.int kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 Nov 01 10:22:48 sh-103-53.int kernel: Memory cgroup stats for /slurm/uid_292669/job_53953107/step_batch: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Nov 01 10:22:48 sh-103-53.int kernel: Memory cgroup stats for /slurm/uid_292669/job_53953107/step_batch/task_0: cache:480KB rss:4193824KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:222720KB active_anon:3970976KB inactive_file:240KB active_file:240KB unevictable:0KB Nov 01 10:22:48 sh-103-53.int kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Nov 01 10:22:48 sh-103-53.int kernel: [256910] 292669 256910 28296 364 13 0 0 slurm_script Nov 01 10:22:48 sh-103-53.int kernel: [256914] 292669 256914 3261470 1057390 2511 0 0 python3 Nov 01 10:22:48 sh-103-53.int kernel: Memory cgroup out of memory: Kill process 257113 (python3) score 1010 or sacrifice child Nov 01 10:22:48 sh-103-53.int kernel: Killed process 256914 (python3) total-vm:13045880kB, anon-rss:4192972kB, file-rss:36588kB, shmem-rss:0kB Nov 01 10:28:20 sh-103-53.int kernel: Lustre: fir-OST0006-osc-ffff9781f2230800: Connection restored to 10.0.10.102@o2ib7 (at 10.0.10.102@o2ib7) Nov 01 10:28:20 sh-103-53.int kernel: Lustre: Skipped 5 previous similar messages Nov 01 10:29:26 sh-103-53.int kernel: Lustre: 91118:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572629290/real 1572629290] req@ffff977113ab7500 x1648382323438448/t0(0) o400->fir-OST0000-osc-ffff9781f2230800@10.0.10.102@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572629366 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1 Nov 01 10:29:26 sh-103-53.int kernel: Lustre: 91118:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Nov 01 10:29:44 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 01 10:29:44 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 10:29:52 sh-103-53.int kernel: Lustre: 91125:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572629315/real 1572629315] req@ffff976a0d164c80 x1648382323438640/t0(0) o400->fir-OST0001-osc-ffff9781f2230800@10.0.10.102@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572629391 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Nov 01 10:29:52 sh-103-53.int kernel: Lustre: fir-OST0009-osc-ffff9781f2230800: Connection to fir-OST0009 (at 10.0.10.102@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 01 10:29:52 sh-103-53.int kernel: Lustre: Skipped 14 previous similar messages Nov 01 10:29:52 sh-103-53.int kernel: Lustre: 91125:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 9 previous similar messages Nov 01 10:30:16 sh-103-53.int kernel: Lustre: 91122:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572629340/real 1572629340] req@ffff975ebf581b00 x1648382323445808/t0(0) o400->fir-OST0001-osc-ffff9781f2230800@10.0.10.102@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572629416 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Nov 01 10:30:16 sh-103-53.int kernel: Lustre: 91123:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572629340/real 1572629340] req@ffff975ebf585580 x1648382323445856/t0(0) o400->fir-OST0006-osc-ffff9781f2230800@10.0.10.102@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572629416 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Nov 01 10:30:16 sh-103-53.int kernel: Lustre: 91122:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 1 previous similar message Nov 01 10:30:24 sh-103-53.int kernel: Lustre: fir-OST0012-osc-ffff9781f2230800: Connection restored to 10.0.10.103@o2ib7 (at 10.0.10.103@o2ib7) Nov 01 10:30:29 sh-103-53.int kernel: Lustre: fir-OST0014-osc-ffff9781f2230800: Connection restored to 10.0.10.103@o2ib7 (at 10.0.10.103@o2ib7) Nov 01 10:30:29 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Nov 01 10:30:41 sh-103-53.int kernel: Lustre: 91120:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572629365/real 1572629365] req@ffff975c85892880 x1648382323447376/t0(0) o400->fir-OST0007-osc-ffff9781f2230800@10.0.10.101@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572629441 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Nov 01 10:30:41 sh-103-53.int kernel: Lustre: 91120:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Nov 01 10:30:43 sh-103-53.int kernel: Lustre: fir-OST002e-osc-ffff9781f2230800: Connection restored to 10.0.10.108@o2ib7 (at 10.0.10.108@o2ib7) Nov 01 10:30:43 sh-103-53.int kernel: Lustre: Skipped 3 previous similar messages Nov 01 10:31:07 sh-103-53.int kernel: Lustre: 91123:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572629391/real 1572629391] req@ffff97607737f500 x1648382323452016/t0(0) o400->fir-OST0003-osc-ffff9781f2230800@10.0.10.102@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572629467 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Nov 01 10:31:07 sh-103-53.int kernel: Lustre: 91123:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 6 previous similar messages Nov 01 10:32:58 sh-103-53.int kernel: python3 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 Nov 01 10:32:58 sh-103-53.int kernel: python3 cpuset=step_batch mems_allowed=0-1 Nov 01 10:32:58 sh-103-53.int kernel: CPU: 17 PID: 256485 Comm: python3 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.27.2.el7.x86_64 #1 Nov 01 10:32:58 sh-103-53.int kernel: Hardware name: Dell Inc. PowerEdge C6420/0YTVTT, BIOS 2.3.10 08/15/2019 Nov 01 10:32:58 sh-103-53.int kernel: Call Trace: Nov 01 10:32:58 sh-103-53.int kernel: [] dump_stack+0x19/0x1b Nov 01 10:32:58 sh-103-53.int kernel: [] dump_header+0x90/0x229 Nov 01 10:32:58 sh-103-53.int kernel: [] ? default_wake_function+0x12/0x20 Nov 01 10:32:58 sh-103-53.int kernel: [] ? find_lock_task_mm+0x56/0xc0 Nov 01 10:32:58 sh-103-53.int kernel: [] ? try_get_mem_cgroup_from_mm+0x28/0x60 Nov 01 10:32:58 sh-103-53.int kernel: [] oom_kill_process+0x254/0x3d0 Nov 01 10:32:58 sh-103-53.int kernel: [] mem_cgroup_oom_synchronize+0x546/0x570 Nov 01 10:32:58 sh-103-53.int kernel: [] ? mem_cgroup_charge_common+0xc0/0xc0 Nov 01 10:32:58 sh-103-53.int kernel: [] pagefault_out_of_memory+0x14/0x90 Nov 01 10:32:58 sh-103-53.int kernel: [] mm_fault_error+0x6a/0x157 Nov 01 10:32:58 sh-103-53.int kernel: [] __do_page_fault+0x3c8/0x4f0 Nov 01 10:32:58 sh-103-53.int kernel: [] do_page_fault+0x35/0x90 Nov 01 10:32:58 sh-103-53.int kernel: [] page_fault+0x28/0x30 Nov 01 10:32:58 sh-103-53.int kernel: Task in /slurm/uid_292669/job_53953057/step_batch/task_0 killed as a result of limit of /slurm/uid_292669/job_53953057/step_batch Nov 01 10:32:58 sh-103-53.int kernel: memory: usage 4194304kB, limit 4194304kB, failcnt 47401 Nov 01 10:32:58 sh-103-53.int kernel: memory+swap: usage 4194304kB, limit 4194304kB, failcnt 2 Nov 01 10:32:58 sh-103-53.int kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 Nov 01 10:32:58 sh-103-53.int kernel: Memory cgroup stats for /slurm/uid_292669/job_53953057/step_batch: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Nov 01 10:32:58 sh-103-53.int kernel: Memory cgroup stats for /slurm/uid_292669/job_53953057/step_batch/task_0: cache:2012KB rss:4192292KB rss_huge:0KB mapped_file:748KB swap:0KB inactive_anon:698744KB active_anon:3493548KB inactive_file:1252KB active_file:760KB unevictable:0KB Nov 01 10:32:58 sh-103-53.int kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Nov 01 10:32:58 sh-103-53.int kernel: [256403] 292669 256403 28296 363 13 0 0 slurm_script Nov 01 10:32:58 sh-103-53.int kernel: [256412] 292669 256412 3261468 1053559 2514 0 0 python3 Nov 01 10:32:58 sh-103-53.int kernel: Memory cgroup out of memory: Kill process 256665 (python3) score 1007 or sacrifice child Nov 01 10:32:58 sh-103-53.int kernel: Killed process 256412 (python3) total-vm:13045872kB, anon-rss:4191536kB, file-rss:22700kB, shmem-rss:0kB Nov 01 10:37:52 sh-103-53.int kernel: Lustre: fir-OST0008-osc-ffff9781f2230800: Connection restored to 10.0.10.101@o2ib7 (at 10.0.10.101@o2ib7) Nov 01 10:37:52 sh-103-53.int kernel: Lustre: Skipped 5 previous similar messages Nov 01 10:39:17 sh-103-53.int kernel: Lustre: fir-OST0001-osc-ffff9781f2230800: Connection restored to 10.0.10.101@o2ib7 (at 10.0.10.101@o2ib7) Nov 01 10:39:17 sh-103-53.int kernel: Lustre: Skipped 5 previous similar messages Nov 01 10:39:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Nov 01 10:39:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 10:40:48 sh-103-53.int kernel: Lustre: fir-OST0017-osc-ffff9781f2230800: Connection restored to 10.0.10.103@o2ib7 (at 10.0.10.103@o2ib7) Nov 01 10:40:48 sh-103-53.int kernel: Lustre: Skipped 5 previous similar messages Nov 01 10:50:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 01 10:50:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 01 10:52:32 sh-103-53.int kernel: Lustre: 91137:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572630699/real 1572630699] req@ffff977e75241200 x1648382323700800/t0(0) o400->fir-OST0012-osc-ffff9781f2230800@10.0.10.103@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572630752 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Nov 01 10:52:32 sh-103-53.int kernel: Lustre: fir-OST0013-osc-ffff9781f2230800: Connection to fir-OST0013 (at 10.0.10.103@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 01 10:52:32 sh-103-53.int kernel: Lustre: Skipped 5 previous similar messages Nov 01 10:52:32 sh-103-53.int kernel: Lustre: 91137:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 15 previous similar messages Nov 01 10:52:57 sh-103-53.int kernel: Lustre: 91136:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572630724/real 1572630724] req@ffff975399d2b180 x1648382323705376/t0(0) o400->fir-OST0012-osc-ffff9781f2230800@10.0.10.103@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572630777 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Nov 01 10:52:57 sh-103-53.int kernel: Lustre: 91136:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 7 previous similar messages Nov 01 10:53:22 sh-103-53.int kernel: Lustre: 91133:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572630749/real 1572630749] req@ffff976a0f6d8d80 x1648382323709952/t0(0) o400->fir-OST0012-osc-ffff9781f2230800@10.0.10.104@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572630802 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Nov 01 10:53:22 sh-103-53.int kernel: Lustre: 91133:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 1 previous similar message Nov 01 10:53:35 sh-103-53.int kernel: Lustre: fir-OST0014-osc-ffff9781f2230800: Connection to fir-OST0014 (at 10.0.10.103@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 01 10:53:35 sh-103-53.int kernel: Lustre: Skipped 10 previous similar messages Nov 01 10:54:00 sh-103-53.int kernel: Lustre: 91137:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572630724/real 1572630724] req@ffff977dc764d580 x1648382323705408/t0(0) o400->fir-OST0014-osc-ffff9781f2230800@10.0.10.103@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572630840 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Nov 01 10:54:00 sh-103-53.int kernel: Lustre: 91137:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 11 previous similar messages Nov 01 10:54:36 sh-103-53.int kernel: Lustre: 91139:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572630699/real 1572630699] req@ffff975aacb04800 x1648382323700640/t0(0) o400->fir-OST0008-osc-ffff9781f2230800@10.0.10.101@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572630875 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Nov 01 10:54:36 sh-103-53.int kernel: Lustre: fir-OST000b-osc-ffff9781f2230800: Connection to fir-OST000b (at 10.0.10.101@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 01 10:54:36 sh-103-53.int kernel: Lustre: 91139:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Nov 01 10:54:43 sh-103-53.int kernel: Lustre: fir-OST0009-osc-ffff9781f2230800: Connection to fir-OST0009 (at 10.0.10.101@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 01 10:54:43 sh-103-53.int kernel: Lustre: Skipped 1 previous similar message Nov 01 10:54:53 sh-103-53.int kernel: Lustre: fir-OST0007-osc-ffff9781f2230800: Connection to fir-OST0007 (at 10.0.10.101@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 01 10:55:43 sh-103-53.int kernel: Lustre: 91137:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572630749/real 1572630749] req@ffff976a0d162d00 x1648382323709824/t0(0) o400->fir-OST000a-osc-ffff9781f2230800@10.0.10.102@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572630943 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Nov 01 10:55:43 sh-103-53.int kernel: Lustre: 91137:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 27 previous similar messages Nov 01 10:58:07 sh-103-53.int kernel: Lustre: 91133:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572630893/real 1572630893] req@ffff977113ab0000 x1648382323733024/t0(0) o400->fir-OST0003-osc-ffff9781f2230800@10.0.10.102@o2ib7:28/4 lens 224/224 e 0 to 1 dl 1572631087 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Nov 01 10:58:07 sh-103-53.int kernel: Lustre: 91133:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 58 previous similar messages Nov 01 11:00:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 01 11:00:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 6 previous similar messages Nov 01 11:10:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 01 11:10:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 11:17:24 sh-103-53.int kernel: Lustre: fir-OST0007-osc-ffff9781f2230800: Connection restored to 10.0.10.102@o2ib7 (at 10.0.10.102@o2ib7) Nov 01 11:17:24 sh-103-53.int kernel: Lustre: Skipped 5 previous similar messages Nov 01 11:18:47 sh-103-53.int kernel: Lustre: fir-OST0017-osc-ffff9781f2230800: Connection restored to 10.0.10.104@o2ib7 (at 10.0.10.104@o2ib7) Nov 01 11:18:47 sh-103-53.int kernel: Lustre: Skipped 7 previous similar messages Nov 01 11:20:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 7 seconds Nov 01 11:20:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 01 11:21:36 sh-103-53.int kernel: Lustre: fir-OST0014-osc-ffff9781f2230800: Connection restored to 10.0.10.104@o2ib7 (at 10.0.10.104@o2ib7) Nov 01 11:21:36 sh-103-53.int kernel: Lustre: Skipped 5 previous similar messages Nov 01 11:27:05 sh-103-53.int kernel: Lustre: fir-OST0000-osc-ffff9781f2230800: Connection restored to 10.0.10.102@o2ib7 (at 10.0.10.102@o2ib7) Nov 01 11:27:05 sh-103-53.int kernel: Lustre: Skipped 5 previous similar messages Nov 01 11:30:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 11:30:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 01 11:40:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Nov 01 11:40:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 01 11:50:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 11 seconds Nov 01 11:50:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 12:01:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 12:01:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 12:11:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Nov 01 12:11:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 01 12:21:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 9 seconds Nov 01 12:21:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 12:31:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 01 12:31:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 01 12:41:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 12:41:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 12:51:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 9 seconds Nov 01 12:51:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 01 13:03:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Nov 01 13:03:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 13:13:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Nov 01 13:13:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 13:23:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 01 13:23:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 13:33:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 13:33:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 13:43:49 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 9 seconds Nov 01 13:43:49 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 13:54:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 13:54:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 14:04:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 01 14:04:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 14:14:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Nov 01 14:14:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 01 14:24:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 15 seconds Nov 01 14:24:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 14:34:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 01 14:34:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 01 14:44:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Nov 01 14:44:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 14:54:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 14 seconds Nov 01 14:54:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 15:05:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 15:05:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 15:15:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Nov 01 15:15:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 15:25:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 13 seconds Nov 01 15:25:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 15:35:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 15:35:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 15:45:49 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Nov 01 15:45:49 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 15:56:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 12 seconds Nov 01 15:56:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 16:06:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 01 16:06:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 16:16:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 01 16:16:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 01 16:26:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 9 seconds Nov 01 16:26:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 16:36:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 01 16:36:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 16:47:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 12 seconds Nov 01 16:47:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 16:58:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 16:58:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 17:08:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 01 17:08:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 01 17:18:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 9 seconds Nov 01 17:18:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 17:28:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 17:28:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 17:38:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 01 17:38:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 17:48:49 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Nov 01 17:48:49 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 18:00:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 01 18:00:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 18:10:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 9 seconds Nov 01 18:10:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 18:20:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 18:20:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 18:30:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 01 18:30:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 18:40:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Nov 01 18:40:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 18:51:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 01 18:51:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 19:02:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 11 seconds Nov 01 19:02:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 19:12:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 19:12:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 19:22:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 01 19:22:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 19:32:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Nov 01 19:32:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 19:42:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 19:42:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 19:52:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 19:52:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 20:03:04 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 01 20:03:04 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 20:13:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 16 seconds Nov 01 20:13:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 20:23:24 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 20:23:24 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 20:33:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 6 seconds Nov 01 20:33:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 20:43:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 13 seconds Nov 01 20:43:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 01 20:53:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 01 20:53:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 21:04:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 21:04:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 21:14:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 11 seconds Nov 01 21:14:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 21:24:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 21:24:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 21:34:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 01 21:34:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 01 21:44:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 11 seconds Nov 01 21:44:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 21:55:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 6 seconds Nov 01 21:55:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 22:06:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 11 seconds Nov 01 22:06:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 22:16:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 22:16:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 22:26:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 01 22:26:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 01 22:40:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Nov 01 22:40:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 01 22:50:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 01 22:50:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 23:01:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 23:01:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 01 23:11:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 01 23:11:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 6 previous similar messages Nov 01 23:21:24 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 23:21:24 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 23:32:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 23:32:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 23:42:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 01 23:42:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 01 23:52:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 01 23:52:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 00:03:04 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 02 00:03:04 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 02 00:13:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 6 seconds Nov 02 00:13:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 00:23:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 02 00:23:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 00:33:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 00:33:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 00:43:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 9 seconds Nov 02 00:43:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 00:54:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Nov 02 00:54:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 01:04:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Nov 02 01:04:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 6 previous similar messages Nov 02 01:15:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 01:15:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 01:25:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 02 01:25:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 02 01:35:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 02 01:35:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 01:46:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Nov 02 01:46:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 4 previous similar messages Nov 02 01:56:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 01:56:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 02:07:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 02 02:07:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 02:17:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 02 02:17:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 02:27:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 9 seconds Nov 02 02:27:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 02:38:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Nov 02 02:38:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 02:48:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Nov 02 02:48:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 02:58:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 02 02:58:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 03:09:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 02 03:09:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 02 03:19:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Nov 02 03:19:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 02 03:20:13 sh-103-53.int kernel: python3 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 Nov 02 03:20:13 sh-103-53.int kernel: python3 cpuset=step_batch mems_allowed=0-1 Nov 02 03:20:13 sh-103-53.int kernel: CPU: 17 PID: 357824 Comm: python3 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.27.2.el7.x86_64 #1 Nov 02 03:20:13 sh-103-53.int kernel: Hardware name: Dell Inc. PowerEdge C6420/0YTVTT, BIOS 2.3.10 08/15/2019 Nov 02 03:20:13 sh-103-53.int kernel: Call Trace: Nov 02 03:20:13 sh-103-53.int kernel: [] dump_stack+0x19/0x1b Nov 02 03:20:13 sh-103-53.int kernel: [] dump_header+0x90/0x229 Nov 02 03:20:13 sh-103-53.int kernel: [] ? default_wake_function+0x12/0x20 Nov 02 03:20:13 sh-103-53.int kernel: [] ? find_lock_task_mm+0x56/0xc0 Nov 02 03:20:13 sh-103-53.int kernel: [] ? try_get_mem_cgroup_from_mm+0x28/0x60 Nov 02 03:20:13 sh-103-53.int kernel: [] oom_kill_process+0x254/0x3d0 Nov 02 03:20:13 sh-103-53.int kernel: [] mem_cgroup_oom_synchronize+0x546/0x570 Nov 02 03:20:13 sh-103-53.int kernel: [] ? mem_cgroup_charge_common+0xc0/0xc0 Nov 02 03:20:13 sh-103-53.int kernel: [] pagefault_out_of_memory+0x14/0x90 Nov 02 03:20:13 sh-103-53.int kernel: [] mm_fault_error+0x6a/0x157 Nov 02 03:20:13 sh-103-53.int kernel: [] __do_page_fault+0x3c8/0x4f0 Nov 02 03:20:13 sh-103-53.int kernel: [] do_page_fault+0x35/0x90 Nov 02 03:20:13 sh-103-53.int kernel: [] page_fault+0x28/0x30 Nov 02 03:20:13 sh-103-53.int kernel: Task in /slurm/uid_292669/job_53955611/step_batch/task_0 killed as a result of limit of /slurm/uid_292669/job_53955611/step_batch Nov 02 03:20:13 sh-103-53.int kernel: memory: usage 4194304kB, limit 4194304kB, failcnt 64609 Nov 02 03:20:13 sh-103-53.int kernel: memory+swap: usage 4194304kB, limit 4194304kB, failcnt 0 Nov 02 03:20:13 sh-103-53.int kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 Nov 02 03:20:13 sh-103-53.int kernel: Memory cgroup stats for /slurm/uid_292669/job_53955611/step_batch: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Nov 02 03:20:13 sh-103-53.int kernel: Memory cgroup stats for /slurm/uid_292669/job_53955611/step_batch/task_0: cache:3088KB rss:4191216KB rss_huge:0KB mapped_file:144KB swap:0KB inactive_anon:698616KB active_anon:3492584KB inactive_file:1700KB active_file:1388KB unevictable:0KB Nov 02 03:20:13 sh-103-53.int kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Nov 02 03:20:13 sh-103-53.int kernel: [357765] 292669 357765 28296 362 15 0 0 slurm_script Nov 02 03:20:13 sh-103-53.int kernel: [357774] 292669 357774 3244467 1048059 2503 0 0 python3 Nov 02 03:20:13 sh-103-53.int kernel: Memory cgroup out of memory: Kill process 357981 (python3) score 1001 or sacrifice child Nov 02 03:20:13 sh-103-53.int kernel: Killed process 357774 (python3) total-vm:12977868kB, anon-rss:4190716kB, file-rss:1520kB, shmem-rss:0kB Nov 02 03:29:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 03:29:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 02 03:39:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 02 03:39:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 03:49:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Nov 02 03:49:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 03:59:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 15 seconds Nov 02 03:59:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 02 04:10:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 02 04:10:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 04:20:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Nov 02 04:20:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 04:30:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 14 seconds Nov 02 04:30:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 04:40:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 04:40:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 04:50:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 02 04:50:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 05:00:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 02 05:00:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 6 previous similar messages Nov 02 05:08:33 sh-103-53.int kernel: python3 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 Nov 02 05:08:33 sh-103-53.int kernel: python3 cpuset=step_batch mems_allowed=0-1 Nov 02 05:08:33 sh-103-53.int kernel: CPU: 7 PID: 367041 Comm: python3 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.27.2.el7.x86_64 #1 Nov 02 05:08:33 sh-103-53.int kernel: Hardware name: Dell Inc. PowerEdge C6420/0YTVTT, BIOS 2.3.10 08/15/2019 Nov 02 05:08:33 sh-103-53.int kernel: Call Trace: Nov 02 05:08:33 sh-103-53.int kernel: [] dump_stack+0x19/0x1b Nov 02 05:08:33 sh-103-53.int kernel: [] dump_header+0x90/0x229 Nov 02 05:08:33 sh-103-53.int kernel: [] ? _raw_spin_unlock_irqrestore+0x15/0x20 Nov 02 05:08:33 sh-103-53.int kernel: [] oom_kill_process+0x254/0x3d0 Nov 02 05:08:33 sh-103-53.int kernel: [] mem_cgroup_oom_synchronize+0x546/0x570 Nov 02 05:08:33 sh-103-53.int kernel: [] ? mem_cgroup_charge_common+0xc0/0xc0 Nov 02 05:08:33 sh-103-53.int kernel: [] pagefault_out_of_memory+0x14/0x90 Nov 02 05:08:33 sh-103-53.int kernel: [] mm_fault_error+0x6a/0x157 Nov 02 05:08:33 sh-103-53.int kernel: [] __do_page_fault+0x3c8/0x4f0 Nov 02 05:08:33 sh-103-53.int kernel: [] do_page_fault+0x35/0x90 Nov 02 05:08:33 sh-103-53.int kernel: [] page_fault+0x28/0x30 Nov 02 05:08:33 sh-103-53.int kernel: Task in /slurm/uid_292669/job_53956560/step_batch/task_0 killed as a result of limit of /slurm/uid_292669/job_53956560/step_batch Nov 02 05:08:33 sh-103-53.int kernel: memory: usage 4194304kB, limit 4194304kB, failcnt 1799 Nov 02 05:08:33 sh-103-53.int kernel: memory+swap: usage 4194304kB, limit 4194304kB, failcnt 0 Nov 02 05:08:33 sh-103-53.int kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 Nov 02 05:08:33 sh-103-53.int kernel: Memory cgroup stats for /slurm/uid_292669/job_53956560/step_batch: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Nov 02 05:08:33 sh-103-53.int kernel: Memory cgroup stats for /slurm/uid_292669/job_53956560/step_batch/task_0: cache:0KB rss:4194304KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:53504KB active_anon:4140800KB inactive_file:0KB active_file:0KB unevictable:0KB Nov 02 05:08:33 sh-103-53.int kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Nov 02 05:08:33 sh-103-53.int kernel: [366840] 292669 366840 28296 363 12 0 0 slurm_script Nov 02 05:08:33 sh-103-53.int kernel: [366844] 292669 366844 3196466 1050157 2504 0 0 python3 Nov 02 05:08:33 sh-103-53.int kernel: Memory cgroup out of memory: Kill process 367041 (python3) score 1003 or sacrifice child Nov 02 05:08:33 sh-103-53.int kernel: Killed process 366844 (python3) total-vm:12785864kB, anon-rss:4193700kB, file-rss:6928kB, shmem-rss:0kB Nov 02 05:11:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 05:11:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 05:21:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 05:21:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 05:31:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 12 seconds Nov 02 05:31:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 05:41:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 02 05:41:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 02 05:51:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 02 05:51:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 02 06:02:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 02 06:02:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 02 06:13:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Nov 02 06:13:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 06:23:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 6 seconds Nov 02 06:23:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 06:33:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 02 06:33:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 02 06:43:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 02 06:43:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 06:53:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 02 06:53:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 07:03:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 07:03:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 07:14:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 02 07:14:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 07:24:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 02 07:24:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 02 07:34:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 07:34:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 07:44:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 02 07:44:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 07:54:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Nov 02 07:54:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 08:04:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 16 seconds Nov 02 08:04:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 02 08:15:04 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 08:15:04 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 08:25:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 7 seconds Nov 02 08:25:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 08:35:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 13 seconds Nov 02 08:35:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 08:45:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 08:45:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 08:55:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Nov 02 08:55:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 09:05:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 09:05:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 02 09:16:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 02 09:16:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 6 previous similar messages Nov 02 09:26:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 02 09:26:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 02 09:36:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Nov 02 09:36:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 09:46:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 09:46:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 09:56:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 02 09:56:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 6 previous similar messages Nov 02 10:06:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 02 10:06:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 02 10:17:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 10:17:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 02 10:27:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 10:27:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 10:28:55 sh-103-53.int kernel: python3 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 Nov 02 10:28:55 sh-103-53.int kernel: python3 cpuset=step_batch mems_allowed=0-1 Nov 02 10:28:55 sh-103-53.int kernel: CPU: 13 PID: 20374 Comm: python3 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.27.2.el7.x86_64 #1 Nov 02 10:28:55 sh-103-53.int kernel: Hardware name: Dell Inc. PowerEdge C6420/0YTVTT, BIOS 2.3.10 08/15/2019 Nov 02 10:28:55 sh-103-53.int kernel: Call Trace: Nov 02 10:28:55 sh-103-53.int kernel: [] dump_stack+0x19/0x1b Nov 02 10:28:55 sh-103-53.int kernel: [] dump_header+0x90/0x229 Nov 02 10:28:55 sh-103-53.int kernel: [] ? find_lock_task_mm+0x56/0xc0 Nov 02 10:28:55 sh-103-53.int kernel: [] ? try_get_mem_cgroup_from_mm+0x28/0x60 Nov 02 10:28:55 sh-103-53.int kernel: [] oom_kill_process+0x254/0x3d0 Nov 02 10:28:55 sh-103-53.int kernel: [] mem_cgroup_oom_synchronize+0x546/0x570 Nov 02 10:28:55 sh-103-53.int kernel: [] ? mem_cgroup_charge_common+0xc0/0xc0 Nov 02 10:28:55 sh-103-53.int kernel: [] pagefault_out_of_memory+0x14/0x90 Nov 02 10:28:55 sh-103-53.int kernel: [] mm_fault_error+0x6a/0x157 Nov 02 10:28:55 sh-103-53.int kernel: [] __do_page_fault+0x3c8/0x4f0 Nov 02 10:28:55 sh-103-53.int kernel: [] do_page_fault+0x35/0x90 Nov 02 10:28:55 sh-103-53.int kernel: [] page_fault+0x28/0x30 Nov 02 10:28:55 sh-103-53.int kernel: Task in /slurm/uid_292669/job_54030852/step_batch/task_0 killed as a result of limit of /slurm/uid_292669/job_54030852/step_batch Nov 02 10:28:55 sh-103-53.int kernel: memory: usage 4194304kB, limit 4194304kB, failcnt 120 Nov 02 10:28:55 sh-103-53.int kernel: memory+swap: usage 4194304kB, limit 4194304kB, failcnt 0 Nov 02 10:28:55 sh-103-53.int kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 Nov 02 10:28:55 sh-103-53.int kernel: Memory cgroup stats for /slurm/uid_292669/job_54030852/step_batch: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Nov 02 10:28:55 sh-103-53.int kernel: Memory cgroup stats for /slurm/uid_292669/job_54030852/step_batch/task_0: cache:0KB rss:4194304KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:46336KB active_anon:4147840KB inactive_file:0KB active_file:0KB unevictable:0KB Nov 02 10:28:55 sh-103-53.int kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Nov 02 10:28:55 sh-103-53.int kernel: [18096] 292669 18096 28296 364 14 0 0 slurm_script Nov 02 10:28:55 sh-103-53.int kernel: [18100] 292669 18100 3250050 1065338 2507 0 0 python3 Nov 02 10:28:55 sh-103-53.int kernel: Memory cgroup out of memory: Kill process 20374 (python3) score 1018 or sacrifice child Nov 02 10:28:55 sh-103-53.int kernel: Killed process 18100 (python3) total-vm:13000200kB, anon-rss:4193432kB, file-rss:67920kB, shmem-rss:0kB Nov 02 10:30:38 sh-103-53.int kernel: python3 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 Nov 02 10:30:38 sh-103-53.int kernel: python3 cpuset=step_batch mems_allowed=0-1 Nov 02 10:30:38 sh-103-53.int kernel: CPU: 3 PID: 20680 Comm: python3 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.27.2.el7.x86_64 #1 Nov 02 10:30:38 sh-103-53.int kernel: Hardware name: Dell Inc. PowerEdge C6420/0YTVTT, BIOS 2.3.10 08/15/2019 Nov 02 10:30:38 sh-103-53.int kernel: Call Trace: Nov 02 10:30:38 sh-103-53.int kernel: [] dump_stack+0x19/0x1b Nov 02 10:30:38 sh-103-53.int kernel: [] dump_header+0x90/0x229 Nov 02 10:30:38 sh-103-53.int kernel: [] ? find_lock_task_mm+0x56/0xc0 Nov 02 10:30:38 sh-103-53.int kernel: [] ? try_get_mem_cgroup_from_mm+0x28/0x60 Nov 02 10:30:38 sh-103-53.int kernel: [] oom_kill_process+0x254/0x3d0 Nov 02 10:30:38 sh-103-53.int kernel: [] mem_cgroup_oom_synchronize+0x546/0x570 Nov 02 10:30:38 sh-103-53.int kernel: [] ? mem_cgroup_charge_common+0xc0/0xc0 Nov 02 10:30:38 sh-103-53.int kernel: [] pagefault_out_of_memory+0x14/0x90 Nov 02 10:30:38 sh-103-53.int kernel: [] mm_fault_error+0x6a/0x157 Nov 02 10:30:38 sh-103-53.int kernel: [] __do_page_fault+0x3c8/0x4f0 Nov 02 10:30:38 sh-103-53.int kernel: [] do_page_fault+0x35/0x90 Nov 02 10:30:38 sh-103-53.int kernel: [] page_fault+0x28/0x30 Nov 02 10:30:38 sh-103-53.int kernel: Task in /slurm/uid_292669/job_54030897/step_batch/task_0 killed as a result of limit of /slurm/uid_292669/job_54030897/step_batch Nov 02 10:30:38 sh-103-53.int kernel: memory: usage 4194304kB, limit 4194304kB, failcnt 162 Nov 02 10:30:38 sh-103-53.int kernel: memory+swap: usage 4194304kB, limit 4194304kB, failcnt 0 Nov 02 10:30:38 sh-103-53.int kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 Nov 02 10:30:38 sh-103-53.int kernel: Memory cgroup stats for /slurm/uid_292669/job_54030897/step_batch: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Nov 02 10:30:38 sh-103-53.int kernel: Memory cgroup stats for /slurm/uid_292669/job_54030897/step_batch/task_0: cache:0KB rss:4194304KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:41216KB active_anon:4153088KB inactive_file:0KB active_file:0KB unevictable:0KB Nov 02 10:30:38 sh-103-53.int kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Nov 02 10:30:38 sh-103-53.int kernel: [20637] 292669 20637 28296 363 14 0 0 slurm_script Nov 02 10:30:38 sh-103-53.int kernel: [20641] 292669 20641 3251508 1065355 2505 0 0 python3 Nov 02 10:30:38 sh-103-53.int kernel: Memory cgroup out of memory: Kill process 20851 (python3) score 1018 or sacrifice child Nov 02 10:30:38 sh-103-53.int kernel: Killed process 20641 (python3) total-vm:13006032kB, anon-rss:4193500kB, file-rss:67920kB, shmem-rss:0kB Nov 02 10:38:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 10:38:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 10:48:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 02 10:48:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 02 10:58:44 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 11 seconds Nov 02 10:58:44 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 02 11:09:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 11:09:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 11:19:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 11:19:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 11:29:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 7 seconds Nov 02 11:29:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 11:40:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 02 11:40:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 11:50:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 02 11:50:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 12:00:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 12:00:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 12:10:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 02 12:10:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 02 12:21:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 7 seconds Nov 02 12:21:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 6 previous similar messages Nov 02 12:32:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 02 12:32:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 12:42:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 02 12:42:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 12:53:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Nov 02 12:53:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 13:03:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Nov 02 13:03:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 13:14:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 13:14:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 13:24:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 02 13:24:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 02 13:34:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 02 13:34:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 13:45:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Nov 02 13:45:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 13:55:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 02 13:55:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 14:05:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 02 14:05:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 14:15:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 02 14:15:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 02 14:22:43 sh-103-53.int kernel: python3 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 Nov 02 14:22:43 sh-103-53.int kernel: python3 cpuset=step_batch mems_allowed=0-1 Nov 02 14:22:43 sh-103-53.int kernel: CPU: 20 PID: 409706 Comm: python3 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.27.2.el7.x86_64 #1 Nov 02 14:22:43 sh-103-53.int kernel: Hardware name: Dell Inc. PowerEdge C6420/0YTVTT, BIOS 2.3.10 08/15/2019 Nov 02 14:22:43 sh-103-53.int kernel: Call Trace: Nov 02 14:22:43 sh-103-53.int kernel: [] dump_stack+0x19/0x1b Nov 02 14:22:43 sh-103-53.int kernel: [] dump_header+0x90/0x229 Nov 02 14:22:43 sh-103-53.int kernel: [] ? default_wake_function+0x12/0x20 Nov 02 14:22:43 sh-103-53.int kernel: [] ? find_lock_task_mm+0x56/0xc0 Nov 02 14:22:43 sh-103-53.int kernel: [] ? try_get_mem_cgroup_from_mm+0x28/0x60 Nov 02 14:22:43 sh-103-53.int kernel: [] oom_kill_process+0x254/0x3d0 Nov 02 14:22:43 sh-103-53.int kernel: [] mem_cgroup_oom_synchronize+0x546/0x570 Nov 02 14:22:43 sh-103-53.int kernel: [] ? mem_cgroup_charge_common+0xc0/0xc0 Nov 02 14:22:43 sh-103-53.int kernel: [] pagefault_out_of_memory+0x14/0x90 Nov 02 14:22:43 sh-103-53.int kernel: [] mm_fault_error+0x6a/0x157 Nov 02 14:22:43 sh-103-53.int kernel: [] __do_page_fault+0x3c8/0x4f0 Nov 02 14:22:43 sh-103-53.int kernel: [] do_page_fault+0x35/0x90 Nov 02 14:22:43 sh-103-53.int kernel: [] page_fault+0x28/0x30 Nov 02 14:22:43 sh-103-53.int kernel: Task in /slurm/uid_292669/job_54017739/step_batch/task_0 killed as a result of limit of /slurm/uid_292669/job_54017739/step_batch Nov 02 14:22:43 sh-103-53.int kernel: memory: usage 4194304kB, limit 4194304kB, failcnt 19889 Nov 02 14:22:43 sh-103-53.int kernel: memory+swap: usage 4194304kB, limit 4194304kB, failcnt 2 Nov 02 14:22:43 sh-103-53.int kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 Nov 02 14:22:43 sh-103-53.int kernel: Memory cgroup stats for /slurm/uid_292669/job_54017739/step_batch: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Nov 02 14:22:43 sh-103-53.int kernel: Memory cgroup stats for /slurm/uid_292669/job_54017739/step_batch/task_0: cache:2612KB rss:4191692KB rss_huge:0KB mapped_file:16KB swap:0KB inactive_anon:485120KB active_anon:3706572KB inactive_file:1400KB active_file:1212KB unevictable:0KB Nov 02 14:22:43 sh-103-53.int kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Nov 02 14:22:43 sh-103-53.int kernel: [409644] 292669 409644 28296 363 14 0 0 slurm_script Nov 02 14:22:43 sh-103-53.int kernel: [409659] 292669 409659 3260754 1048603 2511 0 0 python3 Nov 02 14:22:43 sh-103-53.int kernel: Memory cgroup out of memory: Kill process 410129 (python3) score 1002 or sacrifice child Nov 02 14:22:43 sh-103-53.int kernel: Killed process 409659 (python3) total-vm:13043016kB, anon-rss:4191124kB, file-rss:3288kB, shmem-rss:0kB Nov 02 14:26:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Nov 02 14:26:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 02 14:37:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Nov 02 14:37:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 14:47:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 02 14:47:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 14:58:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Nov 02 14:58:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 15:08:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 14 seconds Nov 02 15:08:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 6 previous similar messages Nov 02 15:10:12 sh-103-53.int kernel: python3 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 Nov 02 15:10:12 sh-103-53.int kernel: python3 cpuset=step_batch mems_allowed=0-1 Nov 02 15:10:12 sh-103-53.int kernel: CPU: 16 PID: 53396 Comm: python3 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.27.2.el7.x86_64 #1 Nov 02 15:10:12 sh-103-53.int kernel: Hardware name: Dell Inc. PowerEdge C6420/0YTVTT, BIOS 2.3.10 08/15/2019 Nov 02 15:10:12 sh-103-53.int kernel: Call Trace: Nov 02 15:10:12 sh-103-53.int kernel: [] dump_stack+0x19/0x1b Nov 02 15:10:12 sh-103-53.int kernel: [] dump_header+0x90/0x229 Nov 02 15:10:12 sh-103-53.int kernel: [] ? find_lock_task_mm+0x56/0xc0 Nov 02 15:10:12 sh-103-53.int kernel: [] ? try_get_mem_cgroup_from_mm+0x28/0x60 Nov 02 15:10:12 sh-103-53.int kernel: [] oom_kill_process+0x254/0x3d0 Nov 02 15:10:12 sh-103-53.int kernel: [] mem_cgroup_oom_synchronize+0x546/0x570 Nov 02 15:10:12 sh-103-53.int kernel: [] ? mem_cgroup_charge_common+0xc0/0xc0 Nov 02 15:10:12 sh-103-53.int kernel: [] pagefault_out_of_memory+0x14/0x90 Nov 02 15:10:12 sh-103-53.int kernel: [] mm_fault_error+0x6a/0x157 Nov 02 15:10:12 sh-103-53.int kernel: [] __do_page_fault+0x3c8/0x4f0 Nov 02 15:10:12 sh-103-53.int kernel: [] do_page_fault+0x35/0x90 Nov 02 15:10:12 sh-103-53.int kernel: [] page_fault+0x28/0x30 Nov 02 15:10:12 sh-103-53.int kernel: Task in /slurm/uid_292669/job_54032013/step_batch/task_0 killed as a result of limit of /slurm/uid_292669/job_54032013/step_batch Nov 02 15:10:12 sh-103-53.int kernel: memory: usage 4194304kB, limit 4194304kB, failcnt 9815 Nov 02 15:10:12 sh-103-53.int kernel: memory+swap: usage 4194304kB, limit 4194304kB, failcnt 0 Nov 02 15:10:12 sh-103-53.int kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 Nov 02 15:10:12 sh-103-53.int kernel: Memory cgroup stats for /slurm/uid_292669/job_54032013/step_batch: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Nov 02 15:10:12 sh-103-53.int kernel: Memory cgroup stats for /slurm/uid_292669/job_54032013/step_batch/task_0: cache:0KB rss:4194304KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:453248KB active_anon:3740928KB inactive_file:0KB active_file:0KB unevictable:0KB Nov 02 15:10:12 sh-103-53.int kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Nov 02 15:10:12 sh-103-53.int kernel: [53216] 292669 53216 28296 363 13 0 0 slurm_script Nov 02 15:10:12 sh-103-53.int kernel: [53220] 292669 53220 3187389 1065349 2496 0 0 python3 Nov 02 15:10:12 sh-103-53.int kernel: Memory cgroup out of memory: Kill process 53397 (python3) score 1018 or sacrifice child Nov 02 15:10:12 sh-103-53.int kernel: Killed process 53220 (python3) total-vm:12749556kB, anon-rss:4193508kB, file-rss:67888kB, shmem-rss:0kB Nov 02 15:19:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 15:19:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 02 15:29:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Nov 02 15:29:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 15:39:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 11 seconds Nov 02 15:39:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 15:49:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 15:49:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 15:59:09 sh-103-53.int kernel: python3 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 Nov 02 15:59:09 sh-103-53.int kernel: python3 cpuset=step_batch mems_allowed=0-1 Nov 02 15:59:09 sh-103-53.int kernel: CPU: 20 PID: 60312 Comm: python3 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.27.2.el7.x86_64 #1 Nov 02 15:59:09 sh-103-53.int kernel: Hardware name: Dell Inc. PowerEdge C6420/0YTVTT, BIOS 2.3.10 08/15/2019 Nov 02 15:59:09 sh-103-53.int kernel: Call Trace: Nov 02 15:59:09 sh-103-53.int kernel: [] dump_stack+0x19/0x1b Nov 02 15:59:09 sh-103-53.int kernel: [] dump_header+0x90/0x229 Nov 02 15:59:09 sh-103-53.int kernel: [] oom_kill_process+0x254/0x3d0 Nov 02 15:59:09 sh-103-53.int kernel: [] mem_cgroup_oom_synchronize+0x546/0x570 Nov 02 15:59:09 sh-103-53.int kernel: [] ? mem_cgroup_charge_common+0xc0/0xc0 Nov 02 15:59:09 sh-103-53.int kernel: [] pagefault_out_of_memory+0x14/0x90 Nov 02 15:59:09 sh-103-53.int kernel: [] mm_fault_error+0x6a/0x157 Nov 02 15:59:09 sh-103-53.int kernel: [] __do_page_fault+0x3c8/0x4f0 Nov 02 15:59:09 sh-103-53.int kernel: [] do_page_fault+0x35/0x90 Nov 02 15:59:09 sh-103-53.int kernel: [] page_fault+0x28/0x30 Nov 02 15:59:09 sh-103-53.int kernel: Task in /slurm/uid_292669/job_54030926/step_batch/task_0 killed as a result of limit of /slurm/uid_292669/job_54030926/step_batch Nov 02 15:59:09 sh-103-53.int kernel: memory: usage 4194304kB, limit 4194304kB, failcnt 12707 Nov 02 15:59:09 sh-103-53.int kernel: memory+swap: usage 4194304kB, limit 4194304kB, failcnt 1 Nov 02 15:59:09 sh-103-53.int kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 Nov 02 15:59:09 sh-103-53.int kernel: Memory cgroup stats for /slurm/uid_292669/job_54030926/step_batch: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Nov 02 15:59:09 sh-103-53.int kernel: Memory cgroup stats for /slurm/uid_292669/job_54030926/step_batch/task_0: cache:0KB rss:4194304KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:376576KB active_anon:3817600KB inactive_file:0KB active_file:0KB unevictable:0KB Nov 02 15:59:09 sh-103-53.int kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Nov 02 15:59:09 sh-103-53.int kernel: [60289] 292669 60289 28296 364 13 0 0 slurm_script Nov 02 15:59:09 sh-103-53.int kernel: [60293] 292669 60293 3184511 1065407 2495 0 0 python3 Nov 02 15:59:09 sh-103-53.int kernel: Memory cgroup out of memory: Kill process 60470 (python3) score 1018 or sacrifice child Nov 02 15:59:09 sh-103-53.int kernel: Killed process 60293 (python3) total-vm:12738044kB, anon-rss:4193736kB, file-rss:67892kB, shmem-rss:0kB Nov 02 15:59:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 15:59:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 6 previous similar messages Nov 02 16:10:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 16:10:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 16:22:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 15 seconds Nov 02 16:22:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 16:32:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 02 16:32:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 16:42:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Nov 02 16:42:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 16:52:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 13 seconds Nov 02 16:52:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 6 previous similar messages Nov 02 17:02:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 17:02:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 17:12:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 02 17:12:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 17:23:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 13 seconds Nov 02 17:23:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 17:34:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 17:34:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 17:44:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 13 seconds Nov 02 17:44:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 02 17:54:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 02 17:54:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 18:04:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Nov 02 18:04:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 18:14:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 13 seconds Nov 02 18:14:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 18:25:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 02 18:25:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 18:35:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Nov 02 18:35:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 18:45:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Nov 02 18:45:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 18:55:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 02 18:55:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 19:05:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 02 19:05:43 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 19:17:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 02 19:17:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 19:27:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Nov 02 19:27:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 19:37:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Nov 02 19:37:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 19:47:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 19:47:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 19:57:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 02 19:57:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 02 20:07:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Nov 02 20:07:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 20:18:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 20:18:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 20:25:30 sh-103-53.int kernel: python3 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 Nov 02 20:25:30 sh-103-53.int kernel: python3 cpuset=step_batch mems_allowed=0-1 Nov 02 20:25:30 sh-103-53.int kernel: CPU: 22 PID: 441449 Comm: python3 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.27.2.el7.x86_64 #1 Nov 02 20:25:30 sh-103-53.int kernel: Hardware name: Dell Inc. PowerEdge C6420/0YTVTT, BIOS 2.3.10 08/15/2019 Nov 02 20:25:30 sh-103-53.int kernel: Call Trace: Nov 02 20:25:30 sh-103-53.int kernel: [] dump_stack+0x19/0x1b Nov 02 20:25:30 sh-103-53.int kernel: [] dump_header+0x90/0x229 Nov 02 20:25:30 sh-103-53.int kernel: [] ? default_wake_function+0x12/0x20 Nov 02 20:25:30 sh-103-53.int kernel: [] ? find_lock_task_mm+0x56/0xc0 Nov 02 20:25:30 sh-103-53.int kernel: [] ? try_get_mem_cgroup_from_mm+0x28/0x60 Nov 02 20:25:30 sh-103-53.int kernel: [] oom_kill_process+0x254/0x3d0 Nov 02 20:25:30 sh-103-53.int kernel: [] mem_cgroup_oom_synchronize+0x546/0x570 Nov 02 20:25:30 sh-103-53.int kernel: [] ? mem_cgroup_charge_common+0xc0/0xc0 Nov 02 20:25:30 sh-103-53.int kernel: [] pagefault_out_of_memory+0x14/0x90 Nov 02 20:25:30 sh-103-53.int kernel: [] mm_fault_error+0x6a/0x157 Nov 02 20:25:30 sh-103-53.int kernel: [] __do_page_fault+0x3c8/0x4f0 Nov 02 20:25:30 sh-103-53.int kernel: [] do_page_fault+0x35/0x90 Nov 02 20:25:30 sh-103-53.int kernel: [] page_fault+0x28/0x30 Nov 02 20:25:30 sh-103-53.int kernel: Task in /slurm/uid_292669/job_54018291/step_batch/task_0 killed as a result of limit of /slurm/uid_292669/job_54018291/step_batch Nov 02 20:25:30 sh-103-53.int kernel: memory: usage 4194304kB, limit 4194304kB, failcnt 732771 Nov 02 20:25:30 sh-103-53.int kernel: memory+swap: usage 4194304kB, limit 4194304kB, failcnt 2 Nov 02 20:25:30 sh-103-53.int kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 Nov 02 20:25:30 sh-103-53.int kernel: Memory cgroup stats for /slurm/uid_292669/job_54018291/step_batch: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Nov 02 20:25:30 sh-103-53.int kernel: Memory cgroup stats for /slurm/uid_292669/job_54018291/step_batch/task_0: cache:23832KB rss:4170472KB rss_huge:0KB mapped_file:12KB swap:0KB inactive_anon:725752KB active_anon:3444720KB inactive_file:11992KB active_file:11840KB unevictable:0KB Nov 02 20:25:30 sh-103-53.int kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Nov 02 20:25:30 sh-103-53.int kernel: [441406] 292669 441406 28296 363 12 0 0 slurm_script Nov 02 20:25:30 sh-103-53.int kernel: [441410] 292669 441410 3189309 1043135 2493 0 0 python3 Nov 02 20:25:30 sh-103-53.int kernel: Memory cgroup out of memory: Kill process 441606 (python3) score 997 or sacrifice child Nov 02 20:25:30 sh-103-53.int kernel: Killed process 441410 (python3) total-vm:12757236kB, anon-rss:4170152kB, file-rss:2388kB, shmem-rss:0kB Nov 02 20:28:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 20:28:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 20:38:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Nov 02 20:38:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 20:48:24 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 17 seconds Nov 02 20:48:24 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 20:59:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Nov 02 20:59:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 21:09:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 21:09:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 21:19:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 02 21:19:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 21:30:04 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Nov 02 21:30:04 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 21:41:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 02 21:41:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 21:53:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 7 seconds Nov 02 21:53:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 22:03:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 16 seconds Nov 02 22:03:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 22:13:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 02 22:13:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 22:23:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 6 seconds Nov 02 22:23:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 22:34:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 12 seconds Nov 02 22:34:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 22:44:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 02 22:44:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 02 22:54:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Nov 02 22:54:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 23:04:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 11 seconds Nov 02 23:04:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 23:14:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 02 23:14:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 02 23:24:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 02 23:24:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 02 23:35:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 02 23:35:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 6 previous similar messages Nov 02 23:46:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 02 23:46:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 02 23:56:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Nov 02 23:56:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 00:06:44 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 00:06:44 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 00:16:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 03 00:16:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 6 previous similar messages Nov 03 00:26:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 11 seconds Nov 03 00:26:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 03 00:36:39 sh-103-53.int kernel: Lustre: 137145:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572765843/real 1572765843] req@ffff97606cbcf080 x1648382351950800/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 576/2088 e 0 to 1 dl 1572766599 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Nov 03 00:36:39 sh-103-53.int kernel: Lustre: 137145:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 5 previous similar messages Nov 03 00:36:39 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 00:36:39 sh-103-53.int kernel: Lustre: Skipped 8 previous similar messages Nov 03 00:36:39 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 00:36:39 sh-103-53.int kernel: Lustre: Skipped 3 previous similar messages Nov 03 00:37:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 00:37:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 00:47:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 03 00:47:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 00:57:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 7 seconds Nov 03 00:57:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 01:07:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 01:07:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 01:17:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 03 01:17:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 03 01:28:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 03 01:28:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 01:39:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 03 01:39:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 01:49:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Nov 03 01:49:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 01:57:47 sh-103-53.int kernel: python3 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 Nov 03 01:57:47 sh-103-53.int kernel: python3 cpuset=step_batch mems_allowed=0-1 Nov 03 01:57:47 sh-103-53.int kernel: CPU: 0 PID: 143563 Comm: python3 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.27.2.el7.x86_64 #1 Nov 03 01:57:47 sh-103-53.int kernel: Hardware name: Dell Inc. PowerEdge C6420/0YTVTT, BIOS 2.3.10 08/15/2019 Nov 03 01:57:47 sh-103-53.int kernel: Call Trace: Nov 03 01:57:47 sh-103-53.int kernel: [] dump_stack+0x19/0x1b Nov 03 01:57:47 sh-103-53.int kernel: [] dump_header+0x90/0x229 Nov 03 01:57:47 sh-103-53.int kernel: [] ? default_wake_function+0x12/0x20 Nov 03 01:57:47 sh-103-53.int kernel: [] ? find_lock_task_mm+0x56/0xc0 Nov 03 01:57:47 sh-103-53.int kernel: [] ? try_get_mem_cgroup_from_mm+0x28/0x60 Nov 03 01:57:47 sh-103-53.int kernel: [] oom_kill_process+0x254/0x3d0 Nov 03 01:57:48 sh-103-53.int kernel: [] mem_cgroup_oom_synchronize+0x546/0x570 Nov 03 01:57:48 sh-103-53.int kernel: [] ? mem_cgroup_charge_common+0xc0/0xc0 Nov 03 01:57:48 sh-103-53.int kernel: [] pagefault_out_of_memory+0x14/0x90 Nov 03 01:57:48 sh-103-53.int kernel: [] mm_fault_error+0x6a/0x157 Nov 03 01:57:48 sh-103-53.int kernel: [] __do_page_fault+0x3c8/0x4f0 Nov 03 01:57:48 sh-103-53.int kernel: [] do_page_fault+0x35/0x90 Nov 03 01:57:48 sh-103-53.int kernel: [] page_fault+0x28/0x30 Nov 03 01:57:48 sh-103-53.int kernel: Task in /slurm/uid_292669/job_54034727/step_batch/task_0 killed as a result of limit of /slurm/uid_292669/job_54034727/step_batch Nov 03 01:57:48 sh-103-53.int kernel: memory: usage 4194304kB, limit 4194304kB, failcnt 48154 Nov 03 01:57:48 sh-103-53.int kernel: memory+swap: usage 4194304kB, limit 4194304kB, failcnt 4 Nov 03 01:57:48 sh-103-53.int kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 Nov 03 01:57:48 sh-103-53.int kernel: Memory cgroup stats for /slurm/uid_292669/job_54034727/step_batch: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Nov 03 01:57:48 sh-103-53.int kernel: Memory cgroup stats for /slurm/uid_292669/job_54034727/step_batch/task_0: cache:3676KB rss:4190628KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:723584KB active_anon:3467044KB inactive_file:1884KB active_file:1792KB unevictable:0KB Nov 03 01:57:48 sh-103-53.int kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Nov 03 01:57:48 sh-103-53.int kernel: [137136] 292669 137136 28296 363 13 0 0 slurm_script Nov 03 01:57:48 sh-103-53.int kernel: [137145] 292669 137145 3250334 1047796 2496 0 0 python3 Nov 03 01:57:48 sh-103-53.int kernel: Memory cgroup out of memory: Kill process 143563 (python3) score 1001 or sacrifice child Nov 03 01:57:48 sh-103-53.int kernel: Killed process 137145 (python3) total-vm:13001336kB, anon-rss:4189796kB, file-rss:1388kB, shmem-rss:0kB Nov 03 01:00:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Nov 03 01:00:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 01:10:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 01:10:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 01:20:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 01:20:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 01:31:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 03 01:31:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 01:41:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Nov 03 01:41:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 03 01:52:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Nov 03 01:52:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 02:02:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Nov 03 02:02:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 02:12:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 03 02:12:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 02:22:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 02:22:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 02:33:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Nov 03 02:33:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 03 02:44:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 03 02:44:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 02:54:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 03 02:54:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 03:04:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 03 03:04:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 03:14:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 03:14:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 03:24:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 6 seconds Nov 03 03:24:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 03:35:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 03 03:35:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 03 03:46:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 03 03:46:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 03:56:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 03 03:56:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 04:06:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Nov 03 04:06:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 04:16:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 03 04:16:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 03 04:27:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 03 04:27:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 04:37:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 03 04:37:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 04:47:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 04:47:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 04:58:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Nov 03 04:58:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 05:08:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 9 seconds Nov 03 05:08:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 05:18:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 03 05:18:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 03 05:28:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 05:28:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 05:39:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 05:39:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 05:50:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 03 05:50:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 06:00:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 9 seconds Nov 03 06:00:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 06:10:44 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 03 06:10:44 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 06:20:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 06:20:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 06:31:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 03 06:31:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 06:41:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 06:41:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 03 06:51:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 03 06:51:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 07:01:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Nov 03 07:01:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 07:11:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 14 seconds Nov 03 07:11:35 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 07:22:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 07:22:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 07:33:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 07:33:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 07:43:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 03 07:43:09 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 07:53:17 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Nov 03 07:53:17 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 08:03:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Nov 03 08:03:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 08:13:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 08:13:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 08:23:32 sh-103-53.int kernel: INFO: task slurm_script:82315 blocked for more than 120 seconds. Nov 03 08:23:32 sh-103-53.int kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 03 08:23:32 sh-103-53.int kernel: slurm_script D ffff976017266400 0 82315 82197 0x00000000 Nov 03 08:23:32 sh-103-53.int kernel: Call Trace: Nov 03 08:23:32 sh-103-53.int kernel: [] ? ll_get_acl+0x31/0xf0 [lustre] Nov 03 08:23:32 sh-103-53.int kernel: [] schedule_preempt_disabled+0x29/0x70 Nov 03 08:23:32 sh-103-53.int kernel: [] __mutex_lock_slowpath+0xc7/0x1d0 Nov 03 08:23:32 sh-103-53.int kernel: [] mutex_lock+0x1f/0x2f Nov 03 08:23:32 sh-103-53.int kernel: [] lookup_slow+0x33/0xa7 Nov 03 08:23:32 sh-103-53.int kernel: [] link_path_walk+0x80f/0x8b0 Nov 03 08:23:32 sh-103-53.int kernel: [] path_lookupat+0x7a/0x8b0 Nov 03 08:23:32 sh-103-53.int kernel: [] ? do_wp_page+0x19e/0x720 Nov 03 08:23:32 sh-103-53.int kernel: [] ? kmem_cache_alloc+0x35/0x1f0 Nov 03 08:23:32 sh-103-53.int kernel: [] ? getname_flags+0x4f/0x1a0 Nov 03 08:23:32 sh-103-53.int kernel: [] filename_lookup+0x2b/0xc0 Nov 03 08:23:32 sh-103-53.int kernel: [] user_path_at_empty+0x67/0xc0 Nov 03 08:23:32 sh-103-53.int kernel: [] ? handle_mm_fault+0x39d/0x9b0 Nov 03 08:23:32 sh-103-53.int kernel: [] user_path_at+0x11/0x20 Nov 03 08:23:32 sh-103-53.int kernel: [] vfs_fstatat+0x63/0xc0 Nov 03 08:23:32 sh-103-53.int kernel: [] SYSC_newstat+0x2e/0x60 Nov 03 08:23:32 sh-103-53.int kernel: [] ? __check_object_size+0x1ca/0x250 Nov 03 08:23:32 sh-103-53.int kernel: [] ? SyS_rt_sigprocmask+0xc4/0x100 Nov 03 08:23:32 sh-103-53.int kernel: [] SyS_newstat+0xe/0x10 Nov 03 08:23:32 sh-103-53.int kernel: [] system_call_fastpath+0x22/0x27 Nov 03 08:23:32 sh-103-53.int kernel: INFO: task slurm_script:82335 blocked for more than 120 seconds. Nov 03 08:23:32 sh-103-53.int kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 03 08:23:32 sh-103-53.int kernel: slurm_script D ffff9757313112c0 0 82335 82190 0x00000000 Nov 03 08:23:32 sh-103-53.int kernel: Call Trace: Nov 03 08:23:32 sh-103-53.int kernel: [] ? ll_get_acl+0x31/0xf0 [lustre] Nov 03 08:23:32 sh-103-53.int kernel: [] schedule_preempt_disabled+0x29/0x70 Nov 03 08:23:32 sh-103-53.int kernel: [] __mutex_lock_slowpath+0xc7/0x1d0 Nov 03 08:23:32 sh-103-53.int kernel: [] mutex_lock+0x1f/0x2f Nov 03 08:23:32 sh-103-53.int kernel: [] lookup_slow+0x33/0xa7 Nov 03 08:23:32 sh-103-53.int kernel: [] link_path_walk+0x80f/0x8b0 Nov 03 08:23:32 sh-103-53.int kernel: [] ? call_rcu_sched+0x1d/0x20 Nov 03 08:23:32 sh-103-53.int kernel: [] ? destroy_inode+0x3b/0x60 Nov 03 08:23:32 sh-103-53.int kernel: [] ? call_rcu_sched+0x1d/0x20 Nov 03 08:23:32 sh-103-53.int kernel: [] path_lookupat+0x7a/0x8b0 Nov 03 08:23:32 sh-103-53.int kernel: [] ? do_wp_page+0x19e/0x720 Nov 03 08:23:32 sh-103-53.int kernel: [] ? kmem_cache_alloc+0x35/0x1f0 Nov 03 08:23:32 sh-103-53.int kernel: [] ? getname_flags+0x4f/0x1a0 Nov 03 08:23:32 sh-103-53.int kernel: [] filename_lookup+0x2b/0xc0 Nov 03 08:23:32 sh-103-53.int kernel: [] user_path_at_empty+0x67/0xc0 Nov 03 08:23:32 sh-103-53.int kernel: [] ? handle_mm_fault+0x39d/0x9b0 Nov 03 08:23:32 sh-103-53.int kernel: [] user_path_at+0x11/0x20 Nov 03 08:23:32 sh-103-53.int kernel: [] vfs_fstatat+0x63/0xc0 Nov 03 08:23:32 sh-103-53.int kernel: [] SYSC_newstat+0x2e/0x60 Nov 03 08:23:32 sh-103-53.int kernel: [] ? __check_object_size+0x1ca/0x250 Nov 03 08:23:32 sh-103-53.int kernel: [] ? SyS_rt_sigprocmask+0xc4/0x100 Nov 03 08:23:32 sh-103-53.int kernel: [] SyS_newstat+0xe/0x10 Nov 03 08:23:32 sh-103-53.int kernel: [] system_call_fastpath+0x22/0x27 Nov 03 08:23:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 7 seconds Nov 03 08:23:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 08:32:08 sh-103-53.int kernel: Lustre: 123161:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572797972/real 1572797972] req@ffff975c6fdede80 x1648382359492912/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 576/2088 e 0 to 1 dl 1572798728 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Nov 03 08:32:08 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 08:32:08 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 08:33:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 08:33:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 08:44:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Nov 03 08:44:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 08:44:44 sh-103-53.int kernel: Lustre: 123161:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572798728/real 1572798728] req@ffff975c6fdede80 x1648382359492912/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 576/2088 e 0 to 1 dl 1572799484 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Nov 03 08:44:44 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 08:44:44 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 08:54:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 9 seconds Nov 03 08:54:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 09:04:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 09:04:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 09:14:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 03 09:14:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 09:24:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 7 seconds Nov 03 09:24:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 09:35:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 03 09:35:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 09:46:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 9 seconds Nov 03 09:46:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 09:56:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 09:56:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 10:02:33 sh-103-53.int kernel: Lustre: 128098:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572803397/real 1572803397] req@ffff975b9f44f080 x1648382360534512/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 576/2088 e 0 to 1 dl 1572804153 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Nov 03 10:02:33 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 10:02:33 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 10:06:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 10:06:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 10:15:09 sh-103-53.int kernel: Lustre: 128098:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572804153/real 1572804153] req@ffff975b9f44f080 x1648382360534512/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 576/2088 e 0 to 1 dl 1572804909 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Nov 03 10:15:09 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 10:15:09 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 10:16:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 03 10:16:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 03 10:26:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 11 seconds Nov 03 10:26:48 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 10:27:45 sh-103-53.int kernel: Lustre: 128098:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572804909/real 1572804909] req@ffff975b9f44f080 x1648382360534512/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 576/2088 e 0 to 1 dl 1572805665 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Nov 03 10:27:45 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 10:27:45 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 10:36:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 14 seconds Nov 03 10:36:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 10:47:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 10:47:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 10:57:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 7 seconds Nov 03 10:57:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 03 11:07:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 13 seconds Nov 03 11:07:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 11:17:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 11:17:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 11:27:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 03 11:27:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 11:38:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 03 11:38:01 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 11:42:52 sh-103-53.int kernel: Lustre: 128070:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572809571/real 1572809571] req@ffff9760637dcc80 x1648382361703216/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 576/2088 e 1 to 1 dl 1572810172 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Nov 03 11:42:52 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 11:42:52 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 11:48:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 03 11:48:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 11:52:53 sh-103-53.int kernel: Lustre: 128070:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572810172/real 1572810172] req@ffff9760637dcc80 x1648382361703216/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 576/2088 e 1 to 1 dl 1572810773 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Nov 03 11:52:53 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 11:52:53 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 11:58:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 7 seconds Nov 03 11:58:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 12:08:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 03 12:08:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 12:18:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 03 12:18:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 12:28:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 12:28:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 12:39:04 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 12:39:04 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 12:48:01 sh-103-53.int kernel: Lustre: 208455:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572813480/real 1572813480] req@ffff97602efa3180 x1648382362438736/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 576/2088 e 1 to 1 dl 1572814081 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Nov 03 12:48:01 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 12:48:01 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 12:49:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 6 seconds Nov 03 12:49:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 12:58:02 sh-103-53.int kernel: Lustre: 208455:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572814081/real 1572814081] req@ffff975b92460d80 x1648382362556880/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 584/2088 e 1 to 1 dl 1572814682 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Nov 03 12:58:02 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 12:58:02 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 12:59:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 16 seconds Nov 03 12:59:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 03 13:08:03 sh-103-53.int kernel: Lustre: 208455:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572814682/real 1572814682] req@ffff975b92460d80 x1648382362556880/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 584/2088 e 1 to 1 dl 1572815283 ref 2 fl Rpc:X/2/ffffffff rc -11/-1 Nov 03 13:08:03 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 13:08:03 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 13:09:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 13:09:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 13:18:04 sh-103-53.int kernel: Lustre: 208455:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572815283/real 1572815283] req@ffff975b92460d80 x1648382362556880/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 584/2088 e 1 to 1 dl 1572815884 ref 2 fl Rpc:X/2/ffffffff rc -11/-1 Nov 03 13:18:04 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 13:18:04 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 13:19:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Nov 03 13:19:39 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 13:28:05 sh-103-53.int kernel: Lustre: 208455:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572815884/real 1572815884] req@ffff975b92460d80 x1648382362556880/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 584/2088 e 1 to 1 dl 1572816485 ref 2 fl Rpc:X/2/ffffffff rc -11/-1 Nov 03 13:28:05 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 13:28:05 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 13:29:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 15 seconds Nov 03 13:29:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 13:38:07 sh-103-53.int kernel: Lustre: 208455:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572816485/real 1572816485] req@ffff97678c987080 x1648382363008336/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 584/2088 e 1 to 1 dl 1572817086 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Nov 03 13:38:07 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 13:38:07 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 13:39:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 13:39:58 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 13:48:08 sh-103-53.int kernel: Lustre: 208455:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572817087/real 1572817087] req@ffff977795f48480 x1648382363122016/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 576/2088 e 1 to 1 dl 1572817688 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Nov 03 13:48:08 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 13:48:08 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 13:50:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Nov 03 13:50:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 13:58:09 sh-103-53.int kernel: Lustre: 212747:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572817688/real 1572817688] req@ffff97687c85b180 x1648382363235168/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 592/2088 e 1 to 1 dl 1572818289 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Nov 03 13:58:09 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 13:58:09 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 14:00:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 14:00:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 14:08:10 sh-103-53.int kernel: Lustre: 212747:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572818289/real 1572818289] req@ffff97687c85b180 x1648382363235168/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 592/2088 e 1 to 1 dl 1572818890 ref 2 fl Rpc:X/2/ffffffff rc -11/-1 Nov 03 14:08:10 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 14:08:10 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 14:10:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Nov 03 14:10:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 14:18:11 sh-103-53.int kernel: Lustre: 212747:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572818890/real 1572818890] req@ffff97687c85b180 x1648382363235168/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 592/2088 e 1 to 1 dl 1572819491 ref 2 fl Rpc:X/2/ffffffff rc -11/-1 Nov 03 14:18:11 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 14:18:11 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 14:20:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Nov 03 14:20:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 14:28:12 sh-103-53.int kernel: Lustre: 212747:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572819491/real 1572819491] req@ffff97779afd2d00 x1648382363576064/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 576/2088 e 1 to 1 dl 1572820092 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Nov 03 14:28:12 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 14:28:12 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 14:30:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 14:30:53 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 14:38:13 sh-103-53.int kernel: Lustre: 212747:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572820092/real 1572820092] req@ffff97779afd2d00 x1648382363576064/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 576/2088 e 1 to 1 dl 1572820693 ref 2 fl Rpc:X/2/ffffffff rc -11/-1 Nov 03 14:38:13 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 14:38:13 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 14:40:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Nov 03 14:40:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 14:48:14 sh-103-53.int kernel: Lustre: 82375:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572820693/real 1572820693] req@ffff9779d3270000 x1648382363796480/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 576/2088 e 1 to 1 dl 1572821294 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Nov 03 14:48:14 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 14:48:14 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 14:48:14 sh-103-53.int kernel: Lustre: 82375:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 1 previous similar message Nov 03 14:51:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 13 seconds Nov 03 14:51:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 03 14:58:15 sh-103-53.int kernel: Lustre: 212747:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572821294/real 1572821294] req@ffff975fb9837500 x1648382363796512/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 592/2088 e 1 to 1 dl 1572821895 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Nov 03 14:58:15 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 14:58:15 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 15:01:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 15:01:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 15:08:16 sh-103-53.int kernel: Lustre: 212747:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572821895/real 1572821895] req@ffff975fb9837500 x1648382363796512/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 592/2088 e 1 to 1 dl 1572822496 ref 2 fl Rpc:X/2/ffffffff rc -11/-1 Nov 03 15:08:16 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 15:08:16 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 15:11:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Nov 03 15:11:27 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 15:18:17 sh-103-53.int kernel: Lustre: 212747:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572822496/real 1572822496] req@ffff975fb9837500 x1648382363796512/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 592/2088 e 1 to 1 dl 1572823097 ref 2 fl Rpc:X/2/ffffffff rc -11/-1 Nov 03 15:18:17 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 15:18:17 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 15:21:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 14 seconds Nov 03 15:21:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 03 15:28:18 sh-103-53.int kernel: Lustre: 212747:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572823097/real 1572823097] req@ffff975fb9837500 x1648382363796512/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 592/2088 e 1 to 1 dl 1572823698 ref 2 fl Rpc:X/2/ffffffff rc -11/-1 Nov 03 15:28:18 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 15:28:18 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 15:31:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 15:31:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 15:38:22 sh-103-53.int kernel: Lustre: 212747:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572823701/real 1572823701] req@ffff97712171da00 x1648382364356928/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 1792/8624 e 1 to 1 dl 1572824302 ref 2 fl Rpc:XP/0/ffffffff rc 0/-1 Nov 03 15:38:22 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 15:38:22 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 15:42:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 03 15:42:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 15:48:23 sh-103-53.int kernel: Lustre: 91123:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572824302/real 1572824302] req@ffff97678c983a80 x1648382364469152/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 480/568 e 1 to 1 dl 1572824903 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Nov 03 15:48:23 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 15:48:23 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 15:52:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 03 15:52:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 15:58:24 sh-103-53.int kernel: Lustre: 91122:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572824903/real 1572824903] req@ffff975b92463180 x1648382364580432/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 480/568 e 1 to 1 dl 1572825504 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Nov 03 15:58:24 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 15:58:24 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 16:02:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 03 16:02:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 16:08:25 sh-103-53.int kernel: Lustre: 208455:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572825504/real 1572825504] req@ffff97678c983a80 x1648382364580496/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 576/2088 e 1 to 1 dl 1572826105 ref 2 fl Rpc:X/2/ffffffff rc -11/-1 Nov 03 16:08:25 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 16:08:25 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 16:12:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 7 seconds Nov 03 16:12:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 16:18:26 sh-103-53.int kernel: Lustre: 208455:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572826105/real 1572826105] req@ffff97678c983a80 x1648382364580496/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 576/2088 e 1 to 1 dl 1572826706 ref 2 fl Rpc:X/2/ffffffff rc -11/-1 Nov 03 16:18:26 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 16:18:26 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 16:22:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Nov 03 16:22:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 16:28:27 sh-103-53.int kernel: Lustre: 82315:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572826706/real 1572826706] req@ffff97582078ba80 x1648382364912368/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 576/2088 e 1 to 1 dl 1572827307 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Nov 03 16:28:27 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 16:28:27 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 16:28:27 sh-103-53.int kernel: Lustre: 82315:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 1 previous similar message Nov 03 16:32:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 03 16:32:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 03 16:38:28 sh-103-53.int kernel: Lustre: 208455:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572827307/real 1572827307] req@ffff9762f5360900 x1648382365026688/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 584/2088 e 1 to 1 dl 1572827908 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Nov 03 16:38:28 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 16:38:28 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 16:42:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 7 seconds Nov 03 16:42:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 16:48:29 sh-103-53.int kernel: Lustre: 208455:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572827908/real 1572827908] req@ffff9762f5360900 x1648382365026688/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 584/2088 e 1 to 1 dl 1572828509 ref 2 fl Rpc:X/2/ffffffff rc -11/-1 Nov 03 16:48:29 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 16:48:29 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 16:54:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 03 16:54:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 16:58:31 sh-103-53.int kernel: Lustre: 208455:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572828509/real 1572828509] req@ffff9762f5360900 x1648382365026688/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 584/2088 e 1 to 1 dl 1572829110 ref 2 fl Rpc:X/2/ffffffff rc -11/-1 Nov 03 16:58:31 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 16:58:31 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 17:04:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 11 seconds Nov 03 17:04:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 17:08:32 sh-103-53.int kernel: Lustre: 208455:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572829111/real 1572829111] req@ffff97602efa7980 x1648382365362352/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 592/2088 e 1 to 1 dl 1572829712 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Nov 03 17:08:32 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 17:08:32 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Nov 03 17:14:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 03 17:14:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 17:16:10 sh-103-53.int kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail Nov 03 17:16:10 sh-103-53.int kernel: LustreError: Skipped 2 previous similar messages Nov 03 17:18:33 sh-103-53.int kernel: Lustre: 212747:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572829712/real 1572829712] req@ffff97549da79f80 x1648382365473728/t0(0) o101->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 592/2088 e 1 to 1 dl 1572830313 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Nov 03 17:18:33 sh-103-53.int kernel: Lustre: 212747:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 1 previous similar message Nov 03 17:18:33 sh-103-53.int kernel: Lustre: fir-MDT0000-mdc-ffff9781f2230800: Connection to fir-MDT0000 (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 03 17:19:05 sh-103-53.int kernel: LustreError: 11-0: fir-MDT0002-mdc-ffff9781f2230800: operation ost_write to node 10.0.10.53@o2ib7 failed: rc = -107 Nov 03 17:24:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 03 17:24:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 17:28:18 sh-103-53.int kernel: LustreError: 235520:0:(lmv_obd.c:1415:lmv_statfs()) fir-MDT0001-mdc-ffff9781f2230800: can't stat MDS #0: rc = -4 Nov 03 17:28:39 sh-103-53.int kernel: Lustre: 91121:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1572830163/real 1572830163] req@ffff975ebf581680 x1648382365552608/t0(0) o400->fir-MDT0000-mdc-ffff9781f2230800@10.0.10.51@o2ib7:12/10 lens 224/224 e 0 to 1 dl 1572830919 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Nov 03 17:28:39 sh-103-53.int kernel: Lustre: 91121:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 86 previous similar messages Nov 03 17:35:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Nov 03 17:35:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 17:39:41 sh-103-53.int kernel: LustreError: 236506:0:(lmv_obd.c:1415:lmv_statfs()) fir-MDT0001-mdc-ffff9781f2230800: can't stat MDS #0: rc = -4 Nov 03 17:45:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Nov 03 17:45:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 17:51:28 sh-103-53.int kernel: LustreError: 237354:0:(lmv_obd.c:1415:lmv_statfs()) fir-MDT0001-mdc-ffff9781f2230800: can't stat MDS #0: rc = -4 Nov 03 17:56:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 03 17:56:18 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 18:02:50 sh-103-53.int kernel: LustreError: 238192:0:(lmv_obd.c:1415:lmv_statfs()) fir-MDT0001-mdc-ffff9781f2230800: can't stat MDS #0: rc = -4 Nov 03 18:06:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 03 18:06:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 18:14:04 sh-103-53.int kernel: LustreError: 239012:0:(lmv_obd.c:1415:lmv_statfs()) fir-MDT0001-mdc-ffff9781f2230800: can't stat MDS #0: rc = -4 Nov 03 18:16:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Nov 03 18:16:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 03 18:24:43 sh-103-53.int kernel: LustreError: 239771:0:(lmv_obd.c:1415:lmv_statfs()) fir-MDT0001-mdc-ffff9781f2230800: can't stat MDS #0: rc = -4 Nov 03 18:26:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 18:26:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 18:34:09 sh-103-53.int kernel: LustreError: 131530:0:(osc_cache.c:955:osc_extent_wait()) extent ffff9775d2689b80@{[20 -> 20/255], [3|0|+|rpc|wiuY|ffff97553a7e92c0], [28672|1|+|-|ffff975a09eef980|256|ffff976a06b20000]} fir-MDT0002-mdc-ffff9781f2230800: wait ext to 0 timedout, recovery in progress? Nov 03 18:34:09 sh-103-53.int kernel: LustreError: 131530:0:(osc_cache.c:955:osc_extent_wait()) ### extent: ffff9775d2689b80 ns: fir-MDT0002-mdc-ffff9781f2230800 lock: ffff975a09eef980/0x51ab3c4efd027b4 lrc: 3/0,0 mode: PW/PW res: [0x2c00313e1:0x61e:0x0].0x0 bits 0x40/0x0 rrc: 4 type: IBT flags: 0x49400000000 nid: local remote: 0x37a6454f95ae6903 expref: -99 pid: 127932 timeout: 0 lvb_type: 0 Nov 03 18:35:06 sh-103-53.int kernel: LustreError: 240552:0:(lmv_obd.c:1415:lmv_statfs()) fir-MDT0001-mdc-ffff9781f2230800: can't stat MDS #0: rc = -4 Nov 03 18:36:49 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 03 18:36:49 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 18:46:38 sh-103-53.int kernel: LustreError: 241382:0:(lmv_obd.c:1415:lmv_statfs()) fir-MDT0001-mdc-ffff9781f2230800: can't stat MDS #0: rc = -4 Nov 03 18:47:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Nov 03 18:47:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 18:57:53 sh-103-53.int kernel: LustreError: 242205:0:(lmv_obd.c:1415:lmv_statfs()) fir-MDT0001-mdc-ffff9781f2230800: can't stat MDS #0: rc = -4 Nov 03 18:58:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 03 18:58:10 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 19:08:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 11 seconds Nov 03 19:08:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 03 19:09:33 sh-103-53.int kernel: LustreError: 243063:0:(lmv_obd.c:1415:lmv_statfs()) fir-MDT0001-mdc-ffff9781f2230800: can't stat MDS #0: rc = -4 Nov 03 19:18:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 19:18:37 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 19:21:50 sh-103-53.int kernel: LustreError: 243979:0:(lmv_obd.c:1415:lmv_statfs()) fir-MDT0001-mdc-ffff9781f2230800: can't stat MDS #0: rc = -4 Nov 03 19:28:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 03 19:28:40 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 19:33:54 sh-103-53.int kernel: LustreError: 244837:0:(lmv_obd.c:1415:lmv_statfs()) fir-MDT0001-mdc-ffff9781f2230800: can't stat MDS #0: rc = -4 Nov 03 19:38:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Nov 03 19:38:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 19:45:18 sh-103-53.int kernel: LustreError: 245665:0:(lmv_obd.c:1415:lmv_statfs()) fir-MDT0001-mdc-ffff9781f2230800: can't stat MDS #0: rc = -4 Nov 03 19:50:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Nov 03 19:50:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 19:56:04 sh-103-53.int kernel: LustreError: 246447:0:(lmv_obd.c:1415:lmv_statfs()) fir-MDT0001-mdc-ffff9781f2230800: can't stat MDS #0: rc = -4 Nov 03 20:00:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Nov 03 20:00:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 20:02:20 sh-103-53.int kernel: LNetError: 91068:0:(peer.c:2453:lnet_peer_merge_data()) Error deleting NID 10.0.10.51@o2ib7 from peer 10.0.10.51@o2ib7: -16 Nov 03 20:04:01 sh-103-53.int kernel: Lustre: Evicted from MGS (at MGC10.0.10.51@o2ib7_0) after server handle changed from 0x6f11bfb841877501 to 0x1306fc8de5f6076d Nov 03 20:04:01 sh-103-53.int kernel: Lustre: MGC10.0.10.51@o2ib7: Connection restored to MGC10.0.10.51@o2ib7_0 (at 10.0.10.51@o2ib7) Nov 03 20:06:46 sh-103-53.int kernel: Lustre: fir-OST0049-osc-ffff9781f2230800: Connection restored to 10.0.10.114@o2ib7 (at 10.0.10.114@o2ib7) Nov 03 20:10:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 9 seconds Nov 03 20:10:23 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 20:20:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 15 seconds Nov 03 20:20:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 20:25:56 sh-103-53.int kernel: LustreError: 11-0: fir-MDT0002-mdc-ffff9781f2230800: operation ldlm_enqueue to node 10.0.10.53@o2ib7 failed: rc = -107 Nov 03 20:25:56 sh-103-53.int kernel: LustreError: 91118:0:(import.c:706:ptlrpc_connect_import_locked()) already connecting Nov 03 20:25:56 sh-103-53.int kernel: LustreError: 167-0: fir-MDT0002-mdc-ffff9781f2230800: This client was evicted by fir-MDT0002; in progress operations using this service will fail. Nov 03 20:25:56 sh-103-53.int kernel: LustreError: Skipped 134 previous similar messages Nov 03 20:25:56 sh-103-53.int kernel: CARDAMOM_MDF.ex[127932]: segfault at 0 ip 00007fef19cd3594 sp 00007ffefdea65f0 error 4 Nov 03 20:25:56 sh-103-53.int kernel: CARDAMOM_MDF.ex[128143]: segfault at 0 ip 00007f5e6135e594 sp 00007fffb9607de0 error 4 in libc-2.17.so[7f5e612ef000+1c2000] Nov 03 20:25:56 sh-103-53.int kernel: CARDAMOM_MDF.ex[134373]: segfault at 0 ip 00007f552f196594 sp 00007ffd80bb7830 error 4 in libc-2.17.so[7f552f127000+1c2000] Nov 03 20:25:56 sh-103-53.int kernel: CARDAMOM_MDF.ex[128106]: segfault at 0 ip 00007f9d0917b594 sp 00007ffdcd3e5e70 error 4 in libc-2.17.so[7f9d0910c000+1c2000] Nov 03 20:25:56 sh-103-53.int kernel: Lustre: 91125:0:(llite_lib.c:2747:ll_dirty_page_discard_warn()) fir: dirty page discard: 10.0.10.51@o2ib7:10.0.10.52@o2ib7:/fir/fid: [0x2c00313e1:0x61e:0x0]// may get corrupted (rc -5) Nov 03 20:25:56 sh-103-53.int kernel: Lustre: fir-MDT0002-mdc-ffff9781f2230800: Connection restored to 10.0.10.53@o2ib7 (at 10.0.10.53@o2ib7) Nov 03 20:25:56 sh-103-53.int kernel: Lustre: Skipped 68 previous similar messages Nov 03 20:25:56 sh-103-53.int kernel: in libc-2.17.so[7fef19c64000+1c2000] Nov 03 20:28:00 sh-103-53.int kernel: Lustre: fir-OST0001-osc-ffff9781f2230800: Connection restored to 10.0.10.102@o2ib7 (at 10.0.10.102@o2ib7) Nov 03 20:29:09 sh-103-53.int kernel: Lustre: fir-OST0003-osc-ffff9781f2230800: Connection restored to 10.0.10.102@o2ib7 (at 10.0.10.102@o2ib7) Nov 03 20:30:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 03 20:30:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 20:40:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 6 seconds Nov 03 20:40:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 03 20:51:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 15 seconds Nov 03 20:51:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 21:01:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 21:01:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 21:06:48 sh-103-53.int kernel: python3 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 Nov 03 21:06:48 sh-103-53.int kernel: python3 cpuset=step_batch mems_allowed=0-1 Nov 03 21:06:48 sh-103-53.int kernel: CPU: 0 PID: 214101 Comm: python3 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.27.2.el7.x86_64 #1 Nov 03 21:06:48 sh-103-53.int kernel: Hardware name: Dell Inc. PowerEdge C6420/0YTVTT, BIOS 2.3.10 08/15/2019 Nov 03 21:06:48 sh-103-53.int kernel: Call Trace: Nov 03 21:06:48 sh-103-53.int kernel: [] dump_stack+0x19/0x1b Nov 03 21:06:48 sh-103-53.int kernel: [] dump_header+0x90/0x229 Nov 03 21:06:48 sh-103-53.int kernel: [] ? default_wake_function+0x12/0x20 Nov 03 21:06:48 sh-103-53.int kernel: [] ? find_lock_task_mm+0x56/0xc0 Nov 03 21:06:48 sh-103-53.int kernel: [] ? try_get_mem_cgroup_from_mm+0x28/0x60 Nov 03 21:06:48 sh-103-53.int kernel: [] oom_kill_process+0x254/0x3d0 Nov 03 21:06:48 sh-103-53.int kernel: [] mem_cgroup_oom_synchronize+0x546/0x570 Nov 03 21:06:48 sh-103-53.int kernel: [] ? mem_cgroup_charge_common+0xc0/0xc0 Nov 03 21:06:48 sh-103-53.int kernel: [] pagefault_out_of_memory+0x14/0x90 Nov 03 21:06:48 sh-103-53.int kernel: [] mm_fault_error+0x6a/0x157 Nov 03 21:06:48 sh-103-53.int kernel: [] __do_page_fault+0x3c8/0x4f0 Nov 03 21:06:48 sh-103-53.int kernel: [] do_page_fault+0x35/0x90 Nov 03 21:06:48 sh-103-53.int kernel: [] page_fault+0x28/0x30 Nov 03 21:06:48 sh-103-53.int kernel: Task in /slurm/uid_292669/job_54036414/step_batch/task_0 killed as a result of limit of /slurm/uid_292669/job_54036414/step_batch Nov 03 21:06:48 sh-103-53.int kernel: memory: usage 4194304kB, limit 4194304kB, failcnt 46271 Nov 03 21:06:48 sh-103-53.int kernel: memory+swap: usage 4194304kB, limit 4194304kB, failcnt 11 Nov 03 21:06:48 sh-103-53.int kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 Nov 03 21:06:48 sh-103-53.int kernel: Memory cgroup stats for /slurm/uid_292669/job_54036414/step_batch: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Nov 03 21:06:48 sh-103-53.int kernel: Memory cgroup stats for /slurm/uid_292669/job_54036414/step_batch/task_0: cache:1948KB rss:4192356KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:722552KB active_anon:3469804KB inactive_file:992KB active_file:956KB unevictable:0KB Nov 03 21:06:48 sh-103-53.int kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Nov 03 21:06:48 sh-103-53.int kernel: [208447] 292669 208447 28296 363 13 0 0 slurm_script Nov 03 21:06:48 sh-103-53.int kernel: [208455] 292669 208455 3251453 1048261 2500 0 0 python3 Nov 03 21:06:48 sh-103-53.int kernel: Memory cgroup out of memory: Kill process 214103 (python3) score 1002 or sacrifice child Nov 03 21:06:48 sh-103-53.int kernel: Killed process 208455 (python3) total-vm:13005812kB, anon-rss:4191708kB, file-rss:1336kB, shmem-rss:0kB Nov 03 21:11:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Nov 03 21:11:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 21:21:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 13 seconds Nov 03 21:21:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 21:31:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 21:31:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 21:41:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Nov 03 21:41:52 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 21:52:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Nov 03 21:52:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 22:02:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 03 22:02:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 22:12:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 03 22:12:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 22:22:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 9 seconds Nov 03 22:22:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 22:32:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 03 22:32:50 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 22:42:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 03 22:42:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 22:53:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 03 22:53:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 23:03:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 23:03:20 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 23:13:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 03 23:13:29 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 23:23:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 7 seconds Nov 03 23:23:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 03 23:33:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 14 seconds Nov 03 23:33:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 23:43:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 03 23:43:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 03 23:54:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Nov 03 23:54:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 04 00:04:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Nov 04 00:04:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 00:14:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 04 00:14:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 04 00:24:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Nov 04 00:24:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 00:34:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 14 seconds Nov 04 00:34:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 04 00:44:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 04 00:44:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 00:55:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Nov 04 00:55:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 01:05:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 14 seconds Nov 04 01:05:13 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 6 previous similar messages Nov 04 01:15:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 04 01:15:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 04 01:25:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Nov 04 01:25:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 01:35:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 11 seconds Nov 04 01:35:41 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 01:45:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 04 01:45:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 01:56:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Nov 04 01:56:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 04 02:06:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 9 seconds Nov 04 02:06:11 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 04 02:16:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 04 02:16:30 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 02:26:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 04 02:26:31 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 02:36:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 9 seconds Nov 04 02:36:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 02:47:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Nov 04 02:47:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 02:58:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 9 seconds Nov 04 02:58:02 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 03:08:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 04 03:08:21 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 03:18:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 04 03:18:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 03:28:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 9 seconds Nov 04 03:28:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 03:38:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 04 03:38:51 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 04 03:49:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 04 03:49:00 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 03:59:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Nov 04 03:59:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 04:09:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 14 seconds Nov 04 04:09:12 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 04:19:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 04 04:19:26 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 04 04:30:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 04 04:30:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 04 04:40:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 04 04:40:46 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 04:50:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 8 seconds Nov 04 04:50:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 05:01:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 15 seconds Nov 04 05:01:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 05:11:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 04 05:11:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 05:21:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 7 seconds Nov 04 05:21:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 04 05:31:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 13 seconds Nov 04 05:31:33 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 05:41:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 04 05:41:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 05:52:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 15 seconds Nov 04 05:52:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 06:03:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 04 06:03:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 06:13:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 7 seconds Nov 04 06:13:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 04 06:23:24 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Nov 04 06:23:24 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 06:33:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 04 06:33:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 06:43:44 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 5 seconds Nov 04 06:43:44 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 04 06:53:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 14 seconds Nov 04 06:53:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 04 07:04:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 04 07:04:08 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 04 07:14:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Nov 04 07:14:14 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 07:24:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 12 seconds Nov 04 07:24:25 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 07:34:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 04 07:34:42 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 04 07:44:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Nov 04 07:44:45 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 07:55:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 0 seconds Nov 04 07:55:59 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 08:06:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Nov 04 08:06:05 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 08:16:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 12 seconds Nov 04 08:16:16 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 04 08:26:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 04 08:26:32 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 08:36:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Nov 04 08:36:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 08:46:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 12 seconds Nov 04 08:46:47 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 04 08:57:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 04 08:57:03 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 04 09:07:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Nov 04 09:07:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 09:17:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Nov 04 09:17:15 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 09:24:26 sh-103-53.int kernel: Lustre: fir-MDT0002-mdc-ffff9781f2230800: Connection to fir-MDT0002 (at 10.0.10.53@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 04 09:24:26 sh-103-53.int kernel: Lustre: Skipped 98 previous similar messages Nov 04 09:24:26 sh-103-53.int kernel: LustreError: 167-0: fir-MDT0002-mdc-ffff9781f2230800: This client was evicted by fir-MDT0002; in progress operations using this service will fail. Nov 04 09:24:26 sh-103-53.int kernel: LustreError: 281933:0:(ldlm_resource.c:1147:ldlm_resource_complain()) fir-MDT0002-mdc-ffff9781f2230800: namespace resource [0x2c0032e40:0x1:0x0].0x0 (ffff977dcaaaa300) refcount nonzero (1) after lock cleanup; forcing cleanup. Nov 04 09:26:57 sh-103-53.int kernel: Lustre: fir-OST0000-osc-ffff9781f2230800: Connection to fir-OST0000 (at 10.0.10.101@o2ib7) was lost; in progress operations using this service will wait for recovery to complete Nov 04 09:26:57 sh-103-53.int kernel: LustreError: 167-0: fir-OST0001-osc-ffff9781f2230800: This client was evicted by fir-OST0001; in progress operations using this service will fail. Nov 04 09:26:57 sh-103-53.int kernel: LustreError: 329854:0:(ldlm_resource.c:1147:ldlm_resource_complain()) fir-OST0001-osc-ffff9781f2230800: namespace resource [0xb40000402:0x1242eae:0x0].0x0 (ffff9781febe8840) refcount nonzero (1) after lock cleanup; forcing cleanup. Nov 04 09:26:57 sh-103-53.int kernel: LustreError: 329854:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Nov 04 09:26:57 sh-103-53.int kernel: Lustre: Skipped 2 previous similar messages Nov 04 09:27:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 04 09:27:34 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 09:38:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 13 seconds Nov 04 09:38:38 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 09:48:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 04 09:48:54 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 09:54:07 sh-103-53.int kernel: LustreError: 396369:0:(file.c:4339:ll_inode_revalidate_fini()) fir: revalidate FID [0x240000406:0x138:0x0] error: rc = -108 Nov 04 09:58:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 3 seconds Nov 04 09:58:57 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 04 10:09:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 7 seconds Nov 04 10:09:06 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 10:19:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 04 10:19:22 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 10:29:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 4 seconds Nov 04 10:29:28 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 6 previous similar messages Nov 04 10:39:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Nov 04 10:39:36 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 04 10:49:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 2 seconds Nov 04 10:49:55 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 9 previous similar messages Nov 04 10:59:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 1 seconds Nov 04 10:59:56 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 7 previous similar messages Nov 04 11:10:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 10 seconds Nov 04 11:10:07 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages Nov 04 11:21:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.31@o2ib4: 6 seconds Nov 04 11:21:19 sh-103-53.int kernel: LNet: 91081:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Skipped 8 previous similar messages