Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13351

Out of memory on server perhaps due to unreachable lnet network

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.3
    • None
    • CentOS 7.6
    • 3
    • 9223372036854775807

    Description

      Hi,

      Last Saturday, we hit the following crash on one of Fir's OSS. This is with Lustre 2.12.3.

            KERNEL: /usr/lib/debug/lib/modules/3.10.0-957.27.2.el7_lustre.pl2.x86_64/vmlinux
          DUMPFILE: fir-io3-s2_vmcore_2020-03-07-23-00-05  [PARTIAL DUMP]
              CPUS: 48
              DATE: Sat Mar  7 22:59:56 2020
            UPTIME: 88 days, 17:04:42
      LOAD AVERAGE: 392.09, 183.09, 89.88
             TASKS: 2344
          NODENAME: fir-io3-s2
           RELEASE: 3.10.0-957.27.2.el7_lustre.pl2.x86_64
           VERSION: #1 SMP Thu Nov 7 15:26:16 PST 2019
           MACHINE: x86_64  (1996 Mhz)
            MEMORY: 255.6 GB
             PANIC: "Kernel panic - not syncing: Out of memory and no killable processes..."
               PID: 6889
           COMMAND: "ll_ost_io02_054"
              TASK: ffff9c36fc49b0c0  [THREAD_INFO: ffff9c340a59c000]
               CPU: 38
             STATE: TASK_RUNNING (PANIC)
      

      Kernel memory:

      crash> kmem -i
      ...(garbage)...
                       PAGES        TOTAL      PERCENTAGE
          TOTAL MEM  65891310     251.4 GB         ----
               FREE   590325       2.3 GB    0% of TOTAL MEM
               USED  65300985     249.1 GB   99% of TOTAL MEM
             SHARED   100074     390.9 MB    0% of TOTAL MEM
            BUFFERS    46259     180.7 MB    0% of TOTAL MEM
             CACHED    29672     115.9 MB    0% of TOTAL MEM
               SLAB  63120835     240.8 GB   95% of TOTAL MEM
      
         TOTAL HUGE        0            0         ----
          HUGE FREE        0            0    0% of TOTAL HUGE
      
         TOTAL SWAP  1048575         4 GB         ----
          SWAP USED   267787         1 GB   25% of TOTAL SWAP
          SWAP FREE   780788         3 GB   74% of TOTAL SWAP
      
       COMMIT LIMIT  33994230     129.7 GB         ----
          COMMITTED   270117         1 GB    0% of TOTAL LIMIT
      

      Unfortunately, the vmcore seems kind of corrupted, as I cannot access the slab information:

      crash> kmem -s
      kmem: invalid kernel virtual address: b3dd52e21b11fbc  type: "list entry"
      

      Last week, we had to change the lnet routes live on Fir, and we migrated routers from o2ib4/6 to new other Lnet networks for further expansion (long story). This all went well, but this specific server may have kept a reference to o2ib4, as we can see many occurrences of the following logs:

      [7663978.048351] Lustre: 124006:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1583649999/real 1583649999]  req@ffff9c2e05194380 x1652574281110560/t0(0) o106->fir-OST0019@10.9.0.63@o2ib4:15/16 lens 296/280 e 0 to 1 dl 1583650006 ref 2 fl Rpc:eX/2/ffffffff rc 0/-1
      [7663978.076127] Lustre: 124006:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 21840089 previous similar messages
      [7663988.066479] LNetError: 80427:0:(lib-move.c:2007:lnet_handle_find_routed_path()) no route to 10.9.0.63@o2ib4 from <?>
      [7663988.077177] LNetError: 80427:0:(lib-move.c:2007:lnet_handle_find_routed_path()) Skipped 21832206 previous similar messages
      

      Again, this is normal, as o2ib4 had just been decommissioned. Fir is using o2ib7 and we have IB/IB routers to several IB fabrics, each with its own o2ib index. It's possible that a memory leak occurred due to this specific situation. We don't consider this issue as Major issue but wanted to report it anyway because a server crashed.

      I'm attaching dmesg-vmcore.txt as fir-io3-s2_vmcore-dmesg_2020-03-07-23-00-05.txtand the output of foreach bt as fir-io3-s2_foreachbt_2020-03-07-23-00-05.txt . Also, the full vmcore has been uploaded to the FTP as fir-io3-s2_vmcore_2020-03-07-23-00-05 (but it looks like it's incomplete).

      Attachments

        Activity

          People

            ssmirnov Serguei Smirnov
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: