[LU-13351] Out of memory on server perhaps due to unreachable lnet network Created: 10/Mar/20  Updated: 11/Mar/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.3
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Stephane Thiell Assignee: Serguei Smirnov
Resolution: Unresolved Votes: 0
Labels: None
Environment:

CentOS 7.6


Attachments: Text File fir-io3-s2_foreachbt_2020-03-07-23-00-05.txt     Text File fir-io3-s2_vmcore-dmesg_2020-03-07-23-00-05.txt    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hi,

Last Saturday, we hit the following crash on one of Fir's OSS. This is with Lustre 2.12.3.

      KERNEL: /usr/lib/debug/lib/modules/3.10.0-957.27.2.el7_lustre.pl2.x86_64/vmlinux
    DUMPFILE: fir-io3-s2_vmcore_2020-03-07-23-00-05  [PARTIAL DUMP]
        CPUS: 48
        DATE: Sat Mar  7 22:59:56 2020
      UPTIME: 88 days, 17:04:42
LOAD AVERAGE: 392.09, 183.09, 89.88
       TASKS: 2344
    NODENAME: fir-io3-s2
     RELEASE: 3.10.0-957.27.2.el7_lustre.pl2.x86_64
     VERSION: #1 SMP Thu Nov 7 15:26:16 PST 2019
     MACHINE: x86_64  (1996 Mhz)
      MEMORY: 255.6 GB
       PANIC: "Kernel panic - not syncing: Out of memory and no killable processes..."
         PID: 6889
     COMMAND: "ll_ost_io02_054"
        TASK: ffff9c36fc49b0c0  [THREAD_INFO: ffff9c340a59c000]
         CPU: 38
       STATE: TASK_RUNNING (PANIC)

Kernel memory:

crash> kmem -i
...(garbage)...
                 PAGES        TOTAL      PERCENTAGE
    TOTAL MEM  65891310     251.4 GB         ----
         FREE   590325       2.3 GB    0% of TOTAL MEM
         USED  65300985     249.1 GB   99% of TOTAL MEM
       SHARED   100074     390.9 MB    0% of TOTAL MEM
      BUFFERS    46259     180.7 MB    0% of TOTAL MEM
       CACHED    29672     115.9 MB    0% of TOTAL MEM
         SLAB  63120835     240.8 GB   95% of TOTAL MEM

   TOTAL HUGE        0            0         ----
    HUGE FREE        0            0    0% of TOTAL HUGE

   TOTAL SWAP  1048575         4 GB         ----
    SWAP USED   267787         1 GB   25% of TOTAL SWAP
    SWAP FREE   780788         3 GB   74% of TOTAL SWAP

 COMMIT LIMIT  33994230     129.7 GB         ----
    COMMITTED   270117         1 GB    0% of TOTAL LIMIT

Unfortunately, the vmcore seems kind of corrupted, as I cannot access the slab information:

crash> kmem -s
kmem: invalid kernel virtual address: b3dd52e21b11fbc  type: "list entry"

Last week, we had to change the lnet routes live on Fir, and we migrated routers from o2ib4/6 to new other Lnet networks for further expansion (long story). This all went well, but this specific server may have kept a reference to o2ib4, as we can see many occurrences of the following logs:

[7663978.048351] Lustre: 124006:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1583649999/real 1583649999]  req@ffff9c2e05194380 x1652574281110560/t0(0) o106->fir-OST0019@10.9.0.63@o2ib4:15/16 lens 296/280 e 0 to 1 dl 1583650006 ref 2 fl Rpc:eX/2/ffffffff rc 0/-1
[7663978.076127] Lustre: 124006:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 21840089 previous similar messages
[7663988.066479] LNetError: 80427:0:(lib-move.c:2007:lnet_handle_find_routed_path()) no route to 10.9.0.63@o2ib4 from <?>
[7663988.077177] LNetError: 80427:0:(lib-move.c:2007:lnet_handle_find_routed_path()) Skipped 21832206 previous similar messages

Again, this is normal, as o2ib4 had just been decommissioned. Fir is using o2ib7 and we have IB/IB routers to several IB fabrics, each with its own o2ib index. It's possible that a memory leak occurred due to this specific situation. We don't consider this issue as Major issue but wanted to report it anyway because a server crashed.

I'm attaching dmesg-vmcore.txt as fir-io3-s2_vmcore-dmesg_2020-03-07-23-00-05.txtand the output of foreach bt as fir-io3-s2_foreachbt_2020-03-07-23-00-05.txt . Also, the full vmcore has been uploaded to the FTP as fir-io3-s2_vmcore_2020-03-07-23-00-05 (but it looks like it's incomplete).



 Comments   
Comment by Peter Jones [ 11/Mar/20 ]

Sergeui

Could you please investigate?

Thanks

Peter

Generated at Sat Feb 10 03:00:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.