Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.12.3
-
None
-
CentOS 7.6
-
3
-
9223372036854775807
Description
Hi,
Last Saturday, we hit the following crash on one of Fir's OSS. This is with Lustre 2.12.3.
KERNEL: /usr/lib/debug/lib/modules/3.10.0-957.27.2.el7_lustre.pl2.x86_64/vmlinux DUMPFILE: fir-io3-s2_vmcore_2020-03-07-23-00-05 [PARTIAL DUMP] CPUS: 48 DATE: Sat Mar 7 22:59:56 2020 UPTIME: 88 days, 17:04:42 LOAD AVERAGE: 392.09, 183.09, 89.88 TASKS: 2344 NODENAME: fir-io3-s2 RELEASE: 3.10.0-957.27.2.el7_lustre.pl2.x86_64 VERSION: #1 SMP Thu Nov 7 15:26:16 PST 2019 MACHINE: x86_64 (1996 Mhz) MEMORY: 255.6 GB PANIC: "Kernel panic - not syncing: Out of memory and no killable processes..." PID: 6889 COMMAND: "ll_ost_io02_054" TASK: ffff9c36fc49b0c0 [THREAD_INFO: ffff9c340a59c000] CPU: 38 STATE: TASK_RUNNING (PANIC)
Kernel memory:
crash> kmem -i ...(garbage)... PAGES TOTAL PERCENTAGE TOTAL MEM 65891310 251.4 GB ---- FREE 590325 2.3 GB 0% of TOTAL MEM USED 65300985 249.1 GB 99% of TOTAL MEM SHARED 100074 390.9 MB 0% of TOTAL MEM BUFFERS 46259 180.7 MB 0% of TOTAL MEM CACHED 29672 115.9 MB 0% of TOTAL MEM SLAB 63120835 240.8 GB 95% of TOTAL MEM TOTAL HUGE 0 0 ---- HUGE FREE 0 0 0% of TOTAL HUGE TOTAL SWAP 1048575 4 GB ---- SWAP USED 267787 1 GB 25% of TOTAL SWAP SWAP FREE 780788 3 GB 74% of TOTAL SWAP COMMIT LIMIT 33994230 129.7 GB ---- COMMITTED 270117 1 GB 0% of TOTAL LIMIT
Unfortunately, the vmcore seems kind of corrupted, as I cannot access the slab information:
crash> kmem -s kmem: invalid kernel virtual address: b3dd52e21b11fbc type: "list entry"
Last week, we had to change the lnet routes live on Fir, and we migrated routers from o2ib4/6 to new other Lnet networks for further expansion (long story). This all went well, but this specific server may have kept a reference to o2ib4, as we can see many occurrences of the following logs:
[7663978.048351] Lustre: 124006:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1583649999/real 1583649999] req@ffff9c2e05194380 x1652574281110560/t0(0) o106->fir-OST0019@10.9.0.63@o2ib4:15/16 lens 296/280 e 0 to 1 dl 1583650006 ref 2 fl Rpc:eX/2/ffffffff rc 0/-1 [7663978.076127] Lustre: 124006:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 21840089 previous similar messages [7663988.066479] LNetError: 80427:0:(lib-move.c:2007:lnet_handle_find_routed_path()) no route to 10.9.0.63@o2ib4 from <?> [7663988.077177] LNetError: 80427:0:(lib-move.c:2007:lnet_handle_find_routed_path()) Skipped 21832206 previous similar messages
Again, this is normal, as o2ib4 had just been decommissioned. Fir is using o2ib7 and we have IB/IB routers to several IB fabrics, each with its own o2ib index. It's possible that a memory leak occurred due to this specific situation. We don't consider this issue as Major issue but wanted to report it anyway because a server crashed.
I'm attaching dmesg-vmcore.txt as fir-io3-s2_vmcore-dmesg_2020-03-07-23-00-05.txtand the output of foreach bt as fir-io3-s2_foreachbt_2020-03-07-23-00-05.txt
. Also, the full vmcore has been uploaded to the FTP as fir-io3-s2_vmcore_2020-03-07-23-00-05 (but it looks like it's incomplete).