[LU-13351] Out of memory on server perhaps due to unreachable lnet network Created: 10/Mar/20 Updated: 11/Mar/20 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Stephane Thiell | Assignee: | Serguei Smirnov |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 7.6 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Hi, Last Saturday, we hit the following crash on one of Fir's OSS. This is with Lustre 2.12.3. KERNEL: /usr/lib/debug/lib/modules/3.10.0-957.27.2.el7_lustre.pl2.x86_64/vmlinux
DUMPFILE: fir-io3-s2_vmcore_2020-03-07-23-00-05 [PARTIAL DUMP]
CPUS: 48
DATE: Sat Mar 7 22:59:56 2020
UPTIME: 88 days, 17:04:42
LOAD AVERAGE: 392.09, 183.09, 89.88
TASKS: 2344
NODENAME: fir-io3-s2
RELEASE: 3.10.0-957.27.2.el7_lustre.pl2.x86_64
VERSION: #1 SMP Thu Nov 7 15:26:16 PST 2019
MACHINE: x86_64 (1996 Mhz)
MEMORY: 255.6 GB
PANIC: "Kernel panic - not syncing: Out of memory and no killable processes..."
PID: 6889
COMMAND: "ll_ost_io02_054"
TASK: ffff9c36fc49b0c0 [THREAD_INFO: ffff9c340a59c000]
CPU: 38
STATE: TASK_RUNNING (PANIC)
Kernel memory: crash> kmem -i
...(garbage)...
PAGES TOTAL PERCENTAGE
TOTAL MEM 65891310 251.4 GB ----
FREE 590325 2.3 GB 0% of TOTAL MEM
USED 65300985 249.1 GB 99% of TOTAL MEM
SHARED 100074 390.9 MB 0% of TOTAL MEM
BUFFERS 46259 180.7 MB 0% of TOTAL MEM
CACHED 29672 115.9 MB 0% of TOTAL MEM
SLAB 63120835 240.8 GB 95% of TOTAL MEM
TOTAL HUGE 0 0 ----
HUGE FREE 0 0 0% of TOTAL HUGE
TOTAL SWAP 1048575 4 GB ----
SWAP USED 267787 1 GB 25% of TOTAL SWAP
SWAP FREE 780788 3 GB 74% of TOTAL SWAP
COMMIT LIMIT 33994230 129.7 GB ----
COMMITTED 270117 1 GB 0% of TOTAL LIMIT
Unfortunately, the vmcore seems kind of corrupted, as I cannot access the slab information: crash> kmem -s kmem: invalid kernel virtual address: b3dd52e21b11fbc type: "list entry" Last week, we had to change the lnet routes live on Fir, and we migrated routers from o2ib4/6 to new other Lnet networks for further expansion (long story). This all went well, but this specific server may have kept a reference to o2ib4, as we can see many occurrences of the following logs: [7663978.048351] Lustre: 124006:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1583649999/real 1583649999] req@ffff9c2e05194380 x1652574281110560/t0(0) o106->fir-OST0019@10.9.0.63@o2ib4:15/16 lens 296/280 e 0 to 1 dl 1583650006 ref 2 fl Rpc:eX/2/ffffffff rc 0/-1 [7663978.076127] Lustre: 124006:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 21840089 previous similar messages [7663988.066479] LNetError: 80427:0:(lib-move.c:2007:lnet_handle_find_routed_path()) no route to 10.9.0.63@o2ib4 from <?> [7663988.077177] LNetError: 80427:0:(lib-move.c:2007:lnet_handle_find_routed_path()) Skipped 21832206 previous similar messages Again, this is normal, as o2ib4 had just been decommissioned. Fir is using o2ib7 and we have IB/IB routers to several IB fabrics, each with its own o2ib index. It's possible that a memory leak occurred due to this specific situation. We don't consider this issue as Major issue but wanted to report it anyway because a server crashed. I'm attaching dmesg-vmcore.txt as fir-io3-s2_vmcore-dmesg_2020-03-07-23-00-05.txt |
| Comments |
| Comment by Peter Jones [ 11/Mar/20 ] |
|
Sergeui Could you please investigate? Thanks Peter |