[LU-15442] e2freefrag can't complete because of OOM Created: 12/Jan/22 Updated: 21/Mar/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Shuichi Ihara | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
1 x 300TB OST and 85% of filesystem is filled up. # rpm -qa | grep e2fsprogs e2fsprogs-1.46.2.wc3-0.el7.x86_64 e2fsprogs-libs-1.46.2.wc3-0.el7.x86_64 e2fsprogs-devel-1.46.2.wc3-0.el7.x86_64 # df -t lustre Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda 313826717284 262586960268 48064149188 85% /lustre/ost0000 # e2freefrag /dev/sda Jan 12 12:13:52 es7990e1-vm1 kernel: Call Trace: Jan 12 12:13:52 es7990e1-vm1 kernel: [<ffffffffacf835a9>] dump_stack+0x19/0x1b Jan 12 12:13:52 es7990e1-vm1 kernel: [<ffffffffacf7e648>] dump_header+0x90/0x229 Jan 12 12:13:52 es7990e1-vm1 kernel: [<ffffffffac906492>] ? ktime_get_ts64+0x52/0xf0 Jan 12 12:13:52 es7990e1-vm1 kernel: [<ffffffffac95db1f>] ? delayacct_end+0x8f/0xb0 Jan 12 12:13:52 es7990e1-vm1 kernel: [<ffffffffac9c204d>] oom_kill_process+0x2cd/0x490 Jan 12 12:13:52 es7990e1-vm1 kernel: [<ffffffffac9c1a3d>] ? oom_unkillable_task+0xcd/0x120 Jan 12 12:13:52 es7990e1-vm1 kernel: [<ffffffffac9c273a>] out_of_memory+0x31a/0x500 Jan 12 12:13:52 es7990e1-vm1 kernel: [<ffffffffac9c9354>] __alloc_pages_nodemask+0xad4/0xbe0 Jan 12 12:13:52 es7990e1-vm1 kernel: [<ffffffffaca1c739>] alloc_pages_vma+0xa9/0x200 Jan 12 12:13:52 es7990e1-vm1 kernel: [<ffffffffac9f6337>] handle_mm_fault+0xcb7/0xfb0 Jan 12 12:13:52 es7990e1-vm1 kernel: [<ffffffffacf90653>] __do_page_fault+0x213/0x500 Jan 12 12:13:52 es7990e1-vm1 kernel: [<ffffffffacf90a26>] trace_do_page_fault+0x56/0x150 Jan 12 12:13:52 es7990e1-vm1 kernel: [<ffffffffacf8ffa2>] do_async_page_fault+0x22/0xf0 Jan 12 12:13:52 es7990e1-vm1 kernel: [<ffffffffacf8c7a8>] async_page_fault+0x28/0x30 Jan 12 12:13:52 es7990e1-vm1 kernel: Mem-Info: Jan 12 12:13:52 es7990e1-vm1 kernel: active_anon:32965316 inactive_anon:953248 isolated_anon:0#012 active_file:19299 inactive_file:18944 isolated_file:0#012 unevictable:0 dirty:0 writeback:2 unstable:0#012 slab_reclaimable:122379 slab_unreclaimable:74262#012 mapped:8092 shmem:8083 pagetables:69898 bounce:0#012 free:2249818 free_pcp:3985 free_cma:0 Jan 12 12:13:52 es7990e1-vm1 kernel: Node 0 DMA free:15892kB min:868kB low:1084kB high:1300kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Jan 12 12:13:52 es7990e1-vm1 kernel: lowmem_reserve[]: 0 943 150076 150076 Jan 12 12:13:52 es7990e1-vm1 kernel: Node 0 DMA32 free:648832kB min:52748kB low:65932kB high:79120kB active_anon:59992kB inactive_anon:47604kB active_file:1524kB inactive_file:1020kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080608kB managed:966464kB mlocked:0kB dirty:0kB writeback:0kB mapped:252kB shmem:252kB slab_reclaimable:6616kB slab_unreclaimable:5092kB kernel_stack:384kB pagetables:236kB unstable:0kB bounce:0kB free_pcp:7136kB local_pcp:264kB free_cma:0kB writeback_tmp:0kB pages_scanned:5844902 all_unreclaimable? yes Jan 12 12:13:52 es7990e1-vm1 kernel: lowmem_reserve[]: 0 0 149132 149132 Jan 12 12:13:52 es7990e1-vm1 kernel: Node 0 Normal free:8334548kB min:8334988kB low:10418732kB high:12502480kB active_anon:131801272kB inactive_anon:3765388kB active_file:75672kB inactive_file:74756kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:155189248kB managed:152711608kB mlocked:0kB dirty:0kB writeback:8kB mapped:32116kB shmem:32080kB slab_reclaimable:482900kB slab_unreclaimable:291940kB kernel_stack:19024kB pagetables:279356kB unstable:0kB bounce:0kB free_pcp:8804kB local_pcp:264kB free_cma:0kB writeback_tmp:0kB pages_scanned:261380 all_unreclaimable? yes Jan 12 12:13:52 es7990e1-vm1 kernel: lowmem_reserve[]: 0 0 0 0 Jan 12 12:13:52 es7990e1-vm1 kernel: Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15892kB Jan 12 12:13:53 es7990e1-vm1 kernel: Node 0 DMA32: 247*4kB (UE) 268*8kB (UEM) 209*16kB (UE) 161*32kB (UE) 83*64kB (UEM) 10*128kB (UEM) 94*256kB (UEM) 88*512kB (UE) 9*1024kB (EM) 39*2048kB (UEM) 115*4096kB (UEM) = 647468kB Jan 12 12:13:53 es7990e1-vm1 kernel: Node 0 Normal: 8198*4kB (UEM) 8367*8kB (UEM) 6746*16kB (UE) 4778*32kB (UEM) 3271*64kB (UEM) 1641*128kB (UE) 694*256kB (UEM) 274*512kB (UEM) 146*1024kB (UEM) 110*2048kB (UE) 1675*4096kB (U) = 8333488kB Jan 12 12:13:53 es7990e1-vm1 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Jan 12 12:13:53 es7990e1-vm1 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Jan 12 12:13:53 es7990e1-vm1 kernel: 46685 total pagecache pages Jan 12 12:13:53 es7990e1-vm1 kernel: 227 pages in swap cache Jan 12 12:13:53 es7990e1-vm1 kernel: Swap cache stats: add 2739967, delete 2739738, find 67504/71273 Jan 12 12:13:53 es7990e1-vm1 kernel: Free swap = 0kB Jan 12 12:13:53 es7990e1-vm1 kernel: Total swap = 5472252kB Jan 12 12:13:53 es7990e1-vm1 kernel: 39321462 pages RAM Jan 12 12:13:53 es7990e1-vm1 kernel: 0 pages HighMem/MovableOnly Jan 12 12:13:53 es7990e1-vm1 kernel: 897967 pages reserved Jan 12 12:13:53 es7990e1-vm1 kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Jan 12 12:13:53 es7990e1-vm1 kernel: [ 807] 0 807 35383 9157 70 31 0 systemd-journal Jan 12 12:13:53 es7990e1-vm1 kernel: [ 838] 0 838 11412 5 23 134 -1000 systemd-udevd Jan 12 12:13:53 es7990e1-vm1 kernel: [ 973] 0 973 152045 0 38 208 0 lvmetad Jan 12 12:13:53 es7990e1-vm1 kernel: [ 1737] 0 1737 13883 22 27 89 -1000 auditd Jan 12 12:13:53 es7990e1-vm1 kernel: [ 1766] 0 1766 6596 20 18 54 0 systemd-logind Jan 12 12:13:53 es7990e1-vm1 kernel: [ 1770] 0 1770 5444 50 16 65 0 irqbalance Jan 12 12:13:53 es7990e1-vm1 kernel: [ 1775] 0 1775 22652 0 47 224 0 rngd Jan 12 12:13:53 es7990e1-vm1 kernel: [ 1790] 998 1790 2145 7 10 30 0 lsmd Jan 12 12:13:53 es7990e1-vm1 kernel: [ 1800] 32 1800 17314 14 38 129 0 rpcbind Jan 12 12:13:53 es7990e1-vm1 kernel: [ 1806] 81 1806 15046 51 34 82 -900 dbus-daemon Jan 12 12:13:53 es7990e1-vm1 kernel: [ 1919] 0 1919 50357 11 39 123 0 gssproxy Jan 12 12:13:53 es7990e1-vm1 kernel: [ 1970] 0 1970 13220 1 32 205 0 smartd Jan 12 12:13:53 es7990e1-vm1 kernel: [ 2899] 0 2899 25736 1 48 517 0 dhclient Jan 12 12:13:53 es7990e1-vm1 kernel: [ 3184] 0 3184 6261 0 17 58 0 xinetd Jan 12 12:13:53 es7990e1-vm1 kernel: [ 3193] 0 3193 76950 5874 80 348 0 rsyslogd Jan 12 12:13:53 es7990e1-vm1 kernel: [ 3198] 0 3198 28235 1 57 257 -1000 sshd Jan 12 12:13:53 es7990e1-vm1 kernel: [ 3235] 0 3235 6477 0 18 53 0 atd Jan 12 12:13:53 es7990e1-vm1 kernel: [ 3247] 0 3247 57127 0 40 188 0 sharpd Jan 12 12:13:53 es7990e1-vm1 kernel: [ 3277] 0 3277 27551 1 10 33 0 agetty Jan 12 12:13:53 es7990e1-vm1 kernel: [ 3279] 0 3279 27551 1 13 33 0 agetty Jan 12 12:13:53 es7990e1-vm1 kernel: [ 3472] 0 3472 31596 20 20 135 0 crond Jan 12 12:13:53 es7990e1-vm1 kernel: [ 3473] 38 3473 6954 1 17 150 0 ntpd Jan 12 12:13:53 es7990e1-vm1 kernel: [ 6542] 999 6542 153604 0 63 883 0 polkitd Jan 12 12:13:53 es7990e1-vm1 kernel: [ 6555] 29 6555 16802 0 38 285 0 rpc.statd Jan 12 12:13:53 es7990e1-vm1 kernel: [13484] 0 13484 39204 40 77 303 0 sshd Jan 12 12:13:53 es7990e1-vm1 kernel: [13486] 0 13486 29271 1 13 506 0 bash Jan 12 12:13:53 es7990e1-vm1 kernel: [13770] 0 13770 32001 1 18 144 0 screen Jan 12 12:13:53 es7990e1-vm1 kernel: [13771] 0 13771 29273 1 14 489 0 bash Jan 12 12:13:53 es7990e1-vm1 kernel: [13950] 0 13950 35405515 33909205 68922 1355850 0 e2freefrag Jan 12 12:13:53 es7990e1-vm1 kernel: [13951] 0 13951 27013 0 9 25 0 tee Jan 12 12:13:53 es7990e1-vm1 kernel: [13984] 0 13984 40796 372 36 77 0 top Jan 12 12:13:53 es7990e1-vm1 kernel: Out of memory: Kill process 13950 (e2freefrag) score 861 or sacrifice child Jan 12 12:13:53 es7990e1-vm1 kernel: Killed process 13950 (e2freefrag), UID 0, total-vm:141622060kB, anon-rss:135636820kB, file-rss:0kB, shmem-rss:0kB Jan 12 12:13:53 es7990e1-vm1 kernel: systemd-journal invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 Jan 12 12:13:53 es7990e1-vm1 kernel: systemd-journal cpuset=/ mems_allowed=0 Jan 12 12:13:53 es7990e1-vm1 kernel: CPU: 4 PID: 807 Comm: systemd-journal Kdump: loaded Tainted: G OE ------------ T 3.10.0-1160.31.1.el7_lustre.ddn15.x86_64 # vm.min_free_kbytes = 8388608 didn't help in this case. |
| Comments |
| Comment by Andreas Dilger [ 12/Jan/22 ] |
|
This is likely because e2freefrag was changed to use a kernel ioctl to get free extent maps from the kernel rather than scanning the block bitmaps directly from the device. |
| Comment by Andreas Dilger [ 19/Mar/22 ] |
|
Shuichi, I looked through the e2freefrag code, and I don't see anywhere in that code that is directly allocating memory, and the RHEL7 and RHEL8 kernels do not support the GETFSMAP ioctl. Running e2freefrag on those systems should immediately print "/dev/xxxx: Operation not supported while calling fsmap", which means it will fall back to scanning each group's block allocation bitmaps on disk. The ext2fs_open() call is allocating bitmaps for the filesystem, but this should not use more memory than e.g. e2fsck (which allocates many more bitmaps to manage state). The memory allocation for a block bitmap is 4KB per 128MB (= 32KB per 1GB) of the filesystem, so 300TB / 1GB * 32KB = 9.6GB, plus an additional 1/32 of that for inode allocation bitmaps (=300MB). It looks like since 1.46.0 this is trying to use multi-threaded locking for the filesystem metadata, Does "ltrace e2freefrag /dev/XXX" make it beyond ext2fs_open()? If this is where the allocation is failing, it would also be useful to run "{[strace e2freefrag /dev/XXXX}}" to watch for sbrk() calls to see how much memory is being allocated.
|
| Comment by Shuichi Ihara [ 21/Mar/22 ] |
|
Here is "ltrace -ttT -f e2freefrag /dev/sda" output. OST is still 300TB and 85% of 300TB keeps filled up. 6173 08:55:49.123347 __libc_start_main(0x400d40, 2, 0x7ffee42df098, 0x401990 <unfinished ...> 6173 08:55:49.124223 add_error_table(0x603120, 0x7ffee42df098, 0x7ffee42df0b0, 0x401990) = 0 <0.000254> 6173 08:55:49.124501 getopt(2, 0x7ffee42df098, "c:h") = -1 <0.000113> 6173 08:55:49.124642 ext2fs_open(0x7ffee42df3b5, 0x4020400, 0, 0) = 0 <3.110846> 6173 08:55:52.235572 __fprintf_chk(0x7f1054c5e400, 1, 0x401c69, 0xdd91b0) = 17 <0.000331> 6173 08:55:52.235940 __fprintf_chk(0x7f1054c5e400, 1, 0x401c75, 4096) = 22 <0.000153> 6173 08:55:52.236126 ext2fs_check_mount_point(0xdd91b0, 0x7ffee42ddd1c, 0x7ffee42ddf70, 4096) = 0 <0.002036> 6173 08:55:52.238194 ext2fs_free_blocks_count(0xddb4c0, 0x7ffee42ddd30, 0x16070, 12) = 0x3205a91de <0.000100> 6173 08:55:52.238332 ext2fs_read_block_bitmap(0xdd9030, 0x7ffee42ddd30, 0x205a91de, 12 <no return ...> 6192 09:45:39.446987 +++ killed by SIGKILL +++ 6193 09:45:39.450489 +++ killed by SIGKILL +++ 6191 09:45:39.450503 +++ killed by SIGKILL +++ 6190 09:45:39.450512 +++ killed by SIGKILL +++ 6189 09:45:39.450520 +++ killed by SIGKILL +++ 6188 09:45:39.450528 +++ killed by SIGKILL +++ 6187 09:45:39.450536 +++ killed by SIGKILL +++ 6185 09:45:39.450544 +++ killed by SIGKILL +++ 6184 09:45:39.450552 +++ killed by SIGKILL +++ 6183 09:45:39.450560 +++ killed by SIGKILL +++ 6182 09:45:39.450567 +++ killed by SIGKILL +++ 6181 09:45:39.450575 +++ killed by SIGKILL +++ 6180 09:45:39.450583 +++ killed by SIGKILL +++ 6179 09:45:39.450590 +++ killed by SIGKILL +++ 6178 09:45:39.450598 +++ killed by SIGKILL +++ 6177 09:45:39.450762 +++ killed by SIGKILL +++ 6176 09:45:39.450773 +++ killed by SIGKILL +++ 6175 09:45:39.450782 +++ killed by SIGKILL +++ 6174 09:45:39.450790 +++ killed by SIGKILL +++ 6186 09:45:47.466515 +++ killed by SIGKILL +++ 6173 09:45:47.466568 +++ killed by SIGKILL +++ Also this is collected number of syscall # strace -fc -o strace.log e2freefrag % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 97.91 12387.779811 7380 1678443 22485 futex 1.08 136.446473 155 879711 pread64 1.01 127.198697 3 37861227 mprotect 0.00 0.445570 11 37856 fadvise64 0.00 0.009892 4 2362 mmap 0.00 0.005309 4 1171 munmap 0.00 0.000462 23 20 clone 0.00 0.000365 17 21 set_robust_list 0.00 0.000195 13 14 4 open 0.00 0.000179 16 11 read 0.00 0.000084 7 11 close 0.00 0.000080 8 9 fstat 0.00 0.000042 2 20 stat 0.00 0.000032 16 2 write 0.00 0.000032 8 4 brk 0.00 0.000018 9 2 rt_sigaction 0.00 0.000015 15 1 execve 0.00 0.000015 15 1 fsync 0.00 0.000011 11 1 1 access 0.00 0.000010 5 2 ioctl 0.00 0.000009 9 1 getuid 0.00 0.000009 9 1 getgid 0.00 0.000009 9 1 prctl 0.00 0.000008 8 1 getegid 0.00 0.000008 8 1 arch_prctl 0.00 0.000008 8 1 set_tid_address 0.00 0.000007 7 1 rt_sigprocmask 0.00 0.000007 7 1 getrlimit 0.00 0.000006 6 1 geteuid 0.00 0.000000 0 2 lstat 0.00 0.000000 0 4 getdents 0.00 0.000000 0 3 1 openat ------ ----------- ----------- --------- --------- ---------------- 100.00 12651.887363 40460907 22491 total both cases eventually didn't complete due to OOM though. |