[LU-15442] e2freefrag can't complete because of OOM Created: 12/Jan/22  Updated: 21/Mar/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Shuichi Ihara Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

1 x 300TB OST and 85% of filesystem is filled up.
e2freefrag is not able to finish because of OOM below.

# rpm -qa | grep e2fsprogs
e2fsprogs-1.46.2.wc3-0.el7.x86_64
e2fsprogs-libs-1.46.2.wc3-0.el7.x86_64
e2fsprogs-devel-1.46.2.wc3-0.el7.x86_64

# df -t lustre
Filesystem                              1K-blocks         Used   Available Use% Mounted on
/dev/sda                             313826717284 262586960268 48064149188  85% /lustre/ost0000

# e2freefrag /dev/sda
Jan 12 12:13:52 es7990e1-vm1 kernel: Call Trace:
Jan 12 12:13:52 es7990e1-vm1 kernel: [<ffffffffacf835a9>] dump_stack+0x19/0x1b
Jan 12 12:13:52 es7990e1-vm1 kernel: [<ffffffffacf7e648>] dump_header+0x90/0x229
Jan 12 12:13:52 es7990e1-vm1 kernel: [<ffffffffac906492>] ? ktime_get_ts64+0x52/0xf0
Jan 12 12:13:52 es7990e1-vm1 kernel: [<ffffffffac95db1f>] ? delayacct_end+0x8f/0xb0
Jan 12 12:13:52 es7990e1-vm1 kernel: [<ffffffffac9c204d>] oom_kill_process+0x2cd/0x490
Jan 12 12:13:52 es7990e1-vm1 kernel: [<ffffffffac9c1a3d>] ? oom_unkillable_task+0xcd/0x120
Jan 12 12:13:52 es7990e1-vm1 kernel: [<ffffffffac9c273a>] out_of_memory+0x31a/0x500
Jan 12 12:13:52 es7990e1-vm1 kernel: [<ffffffffac9c9354>] __alloc_pages_nodemask+0xad4/0xbe0
Jan 12 12:13:52 es7990e1-vm1 kernel: [<ffffffffaca1c739>] alloc_pages_vma+0xa9/0x200
Jan 12 12:13:52 es7990e1-vm1 kernel: [<ffffffffac9f6337>] handle_mm_fault+0xcb7/0xfb0
Jan 12 12:13:52 es7990e1-vm1 kernel: [<ffffffffacf90653>] __do_page_fault+0x213/0x500
Jan 12 12:13:52 es7990e1-vm1 kernel: [<ffffffffacf90a26>] trace_do_page_fault+0x56/0x150
Jan 12 12:13:52 es7990e1-vm1 kernel: [<ffffffffacf8ffa2>] do_async_page_fault+0x22/0xf0
Jan 12 12:13:52 es7990e1-vm1 kernel: [<ffffffffacf8c7a8>] async_page_fault+0x28/0x30
Jan 12 12:13:52 es7990e1-vm1 kernel: Mem-Info:
Jan 12 12:13:52 es7990e1-vm1 kernel: active_anon:32965316 inactive_anon:953248 isolated_anon:0#012 active_file:19299 inactive_file:18944 isolated_file:0#012 unevictable:0 dirty:0 writeback:2 unstable:0#012 slab_reclaimable:122379 slab_unreclaimable:74262#012 mapped:8092 shmem:8083 pagetables:69898 bounce:0#012 free:2249818 free_pcp:3985 free_cma:0
Jan 12 12:13:52 es7990e1-vm1 kernel: Node 0 DMA free:15892kB min:868kB low:1084kB high:1300kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Jan 12 12:13:52 es7990e1-vm1 kernel: lowmem_reserve[]: 0 943 150076 150076
Jan 12 12:13:52 es7990e1-vm1 kernel: Node 0 DMA32 free:648832kB min:52748kB low:65932kB high:79120kB active_anon:59992kB inactive_anon:47604kB active_file:1524kB inactive_file:1020kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080608kB managed:966464kB mlocked:0kB dirty:0kB writeback:0kB mapped:252kB shmem:252kB slab_reclaimable:6616kB slab_unreclaimable:5092kB kernel_stack:384kB pagetables:236kB unstable:0kB bounce:0kB free_pcp:7136kB local_pcp:264kB free_cma:0kB writeback_tmp:0kB pages_scanned:5844902 all_unreclaimable? yes
Jan 12 12:13:52 es7990e1-vm1 kernel: lowmem_reserve[]: 0 0 149132 149132
Jan 12 12:13:52 es7990e1-vm1 kernel: Node 0 Normal free:8334548kB min:8334988kB low:10418732kB high:12502480kB active_anon:131801272kB inactive_anon:3765388kB active_file:75672kB inactive_file:74756kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:155189248kB managed:152711608kB mlocked:0kB dirty:0kB writeback:8kB mapped:32116kB shmem:32080kB slab_reclaimable:482900kB slab_unreclaimable:291940kB kernel_stack:19024kB pagetables:279356kB unstable:0kB bounce:0kB free_pcp:8804kB local_pcp:264kB free_cma:0kB writeback_tmp:0kB pages_scanned:261380 all_unreclaimable? yes
Jan 12 12:13:52 es7990e1-vm1 kernel: lowmem_reserve[]: 0 0 0 0
Jan 12 12:13:52 es7990e1-vm1 kernel: Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15892kB
Jan 12 12:13:53 es7990e1-vm1 kernel: Node 0 DMA32: 247*4kB (UE) 268*8kB (UEM) 209*16kB (UE) 161*32kB (UE) 83*64kB (UEM) 10*128kB (UEM) 94*256kB (UEM) 88*512kB (UE) 9*1024kB (EM) 39*2048kB (UEM) 115*4096kB (UEM) = 647468kB
Jan 12 12:13:53 es7990e1-vm1 kernel: Node 0 Normal: 8198*4kB (UEM) 8367*8kB (UEM) 6746*16kB (UE) 4778*32kB (UEM) 3271*64kB (UEM) 1641*128kB (UE) 694*256kB (UEM) 274*512kB (UEM) 146*1024kB (UEM) 110*2048kB (UE) 1675*4096kB (U) = 8333488kB
Jan 12 12:13:53 es7990e1-vm1 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jan 12 12:13:53 es7990e1-vm1 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jan 12 12:13:53 es7990e1-vm1 kernel: 46685 total pagecache pages
Jan 12 12:13:53 es7990e1-vm1 kernel: 227 pages in swap cache
Jan 12 12:13:53 es7990e1-vm1 kernel: Swap cache stats: add 2739967, delete 2739738, find 67504/71273
Jan 12 12:13:53 es7990e1-vm1 kernel: Free swap  = 0kB
Jan 12 12:13:53 es7990e1-vm1 kernel: Total swap = 5472252kB
Jan 12 12:13:53 es7990e1-vm1 kernel: 39321462 pages RAM
Jan 12 12:13:53 es7990e1-vm1 kernel: 0 pages HighMem/MovableOnly
Jan 12 12:13:53 es7990e1-vm1 kernel: 897967 pages reserved
Jan 12 12:13:53 es7990e1-vm1 kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Jan 12 12:13:53 es7990e1-vm1 kernel: [  807]     0   807    35383     9157      70       31             0 systemd-journal
Jan 12 12:13:53 es7990e1-vm1 kernel: [  838]     0   838    11412        5      23      134         -1000 systemd-udevd
Jan 12 12:13:53 es7990e1-vm1 kernel: [  973]     0   973   152045        0      38      208             0 lvmetad
Jan 12 12:13:53 es7990e1-vm1 kernel: [ 1737]     0  1737    13883       22      27       89         -1000 auditd
Jan 12 12:13:53 es7990e1-vm1 kernel: [ 1766]     0  1766     6596       20      18       54             0 systemd-logind
Jan 12 12:13:53 es7990e1-vm1 kernel: [ 1770]     0  1770     5444       50      16       65             0 irqbalance
Jan 12 12:13:53 es7990e1-vm1 kernel: [ 1775]     0  1775    22652        0      47      224             0 rngd
Jan 12 12:13:53 es7990e1-vm1 kernel: [ 1790]   998  1790     2145        7      10       30             0 lsmd
Jan 12 12:13:53 es7990e1-vm1 kernel: [ 1800]    32  1800    17314       14      38      129             0 rpcbind
Jan 12 12:13:53 es7990e1-vm1 kernel: [ 1806]    81  1806    15046       51      34       82          -900 dbus-daemon
Jan 12 12:13:53 es7990e1-vm1 kernel: [ 1919]     0  1919    50357       11      39      123             0 gssproxy
Jan 12 12:13:53 es7990e1-vm1 kernel: [ 1970]     0  1970    13220        1      32      205             0 smartd
Jan 12 12:13:53 es7990e1-vm1 kernel: [ 2899]     0  2899    25736        1      48      517             0 dhclient
Jan 12 12:13:53 es7990e1-vm1 kernel: [ 3184]     0  3184     6261        0      17       58             0 xinetd
Jan 12 12:13:53 es7990e1-vm1 kernel: [ 3193]     0  3193    76950     5874      80      348             0 rsyslogd
Jan 12 12:13:53 es7990e1-vm1 kernel: [ 3198]     0  3198    28235        1      57      257         -1000 sshd
Jan 12 12:13:53 es7990e1-vm1 kernel: [ 3235]     0  3235     6477        0      18       53             0 atd
Jan 12 12:13:53 es7990e1-vm1 kernel: [ 3247]     0  3247    57127        0      40      188             0 sharpd
Jan 12 12:13:53 es7990e1-vm1 kernel: [ 3277]     0  3277    27551        1      10       33             0 agetty
Jan 12 12:13:53 es7990e1-vm1 kernel: [ 3279]     0  3279    27551        1      13       33             0 agetty
Jan 12 12:13:53 es7990e1-vm1 kernel: [ 3472]     0  3472    31596       20      20      135             0 crond
Jan 12 12:13:53 es7990e1-vm1 kernel: [ 3473]    38  3473     6954        1      17      150             0 ntpd
Jan 12 12:13:53 es7990e1-vm1 kernel: [ 6542]   999  6542   153604        0      63      883             0 polkitd
Jan 12 12:13:53 es7990e1-vm1 kernel: [ 6555]    29  6555    16802        0      38      285             0 rpc.statd
Jan 12 12:13:53 es7990e1-vm1 kernel: [13484]     0 13484    39204       40      77      303             0 sshd
Jan 12 12:13:53 es7990e1-vm1 kernel: [13486]     0 13486    29271        1      13      506             0 bash
Jan 12 12:13:53 es7990e1-vm1 kernel: [13770]     0 13770    32001        1      18      144             0 screen
Jan 12 12:13:53 es7990e1-vm1 kernel: [13771]     0 13771    29273        1      14      489             0 bash
Jan 12 12:13:53 es7990e1-vm1 kernel: [13950]     0 13950 35405515 33909205   68922  1355850             0 e2freefrag
Jan 12 12:13:53 es7990e1-vm1 kernel: [13951]     0 13951    27013        0       9       25             0 tee
Jan 12 12:13:53 es7990e1-vm1 kernel: [13984]     0 13984    40796      372      36       77             0 top
Jan 12 12:13:53 es7990e1-vm1 kernel: Out of memory: Kill process 13950 (e2freefrag) score 861 or sacrifice child
Jan 12 12:13:53 es7990e1-vm1 kernel: Killed process 13950 (e2freefrag), UID 0, total-vm:141622060kB, anon-rss:135636820kB, file-rss:0kB, shmem-rss:0kB
Jan 12 12:13:53 es7990e1-vm1 kernel: systemd-journal invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Jan 12 12:13:53 es7990e1-vm1 kernel: systemd-journal cpuset=/ mems_allowed=0
Jan 12 12:13:53 es7990e1-vm1 kernel: CPU: 4 PID: 807 Comm: systemd-journal Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1160.31.1.el7_lustre.ddn15.x86_64 #

vm.min_free_kbytes = 8388608 didn't help in this case.



 Comments   
Comment by Andreas Dilger [ 12/Jan/22 ]

This is likely because e2freefrag was changed to use a kernel ioctl to get free extent maps from the kernel rather than scanning the block bitmaps directly from the device.

Comment by Andreas Dilger [ 19/Mar/22 ]

Shuichi, I looked through the e2freefrag code, and I don't see anywhere in that code that is directly allocating memory, and the RHEL7 and RHEL8 kernels do not support the GETFSMAP ioctl. Running e2freefrag on those systems should immediately print "/dev/xxxx: Operation not supported while calling fsmap", which means it will fall back to scanning each group's block allocation bitmaps on disk.

The ext2fs_open() call is allocating bitmaps for the filesystem, but this should not use more memory than e.g. e2fsck (which allocates many more bitmaps to manage state). The memory allocation for a block bitmap is 4KB per 128MB (= 32KB per 1GB) of the filesystem, so 300TB / 1GB * 32KB = 9.6GB, plus an additional 1/32 of that for inode allocation bitmaps (=300MB). It looks like since 1.46.0 this is trying to use multi-threaded locking for the filesystem metadata, which might be causing each thread to allocate a full set of bitmaps.

Does "ltrace e2freefrag /dev/XXX" make it beyond ext2fs_open()? If this is where the allocation is failing, it would also be useful to run "{[strace e2freefrag /dev/XXXX}}" to watch for sbrk() calls to see how much memory is being allocated.

One option would be to remove the EXT2_FLAG_THREADS option from the ext2fs_open() call in e2freefrag.c::open_device() to see if this reduces the memory usage.

Comment by Shuichi Ihara [ 21/Mar/22 ]

Here is "ltrace -ttT -f e2freefrag /dev/sda" output. OST is still 300TB and 85% of 300TB keeps filled up.

6173 08:55:49.123347 __libc_start_main(0x400d40, 2, 0x7ffee42df098, 0x401990 <unfinished ...>
6173 08:55:49.124223 add_error_table(0x603120, 0x7ffee42df098, 0x7ffee42df0b0, 0x401990)  = 0 <0.000254>
6173 08:55:49.124501 getopt(2, 0x7ffee42df098, "c:h")                                     = -1 <0.000113>
6173 08:55:49.124642 ext2fs_open(0x7ffee42df3b5, 0x4020400, 0, 0)                         = 0 <3.110846>
6173 08:55:52.235572 __fprintf_chk(0x7f1054c5e400, 1, 0x401c69, 0xdd91b0)                 = 17 <0.000331>
6173 08:55:52.235940 __fprintf_chk(0x7f1054c5e400, 1, 0x401c75, 4096)                     = 22 <0.000153>
6173 08:55:52.236126 ext2fs_check_mount_point(0xdd91b0, 0x7ffee42ddd1c, 0x7ffee42ddf70, 4096) = 0 <0.002036>
6173 08:55:52.238194 ext2fs_free_blocks_count(0xddb4c0, 0x7ffee42ddd30, 0x16070, 12)      = 0x3205a91de <0.000100>
6173 08:55:52.238332 ext2fs_read_block_bitmap(0xdd9030, 0x7ffee42ddd30, 0x205a91de, 12 <no return ...>
6192 09:45:39.446987 +++ killed by SIGKILL +++
6193 09:45:39.450489 +++ killed by SIGKILL +++
6191 09:45:39.450503 +++ killed by SIGKILL +++
6190 09:45:39.450512 +++ killed by SIGKILL +++
6189 09:45:39.450520 +++ killed by SIGKILL +++
6188 09:45:39.450528 +++ killed by SIGKILL +++
6187 09:45:39.450536 +++ killed by SIGKILL +++
6185 09:45:39.450544 +++ killed by SIGKILL +++
6184 09:45:39.450552 +++ killed by SIGKILL +++
6183 09:45:39.450560 +++ killed by SIGKILL +++
6182 09:45:39.450567 +++ killed by SIGKILL +++
6181 09:45:39.450575 +++ killed by SIGKILL +++
6180 09:45:39.450583 +++ killed by SIGKILL +++
6179 09:45:39.450590 +++ killed by SIGKILL +++
6178 09:45:39.450598 +++ killed by SIGKILL +++
6177 09:45:39.450762 +++ killed by SIGKILL +++
6176 09:45:39.450773 +++ killed by SIGKILL +++
6175 09:45:39.450782 +++ killed by SIGKILL +++
6174 09:45:39.450790 +++ killed by SIGKILL +++
6186 09:45:47.466515 +++ killed by SIGKILL +++
6173 09:45:47.466568 +++ killed by SIGKILL +++
 

Also this is collected number of syscall

# strace -fc -o strace.log e2freefrag
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 97.91 12387.779811        7380   1678443     22485 futex
  1.08  136.446473         155    879711           pread64
  1.01  127.198697           3  37861227           mprotect
  0.00    0.445570          11     37856           fadvise64
  0.00    0.009892           4      2362           mmap
  0.00    0.005309           4      1171           munmap
  0.00    0.000462          23        20           clone
  0.00    0.000365          17        21           set_robust_list
  0.00    0.000195          13        14         4 open
  0.00    0.000179          16        11           read
  0.00    0.000084           7        11           close
  0.00    0.000080           8         9           fstat
  0.00    0.000042           2        20           stat
  0.00    0.000032          16         2           write
  0.00    0.000032           8         4           brk
  0.00    0.000018           9         2           rt_sigaction
  0.00    0.000015          15         1           execve
  0.00    0.000015          15         1           fsync
  0.00    0.000011          11         1         1 access
  0.00    0.000010           5         2           ioctl
  0.00    0.000009           9         1           getuid
  0.00    0.000009           9         1           getgid
  0.00    0.000009           9         1           prctl
  0.00    0.000008           8         1           getegid
  0.00    0.000008           8         1           arch_prctl
  0.00    0.000008           8         1           set_tid_address
  0.00    0.000007           7         1           rt_sigprocmask
  0.00    0.000007           7         1           getrlimit
  0.00    0.000006           6         1           geteuid
  0.00    0.000000           0         2           lstat
  0.00    0.000000           0         4           getdents
  0.00    0.000000           0         3         1 openat
------ ----------- ----------- --------- --------- ----------------
100.00 12651.887363              40460907     22491 total

both cases eventually didn't complete due to OOM though.
I didn't see -m option in e2freefrag and cpu usage still limits up to 100% (single cpu core). do we have mutithreaded read in e2freefrag also?

Generated at Sat Feb 10 03:18:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.