[LU-16713] Writeback and commit pages under memory pressure to avoid OOM Created: 05/Apr/23 Updated: 27/Oct/23 Resolved: 26/Sep/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Qian Yingjin | Assignee: | Qian Yingjin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
We've tried to solve this in the past by integrating NFS unstable pages tracking in to Lustre, but this is fraught - it treats our uncommitted pages as dirty, which means we get rate limited on them. The kernels idea of an appropriate number of outstanding pages is based on local file systems, and isn't enough for us, so this causes performance issues. The SOFT_SYNC feature we created to work with unstable pages also just asks the OST nicely to do a commit, and includes no way for the client to be notified quickly. Linux kernel already has matured solution for OOM with cgroup. As unstable page account in kernel may have bad impact on the performance, thus we need to optimize the unstable page account code in next phase work. |
| Comments |
| Comment by Gerrit Updater [ 05/Apr/23 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50544 | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Qian Yingjin [ 12/Apr/23 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Some benchmark results: Total memory: 512G a. without memcg limits: stripe_count: 1 cmd: dd if=/dev/zero of=test bs=1M count=$size
b. with memcg limits on the patched master: stripe_count: 1 cmd: bash -c "echo \$$ > $cgdir/tasks && dd if=/dev/zero of=$DIR/$tfile bs=1M count=$((memlimit_mb * time))" io_size = $time X $memlimit_mb ==> $time = {2, 1, 0.5}
The performance have no obvious degradation with memcg limits.
| |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Qian Yingjin [ 12/Apr/23 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Multiple cgroups testing results (dd writes performance): The test scripts:
error() {
echo "$@"
exit 1
}
DIR="/exafs"
tdir="milti"
tfile="test"
dir=$DIR/$tdir
file=$dir/$tfile
cg_basedir=/sys/fs/cgroup/memory
cgdir=$cg_basedir/$tfile
memlimit_mb=$1
cnt=$2
declare -a pids
rm -rf $dir
sleep 2
mkdir $dir || error "failed to mkdir $dir"
for i in $(seq 1 $cnt); do
cgdir=$cg_basedir/${tfile}.$i
mkdir $cgdir || error "failed to mkdir $cgdir"
echo $((memlimit_mb * 1024 * 1024)) > $cgdir/memory.limit_in_bytes
cat $cgdir/memory.limit_in_bytes
done
echo 3 > /proc/sys/vm/drop_caches
for i in $(seq 1 $cnt); do
cgdir=$cg_basedir/$tfile.$i
(
bash -c "echo \$$ > $cgdir/tasks && dd if=/dev/zero of=$dir/${tfile}.$i bs=1M count=$((memlimit_mb * 2))"
)&
pids[i]=$!
done
for i in $(seq 1 $cnt); do
wait ${pids[$i]}
cgdir=$cg_basedir/$tfile.$i
rmdir $cg_basedir/${tfile}.$i || error "failed to rm $cgdir"
done
wait
sleep 3
Results: CMD: ./tmult.sh $memlimit_mb $cgcnt ==== 4 cgroups ==== [root@ice01 scripts]# ./tmult.sh 1024 4 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 1.47427 s, 1.5 GB/s 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 1.49274 s, 1.4 GB/s 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 1.49886 s, 1.4 GB/s 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 1.52199 s, 1.4 GB/s [root@ice01 scripts]# ./tmult.sh 2048 4 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 2.93491 s, 1.5 GB/s 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 2.94163 s, 1.5 GB/s 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 2.94337 s, 1.5 GB/s 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 2.97721 s, 1.4 GB/s [root@ice01 scripts]# ./tmult.sh 4096 4 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 5.7354 s, 1.5 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 5.87343 s, 1.5 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 5.95922 s, 1.4 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 5.99732 s, 1.4 GB/s [root@ice01 scripts]# ./tmult.sh 8192 4 17179869184 bytes (17 GB, 16 GiB) copied, 11.7261 s, 1.5 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 11.8024 s, 1.5 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 11.8868 s, 1.4 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 11.9072 s, 1.4 GB/s ===== 8 cgroups ==== [root@ice01 scripts]# ./tmult.sh 1024 8 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 1.68561 s, 1.3 GB/s 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 1.69721 s, 1.3 GB/s 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 1.70013 s, 1.3 GB/s 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 1.71561 s, 1.3 GB/s 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 1.71978 s, 1.2 GB/s 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 1.74053 s, 1.2 GB/s 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 1.76275 s, 1.2 GB/s 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 1.87241 s, 1.1 GB/s [root@ice01 scripts]# ./tmult.sh 2048 8 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.40484 s, 1.3 GB/s 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.46257 s, 1.2 GB/s 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.47629 s, 1.2 GB/s 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.4952 s, 1.2 GB/s 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.50229 s, 1.2 GB/s 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.52185 s, 1.2 GB/s 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.53337 s, 1.2 GB/s 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.60111 s, 1.2 GB/s [root@ice01 scripts]# ./tmult.sh 4096 8 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 6.5593 s, 1.3 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 6.60015 s, 1.3 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 6.721 s, 1.3 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 6.75103 s, 1.3 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 6.77716 s, 1.3 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 6.85576 s, 1.3 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 6.85757 s, 1.3 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 6.89447 s, 1.2 GB/s [root@ice01 scripts]# ./tmult.sh 8192 8 17179869184 bytes (17 GB, 16 GiB) copied, 12.7842 s, 1.3 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 12.7889 s, 1.3 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 12.9504 s, 1.3 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 12.9577 s, 1.3 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 13.4066 s, 1.3 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 13.5397 s, 1.3 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 13.5769 s, 1.3 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 13.6605 s, 1.3 GB/s
| |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Qian Yingjin [ 13/Apr/23 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Two process:
error() {
echo "$@"
exit 1
}
DIR="/exafs"
tdir="milti"
tfile="test"
dir=$DIR/$tdir
file=$dir/$tfile
cgfile=$dir/${tfile}.cg
cg_basedir=/sys/fs/cgroup/memory
cgdir=$cg_basedir/$tfile
memlimit_mb=$1
rm -rf $dir
sleep 2
mkdir $dir || error "failed to mkdir $dir"
cgdir=$cg_basedir/${tfile}
mkdir $cgdir || error "failed to mkdir $cgdir"
echo $((memlimit_mb * 1024 * 1024)) > $cgdir/memory.limit_in_bytes
cat $cgdir/memory.limit_in_bytes
echo 3 > /proc/sys/vm/drop_caches
(
bash -c "echo \$$ > $cgdir/tasks && dd if=/dev/zero of=${cgfile} bs=1M count=128000"
)&
cgpid=$!
dd if=/dev/zero of=$file bs=1M count=128000 &
pid=$!
wait $cgpid
wait $pid
rmdir $cgdir
The results are shown as following: ./t2p.sh $memlimit_mb [root@ice01 scripts]# ./t2p.sh 1024 134217728000 bytes (134 GB, 125 GiB) copied, 61.5799 s, 2.2 GB/s 134217728000 bytes (134 GB, 125 GiB) copied, 103.386 s, 1.3 GB/s [root@ice01 scripts]# ./t2p.sh 4096 134217728000 bytes (134 GB, 125 GiB) copied, 62.1537 s, 2.2 GB/s 134217728000 bytes (134 GB, 125 GiB) copied, 101.473 s, 1.3 GB/s [root@ice01 scripts]# ./t2p.sh 16384 134217728000 bytes (134 GB, 125 GiB) copied, 60.7237 s, 2.2 GB/s 134217728000 bytes (134 GB, 125 GiB) copied, 93.3043 s, 1.4 GB/s [root@ice01 scripts]# ./t2p.sh 32768 134217728000 bytes (134 GB, 125 GiB) copied, 61.221 s, 2.2 GB/s 134217728000 bytes (134 GB, 125 GiB) copied, 88.7582 s, 1.5 GB/s [root@ice01 scripts]# ./t2p.sh 65536 134217728000 bytes (134 GB, 125 GiB) copied, 62.0085 s, 2.2 GB/s 134217728000 bytes (134 GB, 125 GiB) copied, 75.543 s, 1.8 GB/s [root@ice01 scripts]# ./t2p.sh 131072 134217728000 bytes (134 GB, 125 GiB) copied, 63.2751 s, 2.1 GB/s 134217728000 bytes (134 GB, 125 GiB) copied, 64.2502 s, 2.1 GB/s The results demonstrates that the process with memcg limits nearly has no impact on the performance of the process without memcg limits. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 13/Apr/23 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50625 | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Xing Huang [ 08/May/23 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
2023-05-13: Two patch for the ticket, one patch landed to master, another one is being worked on. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 09/May/23 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50625/ | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Xing Huang [ 26/Jul/23 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
2023-07-26: Two patches for the ticket, one patch landed to master, another patch's depending patch is being worked on. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Qian Yingjin [ 17/Aug/23 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
These expensive frequent fsync() calls will lead much frequent journal commit on the stroage server, and the journaling overhead becomes rather significant, causing performance drops. We have designed a mechanism called soft sync. The client accounts the number of unstable pages bwtween the client/server pair. Upon the completion of a write I/O request, the client adds the corresponing inode, which has pinned uncommitted pages, into dirty list of the super block or cgroup. And then it increases the unstable account accordingly. Any reply from the server will piggyback the last committed transno (last_committed) on this server, the client will commit write I/O request with transno smaller than last_committed, unpin the uncommitted pages and decreases the unstable page account accordingly. When the system is under memory pressure, the kernel writeback thread will be woken up, and start to write out data of the inodes in dirty list to reclaim pages. If the writeback purpose is to commit the pinned pages, the client first flush dirty pages to servers if any. If unstable page count between this client/server pair is zero, it means all unstable pages have already committed, the client just returns immediately. Otherwise, the client sends a soft sync request to the server with a factor to indicate the urgency degree of its memory pressure. The intention for this operation is to commit pages belonging to a client which has too many outstanding unstable pages in its cache. The server will determine whether to begin an asynchronous journal commit based on the number of the soft sync the clients requesting and the time since its last commit. The server has a tunable global limit (named soft_sync_thrsh) across all clients. It defines how many soft sync request allowed before a asynchronous journal commit will be triggered. And its value is 16 by default. Every soft sync request from a client contributes the soft sync value on the server. The soft sync factor is calculated based on the memory usage on a client by the formula: (1 - free_memory / tot_memory) * soft_sync_thrsh where free_memory is the free memory size, tot_memory is the total memory size, they could be entire system based or cgroup based. Once the accumulated soft sync value is larger than the predefined threshold, a asynchronous sync() is called on the server to start journal commit. The soft sync mechanism makes a tradeoff between the urgency of reducing memory pressure and server throughput. It will dynamically shorten the journal commit interval on the server to avoid pinning pages longer if a client is under memory pressure. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 23/Sep/23 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
"Oleg Drokin <green@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52485 | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 26/Sep/23 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50544/ | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 26/Sep/23 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52485/ | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Peter Jones [ 26/Sep/23 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Landed for 2.16 |