[LU-13312] Optimized RA for stride read under memory pressure Created: 29/Feb/20  Updated: 17/Feb/21  Resolved: 17/Feb/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0

Type: Improvement Priority: Minor
Reporter: Shuichi Ihara Assignee: Wang Shilong (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

master


Attachments: File lctl-dk-ra.txt.gz    
Issue Links:
Related
Rank (Obsolete): 9223372036854775807

 Description   

LU-12518 introduced new RA and it supports for page unaligned stride IO and significant improved performance (e.g. IO500 IOR_hard_read). However, it still can be optimized. The current patch sometimes doesn't work well under memory pressure?, but performance is back after dropping page caches before read. Here is a reproducer and results.

4 x client(1 x Gold 5218, 96GB RAM)
segment=400000 (~300GB per node)

# mpirun -np 64 ior -w -s 400000 -a POSIX -i 1 -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -o /fast/dir/file -O stoneWallingStatusFile=/fast/dir/stonewall -O stoneWallingWearOut=1 -D 300

# mpirun -np 64 ior -r -s 400000 -a POSIX -i 1 -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -o /fast/dir/file -O stoneWallingStatusFile=/fast/dir/stonewall -O stoneWallingWearOut=1 -D 300
 
Max Read:  5087.32 MiB/sec (5334.44 MB/sec)

One of client's RA stat
# lctl get_param llite.*.read_ahead_stats
llite.fast-ffff99878133d000.read_ahead_stats=
snapshot_time             1582946538.113259755 secs.nsecs
hits                      72125088 samples [pages]
misses                    1686810 samples [pages]
readpage not consecutive  6400000 samples [pages]
miss inside window        3011 samples [pages]
failed grab_cache_page    2945424 samples [pages]
read but discarded        35565 samples [pages]
zero size window          100245 samples [pages]
failed to reach end       73663094 samples [pages]
failed to fast read       6396933 samples [pages]

After dropping pagecache on clients before read.

# clush -a "echo 3 > /proc/sys/vm/drop_caches "
# mpirun -np 64 ior -r -s 400000 -a POSIX -i 1 -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -o /fast/dir/file -O stoneWallingStatusFile=/fast/dir/stonewall -O stoneWallingWearOut=1 -D 300

Max Read:  16244.62 MiB/sec (17033.72 MB/sec)

Client's RA stat
# lctl get_param llite.*.read_ahead_stats
llite.fast-ffff99878133d000.read_ahead_stats=
snapshot_time             1582947544.040550353 secs.nsecs
hits                      73799940 samples [pages]
misses                    63 samples [pages]
readpage not consecutive  6400000 samples [pages]
failed grab_cache_page    2654231 samples [pages]
read but discarded        1 samples [pages]
zero size window          500 samples [pages]
failed to reach end       402367 samples [pages]
failed to fast read       35075 samples [pages]

 



 Comments   
Comment by Shuichi Ihara [ 29/Feb/20 ]

attached is debug=reada in an bad performance case.

Comment by Gerrit Updater [ 29/Feb/20 ]

Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/37761
Subject: LU-13312 llite: improve RA under memory pressure
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a8c77b91b3148aa0dbb9e494f2b771dd38910fd9

Comment by Wang Shilong (Inactive) [ 29/Feb/20 ]

To be clear for the ticket, there might be several problems here:

1) This behavior is not acually a regression from LU-12518, memory allocation policy will always have this problem with/without LU-12518

2) there might be two main reasons that make RA stopped currently:
2.1 memory pressure which need be reclaimed some memory from FS.
2.2 some lock contention with write from other clients, this might be especially problem like
IO500 hard mode, as it will generate many PW locks from different kinds of clients for
writting, and then start read mode, this might potentially make RA do not work well if
PR locks could not be grabbed further as a lock contention detection(here).

We should isolate problems, at least focus problem 2.1 in this ticket.

Comment by Wang Shilong (Inactive) [ 29/Feb/20 ]

After checking debugs logs, there are many bunch of error logs like:

00020000:00400000:5.0:1582946316.433652:0:18406:0:(lov_io.c:1049:lov_io_read_ahead()) [0x200000404:0x8:0x0] cra_end = 0, stripes = 240, rc = -61

-61 is ENODATA which returned by osc_io_read_ahead(), it means readahead could not grab locks ahead, this might be related to
your "lru_max_age=100" sihara?

So that explain ldlm.namespaces.*.lru_size=clear before reading testing start, it guarantee there is no PW locks from other clients and PR locks could be grabbed very aggressively which makes our readahead go very well.

Comment by Wang Shilong (Inactive) [ 29/Feb/20 ]

I guess why your set lru_max_age=100 is because of after writing, lock cancel could take a bit time if there is too many PW locks cached in memory.

Comment by Gerrit Updater [ 29/Feb/20 ]

Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/37762
Subject: LU-13312 ldlm: fix to stop iterating tree early in ldlm_kms_shift_cb()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c3b211134e5021f52db80d00613de8699f805f7c

Comment by Wang Shilong (Inactive) [ 29/Feb/20 ]

Regarding to lock cancel problem, i think we talked somewhere, but finally not get a chance to push a known issue there, let's push it this ticket.

Comment by Shuichi Ihara [ 01/Mar/20 ]
-61 is ENODATA which returned by osc_io_read_ahead(), it means readahead could not grab locks ahead, this might be related to
your "lru_max_age=100" Shuichi Ihara?

Nope, I didn't chnage lru_max_age when I got this log.

Comment by Shuichi Ihara [ 01/Mar/20 ]

I've also confimred canceling whole locks before read helped a lot always regardless under memory pressure or not.

# mpirun -np 64 ior -w -s 400000 -a POSIX -i 1 -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -o /fast/dir/file -O stoneWallingStatusFile=/fast/dir/stonewall -O stoneWallingWearOut=1 -D 300

# clush -w ec[01-04] lctl set_param  ldlm.namespaces.*.lru_size=clear > /dev/null

# mpirun -np 64 ior -w -s 400000 -a POSIX -i 1 -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -o /fast/dir/file -O stoneWallingStatusFile=/fast/dir/stonewall -O stoneWallingWearOut=1 -D 300

Max Read:  22606.54 MiB/sec (23704.67 MB/sec)

Without canceling locks before read
Max Read:  4241.10 MiB/sec (4447.12 MB/sec)
Comment by Gerrit Updater [ 24/Mar/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37762/
Subject: LU-13312 ldlm: fix to stop iterating tree early in ldlm_kms_shift_cb()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b28b3bd9094ee7be8e3c11a531383246a71d5dec

Comment by Wang Shilong (Inactive) [ 24/Mar/20 ]

This is not acutally memory problem.

Comment by Cory Spitz [ 24/Mar/20 ]

wshilong, you closed this, but https://review.whamcloud.com/#/c/37761/ is still pending for this LU. Do you intend to abandon or re-target that patch? Or, shall we re-open this ticket?

Comment by Wang Shilong (Inactive) [ 25/Mar/20 ]

spitzcor i'll abandon that patch.

Generated at Sat Feb 10 03:00:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.