[LU-12398] sanity test_255b: FAIL: Ladvise willread should use more memory than 76800 KiB Created: 06/Jun/19  Updated: 07/Nov/23  Resolved: 07/Nov/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Jian Yu Assignee: Dongyang Li
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Lustre build: https://build.whamcloud.com/job/lustre-master/3904/ (tag 2.12.54)
Lustre client distro: RHEL 8.0
Lustre server distro: RHEL 7.6


Issue Links:
Related
is related to LU-16127 sanity test_255b: Ladvise dontneed sh... Open
is related to LU-12269 Support RHEL 8.0 Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity test 255b failed as follows:

== sanity test 255b: check 'lfs ladvise -a dontneed' ================================================= 11:19:01 (1559845141)
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0.586058 s, 179 MB/s
CMD: vm7 cat /proc/meminfo | grep ^MemTotal:
Total memory: 1877564 KiB
CMD: vm7 sync && echo 3 > /proc/sys/vm/drop_caches
CMD: vm7 cat /proc/meminfo | grep ^Cached:
Cache used before read: 145664 KiB
CMD: vm7 cat /proc/meminfo | grep ^Cached:
Cache used after read: 145880 KiB
CMD: vm7 cat /proc/meminfo | grep ^Cached:
Cache used after dontneed ladvise: 145880 KiB
 sanity test_255b: @@@@@@ FAIL: Ladvise willread should use more memory than 76800 KiB

Maloo report: https://testing.whamcloud.com/test_sets/f180e12c-8889-11e9-be83-52540065bddc



 Comments   
Comment by Peter Jones [ 06/Jun/19 ]

Dongyang can you please investigate?

Comment by Dongyang Li [ 11/Jun/19 ]

from the Maloo debug log:

00000080:00200000:0.0:1559845142.798658:0:4733:0:(file.c:3338:ll_file_ioctl()) VFS Op:inode=[0x2000059f3:0x13b7:0x0](000000004ccbd265), cmd=802066fa
00000080:00200000:0.0:1559845143.031197:0:4748:0:(file.c:3338:ll_file_ioctl()) VFS Op:inode=[0x2000059f3:0x13b7:0x0](000000004ccbd265), cmd=802066fa

Looks like the ladvise ioctl was done from the client, which is a rpc to the server. Feels like the server didn't bring the file into memory?

BTW using the same set of packages on a RHEL8 client and RHEL7.6 server, the test case works fine on my local boxes.

Comment by Li Xi [ 18/Jun/19 ]

I noticed that at the end of the last test (255a), a disconnection happens:

[28154.168929] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity test 255a: check \'lfs ladvise -a willread\' ================================================= 11:12:37 \(1559844757\)
[28154.352785] Lustre: DEBUG MARKER: == sanity test 255a: check 'lfs ladvise -a willread' ================================================= 11:12:37 (1559844757)
[28155.345118] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_val=4 fail_loc=0x237
[28155.459919] LustreError: 3276:0:(fail.c:129:__cfs_fail_timeout_set()) cfs_fail_timeout id 237 sleeping for 4000ms
[28159.462249] LustreError: 3276:0:(fail.c:140:__cfs_fail_timeout_set()) cfs_fail_timeout id 237 awake
[28159.487458] LustreError: 3276:0:(fail.c:129:__cfs_fail_timeout_set()) cfs_fail_timeout id 237 sleeping for 4000ms
[28159.487465] LustreError: 3276:0:(fail.c:129:__cfs_fail_timeout_set()) Skipped 1 previous similar message
[28159.586822] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0
[28159.688388] LustreError: 3276:0:(fail.c:135:__cfs_fail_timeout_set()) cfs_fail_timeout interrupted
[28179.112695] Lustre: lustre-OST0001: haven't heard from client fb3fe627-e4c7-4 (at 192.168.0.108@tcp) in 47 seconds. I think it's dead, and I am evicting it. exp ffff9f33c71e6c00, cur 1559827411 expire 1559827381 last 1559827364
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[28536.526341] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
[28536.922444] Lustre: DEBUG MARKER: rc=0;
			val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
			if [[ $? -eq 0 && $val -ne 0 ]]; then
				echo $(hostname -s): $val;
				rc=$val;
			fi;
			exit $rc
[28537.157475] Lustre: DEBUG MARKER: dmesg
[28537.714141] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity test 255b: check \'lfs ladvise -a dontneed\' 
...

I am wondering whether this abnormal contion causes the test failure.

Is there any way to check the failure rate of this test with this system configuration?

Comment by Andreas Dilger [ 07/Nov/23 ]

Haven't seen this in months or years.

Generated at Sat Feb 10 02:52:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.