[LU-12398] sanity test_255b: FAIL: Ladvise willread should use more memory than 76800 KiB Created: 06/Jun/19 Updated: 07/Nov/23 Resolved: 07/Nov/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Jian Yu | Assignee: | Dongyang Li |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre build: https://build.whamcloud.com/job/lustre-master/3904/ (tag 2.12.54) |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
sanity test 255b failed as follows: == sanity test 255b: check 'lfs ladvise -a dontneed' ================================================= 11:19:01 (1559845141) 100+0 records in 100+0 records out 104857600 bytes (105 MB, 100 MiB) copied, 0.586058 s, 179 MB/s CMD: vm7 cat /proc/meminfo | grep ^MemTotal: Total memory: 1877564 KiB CMD: vm7 sync && echo 3 > /proc/sys/vm/drop_caches CMD: vm7 cat /proc/meminfo | grep ^Cached: Cache used before read: 145664 KiB CMD: vm7 cat /proc/meminfo | grep ^Cached: Cache used after read: 145880 KiB CMD: vm7 cat /proc/meminfo | grep ^Cached: Cache used after dontneed ladvise: 145880 KiB sanity test_255b: @@@@@@ FAIL: Ladvise willread should use more memory than 76800 KiB Maloo report: https://testing.whamcloud.com/test_sets/f180e12c-8889-11e9-be83-52540065bddc |
| Comments |
| Comment by Peter Jones [ 06/Jun/19 ] |
|
Dongyang can you please investigate? |
| Comment by Dongyang Li [ 11/Jun/19 ] |
|
from the Maloo debug log: 00000080:00200000:0.0:1559845142.798658:0:4733:0:(file.c:3338:ll_file_ioctl()) VFS Op:inode=[0x2000059f3:0x13b7:0x0](000000004ccbd265), cmd=802066fa 00000080:00200000:0.0:1559845143.031197:0:4748:0:(file.c:3338:ll_file_ioctl()) VFS Op:inode=[0x2000059f3:0x13b7:0x0](000000004ccbd265), cmd=802066fa Looks like the ladvise ioctl was done from the client, which is a rpc to the server. Feels like the server didn't bring the file into memory? BTW using the same set of packages on a RHEL8 client and RHEL7.6 server, the test case works fine on my local boxes. |
| Comment by Li Xi [ 18/Jun/19 ] |
|
I noticed that at the end of the last test (255a), a disconnection happens: [28154.168929] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity test 255a: check \'lfs ladvise -a willread\' ================================================= 11:12:37 \(1559844757\) [28154.352785] Lustre: DEBUG MARKER: == sanity test 255a: check 'lfs ladvise -a willread' ================================================= 11:12:37 (1559844757) [28155.345118] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_val=4 fail_loc=0x237 [28155.459919] LustreError: 3276:0:(fail.c:129:__cfs_fail_timeout_set()) cfs_fail_timeout id 237 sleeping for 4000ms [28159.462249] LustreError: 3276:0:(fail.c:140:__cfs_fail_timeout_set()) cfs_fail_timeout id 237 awake [28159.487458] LustreError: 3276:0:(fail.c:129:__cfs_fail_timeout_set()) cfs_fail_timeout id 237 sleeping for 4000ms [28159.487465] LustreError: 3276:0:(fail.c:129:__cfs_fail_timeout_set()) Skipped 1 previous similar message [28159.586822] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0 [28159.688388] LustreError: 3276:0:(fail.c:135:__cfs_fail_timeout_set()) cfs_fail_timeout interrupted [28179.112695] Lustre: lustre-OST0001: haven't heard from client fb3fe627-e4c7-4 (at 192.168.0.108@tcp) in 47 seconds. I think it's dead, and I am evicting it. exp ffff9f33c71e6c00, cur 1559827411 expire 1559827381 last 1559827364 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [28536.526341] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null [28536.922444] Lustre: DEBUG MARKER: rc=0; val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ $? -eq 0 && $val -ne 0 ]]; then echo $(hostname -s): $val; rc=$val; fi; exit $rc [28537.157475] Lustre: DEBUG MARKER: dmesg [28537.714141] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity test 255b: check \'lfs ladvise -a dontneed\' ... I am wondering whether this abnormal contion causes the test failure. Is there any way to check the failure rate of this test with this system configuration? |
| Comment by Andreas Dilger [ 07/Nov/23 ] |
|
Haven't seen this in months or years. |