Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12398

sanity test_255b: FAIL: Ladvise willread should use more memory than 76800 KiB

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • None
    • None
    • Lustre build: https://build.whamcloud.com/job/lustre-master/3904/ (tag 2.12.54)
      Lustre client distro: RHEL 8.0
      Lustre server distro: RHEL 7.6
    • 3
    • 9223372036854775807

    Description

      sanity test 255b failed as follows:

      == sanity test 255b: check 'lfs ladvise -a dontneed' ================================================= 11:19:01 (1559845141)
      100+0 records in
      100+0 records out
      104857600 bytes (105 MB, 100 MiB) copied, 0.586058 s, 179 MB/s
      CMD: vm7 cat /proc/meminfo | grep ^MemTotal:
      Total memory: 1877564 KiB
      CMD: vm7 sync && echo 3 > /proc/sys/vm/drop_caches
      CMD: vm7 cat /proc/meminfo | grep ^Cached:
      Cache used before read: 145664 KiB
      CMD: vm7 cat /proc/meminfo | grep ^Cached:
      Cache used after read: 145880 KiB
      CMD: vm7 cat /proc/meminfo | grep ^Cached:
      Cache used after dontneed ladvise: 145880 KiB
       sanity test_255b: @@@@@@ FAIL: Ladvise willread should use more memory than 76800 KiB
      

      Maloo report: https://testing.whamcloud.com/test_sets/f180e12c-8889-11e9-be83-52540065bddc

      Attachments

        Issue Links

          Activity

            [LU-12398] sanity test_255b: FAIL: Ladvise willread should use more memory than 76800 KiB

            Haven't seen this in months or years.

            adilger Andreas Dilger added a comment - Haven't seen this in months or years.
            lixi_wc Li Xi added a comment -

            I noticed that at the end of the last test (255a), a disconnection happens:

            [28154.168929] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity test 255a: check \'lfs ladvise -a willread\' ================================================= 11:12:37 \(1559844757\)
            [28154.352785] Lustre: DEBUG MARKER: == sanity test 255a: check 'lfs ladvise -a willread' ================================================= 11:12:37 (1559844757)
            [28155.345118] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_val=4 fail_loc=0x237
            [28155.459919] LustreError: 3276:0:(fail.c:129:__cfs_fail_timeout_set()) cfs_fail_timeout id 237 sleeping for 4000ms
            [28159.462249] LustreError: 3276:0:(fail.c:140:__cfs_fail_timeout_set()) cfs_fail_timeout id 237 awake
            [28159.487458] LustreError: 3276:0:(fail.c:129:__cfs_fail_timeout_set()) cfs_fail_timeout id 237 sleeping for 4000ms
            [28159.487465] LustreError: 3276:0:(fail.c:129:__cfs_fail_timeout_set()) Skipped 1 previous similar message
            [28159.586822] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0
            [28159.688388] LustreError: 3276:0:(fail.c:135:__cfs_fail_timeout_set()) cfs_fail_timeout interrupted
            [28179.112695] Lustre: lustre-OST0001: haven't heard from client fb3fe627-e4c7-4 (at 192.168.0.108@tcp) in 47 seconds. I think it's dead, and I am evicting it. exp ffff9f33c71e6c00, cur 1559827411 expire 1559827381 last 1559827364
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
            [28536.526341] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
            [28536.922444] Lustre: DEBUG MARKER: rc=0;
            			val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
            			if [[ $? -eq 0 && $val -ne 0 ]]; then
            				echo $(hostname -s): $val;
            				rc=$val;
            			fi;
            			exit $rc
            [28537.157475] Lustre: DEBUG MARKER: dmesg
            [28537.714141] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity test 255b: check \'lfs ladvise -a dontneed\' 
            ...
            

            I am wondering whether this abnormal contion causes the test failure.

            Is there any way to check the failure rate of this test with this system configuration?

            lixi_wc Li Xi added a comment - I noticed that at the end of the last test (255a), a disconnection happens: [28154.168929] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity test 255a: check \'lfs ladvise -a willread\' ================================================= 11:12:37 \(1559844757\) [28154.352785] Lustre: DEBUG MARKER: == sanity test 255a: check 'lfs ladvise -a willread' ================================================= 11:12:37 (1559844757) [28155.345118] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_val=4 fail_loc=0x237 [28155.459919] LustreError: 3276:0:(fail.c:129:__cfs_fail_timeout_set()) cfs_fail_timeout id 237 sleeping for 4000ms [28159.462249] LustreError: 3276:0:(fail.c:140:__cfs_fail_timeout_set()) cfs_fail_timeout id 237 awake [28159.487458] LustreError: 3276:0:(fail.c:129:__cfs_fail_timeout_set()) cfs_fail_timeout id 237 sleeping for 4000ms [28159.487465] LustreError: 3276:0:(fail.c:129:__cfs_fail_timeout_set()) Skipped 1 previous similar message [28159.586822] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0 [28159.688388] LustreError: 3276:0:(fail.c:135:__cfs_fail_timeout_set()) cfs_fail_timeout interrupted [28179.112695] Lustre: lustre-OST0001: haven't heard from client fb3fe627-e4c7-4 (at 192.168.0.108@tcp) in 47 seconds. I think it's dead, and I am evicting it. exp ffff9f33c71e6c00, cur 1559827411 expire 1559827381 last 1559827364 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [28536.526341] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null [28536.922444] Lustre: DEBUG MARKER: rc=0; val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ $? -eq 0 && $val -ne 0 ]]; then echo $(hostname -s): $val; rc=$val; fi; exit $rc [28537.157475] Lustre: DEBUG MARKER: dmesg [28537.714141] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity test 255b: check \'lfs ladvise -a dontneed\' ... I am wondering whether this abnormal contion causes the test failure. Is there any way to check the failure rate of this test with this system configuration?
            dongyang Dongyang Li added a comment -

            from the Maloo debug log:

            00000080:00200000:0.0:1559845142.798658:0:4733:0:(file.c:3338:ll_file_ioctl()) VFS Op:inode=[0x2000059f3:0x13b7:0x0](000000004ccbd265), cmd=802066fa
            00000080:00200000:0.0:1559845143.031197:0:4748:0:(file.c:3338:ll_file_ioctl()) VFS Op:inode=[0x2000059f3:0x13b7:0x0](000000004ccbd265), cmd=802066fa
            

            Looks like the ladvise ioctl was done from the client, which is a rpc to the server. Feels like the server didn't bring the file into memory?

            BTW using the same set of packages on a RHEL8 client and RHEL7.6 server, the test case works fine on my local boxes.

            dongyang Dongyang Li added a comment - from the Maloo debug log: 00000080:00200000:0.0:1559845142.798658:0:4733:0:(file.c:3338:ll_file_ioctl()) VFS Op:inode=[0x2000059f3:0x13b7:0x0](000000004ccbd265), cmd=802066fa 00000080:00200000:0.0:1559845143.031197:0:4748:0:(file.c:3338:ll_file_ioctl()) VFS Op:inode=[0x2000059f3:0x13b7:0x0](000000004ccbd265), cmd=802066fa Looks like the ladvise ioctl was done from the client, which is a rpc to the server. Feels like the server didn't bring the file into memory? BTW using the same set of packages on a RHEL8 client and RHEL7.6 server, the test case works fine on my local boxes.
            pjones Peter Jones added a comment -

            Dongyang can you please investigate?

            pjones Peter Jones added a comment - Dongyang can you please investigate?

            People

              dongyang Dongyang Li
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: