Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16515

sanity test_118c test_118d: No page in writeback, writeback=0

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.16.0, Lustre 2.15.2, Lustre 2.15.3
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for S Buisson <sbuisson@ddn.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/8136068e-67b8-43b8-9a9a-f8d956af9458

      test_118d failed with the following error:

      No page in writeback, writeback=0
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-reviews/91915 - 4.18.0-372.32.1.el8_6.x86_64
      servers: https://build.whamcloud.com/job/lustre-reviews/91915 - 4.18.0-372.32.1.el8_lustre.x86_64

      Test output is just:

      == sanity test 118d: Fsync validation inject a delay of the bulk ==================================================================== 14:43:29 (1674830609)
      7+0 records in
      7+0 records out
      458752 bytes (459 kB, 448 KiB) copied, 0.00275827 s, 166 MB/s
      CMD: onyx-117vm3 lctl set_param fail_val=0 fail_loc=0x214
      fail_val=0
      fail_loc=0x214
       sanity test_118d: @@@@@@ FAIL: No page in writeback, writeback=0
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity test_118d - No page in writeback, writeback=0

      Attachments

        Issue Links

          Activity

            [LU-16515] sanity test_118c test_118d: No page in writeback, writeback=0
            ys Yang Sheng added a comment -

            Also i found this issue maybe similar to DDN-5058. Looks like it be hit very frequent on kernel 4.18.0-[3xx-425]. It cannot be reproduced after 4.18.0-477 since fixing patch has been included. The kernel before 5.14.0-3xx should also impact by this issue.

            ys Yang Sheng added a comment - Also i found this issue maybe similar to DDN-5058. Looks like it be hit very frequent on kernel 4.18.0- [3xx-425] . It cannot be reproduced after 4.18.0-477 since fixing patch has been included. The kernel before 5.14.0-3xx should also impact by this issue.

            Yang Sheng reports that this subtest does not fail anymore on master when the test is enabled.

            adilger Andreas Dilger added a comment - Yang Sheng reports that this subtest does not fail anymore on master when the test is enabled.

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55077/
            Subject: LU-16515 tests: disable sanity test_118c/118d
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: fdab53a553c278d9b4126ca72ef5a911c7227dbb

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55077/ Subject: LU-16515 tests: disable sanity test_118c/118d Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: fdab53a553c278d9b4126ca72ef5a911c7227dbb

            "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55077
            Subject: LU-16515 tests: disable sanity test_118c/118d
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: dd76a45cdc5f6d8d1a8eca5b9a5e66be184ecd71

            gerrit Gerrit Updater added a comment - "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55077 Subject: LU-16515 tests: disable sanity test_118c/118d Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: dd76a45cdc5f6d8d1a8eca5b9a5e66be184ecd71
            ys Yang Sheng added a comment -

            I have captured a vmcore for the failure case:

            crash> ll_inode_info ffff8f7192598000 -x
            struct ll_inode_info {
              lli_inode_magic = 0x111d0de5,
            ....
              lli_fid = {
                f_seq = 0x200001b78,
                f_oid = 0xd3a,
                f_ver = 0x0
              },
            .........
                i_op = 0xffffffffc106ffc0 <ll_file_inode_operations>,
                i_sb = 0xffff8f719a8ea800,
                i_mapping = 0xffff8f7192598208,
                i_security = 0xffff8f712ef42d80,
                i_ino = 0x200001b78000d3a,
            

            Look into mmaping:

            crash> address_space 0xffff8f7192598208 -o
            struct address_space {
              [ffff8f7192598208] struct inode *host;
              [ffff8f7192598210] struct xarray i_pages;
              [ffff8f7192598228] atomic_t i_mmap_writable;
            ...........
            crash> xarray ffff8f7192598210
            struct xarray {
              xa_lock = {
                {
                  rlock = {
                    raw_lock = {
                      {
                        val = {
                          counter = 0
                        },
                        {
                          locked = 0 '\000',
                          pending = 0 '\000'
                        },
                        {
                          locked_pending = 0,
                          tail = 0
                        }
                      }
                    }
                  }
                }
              },
              xa_flags = 16777249,
              xa_head = 0xffffd8b842578f80,
              xarray_size_rh = 0,
              _rh = {<No data fields>}
            }
            crash> kmem 0xffffd8b842578f80
                  PAGE       PHYSICAL      MAPPING       INDEX CNT FLAGS
            ffffd8b842578f80 95e3e000 ffff8f7192598208        0  2 fffffc0005028 uptodate,lru,private,writeback
            

            So every thing looks valid. In the RHEL8 use XA to replace radix tree. The only reason could be a racy in XArray code. But i am not sure which place should be a culprit. Further investigating is needed. Any comment is appreciated.

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - I have captured a vmcore for the failure case: crash> ll_inode_info ffff8f7192598000 -x struct ll_inode_info { lli_inode_magic = 0x111d0de5, .... lli_fid = { f_seq = 0x200001b78, f_oid = 0xd3a, f_ver = 0x0 }, ......... i_op = 0xffffffffc106ffc0 <ll_file_inode_operations>, i_sb = 0xffff8f719a8ea800, i_mapping = 0xffff8f7192598208, i_security = 0xffff8f712ef42d80, i_ino = 0x200001b78000d3a, Look into mmaping: crash> address_space 0xffff8f7192598208 -o struct address_space { [ffff8f7192598208] struct inode *host; [ffff8f7192598210] struct xarray i_pages; [ffff8f7192598228] atomic_t i_mmap_writable; ........... crash> xarray ffff8f7192598210 struct xarray { xa_lock = { { rlock = { raw_lock = { { val = { counter = 0 }, { locked = 0 '\000', pending = 0 '\000' }, { locked_pending = 0, tail = 0 } } } } } }, xa_flags = 16777249, xa_head = 0xffffd8b842578f80, xarray_size_rh = 0, _rh = {<No data fields>} } crash> kmem 0xffffd8b842578f80 PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffffd8b842578f80 95e3e000 ffff8f7192598208 0 2 fffffc0005028 uptodate,lru,private,writeback So every thing looks valid. In the RHEL8 use XA to replace radix tree. The only reason could be a racy in XArray code. But i am not sure which place should be a culprit. Further investigating is needed. Any comment is appreciated. Thanks, YangSheng

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50470/
            Subject: LU-16515 tests: disable sanity test_118c/118d
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 7c52cbf65218d77c0594f92981173aa7d78f6758

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50470/ Subject: LU-16515 tests: disable sanity test_118c/118d Project: fs/lustre-release Branch: master Current Patch Set: Commit: 7c52cbf65218d77c0594f92981173aa7d78f6758

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50470
            Subject: LU-16515 tests: disable sanity test_118c/118d
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e3bf4218cb9e1eb68f8ffb9f15161645165b84be

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50470 Subject: LU-16515 tests: disable sanity test_118c/118d Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e3bf4218cb9e1eb68f8ffb9f15161645165b84be

            "Yang Sheng <ys@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50439
            Subject: LU-16515 tests: enable -1 log for 118c & 118d
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: ddbe69524ab8ec5d2147fed5674eddabdf366422

            gerrit Gerrit Updater added a comment - "Yang Sheng <ys@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50439 Subject: LU-16515 tests: enable -1 log for 118c & 118d Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ddbe69524ab8ec5d2147fed5674eddabdf366422

            Interesting about full debug - that's good to know.

            RE: debug_mb=10000; oh yes, absolutely.  It's sort my lazy shorthand for "max", it's intended to be when I want to get as much info as possible out of a specific test.  It's not appropriate for full test runs and it does indeed cause problems if you do one with it set like that.

            paf0186 Patrick Farrell added a comment - Interesting about full debug - that's good to know. RE: debug_mb=10000; oh yes, absolutely.  It's sort my lazy shorthand for "max", it's intended to be when I want to get as much info as possible out of a specific test.  It's not appropriate for full test runs and it does indeed cause problems if you do one with it set like that.

            Sheng, can you please push a patch to enable full debug for this subtest (e.g. add start_full_debug_logging and stop_full_debug_logging calls into test_118c and test_118d and then add:

            Test-Parameters: trivial testlist=sanity env=ONLY="118c 118d",ONLY_REPEAT=100
            Test-Parameters: trivial testlist=sanity
            Test-Parameters: trivial testlist=sanity
            [repeat 20x]
            

            If testing on the patch itself does not reproduce the problem (it is failing about 10% of runs on master this week, but may depend on previous state to fail), then the debug patch could be landed so that it will collect debugging from other patch test runs.

            adilger Andreas Dilger added a comment - Sheng, can you please push a patch to enable full debug for this subtest (e.g. add start_full_debug_logging and stop_full_debug_logging calls into test_118c and test_118d and then add: Test-Parameters: trivial testlist=sanity env=ONLY="118c 118d",ONLY_REPEAT=100 Test-Parameters: trivial testlist=sanity Test-Parameters: trivial testlist=sanity [repeat 20x] If testing on the patch itself does not reproduce the problem (it is failing about 10% of runs on master this week, but may depend on previous state to fail), then the debug patch could be landed so that it will collect debugging from other patch test runs.

            People

              ys Yang Sheng
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated: