Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16515

sanity test_118c test_118d: No page in writeback, writeback=0

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.16.0, Lustre 2.15.2, Lustre 2.15.3
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for S Buisson <sbuisson@ddn.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/8136068e-67b8-43b8-9a9a-f8d956af9458

      test_118d failed with the following error:

      No page in writeback, writeback=0
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-reviews/91915 - 4.18.0-372.32.1.el8_6.x86_64
      servers: https://build.whamcloud.com/job/lustre-reviews/91915 - 4.18.0-372.32.1.el8_lustre.x86_64

      Test output is just:

      == sanity test 118d: Fsync validation inject a delay of the bulk ==================================================================== 14:43:29 (1674830609)
      7+0 records in
      7+0 records out
      458752 bytes (459 kB, 448 KiB) copied, 0.00275827 s, 166 MB/s
      CMD: onyx-117vm3 lctl set_param fail_val=0 fail_loc=0x214
      fail_val=0
      fail_loc=0x214
       sanity test_118d: @@@@@@ FAIL: No page in writeback, writeback=0
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity test_118d - No page in writeback, writeback=0

      Attachments

        Issue Links

          Activity

            [LU-16515] sanity test_118c test_118d: No page in writeback, writeback=0

            "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55077
            Subject: LU-16515 tests: disable sanity test_118c/118d
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: dd76a45cdc5f6d8d1a8eca5b9a5e66be184ecd71

            gerrit Gerrit Updater added a comment - "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55077 Subject: LU-16515 tests: disable sanity test_118c/118d Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: dd76a45cdc5f6d8d1a8eca5b9a5e66be184ecd71
            ys Yang Sheng added a comment -

            I have captured a vmcore for the failure case:

            crash> ll_inode_info ffff8f7192598000 -x
            struct ll_inode_info {
              lli_inode_magic = 0x111d0de5,
            ....
              lli_fid = {
                f_seq = 0x200001b78,
                f_oid = 0xd3a,
                f_ver = 0x0
              },
            .........
                i_op = 0xffffffffc106ffc0 <ll_file_inode_operations>,
                i_sb = 0xffff8f719a8ea800,
                i_mapping = 0xffff8f7192598208,
                i_security = 0xffff8f712ef42d80,
                i_ino = 0x200001b78000d3a,
            

            Look into mmaping:

            crash> address_space 0xffff8f7192598208 -o
            struct address_space {
              [ffff8f7192598208] struct inode *host;
              [ffff8f7192598210] struct xarray i_pages;
              [ffff8f7192598228] atomic_t i_mmap_writable;
            ...........
            crash> xarray ffff8f7192598210
            struct xarray {
              xa_lock = {
                {
                  rlock = {
                    raw_lock = {
                      {
                        val = {
                          counter = 0
                        },
                        {
                          locked = 0 '\000',
                          pending = 0 '\000'
                        },
                        {
                          locked_pending = 0,
                          tail = 0
                        }
                      }
                    }
                  }
                }
              },
              xa_flags = 16777249,
              xa_head = 0xffffd8b842578f80,
              xarray_size_rh = 0,
              _rh = {<No data fields>}
            }
            crash> kmem 0xffffd8b842578f80
                  PAGE       PHYSICAL      MAPPING       INDEX CNT FLAGS
            ffffd8b842578f80 95e3e000 ffff8f7192598208        0  2 fffffc0005028 uptodate,lru,private,writeback
            

            So every thing looks valid. In the RHEL8 use XA to replace radix tree. The only reason could be a racy in XArray code. But i am not sure which place should be a culprit. Further investigating is needed. Any comment is appreciated.

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - I have captured a vmcore for the failure case: crash> ll_inode_info ffff8f7192598000 -x struct ll_inode_info { lli_inode_magic = 0x111d0de5, .... lli_fid = { f_seq = 0x200001b78, f_oid = 0xd3a, f_ver = 0x0 }, ......... i_op = 0xffffffffc106ffc0 <ll_file_inode_operations>, i_sb = 0xffff8f719a8ea800, i_mapping = 0xffff8f7192598208, i_security = 0xffff8f712ef42d80, i_ino = 0x200001b78000d3a, Look into mmaping: crash> address_space 0xffff8f7192598208 -o struct address_space { [ffff8f7192598208] struct inode *host; [ffff8f7192598210] struct xarray i_pages; [ffff8f7192598228] atomic_t i_mmap_writable; ........... crash> xarray ffff8f7192598210 struct xarray { xa_lock = { { rlock = { raw_lock = { { val = { counter = 0 }, { locked = 0 '\000', pending = 0 '\000' }, { locked_pending = 0, tail = 0 } } } } } }, xa_flags = 16777249, xa_head = 0xffffd8b842578f80, xarray_size_rh = 0, _rh = {<No data fields>} } crash> kmem 0xffffd8b842578f80 PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffffd8b842578f80 95e3e000 ffff8f7192598208 0 2 fffffc0005028 uptodate,lru,private,writeback So every thing looks valid. In the RHEL8 use XA to replace radix tree. The only reason could be a racy in XArray code. But i am not sure which place should be a culprit. Further investigating is needed. Any comment is appreciated. Thanks, YangSheng

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50470/
            Subject: LU-16515 tests: disable sanity test_118c/118d
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 7c52cbf65218d77c0594f92981173aa7d78f6758

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50470/ Subject: LU-16515 tests: disable sanity test_118c/118d Project: fs/lustre-release Branch: master Current Patch Set: Commit: 7c52cbf65218d77c0594f92981173aa7d78f6758

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50470
            Subject: LU-16515 tests: disable sanity test_118c/118d
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e3bf4218cb9e1eb68f8ffb9f15161645165b84be

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50470 Subject: LU-16515 tests: disable sanity test_118c/118d Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e3bf4218cb9e1eb68f8ffb9f15161645165b84be

            "Yang Sheng <ys@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50439
            Subject: LU-16515 tests: enable -1 log for 118c & 118d
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: ddbe69524ab8ec5d2147fed5674eddabdf366422

            gerrit Gerrit Updater added a comment - "Yang Sheng <ys@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50439 Subject: LU-16515 tests: enable -1 log for 118c & 118d Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ddbe69524ab8ec5d2147fed5674eddabdf366422

            Interesting about full debug - that's good to know.

            RE: debug_mb=10000; oh yes, absolutely.  It's sort my lazy shorthand for "max", it's intended to be when I want to get as much info as possible out of a specific test.  It's not appropriate for full test runs and it does indeed cause problems if you do one with it set like that.

            paf0186 Patrick Farrell added a comment - Interesting about full debug - that's good to know. RE: debug_mb=10000; oh yes, absolutely.  It's sort my lazy shorthand for "max", it's intended to be when I want to get as much info as possible out of a specific test.  It's not appropriate for full test runs and it does indeed cause problems if you do one with it set like that.

            Sheng, can you please push a patch to enable full debug for this subtest (e.g. add start_full_debug_logging and stop_full_debug_logging calls into test_118c and test_118d and then add:

            Test-Parameters: trivial testlist=sanity env=ONLY="118c 118d",ONLY_REPEAT=100
            Test-Parameters: trivial testlist=sanity
            Test-Parameters: trivial testlist=sanity
            [repeat 20x]
            

            If testing on the patch itself does not reproduce the problem (it is failing about 10% of runs on master this week, but may depend on previous state to fail), then the debug patch could be landed so that it will collect debugging from other patch test runs.

            adilger Andreas Dilger added a comment - Sheng, can you please push a patch to enable full debug for this subtest (e.g. add start_full_debug_logging and stop_full_debug_logging calls into test_118c and test_118d and then add: Test-Parameters: trivial testlist=sanity env=ONLY="118c 118d",ONLY_REPEAT=100 Test-Parameters: trivial testlist=sanity Test-Parameters: trivial testlist=sanity [repeat 20x] If testing on the patch itself does not reproduce the problem (it is failing about 10% of runs on master this week, but may depend on previous state to fail), then the debug patch could be landed so that it will collect debugging from other patch test runs.

            Patrick, for sanity.sh the script definitely sets full debug at the start, so that we can debug test failures more easily:

            OLDDEBUG=$(lctl get_param -n debug 2> /dev/null)
            lctl set_param debug=-1 2> /dev/null || true
            

            but this is lost at some point during the tests running. That said, it looks like this is set only on the client, so possibly the same needs to be done on the servers.

            You can place 'lctl set_param *debug=-1' and 'lctl set_param debug_mb=10000' (or some other large number) directly in the test, then run the test.

            The VM clients only have 3GB of RAM, so 10000 is too large. The limit is 80% of RAM, but this will likely impact system operation. Maybe 1GB is the max for testing.

            adilger Andreas Dilger added a comment - Patrick, for sanity.sh the script definitely sets full debug at the start, so that we can debug test failures more easily: OLDDEBUG=$(lctl get_param -n debug 2> /dev/null) lctl set_param debug=-1 2> /dev/null || true but this is lost at some point during the tests running. That said, it looks like this is set only on the client, so possibly the same needs to be done on the servers. You can place ' lctl set_param *debug=-1 ' and ' lctl set_param debug_mb=10000 ' (or some other large number) directly in the test, then run the test. The VM clients only have 3GB of RAM, so 10000 is too large. The limit is 80% of RAM, but this will likely impact system operation. Maybe 1GB is the max for testing.

            It is not the default, I don't think?  We certainly don't normally log with +trace enabled?  When we accidentally landed a patch of mine which did debug=-1 in a test, it caused problems for other tests.

            Yang Sheng,

            When I need to do this, I cheat.  You can place 'lctl set_param *debug=-1' and 'lctl set_param debug_mb=10000' (or some other large number) directly in the test, then run the test.  Unless the test remounts Lustre, which resets to defaults, this level of debug will show up in the debug log for the test failure.  You can even do this in tests you push to Maloo, but I would set a test-parameters to run just this test if you do that.

            paf0186 Patrick Farrell added a comment - It is not the default, I don't think?  We certainly don't normally log with +trace enabled?  When we accidentally landed a patch of mine which did debug=-1 in a test, it caused problems for other tests. Yang Sheng, When I need to do this, I cheat.  You can place 'lctl set_param *debug=-1' and 'lctl set_param debug_mb=10000' (or some other large number) directly in the test, then run the test.  Unless the test remounts Lustre, which resets to defaults, this level of debug will show up in the debug log for the test failure.  You can even do this in tests you push to Maloo, but I would set a test-parameters to run just this test if you do that.
            adilger Andreas Dilger added a comment - - edited

            You could push a patch to set debug=all for this test. This is the default for sanity, but is lost when some subtests umount and remount the filesystem.

            You may also be able to add this to Test-Parameters:

            Test-Parameters: trivial testlist=sanity env=ONLY=118,ONLY_REPEAT=100,PTLDEBUG=all
            
            adilger Andreas Dilger added a comment - - edited You could push a patch to set debug=all for this test. This is the default for sanity, but is lost when some subtests umount and remount the filesystem. You may also be able to add this to Test-Parameters: Test-Parameters: trivial testlist=sanity env=ONLY=118,ONLY_REPEAT=100,PTLDEBUG=all
            ys Yang Sheng added a comment -

            I wonder why sanity test lustre log without -1 flag? Have a easy way to change it?

            ys Yang Sheng added a comment - I wonder why sanity test lustre log without -1 flag? Have a easy way to change it?

            People

              ys Yang Sheng
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated: