Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16515

sanity test_118c test_118d: No page in writeback, writeback=0

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.16.0, Lustre 2.15.2, Lustre 2.15.3
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for S Buisson <sbuisson@ddn.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/8136068e-67b8-43b8-9a9a-f8d956af9458

      test_118d failed with the following error:

      No page in writeback, writeback=0
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-reviews/91915 - 4.18.0-372.32.1.el8_6.x86_64
      servers: https://build.whamcloud.com/job/lustre-reviews/91915 - 4.18.0-372.32.1.el8_lustre.x86_64

      Test output is just:

      == sanity test 118d: Fsync validation inject a delay of the bulk ==================================================================== 14:43:29 (1674830609)
      7+0 records in
      7+0 records out
      458752 bytes (459 kB, 448 KiB) copied, 0.00275827 s, 166 MB/s
      CMD: onyx-117vm3 lctl set_param fail_val=0 fail_loc=0x214
      fail_val=0
      fail_loc=0x214
       sanity test_118d: @@@@@@ FAIL: No page in writeback, writeback=0
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity test_118d - No page in writeback, writeback=0

      Attachments

        Issue Links

          Activity

            [LU-16515] sanity test_118c test_118d: No page in writeback, writeback=0

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50470
            Subject: LU-16515 tests: disable sanity test_118c/118d
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e3bf4218cb9e1eb68f8ffb9f15161645165b84be

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50470 Subject: LU-16515 tests: disable sanity test_118c/118d Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e3bf4218cb9e1eb68f8ffb9f15161645165b84be

            "Yang Sheng <ys@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50439
            Subject: LU-16515 tests: enable -1 log for 118c & 118d
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: ddbe69524ab8ec5d2147fed5674eddabdf366422

            gerrit Gerrit Updater added a comment - "Yang Sheng <ys@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50439 Subject: LU-16515 tests: enable -1 log for 118c & 118d Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ddbe69524ab8ec5d2147fed5674eddabdf366422

            Interesting about full debug - that's good to know.

            RE: debug_mb=10000; oh yes, absolutely.  It's sort my lazy shorthand for "max", it's intended to be when I want to get as much info as possible out of a specific test.  It's not appropriate for full test runs and it does indeed cause problems if you do one with it set like that.

            paf0186 Patrick Farrell added a comment - Interesting about full debug - that's good to know. RE: debug_mb=10000; oh yes, absolutely.  It's sort my lazy shorthand for "max", it's intended to be when I want to get as much info as possible out of a specific test.  It's not appropriate for full test runs and it does indeed cause problems if you do one with it set like that.

            Sheng, can you please push a patch to enable full debug for this subtest (e.g. add start_full_debug_logging and stop_full_debug_logging calls into test_118c and test_118d and then add:

            Test-Parameters: trivial testlist=sanity env=ONLY="118c 118d",ONLY_REPEAT=100
            Test-Parameters: trivial testlist=sanity
            Test-Parameters: trivial testlist=sanity
            [repeat 20x]
            

            If testing on the patch itself does not reproduce the problem (it is failing about 10% of runs on master this week, but may depend on previous state to fail), then the debug patch could be landed so that it will collect debugging from other patch test runs.

            adilger Andreas Dilger added a comment - Sheng, can you please push a patch to enable full debug for this subtest (e.g. add start_full_debug_logging and stop_full_debug_logging calls into test_118c and test_118d and then add: Test-Parameters: trivial testlist=sanity env=ONLY="118c 118d",ONLY_REPEAT=100 Test-Parameters: trivial testlist=sanity Test-Parameters: trivial testlist=sanity [repeat 20x] If testing on the patch itself does not reproduce the problem (it is failing about 10% of runs on master this week, but may depend on previous state to fail), then the debug patch could be landed so that it will collect debugging from other patch test runs.

            Patrick, for sanity.sh the script definitely sets full debug at the start, so that we can debug test failures more easily:

            OLDDEBUG=$(lctl get_param -n debug 2> /dev/null)
            lctl set_param debug=-1 2> /dev/null || true
            

            but this is lost at some point during the tests running. That said, it looks like this is set only on the client, so possibly the same needs to be done on the servers.

            You can place 'lctl set_param *debug=-1' and 'lctl set_param debug_mb=10000' (or some other large number) directly in the test, then run the test.

            The VM clients only have 3GB of RAM, so 10000 is too large. The limit is 80% of RAM, but this will likely impact system operation. Maybe 1GB is the max for testing.

            adilger Andreas Dilger added a comment - Patrick, for sanity.sh the script definitely sets full debug at the start, so that we can debug test failures more easily: OLDDEBUG=$(lctl get_param -n debug 2> /dev/null) lctl set_param debug=-1 2> /dev/null || true but this is lost at some point during the tests running. That said, it looks like this is set only on the client, so possibly the same needs to be done on the servers. You can place ' lctl set_param *debug=-1 ' and ' lctl set_param debug_mb=10000 ' (or some other large number) directly in the test, then run the test. The VM clients only have 3GB of RAM, so 10000 is too large. The limit is 80% of RAM, but this will likely impact system operation. Maybe 1GB is the max for testing.

            It is not the default, I don't think?  We certainly don't normally log with +trace enabled?  When we accidentally landed a patch of mine which did debug=-1 in a test, it caused problems for other tests.

            Yang Sheng,

            When I need to do this, I cheat.  You can place 'lctl set_param *debug=-1' and 'lctl set_param debug_mb=10000' (or some other large number) directly in the test, then run the test.  Unless the test remounts Lustre, which resets to defaults, this level of debug will show up in the debug log for the test failure.  You can even do this in tests you push to Maloo, but I would set a test-parameters to run just this test if you do that.

            paf0186 Patrick Farrell added a comment - It is not the default, I don't think?  We certainly don't normally log with +trace enabled?  When we accidentally landed a patch of mine which did debug=-1 in a test, it caused problems for other tests. Yang Sheng, When I need to do this, I cheat.  You can place 'lctl set_param *debug=-1' and 'lctl set_param debug_mb=10000' (or some other large number) directly in the test, then run the test.  Unless the test remounts Lustre, which resets to defaults, this level of debug will show up in the debug log for the test failure.  You can even do this in tests you push to Maloo, but I would set a test-parameters to run just this test if you do that.
            adilger Andreas Dilger added a comment - - edited

            You could push a patch to set debug=all for this test. This is the default for sanity, but is lost when some subtests umount and remount the filesystem.

            You may also be able to add this to Test-Parameters:

            Test-Parameters: trivial testlist=sanity env=ONLY=118,ONLY_REPEAT=100,PTLDEBUG=all
            
            adilger Andreas Dilger added a comment - - edited You could push a patch to set debug=all for this test. This is the default for sanity, but is lost when some subtests umount and remount the filesystem. You may also be able to add this to Test-Parameters: Test-Parameters: trivial testlist=sanity env=ONLY=118,ONLY_REPEAT=100,PTLDEBUG=all
            ys Yang Sheng added a comment -

            I wonder why sanity test lustre log without -1 flag? Have a easy way to change it?

            ys Yang Sheng added a comment - I wonder why sanity test lustre log without -1 flag? Have a easy way to change it?

            Yang Sheng,

            Are you interested in looking in to this further?  I haven't been able to reproduce it locally, which has been making looking at it kind of tricky.

            paf0186 Patrick Farrell added a comment - Yang Sheng, Are you interested in looking in to this further?  I haven't been able to reproduce it locally, which has been making looking at it kind of tricky.

            +73 failures in the past week. Please increase priority to fix this issue as it slows down all development efforts.

            adilger Andreas Dilger added a comment - +73 failures in the past week. Please increase priority to fix this issue as it slows down all development efforts.
            ys Yang Sheng added a comment - - edited

            This is really confused me. From log:

            00000080:00200000:0.0:1678874193.081576:0:332595:0:(vvp_io.c:1279:vvp_io_write_start()) f118d.sanity: write [0, 4096)
            00000080:00200000:0.0:1678874193.081581:0:332595:0:(rw26.c:725:ll_write_begin()) Writing 0 of 0 to 4096 bytes
            00000080:00200000:0.0:1678874193.081593:0:332595:0:(rw26.c:908:ll_write_end()) pos 0, len 4096, copied 4096
            00000080:00200000:0.0:1678874193.081597:0:332595:0:(rw26.c:945:ll_write_end()) page@00000000bdbe9cf6[3 0000000097786b56 2 1 0000000095f53b23]
            
            00000080:00200000:0.0:1678874193.081598:0:332595:0:(rw26.c:945:ll_write_end()) vmpage @00000000355f6090 fffffc0002001 3:0 ffff9ff680f81e00 0 lru       
                                                                                                                                  
                                                                                    =================^
            00000080:00200000:0.0:1678874193.081602:0:332595:0:(rw26.c:945:ll_write_end()) osc-page@00000000e81681ab 0: 1< 0 - - > 2< 0 0 0 0x0 0x0 | 0000000000000000 00000000516aba8e 000000004f241537 > 3< 0 0 > 4< 0 0 8 16334848 - | - - - - > 5< - - - - | 0 - | 0 - ->
            
            00000080:00200000:0.0:1678874193.081603:0:332595:0:(rw26.c:945:ll_write_end()) end page@00000000bdbe9cf6
            
            00000080:00200000:0.0:1678874193.081604:0:332595:0:(rw26.c:945:ll_write_end()) queued page: 1.
            00000080:00200000:0.0:1678874193.081605:0:332595:0:(vvp_io.c:1169:vvp_io_write_commit()) commit async pages: 1, from 0, to 4096
            00000008:00010000:0.0:1678874193.081612:0:332595:0:(osc_cache.c:802:osc_extent_find()) ### extent: 0000000023189926 ns: lustre-OST0006-osc-ffff9ff6a6978800 lock: 00000000c663eb85/0xab5218239206893a lrc: 4/0,1 mode: PW/PW res: [0x52b5:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) gid 0 flags: 0x800020000000000 nid: local remote: 0x643532634dfd448a expref: -99 pid: 332595 timeout: 0 lvb_type: 1
            00000080:00200000:0.0:1678874193.081620:0:332595:0:(vvp_io.c:1091:vvp_set_pagevec_dirty()) mapping 00000000608a50d7, count 1, dirtied 1
            00000080:00200000:0.0:1678874193.081627:0:332595:0:(vvp_io.c:1192:vvp_io_write_commit()) Committed 1 pages 4096 bytes, tot: 4096
            00000080:00200000:0.0:1678874193.081631:0:332595:0:(file.c:1472:ll_merge_attr()) [0x200001b78:0xd35:0x0] updating i_size 4096
            00000080:00200000:0.0:1678874193.081633:0:332595:0:(file.c:4888:ll_fsync()) VFS Op:inode=[0x200001b78:0xd35:0x0](00000000101efe33), start 0, end 4095, datasync 0
            00000080:00200000:0.0:1678874193.081637:0:332595:0:(vvp_io.c:1780:vvp_io_init()) [0x200001b78:0xd35:0x0] ignore/verify layout 1/0, layout version 2 restore needed 0
            00020000:00200000:0.0:1678874193.081640:0:332595:0:(lov_io.c:819:lov_io_iter_init()) component[0] flags 0x10
            00000008:00010000:0.0:1678874193.081645:0:332595:0:(osc_cache.c:1867:try_to_add_extent_for_io()) ### extent: 0000000023189926 ns: lustre-OST0006-osc-ffff9ff6a6978800 lock: 00000000c663eb85/0xab5218239206893a lrc: 4/0,1 mode: PW/PW res: [0x52b5:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) gid 0 flags: 0x800020000000000 nid: local remote: 0x643532634dfd448a expref: -99 pid: 332595 timeout: 0 lvb_type: 1
            00000008:00010000:0.0:1678874193.081650:0:332595:0:(osc_cache.c:1084:osc_extent_make_ready()) ### extent: 0000000023189926 ns: lustre-OST0006-osc-ffff9ff6a6978800 lock: 00000000c663eb85/0xab5218239206893a lrc: 4/0,1 mode: PW/PW res: [0x52b5:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) gid 0 flags: 0x800020000000000 nid: local remote: 0x643532634dfd448a expref: -99 pid: 332595 timeout: 0 lvb_type: 1
            00010000:00010000:0.0:1678874193.081662:0:332595:0:(ldlm_lock.c:1507:ldlm_lock_match_with_skip()) ### matched (0 4095) ns: lustre-OST0006-osc-ffff9ff6a6978800 lock: 00000000c663eb85/0xab5218239206893a lrc: 5/0,1 mode: PW/PW res: [0x52b5:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) gid 0 flags: 0x800020000000000 nid: local remote: 0x643532634dfd448a expref: -99 pid: 332595 timeout: 0 lvb_type: 1
            00000008:00100000:0.0:1678874193.081675:0:332595:0:(osc_request.c:1948:osc_brw_prep_request()) brw rpc 0000000034743c16 - object 0x0:21173 offset 0<>4096
            00000100:00100000:1.0:1678874193.081701:0:2905:0:(client.c:1737:ptlrpc_send_new_req()) Sending RPC req@0000000034743c16 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_00_01:8231eeb6-f8da-45b6-923d-908cec5c2274:2905:1760422647264640:10.240.42.118@tcp:4:multiop.0
            00000400:00000080:1.0:1678874194.489851:0:332674:0:(module.c:211:libcfs_ioctl()) libcfs ioctl cmd 3221775648
            00000001:02000400:1.0:1678874194.489853:0:332674:0:(debug.c:725:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark  sanity test_118d: @@@@@@ FAIL: No page in writeback, writeback=0
            
            

            The page flag ffffc0002001 indicts the writeback has been set for this page. The OST_WRITE hasn't received reply until the test case failed, so this flag should not be clear mean while. Then why the test failed? Maybe i miss some thing?

            ys Yang Sheng added a comment - - edited This is really confused me. From log: 00000080:00200000:0.0:1678874193.081576:0:332595:0:(vvp_io.c:1279:vvp_io_write_start()) f118d.sanity: write [0, 4096) 00000080:00200000:0.0:1678874193.081581:0:332595:0:(rw26.c:725:ll_write_begin()) Writing 0 of 0 to 4096 bytes 00000080:00200000:0.0:1678874193.081593:0:332595:0:(rw26.c:908:ll_write_end()) pos 0, len 4096, copied 4096 00000080:00200000:0.0:1678874193.081597:0:332595:0:(rw26.c:945:ll_write_end()) page@00000000bdbe9cf6[3 0000000097786b56 2 1 0000000095f53b23] 00000080:00200000:0.0:1678874193.081598:0:332595:0:(rw26.c:945:ll_write_end()) vmpage @00000000355f6090 fffffc0002001 3:0 ffff9ff680f81e00 0 lru =================^ 00000080:00200000:0.0:1678874193.081602:0:332595:0:(rw26.c:945:ll_write_end()) osc-page@00000000e81681ab 0: 1< 0 - - > 2< 0 0 0 0x0 0x0 | 0000000000000000 00000000516aba8e 000000004f241537 > 3< 0 0 > 4< 0 0 8 16334848 - | - - - - > 5< - - - - | 0 - | 0 - -> 00000080:00200000:0.0:1678874193.081603:0:332595:0:(rw26.c:945:ll_write_end()) end page@00000000bdbe9cf6 00000080:00200000:0.0:1678874193.081604:0:332595:0:(rw26.c:945:ll_write_end()) queued page: 1. 00000080:00200000:0.0:1678874193.081605:0:332595:0:(vvp_io.c:1169:vvp_io_write_commit()) commit async pages: 1, from 0, to 4096 00000008:00010000:0.0:1678874193.081612:0:332595:0:(osc_cache.c:802:osc_extent_find()) ### extent: 0000000023189926 ns: lustre-OST0006-osc-ffff9ff6a6978800 lock: 00000000c663eb85/0xab5218239206893a lrc: 4/0,1 mode: PW/PW res: [0x52b5:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) gid 0 flags: 0x800020000000000 nid: local remote: 0x643532634dfd448a expref: -99 pid: 332595 timeout: 0 lvb_type: 1 00000080:00200000:0.0:1678874193.081620:0:332595:0:(vvp_io.c:1091:vvp_set_pagevec_dirty()) mapping 00000000608a50d7, count 1, dirtied 1 00000080:00200000:0.0:1678874193.081627:0:332595:0:(vvp_io.c:1192:vvp_io_write_commit()) Committed 1 pages 4096 bytes, tot: 4096 00000080:00200000:0.0:1678874193.081631:0:332595:0:(file.c:1472:ll_merge_attr()) [0x200001b78:0xd35:0x0] updating i_size 4096 00000080:00200000:0.0:1678874193.081633:0:332595:0:(file.c:4888:ll_fsync()) VFS Op:inode=[0x200001b78:0xd35:0x0](00000000101efe33), start 0, end 4095, datasync 0 00000080:00200000:0.0:1678874193.081637:0:332595:0:(vvp_io.c:1780:vvp_io_init()) [0x200001b78:0xd35:0x0] ignore/verify layout 1/0, layout version 2 restore needed 0 00020000:00200000:0.0:1678874193.081640:0:332595:0:(lov_io.c:819:lov_io_iter_init()) component[0] flags 0x10 00000008:00010000:0.0:1678874193.081645:0:332595:0:(osc_cache.c:1867:try_to_add_extent_for_io()) ### extent: 0000000023189926 ns: lustre-OST0006-osc-ffff9ff6a6978800 lock: 00000000c663eb85/0xab5218239206893a lrc: 4/0,1 mode: PW/PW res: [0x52b5:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) gid 0 flags: 0x800020000000000 nid: local remote: 0x643532634dfd448a expref: -99 pid: 332595 timeout: 0 lvb_type: 1 00000008:00010000:0.0:1678874193.081650:0:332595:0:(osc_cache.c:1084:osc_extent_make_ready()) ### extent: 0000000023189926 ns: lustre-OST0006-osc-ffff9ff6a6978800 lock: 00000000c663eb85/0xab5218239206893a lrc: 4/0,1 mode: PW/PW res: [0x52b5:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) gid 0 flags: 0x800020000000000 nid: local remote: 0x643532634dfd448a expref: -99 pid: 332595 timeout: 0 lvb_type: 1 00010000:00010000:0.0:1678874193.081662:0:332595:0:(ldlm_lock.c:1507:ldlm_lock_match_with_skip()) ### matched (0 4095) ns: lustre-OST0006-osc-ffff9ff6a6978800 lock: 00000000c663eb85/0xab5218239206893a lrc: 5/0,1 mode: PW/PW res: [0x52b5:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) gid 0 flags: 0x800020000000000 nid: local remote: 0x643532634dfd448a expref: -99 pid: 332595 timeout: 0 lvb_type: 1 00000008:00100000:0.0:1678874193.081675:0:332595:0:(osc_request.c:1948:osc_brw_prep_request()) brw rpc 0000000034743c16 - object 0x0:21173 offset 0<>4096 00000100:00100000:1.0:1678874193.081701:0:2905:0:(client.c:1737:ptlrpc_send_new_req()) Sending RPC req@0000000034743c16 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_00_01:8231eeb6-f8da-45b6-923d-908cec5c2274:2905:1760422647264640:10.240.42.118@tcp:4:multiop.0 00000400:00000080:1.0:1678874194.489851:0:332674:0:(module.c:211:libcfs_ioctl()) libcfs ioctl cmd 3221775648 00000001:02000400:1.0:1678874194.489853:0:332674:0:(debug.c:725:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark sanity test_118d: @@@@@@ FAIL: No page in writeback, writeback=0 The page flag ffffc0002001 indicts the writeback has been set for this page. The OST_WRITE hasn't received reply until the test case failed, so this flag should not be clear mean while. Then why the test failed? Maybe i miss some thing?

            People

              ys Yang Sheng
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated: