[LU-16515] sanity test_118c test_118d: No page in writeback, writeback=0 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.16.0, Lustre 2.15.2, Lustre 2.15.3
Labels:
- always_except

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This issue was created by maloo for S Buisson <sbuisson@ddn.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/8136068e-67b8-43b8-9a9a-f8d956af9458

test_118d failed with the following error:

No page in writeback, writeback=0

Test session details:
clients: https://build.whamcloud.com/job/lustre-reviews/91915 - 4.18.0-372.32.1.el8_6.x86_64
servers: https://build.whamcloud.com/job/lustre-reviews/91915 - 4.18.0-372.32.1.el8_lustre.x86_64

Test output is just:

== sanity test 118d: Fsync validation inject a delay of the bulk ==================================================================== 14:43:29 (1674830609)
7+0 records in
7+0 records out
458752 bytes (459 kB, 448 KiB) copied, 0.00275827 s, 166 MB/s
CMD: onyx-117vm3 lctl set_param fail_val=0 fail_loc=0x214
fail_val=0
fail_loc=0x214
 sanity test_118d: @@@@@@ FAIL: No page in writeback, writeback=0

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_118d - No page in writeback, writeback=0

Attachments

Issue Links

duplicates

LU-8101 sanity test_118c fails with 'No page in writeback, writeback=0'

Resolved

LU-16285 Prolong the lock BL timeout

Resolved

mentioned in: Page Loading...

Activity

[LU-16515] sanity test_118c test_118d: No page in writeback, writeback=0

Gerrit Updater added a comment - 29/Mar/23 9:48 PM

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50470
Subject: LU-16515 tests: disable sanity test_118c/118d
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e3bf4218cb9e1eb68f8ffb9f15161645165b84be

Gerrit Updater added a comment - 29/Mar/23 9:48 PM "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50470 Subject: LU-16515 tests: disable sanity test_118c/118d Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e3bf4218cb9e1eb68f8ffb9f15161645165b84be

Gerrit Updater added a comment - 27/Mar/23 7:52 PM

"Yang Sheng <ys@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50439
Subject: LU-16515 tests: enable -1 log for 118c & 118d
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ddbe69524ab8ec5d2147fed5674eddabdf366422

Gerrit Updater added a comment - 27/Mar/23 7:52 PM "Yang Sheng <ys@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50439 Subject: LU-16515 tests: enable -1 log for 118c & 118d Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ddbe69524ab8ec5d2147fed5674eddabdf366422

Patrick Farrell added a comment - 27/Mar/23 7:02 PM

Interesting about full debug - that's good to know.

RE: debug_mb=10000; oh yes, absolutely. It's sort my lazy shorthand for "max", it's intended to be when I want to get as much info as possible out of a specific test. It's not appropriate for full test runs and it does indeed cause problems if you do one with it set like that.

Patrick Farrell added a comment - 27/Mar/23 7:02 PM Interesting about full debug - that's good to know. RE: debug_mb=10000; oh yes, absolutely. It's sort my lazy shorthand for "max", it's intended to be when I want to get as much info as possible out of a specific test. It's not appropriate for full test runs and it does indeed cause problems if you do one with it set like that.

Andreas Dilger added a comment - 27/Mar/23 6:38 PM

Sheng, can you please push a patch to enable full debug for this subtest (e.g. add start_full_debug_logging and stop_full_debug_logging calls into test_118c and test_118d and then add:

Test-Parameters: trivial testlist=sanity env=ONLY="118c 118d",ONLY_REPEAT=100
Test-Parameters: trivial testlist=sanity
Test-Parameters: trivial testlist=sanity
[repeat 20x]

If testing on the patch itself does not reproduce the problem (it is failing about 10% of runs on master this week, but may depend on previous state to fail), then the debug patch could be landed so that it will collect debugging from other patch test runs.

Andreas Dilger added a comment - 27/Mar/23 6:38 PM Sheng, can you please push a patch to enable full debug for this subtest (e.g. add start_full_debug_logging and stop_full_debug_logging calls into test_118c and test_118d and then add: Test-Parameters: trivial testlist=sanity env=ONLY="118c 118d",ONLY_REPEAT=100 Test-Parameters: trivial testlist=sanity Test-Parameters: trivial testlist=sanity [repeat 20x] If testing on the patch itself does not reproduce the problem (it is failing about 10% of runs on master this week, but may depend on previous state to fail), then the debug patch could be landed so that it will collect debugging from other patch test runs.

Andreas Dilger added a comment - 27/Mar/23 6:16 PM

Patrick, for sanity.sh the script definitely sets full debug at the start, so that we can debug test failures more easily:

OLDDEBUG=$(lctl get_param -n debug 2> /dev/null)
lctl set_param debug=-1 2> /dev/null || true

but this is lost at some point during the tests running. That said, it looks like this is set only on the client, so possibly the same needs to be done on the servers.

You can place 'lctl set_param *debug=-1' and 'lctl set_param debug_mb=10000' (or some other large number) directly in the test, then run the test.

The VM clients only have 3GB of RAM, so 10000 is too large. The limit is 80% of RAM, but this will likely impact system operation. Maybe 1GB is the max for testing.

Andreas Dilger added a comment - 27/Mar/23 6:16 PM Patrick, for sanity.sh the script definitely sets full debug at the start, so that we can debug test failures more easily: OLDDEBUG=$(lctl get_param -n debug 2> /dev/null) lctl set_param debug=-1 2> /dev/null || true but this is lost at some point during the tests running. That said, it looks like this is set only on the client, so possibly the same needs to be done on the servers. You can place ' lctl set_param *debug=-1 ' and ' lctl set_param debug_mb=10000 ' (or some other large number) directly in the test, then run the test. The VM clients only have 3GB of RAM, so 10000 is too large. The limit is 80% of RAM, but this will likely impact system operation. Maybe 1GB is the max for testing.

Patrick Farrell added a comment - 23/Mar/23 7:07 PM

It is not the default, I don't think? We certainly don't normally log with +trace enabled? When we accidentally landed a patch of mine which did debug=-1 in a test, it caused problems for other tests.

Yang Sheng,

When I need to do this, I cheat. You can place 'lctl set_param *debug=-1' and 'lctl set_param debug_mb=10000' (or some other large number) directly in the test, then run the test. Unless the test remounts Lustre, which resets to defaults, this level of debug will show up in the debug log for the test failure. You can even do this in tests you push to Maloo, but I would set a test-parameters to run just this test if you do that.

Patrick Farrell added a comment - 23/Mar/23 7:07 PM It is not the default, I don't think? We certainly don't normally log with +trace enabled? When we accidentally landed a patch of mine which did debug=-1 in a test, it caused problems for other tests. Yang Sheng, When I need to do this, I cheat. You can place 'lctl set_param *debug=-1' and 'lctl set_param debug_mb=10000' (or some other large number) directly in the test, then run the test. Unless the test remounts Lustre, which resets to defaults, this level of debug will show up in the debug log for the test failure. You can even do this in tests you push to Maloo, but I would set a test-parameters to run just this test if you do that.

Andreas Dilger added a comment - 23/Mar/23 6:51 PM - edited

You could push a patch to set debug=all for this test. This is the default for sanity, but is lost when some subtests umount and remount the filesystem.

You may also be able to add this to Test-Parameters:

Test-Parameters: trivial testlist=sanity env=ONLY=118,ONLY_REPEAT=100,PTLDEBUG=all

Andreas Dilger added a comment - 23/Mar/23 6:51 PM - edited You could push a patch to set debug=all for this test. This is the default for sanity, but is lost when some subtests umount and remount the filesystem. You may also be able to add this to Test-Parameters: Test-Parameters: trivial testlist=sanity env=ONLY=118,ONLY_REPEAT=100,PTLDEBUG=all

Yang Sheng added a comment - 23/Mar/23 6:19 PM

I wonder why sanity test lustre log without -1 flag? Have a easy way to change it?

Yang Sheng added a comment - 23/Mar/23 6:19 PM I wonder why sanity test lustre log without -1 flag? Have a easy way to change it?

Patrick Farrell added a comment - 20/Mar/23 5:11 PM

Yang Sheng,

Are you interested in looking in to this further? I haven't been able to reproduce it locally, which has been making looking at it kind of tricky.

Patrick Farrell added a comment - 20/Mar/23 5:11 PM Yang Sheng, Are you interested in looking in to this further? I haven't been able to reproduce it locally, which has been making looking at it kind of tricky.

Andreas Dilger added a comment - 19/Mar/23 7:15 PM

+73 failures in the past week. Please increase priority to fix this issue as it slows down all development efforts.

Andreas Dilger added a comment - 19/Mar/23 7:15 PM +73 failures in the past week. Please increase priority to fix this issue as it slows down all development efforts.

Yang Sheng added a comment - 16/Mar/23 6:18 AM - edited

This is really confused me. From log:

00000080:00200000:0.0:1678874193.081576:0:332595:0:(vvp_io.c:1279:vvp_io_write_start()) f118d.sanity: write [0, 4096)
00000080:00200000:0.0:1678874193.081581:0:332595:0:(rw26.c:725:ll_write_begin()) Writing 0 of 0 to 4096 bytes
00000080:00200000:0.0:1678874193.081593:0:332595:0:(rw26.c:908:ll_write_end()) pos 0, len 4096, copied 4096
00000080:00200000:0.0:1678874193.081597:0:332595:0:(rw26.c:945:ll_write_end()) page@00000000bdbe9cf6[3 0000000097786b56 2 1 0000000095f53b23]

00000080:00200000:0.0:1678874193.081598:0:332595:0:(rw26.c:945:ll_write_end()) vmpage @00000000355f6090 fffffc0002001 3:0 ffff9ff680f81e00 0 lru       
                                                                                                                      
                                                                        =================^
00000080:00200000:0.0:1678874193.081602:0:332595:0:(rw26.c:945:ll_write_end()) osc-page@00000000e81681ab 0: 1< 0 - - > 2< 0 0 0 0x0 0x0 | 0000000000000000 00000000516aba8e 000000004f241537 > 3< 0 0 > 4< 0 0 8 16334848 - | - - - - > 5< - - - - | 0 - | 0 - ->

00000080:00200000:0.0:1678874193.081603:0:332595:0:(rw26.c:945:ll_write_end()) end page@00000000bdbe9cf6

00000080:00200000:0.0:1678874193.081604:0:332595:0:(rw26.c:945:ll_write_end()) queued page: 1.
00000080:00200000:0.0:1678874193.081605:0:332595:0:(vvp_io.c:1169:vvp_io_write_commit()) commit async pages: 1, from 0, to 4096
00000008:00010000:0.0:1678874193.081612:0:332595:0:(osc_cache.c:802:osc_extent_find()) ### extent: 0000000023189926 ns: lustre-OST0006-osc-ffff9ff6a6978800 lock: 00000000c663eb85/0xab5218239206893a lrc: 4/0,1 mode: PW/PW res: [0x52b5:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) gid 0 flags: 0x800020000000000 nid: local remote: 0x643532634dfd448a expref: -99 pid: 332595 timeout: 0 lvb_type: 1
00000080:00200000:0.0:1678874193.081620:0:332595:0:(vvp_io.c:1091:vvp_set_pagevec_dirty()) mapping 00000000608a50d7, count 1, dirtied 1
00000080:00200000:0.0:1678874193.081627:0:332595:0:(vvp_io.c:1192:vvp_io_write_commit()) Committed 1 pages 4096 bytes, tot: 4096
00000080:00200000:0.0:1678874193.081631:0:332595:0:(file.c:1472:ll_merge_attr()) [0x200001b78:0xd35:0x0] updating i_size 4096
00000080:00200000:0.0:1678874193.081633:0:332595:0:(file.c:4888:ll_fsync()) VFS Op:inode=[0x200001b78:0xd35:0x0](00000000101efe33), start 0, end 4095, datasync 0
00000080:00200000:0.0:1678874193.081637:0:332595:0:(vvp_io.c:1780:vvp_io_init()) [0x200001b78:0xd35:0x0] ignore/verify layout 1/0, layout version 2 restore needed 0
00020000:00200000:0.0:1678874193.081640:0:332595:0:(lov_io.c:819:lov_io_iter_init()) component[0] flags 0x10
00000008:00010000:0.0:1678874193.081645:0:332595:0:(osc_cache.c:1867:try_to_add_extent_for_io()) ### extent: 0000000023189926 ns: lustre-OST0006-osc-ffff9ff6a6978800 lock: 00000000c663eb85/0xab5218239206893a lrc: 4/0,1 mode: PW/PW res: [0x52b5:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) gid 0 flags: 0x800020000000000 nid: local remote: 0x643532634dfd448a expref: -99 pid: 332595 timeout: 0 lvb_type: 1
00000008:00010000:0.0:1678874193.081650:0:332595:0:(osc_cache.c:1084:osc_extent_make_ready()) ### extent: 0000000023189926 ns: lustre-OST0006-osc-ffff9ff6a6978800 lock: 00000000c663eb85/0xab5218239206893a lrc: 4/0,1 mode: PW/PW res: [0x52b5:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) gid 0 flags: 0x800020000000000 nid: local remote: 0x643532634dfd448a expref: -99 pid: 332595 timeout: 0 lvb_type: 1
00010000:00010000:0.0:1678874193.081662:0:332595:0:(ldlm_lock.c:1507:ldlm_lock_match_with_skip()) ### matched (0 4095) ns: lustre-OST0006-osc-ffff9ff6a6978800 lock: 00000000c663eb85/0xab5218239206893a lrc: 5/0,1 mode: PW/PW res: [0x52b5:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) gid 0 flags: 0x800020000000000 nid: local remote: 0x643532634dfd448a expref: -99 pid: 332595 timeout: 0 lvb_type: 1
00000008:00100000:0.0:1678874193.081675:0:332595:0:(osc_request.c:1948:osc_brw_prep_request()) brw rpc 0000000034743c16 - object 0x0:21173 offset 0<>4096
00000100:00100000:1.0:1678874193.081701:0:2905:0:(client.c:1737:ptlrpc_send_new_req()) Sending RPC req@0000000034743c16 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_00_01:8231eeb6-f8da-45b6-923d-908cec5c2274:2905:1760422647264640:10.240.42.118@tcp:4:multiop.0
00000400:00000080:1.0:1678874194.489851:0:332674:0:(module.c:211:libcfs_ioctl()) libcfs ioctl cmd 3221775648
00000001:02000400:1.0:1678874194.489853:0:332674:0:(debug.c:725:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark  sanity test_118d: @@@@@@ FAIL: No page in writeback, writeback=0

The page flag ffffc0002001 indicts the writeback has been set for this page. The OST_WRITE hasn't received reply until the test case failed, so this flag should not be clear mean while. Then why the test failed? Maybe i miss some thing?

Yang Sheng added a comment - 16/Mar/23 6:18 AM - edited This is really confused me. From log: 00000080:00200000:0.0:1678874193.081576:0:332595:0:(vvp_io.c:1279:vvp_io_write_start()) f118d.sanity: write [0, 4096) 00000080:00200000:0.0:1678874193.081581:0:332595:0:(rw26.c:725:ll_write_begin()) Writing 0 of 0 to 4096 bytes 00000080:00200000:0.0:1678874193.081593:0:332595:0:(rw26.c:908:ll_write_end()) pos 0, len 4096, copied 4096 00000080:00200000:0.0:1678874193.081597:0:332595:0:(rw26.c:945:ll_write_end()) page@00000000bdbe9cf6[3 0000000097786b56 2 1 0000000095f53b23] 00000080:00200000:0.0:1678874193.081598:0:332595:0:(rw26.c:945:ll_write_end()) vmpage @00000000355f6090 fffffc0002001 3:0 ffff9ff680f81e00 0 lru =================^ 00000080:00200000:0.0:1678874193.081602:0:332595:0:(rw26.c:945:ll_write_end()) osc-page@00000000e81681ab 0: 1< 0 - - > 2< 0 0 0 0x0 0x0 | 0000000000000000 00000000516aba8e 000000004f241537 > 3< 0 0 > 4< 0 0 8 16334848 - | - - - - > 5< - - - - | 0 - | 0 - -> 00000080:00200000:0.0:1678874193.081603:0:332595:0:(rw26.c:945:ll_write_end()) end page@00000000bdbe9cf6 00000080:00200000:0.0:1678874193.081604:0:332595:0:(rw26.c:945:ll_write_end()) queued page: 1. 00000080:00200000:0.0:1678874193.081605:0:332595:0:(vvp_io.c:1169:vvp_io_write_commit()) commit async pages: 1, from 0, to 4096 00000008:00010000:0.0:1678874193.081612:0:332595:0:(osc_cache.c:802:osc_extent_find()) ### extent: 0000000023189926 ns: lustre-OST0006-osc-ffff9ff6a6978800 lock: 00000000c663eb85/0xab5218239206893a lrc: 4/0,1 mode: PW/PW res: [0x52b5:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) gid 0 flags: 0x800020000000000 nid: local remote: 0x643532634dfd448a expref: -99 pid: 332595 timeout: 0 lvb_type: 1 00000080:00200000:0.0:1678874193.081620:0:332595:0:(vvp_io.c:1091:vvp_set_pagevec_dirty()) mapping 00000000608a50d7, count 1, dirtied 1 00000080:00200000:0.0:1678874193.081627:0:332595:0:(vvp_io.c:1192:vvp_io_write_commit()) Committed 1 pages 4096 bytes, tot: 4096 00000080:00200000:0.0:1678874193.081631:0:332595:0:(file.c:1472:ll_merge_attr()) [0x200001b78:0xd35:0x0] updating i_size 4096 00000080:00200000:0.0:1678874193.081633:0:332595:0:(file.c:4888:ll_fsync()) VFS Op:inode=[0x200001b78:0xd35:0x0](00000000101efe33), start 0, end 4095, datasync 0 00000080:00200000:0.0:1678874193.081637:0:332595:0:(vvp_io.c:1780:vvp_io_init()) [0x200001b78:0xd35:0x0] ignore/verify layout 1/0, layout version 2 restore needed 0 00020000:00200000:0.0:1678874193.081640:0:332595:0:(lov_io.c:819:lov_io_iter_init()) component[0] flags 0x10 00000008:00010000:0.0:1678874193.081645:0:332595:0:(osc_cache.c:1867:try_to_add_extent_for_io()) ### extent: 0000000023189926 ns: lustre-OST0006-osc-ffff9ff6a6978800 lock: 00000000c663eb85/0xab5218239206893a lrc: 4/0,1 mode: PW/PW res: [0x52b5:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) gid 0 flags: 0x800020000000000 nid: local remote: 0x643532634dfd448a expref: -99 pid: 332595 timeout: 0 lvb_type: 1 00000008:00010000:0.0:1678874193.081650:0:332595:0:(osc_cache.c:1084:osc_extent_make_ready()) ### extent: 0000000023189926 ns: lustre-OST0006-osc-ffff9ff6a6978800 lock: 00000000c663eb85/0xab5218239206893a lrc: 4/0,1 mode: PW/PW res: [0x52b5:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) gid 0 flags: 0x800020000000000 nid: local remote: 0x643532634dfd448a expref: -99 pid: 332595 timeout: 0 lvb_type: 1 00010000:00010000:0.0:1678874193.081662:0:332595:0:(ldlm_lock.c:1507:ldlm_lock_match_with_skip()) ### matched (0 4095) ns: lustre-OST0006-osc-ffff9ff6a6978800 lock: 00000000c663eb85/0xab5218239206893a lrc: 5/0,1 mode: PW/PW res: [0x52b5:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) gid 0 flags: 0x800020000000000 nid: local remote: 0x643532634dfd448a expref: -99 pid: 332595 timeout: 0 lvb_type: 1 00000008:00100000:0.0:1678874193.081675:0:332595:0:(osc_request.c:1948:osc_brw_prep_request()) brw rpc 0000000034743c16 - object 0x0:21173 offset 0<>4096 00000100:00100000:1.0:1678874193.081701:0:2905:0:(client.c:1737:ptlrpc_send_new_req()) Sending RPC req@0000000034743c16 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_00_01:8231eeb6-f8da-45b6-923d-908cec5c2274:2905:1760422647264640:10.240.42.118@tcp:4:multiop.0 00000400:00000080:1.0:1678874194.489851:0:332674:0:(module.c:211:libcfs_ioctl()) libcfs ioctl cmd 3221775648 00000001:02000400:1.0:1678874194.489853:0:332674:0:(debug.c:725:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark sanity test_118d: @@@@@@ FAIL: No page in writeback, writeback=0 The page flag ffffc0002001 indicts the writeback has been set for this page. The OST_WRITE hasn't received reply until the test case failed, so this flag should not be clear mean while. Then why the test failed? Maybe i miss some thing?

People

Assignee:: Yang Sheng

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 30/Jan/23 7:25 AM

Updated:: 12/Jul/24 12:11 AM