[LU-16515] sanity test_118c test_118d: No page in writeback, writeback=0 Created: 30/Jan/23  Updated: 16/Jun/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0, Lustre 2.15.2, Lustre 2.15.3
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Major
Reporter: Maloo Assignee: Yang Sheng
Resolution: Unresolved Votes: 0
Labels: always_except

Issue Links:
Duplicate
duplicates LU-8101 sanity test_118c fails with 'No page ... Resolved
duplicates LU-16285 Prolong the lock BL timeout Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for S Buisson <sbuisson@ddn.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/8136068e-67b8-43b8-9a9a-f8d956af9458

test_118d failed with the following error:

No page in writeback, writeback=0

Test session details:
clients: https://build.whamcloud.com/job/lustre-reviews/91915 - 4.18.0-372.32.1.el8_6.x86_64
servers: https://build.whamcloud.com/job/lustre-reviews/91915 - 4.18.0-372.32.1.el8_lustre.x86_64

Test output is just:

== sanity test 118d: Fsync validation inject a delay of the bulk ==================================================================== 14:43:29 (1674830609)
7+0 records in
7+0 records out
458752 bytes (459 kB, 448 KiB) copied, 0.00275827 s, 166 MB/s
CMD: onyx-117vm3 lctl set_param fail_val=0 fail_loc=0x214
fail_val=0
fail_loc=0x214
 sanity test_118d: @@@@@@ FAIL: No page in writeback, writeback=0

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_118d - No page in writeback, writeback=0



 Comments   
Comment by Aurelien Degremont (Inactive) [ 02/Feb/23 ]

+1 https://testing.whamcloud.com/test_sessions/953d8bc0-67fe-41ca-a159-e971ce38e1cf

Comment by Andreas Dilger [ 02/Feb/23 ]

Patrick, could you please take a quick look at this failing subtest to see if this is a concern?

It started failing on master 2023-01-18 and is now failing 10 times per day, so likely relates to a patch that was landed around that time. The filtered list (excluding kfilnd and unrelated test changes) looks like:

# git log --oneline --after 2023-01-16 --before 2023-01-19 | grep -E -v "kfilnd|test"
cb0aa0285b32 LU-2771 ldlm: remove obsolete LDLM_FL_SERVER_LOCK
0b406c91d175 LU-13642 lnet: modify lnet_inetdev to work with large NIDS
43fe6e51804f LU-16380 osd-ldiskfs: race in OI mapping
ae98c5fdaaf3 LU-16335 mdt: skip target check for rm_entry
a883fec55694 LU-16302 llite: Use alloc_inode_sb() to allocate inodes
7bf0e557a2b3 LU-15163 osd: osd_obj_map_recover() to restart transaction
17a3b5688435 LU-16460 lnet: validate data sent from user land properly
84c9618190f9 LU-16160 revert: "llite: clear stale page's uptodate bit"
f66b0c3b22bf LU-6142 osc: tidy up osc_init()
d137e9823ca1 LU-10003 lnet: use Netlink to support LNet ping commands
1632ed0340ad LU-6142 ldlm: use list_first_entry in ldlm_lock
c0aa5d97da57 LU-6142 ldlm: tidy list walking in ldlm_flock()
540c293a4d0f LU-16369 ldiskfs: do not check enc context at lookup
a0132a79df9b LU-16444 enc: null-enc names cannot be digested form
6f490275b0e0 LU-16026 llite: always enable remote subdir mount

so I would guess the likely candidate is the LU-16160 revert.

Comment by Gerrit Updater [ 06/Feb/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49918
Subject: LU-16515 clio: Remove cl_page_size()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3b97847db29f85178f3c940332287f5fdcd7bc85

Comment by Gerrit Updater [ 09/Feb/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49960
Subject: LU-16515 kfilnd: Allow one HELLO in-flight per peer
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 833f554bc9147cf58d8c6213c91f13c0bccb7c06

Comment by Patrick Farrell [ 09/Feb/23 ]

So I'm doing a bisection here, so there will be some odd looking patch uploads here.  I'm going to start tagging them with BISECT.

Comment by Gerrit Updater [ 09/Feb/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49961
Subject: LU-16515 llite: BISECT SIGBUS is possible on a race with
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 666624c75fd30aed02493d1000c32dceba431f11

Comment by Gerrit Updater [ 10/Feb/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49963
Subject: LU-16515 sec: BISECT make nodemap root squash
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4a46141776c641b70521d14ae8b8c99b5f994ecf

Comment by Gerrit Updater [ 10/Feb/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49965
Subject: LU-16515 sec: BISECT nodemap root squash independ
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 45e874fcbd59044479c2f51fa6d98dc9f25ca28b

Comment by Patrick Farrell [ 13/Feb/23 ]

So, my bisecting showed that the cause of this is:
https://review.whamcloud.com/c/fs/lustre-release/+/49527

ys , can you take a look?  These tests started failing often after https://review.whamcloud.com/c/fs/lustre-release/+/49527 landed to master.  They failed very rarely before.  (I identified this patch by bisecting)

Comment by Gerrit Updater [ 14/Feb/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49918/
Subject: LU-16515 clio: Remove cl_page_size()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 19c38f6c94ae161b1c52dcb02667b4bf06cadc85

Comment by Serguei Smirnov [ 22/Feb/23 ]

+1 on master: https://testing.whamcloud.com/test_sets/e93b1d78-fd45-4845-a0c2-03c3c13d6661

Comment by Chris Horn [ 22/Feb/23 ]

+1 on master https://testing.whamcloud.com/test_sets/f66a30e2-d742-4c8f-acc0-9e66872ab142

Comment by Arshad Hussain [ 28/Feb/23 ]

+1 on Master (https://testing.whamcloud.com/test_sets/3458dfd9-3ce1-4fee-a29d-37a09587b298)

Comment by Gerrit Updater [ 01/Mar/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50169
Subject: LU-16515 tests: See if LU-16285 fixes 118c,118d
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8887fa6d22b6bc3bb18aef7f110d24a038f0e018

Comment by Patrick Farrell [ 02/Mar/23 ]

Per results of that testing, it looks like this issue is fixed by https://review.whamcloud.com/c/fs/lustre-release/+/50158 on LU-16285.  We'll wait for that to land then close this out.

Comment by Andreas Dilger [ 11/Mar/23 ]

Still seeing this error pretty regularly on master patches, even though they have the LU-16285 fix.  For example:

https://review.whamcloud.com/#/c/fs/lustre-release/+/48096/5 has v2_15_54-60-g7d05a687ee as the parent, and LU-16285 was landed as v2_15_54-59-g9d79f92076.

Comment by Andreas Dilger [ 11/Mar/23 ]

Looking through the test history again, I see test_118c and test_118d started mostly failing on 2023-01-27, after the following series of patch landings (earliest failure is v2_15_53-110-geac03ca978 which is the last patch of the series):

eac03ca978 LU-16464 osp: fix off-by-one errors in oxe_can_hold()
7e566c6a1f LU-16456 tests: skip conf-sanity test_129/132 in interop
742019b260 LU-16461 kfilnd: Modify peer credits and RX buffers
e500f49c30 LU-16451 kfilnd: Improve CQ error logging
98338572a6 LU-13530 build: Add kernel version to depmod
1335eb1d59 LU-16445 sec: make nodemap root squash independent of map_mode
b65374d96b LU-16285 ldlm: send the cancel RPC asap
f147655c33 LU-16415 quota: enforce project quota for root
96edf50039 LU-16367 misc: remove deprecated code
5efc4c1cb4 LU-16345 ofd: ofd_commitrw_read() with non-existing object
cdcf97e17e LU-16338 readahead: add stats for read-ahead page count
421b8fb4e9 LU-16333 osc: page fault in osc_release_bounce_pages()
01fb7bda97 LU-15495 tests: fixed dbench test
eb6518f7ff LU-16267 lnet: fix missing error check in LUTF
e2812e8773 LU-16228 utils: add lljobstat util
945ab61176 LU-16115 build: Linux 5.17 external module support
3774b6afbe LU-13485 ldiskfs: Parallel configure tests for ldiskfs
a346cf6cf2 LU-13485 kernel: Parallel core configure tests
182fa9be07 LU-13485 libcfs: Parallel configure tests for libcfs
b4da788a81 LU-16160 llite: SIGBUS is possible on a race with page reclaim
829af7b029 LU-16480 lov: fiemap improperly handles fm_extent_count=0
ba0d5ffc1c LU-9680 utils: new llapi_param_display_value().
178988d67a LU-14393 recovery: reply reconstruction for batched RPCs
840274b5c5 LU-14393 protocol: basic batching processing framework

The only other test_118d failure was on 2023-01-17 for patch https://review.whamcloud.com/45317 "LU-15155 llite: Make readahead request locks" which is in the middle of a bunch of other pagecache-modifying patches that haven't landed yet, so is unlikely to be related (this had confused me previously).

I see that "LU-16285 ldlm: send the cancel RPC asap" is in that batch, but we might need more than LU-16285 to fix it.

Comment by Yang Sheng [ 16/Mar/23 ]

This is really confused me. From log:

00000080:00200000:0.0:1678874193.081576:0:332595:0:(vvp_io.c:1279:vvp_io_write_start()) f118d.sanity: write [0, 4096)
00000080:00200000:0.0:1678874193.081581:0:332595:0:(rw26.c:725:ll_write_begin()) Writing 0 of 0 to 4096 bytes
00000080:00200000:0.0:1678874193.081593:0:332595:0:(rw26.c:908:ll_write_end()) pos 0, len 4096, copied 4096
00000080:00200000:0.0:1678874193.081597:0:332595:0:(rw26.c:945:ll_write_end()) page@00000000bdbe9cf6[3 0000000097786b56 2 1 0000000095f53b23]

00000080:00200000:0.0:1678874193.081598:0:332595:0:(rw26.c:945:ll_write_end()) vmpage @00000000355f6090 fffffc0002001 3:0 ffff9ff680f81e00 0 lru       
                                                                                                                      
                                                                        =================^
00000080:00200000:0.0:1678874193.081602:0:332595:0:(rw26.c:945:ll_write_end()) osc-page@00000000e81681ab 0: 1< 0 - - > 2< 0 0 0 0x0 0x0 | 0000000000000000 00000000516aba8e 000000004f241537 > 3< 0 0 > 4< 0 0 8 16334848 - | - - - - > 5< - - - - | 0 - | 0 - ->

00000080:00200000:0.0:1678874193.081603:0:332595:0:(rw26.c:945:ll_write_end()) end page@00000000bdbe9cf6

00000080:00200000:0.0:1678874193.081604:0:332595:0:(rw26.c:945:ll_write_end()) queued page: 1.
00000080:00200000:0.0:1678874193.081605:0:332595:0:(vvp_io.c:1169:vvp_io_write_commit()) commit async pages: 1, from 0, to 4096
00000008:00010000:0.0:1678874193.081612:0:332595:0:(osc_cache.c:802:osc_extent_find()) ### extent: 0000000023189926 ns: lustre-OST0006-osc-ffff9ff6a6978800 lock: 00000000c663eb85/0xab5218239206893a lrc: 4/0,1 mode: PW/PW res: [0x52b5:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) gid 0 flags: 0x800020000000000 nid: local remote: 0x643532634dfd448a expref: -99 pid: 332595 timeout: 0 lvb_type: 1
00000080:00200000:0.0:1678874193.081620:0:332595:0:(vvp_io.c:1091:vvp_set_pagevec_dirty()) mapping 00000000608a50d7, count 1, dirtied 1
00000080:00200000:0.0:1678874193.081627:0:332595:0:(vvp_io.c:1192:vvp_io_write_commit()) Committed 1 pages 4096 bytes, tot: 4096
00000080:00200000:0.0:1678874193.081631:0:332595:0:(file.c:1472:ll_merge_attr()) [0x200001b78:0xd35:0x0] updating i_size 4096
00000080:00200000:0.0:1678874193.081633:0:332595:0:(file.c:4888:ll_fsync()) VFS Op:inode=[0x200001b78:0xd35:0x0](00000000101efe33), start 0, end 4095, datasync 0
00000080:00200000:0.0:1678874193.081637:0:332595:0:(vvp_io.c:1780:vvp_io_init()) [0x200001b78:0xd35:0x0] ignore/verify layout 1/0, layout version 2 restore needed 0
00020000:00200000:0.0:1678874193.081640:0:332595:0:(lov_io.c:819:lov_io_iter_init()) component[0] flags 0x10
00000008:00010000:0.0:1678874193.081645:0:332595:0:(osc_cache.c:1867:try_to_add_extent_for_io()) ### extent: 0000000023189926 ns: lustre-OST0006-osc-ffff9ff6a6978800 lock: 00000000c663eb85/0xab5218239206893a lrc: 4/0,1 mode: PW/PW res: [0x52b5:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) gid 0 flags: 0x800020000000000 nid: local remote: 0x643532634dfd448a expref: -99 pid: 332595 timeout: 0 lvb_type: 1
00000008:00010000:0.0:1678874193.081650:0:332595:0:(osc_cache.c:1084:osc_extent_make_ready()) ### extent: 0000000023189926 ns: lustre-OST0006-osc-ffff9ff6a6978800 lock: 00000000c663eb85/0xab5218239206893a lrc: 4/0,1 mode: PW/PW res: [0x52b5:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) gid 0 flags: 0x800020000000000 nid: local remote: 0x643532634dfd448a expref: -99 pid: 332595 timeout: 0 lvb_type: 1
00010000:00010000:0.0:1678874193.081662:0:332595:0:(ldlm_lock.c:1507:ldlm_lock_match_with_skip()) ### matched (0 4095) ns: lustre-OST0006-osc-ffff9ff6a6978800 lock: 00000000c663eb85/0xab5218239206893a lrc: 5/0,1 mode: PW/PW res: [0x52b5:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) gid 0 flags: 0x800020000000000 nid: local remote: 0x643532634dfd448a expref: -99 pid: 332595 timeout: 0 lvb_type: 1
00000008:00100000:0.0:1678874193.081675:0:332595:0:(osc_request.c:1948:osc_brw_prep_request()) brw rpc 0000000034743c16 - object 0x0:21173 offset 0<>4096
00000100:00100000:1.0:1678874193.081701:0:2905:0:(client.c:1737:ptlrpc_send_new_req()) Sending RPC req@0000000034743c16 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_00_01:8231eeb6-f8da-45b6-923d-908cec5c2274:2905:1760422647264640:10.240.42.118@tcp:4:multiop.0
00000400:00000080:1.0:1678874194.489851:0:332674:0:(module.c:211:libcfs_ioctl()) libcfs ioctl cmd 3221775648
00000001:02000400:1.0:1678874194.489853:0:332674:0:(debug.c:725:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark  sanity test_118d: @@@@@@ FAIL: No page in writeback, writeback=0

The page flag ffffc0002001 indicts the writeback has been set for this page. The OST_WRITE hasn't received reply until the test case failed, so this flag should not be clear mean while. Then why the test failed? Maybe i miss some thing?

Comment by Andreas Dilger [ 19/Mar/23 ]

+73 failures in the past week. Please increase priority to fix this issue as it slows down all development efforts.

Comment by Patrick Farrell [ 20/Mar/23 ]

Yang Sheng,

Are you interested in looking in to this further?  I haven't been able to reproduce it locally, which has been making looking at it kind of tricky.

Comment by Yang Sheng [ 23/Mar/23 ]

I wonder why sanity test lustre log without -1 flag? Have a easy way to change it?

Comment by Andreas Dilger [ 23/Mar/23 ]

You could push a patch to set debug=all for this test. This is the default for sanity, but is lost when some subtests umount and remount the filesystem.

You may also be able to add this to Test-Parameters:

Test-Parameters: trivial testlist=sanity env=ONLY=118,ONLY_REPEAT=100,PTLDEBUG=all
Comment by Patrick Farrell [ 23/Mar/23 ]

It is not the default, I don't think?  We certainly don't normally log with +trace enabled?  When we accidentally landed a patch of mine which did debug=-1 in a test, it caused problems for other tests.

Yang Sheng,

When I need to do this, I cheat.  You can place 'lctl set_param *debug=-1' and 'lctl set_param debug_mb=10000' (or some other large number) directly in the test, then run the test.  Unless the test remounts Lustre, which resets to defaults, this level of debug will show up in the debug log for the test failure.  You can even do this in tests you push to Maloo, but I would set a test-parameters to run just this test if you do that.

Comment by Andreas Dilger [ 27/Mar/23 ]

Patrick, for sanity.sh the script definitely sets full debug at the start, so that we can debug test failures more easily:

OLDDEBUG=$(lctl get_param -n debug 2> /dev/null)
lctl set_param debug=-1 2> /dev/null || true

but this is lost at some point during the tests running. That said, it looks like this is set only on the client, so possibly the same needs to be done on the servers.

You can place 'lctl set_param *debug=-1' and 'lctl set_param debug_mb=10000' (or some other large number) directly in the test, then run the test.

The VM clients only have 3GB of RAM, so 10000 is too large. The limit is 80% of RAM, but this will likely impact system operation. Maybe 1GB is the max for testing.

Comment by Andreas Dilger [ 27/Mar/23 ]

Sheng, can you please push a patch to enable full debug for this subtest (e.g. add start_full_debug_logging and stop_full_debug_logging calls into test_118c and test_118d and then add:

Test-Parameters: trivial testlist=sanity env=ONLY="118c 118d",ONLY_REPEAT=100
Test-Parameters: trivial testlist=sanity
Test-Parameters: trivial testlist=sanity
[repeat 20x]

If testing on the patch itself does not reproduce the problem (it is failing about 10% of runs on master this week, but may depend on previous state to fail), then the debug patch could be landed so that it will collect debugging from other patch test runs.

Comment by Patrick Farrell [ 27/Mar/23 ]

Interesting about full debug - that's good to know.

RE: debug_mb=10000; oh yes, absolutely.  It's sort my lazy shorthand for "max", it's intended to be when I want to get as much info as possible out of a specific test.  It's not appropriate for full test runs and it does indeed cause problems if you do one with it set like that.

Comment by Gerrit Updater [ 27/Mar/23 ]

"Yang Sheng <ys@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50439
Subject: LU-16515 tests: enable -1 log for 118c & 118d
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ddbe69524ab8ec5d2147fed5674eddabdf366422

Comment by Gerrit Updater [ 29/Mar/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50470
Subject: LU-16515 tests: disable sanity test_118c/118d
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e3bf4218cb9e1eb68f8ffb9f15161645165b84be

Comment by Gerrit Updater [ 30/Mar/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50470/
Subject: LU-16515 tests: disable sanity test_118c/118d
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7c52cbf65218d77c0594f92981173aa7d78f6758

Comment by Yang Sheng [ 27/Apr/23 ]

I have captured a vmcore for the failure case:

crash> ll_inode_info ffff8f7192598000 -x
struct ll_inode_info {
  lli_inode_magic = 0x111d0de5,
....
  lli_fid = {
    f_seq = 0x200001b78,
    f_oid = 0xd3a,
    f_ver = 0x0
  },
.........
    i_op = 0xffffffffc106ffc0 <ll_file_inode_operations>,
    i_sb = 0xffff8f719a8ea800,
    i_mapping = 0xffff8f7192598208,
    i_security = 0xffff8f712ef42d80,
    i_ino = 0x200001b78000d3a,

Look into mmaping:

crash> address_space 0xffff8f7192598208 -o
struct address_space {
  [ffff8f7192598208] struct inode *host;
  [ffff8f7192598210] struct xarray i_pages;
  [ffff8f7192598228] atomic_t i_mmap_writable;
...........
crash> xarray ffff8f7192598210
struct xarray {
  xa_lock = {
    {
      rlock = {
        raw_lock = {
          {
            val = {
              counter = 0
            },
            {
              locked = 0 '\000',
              pending = 0 '\000'
            },
            {
              locked_pending = 0,
              tail = 0
            }
          }
        }
      }
    }
  },
  xa_flags = 16777249,
  xa_head = 0xffffd8b842578f80,
  xarray_size_rh = 0,
  _rh = {<No data fields>}
}
crash> kmem 0xffffd8b842578f80
      PAGE       PHYSICAL      MAPPING       INDEX CNT FLAGS
ffffd8b842578f80 95e3e000 ffff8f7192598208        0  2 fffffc0005028 uptodate,lru,private,writeback

So every thing looks valid. In the RHEL8 use XA to replace radix tree. The only reason could be a racy in XArray code. But i am not sure which place should be a culprit. Further investigating is needed. Any comment is appreciated.

Thanks,
YangSheng

Generated at Sat Feb 10 03:27:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.