[LU-13997] sanity test_418: df and lfs df blocks output mismatch Created: 25/Sep/20  Updated: 02/Dec/21

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: always_except

Issue Links:
Related
is related to LU-15105 sanity test_418: df and lfs df inodes... Open
is related to LU-13125 sanity test_418: df and lfs df inodes... Resolved
is related to LU-15010 grant shrink for MDTs Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for wangshilong <wshilong@ddn.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/f01e487a-628b-41e2-9646-67794dd75816

test_418 failed with the following error:

'df and lfs df inodes output mismatch:  df -i: 10.9.4.8@tcp:/lustre 916924 14005 902919 2% /mnt/lustre,  lfs df -i: filesystem_summary: 917922 15003 902919 2% /mnt/lustre'

<<Please provide additional information about the failure here>>

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_418 - df and lfs df blocks output mismatch: df -i: 10.9.4.8@tcp:/lustre 916924 14005 902919 2% /mnt/lustre, lfs df -i: filesystem_summary: 917922 15003 902919 2% /mnt/lustre'



 Comments   
Comment by Alex Zhuravlev [ 28/Jul/21 ]

on master: https://testing.whamcloud.com/test_sessions/7efffeb4-d38b-4623-843a-bb9823e14897

Comment by Alex Zhuravlev [ 28/Jul/21 ]

https://testing.whamcloud.com/test_sets/04bd7c7f-cce4-4f68-adfd-81cbfa3062d0

Comment by Andreas Dilger [ 29/Jul/21 ]

+1 on master https://testing.whamcloud.com/test_sets/82e1130a-47d7-4062-ac9e-cd13d44b3b7a

Comment by Emoly Liu [ 30/Jul/21 ]

+1 on masterĀ https://testing.whamcloud.com/test_sets/5d0de25e-6cbb-46a8-9c75-2bbd5c350945

Comment by Chris Horn [ 24/Aug/21 ]

+1 on master https://testing.whamcloud.com/test_sets/253d6740-d5c2-4113-8765-731985896f6b

Comment by Qian Yingjin [ 28/Aug/21 ]

+1 on ES6 https://testing.whamcloud.com/test_sets/5635e7ed-ef1a-4c8b-b11b-d3b4d76a91fb

Comment by Alex Zhuravlev [ 30/Aug/21 ]

I noticed that check_lfs_df() function doesn't wait at all while it should do to let cached data to expire?

Comment by Andreas Dilger [ 30/Aug/21 ]

I see there is a "sleep 0.2" inside the loop to wait until there is an update:

        for count in {1..100}; do
                cancel_lru_locks
                sync; sleep 0.2
Comment by Andreas Dilger [ 01/Sep/21 ]

While this issue has been open since 2020-09-25, it really only started failing after 2021-07-20. There was one failure on 2021-07-20, and another single failure on 2021-07-22, then between 4-10 per day since 2021-07-25. Patches landed in that date range are:

git log --oneline --before 2021-07-25 --after 2021-07-19
233344d451 LU-13417 test: generate uneven MDTs early for sanity 413
e90794af4b LU-14868 llite: revert 'simplify callback handling for async getattr'
cbc62b0b82 LU-14826 mdt: getattr_name("..") under striped directory
6a4be282bb LU-14833 sec: quiet spurious gss_init_svc_upcall() message
f75ff33d9f LU-14114 lnet: print device status in net show command
601c48f3ec LU-14804 nodemap: do not return error for improper ACL
188112fc80 LU-14300 quota: avoid nested lqe lookup
9f672d8a71 LU-14508 lfs: make mirror operations preserve timestamps
253a9e3bcd LU-12214 build: fix SLES build/install
9e1e29406c LU-12022 tests: error on resync failure sanity-flr
6cba1b949d LU-6142 libcfs: checkpatch cleanup of libcfs fail.c
9976d2c35d LU-6142 lnet: convert kiblnd/ksocknal_thread_start to vararg
a39f078041 LU-6142 lnet: discard lnet_current_net_count
a3f5aa624b LU-14217 osd-zfs: allow SEEK_HOLE/DATA only with sync
0f8db7e06a LU-14805 llite: No locked parallel DIO
1dbe63301b LU-13440 utils: fix handling of lsa_stripe_off = -1
f2a16793fa LU-14541 llite: avoid stale data reading

My guess is that "LU-13417 test: generate uneven MDTs early for sanity 413" or "LU-13440 utils: fix handling of lsa_stripe_off = -1" causing some kind of difference in space usage for DNE that only shows up in "lfs df -i" (e.g. many remote entries), or possibly stray file/object deletions still happening in the background?

Comment by Alex Zhuravlev [ 01/Sep/21 ]

I see in the logs the following lines, repeating many times during 418:

00000004:00000024:1.0:1630450218.589991:0:617469:0:(mdt_handler.c:547:mdt_statfs()) blocks cached 10731520 granted 354615296 pending 0 free 1165684736 avail 1060012032

i.e. tgd_tot_dirty= 10731520
it looks like cancel_lru_locks in check_lfs_df() doesn't flush dirty DoM pages? or the dirty pages are on another client?

Comment by Gerrit Updater [ 01/Sep/21 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/44803
Subject: LU-13997 tests: sanity/418 to cancel all client locks
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 37505dfa7cd3a24e78c07be155481217a2258866

Comment by Andreas Dilger [ 03/Sep/21 ]

This is failing about 1/25 of the test runs this week (31/695).

Comment by Gerrit Updater [ 04/Sep/21 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/44845
Subject: LU-13997 tests: sanity/418 reproducer
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2aa6902a3dd18e9f4b97fafa4e4f82c6bd6c258d

Comment by Andreas Dilger [ 10/Sep/21 ]

I'm not sure if this is relevant or not, but it appears that this problem hits much more frequently on ldiskfs than ZFS. For ZFS there were 3/220 failures (1/73) in the past 4 weeks, and no failures in the past 2 weeks. For ldiskfs, there were 84 / 1403 failures (1/16) with 35 in the past week (including the 95 test passes for the two patches for this issue).

Comment by Andreas Dilger [ 11/Sep/21 ]

Alex wrote:

I noticed the following message on MDT, repeating many times:

00000004:00000024:1.0:1630450218.589991:0:617469:0:(mdt_handler.c:547:mdt_statfs())
   blocks cached 10731520 granted 354615296 pending 0 free 1165684736 avail 1060012032

where cached are blocks reported "dirty" by the client my understanding is that it's DoM data to be flushed

Interesting. Looking at mdt_statfs(), it subtracts (tgd->tgd_tot_dirty + tgd->tgd_tot_pending) from os_bavail before returning it to the client, but when the clients do "lfs df" they ignore the MDT os_bavail value when calculating available space, and only OST os_bavail is used (for better or worse), and that is the inconsistent value...

This really points to some other bug, that MDC/MDT grant is not calculated correctly, or that dirty DoM files are not being flushed from the clients properly, or something, since test_418 is waiting at least 20s for the output to stabilize, plus another 50s or more for test_414..test_417.

It looks like test_413a->generate_uneven_mdts() is creating a lot of DoM files, and test_413z is deleting them, since v2_14_52-97-g233344d451 on master (2021-07-25). This aligns with the start of high-frequency test_418 failures on master. In fact, the first "recent" failure was 2021-07-23 on patch https://review.whamcloud.com/44384 "LU-13417 test: generate uneven MDTs early for sanity 413" before it landed, which is the patch that introduced generate_uneven_mdts(). Before that landed test_418 only failed 0/1 times per month.

So it looks like patch 44803 may avoid this test failure, but there is something strange happening with dirty DoM files...

Comment by Alex Zhuravlev [ 11/Sep/21 ]

check_lfs_df() function cancels only local locks, but iirc sanity may use two clients?

Comment by Andreas Dilger [ 13/Sep/21 ]

check_lfs_df() function cancels only local locks, but iirc sanity may use two clients?

Sure, but the dirty DoM data should be flushed by itself far sooner than 60-70s since the files were written. I suppose it might be possible that the data is actually written, but the client did not generate any more RPCs, so the tgt_tot_dirty was not updated on the server since the last write? If that is the case, then the fix in patch 44384 is the right one and the problem is only in the test.

I just want to make sure we aren't missing some other problem, where the DoM file writes are never being sent to the MDT unless the lock is cancelled. Similar bugs have happened in the past, and writes from minutes/hours earlier are lost if the client is evicted or fails. It would be possible to create a test that writes some DoM files, waits 60s (2x the vm.dirty_expire_centisecs=3000 interval for data, and 12x the vm.dirty_writeback_centisecs=500 interval for metadata), evicts the client, and then checks that the files contain valid data.

Comment by Alex Zhuravlev [ 14/Sep/21 ]

here is a simple script...

+       for ((i=0; i < 100; i++)); do
+               $LFS setstripe -E 1M -L mdt $DIR/$tfile-$i || error "can't setstripe"
+               dd if=/dev/zero of=$DIR/$tfile-$i bs=128k count=1 || error "can't dd"
+       done
+       start=$SECONDS
+       for ((i=0; i < 100; i++)); do
+               grant_dirty=$($LCTL get_param -n  mdt.*.tot_dirty |
+                       awk '{sum=sum+$1} END {print sum}')
+               vm_dirty=$(awk '/Dirty:/{print $2}' /proc/meminfo)
+               echo "dirty after $((SECONDS-start)): $grant_dirty in grants, $vm_dirty in system"
+               sleep 2
+               (( vm_dirty == 0 )) && break
+       done

... and its output:

dirty after 0: 1966080 in grants, 12292 in system
dirty after 2: 1966080 in grants, 12392 in system
dirty after 4: 1966080 in grants, 12392 in system
dirty after 6: 1966080 in grants, 12424 in system
dirty after 8: 1966080 in grants, 12176 in system
dirty after 10: 1966080 in grants, 12176 in system
dirty after 12: 1966080 in grants, 12176 in system
dirty after 14: 1966080 in grants, 12176 in system
Lustre: lustre-OST0000-osc-ffff8801ee82e000: disconnect after 22s idle
dirty after 16: 1966080 in grants, 11712 in system
dirty after 18: 1966080 in grants, 11648 in system
dirty after 20: 1966080 in grants, 11648 in system
dirty after 22: 1966080 in grants, 11648 in system
dirty after 24: 1966080 in grants, 11664 in system
dirty after 26: 1966080 in grants, 11664 in system
dirty after 28: 1966080 in grants, 11664 in system
dirty after 30: 3579904 in grants, 2064 in system
dirty after 32: 3579904 in grants, 2064 in system
dirty after 34: 155648 in grants, 440 in system
dirty after 37: 155648 in grants, 416 in system
dirty after 39: 155648 in grants, 416 in system
dirty after 41: 155648 in grants, 416 in system
dirty after 43: 155648 in grants, 504 in system
dirty after 45: 155648 in grants, 16 in system
dirty after 47: 155648 in grants, 16 in system
dirty after 49: 155648 in grants, 16 in system
dirty after 51: 155648 in grants, 16 in system
dirty after 53: 155648 in grants, 0 in system

i.e. all dirty pages had been flushed, but MDT still counts few pages dirty

Comment by Andreas Dilger [ 14/Sep/21 ]
:
dirty after 53: 155648 in grants, 0 in system

all dirty pages had been flushed, but MDT still counts few pages dirty

Thanks for testing this. It looks like the test with your patch should fix this problem.

Comment by Alex Zhuravlev [ 15/Sep/21 ]

actually now I doubt the patch will really help as nothing I tried (cancel_lru_locks, extra RPC, waiting some time) change tot_dirty on MDT side.

Comment by Alex Zhuravlev [ 15/Sep/21 ]

the problem is that grants info is passed with OST_READ/OST_WRITE. so even with no actual dirty pages on the client side MDT doesn't get any updates and keeps tot_total_dirty > 0 which in turn affects statfs calculation.

Comment by Alex Zhuravlev [ 15/Sep/21 ]

another observation is that grant shrink seem to happen (and handled) to OSC-OST only.

Comment by Andreas Dilger [ 16/Sep/21 ]

We need a separate patch (separate ticket?) for grant shrink to the MDTs

Comment by Alex Zhuravlev [ 16/Sep/21 ]

but grant shrinking mechanism has own limits too:

        /* Don't shrink if we are already above or below the desired limit
         * We don't want to shrink below a single RPC, as that will negatively
         * impact block allocation and long-term performance. */
        if (target_bytes < cli->cl_max_pages_per_rpc << PAGE_SHIFT)
                target_bytes = cli->cl_max_pages_per_rpc << PAGE_SHIFT;

        if (target_bytes >= cli->cl_avail_grant) {
                spin_unlock(&cli->cl_loi_list_lock);
                RETURN(0);

add an expection for MDT?

Comment by Andreas Dilger [ 16/Sep/21 ]

add an expection for MDT?

No, I'd rather maintain good performance than make "df" totally consistent across all the clients/MDTs. Having some grant shrinking on the MDTs would still be desirable, but is outside the scope of this ticket. I filed LU-15010 for this.

Comment by Gerrit Updater [ 17/Sep/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44803/
Subject: LU-13997 tests: sanity/418 to cancel all client locks
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9e8ee5aa0339b06aa53dc00f4911e2019a3ef3a0

Comment by Andreas Dilger [ 25/Sep/21 ]

It looks like this is still failing on master, for example:
https://testing.whamcloud.com/test_sets/2121a4d8-1472-473c-875c-4b3b1288b726

Comment by Chris Horn [ 27/Sep/21 ]

+1 on master - https://testing.whamcloud.com/test_sets/1c8664ee-840b-453d-bbcd-c80e7b75d7d4

Comment by Alex Zhuravlev [ 27/Sep/21 ]

I hope https://review.whamcloud.com/#/c/44956/ can help

Comment by Chris Horn [ 07/Oct/21 ]

+1 on master - https://testing.whamcloud.com/test_sets/302baca3-ea35-4824-a7fe-d4a6995111d7

Comment by Gerrit Updater [ 13/Oct/21 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45231
Subject: LU-13997 tests: fix sanity test_418 lock cancellation
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e4bb910bba91199af21199080c5f3eb1070c59c9

Comment by Gerrit Updater [ 26/Oct/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45231/
Subject: LU-13997 tests: fix sanity test_418 lock cancellation
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c0596c4d8af3bce6565c0e16a1469fcc65191404

Comment by Peter Jones [ 27/Oct/21 ]

Landed for 2.15

Comment by Andreas Dilger [ 27/Oct/21 ]

The landed patch only avoids the test failure, it does not actually fix it, which is why it is marked always_except for now.

Generated at Sat Feb 10 03:05:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.