[LU-13997] sanity test_418: df and lfs df blocks output mismatch Created: 25/Sep/20 Updated: 02/Dec/21 |
|
| Status: | Reopened |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | always_except | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
This issue was created by maloo for wangshilong <wshilong@ddn.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/f01e487a-628b-41e2-9646-67794dd75816 test_418 failed with the following error: 'df and lfs df inodes output mismatch: df -i: 10.9.4.8@tcp:/lustre 916924 14005 902919 2% /mnt/lustre, lfs df -i: filesystem_summary: 917922 15003 902919 2% /mnt/lustre' <<Please provide additional information about the failure here>> VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Alex Zhuravlev [ 28/Jul/21 ] |
|
on master: https://testing.whamcloud.com/test_sessions/7efffeb4-d38b-4623-843a-bb9823e14897 |
| Comment by Alex Zhuravlev [ 28/Jul/21 ] |
|
https://testing.whamcloud.com/test_sets/04bd7c7f-cce4-4f68-adfd-81cbfa3062d0 |
| Comment by Andreas Dilger [ 29/Jul/21 ] |
|
+1 on master https://testing.whamcloud.com/test_sets/82e1130a-47d7-4062-ac9e-cd13d44b3b7a |
| Comment by Emoly Liu [ 30/Jul/21 ] |
|
+1 on masterĀ https://testing.whamcloud.com/test_sets/5d0de25e-6cbb-46a8-9c75-2bbd5c350945 |
| Comment by Chris Horn [ 24/Aug/21 ] |
|
+1 on master https://testing.whamcloud.com/test_sets/253d6740-d5c2-4113-8765-731985896f6b |
| Comment by Qian Yingjin [ 28/Aug/21 ] |
|
+1 on ES6 https://testing.whamcloud.com/test_sets/5635e7ed-ef1a-4c8b-b11b-d3b4d76a91fb |
| Comment by Alex Zhuravlev [ 30/Aug/21 ] |
|
I noticed that check_lfs_df() function doesn't wait at all while it should do to let cached data to expire? |
| Comment by Andreas Dilger [ 30/Aug/21 ] |
|
I see there is a "sleep 0.2" inside the loop to wait until there is an update:
for count in {1..100}; do
cancel_lru_locks
sync; sleep 0.2
|
| Comment by Andreas Dilger [ 01/Sep/21 ] |
|
While this issue has been open since 2020-09-25, it really only started failing after 2021-07-20. There was one failure on 2021-07-20, and another single failure on 2021-07-22, then between 4-10 per day since 2021-07-25. Patches landed in that date range are: git log --oneline --before 2021-07-25 --after 2021-07-19
233344d451 LU-13417 test: generate uneven MDTs early for sanity 413
e90794af4b LU-14868 llite: revert 'simplify callback handling for async getattr'
cbc62b0b82 LU-14826 mdt: getattr_name("..") under striped directory
6a4be282bb LU-14833 sec: quiet spurious gss_init_svc_upcall() message
f75ff33d9f LU-14114 lnet: print device status in net show command
601c48f3ec LU-14804 nodemap: do not return error for improper ACL
188112fc80 LU-14300 quota: avoid nested lqe lookup
9f672d8a71 LU-14508 lfs: make mirror operations preserve timestamps
253a9e3bcd LU-12214 build: fix SLES build/install
9e1e29406c LU-12022 tests: error on resync failure sanity-flr
6cba1b949d LU-6142 libcfs: checkpatch cleanup of libcfs fail.c
9976d2c35d LU-6142 lnet: convert kiblnd/ksocknal_thread_start to vararg
a39f078041 LU-6142 lnet: discard lnet_current_net_count
a3f5aa624b LU-14217 osd-zfs: allow SEEK_HOLE/DATA only with sync
0f8db7e06a LU-14805 llite: No locked parallel DIO
1dbe63301b LU-13440 utils: fix handling of lsa_stripe_off = -1
f2a16793fa LU-14541 llite: avoid stale data reading
My guess is that " |
| Comment by Alex Zhuravlev [ 01/Sep/21 ] |
|
I see in the logs the following lines, repeating many times during 418: 00000004:00000024:1.0:1630450218.589991:0:617469:0:(mdt_handler.c:547:mdt_statfs()) blocks cached 10731520 granted 354615296 pending 0 free 1165684736 avail 1060012032 i.e. tgd_tot_dirty= 10731520 |
| Comment by Gerrit Updater [ 01/Sep/21 ] |
|
"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/44803 |
| Comment by Andreas Dilger [ 03/Sep/21 ] |
|
This is failing about 1/25 of the test runs this week (31/695). |
| Comment by Gerrit Updater [ 04/Sep/21 ] |
|
"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/44845 |
| Comment by Andreas Dilger [ 10/Sep/21 ] |
|
I'm not sure if this is relevant or not, but it appears that this problem hits much more frequently on ldiskfs than ZFS. For ZFS there were 3/220 failures (1/73) in the past 4 weeks, and no failures in the past 2 weeks. For ldiskfs, there were 84 / 1403 failures (1/16) with 35 in the past week (including the 95 test passes for the two patches for this issue). |
| Comment by Andreas Dilger [ 11/Sep/21 ] |
|
Alex wrote:
Interesting. Looking at mdt_statfs(), it subtracts (tgd->tgd_tot_dirty + tgd->tgd_tot_pending) from os_bavail before returning it to the client, but when the clients do "lfs df" they ignore the MDT os_bavail value when calculating available space, and only OST os_bavail is used (for better or worse), and that is the inconsistent value... This really points to some other bug, that MDC/MDT grant is not calculated correctly, or that dirty DoM files are not being flushed from the clients properly, or something, since test_418 is waiting at least 20s for the output to stabilize, plus another 50s or more for test_414..test_417. It looks like test_413a->generate_uneven_mdts() is creating a lot of DoM files, and test_413z is deleting them, since v2_14_52-97-g233344d451 on master (2021-07-25). This aligns with the start of high-frequency test_418 failures on master. In fact, the first "recent" failure was 2021-07-23 on patch https://review.whamcloud.com/44384 " So it looks like patch 44803 may avoid this test failure, but there is something strange happening with dirty DoM files... |
| Comment by Alex Zhuravlev [ 11/Sep/21 ] |
|
check_lfs_df() function cancels only local locks, but iirc sanity may use two clients? |
| Comment by Andreas Dilger [ 13/Sep/21 ] |
Sure, but the dirty DoM data should be flushed by itself far sooner than 60-70s since the files were written. I suppose it might be possible that the data is actually written, but the client did not generate any more RPCs, so the tgt_tot_dirty was not updated on the server since the last write? If that is the case, then the fix in patch 44384 is the right one and the problem is only in the test. I just want to make sure we aren't missing some other problem, where the DoM file writes are never being sent to the MDT unless the lock is cancelled. Similar bugs have happened in the past, and writes from minutes/hours earlier are lost if the client is evicted or fails. It would be possible to create a test that writes some DoM files, waits 60s (2x the vm.dirty_expire_centisecs=3000 interval for data, and 12x the vm.dirty_writeback_centisecs=500 interval for metadata), evicts the client, and then checks that the files contain valid data. |
| Comment by Alex Zhuravlev [ 14/Sep/21 ] |
|
here is a simple script... + for ((i=0; i < 100; i++)); do + $LFS setstripe -E 1M -L mdt $DIR/$tfile-$i || error "can't setstripe" + dd if=/dev/zero of=$DIR/$tfile-$i bs=128k count=1 || error "can't dd" + done + start=$SECONDS + for ((i=0; i < 100; i++)); do + grant_dirty=$($LCTL get_param -n mdt.*.tot_dirty | + awk '{sum=sum+$1} END {print sum}') + vm_dirty=$(awk '/Dirty:/{print $2}' /proc/meminfo) + echo "dirty after $((SECONDS-start)): $grant_dirty in grants, $vm_dirty in system" + sleep 2 + (( vm_dirty == 0 )) && break + done ... and its output: dirty after 0: 1966080 in grants, 12292 in system dirty after 2: 1966080 in grants, 12392 in system dirty after 4: 1966080 in grants, 12392 in system dirty after 6: 1966080 in grants, 12424 in system dirty after 8: 1966080 in grants, 12176 in system dirty after 10: 1966080 in grants, 12176 in system dirty after 12: 1966080 in grants, 12176 in system dirty after 14: 1966080 in grants, 12176 in system Lustre: lustre-OST0000-osc-ffff8801ee82e000: disconnect after 22s idle dirty after 16: 1966080 in grants, 11712 in system dirty after 18: 1966080 in grants, 11648 in system dirty after 20: 1966080 in grants, 11648 in system dirty after 22: 1966080 in grants, 11648 in system dirty after 24: 1966080 in grants, 11664 in system dirty after 26: 1966080 in grants, 11664 in system dirty after 28: 1966080 in grants, 11664 in system dirty after 30: 3579904 in grants, 2064 in system dirty after 32: 3579904 in grants, 2064 in system dirty after 34: 155648 in grants, 440 in system dirty after 37: 155648 in grants, 416 in system dirty after 39: 155648 in grants, 416 in system dirty after 41: 155648 in grants, 416 in system dirty after 43: 155648 in grants, 504 in system dirty after 45: 155648 in grants, 16 in system dirty after 47: 155648 in grants, 16 in system dirty after 49: 155648 in grants, 16 in system dirty after 51: 155648 in grants, 16 in system dirty after 53: 155648 in grants, 0 in system i.e. all dirty pages had been flushed, but MDT still counts few pages dirty |
| Comment by Andreas Dilger [ 14/Sep/21 ] |
Thanks for testing this. It looks like the test with your patch should fix this problem. |
| Comment by Alex Zhuravlev [ 15/Sep/21 ] |
|
actually now I doubt the patch will really help as nothing I tried (cancel_lru_locks, extra RPC, waiting some time) change tot_dirty on MDT side. |
| Comment by Alex Zhuravlev [ 15/Sep/21 ] |
|
the problem is that grants info is passed with OST_READ/OST_WRITE. so even with no actual dirty pages on the client side MDT doesn't get any updates and keeps tot_total_dirty > 0 which in turn affects statfs calculation. |
| Comment by Alex Zhuravlev [ 15/Sep/21 ] |
|
another observation is that grant shrink seem to happen (and handled) to OSC-OST only. |
| Comment by Andreas Dilger [ 16/Sep/21 ] |
|
We need a separate patch (separate ticket?) for grant shrink to the MDTs |
| Comment by Alex Zhuravlev [ 16/Sep/21 ] |
|
but grant shrinking mechanism has own limits too:
/* Don't shrink if we are already above or below the desired limit
* We don't want to shrink below a single RPC, as that will negatively
* impact block allocation and long-term performance. */
if (target_bytes < cli->cl_max_pages_per_rpc << PAGE_SHIFT)
target_bytes = cli->cl_max_pages_per_rpc << PAGE_SHIFT;
if (target_bytes >= cli->cl_avail_grant) {
spin_unlock(&cli->cl_loi_list_lock);
RETURN(0);
add an expection for MDT? |
| Comment by Andreas Dilger [ 16/Sep/21 ] |
No, I'd rather maintain good performance than make "df" totally consistent across all the clients/MDTs. Having some grant shrinking on the MDTs would still be desirable, but is outside the scope of this ticket. I filed |
| Comment by Gerrit Updater [ 17/Sep/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44803/ |
| Comment by Andreas Dilger [ 25/Sep/21 ] |
|
It looks like this is still failing on master, for example: |
| Comment by Chris Horn [ 27/Sep/21 ] |
|
+1 on master - https://testing.whamcloud.com/test_sets/1c8664ee-840b-453d-bbcd-c80e7b75d7d4 |
| Comment by Alex Zhuravlev [ 27/Sep/21 ] |
|
I hope https://review.whamcloud.com/#/c/44956/ can help |
| Comment by Chris Horn [ 07/Oct/21 ] |
|
+1 on master - https://testing.whamcloud.com/test_sets/302baca3-ea35-4824-a7fe-d4a6995111d7 |
| Comment by Gerrit Updater [ 13/Oct/21 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45231 |
| Comment by Gerrit Updater [ 26/Oct/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45231/ |
| Comment by Peter Jones [ 27/Oct/21 ] |
|
Landed for 2.15 |
| Comment by Andreas Dilger [ 27/Oct/21 ] |
|
The landed patch only avoids the test failure, it does not actually fix it, which is why it is marked always_except for now. |