[LU-12572] sanity-pfl test_20b: Delete is not completed in 29 seconds Created: 22/Jul/19 Updated: 28/Oct/20 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0, Lustre 2.14.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for jianyu <yujian@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/d66394cc-abc3-11e9-a0be-52540065bddc test_20b failed with the following error: CMD: trevis-70vm4 /usr/sbin/lctl set_param -n os[cd]*.*MD*.force_sync 1 CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* Delete is not completed in 29 seconds CMD: trevis-70vm4 /usr/sbin/lctl get_param osc.*MDT*.sync_* osc.lustre-OST0000-osc-MDT0000.sync_changes=0 osc.lustre-OST0000-osc-MDT0000.sync_in_flight=0 osc.lustre-OST0000-osc-MDT0000.sync_in_progress=1 osc.lustre-OST0000-osc-MDT0002.sync_changes=0 osc.lustre-OST0000-osc-MDT0002.sync_in_flight=0 osc.lustre-OST0000-osc-MDT0002.sync_in_progress=0 VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Jian Yu [ 27/Aug/19 ] |
|
+1 on master branch: https://testing.whamcloud.com/test_sets/5e5172b4-c8e8-11e9-90ad-52540065bddc |
| Comment by James Nunez (Inactive) [ 28/Aug/19 ] |
|
sanity-pfl test 20b is failing for review-dne-zfs-part-4 at a high rate. We’ve seen this test fail as early as 12 July 2019 with two failures for in June 2019; https://testing.whamcloud.com/test_sets/f5ff99e2-9a1c-11e9-b26a-52540065bddc and https://testing.whamcloud.com/test_sets/19cb1252-90b8-11e9-abe3-52540065bddc . In some of these failures, for example at https://testing.whamcloud.com/test_sets/b04ecaf6-c953-11e9-97d5-52540065bddc, we see an error in the client test_log: CMD: trevis-44vm4 /usr/sbin/lctl set_param -n os[cd]*.*MD*.force_sync 1 trevis-44vm4: error: set_param: setting /sys/fs/lustre/osc/lustre-OST0000-osc-MDT0000/force_sync=1: Connection timed out CMD: trevis-44vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* In some failures, for example https://testing.whamcloud.com/test_sets/2894009c-b56f-11e9-b023-52540065bddc, we see that some of the osc parameters are not changed to 0: Delete is not completed in 29 seconds CMD: trevis-70vm4 /usr/sbin/lctl get_param osc.*MDT*.sync_* osc.lustre-OST0000-osc-MDT0000.sync_changes=0 osc.lustre-OST0000-osc-MDT0000.sync_in_flight=0 osc.lustre-OST0000-osc-MDT0000.sync_in_progress=1 In this case, we also see the following in the OSS console [14628.488316] Lustre: DEBUG MARKER: dmesg [14629.089241] LustreError: 5730:0:(pack_generic.c:425:lustre_msg_buf_v2()) msg ffff9e3e5937f600 buffer[0] size 152 too small (required 184, opc=-1) [14629.091601] LustreError: 5730:0:(pack_generic.c:425:lustre_msg_buf_v2()) Skipped 815 previous similar messages Not sure if this causes errors, but for some failures, for example https://testing.whamcloud.com/test_sets/1996cd30-b2f5-11e9-b753-52540065bddc, we see the following on the OSS console [13183.416656] Lustre: ll_ost00_035: service thread pid 6546 was inactive for 40.145 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. |
| Comment by Jian Yu [ 26/Jan/20 ] |
|
Still failed on master branch: https://testing.whamcloud.com/test_sets/3c217dac-406f-11ea-9543-52540065bddc |
| Comment by Emoly Liu [ 13/Jun/20 ] |
|
more on master: |
| Comment by Patrick Farrell [ 08/Jul/20 ] |
|
Brief notes from a minimal look: This is failing in the cleanup phase related to setting up ENOSPC/low space as part of the test, specifically: Sets up: ost_watermarks_clear_enospc And the failure is specifically related to wait_delete_completed_mds, this part: local WAIT=0
while [[ $WAIT -ne $max_wait ]]; do
changes=$(do_nodes $mds2sync \
"$LCTL get_param -n osc.*MDT*.sync_*" | calc_sum)
#echo "$node: $changes changes on all"
if [[ $changes -eq 0 ]]; then
wait_zfs_commit $SINGLEMDS # the occupied disk space will be released
# only after TXGs are committed
wait_zfs_commit ost1
return 0
fi
sleep 1
WAIT=$((WAIT + 1))
done
On failure, we dump the sync parameters we're summing there, and they all show as zero. Most likely, this means we were waiting for ZFS commit (Unless there is a race condition between the "while [[ $WAIT .. ]] " timer and this check). Since this is just failing (I think? I am relying on past comments & the most recent failure) in ZFS testing, I strongly suspect it's waiting for ZFS commit. So basically it looks to me like sometimes after failover we're waiting more than 30 seconds for ZFS commit. I have not dug in further. |
| Comment by Patrick Farrell [ 08/Jul/20 ] |
|
Note that in contrast to the earlier failures James highlighted, the most recent failure does not have any errors in the force_sync set_param. That seems highly relevant because such an error would explain getting stuck, but it's not present in this most recent failure. I suspect in the earlier case we had an issue with the connection coming back up after the failover (so we failed to set sync, so we failed this check, etc), whereas here it looks more like a ZFS/storage issue |