[LU-12572] sanity-pfl test_20b: Delete is not completed in 29 seconds Created: 22/Jul/19  Updated: 28/Oct/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.14.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-11987 sanity test 59 fails with “test_59 fa... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for jianyu <yujian@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/d66394cc-abc3-11e9-a0be-52540065bddc

test_20b failed with the following error:

CMD: trevis-70vm4 /usr/sbin/lctl set_param -n os[cd]*.*MD*.force_sync 1
CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
Delete is not completed in 29 seconds
CMD: trevis-70vm4 /usr/sbin/lctl get_param osc.*MDT*.sync_*
osc.lustre-OST0000-osc-MDT0000.sync_changes=0
osc.lustre-OST0000-osc-MDT0000.sync_in_flight=0
osc.lustre-OST0000-osc-MDT0000.sync_in_progress=1
osc.lustre-OST0000-osc-MDT0002.sync_changes=0
osc.lustre-OST0000-osc-MDT0002.sync_in_flight=0
osc.lustre-OST0000-osc-MDT0002.sync_in_progress=0

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-pfl test_20b - test_20b returned 1



 Comments   
Comment by Jian Yu [ 27/Aug/19 ]

+1 on master branch: https://testing.whamcloud.com/test_sets/5e5172b4-c8e8-11e9-90ad-52540065bddc

Comment by James Nunez (Inactive) [ 28/Aug/19 ]

sanity-pfl test 20b is failing for review-dne-zfs-part-4 at a high rate. We’ve seen this test fail as early as 12 July 2019 with two failures for in June 2019; https://testing.whamcloud.com/test_sets/f5ff99e2-9a1c-11e9-b26a-52540065bddc and https://testing.whamcloud.com/test_sets/19cb1252-90b8-11e9-abe3-52540065bddc .

In some of these failures, for example at https://testing.whamcloud.com/test_sets/b04ecaf6-c953-11e9-97d5-52540065bddc, we see an error in the client test_log:

CMD: trevis-44vm4 /usr/sbin/lctl set_param -n os[cd]*.*MD*.force_sync 1
trevis-44vm4: error: set_param: setting /sys/fs/lustre/osc/lustre-OST0000-osc-MDT0000/force_sync=1: Connection timed out
CMD: trevis-44vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*

In some failures, for example https://testing.whamcloud.com/test_sets/2894009c-b56f-11e9-b023-52540065bddc, we see that some of the osc parameters are not changed to 0:

Delete is not completed in 29 seconds
CMD: trevis-70vm4 /usr/sbin/lctl get_param osc.*MDT*.sync_*
osc.lustre-OST0000-osc-MDT0000.sync_changes=0
osc.lustre-OST0000-osc-MDT0000.sync_in_flight=0
osc.lustre-OST0000-osc-MDT0000.sync_in_progress=1

In this case, we also see the following in the OSS console

 [14628.488316] Lustre: DEBUG MARKER: dmesg
[14629.089241] LustreError: 5730:0:(pack_generic.c:425:lustre_msg_buf_v2()) msg ffff9e3e5937f600 buffer[0] size 152 too small (required 184, opc=-1)
[14629.091601] LustreError: 5730:0:(pack_generic.c:425:lustre_msg_buf_v2()) Skipped 815 previous similar messages

Not sure if this causes errors, but for some failures, for example https://testing.whamcloud.com/test_sets/1996cd30-b2f5-11e9-b753-52540065bddc, we see the following on the OSS console

[13183.416656] Lustre: ll_ost00_035: service thread pid 6546 was inactive for 40.145 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
Comment by Jian Yu [ 26/Jan/20 ]

Still failed on master branch: https://testing.whamcloud.com/test_sets/3c217dac-406f-11ea-9543-52540065bddc

Comment by Emoly Liu [ 13/Jun/20 ]

more on master: 
https://testing.whamcloud.com/test_sets/53a2b7d2-5056-4261-9ae8-c9cf52f4c47c
https://testing.whamcloud.com/test_sets/65adf9fc-24c8-40f2-b969-4a2a4ffde057

Comment by Patrick Farrell [ 08/Jul/20 ]

Brief notes from a minimal look:

This is failing in the cleanup phase related to setting up ENOSPC/low space as part of the test, specifically:
stack_trap "ost_watermarks_clear_enospc $tfile $ost_idx1 $wms" EXIT

Sets up:

ost_watermarks_clear_enospc

And the failure is specifically related to wait_delete_completed_mds, this part:

        local WAIT=0
        while [[ $WAIT -ne $max_wait ]]; do
                changes=$(do_nodes $mds2sync \
                        "$LCTL get_param -n osc.*MDT*.sync_*" | calc_sum)
                #echo "$node: $changes changes on all"
                if [[ $changes -eq 0 ]]; then
                        wait_zfs_commit $SINGLEMDS                        # the occupied disk space will be released
                        # only after TXGs are committed
                        wait_zfs_commit ost1
                        return 0
                fi
                sleep 1
                WAIT=$((WAIT + 1))
        done 

On failure, we dump the sync parameters we're summing there, and they all show as zero.

Most likely, this means we were waiting for ZFS commit (Unless there is a race condition between the "while [[ $WAIT .. ]] " timer and this check).

Since this is just failing (I think?  I am relying on past comments & the most recent failure) in ZFS testing, I strongly suspect it's waiting for ZFS commit.

So basically it looks to me like sometimes after failover we're waiting more than 30 seconds for ZFS commit.  I have not dug in further.

Comment by Patrick Farrell [ 08/Jul/20 ]

Note that in contrast to the earlier failures James highlighted, the most recent failure does not have any errors in the force_sync set_param.  That seems highly relevant because such an error would explain getting stuck, but it's not present in this most recent failure.  I suspect in the earlier case we had an issue with the connection coming back up after the failover (so we failed to set sync, so we failed this check, etc), whereas here it looks more like a ZFS/storage issue.

Generated at Sat Feb 10 02:53:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.