[LU-3455] Test failure replay-single, test_20b: after 23936 > before 9600 Created: 11/Jun/13  Updated: 26/Feb/15  Resolved: 12/Jan/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0, Lustre 2.6.0, Lustre 2.5.2, Lustre 2.5.3
Fix Version/s: Lustre 2.7.0, Lustre 2.5.4

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: HB, prz, zfs

Issue Links:
Related
is related to LU-3626 replay-single test_20b: @@@@@@ FAIL: ... Resolved
is related to LU-4051 Timeout in replay-single, subtest tes... Resolved
Severity: 3
Rank (Obsolete): 8637

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run:
http://maloo.whamcloud.com/test_sets/27d96f00-bf41-11e2-88e0-52540035b04c
https://maloo.whamcloud.com/test_sets/96ebdda0-c483-11e2-bc10-52540035b04c
https://maloo.whamcloud.com/test_sets/7a22e7fe-c762-11e2-ae6b-52540035b04c
https://maloo.whamcloud.com/test_sets/911e66d4-c520-11e2-9b87-52540035b04c
https://maloo.whamcloud.com/test_sets/e4570cf4-c4d3-11e2-ac71-52540035b04c

The sub-test test_20b failed with the following error:

after 23936 > before 9600

Info required for matching: replay-single 20b



 Comments   
Comment by Li Wei (Inactive) [ 25/Jul/13 ]

https://maloo.whamcloud.com/test_sets/f84d0b8a-f46e-11e2-b8a2-52540035b04c

Comment by Andreas Dilger [ 13/Sep/13 ]

This is still failing quite a bit.

Comment by Andreas Dilger [ 13/Sep/13 ]

This fails quite a bit more often on review-dne-zfs, but there are still review-zfs failures as well. This might be because the amount of extra data written for DNE is higher than for non-DNE systems? Should the error margin be scaled by MDSCOUNT in the test?

Comment by Nathaniel Clark [ 15/Oct/13 ]

MDS debug_log:

00000004:00080000:0.0:1381566255.035796:0:14198:0:(osp_dev.c:1039:osp_import_event()) got inactive
00000100:00080000:0.0:1381566255.035798:0:14198:0:(pinger.c:433:ptlrpc_pinger_del_import()) removing pingable import lustre-MDT0000-mdtlov_UUID->lustre-OST0000_UUID
00000100:00080000:0.0:1381566255.035811:0:14198:0:(import.c:1484:ptlrpc_disconnect_import()) ffff8800298b4800 lustre-OST0000_UUID: changing import state from FULL to CONNECTING
00000100:00100000:0.0:1381566255.035819:0:14198:0:(client.c:1469:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc umount:lustre-MDT0000-mdtlov_UUID:14198:1448660421728700:10.10.16.157@tcp:9
00000100:00100000:0.0:1381566255.035854:0:14198:0:(client.c:2117:ptlrpc_set_wait()) set ffff8800571a9bc0 going to sleep for 5 seconds
00000004:00080000:0.0:1381566255.035961:0:3486:0:(osp_sync.c:348:osp_sync_interpret()) reply req ffff880056eb2800/1, rc -5, transno 0
00000100:00100000:0.0:1381566255.035969:0:3486:0:(client.c:1834:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_1:lustre-MDT0000-mdtlov_UUID:3486:1448660421728420:10.10.16.157@tcp:6
00000004:00080000:0.0:1381566255.035974:0:3486:0:(osp_sync.c:348:osp_sync_interpret()) reply req ffff880025f2ec00/1, rc -5, transno 0
00000100:00100000:0.0:1381566255.035977:0:3486:0:(client.c:1834:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_1:lustre-MDT0000-mdtlov_UUID:3486:1448660421728428:10.10.16.157@tcp:6
00000004:00080000:0.0:1381566255.035980:0:3486:0:(osp_sync.c:348:osp_sync_interpret()) reply req ffff880025f2e000/1, rc -5, transno 0
00000100:00100000:0.0:1381566255.035983:0:3486:0:(client.c:1834:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_1:lustre-MDT0000-mdtlov_UUID:3486:1448660421728436:10.10.16.157@tcp:6
00000004:00080000:0.0:1381566255.035986:0:3486:0:(osp_sync.c:348:osp_sync_interpret()) reply req ffff88006c467800/1, rc -5, transno 0
00000100:00100000:0.0:1381566255.035989:0:3486:0:(client.c:1834:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_1:lustre-MDT0000-mdtlov_UUID:3486:1448660421728444:10.10.16.157@tcp:6
00000004:00080000:0.0:1381566255.035997:0:3486:0:(osp_sync.c:348:osp_sync_interpret()) reply req ffff880056eb2c00/1, rc -5, transno 0
00000100:00100000:0.0:1381566255.036000:0:3486:0:(client.c:1834:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_1:lustre-MDT0000-mdtlov_UUID:3486:1448660421728416:10.10.16.157@tcp:6
00000004:00080000:0.0:1381566255.036004:0:3486:0:(osp_sync.c:348:osp_sync_interpret()) reply req ffff880056eb2000/1, rc -5, transno 0
00000100:00100000:0.0:1381566255.036007:0:3486:0:(client.c:1834:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_1:lustre-MDT0000-mdtlov_UUID:3486:1448660421728424:10.10.16.157@tcp:6
00000004:00080000:0.0:1381566255.036011:0:3486:0:(osp_sync.c:348:osp_sync_interpret()) reply req ffff880025f2e400/1, rc -5, transno 0
00000100:00100000:0.0:1381566255.036014:0:3486:0:(client.c:1834:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_1:lustre-MDT0000-mdtlov_UUID:3486:1448660421728432:10.10.16.157@tcp:6
00000004:00080000:0.0:1381566255.036018:0:3486:0:(osp_sync.c:348:osp_sync_interpret()) reply req ffff88006c467c00/1, rc -5, transno 0
00000100:00100000:0.0:1381566255.036020:0:3486:0:(client.c:1834:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_1:lustre-MDT0000-mdtlov_UUID:3486:1448660421728440:10.10.16.157@tcp:6
00000100:00100000:0.0:1381566255.036050:0:13350:0:(client.c:1834:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc osp-pre-0:lustre-MDT0000-mdtlov_UUID:13350:1448660421728660:10.10.16.157@tcp:5
00000004:00020000:0.0:1381566255.036055:0:13350:0:(osp_precreate.c:481:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -5
00000004:00020000:0.0:1381566255.038361:0:13350:0:(osp_precreate.c:986:osp_precreate_thread()) lustre-OST0000-osc-MDT0000: cannot precreate objects: rc = -5
Comment by Nathaniel Clark [ 01/Nov/13 ]

WAG: Might this be related to http://review.whamcloud.com/6988 (from LU-3230)? That patch would cause the EIO to propogate and thus fail the precreates and cause the size mismatch?

Update: This bug predates landing of accused patch, and is thus not the culprit.

Comment by Nathaniel Clark [ 01/Nov/13 ]

The first occurrence of this bug is actually on http://review.whamcloud.com/5666 (LU-2342) which re-enabled this test for ZFS.

Comment by nasf (Inactive) [ 27/Dec/13 ]

Another failure instance:

https://maloo.whamcloud.com/test_sets/709b23f6-6e9f-11e3-b713-52540035b04c

Comment by Nathaniel Clark [ 10/Feb/14 ]

review-zfs on master:
https://maloo.whamcloud.com/test_sets/1b83d22e-90b8-11e3-9767-52540035b04c

Comment by Jian Yu [ 18/Feb/14 ]

More instance on master branch:
https://maloo.whamcloud.com/test_sets/d4bf6952-9792-11e3-877b-52540035b04c

Comment by Jian Yu [ 16/Jun/14 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/63/
TEST_GROUP=failover
FSTYPE=zfs

The same failure occurred: https://maloo.whamcloud.com/test_sets/1a08ad42-f374-11e3-86ca-52540035b04c

Comment by Peter Jones [ 29/Jul/14 ]

Hongchao

Could you please look into this issue?

Thanks

Peter

Comment by Hongchao Zhang [ 15/Aug/14 ]

the debug patch is tracked at http://review.whamcloud.com/#/c/11471/

Comment by Jian Yu [ 24/Aug/14 ]

Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/84/
TEST_GROUP=failover
FSTYPE=zfs

The same failure occurred: https://testing.hpdd.intel.com/test_sets/e169ad78-2ab5-11e4-aec8-5254006e85c2

Comment by Hongchao Zhang [ 25/Aug/14 ]

this bug is related to the ZFS, which only release occupied disk space once DMUs are committed?

the tests of 20b of replay_sinlge were all passed if waiting some time after "wait_delete_completed_mds", and all failed without change
(wait 5 seconds) https://testing.hpdd.intel.com/test_sessions/d79c9e64-2a69-11e4-8657-5254006e85c2
(no wait) https://testing.hpdd.intel.com/test_sessions/15b6d73a-2c4c-11e4-8362-5254006e85c2

wait_delete_completed_mds() {
        ...

        # sync MDS transactions
        do_nodes $mds2sync "$LCTL set_param -n osd*.*MD*.force_sync 1"

        # wait till all changes are sent and commmitted by OSTs
        # for ldiskfs space is released upon execution, but DMU
        # do this upon commit
    
        local WAIT=0
        while [[ $WAIT -ne $MAX_WAIT ]]; do
                changes=$(do_nodes $mds2sync \
                        "$LCTL get_param -n osc.*MDT*.sync_*" | calc_sum)
                #echo "$node: $changes changes on all"
                if [[ $changes -eq 0 ]]; then
                        etime=$(date +%s)
                        #echo "delete took $((etime - stime)) seconds"
                        return
                fi
                sleep 1
                WAIT=$(( WAIT + 1))
        done
        ...
}
Comment by Jian Yu [ 31/Aug/14 ]

Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/86/ (2.5.3 RC1)
FSTYPE=zfs

The same failure occurred: https://testing.hpdd.intel.com/test_sets/158df844-311d-11e4-b503-5254006e85c2

Comment by Hongchao Zhang [ 05/Sep/14 ]

the initial patch is tracked at http://review.whamcloud.com/#/c/11767/

Comment by Hongchao Zhang [ 30/Sep/14 ]

the patch has been updated to sync the underlying OSD if some client is being evicted.

Comment by nasf (Inactive) [ 24/Oct/14 ]

Another failure instance:
https://testing.hpdd.intel.com/test_sets/d89c27d6-5b1e-11e4-9c62-5254006e85c2

Comment by Isaac Huang (Inactive) [ 29/Oct/14 ]

This might be related to LU-4265. Free space accounting with ZFS can be quite tricky as sometimes frees can be delayed for quite a while.

Comment by Gerrit Updater [ 29/Dec/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/11767/
Subject: LU-3455 mdt: sync when evicting client
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: fabd01968d0fac83e4855da5a5dfe8f68c038ceb

Comment by Gerrit Updater [ 04/Jan/15 ]

Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/13231
Subject: LU-3455 mdt: sync when evicting client
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: aa79c4974653d101be9d03200914b023970a7d41

Comment by Jodi Levi (Inactive) [ 12/Jan/15 ]

Patch landed to Master. b2_5 patch is being tracked externally.

Comment by Gerrit Updater [ 15/Jan/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13231/
Subject: LU-3455 mdt: sync when evicting client
Project: fs/lustre-release
Branch: b2_5
Current Patch Set:
Commit: fd7d369b7a513fd0b3123d16b8329a0cfc97875c

Comment by nasf (Inactive) [ 26/Feb/15 ]

We still hit it on b2_5 with ZFS backend:
https://testing.hpdd.intel.com/test_sets/05e4fbe4-bd23-11e4-a946-5254006e85c2

Generated at Sat Feb 10 01:34:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.