[LU-3455] Test failure replay-single, test_20b: after 23936 > before 9600 Created: 11/Jun/13 Updated: 26/Feb/15 Resolved: 12/Jan/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0, Lustre 2.6.0, Lustre 2.5.2, Lustre 2.5.3 |
| Fix Version/s: | Lustre 2.7.0, Lustre 2.5.4 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Hongchao Zhang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB, prz, zfs | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 8637 | ||||||||||||
| Description |
|
This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com> This issue relates to the following test suite run: The sub-test test_20b failed with the following error:
Info required for matching: replay-single 20b |
| Comments |
| Comment by Li Wei (Inactive) [ 25/Jul/13 ] |
|
https://maloo.whamcloud.com/test_sets/f84d0b8a-f46e-11e2-b8a2-52540035b04c |
| Comment by Andreas Dilger [ 13/Sep/13 ] |
|
This is still failing quite a bit. |
| Comment by Andreas Dilger [ 13/Sep/13 ] |
|
This fails quite a bit more often on review-dne-zfs, but there are still review-zfs failures as well. This might be because the amount of extra data written for DNE is higher than for non-DNE systems? Should the error margin be scaled by MDSCOUNT in the test? |
| Comment by Nathaniel Clark [ 15/Oct/13 ] |
|
MDS debug_log: 00000004:00080000:0.0:1381566255.035796:0:14198:0:(osp_dev.c:1039:osp_import_event()) got inactive 00000100:00080000:0.0:1381566255.035798:0:14198:0:(pinger.c:433:ptlrpc_pinger_del_import()) removing pingable import lustre-MDT0000-mdtlov_UUID->lustre-OST0000_UUID 00000100:00080000:0.0:1381566255.035811:0:14198:0:(import.c:1484:ptlrpc_disconnect_import()) ffff8800298b4800 lustre-OST0000_UUID: changing import state from FULL to CONNECTING 00000100:00100000:0.0:1381566255.035819:0:14198:0:(client.c:1469:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc umount:lustre-MDT0000-mdtlov_UUID:14198:1448660421728700:10.10.16.157@tcp:9 00000100:00100000:0.0:1381566255.035854:0:14198:0:(client.c:2117:ptlrpc_set_wait()) set ffff8800571a9bc0 going to sleep for 5 seconds 00000004:00080000:0.0:1381566255.035961:0:3486:0:(osp_sync.c:348:osp_sync_interpret()) reply req ffff880056eb2800/1, rc -5, transno 0 00000100:00100000:0.0:1381566255.035969:0:3486:0:(client.c:1834:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_1:lustre-MDT0000-mdtlov_UUID:3486:1448660421728420:10.10.16.157@tcp:6 00000004:00080000:0.0:1381566255.035974:0:3486:0:(osp_sync.c:348:osp_sync_interpret()) reply req ffff880025f2ec00/1, rc -5, transno 0 00000100:00100000:0.0:1381566255.035977:0:3486:0:(client.c:1834:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_1:lustre-MDT0000-mdtlov_UUID:3486:1448660421728428:10.10.16.157@tcp:6 00000004:00080000:0.0:1381566255.035980:0:3486:0:(osp_sync.c:348:osp_sync_interpret()) reply req ffff880025f2e000/1, rc -5, transno 0 00000100:00100000:0.0:1381566255.035983:0:3486:0:(client.c:1834:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_1:lustre-MDT0000-mdtlov_UUID:3486:1448660421728436:10.10.16.157@tcp:6 00000004:00080000:0.0:1381566255.035986:0:3486:0:(osp_sync.c:348:osp_sync_interpret()) reply req ffff88006c467800/1, rc -5, transno 0 00000100:00100000:0.0:1381566255.035989:0:3486:0:(client.c:1834:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_1:lustre-MDT0000-mdtlov_UUID:3486:1448660421728444:10.10.16.157@tcp:6 00000004:00080000:0.0:1381566255.035997:0:3486:0:(osp_sync.c:348:osp_sync_interpret()) reply req ffff880056eb2c00/1, rc -5, transno 0 00000100:00100000:0.0:1381566255.036000:0:3486:0:(client.c:1834:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_1:lustre-MDT0000-mdtlov_UUID:3486:1448660421728416:10.10.16.157@tcp:6 00000004:00080000:0.0:1381566255.036004:0:3486:0:(osp_sync.c:348:osp_sync_interpret()) reply req ffff880056eb2000/1, rc -5, transno 0 00000100:00100000:0.0:1381566255.036007:0:3486:0:(client.c:1834:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_1:lustre-MDT0000-mdtlov_UUID:3486:1448660421728424:10.10.16.157@tcp:6 00000004:00080000:0.0:1381566255.036011:0:3486:0:(osp_sync.c:348:osp_sync_interpret()) reply req ffff880025f2e400/1, rc -5, transno 0 00000100:00100000:0.0:1381566255.036014:0:3486:0:(client.c:1834:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_1:lustre-MDT0000-mdtlov_UUID:3486:1448660421728432:10.10.16.157@tcp:6 00000004:00080000:0.0:1381566255.036018:0:3486:0:(osp_sync.c:348:osp_sync_interpret()) reply req ffff88006c467c00/1, rc -5, transno 0 00000100:00100000:0.0:1381566255.036020:0:3486:0:(client.c:1834:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_1:lustre-MDT0000-mdtlov_UUID:3486:1448660421728440:10.10.16.157@tcp:6 00000100:00100000:0.0:1381566255.036050:0:13350:0:(client.c:1834:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc osp-pre-0:lustre-MDT0000-mdtlov_UUID:13350:1448660421728660:10.10.16.157@tcp:5 00000004:00020000:0.0:1381566255.036055:0:13350:0:(osp_precreate.c:481:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -5 00000004:00020000:0.0:1381566255.038361:0:13350:0:(osp_precreate.c:986:osp_precreate_thread()) lustre-OST0000-osc-MDT0000: cannot precreate objects: rc = -5 |
| Comment by Nathaniel Clark [ 01/Nov/13 ] |
|
WAG: Might this be related to http://review.whamcloud.com/6988 (from Update: This bug predates landing of accused patch, and is thus not the culprit. |
| Comment by Nathaniel Clark [ 01/Nov/13 ] |
|
The first occurrence of this bug is actually on http://review.whamcloud.com/5666 ( |
| Comment by nasf (Inactive) [ 27/Dec/13 ] |
|
Another failure instance: https://maloo.whamcloud.com/test_sets/709b23f6-6e9f-11e3-b713-52540035b04c |
| Comment by Nathaniel Clark [ 10/Feb/14 ] |
|
review-zfs on master: |
| Comment by Jian Yu [ 18/Feb/14 ] |
|
More instance on master branch: |
| Comment by Jian Yu [ 16/Jun/14 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/63/ The same failure occurred: https://maloo.whamcloud.com/test_sets/1a08ad42-f374-11e3-86ca-52540035b04c |
| Comment by Peter Jones [ 29/Jul/14 ] |
|
Hongchao Could you please look into this issue? Thanks Peter |
| Comment by Hongchao Zhang [ 15/Aug/14 ] |
|
the debug patch is tracked at http://review.whamcloud.com/#/c/11471/ |
| Comment by Jian Yu [ 24/Aug/14 ] |
|
Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/84/ The same failure occurred: https://testing.hpdd.intel.com/test_sets/e169ad78-2ab5-11e4-aec8-5254006e85c2 |
| Comment by Hongchao Zhang [ 25/Aug/14 ] |
|
this bug is related to the ZFS, which only release occupied disk space once DMUs are committed? the tests of 20b of replay_sinlge were all passed if waiting some time after "wait_delete_completed_mds", and all failed without change wait_delete_completed_mds() {
...
# sync MDS transactions
do_nodes $mds2sync "$LCTL set_param -n osd*.*MD*.force_sync 1"
# wait till all changes are sent and commmitted by OSTs
# for ldiskfs space is released upon execution, but DMU
# do this upon commit
local WAIT=0
while [[ $WAIT -ne $MAX_WAIT ]]; do
changes=$(do_nodes $mds2sync \
"$LCTL get_param -n osc.*MDT*.sync_*" | calc_sum)
#echo "$node: $changes changes on all"
if [[ $changes -eq 0 ]]; then
etime=$(date +%s)
#echo "delete took $((etime - stime)) seconds"
return
fi
sleep 1
WAIT=$(( WAIT + 1))
done
...
}
|
| Comment by Jian Yu [ 31/Aug/14 ] |
|
Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/86/ (2.5.3 RC1) The same failure occurred: https://testing.hpdd.intel.com/test_sets/158df844-311d-11e4-b503-5254006e85c2 |
| Comment by Hongchao Zhang [ 05/Sep/14 ] |
|
the initial patch is tracked at http://review.whamcloud.com/#/c/11767/ |
| Comment by Hongchao Zhang [ 30/Sep/14 ] |
|
the patch has been updated to sync the underlying OSD if some client is being evicted. |
| Comment by nasf (Inactive) [ 24/Oct/14 ] |
|
Another failure instance: |
| Comment by Isaac Huang (Inactive) [ 29/Oct/14 ] |
|
This might be related to |
| Comment by Gerrit Updater [ 29/Dec/14 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/11767/ |
| Comment by Gerrit Updater [ 04/Jan/15 ] |
|
Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/13231 |
| Comment by Jodi Levi (Inactive) [ 12/Jan/15 ] |
|
Patch landed to Master. b2_5 patch is being tracked externally. |
| Comment by Gerrit Updater [ 15/Jan/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13231/ |
| Comment by nasf (Inactive) [ 26/Feb/15 ] |
|
We still hit it on b2_5 with ZFS backend: |