[LU-10052] replay-single test_20b fails with 'after 4096 > before 3072' Created: 30/Sep/17  Updated: 29/Mar/18  Resolved: 25/Jan/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.11.0, Lustre 2.10.4

Type: Bug Priority: Major
Reporter: Maloo Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: zfs

Issue Links:
Duplicate
is duplicated by LU-2012 replay-dual test_14b: after 846984 > ... Reopened
Related
is related to LU-10793 replay-dual test_14b FAIL: after 221... Open
is related to LU-5761 replay-single test_89: @@@@@@ FAIL: 2... Resolved
is related to LU-8672 missing error handling in replay-sing... Resolved
is related to LU-9891 replay-ost-single test_7: 15995648 > ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for wangshilong <wshilong@ddn.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/5f45e920-a61c-11e7-bb19-5254006e85c2.

The sub-test test_20b failed with the following error:

after 4096 > before 3072

Please provide additional information about the failure here.

Info required for matching: replay-single 20b



 Comments   
Comment by nasf (Inactive) [ 17/Oct/17 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/246f6226-b2b6-11e7-9eeb-5254006e85c2

Comment by Peter Jones [ 23/Oct/17 ]

Hongchao

Could you please investigate?

Thanks

Peter

Comment by Hongchao Zhang [ 27/Oct/17 ]

For ldiskfs, the error only occurred 3 times since Oct 1, 2016, all are related to https://review.whamcloud.com/#/c/28847/ for LU-7585.

Comment by nasf (Inactive) [ 01/Dec/17 ]

It happened on ZFS many times recently. For example:
https://testing.hpdd.intel.com/test_sets/cc484e04-d61a-11e7-a066-52540065bddc

Comment by nasf (Inactive) [ 01/Dec/17 ]

For ldiskfs, the error only occurred 3 times since Oct 1, 2016, all are related to https://review.whamcloud.com/#/c/28847/ for LU-7585.

The failure happened on ZFS even if without such patch.

Comment by Andreas Dilger [ 21/Dec/17 ]

This has failed 16x in the past week.

Comment by Bob Glossman (Inactive) [ 23/Dec/17 ]

another on master:
https://testing.hpdd.intel.com/test_sets/eb616ebc-e79d-11e7-9c63-52540065bddc

Comment by Jian Yu [ 25/Dec/17 ]

This is affecting patch testing on master branch:
https://testing.hpdd.intel.com/test_sets/cbe4e4b4-e76c-11e7-9c63-52540065bddc
https://testing.hpdd.intel.com/test_sets/3ae4250e-e917-11e7-8027-52540065bddc

Comment by Emoly Liu [ 26/Dec/17 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/444ea2f8-e62a-11e7-a066-52540065bddc

Comment by Mikhail Pershin [ 26/Dec/17 ]

+1, master
https://testing.hpdd.intel.com/test_sets/68ef3b5a-e99e-11e7-a066-52540065bddc

Comment by Hongchao Zhang [ 01/Jan/18 ]

I have looked at the logs of these failed tests, and the space usage difference is the related to the recordsize of underlying ZFS,
the maximum value observed up to now is 2*recordsize, the recordsize is set to 1M (the default is 128K),

Dec 21 2017 02:24:43.426357505 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "lustre-ost2"
        pool_guid = 0x1c4c49fe69e964bf
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "onyx-30vm11.onyx.hpdd.intel.com"
        history_dsname = "lustre-ost2/ost2"
        history_internal_str = "recordsize=1048576"
        history_internal_name = "set"
        history_dsid = 0x88
        history_txg = 0xc
        history_time = 0x5a3b1b6b
        time = 0x5a3b1b6b 0x1969b301 
        eid = 0x20
Comment by Gerrit Updater [ 01/Jan/18 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: https://review.whamcloud.com/30678
Subject: LU-10052 test: limit recordsize
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d0e579ad09ac1b61509fa7a57470a48d1aed728d

Comment by Bruno Faccini (Inactive) [ 08/Jan/18 ]

+1 on master at https://testing.hpdd.intel.com/test_sets/577b494a-f40c-11e7-8c43-52540065bddc

Comment by Jian Yu [ 10/Jan/18 ]

+1 on master at https://testing.hpdd.intel.com/test_sets/41d6a8c8-f614-11e7-94c7-52540065bddc

Comment by Gerrit Updater [ 14/Jan/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30678/
Subject: LU-10052 tests: wait for OST objects to be deleted
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1eae3bfd5de6eecbe70d24681890ad070e8446f8

Comment by Peter Jones [ 14/Jan/18 ]

Landed for 2.11

Comment by James Nunez (Inactive) [ 16/Jan/18 ]

Hongchao - I think this issue is still open. There are a few replay-single test 20b failures that have the patch for this ticket, https://review.whamcloud.com/30678, applied.

Please see the following logs for one such failure:https://testing.hpdd.intel.com/test_sets/bbe57164-f9e6-11e7-bd00-52540065bddc

Thank you

Comment by Gerrit Updater [ 18/Jan/18 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: https://review.whamcloud.com/30916
Subject: LU-10052 test: relate fs_log_size to recordsize
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c845b4230bbc4d67339c20f64187c6f4ee166275

Comment by Jian Yu [ 21/Jan/18 ]

+1 on master branch:
https://testing.hpdd.intel.com/test_sets/5f5fa6f4-fe4a-11e7-bd00-52540065bddc

Comment by Jinshan Xiong (Inactive) [ 25/Jan/18 ]

https://testing.hpdd.intel.com/test_sets/649ca34c-016a-11e8-bd00-52540065bddc

Comment by Gerrit Updater [ 25/Jan/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30916/
Subject: LU-10052 test: relate fs_log_size to recordsize
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 804ea3abc4265c035d2b3941400c842e6d6fdb96

Comment by Gerrit Updater [ 02/Feb/18 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31145
Subject: LU-10052 tests: wait for OST objects to be deleted
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: b8a94fd40ca7b829917ad437995de9091a447001

Comment by Gerrit Updater [ 02/Feb/18 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31146
Subject: LU-10052 test: relate fs_log_size to recordsize
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: c55730b2c3aecedacababebee1437132f2820645

Comment by Patrick Farrell (Inactive) [ 15/Feb/18 ]

Plus one on master:
https://testing.hpdd.intel.com/test_sessions/e93e1854-4731-4a95-b047-fe08c34122a5

Anyone familiar with this able to check if it's the same issue? It is the same test failure reported here.

Comment by Elena Gryaznova [ 20/Feb/18 ]

Patrick,
https://testing.hpdd.intel.com/test_sessions/e93e1854-4731-4a95-b047-fe08c34122a5
fails with LU-10052 because:
https://review.whamcloud.com/#/c/30405/ parent: 2b13cb3c which does not contain LU-10052 fix 804ea3

Comment by Andreas Dilger [ 06/Mar/18 ]

Note that the commit message for the 31146 patch says that it fixes replay-single test_89, but it doesn't. The patch https://review.whamcloud.com/31120 "LU-5761 tests: fix test_89 to use fs_log_size()" is still needed to fix that test to use fs_log_size() so that it allows enough leeway in the space allocation for ZFS log blocks.

Comment by Gerrit Updater [ 08/Mar/18 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31145/
Subject: LU-10052 tests: wait for OST objects to be deleted
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: e1c3628ae4363cdb3cd28d8dc89459df819d134c

Comment by Gerrit Updater [ 08/Mar/18 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31146/
Subject: LU-10052 test: relate fs_log_size to recordsize
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 86cea21001f0152f6849fe5710112bced629f5f3

Generated at Sat Feb 10 02:31:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.