[LU-14201] replay-single test 89 fails with '3072 blocks leaked' Created: 08/Dec/20  Updated: 22/Nov/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.6
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

ZFS


Issue Links:
Related
is related to LU-16271 replay-single test_89 FAIL: 20480 blo... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

replay-single test_89 fails with '3072 blocks leaked'. We’ve seen this test fail with this error message, LU-1867 and LU-5761, but all these tickets are closed. Since late September 2020, we’ve seen this test fail with this error for branch testing and for patches, but there are several other replay-single tests that fail prior to test 89; for example https://testing.whamcloud.com/test_sets/b83cb774-5cb2-473e-a641-90e5875fe6a6.

For the test failure at https://testing.whamcloud.com/test_sets/a6260ca9-b7a0-4818-9d48-ab79249ba526, the last lines in the suite_log are

Waiting for orphan cleanup...
CMD: trevis-20vm4 /usr/sbin/lctl list_param osp.*osc*.old_sync_processed 2> /dev/null
osp.lustre-OST0000-osc-MDT0000.old_sync_processed
osp.lustre-OST0001-osc-MDT0000.old_sync_processed
osp.lustre-OST0002-osc-MDT0000.old_sync_processed
osp.lustre-OST0003-osc-MDT0000.old_sync_processed
osp.lustre-OST0004-osc-MDT0000.old_sync_processed
osp.lustre-OST0005-osc-MDT0000.old_sync_processed
osp.lustre-OST0006-osc-MDT0000.old_sync_processed
wait 40 secs maximumly for trevis-20vm4 mds-ost sync done.
CMD: trevis-20vm4 /usr/sbin/lctl get_param -n osp.*osc*.old_sync_processed
sleep 5 for ZFS zfs
Waiting for local destroys to complete
 replay-single test_89: @@@@@@ FAIL: 3072 blocks leaked 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:5907:error()
  = /usr/lib64/lustre/tests/replay-single.sh:3329:test_89()

There is nothing obviously wrong in the console logs.



 Comments   
Comment by Andreas Dilger [ 09/Dec/20 ]

I had a quick look at this, and so far it is a one-off test failure. There was one other test_89 failure in the past month, but it looked quite different.

This test is verifying that if a file is deleted across both an OSS and MDS restart that the space on the OSTs is released. In terms of severity, this is fairly low, in the sense that concurrent MDS and OSS failure is fairly rare, while also deleting files. At worst, some space on the OST would be leaked. It may also be that this is a test script issue (e.g. the delete didn't happen yet because "wait_delete_completed_mds()" didn't wait long enough).

So I don't think it is a blocker for the 2.12.6 release, but we can keep an eye on whether it is being hit regularly.

Comment by Andreas Dilger [ 22/Nov/22 ]

Duplicate with LU-16271, but seems to be hitting more regularly. Only 5 failures in 2022-07 and 2022-08, but 16 in the past 4 weeks.

Generated at Sat Feb 10 03:07:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.