[LU-12068] sanity-lfsck test_6b: (7.2) 0x0 is not larger than 0x0 Created: 14/Mar/19  Updated: 10/May/19  Resolved: 08/Apr/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.12.1
Fix Version/s: Lustre 2.13.0, Lustre 2.12.2

Type: Bug Priority: Major
Reporter: Maloo Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: zfs

Issue Links:
Duplicate
is duplicated by LU-8112 sanity-lfsck test_6b: (7.2) 0x0 is no... Resolved
Related
is related to LU-11330 replay-single test_70d: Directory not... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for wangshilong <wshilong@ddn.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/36dcb5b6-45d2-11e9-9646-52540065bddc

test_6b failed with the following error:

(7.2) 0x0 is not larger than 0x0

<<Please provide additional information about the failure here>>

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-lfsck test_6b - (7.2) 0x0 is not larger than 0x0



 Comments   
Comment by Andreas Dilger [ 14/Mar/19 ]

It looks like this has been happening a lot recently, but the first recent occurrence is on 2019-02-20:
https://testing.whamcloud.com/test_sets/07e0f5ee-3513-11e9-b4f9-52540065bddc

There were a number of patches landed on 2019-02-18 that may have triggered this issue, but none of the patches landed on the 18th look like they would be the cause. The most recent patch that changed LFSCK is LU-11111 but it landed on 2019-02-11.

Comment by Gerrit Updater [ 14/Mar/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34417
Subject: LU-12068 tests: add debug for sanity-lfsck test_6b
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3a81468d0f0fff77de01e67a479c9ea82878d4c8

Comment by Gerrit Updater [ 23/Mar/19 ]

Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/34417/
Subject: LU-12068 tests: add debug for sanity-lfsck test_6b
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 698209700faf51c227f9ba16626de5ed70fa97c8

Comment by James Nunez (Inactive) [ 27/Mar/19 ]

We have at least one sanity-lfsck test 6b failure with the debug information at https://testing.whamcloud.com/test_sets/5c532a1a-4fe0-11e9-9720-52540065bddc. The debug info for this test is:

Additional debug for 6b
CMD: trevis-12vm8 /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.lfsck_namespace
name: lfsck_namespace
magic: 0xa06249ff
version: 2
status: scanning-phase1
flags:
param:
last_completed_time: 1553561433
time_since_last_completed: 14 seconds
latest_start_time: 1553561446
time_since_latest_start: 1 seconds
last_checkpoint_time: 1553561444
time_since_last_checkpoint: 3 seconds
latest_start_position: 1943, [0x200000405:0x1:0x0], 0x0
last_checkpoint_position: 1680, [0x200000405:0x1:0x0], 0x0
first_failure_position: N/A, N/A, N/A
checked_phase1: 4
checked_phase2: 0
updated_phase1: 0
updated_phase2: 0
failed_phase1: 0
failed_phase2: 0
directories: 2
dirent_repaired: 0
linkea_repaired: 0
nlinks_repaired: 0
multiple_linked_checked: 0
multiple_linked_repaired: 0
unknown_inconsistency: 0
unmatched_pairs_repaired: 0
dangling_repaired: 0
multiple_referenced_repaired: 0
bad_file_type_repaired: 0
lost_dirent_repaired: 0
local_lost_found_scanned: 0
local_lost_found_moved: 0
local_lost_found_skipped: 0
local_lost_found_failed: 0
striped_dirs_scanned: 0
striped_dirs_repaired: 0
striped_dirs_failed: 0
striped_dirs_disabled: 0
striped_dirs_skipped: 0
striped_shards_scanned: 0
striped_shards_repaired: 0
striped_shards_failed: 0
striped_shards_skipped: 0
name_hash_repaired: 0
linkea_overflow_cleared: 0
agent_entries_repaired: 0
success_count: 11
run_time_phase1: 7 seconds
run_time_phase2: 0 seconds
average_speed_phase1: 0 items/sec
average_speed_phase2: N/A
average_speed_total: 0 items/sec
real_time_speed_phase1: 0 items/sec
real_time_speed_phase2: N/A
current_position: 1942, [0x200000405:0x1:0x0], 0x0
 sanity-lfsck test_6b: @@@@@@ FAIL: (7.2) 0x0 is not larger than 0x0 
Comment by Patrick Farrell (Inactive) [ 27/Mar/19 ]

I dug in to this for a while yesterday (I found several examples with debug, James' there is very much representative), and I mostly concluded that I don't understand lfsck tests very well.  It would be good if someone who knows the lfsck architecture could dig in - I was spending most of my time trying to figure out what normal was for this.

Comment by Andreas Dilger [ 27/Mar/19 ]

HongChao, could you please look into this issue. It is causing a large number of test failures.

Comment by Gerrit Updater [ 28/Mar/19 ]

Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34525
Subject: LU-12068 test: compare position for ZFS dot entry
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 617cd0fa4b53453cd88295a7c2265817ebc38f48

Comment by Hongchao Zhang [ 28/Mar/19 ]

The issue is managed to be reproduced locally, and the cause of it is that the LFSCK process is stopped just when it is scanning the "." or ".." entry
of some directory, for ZFS, the position of both the two entries is zero.

static __u64 osd_dir_it_store(const struct lu_env *env, const struct dt_it *di)
{       
        struct osd_zap_it *it = (struct osd_zap_it *)di;
        __u64              pos;
        ENTRY;

        if (it->ozi_pos <= OZI_POS_DOTDOT)
                pos = 0;
        else
                pos = osd_zap_cursor_serialize(it->ozi_zc);
                                  
        RETURN(pos);
}       

the patch is tracked at https://review.whamcloud.com/34525

Comment by Andreas Dilger [ 28/Mar/19 ]

The patch https://review.whamcloud.com/34098 "LU-11330 osd-zfs: hash for ./.. must be 0" only landed on 2019-02-27, while there were a few tests failing on 2019-02-20 to 2019-02-25, so it is close to the first date this problem was seen, but not exactly the same. However, most of the tests started failing after 2019-02-27 so it is possible there are a couple of different issues here, and LU-11330 made the problem much worse. There aren't any cases where this test failed during the testing of LU-11330, but it is definitely not being hit on ldiskfs so this is the likely cause of most of these failures.

Comment by Hongchao Zhang [ 29/Mar/19 ]

Hi Andreas,

Yes, there is one failed case on 2019-02-20 and another one on 2019-02-25, both failed cases are on branch "master-next",
the version is "2.12.51.85", is it possible this version contains some other patches?

https://testing.whamcloud.com/test_sessions/ae657cad-8359-4ff5-a42a-9b0e496a025b
https://testing.whamcloud.com/test_sessions/b49849ff-483d-4982-84c5-92721cae1afe

Comment by Minh Diep [ 02/Apr/19 ]

+1 on b2_12 https://testing.whamcloud.com/test_sets/eb96e502-5523-11e9-9646-52540065bddc

Comment by Gerrit Updater [ 08/Apr/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34525/
Subject: LU-12068 test: compare position for ZFS dot entry
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 42adbae36f206a6ed4170e7619cd993c8fa80b1d

Comment by Minh Diep [ 08/Apr/19 ]

Landed in 2.13

Comment by Gerrit Updater [ 17/Apr/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34695
Subject: LU-12068 tests: add debug for sanity-lfsck test_6b
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 02ebeedf6dcd83d11e13e31c06753cfaef5dcbbf

Comment by Gerrit Updater [ 17/Apr/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34696
Subject: LU-12068 test: compare position for ZFS dot entry
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 08a0fa85c6f9987dbd6c3049b46c660a1846a1c3

Comment by Gerrit Updater [ 10/May/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34695/
Subject: LU-12068 tests: add debug for sanity-lfsck test_6b
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: f5e0c311ae60709070514b951f7d25d537d3dc91

Comment by Gerrit Updater [ 10/May/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34696/
Subject: LU-12068 test: compare position for ZFS dot entry
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 1e6cd6b21fc37420341fdb0dcec366bb3feb350e

Generated at Sat Feb 10 02:49:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.