[LU-12068] sanity-lfsck test_6b: (7.2) 0x0 is not larger than 0x0 Created: 14/Mar/19 Updated: 10/May/19 Resolved: 08/Apr/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0, Lustre 2.12.1 |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.2 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Hongchao Zhang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | zfs | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
This issue was created by maloo for wangshilong <wshilong@ddn.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/36dcb5b6-45d2-11e9-9646-52540065bddc test_6b failed with the following error: (7.2) 0x0 is not larger than 0x0 <<Please provide additional information about the failure here>> VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Andreas Dilger [ 14/Mar/19 ] |
|
It looks like this has been happening a lot recently, but the first recent occurrence is on 2019-02-20: There were a number of patches landed on 2019-02-18 that may have triggered this issue, but none of the patches landed on the 18th look like they would be the cause. The most recent patch that changed LFSCK is |
| Comment by Gerrit Updater [ 14/Mar/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34417 |
| Comment by Gerrit Updater [ 23/Mar/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/34417/ |
| Comment by James Nunez (Inactive) [ 27/Mar/19 ] |
|
We have at least one sanity-lfsck test 6b failure with the debug information at https://testing.whamcloud.com/test_sets/5c532a1a-4fe0-11e9-9720-52540065bddc. The debug info for this test is: Additional debug for 6b CMD: trevis-12vm8 /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.lfsck_namespace name: lfsck_namespace magic: 0xa06249ff version: 2 status: scanning-phase1 flags: param: last_completed_time: 1553561433 time_since_last_completed: 14 seconds latest_start_time: 1553561446 time_since_latest_start: 1 seconds last_checkpoint_time: 1553561444 time_since_last_checkpoint: 3 seconds latest_start_position: 1943, [0x200000405:0x1:0x0], 0x0 last_checkpoint_position: 1680, [0x200000405:0x1:0x0], 0x0 first_failure_position: N/A, N/A, N/A checked_phase1: 4 checked_phase2: 0 updated_phase1: 0 updated_phase2: 0 failed_phase1: 0 failed_phase2: 0 directories: 2 dirent_repaired: 0 linkea_repaired: 0 nlinks_repaired: 0 multiple_linked_checked: 0 multiple_linked_repaired: 0 unknown_inconsistency: 0 unmatched_pairs_repaired: 0 dangling_repaired: 0 multiple_referenced_repaired: 0 bad_file_type_repaired: 0 lost_dirent_repaired: 0 local_lost_found_scanned: 0 local_lost_found_moved: 0 local_lost_found_skipped: 0 local_lost_found_failed: 0 striped_dirs_scanned: 0 striped_dirs_repaired: 0 striped_dirs_failed: 0 striped_dirs_disabled: 0 striped_dirs_skipped: 0 striped_shards_scanned: 0 striped_shards_repaired: 0 striped_shards_failed: 0 striped_shards_skipped: 0 name_hash_repaired: 0 linkea_overflow_cleared: 0 agent_entries_repaired: 0 success_count: 11 run_time_phase1: 7 seconds run_time_phase2: 0 seconds average_speed_phase1: 0 items/sec average_speed_phase2: N/A average_speed_total: 0 items/sec real_time_speed_phase1: 0 items/sec real_time_speed_phase2: N/A current_position: 1942, [0x200000405:0x1:0x0], 0x0 sanity-lfsck test_6b: @@@@@@ FAIL: (7.2) 0x0 is not larger than 0x0 |
| Comment by Patrick Farrell (Inactive) [ 27/Mar/19 ] |
|
I dug in to this for a while yesterday (I found several examples with debug, James' there is very much representative), and I mostly concluded that I don't understand lfsck tests very well. It would be good if someone who knows the lfsck architecture could dig in - I was spending most of my time trying to figure out what normal was for this. |
| Comment by Andreas Dilger [ 27/Mar/19 ] |
|
HongChao, could you please look into this issue. It is causing a large number of test failures. |
| Comment by Gerrit Updater [ 28/Mar/19 ] |
|
Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34525 |
| Comment by Hongchao Zhang [ 28/Mar/19 ] |
|
The issue is managed to be reproduced locally, and the cause of it is that the LFSCK process is stopped just when it is scanning the "." or ".." entry static __u64 osd_dir_it_store(const struct lu_env *env, const struct dt_it *di)
{
struct osd_zap_it *it = (struct osd_zap_it *)di;
__u64 pos;
ENTRY;
if (it->ozi_pos <= OZI_POS_DOTDOT)
pos = 0;
else
pos = osd_zap_cursor_serialize(it->ozi_zc);
RETURN(pos);
}
the patch is tracked at https://review.whamcloud.com/34525 |
| Comment by Andreas Dilger [ 28/Mar/19 ] |
|
The patch https://review.whamcloud.com/34098 " |
| Comment by Hongchao Zhang [ 29/Mar/19 ] |
|
Hi Andreas, Yes, there is one failed case on 2019-02-20 and another one on 2019-02-25, both failed cases are on branch "master-next", https://testing.whamcloud.com/test_sessions/ae657cad-8359-4ff5-a42a-9b0e496a025b |
| Comment by Minh Diep [ 02/Apr/19 ] |
|
+1 on b2_12 https://testing.whamcloud.com/test_sets/eb96e502-5523-11e9-9646-52540065bddc |
| Comment by Gerrit Updater [ 08/Apr/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34525/ |
| Comment by Minh Diep [ 08/Apr/19 ] |
|
Landed in 2.13 |
| Comment by Gerrit Updater [ 17/Apr/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34695 |
| Comment by Gerrit Updater [ 17/Apr/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34696 |
| Comment by Gerrit Updater [ 10/May/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34695/ |
| Comment by Gerrit Updater [ 10/May/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34696/ |