[LU-4226] MDS unable to locate swabbed FID SEQ in FLDB - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: None
Affects Version/s: Lustre 1.8.9
Labels:
- llnl
- ppc

Severity:
1
Rank (Obsolete):
11501

Description

Our sysadmins updated one of out Lustre 2.1 filesystem to lustre 2.4.0-19chaos. Note that this filesystem was likely originally formatted under 1.8. It looks like oi_scrub ran automatically this time, but failed to make any updates:

> cat osd-ldiskfs/lsd-MDT0000/oi_scrub
name: OI_scrub
magic: 0x4c5fd252
oi_files: 1
status: completed
flags:
param:
time_since_last_completed: 505891 seconds
time_since_latest_start: 521998 seconds
time_since_last_checkpoint: 505891 seconds
latest_start_position: 12
last_checkpoint_position: 991133697
first_failure_position: N/A
checked: 200636112
updated: 0
failed: 0
prior_updated: 0
noscrub: 3090
igif: 15492100
success_count: 2
run_time: 16107 seconds
average_speed: 12456 objects/sec
real-time_speed: N/A
current_position: N/A

You'll recall that we have oi scrub problems when we tried to upgrade the first ldiskfs filesystem to 2.4 in ~~LU-3934~~. This time we are using a version of lustre with the suggested patches included.

We are seeing similar symptoms as last time. For example, directory listings show ????????? for permissions flags for some of the subdirectories, and we are seeing errors on the MDS console like this:

Nov  7 08:06:19 momus-mds1 kernel: LustreError: 7326:0:(fld_handler.c:169:fld_server_lookup()) srv-lsd-MDT0000: Cannot find sequence 0x607000002000000: rc = -5
Nov  7 08:06:19 momus-mds1 kernel: LustreError: 7326:0:(fld_handler.c:169:fld_server_lookup()) Skipped 20 previous similar messages
Nov  7 08:06:19 momus-mds1 kernel: LustreError: 7326:0:(osd_handler.c:2125:osd_fld_lookup()) lsd-MDT0000-osd: cannot find FLD range for [0x607000002000000:0x8a0:0x0]: rc = -5
Nov  7 08:06:19 momus-mds1 kernel: LustreError: 7326:0:(osd_handler.c:2125:osd_fld_lookup()) Skipped 14 previous similar messages
Nov  7 08:06:19 momus-mds1 kernel: LustreError: 7326:0:(osd_handler.c:3317:osd_remote_fid()) lsd-MDT0000-osd: Can not lookup fld for [0x607000002000000:0x8a0:0x0]
Nov  7 08:06:19 momus-mds1 kernel: LustreError: 7326:0:(osd_handler.c:3317:osd_remote_fid()) Skipped 14 previous similar messages

The filesystem is unusable many of our users.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

checkfid.sh
0.9 kB
15/Nov/13 9:19 AM
client_log.txt
701 kB
08/Nov/13 2:56 AM
server_log.txt.bz2
0.2 kB
08/Nov/13 2:56 AM

Issue Links

is related to

LU-5369 servers should reject invalid FIDs, requests, ...

Resolved

LU-4232 MDT and OST should validate incoming FIDs belong to local target

Open

LU-4233 Verify PPC FID swabbed correctly during create.

Open

Activity

[LU-4226] MDS unable to locate swabbed FID SEQ in FLDB

James A Simmons added a comment - 14/Aug/16 5:24 PM

time to close this out.

James A Simmons added a comment - 14/Aug/16 5:24 PM time to close this out.

Christopher Morrone (Inactive) added a comment - 21/Apr/14 6:00 PM

The problem was handled as Andreas explained. If the servers now have code to prevent this problem in the fist place, then the ticket is complete.

Christopher Morrone (Inactive) added a comment - 21/Apr/14 6:00 PM The problem was handled as Andreas explained. If the servers now have code to prevent this problem in the fist place, then the ticket is complete.

Andreas Dilger added a comment - 19/Apr/14 9:14 AM

Di, I don't think there was any way for LFSCK to fix the bad FIDs directly. My understanding is that the LMA xattr was removed from the inodes, and then LFSCK treated this as an upgraded 1.8 filesystem with IGIF FIDs and recreated the LMA.

Andreas Dilger added a comment - 19/Apr/14 9:14 AM Di, I don't think there was any way for LFSCK to fix the bad FIDs directly. My understanding is that the LMA xattr was removed from the inodes, and then LFSCK treated this as an upgraded 1.8 filesystem with IGIF FIDs and recreated the LMA.

Di Wang added a comment - 15/Apr/14 9:58 PM - edited

Btw: we will do more FID validation on the server side in https://jira.hpdd.intel.com/browse/LU-4232. Ah, already attached that ticket to the sub-tasks.

Di Wang added a comment - 15/Apr/14 9:58 PM - edited Btw: we will do more FID validation on the server side in https://jira.hpdd.intel.com/browse/LU-4232 . Ah, already attached that ticket to the sub-tasks.

Di Wang added a comment - 15/Apr/14 9:55 PM

Ned, Chris: Could you please tell me if OI_scrub fix these bad FIDs? Are there anything else I should do for this ticket? Thanks.

Di Wang added a comment - 15/Apr/14 9:55 PM Ned, Chris: Could you please tell me if OI_scrub fix these bad FIDs? Are there anything else I should do for this ticket? Thanks.

Ned Bass (Inactive) added a comment - 05/Dec/13 9:13 PM

Andreas, in case checkfid.sh is needed again, it needs to handle sequence numbers that compare as negative integers:

-       [[ ${SFID[1]} -ge $MAXFID ]] && echo "$F: bad SEQ $FFID" && continue
+       if [[ ${SFID[1]} -ge $MAXFID -o ${FID[1]} -lt 0 ]] ; then
+               echo "$F: bad SEQ $FFID"
+               continue
+       fi

Ned Bass (Inactive) added a comment - 05/Dec/13 9:13 PM Andreas, in case checkfid.sh is needed again, it needs to handle sequence numbers that compare as negative integers: - [[ ${SFID[1]} -ge $MAXFID ]] && echo "$F: bad SEQ $FFID" && continue + if [[ ${SFID[1]} -ge $MAXFID -o ${FID[1]} -lt 0 ]] ; then + echo "$F: bad SEQ $FFID" + continue + fi

Christopher Morrone (Inactive) added a comment - 25/Nov/13 10:16 PM

Yes, it was run (maybe still running) on four of our ldiskfs systems on the SCF. Of the four, only one had bad fids, and that filesystem was the one that BG/P used exclusively. That filesystem has in excess of 1 million files/directories with bad fids.

So that would appear to be anther strong correlation pointing to ppc clients and lack of checking on the servers.

Christopher Morrone (Inactive) added a comment - 25/Nov/13 10:16 PM Yes, it was run (maybe still running) on four of our ldiskfs systems on the SCF. Of the four, only one had bad fids, and that filesystem was the one that BG/P used exclusively. That filesystem has in excess of 1 million files/directories with bad fids. So that would appear to be anther strong correlation pointing to ppc clients and lack of checking on the servers.

Andreas Dilger added a comment - 18/Nov/13 9:22 AM

Any chance to run the checkfid.sh script on any of your other filesystems?

Andreas Dilger added a comment - 18/Nov/13 9:22 AM Any chance to run the checkfid.sh script on any of your other filesystems?

Andreas Dilger added a comment - 15/Nov/13 9:19 AM

Updated version of checkfid.sh program. The "restart" mechanism was added at the last minute and looked like it was working, but wasn't.

Andreas Dilger added a comment - 15/Nov/13 9:19 AM Updated version of checkfid.sh program. The "restart" mechanism was added at the last minute and looked like it was working, but wasn't.

Ned Bass (Inactive) added a comment - 15/Nov/13 1:19 AM - edited

Thanks Andreas, that's quite helpful (though I don't think this does what you intended):

	[[ -n "$LAST" && "$F" == "$LAST" ]] && LAST="" && echo "found" ||
		continue

Incidentally, we carried out the "remove trusted.{lma,link} from bad files + lfsck" recovery procedure, and it worked pretty much as expected.

Ned Bass (Inactive) added a comment - 15/Nov/13 1:19 AM - edited Thanks Andreas, that's quite helpful (though I don't think this does what you intended): [[ -n "$LAST" && "$F" == "$LAST" ]] && LAST= "" && echo " found" || continue Incidentally, we carried out the "remove trusted.{lma,link} from bad files + lfsck" recovery procedure, and it worked pretty much as expected.

People

Assignee:: Di Wang

Reporter:: Christopher Morrone (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 07/Nov/13 7:18 PM

Updated:: 13/Dec/16 10:45 PM

Resolved:: 13/Dec/16 10:45 PM