[LU-785] Gettting found wrong generation error for the same inode - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 1.8.x (1.8.0 - 1.8.5)
Labels:
None
Environment:
THIS IS FOR NASA AMES.
MDS running
SUSE Linux Enterprise Server 10 (x86_64)
VERSION = 10
PATCHLEVEL = 2

Severity:
3
Epic:
- metadata
Rank (Obsolete):
7038

Description

We are getting the following error on our mds server repeatedly. This error continues to appear even after a MDS reboot.

Oct 21 12:09:18 nbp30-mds kernel: LustreError: 8021:0:(handler.c:275:mds_fid2dentry()) Skipped 599 previous similar messages
Oct 21 12:19:18 nbp30-mds kernel: LustreError: 8020:0:(handler.c:275:mds_fid2dentry()) found wrong generation: inode 209323401, link: 2, count: 1, generation 3543922068/3529758994
Oct 21 12:19:18 nbp30-mds kernel: LustreError: 8020:0:(handler.c:275:mds_fid2dentry()) Skipped 599 previous similar messages
Oct 21 12:29:19 nbp30-mds kernel: LustreError: 9206:0:(handler.c:275:mds_fid2dentry()) found wrong generation: inode 209323401, link: 2, count: 1, generation 3543922068/3529770923
Oct 21 12:29:19 nbp30-mds kernel: LustreError: 9206:0:(handler.c:275:mds_fid2dentry()) Skipped 598 previous similar messages
Oct 21 12:39:20 nbp30-mds kernel: LustreError: 9293:0:(handler.c:275:mds_fid2dentry()) found wrong generation: inode 209323401, link: 2, count: 1, generation 3543922068/3529758994
Oct 21 12:39:20 nbp30-mds kernel: LustreError: 9293:0:(handler.c:275:mds_fid2dentry()) Skipped 599 previous similar messages
Oct 21 12:49:20 nbp30-mds kernel: LustreError: 9238:0:(handler.c:275:mds_fid2dentry()) found wrong generation: inode 209323401, link: 2, count: 1, generation 3543922068/3529758994
Oct 21 12:49:20 nbp30-mds kernel: LustreError: 9238:0:(handler.c:275:mds_fid2dentry()) Skipped 599 previous similar messages
Oct 21 12:59:21 nbp30-mds kernel: LustreError: 9204:0:(handler.c:275:mds_fid2dentry()) found wrong generation: inode 209323401, link: 2, count: 1, generation 3543922068/3529758994
Oct 21 12:59:21 nbp30-mds kernel: LustreError: 9204:0:(handler.c:275:mds_fid2dentry()) Skipped 598 previous similar messages

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

debug.log.tgz
3.44 MB
02/Nov/11 4:17 PM
debug.log.tgz
3.44 MB
02/Nov/11 4:16 PM

Issue Links

Trackbacks

Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA

Activity

[LU-785] Gettting found wrong generation error for the same inode

Peter Jones added a comment - 08/Nov/11 3:43 PM

Lai

Could you please review the latest information from NASA?

Thanks

Peter

Peter Jones added a comment - 08/Nov/11 3:43 PM Lai Could you please review the latest information from NASA? Thanks Peter

Mahmoud Hanafi added a comment - 02/Nov/11 4:19 PM

Found the client that the request were coming from. Got debug logs then clear the cache. the issue remained. Gathered debug after the flush. See attached file debug.log.tgz

Mahmoud Hanafi added a comment - 02/Nov/11 4:19 PM Found the client that the request were coming from. Got debug logs then clear the cache. the issue remained. Gathered debug after the flush. See attached file debug.log.tgz

Andreas Dilger added a comment - 28/Oct/11 7:25 PM

It looks like some client(s) cache an old version of this information about this directory for some reason.

If the message was repeating fairly regularly, it would be possible to collect RPCTRACE logs on the MDS (via "lctl set_param debug=+rpctrace") and then dump them quickly after the message appeared (via "lctl dk /tmp/debug.log") to see which client this was coming from. Then that client could collect RPCTRACE and VFSTRACE logs to see what operation it is doing to trigger this (to debug it) or just flush its cache (via "lctl set_param ldlm.namespaces.MDT.lru_size=clear") to get rid of the problem. The problem should also clear up if the clients are restarted.

Andreas Dilger added a comment - 28/Oct/11 7:25 PM It looks like some client(s) cache an old version of this information about this directory for some reason. If the message was repeating fairly regularly, it would be possible to collect RPCTRACE logs on the MDS (via "lctl set_param debug=+rpctrace") and then dump them quickly after the message appeared (via "lctl dk /tmp/debug.log") to see which client this was coming from. Then that client could collect RPCTRACE and VFSTRACE logs to see what operation it is doing to trigger this (to debug it) or just flush its cache (via "lctl set_param ldlm.namespaces. MDT .lru_size=clear") to get rid of the problem. The problem should also clear up if the clients are restarted.

Mahmoud Hanafi added a comment - 27/Oct/11 1:32 PM

This inode, 209323401, points to a empty directory "/ROOT/yham/output_ran7/Y1994/19941011" This user is not running any jobs and is not logged in. So it would appear that this directory is static.

drwxr-xr-x 2 yham xxxxx 4096 Oct 11 08:34 /nobackupp30/yham/output_ran7/Y1994/19941011

Mahmoud Hanafi added a comment - 27/Oct/11 1:32 PM This inode, 209323401, points to a empty directory "/ROOT/yham/output_ran7/Y1994/19941011" This user is not running any jobs and is not logged in. So it would appear that this directory is static. drwxr-xr-x 2 yham xxxxx 4096 Oct 11 08:34 /nobackupp30/yham/output_ran7/Y1994/19941011

Peter Jones added a comment - 25/Oct/11 12:53 PM

Thanks Jay. When do you\Mahmoud expect to be able to answer Andreas's other questions?

Peter Jones added a comment - 25/Oct/11 12:53 PM Thanks Jay. When do you\Mahmoud expect to be able to answer Andreas's other questions?

Jay Lan (Inactive) added a comment - 21/Oct/11 5:49 PM

Mahmoud, nbp30-mds runs sles10sp2, lustre-1.8.2-3.4nas_ofed151.
The lustre server was built by Jason with git source based on LLNL 1.8.2's.

The git source can be found at https://github.com/jlan/lustre-nas
branch b1_8-server-nas, tag 1.8.2-3.4nas.

Jay Lan (Inactive) added a comment - 21/Oct/11 5:49 PM Mahmoud, nbp30-mds runs sles10sp2, lustre-1.8.2-3.4nas_ofed151. The lustre server was built by Jason with git source based on LLNL 1.8.2's. The git source can be found at https://github.com/jlan/lustre-nas branch b1_8-server-nas, tag 1.8.2-3.4nas.

Andreas Dilger added a comment - 21/Oct/11 5:31 PM

Do you know of any unusual application running at this time?

Concievably this could be a "unusual but harmless" error if clients are deleting and recreating files one is being assigned the same inode number, but a new generation number (which is the point of the generation number). The clients could be revalidating the attributes of an old version of the inode, but that inode was deleted and a new one is in its place.

What is unusual here (and the cause of my speculation) is that the "current" inode/generation number (209323401/3543922068) is staying constant, but the generation number that is being requested is changing (either 3529758994 or 3529770923, which are earlier values).

Also, are you exporting the filesystem via NFS to other clients? They may keep the inode+generation in the NFS file handle for a long time, and this could generate error messages like this also.

It may well be that this error message is unnecessary and can be changed to be an internal debugging message.

What is a bit interesting is that the errors appear to be happening exactly once per second (~600 every 10 minutes), so this may be some kind of application that is polling the filesystem every second? You could determine what the filename in question by running on the MDS "debugfs -c -R 'ncheck 209323401' /dev/

{MDSDEV}

", and perhaps discover which user and/or application is causing this message, and whether they are experiencing any problems.

Andreas Dilger added a comment - 21/Oct/11 5:31 PM Do you know of any unusual application running at this time? Concievably this could be a "unusual but harmless" error if clients are deleting and recreating files one is being assigned the same inode number, but a new generation number (which is the point of the generation number). The clients could be revalidating the attributes of an old version of the inode, but that inode was deleted and a new one is in its place. What is unusual here (and the cause of my speculation) is that the "current" inode/generation number (209323401/3543922068) is staying constant, but the generation number that is being requested is changing (either 3529758994 or 3529770923, which are earlier values). Also, are you exporting the filesystem via NFS to other clients? They may keep the inode+generation in the NFS file handle for a long time, and this could generate error messages like this also. It may well be that this error message is unnecessary and can be changed to be an internal debugging message. What is a bit interesting is that the errors appear to be happening exactly once per second (~600 every 10 minutes), so this may be some kind of application that is polling the filesystem every second? You could determine what the filename in question by running on the MDS "debugfs -c -R 'ncheck 209323401' /dev/ {MDSDEV} ", and perhaps discover which user and/or application is causing this message, and whether they are experiencing any problems.

Peter Jones added a comment - 21/Oct/11 4:41 PM

Understood. Are you running the Oracle 1.8.6 or the Whamcloud 1.8.6-wc1 release and do you have any patches applied?

Peter Jones added a comment - 21/Oct/11 4:41 PM Understood. Are you running the Oracle 1.8.6 or the Whamcloud 1.8.6-wc1 release and do you have any patches applied?

Mahmoud Hanafi added a comment - 21/Oct/11 4:39 PM

The filesystem is not down. Just want to know if this should be cause
for concern or a need for possible fsck?

Mahmoud Hanafi added a comment - 21/Oct/11 4:39 PM The filesystem is not down. Just want to know if this should be cause for concern or a need for possible fsck?

Peter Jones added a comment - 21/Oct/11 4:10 PM

Mahmoud

Is the filesystem down or are you just concerned by the error messages?

Peter

Peter Jones added a comment - 21/Oct/11 4:10 PM Mahmoud Is the filesystem down or are you just concerned by the error messages? Peter

People

Assignee:: Lai Siyao

Reporter:: Mahmoud Hanafi

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 21/Oct/11 4:06 PM

Updated:: 29/Oct/13 6:32 PM

Resolved:: 29/Oct/13 6:32 PM