Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-785

Gettting found wrong generation error for the same inode

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 1.8.x (1.8.0 - 1.8.5)
    • None
    • THIS IS FOR NASA AMES.
      MDS running
      SUSE Linux Enterprise Server 10 (x86_64)
      VERSION = 10
      PATCHLEVEL = 2
    • 3
    • 7038

    Description

      We are getting the following error on our mds server repeatedly. This error continues to appear even after a MDS reboot.

      Oct 21 12:09:18 nbp30-mds kernel: LustreError: 8021:0:(handler.c:275:mds_fid2dentry()) Skipped 599 previous similar messages
      Oct 21 12:19:18 nbp30-mds kernel: LustreError: 8020:0:(handler.c:275:mds_fid2dentry()) found wrong generation: inode 209323401, link: 2, count: 1, generation 3543922068/3529758994
      Oct 21 12:19:18 nbp30-mds kernel: LustreError: 8020:0:(handler.c:275:mds_fid2dentry()) Skipped 599 previous similar messages
      Oct 21 12:29:19 nbp30-mds kernel: LustreError: 9206:0:(handler.c:275:mds_fid2dentry()) found wrong generation: inode 209323401, link: 2, count: 1, generation 3543922068/3529770923
      Oct 21 12:29:19 nbp30-mds kernel: LustreError: 9206:0:(handler.c:275:mds_fid2dentry()) Skipped 598 previous similar messages
      Oct 21 12:39:20 nbp30-mds kernel: LustreError: 9293:0:(handler.c:275:mds_fid2dentry()) found wrong generation: inode 209323401, link: 2, count: 1, generation 3543922068/3529758994
      Oct 21 12:39:20 nbp30-mds kernel: LustreError: 9293:0:(handler.c:275:mds_fid2dentry()) Skipped 599 previous similar messages
      Oct 21 12:49:20 nbp30-mds kernel: LustreError: 9238:0:(handler.c:275:mds_fid2dentry()) found wrong generation: inode 209323401, link: 2, count: 1, generation 3543922068/3529758994
      Oct 21 12:49:20 nbp30-mds kernel: LustreError: 9238:0:(handler.c:275:mds_fid2dentry()) Skipped 599 previous similar messages
      Oct 21 12:59:21 nbp30-mds kernel: LustreError: 9204:0:(handler.c:275:mds_fid2dentry()) found wrong generation: inode 209323401, link: 2, count: 1, generation 3543922068/3529758994
      Oct 21 12:59:21 nbp30-mds kernel: LustreError: 9204:0:(handler.c:275:mds_fid2dentry()) Skipped 598 previous similar messages

      Attachments

        1. debug.log.tgz
          3.44 MB
        2. debug.log.tgz
          3.44 MB

        Issue Links

          Activity

            [LU-785] Gettting found wrong generation error for the same inode
            pjones Peter Jones added a comment -

            Lai

            Could you please review the latest information from NASA?

            Thanks

            Peter

            pjones Peter Jones added a comment - Lai Could you please review the latest information from NASA? Thanks Peter

            Found the client that the request were coming from. Got debug logs then clear the cache. the issue remained. Gathered debug after the flush. See attached file debug.log.tgz

            mhanafi Mahmoud Hanafi added a comment - Found the client that the request were coming from. Got debug logs then clear the cache. the issue remained. Gathered debug after the flush. See attached file debug.log.tgz

            It looks like some client(s) cache an old version of this information about this directory for some reason.

            If the message was repeating fairly regularly, it would be possible to collect RPCTRACE logs on the MDS (via "lctl set_param debug=+rpctrace") and then dump them quickly after the message appeared (via "lctl dk /tmp/debug.log") to see which client this was coming from. Then that client could collect RPCTRACE and VFSTRACE logs to see what operation it is doing to trigger this (to debug it) or just flush its cache (via "lctl set_param ldlm.namespaces.MDT.lru_size=clear") to get rid of the problem. The problem should also clear up if the clients are restarted.

            adilger Andreas Dilger added a comment - It looks like some client(s) cache an old version of this information about this directory for some reason. If the message was repeating fairly regularly, it would be possible to collect RPCTRACE logs on the MDS (via "lctl set_param debug=+rpctrace") and then dump them quickly after the message appeared (via "lctl dk /tmp/debug.log") to see which client this was coming from. Then that client could collect RPCTRACE and VFSTRACE logs to see what operation it is doing to trigger this (to debug it) or just flush its cache (via "lctl set_param ldlm.namespaces. MDT .lru_size=clear") to get rid of the problem. The problem should also clear up if the clients are restarted.

            This inode, 209323401, points to a empty directory "/ROOT/yham/output_ran7/Y1994/19941011" This user is not running any jobs and is not logged in. So it would appear that this directory is static.

            drwxr-xr-x 2 yham xxxxx 4096 Oct 11 08:34 /nobackupp30/yham/output_ran7/Y1994/19941011

            mhanafi Mahmoud Hanafi added a comment - This inode, 209323401, points to a empty directory "/ROOT/yham/output_ran7/Y1994/19941011" This user is not running any jobs and is not logged in. So it would appear that this directory is static. drwxr-xr-x 2 yham xxxxx 4096 Oct 11 08:34 /nobackupp30/yham/output_ran7/Y1994/19941011
            pjones Peter Jones added a comment -

            Thanks Jay. When do you\Mahmoud expect to be able to answer Andreas's other questions?

            pjones Peter Jones added a comment - Thanks Jay. When do you\Mahmoud expect to be able to answer Andreas's other questions?

            Mahmoud, nbp30-mds runs sles10sp2, lustre-1.8.2-3.4nas_ofed151.
            The lustre server was built by Jason with git source based on LLNL 1.8.2's.

            The git source can be found at https://github.com/jlan/lustre-nas
            branch b1_8-server-nas, tag 1.8.2-3.4nas.

            jaylan Jay Lan (Inactive) added a comment - Mahmoud, nbp30-mds runs sles10sp2, lustre-1.8.2-3.4nas_ofed151. The lustre server was built by Jason with git source based on LLNL 1.8.2's. The git source can be found at https://github.com/jlan/lustre-nas branch b1_8-server-nas, tag 1.8.2-3.4nas.

            Do you know of any unusual application running at this time?

            Concievably this could be a "unusual but harmless" error if clients are deleting and recreating files one is being assigned the same inode number, but a new generation number (which is the point of the generation number). The clients could be revalidating the attributes of an old version of the inode, but that inode was deleted and a new one is in its place.

            What is unusual here (and the cause of my speculation) is that the "current" inode/generation number (209323401/3543922068) is staying constant, but the generation number that is being requested is changing (either 3529758994 or 3529770923, which are earlier values).

            Also, are you exporting the filesystem via NFS to other clients? They may keep the inode+generation in the NFS file handle for a long time, and this could generate error messages like this also.

            It may well be that this error message is unnecessary and can be changed to be an internal debugging message.

            What is a bit interesting is that the errors appear to be happening exactly once per second (~600 every 10 minutes), so this may be some kind of application that is polling the filesystem every second? You could determine what the filename in question by running on the MDS "debugfs -c -R 'ncheck 209323401' /dev/

            {MDSDEV}

            ", and perhaps discover which user and/or application is causing this message, and whether they are experiencing any problems.

            adilger Andreas Dilger added a comment - Do you know of any unusual application running at this time? Concievably this could be a "unusual but harmless" error if clients are deleting and recreating files one is being assigned the same inode number, but a new generation number (which is the point of the generation number). The clients could be revalidating the attributes of an old version of the inode, but that inode was deleted and a new one is in its place. What is unusual here (and the cause of my speculation) is that the "current" inode/generation number (209323401/3543922068) is staying constant, but the generation number that is being requested is changing (either 3529758994 or 3529770923, which are earlier values). Also, are you exporting the filesystem via NFS to other clients? They may keep the inode+generation in the NFS file handle for a long time, and this could generate error messages like this also. It may well be that this error message is unnecessary and can be changed to be an internal debugging message. What is a bit interesting is that the errors appear to be happening exactly once per second (~600 every 10 minutes), so this may be some kind of application that is polling the filesystem every second? You could determine what the filename in question by running on the MDS "debugfs -c -R 'ncheck 209323401' /dev/ {MDSDEV} ", and perhaps discover which user and/or application is causing this message, and whether they are experiencing any problems.
            pjones Peter Jones added a comment -

            Understood. Are you running the Oracle 1.8.6 or the Whamcloud 1.8.6-wc1 release and do you have any patches applied?

            pjones Peter Jones added a comment - Understood. Are you running the Oracle 1.8.6 or the Whamcloud 1.8.6-wc1 release and do you have any patches applied?

            The filesystem is not down. Just want to know if this should be cause
            for concern or a need for possible fsck?

            mhanafi Mahmoud Hanafi added a comment - The filesystem is not down. Just want to know if this should be cause for concern or a need for possible fsck?
            pjones Peter Jones added a comment -

            Mahmoud

            Is the filesystem down or are you just concerned by the error messages?

            Peter

            pjones Peter Jones added a comment - Mahmoud Is the filesystem down or are you just concerned by the error messages? Peter

            People

              laisiyao Lai Siyao
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: