Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-785

Gettting found wrong generation error for the same inode

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 1.8.x (1.8.0 - 1.8.5)
    • None
    • THIS IS FOR NASA AMES.
      MDS running
      SUSE Linux Enterprise Server 10 (x86_64)
      VERSION = 10
      PATCHLEVEL = 2
    • 3
    • 7038

    Description

      We are getting the following error on our mds server repeatedly. This error continues to appear even after a MDS reboot.

      Oct 21 12:09:18 nbp30-mds kernel: LustreError: 8021:0:(handler.c:275:mds_fid2dentry()) Skipped 599 previous similar messages
      Oct 21 12:19:18 nbp30-mds kernel: LustreError: 8020:0:(handler.c:275:mds_fid2dentry()) found wrong generation: inode 209323401, link: 2, count: 1, generation 3543922068/3529758994
      Oct 21 12:19:18 nbp30-mds kernel: LustreError: 8020:0:(handler.c:275:mds_fid2dentry()) Skipped 599 previous similar messages
      Oct 21 12:29:19 nbp30-mds kernel: LustreError: 9206:0:(handler.c:275:mds_fid2dentry()) found wrong generation: inode 209323401, link: 2, count: 1, generation 3543922068/3529770923
      Oct 21 12:29:19 nbp30-mds kernel: LustreError: 9206:0:(handler.c:275:mds_fid2dentry()) Skipped 598 previous similar messages
      Oct 21 12:39:20 nbp30-mds kernel: LustreError: 9293:0:(handler.c:275:mds_fid2dentry()) found wrong generation: inode 209323401, link: 2, count: 1, generation 3543922068/3529758994
      Oct 21 12:39:20 nbp30-mds kernel: LustreError: 9293:0:(handler.c:275:mds_fid2dentry()) Skipped 599 previous similar messages
      Oct 21 12:49:20 nbp30-mds kernel: LustreError: 9238:0:(handler.c:275:mds_fid2dentry()) found wrong generation: inode 209323401, link: 2, count: 1, generation 3543922068/3529758994
      Oct 21 12:49:20 nbp30-mds kernel: LustreError: 9238:0:(handler.c:275:mds_fid2dentry()) Skipped 599 previous similar messages
      Oct 21 12:59:21 nbp30-mds kernel: LustreError: 9204:0:(handler.c:275:mds_fid2dentry()) found wrong generation: inode 209323401, link: 2, count: 1, generation 3543922068/3529758994
      Oct 21 12:59:21 nbp30-mds kernel: LustreError: 9204:0:(handler.c:275:mds_fid2dentry()) Skipped 598 previous similar messages

      Attachments

        1. debug.log.tgz
          3.44 MB
        2. debug.log.tgz
          3.44 MB

        Issue Links

          Activity

            [LU-785] Gettting found wrong generation error for the same inode
            pjones Peter Jones added a comment -

            Thanks Mahmoud

            pjones Peter Jones added a comment - Thanks Mahmoud

            This case can be closed

            mhanafi Mahmoud Hanafi added a comment - This case can be closed
            pjones Peter Jones added a comment -

            Mahmoud

            Are you able to provide any further information about the processes Lai has highlighted?

            Peter

            pjones Peter Jones added a comment - Mahmoud Are you able to provide any further information about the processes Lai has highlighted? Peter
            laisiyao Lai Siyao added a comment -

            log of process 21071

            00000080:00200000:6:1320263638.353865:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1
            00000080:00200000:6:1320263640.357859:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1
            00000080:00200000:6:1320263642.362592:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1
            00000080:00200000:6:1320263644.369859:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1
            00000080:00200000:6:1320263646.373762:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1
            00000080:00200000:6:1320263648.377838:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1
            00000080:00200000:6:1320263650.381701:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1
            00000080:00200000:6:1320263652.385898:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1
            00000080:00200000:6:1320263654.389734:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1
            

            logs of process 898:

            00000080:00200000:7:1320263650.433778:0:898:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529770923(ffff810049e77ce0), inode mode 41c0 mask 1
            00000080:00200000:7:1320263650.433790:0:898:0:(namei.c:557:ll_lookup_it()) VFS Op:name=COMPLETED.f15_Mach2,dir=209323401/3529770923(ffff810049e77ce0),intent=getattr
            00000100:00100000:7:1320263650.433833:0:898:0:(client.c:2066:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc runOverflow.csh:2bb186c5-12bb-59bc-5cbf-2eb3fb75f6d0:898:x1380095751772954:10.151.25.176@o2ib:101
            00000100:00100000:7:1320263650.433957:0:898:0:(client.c:2171:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc runOverflow.csh:2bb186c5-12bb-59bc-5cbf-2eb3fb75f6d0:898:x1380095751772954:10.151.25.176@o2ib:101
            00000100:00100000:7:1320263650.433965:0:898:0:(events.c:114:reply_in_callback()) @@@ unlink  req@ffff8101aa92c000 x1380095751772954/t0 o101->nbp30-MDT0000_UUID@10.151.25.176@o2ib:12/10 lens 560/3920 e 0 to 1 dl 1320263992 ref 1 fl Rpc:R/0/0 rc 0/301
            00000100:00100000:7:1320263650.433983:0:898:0:(client.c:1891:ptlrpc_free_committed()) nbp30-MDT0000-mdc-ffff8101e335ac00: skip recheck: last_committed 210979479427
            00000002:00100000:7:1320263650.434001:0:898:0:(mdc_locks.c:505:mdc_finish_enqueue()) @@@ op: 8 disposition: 3, status: -2  req@ffff8101aa92c000 x1380095751772954/t0 o101->nbp30-MDT0000_UUID@10.151.25.176@o2ib:12/10 lens 560/3920 e 0 to 1 dl 1320263992 ref 1 fl Interpret:R/0/0 rc 0/301
            00000080:00200000:7:1320263652.437757:0:898:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529770923(ffff810049e77ce0), inode mode 41c0 mask 1
            00000080:00200000:7:1320263652.437766:0:898:0:(namei.c:557:ll_lookup_it()) VFS Op:name=COMPLETED.f15_Mach2,dir=209323401/3529770923(ffff810049e77ce0),intent=getattr
            00000100:00100000:7:1320263652.437792:0:898:0:(client.c:2066:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc runOverflow.csh:2bb186c5-12bb-59bc-5cbf-2eb3fb75f6d0:898:x1380095751774603:10.151.25.176@o2ib:101
            00000100:00100000:7:1320263652.437900:0:898:0:(client.c:2171:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc runOverflow.csh:2bb186c5-12bb-59bc-5cbf-2eb3fb75f6d0:898:x1380095751774603:10.151.25.176@o2ib:101
            00000100:00100000:7:1320263652.437904:0:898:0:(events.c:114:reply_in_callback()) @@@ unlink  req@ffff8101c62cc400 x1380095751774603/t0 o101->nbp30-MDT0000_UUID@10.151.25.176@o2ib:12/10 lens 560/3920 e 0 to 1 dl 1320263994 ref 1 fl Rpc:R/0/0 rc 0/301
            00000100:00100000:7:1320263652.437911:0:898:0:(client.c:1891:ptlrpc_free_committed()) nbp30-MDT0000-mdc-ffff8101e335ac00: skip recheck: last_committed 210979479427
            00000002:00100000:7:1320263652.437921:0:898:0:(mdc_locks.c:505:mdc_finish_enqueue()) @@@ op: 8 disposition: 3, status: -2  req@ffff8101c62cc400 x1380095751774603/t0 o101->nbp30-MDT0000_UUID@10.151.25.176@o2ib:12/10 lens 560/3920 e 0 to 1 dl 1320263994 ref 1 fl Interpret:R/0/0 rc 0/301
            00000080:00200000:7:1320263654.441728:0:898:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529770923(ffff810049e77ce0), inode mode 41c0 mask 1
            00000080:00200000:7:1320263654.441741:0:898:0:(namei.c:557:ll_lookup_it()) VFS Op:name=COMPLETED.f15_Mach2,dir=209323401/3529770923(ffff810049e77ce0),intent=getattr
            00000100:00100000:7:1320263654.441787:0:898:0:(client.c:2066:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc runOverflow.csh:2bb186c5-12bb-59bc-5cbf-2eb3fb75f6d0:898:x1380095751776847:10.151.25.176@o2ib:101
            00000100:00100000:7:1320263654.441965:0:898:0:(client.c:2171:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc runOverflow.csh:2bb186c5-12bb-59bc-5cbf-2eb3fb75f6d0:898:x1380095751776847:10.151.25.176@o2ib:101
            00000100:00100000:7:1320263654.441973:0:898:0:(events.c:114:reply_in_callback()) @@@ unlink  req@ffff8101c7f90000 x1380095751776847/t0 o101->nbp30-MDT0000_UUID@10.151.25.176@o2ib:12/10 lens 560/3920 e 0 to 1 dl 1320263996 ref 1 fl Rpc:R/0/0 rc 0/301
            00000100:00100000:7:1320263654.441990:0:898:0:(client.c:1891:ptlrpc_free_committed()) nbp30-MDT0000-mdc-ffff8101e335ac00: skip recheck: last_committed 210979479427
            00000002:00100000:7:1320263654.442013:0:898:0:(mdc_locks.c:505:mdc_finish_enqueue()) @@@ op: 8 disposition: 3, status: -2  req@ffff8101c7f90000 x1380095751776847/t0 o101->nbp30-MDT0000_UUID@10.151.25.176@o2ib:12/10 lens 560/3920 e 0 to 1 dl 1320263996 ref 1 fl Interpret:R/0/0 rc 0/301
            

            These logs show some patterns:

            • process 21071 calls ll_inode_permission() on inode 209323401 (a directory) every 2 seconds, and it always uses i_generation 3529767979, but it doesn't do any fs operations other than that.
            • process 898 calls lookup on a file under inode 209323401, and it always uses i_generation 3529770923. And this lookup will fail because MDS thinks the generation is wrong.

            Mahmoud, if you could give some hint of what these two processes are doing, that will help.

            laisiyao Lai Siyao added a comment - log of process 21071 00000080:00200000:6:1320263638.353865:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1 00000080:00200000:6:1320263640.357859:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1 00000080:00200000:6:1320263642.362592:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1 00000080:00200000:6:1320263644.369859:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1 00000080:00200000:6:1320263646.373762:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1 00000080:00200000:6:1320263648.377838:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1 00000080:00200000:6:1320263650.381701:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1 00000080:00200000:6:1320263652.385898:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1 00000080:00200000:6:1320263654.389734:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1 logs of process 898: 00000080:00200000:7:1320263650.433778:0:898:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529770923(ffff810049e77ce0), inode mode 41c0 mask 1 00000080:00200000:7:1320263650.433790:0:898:0:(namei.c:557:ll_lookup_it()) VFS Op:name=COMPLETED.f15_Mach2,dir=209323401/3529770923(ffff810049e77ce0),intent=getattr 00000100:00100000:7:1320263650.433833:0:898:0:(client.c:2066:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc runOverflow.csh:2bb186c5-12bb-59bc-5cbf-2eb3fb75f6d0:898:x1380095751772954:10.151.25.176@o2ib:101 00000100:00100000:7:1320263650.433957:0:898:0:(client.c:2171:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc runOverflow.csh:2bb186c5-12bb-59bc-5cbf-2eb3fb75f6d0:898:x1380095751772954:10.151.25.176@o2ib:101 00000100:00100000:7:1320263650.433965:0:898:0:(events.c:114:reply_in_callback()) @@@ unlink req@ffff8101aa92c000 x1380095751772954/t0 o101->nbp30-MDT0000_UUID@10.151.25.176@o2ib:12/10 lens 560/3920 e 0 to 1 dl 1320263992 ref 1 fl Rpc:R/0/0 rc 0/301 00000100:00100000:7:1320263650.433983:0:898:0:(client.c:1891:ptlrpc_free_committed()) nbp30-MDT0000-mdc-ffff8101e335ac00: skip recheck: last_committed 210979479427 00000002:00100000:7:1320263650.434001:0:898:0:(mdc_locks.c:505:mdc_finish_enqueue()) @@@ op: 8 disposition: 3, status: -2 req@ffff8101aa92c000 x1380095751772954/t0 o101->nbp30-MDT0000_UUID@10.151.25.176@o2ib:12/10 lens 560/3920 e 0 to 1 dl 1320263992 ref 1 fl Interpret:R/0/0 rc 0/301 00000080:00200000:7:1320263652.437757:0:898:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529770923(ffff810049e77ce0), inode mode 41c0 mask 1 00000080:00200000:7:1320263652.437766:0:898:0:(namei.c:557:ll_lookup_it()) VFS Op:name=COMPLETED.f15_Mach2,dir=209323401/3529770923(ffff810049e77ce0),intent=getattr 00000100:00100000:7:1320263652.437792:0:898:0:(client.c:2066:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc runOverflow.csh:2bb186c5-12bb-59bc-5cbf-2eb3fb75f6d0:898:x1380095751774603:10.151.25.176@o2ib:101 00000100:00100000:7:1320263652.437900:0:898:0:(client.c:2171:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc runOverflow.csh:2bb186c5-12bb-59bc-5cbf-2eb3fb75f6d0:898:x1380095751774603:10.151.25.176@o2ib:101 00000100:00100000:7:1320263652.437904:0:898:0:(events.c:114:reply_in_callback()) @@@ unlink req@ffff8101c62cc400 x1380095751774603/t0 o101->nbp30-MDT0000_UUID@10.151.25.176@o2ib:12/10 lens 560/3920 e 0 to 1 dl 1320263994 ref 1 fl Rpc:R/0/0 rc 0/301 00000100:00100000:7:1320263652.437911:0:898:0:(client.c:1891:ptlrpc_free_committed()) nbp30-MDT0000-mdc-ffff8101e335ac00: skip recheck: last_committed 210979479427 00000002:00100000:7:1320263652.437921:0:898:0:(mdc_locks.c:505:mdc_finish_enqueue()) @@@ op: 8 disposition: 3, status: -2 req@ffff8101c62cc400 x1380095751774603/t0 o101->nbp30-MDT0000_UUID@10.151.25.176@o2ib:12/10 lens 560/3920 e 0 to 1 dl 1320263994 ref 1 fl Interpret:R/0/0 rc 0/301 00000080:00200000:7:1320263654.441728:0:898:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529770923(ffff810049e77ce0), inode mode 41c0 mask 1 00000080:00200000:7:1320263654.441741:0:898:0:(namei.c:557:ll_lookup_it()) VFS Op:name=COMPLETED.f15_Mach2,dir=209323401/3529770923(ffff810049e77ce0),intent=getattr 00000100:00100000:7:1320263654.441787:0:898:0:(client.c:2066:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc runOverflow.csh:2bb186c5-12bb-59bc-5cbf-2eb3fb75f6d0:898:x1380095751776847:10.151.25.176@o2ib:101 00000100:00100000:7:1320263654.441965:0:898:0:(client.c:2171:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc runOverflow.csh:2bb186c5-12bb-59bc-5cbf-2eb3fb75f6d0:898:x1380095751776847:10.151.25.176@o2ib:101 00000100:00100000:7:1320263654.441973:0:898:0:(events.c:114:reply_in_callback()) @@@ unlink req@ffff8101c7f90000 x1380095751776847/t0 o101->nbp30-MDT0000_UUID@10.151.25.176@o2ib:12/10 lens 560/3920 e 0 to 1 dl 1320263996 ref 1 fl Rpc:R/0/0 rc 0/301 00000100:00100000:7:1320263654.441990:0:898:0:(client.c:1891:ptlrpc_free_committed()) nbp30-MDT0000-mdc-ffff8101e335ac00: skip recheck: last_committed 210979479427 00000002:00100000:7:1320263654.442013:0:898:0:(mdc_locks.c:505:mdc_finish_enqueue()) @@@ op: 8 disposition: 3, status: -2 req@ffff8101c7f90000 x1380095751776847/t0 o101->nbp30-MDT0000_UUID@10.151.25.176@o2ib:12/10 lens 560/3920 e 0 to 1 dl 1320263996 ref 1 fl Interpret:R/0/0 rc 0/301 These logs show some patterns: process 21071 calls ll_inode_permission() on inode 209323401 (a directory) every 2 seconds, and it always uses i_generation 3529767979, but it doesn't do any fs operations other than that. process 898 calls lookup on a file under inode 209323401, and it always uses i_generation 3529770923. And this lookup will fail because MDS thinks the generation is wrong. Mahmoud, if you could give some hint of what these two processes are doing, that will help.
            pjones Peter Jones added a comment -

            Lai

            Could you please review the latest information from NASA?

            Thanks

            Peter

            pjones Peter Jones added a comment - Lai Could you please review the latest information from NASA? Thanks Peter

            Found the client that the request were coming from. Got debug logs then clear the cache. the issue remained. Gathered debug after the flush. See attached file debug.log.tgz

            mhanafi Mahmoud Hanafi added a comment - Found the client that the request were coming from. Got debug logs then clear the cache. the issue remained. Gathered debug after the flush. See attached file debug.log.tgz

            It looks like some client(s) cache an old version of this information about this directory for some reason.

            If the message was repeating fairly regularly, it would be possible to collect RPCTRACE logs on the MDS (via "lctl set_param debug=+rpctrace") and then dump them quickly after the message appeared (via "lctl dk /tmp/debug.log") to see which client this was coming from. Then that client could collect RPCTRACE and VFSTRACE logs to see what operation it is doing to trigger this (to debug it) or just flush its cache (via "lctl set_param ldlm.namespaces.MDT.lru_size=clear") to get rid of the problem. The problem should also clear up if the clients are restarted.

            adilger Andreas Dilger added a comment - It looks like some client(s) cache an old version of this information about this directory for some reason. If the message was repeating fairly regularly, it would be possible to collect RPCTRACE logs on the MDS (via "lctl set_param debug=+rpctrace") and then dump them quickly after the message appeared (via "lctl dk /tmp/debug.log") to see which client this was coming from. Then that client could collect RPCTRACE and VFSTRACE logs to see what operation it is doing to trigger this (to debug it) or just flush its cache (via "lctl set_param ldlm.namespaces. MDT .lru_size=clear") to get rid of the problem. The problem should also clear up if the clients are restarted.

            This inode, 209323401, points to a empty directory "/ROOT/yham/output_ran7/Y1994/19941011" This user is not running any jobs and is not logged in. So it would appear that this directory is static.

            drwxr-xr-x 2 yham xxxxx 4096 Oct 11 08:34 /nobackupp30/yham/output_ran7/Y1994/19941011

            mhanafi Mahmoud Hanafi added a comment - This inode, 209323401, points to a empty directory "/ROOT/yham/output_ran7/Y1994/19941011" This user is not running any jobs and is not logged in. So it would appear that this directory is static. drwxr-xr-x 2 yham xxxxx 4096 Oct 11 08:34 /nobackupp30/yham/output_ran7/Y1994/19941011
            pjones Peter Jones added a comment -

            Thanks Jay. When do you\Mahmoud expect to be able to answer Andreas's other questions?

            pjones Peter Jones added a comment - Thanks Jay. When do you\Mahmoud expect to be able to answer Andreas's other questions?

            Mahmoud, nbp30-mds runs sles10sp2, lustre-1.8.2-3.4nas_ofed151.
            The lustre server was built by Jason with git source based on LLNL 1.8.2's.

            The git source can be found at https://github.com/jlan/lustre-nas
            branch b1_8-server-nas, tag 1.8.2-3.4nas.

            jaylan Jay Lan (Inactive) added a comment - Mahmoud, nbp30-mds runs sles10sp2, lustre-1.8.2-3.4nas_ofed151. The lustre server was built by Jason with git source based on LLNL 1.8.2's. The git source can be found at https://github.com/jlan/lustre-nas branch b1_8-server-nas, tag 1.8.2-3.4nas.

            People

              laisiyao Lai Siyao
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: