[LU-785] Gettting found wrong generation error for the same inode Created: 21/Oct/11  Updated: 29/Oct/13  Resolved: 29/Oct/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.x (1.8.0 - 1.8.5)
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Mahmoud Hanafi Assignee: Lai Siyao
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

THIS IS FOR NASA AMES.
MDS running
SUSE Linux Enterprise Server 10 (x86_64)
VERSION = 10
PATCHLEVEL = 2


Attachments: File debug.log.tgz     File debug.log.tgz    
Severity: 3
Epic: metadata
Rank (Obsolete): 7038

 Description   

We are getting the following error on our mds server repeatedly. This error continues to appear even after a MDS reboot.

Oct 21 12:09:18 nbp30-mds kernel: LustreError: 8021:0:(handler.c:275:mds_fid2dentry()) Skipped 599 previous similar messages
Oct 21 12:19:18 nbp30-mds kernel: LustreError: 8020:0:(handler.c:275:mds_fid2dentry()) found wrong generation: inode 209323401, link: 2, count: 1, generation 3543922068/3529758994
Oct 21 12:19:18 nbp30-mds kernel: LustreError: 8020:0:(handler.c:275:mds_fid2dentry()) Skipped 599 previous similar messages
Oct 21 12:29:19 nbp30-mds kernel: LustreError: 9206:0:(handler.c:275:mds_fid2dentry()) found wrong generation: inode 209323401, link: 2, count: 1, generation 3543922068/3529770923
Oct 21 12:29:19 nbp30-mds kernel: LustreError: 9206:0:(handler.c:275:mds_fid2dentry()) Skipped 598 previous similar messages
Oct 21 12:39:20 nbp30-mds kernel: LustreError: 9293:0:(handler.c:275:mds_fid2dentry()) found wrong generation: inode 209323401, link: 2, count: 1, generation 3543922068/3529758994
Oct 21 12:39:20 nbp30-mds kernel: LustreError: 9293:0:(handler.c:275:mds_fid2dentry()) Skipped 599 previous similar messages
Oct 21 12:49:20 nbp30-mds kernel: LustreError: 9238:0:(handler.c:275:mds_fid2dentry()) found wrong generation: inode 209323401, link: 2, count: 1, generation 3543922068/3529758994
Oct 21 12:49:20 nbp30-mds kernel: LustreError: 9238:0:(handler.c:275:mds_fid2dentry()) Skipped 599 previous similar messages
Oct 21 12:59:21 nbp30-mds kernel: LustreError: 9204:0:(handler.c:275:mds_fid2dentry()) found wrong generation: inode 209323401, link: 2, count: 1, generation 3543922068/3529758994
Oct 21 12:59:21 nbp30-mds kernel: LustreError: 9204:0:(handler.c:275:mds_fid2dentry()) Skipped 598 previous similar messages



 Comments   
Comment by Peter Jones [ 21/Oct/11 ]

Mahmoud

Is the filesystem down or are you just concerned by the error messages?

Peter

Comment by Mahmoud Hanafi [ 21/Oct/11 ]

The filesystem is not down. Just want to know if this should be cause
for concern or a need for possible fsck?

Comment by Peter Jones [ 21/Oct/11 ]

Understood. Are you running the Oracle 1.8.6 or the Whamcloud 1.8.6-wc1 release and do you have any patches applied?

Comment by Andreas Dilger [ 21/Oct/11 ]

Do you know of any unusual application running at this time?

Concievably this could be a "unusual but harmless" error if clients are deleting and recreating files one is being assigned the same inode number, but a new generation number (which is the point of the generation number). The clients could be revalidating the attributes of an old version of the inode, but that inode was deleted and a new one is in its place.

What is unusual here (and the cause of my speculation) is that the "current" inode/generation number (209323401/3543922068) is staying constant, but the generation number that is being requested is changing (either 3529758994 or 3529770923, which are earlier values).

Also, are you exporting the filesystem via NFS to other clients? They may keep the inode+generation in the NFS file handle for a long time, and this could generate error messages like this also.

It may well be that this error message is unnecessary and can be changed to be an internal debugging message.

What is a bit interesting is that the errors appear to be happening exactly once per second (~600 every 10 minutes), so this may be some kind of application that is polling the filesystem every second? You could determine what the filename in question by running on the MDS "debugfs -c -R 'ncheck 209323401' /dev/

{MDSDEV}

", and perhaps discover which user and/or application is causing this message, and whether they are experiencing any problems.

Comment by Jay Lan (Inactive) [ 21/Oct/11 ]

Mahmoud, nbp30-mds runs sles10sp2, lustre-1.8.2-3.4nas_ofed151.
The lustre server was built by Jason with git source based on LLNL 1.8.2's.

The git source can be found at https://github.com/jlan/lustre-nas
branch b1_8-server-nas, tag 1.8.2-3.4nas.

Comment by Peter Jones [ 25/Oct/11 ]

Thanks Jay. When do you\Mahmoud expect to be able to answer Andreas's other questions?

Comment by Mahmoud Hanafi [ 27/Oct/11 ]

This inode, 209323401, points to a empty directory "/ROOT/yham/output_ran7/Y1994/19941011" This user is not running any jobs and is not logged in. So it would appear that this directory is static.

drwxr-xr-x 2 yham xxxxx 4096 Oct 11 08:34 /nobackupp30/yham/output_ran7/Y1994/19941011

Comment by Andreas Dilger [ 28/Oct/11 ]

It looks like some client(s) cache an old version of this information about this directory for some reason.

If the message was repeating fairly regularly, it would be possible to collect RPCTRACE logs on the MDS (via "lctl set_param debug=+rpctrace") and then dump them quickly after the message appeared (via "lctl dk /tmp/debug.log") to see which client this was coming from. Then that client could collect RPCTRACE and VFSTRACE logs to see what operation it is doing to trigger this (to debug it) or just flush its cache (via "lctl set_param ldlm.namespaces.MDT.lru_size=clear") to get rid of the problem. The problem should also clear up if the clients are restarted.

Comment by Mahmoud Hanafi [ 02/Nov/11 ]

Found the client that the request were coming from. Got debug logs then clear the cache. the issue remained. Gathered debug after the flush. See attached file debug.log.tgz

Comment by Peter Jones [ 08/Nov/11 ]

Lai

Could you please review the latest information from NASA?

Thanks

Peter

Comment by Lai Siyao [ 13/Dec/11 ]

log of process 21071

00000080:00200000:6:1320263638.353865:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1
00000080:00200000:6:1320263640.357859:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1
00000080:00200000:6:1320263642.362592:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1
00000080:00200000:6:1320263644.369859:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1
00000080:00200000:6:1320263646.373762:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1
00000080:00200000:6:1320263648.377838:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1
00000080:00200000:6:1320263650.381701:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1
00000080:00200000:6:1320263652.385898:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1
00000080:00200000:6:1320263654.389734:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1

logs of process 898:

00000080:00200000:7:1320263650.433778:0:898:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529770923(ffff810049e77ce0), inode mode 41c0 mask 1
00000080:00200000:7:1320263650.433790:0:898:0:(namei.c:557:ll_lookup_it()) VFS Op:name=COMPLETED.f15_Mach2,dir=209323401/3529770923(ffff810049e77ce0),intent=getattr
00000100:00100000:7:1320263650.433833:0:898:0:(client.c:2066:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc runOverflow.csh:2bb186c5-12bb-59bc-5cbf-2eb3fb75f6d0:898:x1380095751772954:10.151.25.176@o2ib:101
00000100:00100000:7:1320263650.433957:0:898:0:(client.c:2171:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc runOverflow.csh:2bb186c5-12bb-59bc-5cbf-2eb3fb75f6d0:898:x1380095751772954:10.151.25.176@o2ib:101
00000100:00100000:7:1320263650.433965:0:898:0:(events.c:114:reply_in_callback()) @@@ unlink  req@ffff8101aa92c000 x1380095751772954/t0 o101->nbp30-MDT0000_UUID@10.151.25.176@o2ib:12/10 lens 560/3920 e 0 to 1 dl 1320263992 ref 1 fl Rpc:R/0/0 rc 0/301
00000100:00100000:7:1320263650.433983:0:898:0:(client.c:1891:ptlrpc_free_committed()) nbp30-MDT0000-mdc-ffff8101e335ac00: skip recheck: last_committed 210979479427
00000002:00100000:7:1320263650.434001:0:898:0:(mdc_locks.c:505:mdc_finish_enqueue()) @@@ op: 8 disposition: 3, status: -2  req@ffff8101aa92c000 x1380095751772954/t0 o101->nbp30-MDT0000_UUID@10.151.25.176@o2ib:12/10 lens 560/3920 e 0 to 1 dl 1320263992 ref 1 fl Interpret:R/0/0 rc 0/301
00000080:00200000:7:1320263652.437757:0:898:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529770923(ffff810049e77ce0), inode mode 41c0 mask 1
00000080:00200000:7:1320263652.437766:0:898:0:(namei.c:557:ll_lookup_it()) VFS Op:name=COMPLETED.f15_Mach2,dir=209323401/3529770923(ffff810049e77ce0),intent=getattr
00000100:00100000:7:1320263652.437792:0:898:0:(client.c:2066:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc runOverflow.csh:2bb186c5-12bb-59bc-5cbf-2eb3fb75f6d0:898:x1380095751774603:10.151.25.176@o2ib:101
00000100:00100000:7:1320263652.437900:0:898:0:(client.c:2171:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc runOverflow.csh:2bb186c5-12bb-59bc-5cbf-2eb3fb75f6d0:898:x1380095751774603:10.151.25.176@o2ib:101
00000100:00100000:7:1320263652.437904:0:898:0:(events.c:114:reply_in_callback()) @@@ unlink  req@ffff8101c62cc400 x1380095751774603/t0 o101->nbp30-MDT0000_UUID@10.151.25.176@o2ib:12/10 lens 560/3920 e 0 to 1 dl 1320263994 ref 1 fl Rpc:R/0/0 rc 0/301
00000100:00100000:7:1320263652.437911:0:898:0:(client.c:1891:ptlrpc_free_committed()) nbp30-MDT0000-mdc-ffff8101e335ac00: skip recheck: last_committed 210979479427
00000002:00100000:7:1320263652.437921:0:898:0:(mdc_locks.c:505:mdc_finish_enqueue()) @@@ op: 8 disposition: 3, status: -2  req@ffff8101c62cc400 x1380095751774603/t0 o101->nbp30-MDT0000_UUID@10.151.25.176@o2ib:12/10 lens 560/3920 e 0 to 1 dl 1320263994 ref 1 fl Interpret:R/0/0 rc 0/301
00000080:00200000:7:1320263654.441728:0:898:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529770923(ffff810049e77ce0), inode mode 41c0 mask 1
00000080:00200000:7:1320263654.441741:0:898:0:(namei.c:557:ll_lookup_it()) VFS Op:name=COMPLETED.f15_Mach2,dir=209323401/3529770923(ffff810049e77ce0),intent=getattr
00000100:00100000:7:1320263654.441787:0:898:0:(client.c:2066:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc runOverflow.csh:2bb186c5-12bb-59bc-5cbf-2eb3fb75f6d0:898:x1380095751776847:10.151.25.176@o2ib:101
00000100:00100000:7:1320263654.441965:0:898:0:(client.c:2171:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc runOverflow.csh:2bb186c5-12bb-59bc-5cbf-2eb3fb75f6d0:898:x1380095751776847:10.151.25.176@o2ib:101
00000100:00100000:7:1320263654.441973:0:898:0:(events.c:114:reply_in_callback()) @@@ unlink  req@ffff8101c7f90000 x1380095751776847/t0 o101->nbp30-MDT0000_UUID@10.151.25.176@o2ib:12/10 lens 560/3920 e 0 to 1 dl 1320263996 ref 1 fl Rpc:R/0/0 rc 0/301
00000100:00100000:7:1320263654.441990:0:898:0:(client.c:1891:ptlrpc_free_committed()) nbp30-MDT0000-mdc-ffff8101e335ac00: skip recheck: last_committed 210979479427
00000002:00100000:7:1320263654.442013:0:898:0:(mdc_locks.c:505:mdc_finish_enqueue()) @@@ op: 8 disposition: 3, status: -2  req@ffff8101c7f90000 x1380095751776847/t0 o101->nbp30-MDT0000_UUID@10.151.25.176@o2ib:12/10 lens 560/3920 e 0 to 1 dl 1320263996 ref 1 fl Interpret:R/0/0 rc 0/301

These logs show some patterns:

  • process 21071 calls ll_inode_permission() on inode 209323401 (a directory) every 2 seconds, and it always uses i_generation 3529767979, but it doesn't do any fs operations other than that.
  • process 898 calls lookup on a file under inode 209323401, and it always uses i_generation 3529770923. And this lookup will fail because MDS thinks the generation is wrong.

Mahmoud, if you could give some hint of what these two processes are doing, that will help.

Comment by Peter Jones [ 04/Jan/12 ]

Mahmoud

Are you able to provide any further information about the processes Lai has highlighted?

Peter

Comment by Mahmoud Hanafi [ 29/Oct/13 ]

This case can be closed

Comment by Peter Jones [ 29/Oct/13 ]

Thanks Mahmoud

Generated at Sat Feb 10 01:10:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.