[LU-785] Gettting found wrong generation error for the same inode Created: 21/Oct/11 Updated: 29/Oct/13 Resolved: 29/Oct/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.x (1.8.0 - 1.8.5) |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Mahmoud Hanafi | Assignee: | Lai Siyao |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
THIS IS FOR NASA AMES. |
||
| Attachments: |
|
| Severity: | 3 |
| Epic: | metadata |
| Rank (Obsolete): | 7038 |
| Description |
|
We are getting the following error on our mds server repeatedly. This error continues to appear even after a MDS reboot. Oct 21 12:09:18 nbp30-mds kernel: LustreError: 8021:0:(handler.c:275:mds_fid2dentry()) Skipped 599 previous similar messages |
| Comments |
| Comment by Peter Jones [ 21/Oct/11 ] |
|
Mahmoud Is the filesystem down or are you just concerned by the error messages? Peter |
| Comment by Mahmoud Hanafi [ 21/Oct/11 ] |
|
The filesystem is not down. Just want to know if this should be cause |
| Comment by Peter Jones [ 21/Oct/11 ] |
|
Understood. Are you running the Oracle 1.8.6 or the Whamcloud 1.8.6-wc1 release and do you have any patches applied? |
| Comment by Andreas Dilger [ 21/Oct/11 ] |
|
Do you know of any unusual application running at this time? Concievably this could be a "unusual but harmless" error if clients are deleting and recreating files one is being assigned the same inode number, but a new generation number (which is the point of the generation number). The clients could be revalidating the attributes of an old version of the inode, but that inode was deleted and a new one is in its place. What is unusual here (and the cause of my speculation) is that the "current" inode/generation number (209323401/3543922068) is staying constant, but the generation number that is being requested is changing (either 3529758994 or 3529770923, which are earlier values). Also, are you exporting the filesystem via NFS to other clients? They may keep the inode+generation in the NFS file handle for a long time, and this could generate error messages like this also. It may well be that this error message is unnecessary and can be changed to be an internal debugging message. What is a bit interesting is that the errors appear to be happening exactly once per second (~600 every 10 minutes), so this may be some kind of application that is polling the filesystem every second? You could determine what the filename in question by running on the MDS "debugfs -c -R 'ncheck 209323401' /dev/ {MDSDEV}", and perhaps discover which user and/or application is causing this message, and whether they are experiencing any problems. |
| Comment by Jay Lan (Inactive) [ 21/Oct/11 ] |
|
Mahmoud, nbp30-mds runs sles10sp2, lustre-1.8.2-3.4nas_ofed151. The git source can be found at https://github.com/jlan/lustre-nas |
| Comment by Peter Jones [ 25/Oct/11 ] |
|
Thanks Jay. When do you\Mahmoud expect to be able to answer Andreas's other questions? |
| Comment by Mahmoud Hanafi [ 27/Oct/11 ] |
|
This inode, 209323401, points to a empty directory "/ROOT/yham/output_ran7/Y1994/19941011" This user is not running any jobs and is not logged in. So it would appear that this directory is static. drwxr-xr-x 2 yham xxxxx 4096 Oct 11 08:34 /nobackupp30/yham/output_ran7/Y1994/19941011 |
| Comment by Andreas Dilger [ 28/Oct/11 ] |
|
It looks like some client(s) cache an old version of this information about this directory for some reason. If the message was repeating fairly regularly, it would be possible to collect RPCTRACE logs on the MDS (via "lctl set_param debug=+rpctrace") and then dump them quickly after the message appeared (via "lctl dk /tmp/debug.log") to see which client this was coming from. Then that client could collect RPCTRACE and VFSTRACE logs to see what operation it is doing to trigger this (to debug it) or just flush its cache (via "lctl set_param ldlm.namespaces.MDT.lru_size=clear") to get rid of the problem. The problem should also clear up if the clients are restarted. |
| Comment by Mahmoud Hanafi [ 02/Nov/11 ] |
|
Found the client that the request were coming from. Got debug logs then clear the cache. the issue remained. Gathered debug after the flush. See attached file debug.log.tgz |
| Comment by Peter Jones [ 08/Nov/11 ] |
|
Lai Could you please review the latest information from NASA? Thanks Peter |
| Comment by Lai Siyao [ 13/Dec/11 ] |
|
log of process 21071 00000080:00200000:6:1320263638.353865:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1 00000080:00200000:6:1320263640.357859:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1 00000080:00200000:6:1320263642.362592:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1 00000080:00200000:6:1320263644.369859:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1 00000080:00200000:6:1320263646.373762:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1 00000080:00200000:6:1320263648.377838:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1 00000080:00200000:6:1320263650.381701:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1 00000080:00200000:6:1320263652.385898:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1 00000080:00200000:6:1320263654.389734:0:21071:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529767979(ffff810189be75a0), inode mode 41c0 mask 1 logs of process 898: 00000080:00200000:7:1320263650.433778:0:898:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529770923(ffff810049e77ce0), inode mode 41c0 mask 1 00000080:00200000:7:1320263650.433790:0:898:0:(namei.c:557:ll_lookup_it()) VFS Op:name=COMPLETED.f15_Mach2,dir=209323401/3529770923(ffff810049e77ce0),intent=getattr 00000100:00100000:7:1320263650.433833:0:898:0:(client.c:2066:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc runOverflow.csh:2bb186c5-12bb-59bc-5cbf-2eb3fb75f6d0:898:x1380095751772954:10.151.25.176@o2ib:101 00000100:00100000:7:1320263650.433957:0:898:0:(client.c:2171:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc runOverflow.csh:2bb186c5-12bb-59bc-5cbf-2eb3fb75f6d0:898:x1380095751772954:10.151.25.176@o2ib:101 00000100:00100000:7:1320263650.433965:0:898:0:(events.c:114:reply_in_callback()) @@@ unlink req@ffff8101aa92c000 x1380095751772954/t0 o101->nbp30-MDT0000_UUID@10.151.25.176@o2ib:12/10 lens 560/3920 e 0 to 1 dl 1320263992 ref 1 fl Rpc:R/0/0 rc 0/301 00000100:00100000:7:1320263650.433983:0:898:0:(client.c:1891:ptlrpc_free_committed()) nbp30-MDT0000-mdc-ffff8101e335ac00: skip recheck: last_committed 210979479427 00000002:00100000:7:1320263650.434001:0:898:0:(mdc_locks.c:505:mdc_finish_enqueue()) @@@ op: 8 disposition: 3, status: -2 req@ffff8101aa92c000 x1380095751772954/t0 o101->nbp30-MDT0000_UUID@10.151.25.176@o2ib:12/10 lens 560/3920 e 0 to 1 dl 1320263992 ref 1 fl Interpret:R/0/0 rc 0/301 00000080:00200000:7:1320263652.437757:0:898:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529770923(ffff810049e77ce0), inode mode 41c0 mask 1 00000080:00200000:7:1320263652.437766:0:898:0:(namei.c:557:ll_lookup_it()) VFS Op:name=COMPLETED.f15_Mach2,dir=209323401/3529770923(ffff810049e77ce0),intent=getattr 00000100:00100000:7:1320263652.437792:0:898:0:(client.c:2066:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc runOverflow.csh:2bb186c5-12bb-59bc-5cbf-2eb3fb75f6d0:898:x1380095751774603:10.151.25.176@o2ib:101 00000100:00100000:7:1320263652.437900:0:898:0:(client.c:2171:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc runOverflow.csh:2bb186c5-12bb-59bc-5cbf-2eb3fb75f6d0:898:x1380095751774603:10.151.25.176@o2ib:101 00000100:00100000:7:1320263652.437904:0:898:0:(events.c:114:reply_in_callback()) @@@ unlink req@ffff8101c62cc400 x1380095751774603/t0 o101->nbp30-MDT0000_UUID@10.151.25.176@o2ib:12/10 lens 560/3920 e 0 to 1 dl 1320263994 ref 1 fl Rpc:R/0/0 rc 0/301 00000100:00100000:7:1320263652.437911:0:898:0:(client.c:1891:ptlrpc_free_committed()) nbp30-MDT0000-mdc-ffff8101e335ac00: skip recheck: last_committed 210979479427 00000002:00100000:7:1320263652.437921:0:898:0:(mdc_locks.c:505:mdc_finish_enqueue()) @@@ op: 8 disposition: 3, status: -2 req@ffff8101c62cc400 x1380095751774603/t0 o101->nbp30-MDT0000_UUID@10.151.25.176@o2ib:12/10 lens 560/3920 e 0 to 1 dl 1320263994 ref 1 fl Interpret:R/0/0 rc 0/301 00000080:00200000:7:1320263654.441728:0:898:0:(file.c:3500:ll_inode_permission()) VFS Op:inode=209323401/3529770923(ffff810049e77ce0), inode mode 41c0 mask 1 00000080:00200000:7:1320263654.441741:0:898:0:(namei.c:557:ll_lookup_it()) VFS Op:name=COMPLETED.f15_Mach2,dir=209323401/3529770923(ffff810049e77ce0),intent=getattr 00000100:00100000:7:1320263654.441787:0:898:0:(client.c:2066:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc runOverflow.csh:2bb186c5-12bb-59bc-5cbf-2eb3fb75f6d0:898:x1380095751776847:10.151.25.176@o2ib:101 00000100:00100000:7:1320263654.441965:0:898:0:(client.c:2171:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc runOverflow.csh:2bb186c5-12bb-59bc-5cbf-2eb3fb75f6d0:898:x1380095751776847:10.151.25.176@o2ib:101 00000100:00100000:7:1320263654.441973:0:898:0:(events.c:114:reply_in_callback()) @@@ unlink req@ffff8101c7f90000 x1380095751776847/t0 o101->nbp30-MDT0000_UUID@10.151.25.176@o2ib:12/10 lens 560/3920 e 0 to 1 dl 1320263996 ref 1 fl Rpc:R/0/0 rc 0/301 00000100:00100000:7:1320263654.441990:0:898:0:(client.c:1891:ptlrpc_free_committed()) nbp30-MDT0000-mdc-ffff8101e335ac00: skip recheck: last_committed 210979479427 00000002:00100000:7:1320263654.442013:0:898:0:(mdc_locks.c:505:mdc_finish_enqueue()) @@@ op: 8 disposition: 3, status: -2 req@ffff8101c7f90000 x1380095751776847/t0 o101->nbp30-MDT0000_UUID@10.151.25.176@o2ib:12/10 lens 560/3920 e 0 to 1 dl 1320263996 ref 1 fl Interpret:R/0/0 rc 0/301 These logs show some patterns:
Mahmoud, if you could give some hint of what these two processes are doing, that will help. |
| Comment by Peter Jones [ 04/Jan/12 ] |
|
Mahmoud Are you able to provide any further information about the processes Lai has highlighted? Peter |
| Comment by Mahmoud Hanafi [ 29/Oct/13 ] |
|
This case can be closed |
| Comment by Peter Jones [ 29/Oct/13 ] |
|
Thanks Mahmoud |