[LU-4567] NFS exports - The mds_getattr operation failed with -43 Created: 30/Jan/14  Updated: 02/Apr/14  Resolved: 02/Apr/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.9
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Daire Byrne (Inactive) Assignee: Peter Jones
Resolution: Won't Fix Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 12462

 Description   

Hi,

We are seeing a lot of "timeouts" to the MDS on two round-robin clients/NFS exporters. The issue seems to be unique to going via NFS. I am aware that the "-43" error is related to UID/GID mismatches but I am almost certain that these are correctly configured to be the same everywhere. Even still should the Lustre client essentially disconnect and then return IO errors to the NFS clients for a short period of time if it can't match a UID/GID?

Jan 24 01:36:26 lustre1 kernel: LustreError: 11-0: an error occurred while communicating with 10.21.22.10@tcp. The mds_getattr_lock operation failed with -43
Jan 24 01:36:26 lustre1 kernel: LustreError: Skipped 527 previous similar messages
Jan 24 01:36:26 lustre1 kernel: LustreError: 3174:0:(llite_nfs.c:276:ll_get_parent()) failure -43 inode 2524321369 get parent
Jan 24 01:36:26 lustre1 kernel: LustreError: 3174:0:(llite_nfs.c:276:ll_get_parent()) Skipped 58 previous similar messages
Jan 24 01:36:26 lustre1 kernel: nfsd: non-standard errno: -43

Jan 24 01:36:26 mds kernel: LustreError: 5942:0:(ldlm_lib.c:1921:target_send_reply_msg()) @@@ processing error (-43)  req@ffff8104ce78b000 x1447826783625469/t0 o34->73d957f1-091b-9ffc-a5db-3402eba274ff@NET_0x200000a151615_UUID:0/0 lens 424/192 e 0 to 0 dl 1390527406 ref 1 fl Interpret:/0/0 rc -43/0
Jan 24 01:36:26 mds kernel: LustreError: 5942:0:(ldlm_lib.c:1921:target_send_reply_msg()) Skipped 319 previous similar messages

They occur at reasonably regular periods because we have an application that scans various directories every 5 mins over NFS. Looking at the occurrences across both Lustre clients/NFS exporters:

lustre1 /root # tail -f /var/log/messages | grep "10.21.22.10"
Jan 30 12:01:01 lustre1 kernel: LustreError: 11-0: an error occurred while communicating with 10.21.22.10@tcp. The mds_getattr_lock operation failed with -43
Jan 30 12:06:01 lustre1 kernel: LustreError: 11-0: an error occurred while communicating with 10.21.22.10@tcp. The mds_getattr_lock operation failed with -43
Jan 30 12:11:00 lustre1 kernel: LustreError: 11-0: an error occurred while communicating with 10.21.22.10@tcp. The mds_getattr_lock operation failed with -43
Jan 30 12:16:02 lustre1 kernel: LustreError: 11-0: an error occurred while communicating with 10.21.22.10@tcp. The mds_getattr_lock operation failed with -43
Jan 30 12:21:01 lustre1 kernel: LustreError: 11-0: an error occurred while communicating with 10.21.22.10@tcp. The mds_getattr_lock operation failed with -43
Jan 30 12:31:32 lustre1 kernel: LustreError: 11-0: an error occurred while communicating with 10.21.22.10@tcp. The mds_getattr_lock operation failed with -43
Jan 30 12:51:14 lustre1 kernel: LustreError: 11-0: an error occurred while communicating with 10.21.22.10@tcp. The mds_getattr_lock operation failed with -43
Jan 30 12:56:13 lustre1 kernel: LustreError: 11-0: an error occurred while communicating with 10.21.22.10@tcp. The mds_getattr_lock operation failed with -43

lustre2 /root # tail -f /var/log/messages | grep "10.21.22.10"
Jan 30 12:01:06 lustre2 kernel: LustreError: 11-0: an error occurred while communicating with 10.21.22.10@tcp. The mds_getattr operation failed with -43
Jan 30 12:01:06 lustre2 kernel: LustreError: 11-0: an error occurred while communicating with 10.21.22.10@tcp. The mds_getattr operation failed with -43
Jan 30 12:01:07 lustre2 kernel: LustreError: 11-0: an error occurred while communicating with 10.21.22.10@tcp. The mds_getattr operation failed with -43
Jan 30 12:01:08 lustre2 kernel: LustreError: 11-0: an error occurred while communicating with 10.21.22.10@tcp. The mds_getattr operation failed with -43
Jan 30 12:01:26 lustre2 kernel: LustreError: 11-0: an error occurred while communicating with 10.21.22.10@tcp. The mds_getattr operation failed with -43
Jan 30 12:11:25 lustre2 kernel: LustreError: 11-0: an error occurred while communicating with 10.21.22.10@tcp. The mds_getattr operation failed with -43
Jan 30 12:11:42 lustre2 kernel: LustreError: 11-0: an error occurred while communicating with 10.21.22.10@tcp. The mds_getattr operation failed with -43
Jan 30 12:21:01 lustre2 kernel: LustreError: 11-0: an error occurred while communicating with 10.21.22.10@tcp. The mds_getattr operation failed with -43
Jan 30 12:21:52 lustre2 kernel: LustreError: 11-0: an error occurred while communicating with 10.21.22.10@tcp. The mds_getattr operation failed with -43
Jan 30 12:37:03 lustre2 kernel: LustreError: 11-0: an error occurred while communicating with 10.21.22.10@tcp. The mds_getattr operation failed with -43

Am am reasonably sure that both the network layer and UID/GIDs are fine.

Regards,

Daire



 Comments   
Comment by Daire Byrne (Inactive) [ 02/Apr/14 ]

Well we are not seeing this on our 2.4 cluster so I guess we don't care so much anymore. We will stop using the v1.8 cluster to serve data over NFS.

Comment by Peter Jones [ 02/Apr/14 ]

ok - thanks Daire!

Generated at Sat Feb 10 01:43:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.