[LU-863] (upcall_cache.c:342:upcall_cache_get_entry()) acquire timeout exceeded for key 1104 Created: 18/Nov/11  Updated: 15/Dec/11  Resolved: 15/Dec/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Shuichi Ihara (Inactive) Assignee: Johann Lombardi (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: File lustre_error_after_filter.log.gz     File messages_nies_20111110-01.tar.gz    
Severity: 3
Rank (Obsolete): 6519

 Description   

we have an customer who are now using lustre-1.8.3.ddn3.3.
A single user on single client writes many files (100K files, file size average 600KB) and read them by same user on same client.
When that user read thease files, he can't read some files and returned "no such or directry". Howerver, if same user reads these files on other clients, he can read them.

There are following error messages on MDS when the problem happened.
Nov 3 06:08:19 md1 kernel: LustreError: 9406:0:(upcall_cache.c:342:upcall_cache_get_entry()) acquire timeout exceeded for key 1104

could you pleaes have a look at the log files and find out whether this is known issue on 1.8.3 or not?

Thanks



 Comments   
Comment by Shuichi Ihara (Inactive) [ 18/Nov/11 ]

/var/log/messages

Comment by Johann Lombardi (Inactive) [ 18/Nov/11 ]

Hi Ihara,

Is this customer using LDAP or NIS?

Comment by Shuichi Ihara (Inactive) [ 18/Nov/11 ]

Johann,
The customer is using NIS.

Comment by Johann Lombardi (Inactive) [ 18/Nov/11 ]

The default group upcall timeout is 15s, so it means that NIS can sometimes take more than 15s to answer to the request.
What you can do is to increase the acquire timeout (i.e. group_acquire_expire) as well as how often we refresh the group information (i.e. group_expire_interval set to 10 mins by default).
I would also recommend to check if anything can be done to make NIS more responsive on the MDS since it hurts lustre (requests processing is stuck until NIS replies).

Comment by Shuichi Ihara (Inactive) [ 18/Nov/11 ]

ok, we do suggest incresing group_acquire_expire to 600, then run program again to see if the problem goes away.

Comment by Peter Jones [ 15/Dec/11 ]

Ihara

Can this ticket be closed?

Peter

Comment by Shuichi Ihara (Inactive) [ 15/Dec/11 ]

Peter,

Yes, the customer did increasing group_acquire_expire to 600, after that they have bee seeing same error messages so far.
So, please close this ticket.

Thank you.

Ihara

Comment by Peter Jones [ 15/Dec/11 ]

Thanks Ihara!

Generated at Sat Feb 10 01:11:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.