[LU-4607] OSS servers crashing with error: (ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired Created: 10/Feb/14 Updated: 13/Feb/14 Resolved: 13/Feb/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Oz Rentas | Assignee: | Cliff White (Inactive) |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Environment: |
mds1 (mgs + mds for home) mds1.ibb@o2ib0:mds2.ibb@o2ib0:/home all clients mounts:
mds1.ibb@o2ib0:mds2.ibb@o2ib0:/scratch /global/scratch lustre rw,user_xattr,localflock,_netdev 0 0 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 12604 |
| Description |
|
These messages appear every few hours on the oss nodes: On the client: pod24b14 kernel: : LustreError: 11-0: lock@ffff8806063297b8 lock@ffff880606329978 |
| Comments |
| Comment by Cliff White (Inactive) [ 10/Feb/14 ] |
|
These evictions are usually a sign of either a busy server, or a busy/bad network. The clients should recover, perhaps with an error. What does the load look like on the server side? Is there any indication of network error/issue? |
| Comment by Cliff White (Inactive) [ 10/Feb/14 ] |
|
If this is only happening on a few clients every few hours, you may be able to isolate it to a specific task, again best to monitor server load over time (vmstat, top, etc). |
| Comment by Andreas Dilger [ 11/Feb/14 ] |
|
For future reference, the binary debug logs that are dumped by the kernel need to be post-processed before we can use them. You need to run: lctl debug_file /tmp/lustre-log > /tmp/lustre-log.txt for them to be very useful. It would also be useful to include a bit more of the console logs from the OSS, since this error message is itself not a sign of a crash, but a normal message indicating that the client was not responsive to the server's request to cancel the lock. |
| Comment by Oz Rentas [ 12/Feb/14 ] |
|
Noted, thank you Andreas. The processed logs are too large to attach to this ticket but can be downloaded from: http://ddntsr.com/ftp/2014-02-11-R30000_ddn_lustre_processed_logs.tgz Also, what can this mean? Now on the client 192.168.224.16 i.e. pod24b16 [root@pod24b16 ~]# date ; lfs fid2path /global/scratch [0x5b9afe:0x0:0x0] ; date Wed Feb 12 03:09:16 PST 2014 ioctl err -22: Invalid argument (22) fid2path error: Invalid argument Wed Feb 12 03:09:16 PST 2014 The message on mds2 (MDS node for scratch) Feb 12 03:09:16 mds2 kernel: : |
| Comment by John Fuchs-Chesney (Inactive) [ 12/Feb/14 ] |
|
Thanks Cliff. |
| Comment by Cliff White (Inactive) [ 12/Feb/14 ] |
|
What that means is quite simple: There is a client holding a lock on a resource. The client has stopped talking to the server. The server waits, in this case 1620 seconds, and then times out, the timeout allows the server to reclaim the lock. This is done so that a dead client does not halt other work on the cluster. This normally has several causes:
First, you should determine if there are corresponding dead clients, or clients having application errors matching the server error timestamps. |
| Comment by Cliff White (Inactive) [ 12/Feb/14 ] |
|
To further clarify, this bit: |
| Comment by Oz Rentas [ 12/Feb/14 ] |
|
Thanks Cliff. Totally understand. In this case, there has been some push back from the customer to collect network / client / OSS stats. This feedback will hopefully help justify the need to collect the data that has already been requested (multiple times). Thanks again. |
| Comment by John Fuchs-Chesney (Inactive) [ 13/Feb/14 ] |
|
Oz – do you want us to keep this ticket open/unresolved, while you try to get your customer data? Or shall we mark this ticket as resolved and wait for you to open a new ticket if the problem reoccurs? Please advise. Thanks. |
| Comment by Oz Rentas [ 13/Feb/14 ] |
|
There are 2 different issues: 1) Mellanox: OSS HCAs losing access to UFM, and thus losing complete access to the storage. Sequence of errors attached do_IRQ-errors.txt. -irqbalance has been disabled on all servers, and now the system is being monitored. 2) Quota issue on a single OST (ost_scratch_11) The first issue is not a Lustre problem, so we can go ahead and close Thanks all. |
| Comment by Peter Jones [ 13/Feb/14 ] |
|
ok thanks Oz |