[LU-5532] LustreError: 77234:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600 Created: 21/Aug/14 Updated: 28/Aug/14 Resolved: 28/Aug/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Jason Hill (Inactive) | Assignee: | Oleg Drokin |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre 2.4.3, RHEL 6.4, kernel 2.6.32-358.23.2.el6.atlas.x86_64 |
||
| Attachments: |
|
| Severity: | 2 |
| Rank (Obsolete): | 15399 |
| Description |
|
Lustre OSS reports lustre is unhealthy after issuing the following sequence of messages. Second occurrence in the last 24 hours. Aug 21 14:00:23 atlas-oss1c7.ccs.ornl.gov kernel: [3794666.155556] Lustre: atlas1-OST0016: Client 1942a1b8-14c2-1c85-f1cc-f5a627755ef9 (at 10.38.145.2@o2ib4) reconnecting Then followed by several messages like this: Aug 21 14:54:27 atlas-oss1c7.ccs.ornl.gov kernel: [3797911.436996] Lustre: 33219:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/3), not sending early reply Full syslog to follow. |
| Comments |
| Comment by Jason Hill (Inactive) [ 21/Aug/14 ] |
|
Lustre logs from the affected OSS and the MDS for that filesystem for the 2 occurences of this particular issue. |
| Comment by Jason Hill (Inactive) [ 21/Aug/14 ] |
|
also of significance (I think), there is very little back-end IO happening; system load is over 300 in all categories, and memory utilization is extremely high; < 650MB/64GB free. [root@atlas-oss1c7 ~]# free -m Yesterday I successfully unmounted all the OST's from this OSS, removed all lustre kernel modules and restarted Lustre to a positive outcome. |
| Comment by Jason Hill (Inactive) [ 21/Aug/14 ] |
|
clients are both from the same cluster, both are running : # rpm -qa | grep lustre |
| Comment by Jason Hill (Inactive) [ 21/Aug/14 ] |
|
please drop severity; that was my mistake I tabbed through the field and had not intended to hit sev 2. |
| Comment by Peter Jones [ 22/Aug/14 ] |
|
Oleg is looking into this one |
| Comment by Oleg Drokin [ 22/Aug/14 ] |
|
So in the logs it seems there's a severe disk backend slowness going on. I remember one of teh times we hit something like this before was soon after you restarted an OST. Was this restart you are talking about also before these problems occurred? Any interesting data from the DDN side about load on the array? |
| Comment by Jason Hill (Inactive) [ 26/Aug/14 ] |
|
Oleg, No interesting data at the DDN level, but we were able to see some errors on the infiniband interface between the OSS and the DDN. We replaced the IB cable and that seems to have quelled the issue. This was just unexpected because I thought the only way to clear an "unhealthy" state in /proc/fs/lustre/health_check was to reboot or at least unmount the deviced and unload the lustre modules. I'm good if we close this issue; thanks for the response and I apologize for my delay in getting back to you. – |
| Comment by Oleg Drokin [ 28/Aug/14 ] |
|
If "unhealthy" state was set due to slowness of the disk, and then the disk performance improves (ie requests no longer take 10 minutes to complete), unhealthy state will clear. |
| Comment by James Nunez (Inactive) [ 28/Aug/14 ] |
|
Per ORNL, the issue is now understood and we can close the ticket. |