[LU-5532] LustreError: 77234:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600 Created: 21/Aug/14  Updated: 28/Aug/14  Resolved: 28/Aug/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Jason Hill (Inactive) Assignee: Oleg Drokin
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

Lustre 2.4.3, RHEL 6.4, kernel 2.6.32-358.23.2.el6.atlas.x86_64


Attachments: File atlas-mds1.20140820.kern.log     File atlas-mds1.20140821.kern.log     File atlas-oss1c7.20140820.kern.log     File atlas-oss1c7.20140821.kern.log    
Severity: 2
Rank (Obsolete): 15399

 Description   

Lustre OSS reports lustre is unhealthy after issuing the following sequence of messages. Second occurrence in the last 24 hours.

Aug 21 14:00:23 atlas-oss1c7.ccs.ornl.gov kernel: [3794666.155556] Lustre: atlas1-OST0016: Client 1942a1b8-14c2-1c85-f1cc-f5a627755ef9 (at 10.38.145.2@o2ib4) reconnecting
Aug 21 14:00:23 atlas-oss1c7.ccs.ornl.gov kernel: [3794666.185134] Lustre: Skipped 6 previous similar messages
Aug 21 14:02:53 atlas-oss1c7.ccs.ornl.gov kernel: [3794816.201589] LustreError: 137-5: atlas1-OST00a2_UUID: not available for connect from 10.38.145.2@o2ib4 (no target)
Aug 21 14:12:20 atlas-oss1c7.ccs.ornl.gov kernel: [3795383.544447] Lustre: atlas1-OST0016: Client 1942a1b8-14c2-1c85-f1cc-f5a627755ef9 (at 10.38.145.2@o2ib4) reconnecting
Aug 21 14:12:20 atlas-oss1c7.ccs.ornl.gov kernel: [3795383.567821] Lustre: Skipped 5 previous similar messages
Aug 21 14:19:20 atlas-oss1c7.ccs.ornl.gov kernel: [3795803.534475] LustreError: 137-5: atlas1-OST00a2_UUID: not available for connect from 10.38.145.2@o2ib4 (no target)
Aug 21 14:19:20 atlas-oss1c7.ccs.ornl.gov kernel: [3795803.566134] LustreError: Skipped 3 previous similar messages
Aug 21 14:26:17 atlas-oss1c7.ccs.ornl.gov kernel: [3796220.710761] Lustre: atlas1-OST01c6: Client 5d5389e1-62ad-c671-5318-48ff669e4a6e (at 10.38.145.2@o2ib4) reconnecting
Aug 21 14:26:17 atlas-oss1c7.ccs.ornl.gov kernel: [3796220.734900] Lustre: Skipped 11 previous similar messages
Aug 21 14:31:17 atlas-oss1c7.ccs.ornl.gov kernel: [3796520.850175] LustreError: 137-5: atlas1-OST00a2_UUID: not available for connect from 10.38.145.2@o2ib4 (no target)
Aug 21 14:31:17 atlas-oss1c7.ccs.ornl.gov kernel: [3796520.878544] LustreError: Skipped 4 previous similar messages
Aug 21 14:38:14 atlas-oss1c7.ccs.ornl.gov kernel: [3796937.996148] Lustre: atlas1-OST0136: Client 5d5389e1-62ad-c671-5318-48ff669e4a6e (at 10.38.145.2@o2ib4) reconnecting
Aug 21 14:38:14 atlas-oss1c7.ccs.ornl.gov kernel: [3796938.026732] Lustre: Skipped 6 previous similar messages
Aug 21 14:44:39 atlas-oss1c7.ccs.ornl.gov kernel: [3797323.165332] LustreError: 33287:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 603, more than at_max 600
Aug 21 14:44:39 atlas-oss1c7.ccs.ornl.gov kernel: [3797323.165334] ns: filter-atlas1-OST0376_UUID lock: ffff88034b20a900/0xecd0a12120f17d4c lrc: 4/0,0 mode: PW/PW res: [0x5c2a0f:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->16383) flags: 0x10020 nid: 10.36.205.208@o2ib remote: 0xc39a22a84c361c61 expref: 26 pid: 14426 timeout: 8090929608 lvb_type: 0
Aug 21 14:44:39 atlas-oss1c7.ccs.ornl.gov kernel: [3797323.292254] LustreError: 33287:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) Skipped 3 previous similar messages
Aug 21 14:44:57 atlas-oss1c7.ccs.ornl.gov kernel: [3797340.909561] Lustre: atlas1-OST0136: Slow creates, 128/256 objects created at a rate of 2/s
Aug 21 14:45:53 atlas-oss1c7.ccs.ornl.gov kernel: [3797397.137591] LustreError: 137-5: atlas1-OST0372_UUID: not available for connect from 10.38.145.2@o2ib4 (no target)
Aug 21 14:45:53 atlas-oss1c7.ccs.ornl.gov kernel: [3797397.161035] LustreError: Skipped 2 previous similar messages
Aug 21 14:48:24 atlas-oss1c7.ccs.ornl.gov kernel: [3797548.207840] Lustre: atlas1-OST01c6: Client 1942a1b8-14c2-1c85-f1cc-f5a627755ef9 (at 10.38.145.2@o2ib4) reconnecting
Aug 21 14:48:24 atlas-oss1c7.ccs.ornl.gov kernel: [3797548.238152] Lustre: Skipped 6 previous similar messages
Aug 21 14:52:26 atlas-oss1c7.ccs.ornl.gov kernel: [3797790.627280] LustreError: 45633:0:(service.c:3216:ptlrpc_svcpt_health_check()) ost_io: unhealthy - request has been waiting 610s

Then followed by several messages like this:

Aug 21 14:54:27 atlas-oss1c7.ccs.ornl.gov kernel: [3797911.436996] Lustre: 33219:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/3), not sending early reply
Aug 21 14:54:27 atlas-oss1c7.ccs.ornl.gov kernel: [3797911.436998] req@ffff8802f640e400 x1476723491621465/t0(0) o3->6ae34883-e4de-08ed-e3e9-ccdd41e9934d@9719@gni108:0/0 lens 448/432 e 0 to 0 dl 1408647272 ref 2 fl Interpret:/0/0 rc 0/0
Aug 21 14:54:32 atlas-oss1c7.ccs.ornl.gov kernel: [3797916.390933] Lustre: atlas1-OST0136: Bulk IO read error with 711e8a57-ae3b-7204-47a9-6a996887d00c (at 7253@gni108), client will retry: rc -110
Aug 21 14:54:32 atlas-oss1c7.ccs.ornl.gov kernel: [3797916.392844] LustreError: 33261:0:(ldlm_lib.c:2702:target_bulk_io()) @@@ timeout on bulk PUT after 0+0s req@ffff8803bab27000 x1476723483171409/t0(0) o3->711e8a57-ae3b-7204-47a9-6a996887d00c@7253@gni108:0/0 lens 448/432 e 0 to 0 dl 1408647272 ref 1 fl Interpret:/0/0 rc 0/0

Full syslog to follow.



 Comments   
Comment by Jason Hill (Inactive) [ 21/Aug/14 ]

Lustre logs from the affected OSS and the MDS for that filesystem for the 2 occurences of this particular issue.

Comment by Jason Hill (Inactive) [ 21/Aug/14 ]

also of significance (I think), there is very little back-end IO happening; system load is over 300 in all categories, and memory utilization is extremely high; < 650MB/64GB free.

[root@atlas-oss1c7 ~]# free -m
total used free shared buffers cached
Mem: 64306 64038 268 0 761 51748
-/+ buffers/cache: 11527 52778
Swap: 0 0 0

Yesterday I successfully unmounted all the OST's from this OSS, removed all lustre kernel modules and restarted Lustre to a positive outcome.

Comment by Jason Hill (Inactive) [ 21/Aug/14 ]

clients are both from the same cluster, both are running :

# rpm -qa | grep lustre
lustre-client-2.4.2-2.6.32_358.23.2.el6.x86_64_g89cc68b.x86_64
lustre-client-modules-2.4.2-2.6.32_358.23.2.el6.x86_64_g89cc68b.x86_64

Comment by Jason Hill (Inactive) [ 21/Aug/14 ]

please drop severity; that was my mistake I tabbed through the field and had not intended to hit sev 2.

Comment by Peter Jones [ 22/Aug/14 ]

Oleg is looking into this one

Comment by Oleg Drokin [ 22/Aug/14 ]

So in the logs it seems there's a severe disk backend slowness going on.
We see things like " Aug 21 14:52:26 atlas-oss1c7.ccs.ornl.gov kernel: [3797790.627280] LustreError: 45633:0:(service.c:3216:ptlrpc_svcpt_health_check()) ost_io: unhealthy - request has been waiting 610s" and slow creates at the rate of 2/sec.

I remember one of teh times we hit something like this before was soon after you restarted an OST. Was this restart you are talking about also before these problems occurred?

Any interesting data from the DDN side about load on the array?

Comment by Jason Hill (Inactive) [ 26/Aug/14 ]

Oleg,

No interesting data at the DDN level, but we were able to see some errors on the infiniband interface between the OSS and the DDN. We replaced the IB cable and that seems to have quelled the issue. This was just unexpected because I thought the only way to clear an "unhealthy" state in /proc/fs/lustre/health_check was to reboot or at least unmount the deviced and unload the lustre modules.

I'm good if we close this issue; thanks for the response and I apologize for my delay in getting back to you.


-Jason

Comment by Oleg Drokin [ 28/Aug/14 ]

If "unhealthy" state was set due to slowness of the disk, and then the disk performance improves (ie requests no longer take 10 minutes to complete), unhealthy state will clear.

Comment by James Nunez (Inactive) [ 28/Aug/14 ]

Per ORNL, the issue is now understood and we can close the ticket.

Generated at Sat Feb 10 01:52:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.