[LU-1271] client evicted by ost during simul truncate test Created: 30/Mar/12  Updated: 29/May/17  Resolved: 29/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Minh Diep Assignee: Oleg Drokin
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Servers: 2.2.0 RC2
Clients: 2.2.0 RC2 rhel5 and rhel6
105 rhel6 clients, 130 rhel5 clients


Severity: 3
Rank (Obsolete): 10430

 Description   

During simul test #36 truncate individual mode, I removed the test directory
Simul was failed as expected but a few clients were evicted by ost

Mar 29 14:41:14 ehyperion-dit34 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.114.15@o2ib1 ns: filter-lustre-OST002f_UUI
D lock: ffff880637e41240/0x841523d0d78a63ce lrc: 3/0,0 mode: PW/PW res: 1408564/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x10020 remote: 0xcfdc2859f55f67fa expref: 4 pid: 209
65 timeout 4307447235
Mar 29 14:41:14 ehyperion-dit34 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.114.22@o2ib1 ns: filter-lustre-OST001f_UUID lock: ffff8805eb75bb40/0x841523d0d78a63f1 lrc: 3/0,0 mode: PW/PW res: 1406133/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x10020 remote: 0x5d5b157fb75724f0 expref: 4 pid: 209
50 timeout 4307447236
Mar 29 14:41:14 ehyperion-dit30 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.115.128@o2ib1 ns: filter-lustre-OST0032_UU
ID lock: ffff88060c42db40/0x545dd15c52687038 lrc: 3/0,0 mode: PW/PW res: 1443808/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x10020 remote: 0x96951ee279363733 expref: 4 pid: 20
778 timeout 4307432200
Mar 29 14:41:16 ehyperion305 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.127.65@o2ib1. The obd_ping operation failed with -107
Mar 29 14:41:16 ehyperion305 kernel: LustreError: Skipped 35 previous similar messages
Mar 29 14:41:16 ehyperion305 kernel: Lustre: lustre-OST001f-osc-ffff880228f4ac00: Connection to service lustre-OST001f via nid 192.168.127.65@o2ib1 was lost; in progress operations using this service will wait for recovery to complete.
Mar 29 14:41:16 ehyperion305 kernel: Lustre: Skipped 9 previous similar messages
Mar 29 14:41:16 ehyperion305 kernel: LustreError: 167-0: This client was evicted by lustre-OST001f; in progress operations using this service will fail.
Mar 29 14:41:16 ehyperion305 kernel: Lustre: lustre-OST001f-osc-ffff880228f4ac00: Connection restored to service lustre-OST001f using nid 192.168.127.65@o2ib1.
Mar 29 14:41:16 ehyperion305 kernel: Lustre: Skipped 61 previous similar messages
Mar 29 14:41:29 ehyperion-dit30 kernel: LustreError: 20822:0:(ldlm_lib.c:2239:target_send_reply_msg()) @@@ processing error (107) req@ffff8805f00be400 x1397535667303049/t0(0) o400><?>@<?>:0/0 lens 192/0 e 0 to 0
dl 1333057350 ref 1 fl Interpret:H/0/ffffffff rc -107/-1
Mar 29 14:41:29 ehyperion-dit30 kernel: LustreError: 20822:0:(ldlm_lib.c:2239:target_send_reply_msg()) Skipped 10764 previous similar messages
Mar 29 14:41:29 ehyperion557 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.127.61@o2ib1. The obd_ping operation failed with -107
Mar 29 14:41:29 ehyperion557 kernel: LustreError: Skipped 16 previous similar messages
Mar 29 14:41:29 ehyperion557 kernel: Lustre: lustre-OST0032-osc-ffff880210e9a400: Connection to service lustre-OST0032 via nid 192.168.127.61@o2ib1 was lost; in progress operations using this service will wait for r
ecovery to complete.
Mar 29 14:41:29 ehyperion557 kernel: Lustre: Skipped 14 previous similar messages
Mar 29 14:41:29 ehyperion557 kernel: LustreError: 167-0: This client was evicted by lustre-OST0032; in progress operations using this service will fail.
Mar 29 14:41:29 ehyperion557 kernel: Lustre: lustre-OST0032-osc-ffff880210e9a400: Connection restored to service lustre-OST0032 using nid 192.168.127.61@o2ib1.
Mar 29 14:41:29 ehyperion557 kernel: Lustre: Skipped 61 previous similar messages
Mar 29 14:41:29 ehyperion298 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.127.65@o2ib1. The obd_ping operation failed with -107
Mar 29 14:41:29 ehyperion298 kernel: LustreError: Skipped 15 previous similar messages
Mar 29 14:41:29 ehyperion298 kernel: Lustre: lustre-OST002f-osc-ffff880226002400: Connection to service lustre-OST002f via nid 192.168.127.65@o2ib1 was lost; in progress operations using this service will wait for r
ecovery to complete.
Mar 29 14:41:29 ehyperion298 kernel: Lustre: Skipped 53 previous similar messages
Mar 29 14:41:29 ehyperion298 kernel: LustreError: 167-0: This client was evicted by lustre-OST002f; in progress operations using this service will fail.
Mar 29 14:41:29 ehyperion298 kernel: Lustre: lustre-OST002f-osc-ffff880226002400: Connection restored to service lustre-OST002f using nid 192.168.127.65@o2ib1.
Mar 29 14:41:29 ehyperion298 kernel: Lustre: Skipped 61 previous similar messages



 Comments   
Comment by Oleg Drokin [ 04/Apr/12 ]

From the logs we can see there is some sort of a signal delivered to the writing thread while it has a lco kenqueued during truncate.

As such the syscall is aborted, and the not yet granted lock is placed onto the LRU.
The lock meanwhile is granted with CBPENDING already set. But since we only check it on lock LRU placement, and the lock is already in LRU, it's never noticed and eventually the client is evicted for not releasing the lock.

The straightforward fix is to check if the lock is in the LRU already when we get the completion AST with CBPENDING set and if so, release it right away.

Comment by Andreas Dilger [ 29/May/17 ]

Close old ticket.

Generated at Sat Feb 10 01:15:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.