Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
3
-
9223372036854775807
Description
Running iorssf (job 33751) on 3 client nodes, one of the clients got evicted. The IB fabric seems to be sane (no downlink, symbolerrors,...) for ports in question. No mds restart or failover operation was ongoing at the time the error occured.
Client error
Jan 22 20:12:35 lola-24 kernel: LustreError: 11-0: soaked-OST000f-osc-ffff880819252c00: Communicating with 192.1 68.1.105@o2ib10, operation obd_ping failed with -107. Jan 22 20:12:35 lola-24 kernel: Lustre: soaked-OST000f-osc-ffff880819252c00: Connection to soaked-OST000f (at 19 2.168.1.105@o2ib10) was lost; in progress operations using this service will wait for recovery to complete Jan 22 20:12:35 lola-24 kernel: Lustre: Skipped 1 previous similar message Jan 22 20:12:35 lola-24 kernel: LustreError: 167-0: soaked-OST000f-osc-ffff880819252c00: This client was evicted by soaked-OST000f; in progress operations using this service will fail. Jan 22 20:12:35 lola-24 kernel: LustreError: 163903:0:(osc_lock.c:832:osc_ldlm_completion_ast()) lock@ffff880650 352228[3 3 0 1 1 00000000] W(2):[8702328, 18446744073709551615]@[0x1000f0000:0x1514518:0x0] { Jan 22 20:12:35 lola-24 kernel: LustreError: 163903:0:(osc_lock.c:832:osc_ldlm_completion_ast()) lovsub@ffff 8804b7635460: [0 ffff88067225d470 R(1):[8702328, 18446744073709551615]@[0x60002c323:0x96d3:0x0]] Jan 22 20:12:35 lola-24 kernel: LustreError: 163903:0:(osc_lock.c:832:osc_ldlm_completion_ast()) osc@ffff880 4fdc1dce8: ffff8803bf23e680 0x20040000000 0x5ec022e3dfcd440e 3 ffff88073a4476f8 size: 53278838784 mtime: 1421986 119 atime: 1421985489 ctime: 1421986114 blocks: 102256017 Jan 22 20:12:35 lola-24 kernel: LustreError: 163903:0:(osc_lock.c:832:osc_ldlm_completion_ast()) } lock@ffff8806 50352228 Jan 22 20:12:35 lola-24 kernel: LustreError: 163903:0:(osc_lock.c:832:osc_ldlm_completion_ast()) dlmlock returne d -5 Jan 22 20:12:35 lola-24 kernel: LustreError: 163903:0:(ldlm_resource.c:809:ldlm_resource_complain()) soaked-OST0 00f-osc-ffff880819252c00: namespace resource [0x1514518:0x0:0x0].0 (ffff8802cef43d00) refcount nonzero (1) after lock cleanup; forcing cleanup. Jan 22 20:12:35 lola-24 kernel: LustreError: 163903:0:(ldlm_resource.c:1448:ldlm_resource_dump()) --- Resource: [0x1514518:0x0:0x0].0 (ffff8802cef43d00) refcount = 2 Jan 22 20:12:35 lola-24 kernel: LustreError: 163903:0:(ldlm_resource.c:1451:ldlm_resource_dump()) Granted locks (in reverse order): Jan 22 20:12:35 lola-24 kernel: LustreError: 163903:0:(ldlm_resource.c:1454:ldlm_resource_dump()) ### ### ns: so aked-OST000f-osc-ffff880819252c00 lock: ffff8803bf23e680/0x5ec022e3dfcd440e lrc: 3/0,0 mode: PW/PW res: [0x15145 18:0x0:0x0].0 rrc: 2 type: EXT [35644735488->18446744073709551615] (req 35644735488->18446744073709551615) flags : 0x126400000000 nid: local remote: 0xe42bc2b426f6efd7 expref: -99 pid: 163262 timeout: 0 lvb_type: 1 Jan 22 20:12:35 lola-24 kernel: LustreError: 163262:0:(cl_lock.c:1422:cl_unuse_try()) result = -5, this is unlik ely! Jan 22 20:12:35 lola-24 kernel: Lustre: soaked-OST000f-osc-ffff880819252c00: Connection restored to soaked-OST00 0f (at 192.168.1.105@o2ib10) Jan 22 20:12:35 lola-24 kernel: LustreError: 163262:0:(cl_lock.c:1437:cl_unuse_locked()) lock@ffff880650352660[2 0 0 1 0 00000000] R(1):[8702328, 18446744073709551615]@[0x60002c323:0x96d3:0x0] { Jan 22 20:12:35 lola-24 kernel: LustreError: 163262:0:(cl_lock.c:1437:cl_unuse_locked()) vvp@ffff8804e2282c4 0: Jan 22 20:12:35 lola-24 kernel: LustreError: 163262:0:(cl_lock.c:1437:cl_unuse_locked()) lov@ffff88067225d47 0: 1 Jan 22 20:12:35 lola-24 kernel: LustreError: 163262:0:(cl_lock.c:1437:cl_unuse_locked()) 0 0: --- Jan 22 20:12:35 lola-24 kernel: LustreError: 163262:0:(cl_lock.c:1437:cl_unuse_locked()) Jan 22 20:12:35 lola-24 kernel: LustreError: 163262:0:(cl_lock.c:1437:cl_unuse_locked()) } lock@ffff880650352660 Jan 22 20:12:35 lola-24 kernel: LustreError: 163262:0:(cl_lock.c:1437:cl_unuse_locked()) unuse return -5 Jan 22 20:12:35 lola-24 slurmd[lola-24][163257]: error: *** STEP 33751.0 KILLED AT 2015-01-22T20:12:35 WITH SIGNAL 9 *** Jan 22 20:12:35 lola-24 slurmd[lola-24][163257]: error: *** STEP 33751.0 KILLED AT 2015-01-22T20:12:35 WITH SIGNAL 9 *** Jan 22 20:12:35 lola-24 slurmd[lola-24][163257]: done with job
OSS error:
Jan 22 20:12:31 lola-5 kernel: LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.1.124@o2ib100 ns: filter-soaked-OST000f_UUID lock: ffff88073a776480/0xe42bc2b426f6ee2c lrc: 3/0,0 mode: PR/PR res: [0x1514518:0x0:0x0].0 rrc: 7 type: EXT [35519225856->35644735487] (req 35519225856->35550605311) flags: 0x60000000000020 nid: 192.168.1.124@o2ib100 remote: 0x5ec022e3dfcd4407 expref: 32 pid: 103043 timeout: 5797977767 lvb_type: 0 Jan 22 20:14:10 lola-5 kernel: LustreError: 93964:0:(ost_handler.c:1775:ost_blocking_ast()) Error -2 syncing data on lock cancel Jan 22 20:14:12 lola-5 kernel: LNet: Service thread pid 102689 completed after 319.51s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Jan 22 20:14:12 lola-5 kernel: LNet: Skipped 3 previous similar messages Jan 22 20:14:55 lola-5 kernel: LNet: Service thread pid 102685 completed after 362.60s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Jan 22 20:14:55 lola-5 kernel: LNet: Skipped 1 previous similar message Jan 22 20:14:59 lola-5 kernel: LustreError: 93958:0:(ldlm_lib.c:2679:target_bulk_io()) @@@ bulk PUT failed: rc -107 req@ffff8807306bc800 x1490477310356324/t0(0) o3->76d86919-2338-4877-542c-9e7f34e71591@192.168.1.124@o2ib100:0/0 lens 488/432 e 8 to 0 dl 1421986372 ref 1 fl Interpret:/0/0 rc 0/0 Jan 22 20:14:59 lola-5 kernel: Lustre: soaked-OST000f: Bulk IO read error with 76d86919-2338-4877-542c-9e7f34e71591 (at 192.168.1.124@o2ib100), client will retry: rc -107 Jan 22 20:14:59 lola-5 kernel: Lustre: Skipped 1 previous similar message Jan 22 20:14:59 lola-5 kernel: Lustre: 102688:0:(service.c:2045:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (243:127s); client may timeout. req@ffff880271b8ac00 x1490477310356328/t0(0) o3->76d86919-2338-4877-542c-9e7f34e71591@192.168.1.124@o2ib100:0/0 lens 488/432 e 8 to 0 dl 1421986372 ref 1 fl Complete:/0/ffffffff rc 0/-1 Jan 22 20:14:59 lola-5 kernel: Lustre: 102688:0:(service.c:2045:ptlrpc_server_handle_request()) Skipped 1 previous similar message Jan 22 20:14:59 lola-5 kernel: LustreError: 93958:0:(ldlm_lib.c:2679:target_bulk_io()) Skipped 2 previous similar messages Jan 22 20:18:22 lola-5 kernel: LustreError: 94003:0:(ost_handler.c:1775:ost_blocking_ast()) Error -2 syncing data on lock cancel
Landed for 2.8