[LU-1466] Hyperion DAT - IOR ssf - client eviction Created: 01/Jun/12 Updated: 22/Jun/12 Resolved: 22/Jun/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Cliff White (Inactive) | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 6388 |
| Description |
|
Running IOR, single-shared file, a single client is always evicted for a blocking callback by OSS. OSS and client debug logs attached. |
| Comments |
| Comment by Cliff White (Inactive) [ 01/Jun/12 ] |
|
Debug logs are on FTP, uploads |
| Comment by Cliff White (Inactive) [ 01/Jun/12 ] |
|
Repeated tests, repeated error, full debug and msgs uploaded to uploads |
| Comment by Peter Jones [ 02/Jun/12 ] |
|
Oleg What do you advise here? Thanks Peter |
| Comment by Cliff White (Inactive) [ 02/Jun/12 ] |
|
I have repeated the test with the 2.2.54 tag, same errors. |
| Comment by Oleg Drokin [ 04/Jun/12 ] |
|
the second test log set contains the culprit: 00010000:00010000:1.0:1338581263.050526:0:22810:0:(ldlm_lockd.c:1503:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: lustre-OST002a-osc-ffff8101a4928800 lock: ffff8101c7918b40/0xc2c2e1a2df0fb82d lrc: 3/0,0 mode: PW/PW res: 36588/0 rrc: 6 type: EXT [4706009088->4739563519] (req 4706009088->4707057663) flags: 0x10100000 remote: 0x1272af3896302f40 expref: -99 pid: 25237 timeout 0 00010000:00010000:1.0:1338581263.050531:0:22810:0:(ldlm_lockd.c:1516:ldlm_handle_bl_callback()) Lock ffff8101c7918b40 already unused, calling callback (ffffffff88837f80) Now I assume the bl callback just blocked on the cl_lock_mutex_get(env, lock); in osc_dlm_blocking_ast0() as there is basically nothing else to block on. Then there's no activity with this lock until finally the lock is cancelled: 00010000:00010000:1.0:1338581374.583277:0:22810:0:(ldlm_request.c:1030:ldlm_cli_cancel_local()) ### client-side cancel ns: lustre-OST002a-osc-ffff8101a4928800 lock: ffff8101c7918b40/0xc2c2e1a2df0fb82d lrc: 4/0,0 mode: PW/PW res: 36588/0 rrc: 6 type: EXT [4706009088->4739563519] (req 4706009088->4707057663) flags: 0x10102010 remote: 0x1272af3896302f40 expref: -99 pid: 25237 timeout 0 Now this only happens after the client was already evicted, in fact - right after the eviction and cancellation of a bunch of RPCs, but none of them seems to be too old, all of them certainly have appeared way after we got the BL ast. This seems to be at least marginally related to lu1274 which was dealing with a similar issue in glimpse callback and at the time it was decided that it was the server slowly progressing through IO, but in fact it might be the client that hogs the lock after all. |
| Comment by Jinshan Xiong (Inactive) [ 04/Jun/12 ] |
|
The cancel process is not blocked at acquiring a lock mutex because I saw this line in the same log: 00000020:00010000:1.0:1338581263.050536:0:22810:0:(cl_lock.c:143:cl_lock_trace0()) cancel lock: ffff810137d5f4b0@(1 ffff810228bc4080 1 5 0 0 0 0)(ffff81016091ea30/1/1) at cl_lock_cancel():1830 this means it has already grabbed lock mutex to call cl_lock_cancel(). However, I don't know why it can;t go through the canceling process. Maybe it was blocked at a page based on the situation that there is no RPC sent at all. Can you please show me the backtrace with a higher level debug information? |
| Comment by Cliff White (Inactive) [ 05/Jun/12 ] |
|
That is likely to be difficult short term, may be possible after next test cycle on Hyperion. |
| Comment by Cliff White (Inactive) [ 22/Jun/12 ] |
|
Current testing with 2.1.2 fails to reproduce this issue. https://maloo.whamcloud.com/test_sets/652c72e0-b9b5-11e1-9392-52540035b04c Closing. |