[LU-1496] Client evicted frequently due to lock callback timer expiration Created: 08/Jun/12 Updated: 08/Oct/13 Resolved: 27/Aug/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.x (1.8.0 - 1.8.5) |
| Fix Version/s: | Lustre 1.8.9 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Mitsuhiro Nishizawa | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 6385 |
| Description |
|
Our customer is seeing client eviction due to lock callback timer expiration relatively frequently. << OSS >> << client >> |
| Comments |
| Comment by Mitsuhiro Nishizawa [ 08/Jun/12 ] |
|
Uploaded log file to the FTP server: /uploads/ Thanks, |
| Comment by Peter Jones [ 08/Jun/12 ] |
|
Bobijam Could you please advise on this one? Thanks Peter |
| Comment by Zhenyu Xu [ 10/Jun/12 ] |
|
from uploaded logs, it looks like ost0007 node had some congestions which caused client evictions, can you check it (disk, network interface, etc) and upload lustre debug logs of it here? |
| Comment by Mitsuhiro Nishizawa [ 14/Jun/12 ] |
|
We asked the customer to check the network and capture debug log when eviction occur next time. |
| Comment by Mitsuhiro Nishizawa [ 19/Jun/12 ] |
|
The issue occurred again and we captured debug log. We uploaded it to the FTP server: /uploads/ 2012/06/19 23:52:05 kern.err@oss4 kernel[-]: [1662880.375982] LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 172.17.11.50@o2ib ns: filter-data-OST0007_UUID lock: ffff8101a5e2fa00/0x2babc16ba243f737 lrc: 3/0,0 mode: PW/PW res: 430494/0 rrc: 8 type: EXT [73728->159743] (req 73728->77823) flags: 0x20 remote: 0x225685b920b04720 expref: 8 pid: 2438 timeout 5961707167 |
| Comment by Zhenyu Xu [ 20/Jun/12 ] |
|
The OSS4 is heavy loaded, and many clients are compete writing to same files. The write request from cnode050 to OST0007 (on OSS4) was delayed too much for the heavy load and got evicted. OSS4's heavy load 2012/06/19 13:36:27 kern.warning@oss4 kernel[-]: [1626037.218357] Lustre: data-OST0007: slow quota init 150s due to heavy IO load cnode050's write request got evicted due to lock's timeout 00000100:00000200:23:1340117424.398493:0:4625:0:(service.c:1046:ptlrpc_hpreq_reorder_nolock()) @@@ high priority req req@ffff8101b6316800 x1403470546287529/t0 o4->2208c67a-8f53-db44-9cbd-e65dba8f887c@NET_0x50000ac110b32_UUID:0/0 lens 448/0 e 0 to 0 dl 1340117467 ref 1 fl New:H/0/0 rc 0/0 got request on 1340117424 and start to handle 00000100:00001000:7:1340117462.705473:0:2587:0:(service.c:762:ptlrpc_at_send_early_reply()) @@@ sending early reply (deadline +5s, margin 38 seconds passed, asked client give OST extra 68 seconds 00010000:00020000:23:1340117525.229707:0:0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 172.17.11.50@o2ib ns: filter-data-OST0007_UUID lock: ffff8101a5e2fa00/0x2babc16ba243f737 lrc: 3/0,0 mode: PW/PW res: 430494/0 rrc: 8 type: EXT [73728->159743] (req 73728->77823) flags: 0x20 remote: 0x225685b920b04720 expref: 8 pid: 2438 timeout 5961707167 the extent lock protecting the write has timed out and evicted the client |
| Comment by Zhenyu Xu [ 20/Jun/12 ] |
| Comment by Mitsuhiro Nishizawa [ 20/Jun/12 ] |
|
The customer's lustre version is 1.8.4. Is the issue in |
| Comment by Zhenyu Xu [ 20/Jun/12 ] |
|
not directly, we need a patch port for 1.8.x |
| Comment by Zhenyu Xu [ 21/Jun/12 ] |
|
port patch tracking at http://review.whamcloud.com/3157 |
| Comment by Mitsuhiro Nishizawa [ 21/Jun/12 ] |
|
Thanks, Zhenyu. We will discuss if we can apply the patch. Today, the customer encoutered another instance of the issue. We uploaded the debug log to the FTP server:/uploads/ It looks like this is again due to high I/O load on the OSS. |
| Comment by Zhenyu Xu [ 06/Jul/12 ] |
|
Sorry for the late response, yes, I think the phenomenon is the same. |
| Comment by Mitsuhiro Nishizawa [ 09/Jul/12 ] |
|
Thank for checking the log. The customer is currently considering to upgrade lustre version with the patch applied. |
| Comment by Peter Jones [ 27/Aug/12 ] |
|
Landed to b1_8 and already landed to master under |
| Comment by Melinda McDade [ 08/Oct/13 ] |
|
I am running into this problem while running lustre 1.8.8. Is there a simple patch or do I have to upgrade to 1.8.9? Have already increased the osc..max_rpcs_in_flight=32, turned off checksums, and set osc..max_dirty_mb=512. This helped with some evictions. However, I tend to think the timeout=100 still needs to be increased even with adaptive timeouts. Notice the 100sec timeout below when we have at_min=0 and at_max=800. Interesting expiration time since timeout=100. Could also be related to an actual network socket timeout. Please help. Melinda LustreError: 0:0:(ldlm_lockd.c:313:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 34.239.17.202@tcp ns: filter-dlustre-OST0002_UUID lock: ffff8102636cde00/0x392a313d071a080e lrc: 3/0,0 mode: PW/PW res: 961326/0 rrc: 2 type: EXT [0->18446744073709551615] (req 71299072->71303167) flags: 0x20 remote: 0xa1584afedda7b6c expref: 197 pid: 5369 timeout 10277973712 |