[LU-2499] Help debug waiting_locks_callback causing client eviction Created: 14/Dec/12 Updated: 29/Oct/13 Resolved: 29/Oct/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Mahmoud Hanafi | Assignee: | Zhenyu Xu |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | ptr | ||
| Severity: | 2 |
| Rank (Obsolete): | 5857 |
| Description |
|
We are seeing the following error. Dec 13 08:35:39 nbp2-oss1 kernel: LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 351s: evicting client at 10.151.34.219@o2ib ns: filter-nbp2-OST0018_UUID lock: ffff8804c55d8480/0x1ca7e7e6c780ff4d lrc: 3/0,0 mode: PW/PW res: 182889173/0 rrc: 5 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0xd281632991b12020 expref: 6 pid: 19246 timeout 7391670727 With the client evicted we get dirty_page_discards like this. Dec 13 08:35:40 r305i3n1 kernel: [1164772.491928] Lustre: 7178:0:(llite_lib.c:2283:ll_dirty_page_discard_warn()) nbp2: dirty page discard: 10.151.26.5@o2ib:/nbp2/fid: [0x5677ca33040:0x2d5:0x0]//mlellis/RunStilt/runs/20120523-Cherskii-d01-WRF-TEST-20121213.15.46.32.UTC/run_d01/Exe/Copy8/cdump may get corrupted (rc -4) We have seen this happen at the beginning of a job. Now we are runing lflush before the start of every job. Could lflush cause this? We stilling trying to to reproduce it and gather additional logs. |
| Comments |
| Comment by Peter Jones [ 14/Dec/12 ] |
|
Bobijam lflush is a tool-produced by LLNL. You may find some information on it by Googling. Could you please see what conditions would trigger this error and possible reasons? Thanks Peter |
| Comment by Peter Jones [ 14/Dec/12 ] |
|
Chris I think that you were involved in the creation of lflush. Is LLNL still using this tool on your 2.1.x production system? Have you ever seen any errors of this nature as a result if so? Peter |
| Comment by Christopher Morrone [ 17/Dec/12 ] |
|
See the scripts directory of this project: https://github.com/chaos/lustre-tools-llnl It is fairly simple. These days it could be done even shorter if we just used an "lctl set_param". We do still use it in the slurm epilog script at the end of every job. We're not seeing that problem. At least not specifically associated with lflush, to the best of my knowledge. But "lock callback timer expired" is a very, very common error that we have seen, for many different reasons. Many nodes dropping their locks at the same time could certainly provide the load that uncovers a bug, network problem, or something else. Full logs will be needed to figure out what happened in this case. |
| Comment by Zhenyu Xu [ 21/Mar/13 ] |
|
Do you have a detailed log around the time when this issue happens? |
| Comment by Mahmoud Hanafi [ 29/Oct/13 ] |
|
this can be closed |
| Comment by Peter Jones [ 29/Oct/13 ] |
|
Thanks Mahmoud |