[LU-2499] Help debug waiting_locks_callback causing client eviction Created: 14/Dec/12  Updated: 29/Oct/13  Resolved: 29/Oct/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.3
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Mahmoud Hanafi Assignee: Zhenyu Xu
Resolution: Cannot Reproduce Votes: 0
Labels: ptr

Severity: 2
Rank (Obsolete): 5857

 Description   

We are seeing the following error.

Dec 13 08:35:39 nbp2-oss1 kernel: LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 351s: evicting client at 10.151.34.219@o2ib ns: filter-nbp2-OST0018_UUID lock: ffff8804c55d8480/0x1ca7e7e6c780ff4d lrc: 3/0,0 mode: PW/PW res: 182889173/0 rrc: 5 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0xd281632991b12020 expref: 6 pid: 19246 timeout 7391670727

With the client evicted we get dirty_page_discards like this.

Dec 13 08:35:40 r305i3n1 kernel: [1164772.491928] Lustre: 7178:0:(llite_lib.c:2283:ll_dirty_page_discard_warn()) nbp2: dirty page discard: 10.151.26.5@o2ib:/nbp2/fid: [0x5677ca33040:0x2d5:0x0]//mlellis/RunStilt/runs/20120523-Cherskii-d01-WRF-TEST-20121213.15.46.32.UTC/run_d01/Exe/Copy8/cdump may get corrupted (rc -4)

We have seen this happen at the beginning of a job. Now we are runing lflush before the start of every job. Could lflush cause this?

We stilling trying to to reproduce it and gather additional logs.



 Comments   
Comment by Peter Jones [ 14/Dec/12 ]

Bobijam

lflush is a tool-produced by LLNL. You may find some information on it by Googling. Could you please see what conditions would trigger this error and possible reasons?

Thanks

Peter

Comment by Peter Jones [ 14/Dec/12 ]

Chris

I think that you were involved in the creation of lflush. Is LLNL still using this tool on your 2.1.x production system? Have you ever seen any errors of this nature as a result if so?

Peter

Comment by Christopher Morrone [ 17/Dec/12 ]

See the scripts directory of this project:

https://github.com/chaos/lustre-tools-llnl

It is fairly simple. These days it could be done even shorter if we just used an "lctl set_param".

We do still use it in the slurm epilog script at the end of every job. We're not seeing that problem. At least not specifically associated with lflush, to the best of my knowledge.

But "lock callback timer expired" is a very, very common error that we have seen, for many different reasons. Many nodes dropping their locks at the same time could certainly provide the load that uncovers a bug, network problem, or something else. Full logs will be needed to figure out what happened in this case.

Comment by Zhenyu Xu [ 21/Mar/13 ]

Do you have a detailed log around the time when this issue happens?

Comment by Mahmoud Hanafi [ 29/Oct/13 ]

this can be closed

Comment by Peter Jones [ 29/Oct/13 ]

Thanks Mahmoud

Generated at Sat Feb 10 01:25:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.