Details
-
Bug
-
Resolution: Cannot Reproduce
-
Critical
-
None
-
Lustre 2.1.3
-
2
-
5857
Description
We are seeing the following error.
Dec 13 08:35:39 nbp2-oss1 kernel: LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 351s: evicting client at 10.151.34.219@o2ib ns: filter-nbp2-OST0018_UUID lock: ffff8804c55d8480/0x1ca7e7e6c780ff4d lrc: 3/0,0 mode: PW/PW res: 182889173/0 rrc: 5 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0xd281632991b12020 expref: 6 pid: 19246 timeout 7391670727
With the client evicted we get dirty_page_discards like this.
Dec 13 08:35:40 r305i3n1 kernel: [1164772.491928] Lustre: 7178:0:(llite_lib.c:2283:ll_dirty_page_discard_warn()) nbp2: dirty page discard: 10.151.26.5@o2ib:/nbp2/fid: [0x5677ca33040:0x2d5:0x0]//mlellis/RunStilt/runs/20120523-Cherskii-d01-WRF-TEST-20121213.15.46.32.UTC/run_d01/Exe/Copy8/cdump may get corrupted (rc -4)
We have seen this happen at the beginning of a job. Now we are runing lflush before the start of every job. Could lflush cause this?
We stilling trying to to reproduce it and gather additional logs.