[LU-874] Client eviction on lock callback timeout Created: 22/Nov/11 Updated: 28/Feb/18 Resolved: 28/Feb/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.2.0, Lustre 2.4.0, Lustre 2.1.1 |
| Fix Version/s: | Lustre 2.2.0, Lustre 2.4.0, Lustre 2.1.2 |
| Type: | Bug | Priority: | Major |
| Reporter: | Christopher Morrone | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 4 |
| Labels: | llnl, ptr | ||
| Environment: | |||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Sub-Tasks: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 4740 | ||||||||||||||||||||
| Description |
|
Our testing has revealed that lustre 2.1 is far more likely than 1.8 to return short reads and writes (return code says fewer bytes read/written than requested). So far, the frequent reproducer is IOR shared single file, transfer size 128MB, block size 256MB, 32 client nodes, and 512 tasks evenly spread over the clients. The file is only striped over 2 OSTs. When the read() or write() return value is less than the requested amount, the size is, in every instance that I have seen thus far, a multiple of 1MB. I suspect that other loads will show the same problem. I think that our more common large-transfer-request work loads come from our file-per-process apps though, so we'll run some tests to see if it is easy to reproduce there as well. |
| Comments |
| Comment by Peter Jones [ 22/Nov/11 ] |
|
Niu Could you please look at this one? Thanks Peter |
| Comment by Jinshan Xiong (Inactive) [ 22/Nov/11 ] |
|
Hi Chris, Did you see any error message when short rw happened? |
| Comment by Christopher Morrone [ 22/Nov/11 ] |
|
It is looking like all of the ones I have examined thus far are associated with evictions due to expired lock callback timers (expiration times vary from around 100s to over 4400s). "slow" messages are pretty much constant on the servers when they are under load. Here are some samples: 2011-11-20 21:09:17 Lustre: 9536:0:(filter.c:3455:filter_setattr()) lc2-OST002f: slow i_alloc_sem 223s And it is not unusual for them to be stuck for 600+ seconds. It looks like usually a "partial" read or write will be accompanied with an EIO at close() time. Sometimes occasionally a read or write will get the EIO. |
| Comment by Christopher Morrone [ 22/Nov/11 ] |
|
Yes, it is looking like the short reads and writes are due to the evictions, so they are reasonable in that case. We could change this ticket's summary if you like. |
| Comment by Christopher Morrone [ 22/Nov/11 ] |
|
Most of us are going to be gone for the holiday, so we'll investigate further on Monday. We'll need to figure out why we are seeing evictions more often with 2.1 than 1.8 with this workload. |
| Comment by Niu Yawei (Inactive) [ 23/Nov/11 ] |
|
It'll be great to collect the debug log on clients and MDS as well. We do read/write stripe by stripe, so the short read/write should be multiple of 1MB on 1MB stripe size system. |
| Comment by Jinshan Xiong (Inactive) [ 23/Nov/11 ] |
|
There is a similar one at Chris, can you reproduce this problem constantly? If this is the case, please run it again with a larger debug memory, by DEBUG_SIZE=200 for example. It's unpractical to transfer 200MB log to maloo, so you may upload it to our ftp site. |
| Comment by Ned Bass [ 01/Dec/11 ] |
|
I'm uploading two debug log files with +rpctrace dumped due to client eviction:
One suspicious thing we noticed in the server log. Client evicted when timer expires after 257s: 00010000:00020000:9.1:1322741808.246787:0:0:0:(ldlm_lockd.c:356:waiting_locks_callback()) ### lock callback timer expired after 257s: evicting client at 192.168.121.109@o2ib2 ns: filter-lc3-OST0003_UUID lock: ffff8802b94e5d80/0x36f903476b7f53b8 lrc: 3/0,0 mode: PW/PW res: 6806706/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x20 remote: 0xa18374fc160b5b35 expref: 35 pid: 6063 timeout 4353798639 46 seconds later after the client has reconnected it is evicted again for timers expired after 303s. Shouldn't all callback timers for the client be cancelled when it was evicted the first time? 00010000:00020000:9.1:1322741854.246954:0:0:0:(ldlm_lockd.c:356:waiting_locks_callback()) ### lock callback timer expired after 303s: evicting client at 192.168.121.109@o2ib2 ns: filter-lc3-OST0003_UUID lock: ffff88028fb26480/0x36f903476b7f53fe lrc: 3/0,0 mode: PW/PW res: 6806684/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x20 remote: 0xa18374fc160b5b58 expref: 34 pid: 6063 timeout 4353844679 |
| Comment by Christopher Morrone [ 01/Dec/11 ] |
|
Jinshan meant |
| Comment by Jinshan Xiong (Inactive) [ 01/Dec/11 ] |
|
Chris, yes, it is. Thanks. Ned: indeed, those locks should have been removed from waiting list during eviction. You can compress the log so that it's easier to upload. |
| Comment by Jinshan Xiong (Inactive) [ 01/Dec/11 ] |
|
The log didn't provide us enough information, unfortunately. The lock was about to cancel at 1322741541 (1322741808 - 267 = 1322741541), but nothing was printed during that time in client log. Can you please rerun the test with the following debug set: [root@mercury tests]# lctl get_param debugdebug= at the client side. And the on the ost side: [root@mercury tests]# lctl set_param debug="cache ha info nettrace dlmtrace rpctrace" Please also make sure enough memory is reserved by `lctl set_param debug_mb=200' for example. |
| Comment by Niu Yawei (Inactive) [ 02/Dec/11 ] |
|
BTW: the OSS log looks isn't uploaded? |
| Comment by Ned Bass [ 02/Dec/11 ] |
|
The original ftp transfer reported success, but I've compressed and attached it here. |
| Comment by Ned Bass [ 02/Dec/11 ] |
|
We had another occurence today with the requested debugging options enabled. Uploading client and oss logs to ftp.whamcloud.com/uploads:
|
| Comment by Christopher Morrone [ 02/Dec/11 ] |
|
The server evicts the client at 192.168.121.146@o2ib2. But I can't see in the server's log any place that it ever sent the lock revocation. Am I missing something? If the logs can be believed, it appears that the server is to blame for the eviction. |
| Comment by Jinshan Xiong (Inactive) [ 05/Dec/11 ] |
|
There were two eviction on the server side: jxiong@mac: tmp$ cat LU-874.lustre-log.oss.1322853642.5882.txt |grep evicting 00010000:00020000:7.1F:1322766963.843609:0:33:0:(ldlm_lockd.c:356:waiting_locks_callback()) ### lock callback timer expired after 257s: evicting client at 192.168.121.135@o2ib2 ns: filter-lc3-OST0004_UUID lock: ffff88050a7e1480/0xfa091755b2c622f5 lrc: 3/0,0 mode: PW/PW res: 7524140/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x20 remote: 0xa32e1dab97d934b1 expref: 44 pid: 6955 timeout 4378959523 00010000:00020000:1.1F:1322853614.843500:0:0:0:(ldlm_lockd.c:356:waiting_locks_callback()) ### lock callback timer expired after 235s: evicting client at 192.168.121.146@o2ib2 ns: filter-lc3-OST0002_UUID lock: ffff8805d8352b40/0xfa091755c942d054 lrc: 3/0,0 mode: PW/PW res: 17359051/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x20 remote: 0x14c71a0a40d66e1c expref: 14 pid: 5985 timeout 4465610820 Unfortunately I didn't find any information about this lock prior than that time, even there was no IO to this object prior that time either. I was expecting something printed for adding this lock into waiting list. It looks like there were some log missed in the middle, I don't know why? For the 2nd occurrence of eviction at the OST, I can find some information with the lock cookie: 00000100:00100000:12.0:1322853642.174321:0:73988:0:(client.c:2657:ptlrpc_abort_inflight()) @@@ inflight req@ffff880412470800 x1386945716847253/t0(0) o-1->lc3-OST0002_UUID@10.1.1.4@o2ib9:6/4 lens 456/416 e 0 to 0 dl 1322853850 ref 2 fl Rpc:/ffffffff/ffffffff rc 0/-1 00000100:00000040:10.0:1322853642.174361:0:5115:0:(lustre_net.h:1649:ptlrpc_rqphase_move()) @@@ move req "Rpc" -> "Interpret" req@ffff880412470800 x1386945716847253/t0(0) o-1->lc3-OST0002_UUID@10.1.1.4@o2ib9:6/4 lens 456/416 e 0 to 0 dl 1322853850 ref 2 fl Rpc:E/ffffffff/ffffffff rc -4/-1 req@ffff880412470800 x1386945716847253/t0(0) o-1->lc3-OST0002_UUID@10.1.1.4@o2ib9:6/4 lens 456/416 e 0 to 0 dl 1322853850 ref 2 fl Interpret:E/ffffffff/ffffffff rc -4/-1 00000100:00000040:10.0:1322853642.174376:0:5115:0:(client.c:2168:__ptlrpc_req_finished()) @@@ refcount now 1 req@ffff880412470800 x1386945716847253/t0(0) o-1->lc3-OST0002_UUID@10.1.1.4@o2ib9:6/4 lens 456/416 e 0 to 0 dl 1322853850 ref 2 fl Interpret:E/ffffffff/ffffffff rc -4/-1 00000008:00000040:4.0:1322853642.174480:0:6243:0:(osc_io.c:133:osc_io_submit()) 72 1 00000100:00000040:10.0:1322853642.174615:0:5115:0:(lustre_net.h:1649:ptlrpc_rqphase_move()) @@@ move req "Interpret" -> "Complete" req@ffff880412470800 x1386945716847253/t0(0) o-1->lc3-OST0002_UUID@10.1.1.4@o2ib9:6/4 lens 456/416 e 0 to 0 dl 1322853850 ref 1 fl Interpret:E/ffffffff/ffffffff rc -4/-1 00000100:00000040:10.0:1322853642.174623:0:5115:0:(client.c:2168:__ptlrpc_req_finished()) @@@ refcount now 0 req@ffff880412470800 x1386945716847253/t0(0) o-1->lc3-OST0002_UUID@10.1.1.4@o2ib9:6/4 lens 456/416 e 0 to 0 dl 1322853850 ref 1 fl Complete:E/ffffffff/ffffffff rc -4/-1 00000008:00000040:4.0:1322853642.174627:0:6243:0:(osc_io.c:227:osc_io_submit()) 72/0 0 00000008:00000040:4.0:1322853642.174873:0:6243:0:(osc_io.c:133:osc_io_submit()) 184 1 00000008:00000040:4.0:1322853642.174905:0:6243:0:(osc_io.c:227:osc_io_submit()) 184/0 0 00010000:00010000:4.0:1322853642.175282:0:6243:0:(ldlm_request.c:1030:ldlm_cli_cancel_local()) ### client-side cancel ns: lc3-OST0002-osc-ffff880833be8000 lock: ffff880405f10b40/0x14c71a0a40d66e1c lrc: 4/0,0 mode: PW/PW res: 17359051/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x10102010 remote: 0xfa091755c942d054 expref: -99 pid: 73862 timeout 0 00000020:00010000:4.0:1322853642.175288:0:6243:0:(cl_lock.c:143:cl_lock_trace0()) cancel lock: ffff880432c24cb8@(2 ffff88083165c100 2 5 0 0 0 1)(ffff880821d0a428/1/1) at cl_lock_cancel():1830 00000020:00010000:4.0:1322853642.175291:0:6243:0:(cl_lock.c:143:cl_lock_trace0()) delete lock: ffff880432c24cb8@(2 ffff88083165c100 2 5 0 0 0 1)(ffff880821d0a428/1/1) at cl_lock_delete():1777 00000020:00010000:4.0:1322853642.175293:0:6243:0:(cl_lock.c:143:cl_lock_trace0()) state signal lock: ffff880432c24d98@(2 ffff88083165c100 1 5 0 0 0 0)(ffff880821d0a428/0/2) at cl_lock_signal():1016 00000008:00000020:4.0:1322853642.175297:0:6243:0:(osc_object.c:156:osc_attr_set()) set kms from 134217728to 0 00000020:00010000:4.0:1322853642.175299:0:6243:0:(cl_lock.c:143:cl_lock_trace0()) cancel lock: ffff880432c24d98@(1 ffff88083165c100 1 0 0 0 0 0)(ffff880821d0a428/0/1) at cl_lock_cancel():1830 00000020:00010000:4.0:1322853642.175301:0:6243:0:(cl_lock.c:143:cl_lock_trace0()) delete lock: ffff880432c24d98@(1 ffff88083165c100 1 0 0 0 0 1)(ffff880821d0a428/0/1) at cl_lock_delete():1777 00000020:00010000:4.0:1322853642.175303:0:6243:0:(cl_lock.c:143:cl_lock_trace0()) free lock: ffff880432c24d98@(0 (null) 0 6 0 0 0 1)(ffff880821d0a428/0/0) at cl_lock_free():268 00010000:00010000:4.0:1322853642.175307:0:6243:0:(ldlm_lockd.c:502:ldlm_del_waiting_lock()) Client lock ffff880405f10b40 : no-op 00010000:00010000:4.0:1322853642.175308:0:6243:0:(ldlm_lockd.c:502:ldlm_del_waiting_lock()) Client lock ffff880405f10b40 : no-op 00000020:00000040:4.0:1322853642.175310:0:6243:0:(lustre_handles.c:135:class_handle_unhash_nolock()) removing object ffff880405f10b40 with handle 0x14c71a0a40d66e1c from hash 00010000:00010000:4.0:1322853642.175313:0:6243:0:(ldlm_request.c:1126:ldlm_cli_cancel_req()) skipping cancel on invalid import ffff880434578000 00000020:00010000:4.0:1322853642.175314:0:6243:0:(cl_lock.c:143:cl_lock_trace0()) delete lock: ffff880432c24cb8@(1 ffff88083165c100 1 6 0 0 0 1)(ffff88042070cef0/1/1) at cl_lock_delete():1777 00000020:00010000:4.0:1322853642.175316:0:6243:0:(cl_lock.c:143:cl_lock_trace0()) free lock: ffff880432c24cb8@(0 (null) 0 6 0 0 0 1)(ffff88042070cef0/1/0) at cl_lock_free():268 00010000:00010000:4.0:1322853642.175319:0:6243:0:(ldlm_lockd.c:1509:ldlm_handle_bl_callback()) ### client blocking callback handler END ns: lc3-OST0002-osc-ffff880833be8000 lock: ffff880405f10b40/0x14c71a0a40d66e1c lrc: 1/0,0 mode: --/PW res: 17359051/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x30302090 remote: 0xfa091755c942d054 expref: -99 pid: 73862 timeout 0 00010000:00010000:4.0:1322853642.175322:0:6243:0:(ldlm_lock.c:198:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: lc3-OST0002-osc-ffff880833be8000 lock: ffff880405f10b40/0x14c71a0a40d66e1c lrc: 0/0,0 mode: --/PW res: 17359051/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x30302090 remote: 0xfa091755c942d054 expref: -99 pid: 73862 timeout 0 00010000:00000040:4.0:1322853642.175325:0:6243:0:(ldlm_resource.c:1141:ldlm_resource_putref()) putref res: ffff8801ab634e80 count: 0 00010000:00000040:4.0:1322853642.175325:0:6243:0:(ldlm_resource.c:1141:ldlm_resource_putref()) putref res: ffff8801ab634e80 count: 0 From the above log, obviously the process 6243 is in the middle of canceling that lock(with cookie 0x14c71a0a40d66e1c), and the RPC x1386945716847253 was flushing the pages into OST0002, but for unknown reason that RPC never appeared at the OST. I suspect the OST is in extremely busy at that time: jxiong@mac: tmp$ cat LU-874.lustre-log.oss.1322853642.5882.txt |grep mutex 00002000:00000400:4.0:1322766355.641599:0:6033:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 185s 00002000:00000400:4.0:1322766355.641659:0:6286:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 185s 00002000:00000400:4.0:1322766355.641711:0:6210:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 185s 00002000:00000400:5.0:1322766355.641735:0:6640:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 185s 00002000:00000400:4.0:1322766355.641761:0:6449:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 185s 00002000:00000400:5.0:1322766355.641799:0:6600:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 185s 00002000:00000400:4.0:1322766355.641833:0:6378:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 185s 00002000:00000400:5.0:1322766355.641855:0:6071:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 185s 00002000:00000400:4.0:1322766355.641908:0:6656:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 185s 00002000:00000400:5.0:1322766355.641910:0:6119:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 185s 00002000:00000400:4.0:1322766355.641964:0:6338:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 185s 00002000:00000400:5.0:1322766355.641984:0:6620:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 185s 00002000:00000400:4.0:1322766355.642013:0:6634:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 185s 00002000:00000400:5.0:1322766355.642038:0:6209:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 185s 00002000:00000400:4.0:1322766355.642078:0:6120:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 185s 00002000:00000400:5.0:1322766355.642098:0:6062:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 185s 00002000:00000400:5.0:1322766355.642145:0:6665:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 185s 00002000:00000400:5.0:1322766355.642171:0:6079:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 185s 00002000:00000400:5.0:1322766355.642194:0:6255:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 185s 00002000:00000400:5.0:1322766355.642222:0:6589:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 183s 00002000:00000400:4.0:1322766355.642222:0:6479:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 183s 00002000:00000400:4.0:1322766355.642271:0:6105:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 183s 00002000:00000400:4.0:1322766355.642316:0:6066:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 185s 00002000:00000400:4.0:1322766355.642361:0:6279:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 185s 00002000:00200000:10.0:1322853642.256227:0:6062:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 90s 00002000:00200000:4.0:1322853642.256318:0:6219:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 90s 00002000:00200000:9.0:1322853642.256343:0:6038:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 90s 00002000:00200000:0.0:1322853642.256344:0:6662:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 90s 00002000:00200000:3.0:1322853642.256346:0:6408:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 90s 00002000:00200000:5.0:1322853642.256352:0:6646:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 90s 00002000:00200000:7.0:1322853642.256377:0:6100:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 78s 00002000:00200000:6.0:1322853642.256380:0:6469:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 90s 00002000:00200000:0.0:1322853642.256390:0:6087:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 90s 00002000:00200000:5.0:1322853642.256392:0:6319:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 90s 00002000:00200000:4.0:1322853642.256393:0:6598:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 90s 00002000:00200000:7.0:1322853642.256411:0:6254:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 90s 00002000:00200000:4.0:1322853642.256425:0:6635:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 90s 00002000:00200000:6.0:1322853642.256440:0:6444:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 78s 00002000:00200000:9.0:1322853642.256457:0:6442:0:(filter_io_26.c:705:filter_commitrw_write()) lc3-OST0002: slow i_mutex 90s So I guess the root cause would be that the OSS was extremely busy serving IO requests so as to not get any chance to serve that write request. However, before we dive into it, we should make sure that it's the real cause. So let's do the followings: Thanks. |
| Comment by Jinshan Xiong (Inactive) [ 05/Dec/11 ] |
|
BTW, did you see any abnormal dmesg on the client and server nodes? |
| Comment by Niu Yawei (Inactive) [ 05/Dec/11 ] |
Right, the process 6243 is in the middle of handling blocking ast, but RPC x1386945716847253 looks not the flush RPC (it's opcode is -1), and from the log: 00000008:00000040:4.0:1322853642.174480:0:6243:0:(osc_io.c:133:osc_io_submit()) 72 1 00000008:00000040:4.0:1322853642.174627:0:6243:0:(osc_io.c:227:osc_io_submit()) 72/0 0 00000008:00000040:4.0:1322853642.174873:0:6243:0:(osc_io.c:133:osc_io_submit()) 184 1 00000008:00000040:4.0:1322853642.174905:0:6243:0:(osc_io.c:227:osc_io_submit()) 184/0 0 we can see that client didn't trigger flush for this lock cancel (the pages might have been flushed already), and following logs 00010000:00010000:4.0:1322853642.175313:0:6243:0:(ldlm_request.c:1126:ldlm_cli_cancel_req()) skipping cancel on invalid import ffff880434578000
shows that the client import is already invalid before sending cancel RPC, so the cancel RPC isn't sent either. I guess that the blocking ast handler on client took too much time, which caused the lock expired on server and the client was evicted. Unfortunately, the client log is too short (only 259 seconds), so we can't figure out what the blocking handler process was stuck in. As Jingshan mentioned, a bigger debug buffer size on client should be helpful. |
| Comment by Ned Bass [ 05/Dec/11 ] |
OSS dmesg: Dec 2 11:11:01 zwicky4 kernel: Lustre: 6078:0:(filter_io_26.c:775:filter_commitrw_write()) lc3-OST0004: slow direct_io 205s Dec 2 11:19:42 zwicky4 kernel: Lustre: 6133:0:(filter_io_26.c:775:filter_commitrw_write()) lc3-OST0002: slow direct_io 153s Dec 2 11:19:42 zwicky4 kernel: Lustre: 6133:0:(filter_io_26.c:775:filter_commitrw_write()) Skipped 39 previous similar messages Dec 2 11:20:14 zwicky4 kernel: LustreError: 0:0:(ldlm_lockd.c:356:waiting_locks_callback()) ### lock callback timer expired after 235s: evicting client at 192.168.121.146@o2ib2 ns: filter-lc3-OST0002_UUID lock: ffff8805d8352b40/0xfa091755c942d054 lrc: 3/0,0 mode: PW/PW res: 17359051/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x20 remote: 0x14c71a0a40d66e1c expref: 14 pid: 5985 timeout 4465610820 Dec 2 11:20:16 zwicky4 kernel: Lustre: 6095:0:(filter_io_26.c:775:filter_commitrw_write()) lc3-OST0002: slow direct_io 187s Dec 2 11:20:22 zwicky4 kernel: LustreError: 7219:0:(ldlm_lib.c:2129:target_send_reply_msg()) @@@ processing error (-107) req@ffff880426f5cc00 x1386945716851486/t0(0) o-1-><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1322853645 ref 1 fl Interpret:H/ffffffff/ffffffff rc -107/-1 Dec 2 11:20:22 zwicky4 kernel: Lustre: 7219:0:(ldlm_lib.c:877:target_handle_connect()) lc3-OST0002: connection from 101f29b7-7cd1-ddfb-72de-910040f70618@192.168.121.146@o2ib2 t0 exp (null) cur 1322853622 last 0 Dec 2 11:20:22 zwicky4 kernel: Lustre: 7219:0:(ldlm_lib.c:877:target_handle_connect()) Skipped 32 previous similar messages Dec 2 11:20:25 zwicky4 kernel: Lustre: 6131:0:(filter_io_26.c:775:filter_commitrw_write()) lc3-OST0002: slow direct_io 196s Dec 2 11:20:41 zwicky4 kernel: Lustre: 6115:0:(filter_io_26.c:775:filter_commitrw_write()) lc3-OST0002: slow direct_io 212s Dec 2 11:20:41 zwicky4 kernel: Lustre: 6115:0:(filter_io_26.c:775:filter_commitrw_write()) Skipped 1 previous similar message Dec 2 11:20:42 zwicky4 kernel: Lustre: 6221:0:(lustre_fsfilt.h:332:fsfilt_setattr()) lc3-OST0002: slow setattr 176s Dec 2 11:20:42 zwicky4 kernel: Lustre: 6471:0:(lustre_fsfilt.h:332:fsfilt_setattr()) lc3-OST0002: slow setattr 176s Dec 2 11:20:42 zwicky4 kernel: Lustre: 7219:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import lc3-OST0002->NET_0x50002c0a87992_UUID netid 50009: select flavor null Dec 2 11:20:42 zwicky4 kernel: Lustre: 7219:0:(sec.c:1474:sptlrpc_import_sec_adapt()) Skipped 41 previous similar messages Dec 2 11:20:42 zwicky4 kernel: LustreError: 5882:0:(ldlm_lockd.c:238:expired_lock_main()) dump the log upon eviction Dec 2 11:20:42 zwicky4 kernel: LustreError: dumping log to /tmp/lustre-log.1322853642.5882 Dec 2 11:20:42 zwicky4 kernel: __ratelimit: 20 callbacks suppressed Dec 2 11:20:42 zwicky4 kernel: cannot allocate a tage (1) Dec 2 11:20:42 zwicky4 kernel: cannot allocate a tage (1) Dec 2 11:20:43 zwicky4 kernel: cannot allocate a tage (72) Dec 2 11:20:43 zwicky4 kernel: cannot allocate a tage (72) Dec 2 11:20:43 zwicky4 kernel: cannot allocate a tage (80) Dec 2 11:20:43 zwicky4 kernel: cannot allocate a tage (88) Dec 2 11:20:43 zwicky4 kernel: cannot allocate a tage (243) Dec 2 11:20:43 zwicky4 kernel: cannot allocate a tage (309) Dec 2 11:20:43 zwicky4 kernel: cannot allocate a tage (105) Dec 2 11:20:43 zwicky4 kernel: cannot allocate a tage (105) Client dmesg: Dec 2 11:20:22 hype345 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.4@o2ib9. The obd_ping operation failed with -107
Dec 2 11:20:22 hype345 kernel: Lustre: lc3-OST0002-osc-ffff880833be8000: Connection to service lc3-OST0002 via nid 10.1.1.4@o2ib9 was lost; in progress operations using this service will wait for recovery to complete.
Dec 2 11:20:42 hype345 kernel: LustreError: 167-0: This client was evicted by lc3-OST0002; in progress operations using this service will fail.
Dec 2 11:20:42 hype345 kernel: LustreError: 73988:0:(osc_lock.c:809:osc_ldlm_completion_ast()) lock@ffff880167a7f3f8[2 2 0 1 1 00000000] R(1):[0, 255]@[0x100020000:0x108e0d1:0x0] {
Dec 2 11:20:42 hype345 kernel: LustreError: 73988:0:(osc_lock.c:809:osc_ldlm_completion_ast()) lovsub@ffff8803d4697a60: [0 ffff880288b212d0 R(1):[0, 255]@[0x2000006ea:0xa584:0x0]]
Dec 2 11:20:42 hype345 kernel: LustreError: 73988:0:(osc_lock.c:809:osc_ldlm_completion_ast()) osc@ffff8803d4698420: ffff8803fc017d80 40120002 0x14c71a0a40d66fce 2 (null) size: 0 mtime: 18446744071562154368 atime: 18446744071562154368 ctime: 18446744071562154368 blocks: 0
Dec 2 11:20:42 hype345 kernel: LustreError: 73988:0:(osc_lock.c:809:osc_ldlm_completion_ast()) } lock@ffff880167a7f3f8
Dec 2 11:20:42 hype345 kernel: LustreError: 73988:0:(osc_lock.c:809:osc_ldlm_completion_ast()) dlmlock returned -5
Dec 2 11:20:42 hype345 kernel: LustreError: 73988:0:(osc_lock.c:809:osc_ldlm_completion_ast()) lock@ffff880572cc9e78[2 2 0 1 1 00000000] R(1):[0, 255]@[0x100020000:0x108e0d2:0x0] {
Dec 2 11:20:42 hype345 kernel: LustreError: 73988:0:(osc_lock.c:809:osc_ldlm_completion_ast()) lovsub@ffff8804d0fa54a0: [0 ffff8807f9195f68 R(1):[256, 511]@[0x2000006ea:0xa585:0x0]]
Dec 2 11:20:42 hype345 kernel: LustreError: 73988:0:(osc_lock.c:809:osc_ldlm_completion_ast()) osc@ffff8804ddc77b50: ffff880835a8a480 40120002 0x14c71a0a40d67053 2 (null) size: 0 mtime: 18446744071562154368 atime: 18446744071562154368 ctime: 18446744071562154368 blocks: 0
Dec 2 11:20:42 hype345 kernel: LustreError: 73988:0:(osc_lock.c:809:osc_ldlm_completion_ast()) } lock@ffff880572cc9e78
Dec 2 11:20:42 hype345 kernel: LustreError: 73988:0:(osc_lock.c:809:osc_ldlm_completion_ast()) dlmlock returned -5
Dec 2 11:20:42 hype345 kernel: LustreError: 5117:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8805cea59800 x1386945716851491/t0(0) o-1->lc3-OST0002_UUID@10.1.1.4@o2ib9:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/ffffffff/ffffffff rc 0/-1
Dec 2 11:20:42 hype345 kernel: LustreError: 73988:0:(osc_lock.c:809:osc_ldlm_completion_ast()) lock@ffff8804614a1278[2 2 0 1 1 00000000] R(1):[0, 255]@[0x100020000:0x108e0d3:0x0] {
Dec 2 11:20:42 hype345 kernel: LustreError: 73988:0:(osc_lock.c:809:osc_ldlm_completion_ast()) lovsub@ffff8804d0fa5120: [0 ffff8807f9195748 R(1):[0, 255]@[0x2000006ea:0xa586:0x0]]
Dec 2 11:20:42 hype345 kernel: LustreError: 73988:0:(osc_lock.c:809:osc_ldlm_completion_ast()) osc@ffff880575844870: ffff88050ef2dd80 40120002 0x14c71a0a40d66fc7 2 (null) size: 0 mtime: 18446744071562154368 atime: 18446744071562154368 ctime: 18446744071562154368 blocks: 0
Dec 2 11:20:42 hype345 kernel: LustreError: 73988:0:(osc_lock.c:809:osc_ldlm_completion_ast()) } lock@ffff8804614a1278
Dec 2 11:20:42 hype345 kernel: LustreError: 73988:0:(osc_lock.c:809:osc_ldlm_completion_ast()) dlmlock returned -5
Dec 2 11:20:42 hype345 kernel: LustreError: 73988:0:(osc_lock.c:809:osc_ldlm_completion_ast()) lock@ffff8803c9255b38[2 2 0 1 1 00000000] R(1):[0, 255]@[0x100020000:0x108e0cc:0x0] {
Dec 2 11:20:42 hype345 kernel: LustreError: 73988:0:(osc_lock.c:809:osc_ldlm_completion_ast()) lovsub@ffff8803101564a0: [0 ffff880193118cf8 R(1):[256, 511]@[0x2000006ea:0xa57c:0x0]]
Dec 2 11:20:42 hype345 kernel: LustreError: 73988:0:(osc_lock.c:809:osc_ldlm_completion_ast()) osc@ffff880310157a98: ffff880308f98000 40120002 0x14c71a0a40d6705a 2 (null) size: 0 mtime: 18446744071562154368 atime: 18446744071562154368 ctime: 18446744071562154368 blocks: 0
Dec 2 11:20:42 hype345 kernel: LustreError: 73988:0:(osc_lock.c:809:osc_ldlm_completion_ast()) } lock@ffff8803c9255b38
Dec 2 11:20:42 hype345 kernel: LustreError: 73988:0:(osc_lock.c:809:osc_ldlm_completion_ast()) dlmlock returned -5
Dec 2 11:20:42 hype345 kernel: LustreError: 73988:0:(osc_lock.c:809:osc_ldlm_completion_ast()) lock@ffff8805527a4978[2 2 0 1 1 00000000] R(1):[0, 255]@[0x100020000:0x108e0ce:0x0] {
Dec 2 11:20:42 hype345 kernel: LustreError: 73988:0:(osc_lock.c:809:osc_ldlm_completion_ast()) lovsub@ffff8808334f77a0: [0 ffff88055e67ca20 R(1):[256, 511]@[0x2000006ea:0xa580:0x0]]
Dec 2 11:20:42 hype345 kernel: LustreError: 73988:0:(osc_lock.c:809:osc_ldlm_completion_ast()) osc@ffff880830ca3cc0: ffff880835a8ad80 40120002 0x14c71a0a40d6703e 2 (null) size: 0 mtime: 18446744071562154368 atime: 18446744071562154368 ctime: 18446744071562154368 blocks: 0
Dec 2 11:20:42 hype345 kernel: LustreError: 73988:0:(osc_lock.c:809:osc_ldlm_completion_ast()) } lock@ffff8805527a4978
Dec 2 11:20:42 hype345 kernel: LustreError: 73988:0:(osc_lock.c:809:osc_ldlm_completion_ast()) dlmlock returned -5
Dec 2 11:20:42 hype345 kernel: LustreError: 5117:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8801056b2c00 x1386945716851492/t0(0) o-1->lc3-OST0002_UUID@10.1.1.4@o2ib9:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/ffffffff/ffffffff rc 0/-1
Dec 2 11:20:42 hype345 kernel: LustreError: 73988:0:(import.c:1228:ptlrpc_invalidate_import_thread()) dump the log upon eviction
Dec 2 11:20:42 hype345 kernel: LustreError: dumping log to /tmp/lustre-log.1322853642.73988
Dec 2 11:20:42 hype345 kernel: Lustre: lc3-OST0002-osc-ffff880833be8000: Connection restored to service lc3-OST0002 using nid 10.1.1.4@o2ib9.
|
| Comment by Jinshan Xiong (Inactive) [ 05/Dec/11 ] |
|
Yes, this can explain why there are some log missed. We can use debug log daemon to address that. If performance is an issue, we can write the debug file into ramdisk. Also it will help if you can monitor the vmstat in addition to RPC stats on the OSS nodes with: vmstat 1 |while read LINE; do echo "$(date +%s): $LINE"; done |tee -a vmstat-`uname -n`.log and monitor RPC stats with: while /bin/true; do cat /proc/fs/lustre/ost/OSS/ost_io/stats |tee -a rpcstats-`uname -n`.log; sleep 1; done |
| Comment by Ned Bass [ 05/Dec/11 ] |
|
Uploaded a tar file
|
| Comment by Jinshan Xiong (Inactive) [ 05/Dec/11 ] |
|
I can make a conclusion that this issue is due to slowness on the OST. Let's take a look at lock with cookie: 0xfa0917560267023f, the corresponding client cookie is: 0xe7cb6d7040dca4a4. Client enqueued this lock at: 00010000:00010000:0.0:1323124831.011420:0:5055:0:(ldlm_request.c:653:ldlm_cli_enqueue_fini()) ### client-side enqueue END ns: lc3-OST0002-osc-ffff880414115c00 lock: ffff880119a2ed80/0xe7cb6d7040dca4a4 lrc: 6/0,2 mode: PW/PW res: 40558883/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x0 remote: 0xfa0917560267023f expref: -99 pid: 43633 timeout 0 and cancel it at: 00010000:00010000:15.0:1323124831.069257:0:101226:0:(ldlm_lockd.c:1487:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: lc3-OST0002-osc-ffff880414115c00 lock: ffff880119a2ed80/0xe7cb6d7040dca4a4 lrc: 3/0,0 mode: PW/PW res: 40558883/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x10100000 remote: 0xfa0917560267023f expref: -99 pid: 43633 timeout 0 which sent OST_WRITE RPC with xid: 1386945715233692, however, it cost obdfilter 122 seconds to handle this RPC: 00000100:00000400:6.0:1323124962.428840:0:6286:0:(service.c:1731:ptlrpc_server_handle_request()) @@@ Request x1386945715233692 took longer than estimated (122:9s); client may timeout. req@ffff880577585800 x1386945715233692/t90245583(0) o-1->3de73cb5-e8d5-517c-889f-20a68b223109@NET_0x50002c0a87994_UUID:0/0 lens 456/416 e 2 to 0 dl 1323124953 ref 1 fl Complete:H/ffffffff/ffffffff rc 0/-1 Though the lock timeout is only 101 seconds: 00010000:00020000:7.1F:1323124931.843477:0:0:0:(ldlm_lockd.c:356:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.121.148@o2ib2 ns: filter-lc3-OST0002_UUID lock: ffff88042cfa6240/0xfa0917560267023f lrc: 3/0,0 mode: PW/PW res: 40558883/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x20 remote: 0xe7cb6d7040dca4a4 expref: 55 pid: 7221 timeout 4736927235 The OST log was truncated so we couldn't understand why it took so long time for obdfilter to handle this RPC. 00000020:00000002:6.0:1323124962.428791:0:6286:0:(obdo.c:73:obdo_from_inode()) valid 41fe, new time 1323124831/1323124831 00002000:00200000:6.0:1323124962.428794:0:6286:0:(filter_io_26.c:775:filter_commitrw_write()) lc3-OST0002: slow direct_io 131s 00002000:00000002:6.0:1323124962.428800:0:6286:0:(filter.c:221:f_dput()) putting 40558883: ffff880638271200, count = 1 00000020:00000040:6.0:1323124962.428804:0:6286:0:(genops.c:758:class_export_put()) PUTting export ffff88062ed8a000 : new refcount 24 00010000:00000040:6.0:1323124962.428807:0:6286:0:(ldlm_lib.c:2252:target_committed_to_req()) last_committed 90245590, transno 90245583, xid 1386945715233692 00000100:00000040:6.0:1323124962.428812:0:6286:0:(connection.c:136:ptlrpc_connection_addref()) conn=ffff8805dbbb8dc0 refcount 8 to 192.168.121.148@o2ib2 00000100:00000040:6.0:1323124962.428816:0:6286:0:(niobuf.c:63:ptl_send_buf()) conn=ffff8805dbbb8dc0 id 12345-192.168.121.148@o2ib2 00000100:00000040:6.0:1323124962.428825:0:6286:0:(connection.c:123:ptlrpc_connection_put()) PUT conn=ffff8805dbbb8dc0 refcount 7 to 192.168.121.148@o2ib2 00000100:00000040:6.0:1323124962.428828:0:6286:0:(lustre_net.h:1649:ptlrpc_rqphase_move()) @@@ move req "Interpret" -> "Complete" req@ffff880577585800 x1386945715233692/t90245583(0) o-1->3de73cb5-e8d5-517c-889f-20a68b223109@NET_0x50002c0a87994_UUID:0/0 lens 456/416 e 2 to 0 dl 1323124953 ref 1 fl Interpret:H/ffffffff/ffffffff rc 0/-1 00000100:00000040:6.0:1323124962.428834:0:6286:0:(service.c:1719:ptlrpc_server_handle_request()) RPC PUTting export ffff88062ed8a000 : new rpc_count 6 00000020:00000040:6.0:1323124962.428836:0:6286:0:(genops.c:758:class_export_put()) PUTting export ffff88062ed8a000 : new refcount 22 00000100:00000400:6.0:1323124962.428840:0:6286:0:(service.c:1731:ptlrpc_server_handle_request()) @@@ Request x1386945715233692 took longer than estimated (122:9s); client may timeout. req@ffff880577585800 x1386945715233692/t90245583(0) o-1->3de73cb5-e8d5-517c-889f-20a68b223109@NET_0x50002c0a87994_UUID:0/0 lens 456/416 e 2 to 0 dl 1323124953 ref 1 fl Complete:H/ffffffff/ffffffff rc 0/-1 00000100:00100000:6.0:1323124962.428872:0:6286:0:(service.c:1755:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost_io_217:3de73cb5-e8d5-517c-889f-20a68b223109+20:5053:x1386945715233692:12345-192.168.121.148@o2ib2:4 Request procesed in 131348976us (131349034us total) trans 90245583 rc 0/0 00000020:00000040:6.0:1323124962.428877:0:6286:0:(genops.c:758:class_export_put()) PUTting export ffff88062ed8a000 : new refcount 19 In the next step, let's apply a new set of debug parameters on the OST, I'll consult ldiskfs expert for suggestion. |
| Comment by Christopher Morrone [ 06/Dec/11 ] |
|
If there is a bug that is causing OST slowness, we can address that in another jira ticket. But in my experience, OST slowness is a frequent problem in production. No matter what we do to the OST code there will be times that it is slow. The client has completely fulfilled its obligation. There is no good reason for evicting it. So in this bug I would like us to look into fixing that basic problem. |
| Comment by Jinshan Xiong (Inactive) [ 06/Dec/11 ] |
|
In my point of view, the client was doing what it's supposed to do. From the log, the OST took longer than expectation to handle page writeback request. It's really difficult for this problem to be fixed at client side. However, the slowness of OST was still symptom, we should dig the root cause out. |
| Comment by Jinshan Xiong (Inactive) [ 06/Dec/11 ] |
|
Is it possible to disable OSS read cache to see if we have a good luck? |
| Comment by Ned Bass [ 06/Dec/11 ] |
Sure, I'll report back with results. |
| Comment by Christopher Morrone [ 06/Dec/11 ] |
|
As Ned said, we will look into that. But lets track anything about obdfilter slowness or OSS performance problems in another bug. We really need to work on a solution to the problem here. As I undertand it, the server has a cancellation from the client sitting in its queue, but evicts the client anyway. That is a clearly a bug in the server. I really think that the server really needs to either extend the ldlm timeout to an indefinite timeout time (-1 or something), or set an explicit flag in the lock that means "do not evict, cancellation is sitting in our queue". We might consider making the cancel message high-priority, but I really don't have a full enough understanding of all of the corner cases to claim that we can do that safely. |
| Comment by Jinshan Xiong (Inactive) [ 06/Dec/11 ] |
I had the same idea to use a new portal for canceling (OST_WRITE) requests, but finally I found this won;t work. This is because lock canceling code has to evict all pages so it has to lock all covering pages, that means it doesn't help if some of pages are being written by page-writeback daemon because those RPCs will still be in normal portal at the server side. |
| Comment by Christopher Morrone [ 06/Dec/11 ] |
|
Ah, sorry, I wasn't thinking it was the OST_WRITE that is the problem. If I remember correctly, an OST_WRITE will extend a lock timeout, but perhaps only when it is actually processed. Is that correct? If so, we really need to extend the lock timeout while it is sitting in the queue. I would think that an OST_WRITE would be in the at queue, so we could extend the at timer extension code to also extend lock timeouts...but the lock timer and the rpc time are not necessarily related. Do we need an additional queue like the at timeout queue for lock timeout extension? |
| Comment by Christopher Morrone [ 06/Dec/11 ] |
|
I see "ost_rw_prolong_locks()) refresh locks" lines in the OSS debug log, so I am going to start by assuming that the client is correctly attaching the lock handle to the OST_WRITE rpc. If that is the case, the ldlm lock should be extended at "processing" time, just before the rpc is put on the hpreq queue. I am not 100% certain from the logs that this is happening, but I am going to assume that it is. I believe that the only other time that the ldlm lock timeout is extended is at handling time. There seems to be a basic faulty assumption here that the hpreq queue will always be handled in a short, bounded amount of time. This may often be true, but the system provides absolutely no guarantee of queue waiting time. Therefore we MUST extend ldlm locks regularly while rpcs are sitting on the queues. If we do not, there will always be cases when we evict the clients even though the clients have done nothing wrong. |
| Comment by Jinshan Xiong (Inactive) [ 06/Dec/11 ] |
|
Yes, we should take queue waiting time into account, the final solution for this kind of problem is the implementation of resource control protocol or NRS; that is to say, clients should be told to not send any more requests because OST queue is over loaded. |
| Comment by Christopher Morrone [ 07/Dec/11 ] |
|
NRS will help, but not solve the underlying problem. It will be capable of limiting the rate at which rpcs are enqueued, but not the rate at which they are dequeued. So NRS still leaves us with a design flaw that we need to address. |
| Comment by Christopher Morrone [ 07/Dec/11 ] |
|
Furthermore, NRS introduces a NEW problem. Currently the clients cannot send any kind of early reply for ldlm lock revocations. The only way to extend an ldlm lock timeout is to put a bulk write into the server's queue. If NRS prevents a client from putting a bulk rpc into the queue for any length of time, it will increase the probability that the client will be evicted. |
| Comment by Jinshan Xiong (Inactive) [ 07/Dec/11 ] |
|
I don't know NRS in detail. But it's good to let the developers know the requirements. CC liang and andreas. BTW, no message may mean good message. Disabling OSS read cache can fix this problem? |
| Comment by Christopher Morrone [ 08/Dec/11 ] |
|
Disabling read cache is irrelevant as far as I can tell. We are not looking for a work-around, we are looking for a real fix to the problem. |
| Comment by Jinshan Xiong (Inactive) [ 08/Dec/11 ] |
|
there are lots of discussion about OSS read cache in |
| Comment by Christopher Morrone [ 08/Dec/11 ] |
|
I would prefer that someone work on THIS bug before dealing with OSS read cache problems. |
| Comment by Jinshan Xiong (Inactive) [ 08/Dec/11 ] |
|
Hi Chris, I mentioned OSS read cache because we've seen this problem before. After OSS read cache is enabled, it will consume lots of memory which worsens the quality of IO on the OST side. This is why there were lots of "slow IO" warning messages printed. Unfortunately there is no decent way of fixing it instead of disabling OSS read cache. Sorry for confusion. I have CC-ed this to wangdi who was taking care of this problem before. |
| Comment by Christopher Morrone [ 08/Dec/11 ] |
|
The slowness of the OST can be dealt with elsewhere. There will ALWAYS times that the OST is slow for some reason, be it OSS read cache, broken drives, whatever. Lustre has 512 IO threads. It is EASY to generate pathological IO patterns that drive the IO wait times into the minutes. Users do it all the time. Lustre must behave correctly when this happens. This is probably the reason (or at least one reason) that we see evictions in production with 1.8 as well. We need to finally address the root problem here! |
| Comment by Di Wang [ 08/Dec/11 ] |
|
I agree that if the eviction is caused by slow IO on the server side, we probably should not punish client here. Just checking the code, actually in ost_rw_prolong_locks, it takes account in all of the "handling" time when update the timeout for lock. /* prolong locks for the current service time of the corresponding
&req->rq_rqbd->rqbd_service->srv_at_estimate includes all of the "processing" time of the rpc, which is from arriving to sent reply, so it should include the slow IO time here. I will investigate it further, probably some logic is not right here. Thanks. |
| Comment by Jinshan Xiong (Inactive) [ 09/Dec/11 ] |
|
Hi Chris, This issue is our top priority now. We're discussing solution and will get back to you soon. |
| Comment by Christopher Morrone [ 09/Dec/11 ] |
That is only true at a fixed point in time when the rpc is initially processed and enqueued. src_at_estimate is just a guess of how long the rpc processing will take based on the PAST information. The rpc processing time can, and often does, exceed the estimate. With AT, the rpc timeout will be extended and the client will be sent an early reply so that it does not time out on the rpc. However the server fails to update the associated lock timeout when the rpc timeout is extended by the AT code. We have a patch that we are testing tonight that extends the lock timeout whenever the AT code extends the rpc timeout. It is basically just a one-liner that calls ptlrpc_server_hpreq_check() from ptlrpc_at_check_timed() when it extends the AT timer. We'll let you know how it goes, and I'll push it to gerrit when I'm at work tomorrow. |
| Comment by Christopher Morrone [ 10/Dec/11 ] |
|
Testing didn't show a significant improvement. Since oss read cache seems to be making server-side log collection virtually impossible, we have disabled it. We'll see if the eviction bug is still reproducible without it. |
| Comment by Mike Jacobs [ 12/Dec/11 ] |
|
Just 1 question - by any chance you're not mounting Lustre-FS directly on OSD? If that is the case, we ran into similar problems with spontaneous evictions and traced it back to that mount - we hit deadlock. |
| Comment by Jinshan Xiong (Inactive) [ 12/Dec/11 ] |
|
Hi Chris, I was going to take the lock out of waiting list, if there is a pending IO requests for that lock. But I regard your proposal as a better way by extending lock timeout each time when early replies are sent. Can you please describe the problem you met in testing a bit more detail? |
| Comment by Christopher Morrone [ 12/Dec/11 ] |
|
Mike, no, our Lustre clients and OSSs are separate nodes. |
| Comment by Andreas Dilger [ 13/Dec/11 ] |
|
In discussion today, the latest news in testing is that disabling the OSS read cache has avoided this problem. However, we'd like to understand the problem better, and so additional testing will be done to isolate the root cause. The first testing will try to determine whether threads-per-client < rpcs_in_flight will allow each thread to at least get some IO to the server. This can be achieved by increasing the client count and reducing threads per client (still 512 total threads), setting "lctl max_rpcs_in_flight=(2 * cores)", or some combination of both. Sub task bugs have been filed to investigate the various different issues that have been discussed. |
| Comment by Christopher Morrone [ 13/Dec/11 ] |
|
I may have been too quick to judge that disabling the OSS read cache avoided the problem. I still see evictions in the logs after I turned it off. I didn't manually flush the cache on the OSS nodes until Monday though, so if the cache doesn't clear out on its own in a reasonable fashion... We need another clean run to make sure it is really unreproducible without OSS read cache. hype was taken for other testing, so we'll try to get testing running from alternate client shortly. Also, I should probably clarify the reproducer. While initially it looks like a single-share-file run of IOR was the reproducer, we have seen it other times as well. file-per-process IOR has had evictions, and on PW locks spanning the entire file. Also, a single IOR run will probably never reproduce this problem. We only see it when running the IO portion of the SWL suite, which runs many smaller IO jobs at the same time. Mostly IOR and simul jobs. |
| Comment by Jinshan Xiong (Inactive) [ 15/Dec/11 ] |
|
Can you please run the test with RPC trace enabled because we're going to analyze the IO model so as to generate similar scenarios to reproduce it in our lab? |
| Comment by Christopher Morrone [ 15/Dec/11 ] |
|
We already have, look at the logs we provided. Unless you mean that you want us to trace every single node in the cluster, which seems rather difficult. Especially since we already know that logging on the server side is no good when problems happen... |
| Comment by Jinshan Xiong (Inactive) [ 15/Dec/11 ] |
|
Indeed, I'll use that log then. Now that we're suspecting the problem is about IO quality, we should collect brw_stats on the obdfilter as well. Also there's another proc entry: writethrough_cache_enable of obdfilter, we should disable it as well. It has the same effect as read_cache_enable consuming a lot memory on the server side. |
| Comment by Christopher Morrone [ 15/Dec/11 ] |
|
Ok, I think I can reproduce the problem like this. I create a job script with this contents: #!/bin/bash srun -o job-%J.log /g/g0/morrone/src/ior/src/ior -t 128M -b 256M -e -C -g -o /p/lcrater3/morrone/test.$SLURM_JOBID Then I queue up many of those jobs by doing: sbatch -O -N 16 -n 256 mytestscript On one of our systems, three 16 node jobs can run at the same time. At some point we will have a server start spewing messages about "unable to allocate tage", which is why our logging on the servers is so ineffective. Usually there are also a fair number of slow io messages of various sorts. Somewhere in there we'll see evictions. We really need to get to the bottom of the tage allocation issues. If we can't get server logs, it is going to be hard to debug this. Clearly there is some kind of memory pressure issue. Note that we have OSS read cache DISabled, so that apparently was not the problem. |
| Comment by Christopher Morrone [ 15/Dec/11 ] |
|
Ah, I did not disable writethrough_cache_enable, only read_cache_enable! That probably explains the huge cache usage I see. |
| Comment by Christopher Morrone [ 16/Dec/11 ] |
|
Disabling writethrough_cache_enable in addition to disabling read_cache_enable does (at least for now) appear to have eliminated the "tage" allocation issues. We still do see the eviction, but now hopefully the server logs will not be missing lines. We'll share logs when we get them. |
| Comment by Christopher Morrone [ 17/Dec/11 ] |
|
I am seeing some decent 400MB logs (with debug_mb set to 1000) on the server side now. Not entirely certain if everything we need is in there though. There seems to be some strangeness. For instance, I am looking at a server log where it evicted a client, but the client seems oblivious to the eviction. Strange. But anyway, maybe someone can explain LDLM lock flags to me. I am seeing the following. 2011-12-17 18:35:07.033418 00010000:00020000:8.1::0:0:0:(ldlm_lockd.c:356:waiting_locks_callback()) ### lock callback timer expired after 465s: evicting client at 192.168.121.115@o2ib2 ns: filter-lc3-OST0001_UUID lock: ffff8802643ddd80/0xe068980dd4dacbe7 lrc: 3/0,0 mode: PW/PW res: 93591578/0 rrc: 1739 type: EXT [24024973312->24058527743] (req 24024973312->24026021887) flags: 0x20 remote: 0x2b77fe919eba1dfa expref: 101 pid: 5002 timeout 5075017673 The lock callback timer expires at 18:35:07, supposedly after 465s. Now looking back in the logs, I can't really tell where the server ever sends the lock callback. But I DO see plenty of "Luck dump" logs which contain this particular lock. It looks like the first time I see it in a debug log is here: 2011-12-17 18:27:43.531550 00010000:00000040:4.0::0:26699:0:(ldlm_lock.c:1896:ldlm_lock_dump()) -- Lock dump: ffff8802643ddd80/0xe068980dd4dacbe7 (rc: 2) (pos: 143) (pid: 5002) 2011-12-17 18:27:43.531551 00010000:00000040:4.0::0:26699:0:(ldlm_lock.c:1902:ldlm_lock_dump()) Node: NID 192.168.121.115@o2ib2 (rhandle: 0x2b77fe919eba1dfa) 2011-12-17 18:27:43.531552 00010000:00000040:4.0::0:26699:0:(ldlm_lock.c:1915:ldlm_lock_dump()) Resource: ffff8805ef88b1c0 (93591578/0/0) 2011-12-17 18:27:43.531553 00010000:00000040:4.0::0:26699:0:(ldlm_lock.c:1920:ldlm_lock_dump()) Req mode: PW, grant mode: PW, rc: 2, read: 0, write: 0 flags: 0x0 2011-12-17 18:27:43.531554 00010000:00000040:4.0::0:26699:0:(ldlm_lock.c:1926:ldlm_lock_dump()) Extent: 24024973312 -> 24058527743 (req 24024973312-24026021887) It appears in various times, with little change but the "pos", which seems normal. the "flags" field remains 0x0. Up through an entry at 18:33:16. The next entry at 18:34:23 shows "flags" as 0x20: 2011-12-17 18:34:23.632487 00010000:00000040:3.0::0:7170:0:(ldlm_lock.c:1896:ldlm_lock_dump()) -- Lock dump: ffff8802643ddd80/0xe068980dd4dacbe7 (rc: 3) (pos: 119) (pid: 5002) 2011-12-17 18:34:23.632488 00010000:00000040:3.0::0:7170:0:(ldlm_lock.c:1902:ldlm_lock_dump()) Node: NID 192.168.121.115@o2ib2 (rhandle: 0x2b77fe919eba1dfa) 2011-12-17 18:34:23.632489 00010000:00000040:3.0::0:7170:0:(ldlm_lock.c:1915:ldlm_lock_dump()) Resource: ffff8805ef88b1c0 (93591578/0/0) 2011-12-17 18:34:23.632490 00010000:00000040:3.0::0:7170:0:(ldlm_lock.c:1920:ldlm_lock_dump()) Req mode: PW, grant mode: PW, rc: 3, read: 0, write: 0 flags: 0x20 2011-12-17 18:34:23.632491 00010000:00000040:3.0::0:7170:0:(ldlm_lock.c:1926:ldlm_lock_dump()) Extent: 24024973312 -> 24058527743 (req 24024973312-24026021887) 0x20 is LDLM_FL_AST_SENT. So somewhere between 18:33:16 and 18:34:23 the OSS must have at least thought about sending the lock blocking callback to the lustre client. Am I correct that LDLM_FL_AST_SENT means that the server tried to send an LDLM_BL_CALLBACK? So there would appear to be a couple of problems here: 1) Less than two minutes before the eviction, the LDLM_FL_AST_SENT flag is not set, but the eviction message claims that we timed out after 465 seconds. I uploaded zwicky3-lustre-log.1324175738.4930.txt.bz2. This is log from an OSS after evicting a client. |
| Comment by Christopher Morrone [ 18/Dec/11 ] |
|
I just uploaded another file: This tarball contains OSS and client log after an eviction, and a snippet from the console log. Interestingly, the console log shows the server receiving an ldlm_cancel from the client shortly after the OSS evicted the client and dumped its log. I see several incidents like this in the log. Not all show a late ldlm_cancel, but some do. |
| Comment by Christopher Morrone [ 18/Dec/11 ] |
|
At 4pm today I set osc.*.max_rpcs_in_flight=32 on all of the hype clients. We'll let you know tomorrow if that reduces the eviction rate. |
| Comment by Jinshan Xiong (Inactive) [ 19/Dec/11 ] |
This is the best log so far. Can you please show me the brw_stats of this target if the system is still alive? From what I have seen in the log, it;s the same problem. In the ost log:
Xid x1388386122264103 must be the RPC to write this 1-page size lock. This RPC was started at time 1324236462, but it wasn't finished even when the client was evicted(from client log, it was finished around 1324236721). One thing needs mentioning is that the target finished the trans 20 seconds later, because from the client log:
It saw last commited transno 207101084 at 1324236700. But I still don't understand why the whole transaction took 40 seconds to finish, so brw_stats may help. |
| Comment by Christopher Morrone [ 19/Dec/11 ] |
|
Attach brw_stats from lc3-OST0001. |
| Comment by Jinshan Xiong (Inactive) [ 19/Dec/11 ] |
|
Please try this patch: http://review.whamcloud.com/1893 At least this patch will prevent clients from being evicted, but the real cause of why it takes so long time to finish an IO is still in investigation. |
| Comment by Jinshan Xiong (Inactive) [ 19/Dec/11 ] |
|
The above patch is just a 1st step, I'm working on simplifying: f31b79be5a0380df3ed05c16fa43feca2bf5905c |
| Comment by Jinshan Xiong (Inactive) [ 19/Dec/11 ] |
From the stats, as a result of disabling read/writethrough cache, the fragmented IO is pretty good; though I didn't see this data before, this one looks really good. However, the IO time is unreasonable high:
I guess this is because the disk is quite busy - 50% percent time there are over 32 requests in flight. This is no good, we should work out a flow control algorithm to stop sending low prio requests to a target if it's pretty busy. |
| Comment by Christopher Morrone [ 19/Dec/11 ] |
If I understand correctly, the fix is the moving of the ptlrpc_hpreq_fini() calls, and the rest is just clean up. Moving the ptlrpc_hpreq_fini() call leaves the message on the exp_queued_rpc queue which should allow the ldlm_lock_busy() to find the rpc, which in turn allows waiting_locks_callback() to extend the lock rather than timing out. Is that correct? I'll give it a try. |
| Comment by Jinshan Xiong (Inactive) [ 19/Dec/11 ] |
|
Hi Chris, Yes, you're right. Another fix in the patch is to match extent exactly. |
| Comment by Eric Barton (Inactive) [ 20/Dec/11 ] |
|
I've graphed the stats from comment-24967 in 874pdf.pdf If I'm interpreting these stats correctly, all reads and the vast bulk of writes complete within 16 seconds, but some writes are taking many minutes. We really need an explanation for these long running writes - what is causing the long delay (memory allocation? elevator? storage device?) and why aren't reads similarly affected? As an aside, I'm also surprised at the high proportion of writes that complete in <= 1mS. I assume this means the storage device is write-back rather than write-thru? |
| Comment by Eric Barton (Inactive) [ 20/Dec/11 ] |
|
I've also graphed pages per RPC and I/O here to compare the number of pages 1. You can see many more single-page reads and writes than there were single-page RPCs. |
| Comment by Ned Bass [ 20/Dec/11 ] |
Yes, the storage controller uses a write-back cache with battery protection. |
| Comment by Christopher Morrone [ 20/Dec/11 ] |
|
The patch shows promise, but our testing has been weak so far. We've started hitting ldiskfs panics that we've never seen before. On the assumption that the panic is |
| Comment by Jinshan Xiong (Inactive) [ 20/Dec/11 ] |
|
The purpose of this patch is to let you move forward while we're working on it. We still don't know the root cause of this problem, so the fixing time is uncertain. You can also try the 2nd version of patch, it can reduce lock contention on the obdfilter. |
| Comment by Christopher Morrone [ 20/Dec/11 ] |
|
Ok. The SWL is running now, and I want to let that get a few hours in on the previous patch to have a little more confidence in the fix. We are upgrading the client side in a few hours, so I'll prep the second patch to be installed on the servers then. Then we will run the IO portion of the SWL over night. |
| Comment by Christopher Morrone [ 20/Dec/11 ] |
|
It looks like we are still seeing evictions with the first patch from http://review.whamcloud.com/1893. |
| Comment by Christopher Morrone [ 20/Dec/11 ] |
|
Uploaded OSS log zwicky4-lustre-log.1324422741.5007.txt.bz2 to ftp.whamcloud.com. |
| Comment by Di Wang [ 20/Dec/11 ] |
|
Just talked with jinshan. As for slow IO, there are still a few things we are not clear yet. 1. In brw_stats disk fragmented I/Os ios % cum % | ios % cum % So there are still 20% I/O is fragmented. And we do not know why. Could you please tell me what is the storage you use on that OSS? What is the max_hw_segments setting on that OST? 2. In your script, it does #!/bin/bash srun Do you mean you just keep launching this script, you can see the slow IO and client eviction? Are there any other jobs which drives OSS busy as well? Could you please tell us more about these jobs. |
| Comment by Jinshan Xiong (Inactive) [ 20/Dec/11 ] |
|
Do you disable read only cache and write through cache in the latest run? I'm looking at the log, it appears to the first one where useful information were lost. |
| Comment by Christopher Morrone [ 20/Dec/11 ] |
|
Yes, I doubled checked and both read and writethrough cache are disabled. |
| Comment by Christopher Morrone [ 20/Dec/11 ] |
Yes, that is correct. I issue that several hundred times with the sbatch command that I provided. The SWL IO load is a more reliable and frequent reproducer of the slow IO and evictions, so that is mainly what we have running on hype right now. That is a mix of many sizes of IORs and simul, and mdtest, and may do some logging of it own output into lustre.
One NetApp enclosure with 2 controllers and 60 drives. The drives are grouped into RAID6 arrays of 10 drives. 6 LUNS total. Two OSS are each IB connected to both controllers. Each OSS is primary for 3 OSTs.
max_segments is 255. max_hw_sectors_kb is 32767. |
| Comment by Christopher Morrone [ 20/Dec/11 ] |
|
brw_stats for all three OST on zwicky3 about three hours after startup. |
| Comment by Christopher Morrone [ 20/Dec/11 ] |
Note that those jobs run IN PARALLEL, at least three at any given moment. If you just run IOR jobs sequentially, you may never see this. |
| Comment by Christopher Morrone [ 20/Dec/11 ] |
No, thats pretty normal. I've been only passing on logs that looked like they might be useful before now. This one I just passed on without looking at it first. Think about what you need to know, add logging, and we'll test it. I think thats the only way we are going to make progress on this. |
| Comment by Christopher Morrone [ 20/Dec/11 ] |
|
Possibly this is just a result of temporary hangs due to slow io on the servers, but I thought I should mention that this is happening fairly frequently on the clients: Dec 20 16:06:37 hype265 kernel: INFO: task flush-lustre-1:5442 blocked for more than 120 seconds. Dec 20 16:06:37 hype265 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 20 16:06:37 hype265 kernel: flush-lustre- D 0000000000000000 0 5442 2 0x00000080 Dec 20 16:06:37 hype265 kernel: ffff8804324cf9a0 0000000000000046 0000000000000000 ffffffffa04aa494 Dec 20 16:06:37 hype265 kernel: 0000000000000000 ffff88041bd60300 ffff8804324cf930 0000000000000000 Dec 20 16:06:37 hype265 kernel: ffff880420e9fb38 ffff8804324cffd8 000000000000f4e8 ffff880420e9fb38 Dec 20 16:06:37 hype265 kernel: Call Trace: Dec 20 16:06:37 hype265 kernel: [<ffffffffa04aa494>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs] Dec 20 16:06:37 hype265 kernel: [<ffffffff81110f80>] ? sync_page+0x0/0x50 Dec 20 16:06:37 hype265 kernel: [<ffffffff814eeba3>] io_schedule+0x73/0xc0 Dec 20 16:06:37 hype265 kernel: [<ffffffff81110fbd>] sync_page+0x3d/0x50 Dec 20 16:06:37 hype265 kernel: [<ffffffff814ef40a>] __wait_on_bit_lock+0x5a/0xc0 Dec 20 16:06:37 hype265 kernel: [<ffffffff81110f57>] __lock_page+0x67/0x70 Dec 20 16:06:37 hype265 kernel: [<ffffffff81090c30>] ? wake_bit_function+0x0/0x50 Dec 20 16:06:37 hype265 kernel: [<ffffffff81125107>] ? __writepage+0x17/0x40 Dec 20 16:06:37 hype265 kernel: [<ffffffff81126662>] write_cache_pages+0x392/0x4a0 Dec 20 16:06:37 hype265 kernel: [<ffffffff811250f0>] ? __writepage+0x0/0x40 Dec 20 16:06:37 hype265 kernel: [<ffffffff81126794>] generic_writepages+0x24/0x30 Dec 20 16:06:37 hype265 kernel: [<ffffffff811267c1>] do_writepages+0x21/0x40 Dec 20 16:06:37 hype265 kernel: [<ffffffff811a166d>] writeback_single_inode+0xdd/0x2c0 Dec 20 16:06:37 hype265 kernel: [<ffffffff811a1aae>] writeback_sb_inodes+0xce/0x180 Dec 20 16:06:37 hype265 kernel: [<ffffffff811a1c0b>] writeback_inodes_wb+0xab/0x1b0 Dec 20 16:06:37 hype265 kernel: [<ffffffff811a1fab>] wb_writeback+0x29b/0x3f0 Dec 20 16:06:37 hype265 kernel: [<ffffffff814ee400>] ? thread_return+0x4e/0x77e Dec 20 16:06:37 hype265 kernel: [<ffffffff8107cc02>] ? del_timer_sync+0x22/0x30 Dec 20 16:06:37 hype265 kernel: [<ffffffff811a2299>] wb_do_writeback+0x199/0x240 Dec 20 16:06:37 hype265 kernel: [<ffffffff811a23a3>] bdi_writeback_task+0x63/0x1b0 Dec 20 16:06:37 hype265 kernel: [<ffffffff81090ab7>] ? bit_waitqueue+0x17/0xd0 Dec 20 16:06:37 hype265 kernel: [<ffffffff811351d0>] ? bdi_start_fn+0x0/0x100 Dec 20 16:06:37 hype265 kernel: [<ffffffff81135256>] bdi_start_fn+0x86/0x100 Dec 20 16:06:37 hype265 kernel: [<ffffffff811351d0>] ? bdi_start_fn+0x0/0x100 Dec 20 16:06:37 hype265 kernel: [<ffffffff81090886>] kthread+0x96/0xa0 Dec 20 16:06:37 hype265 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20 Dec 20 16:06:37 hype265 kernel: [<ffffffff810907f0>] ? kthread+0x0/0xa0 Dec 20 16:06:37 hype265 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20 |
| Comment by Jinshan Xiong (Inactive) [ 20/Dec/11 ] |
From this log, I didn't see any attempt from client side to write back pages back. |
| Comment by Christopher Morrone [ 20/Dec/11 ] |
|
True. We may have more than one issue here. We'll just have to see what happens overnight. Too soon to say. |
| Comment by Christopher Morrone [ 20/Dec/11 ] |
|
Something is very odd here. I just saw a single OST evict nine different nodes. But none of those nodes appear to know that they were evicted. And yet ANOTHER client did think it was evicted and drop its log. I will attach a tarball. |
| Comment by Christopher Morrone [ 20/Dec/11 ] |
|
Uploading |
| Comment by Christopher Morrone [ 20/Dec/11 ] |
|
Sorry, didn't include the decoded hype log. Better tar ball is |
| Comment by Jinshan Xiong (Inactive) [ 21/Dec/11 ] |
Sure, please try the patch set 4 at: http://review.whamcloud.com/#change,1893. In this patch, I removed ldlm_resource_dump() so as to have more room for useful log on the server side. In the latest run, the most strange thing is that there is no 'slow IO' or 'prolong' information any more, neither RPC was sent for lock cancel, it looks very like that the client is simply hung. This seems to be a good sign to me. So, please do the following things in the next run: Thanks. |
| Comment by Jinshan Xiong (Inactive) [ 21/Dec/11 ] |
|
hmm.. It had the message of 'prolong the busy lock' in the log. Is it always the same OSS to run into problem? |
| Comment by Jinshan Xiong (Inactive) [ 21/Dec/11 ] |
|
For log Does it mean there is a problem with network? |
| Comment by Christopher Morrone [ 21/Dec/11 ] |
|
FYI, |
| Comment by Jinshan Xiong (Inactive) [ 21/Dec/11 ] |
|
after taking a closer look at the log, the canceling RPC is not just handled. From the log, the last LDLM_CANCEL from this client is handled:
it took 50 seconds to handle that request. From the client log, this request was sent at:
This may be due to the contention at resource lock. Can you please monitor the CPU overhead on the OSS, and it will be better if you can use oprofile to verify if there is heavy contention at resource lock. |
| Comment by Christopher Morrone [ 21/Dec/11 ] |
|
I mentioned this in person, but just to record it here: the panics are believed to be unrelated. |
| Comment by Brian Behlendorf [ 21/Dec/11 ] |
|
This morning I started looking in to this issue too. I started by carefully tracing the lock cookie 0x54e8463975d1f617 from your previous comment. What I see in the logs for this request is that the server a blocking cancel to the client as xid 1388750491010582 for lock handle 0x54e8463975d1f617. The client cancels this quickly because there is no dirty data, just as you said, and sends its cancel rpc as xid x1388756785471675 back to the server. Now this second xid doesn't appear in the server logs but I don't think it's because of a network issue. The server log just doesn't go back far enough in time. If you notice that first blocking cancel 1388750491010582 doesn't show up in the server log either. Both of these events just occurred before the log starts. Chris is going to apply your latest patch and increase the log size so we can get the critical bit of data about how the server (mis)handled the cancel. |
| Comment by Christopher Morrone [ 21/Dec/11 ] |
|
Patchset 4 of http://review.whamcloud.com/1893 is on, and debug_mb has been increased. Just waiting for the next eviction. |
| Comment by Di Wang [ 21/Dec/11 ] |
|
Just post the patch http://review.whamcloud.com/#change,1899 for reserving pages for each OSS threads, Could you please add it to see whether it could alleviate fragmented IO problem, which might be one reason for slow IO? Thanks. |
| Comment by Andreas Dilger [ 21/Dec/11 ] |
|
As a general rule, it isn't worthwhile to apply a patch for testing unless it has already passed our own autotest suite. Otherwise it is wasting everyone's time and effort. Please hold off applying change 1899 until it has gotten the OK from Maloo. |
| Comment by Christopher Morrone [ 21/Dec/11 ] |
|
Uploading (it is taking a while) I'll add http://review.whamcloud.com/#change,1899 next. |
| Comment by Jinshan Xiong (Inactive) [ 21/Dec/11 ] |
Yes, I made that conclusion too quick. After taking a closer look, I found this may be a lock contention issue on the server because it took around 50 seconds to handle the previous LDLM_CANCEL with xid x1388756785471672 from the same client(please take a look at comment at: 21/Dec/11 1:40 PM); the server just didn't get a chance to handle it. So I suspect the server was experiencing the following issue: In the new run, I'm asking for monitoring server CPU usage, RPC stats and better oprofile it to see if there exists heavy contention on ldlm resource lock. Thanks for helping. |
| Comment by Jinshan Xiong (Inactive) [ 21/Dec/11 ] |
take a look at xid x1388756786783757, this is the LDLM_CANCEL request from client side. It looks like it wasn't handled in time so early reply was sent. Unfortunately, the lock timed out in the waiting list soon at 1324507012.765655.
Similar with BRW requests, it looks like we should prolong locks if the cancel requests are already in queue. I'll cook a patch soon. |
| Comment by Christopher Morrone [ 21/Dec/11 ] |
|
FYI, I don't think we will learn much from oprofile. Just watching the node with top, it rarely uses more than 15% of CPU time, although of course the rest of the idle time can frequently be attributed to iowait. I do note (using "perf record") that of the small amount of CPU time that is used, the ldlm_callback_handler is frequently in ost_brw_lock_get. |
| Comment by Jinshan Xiong (Inactive) [ 21/Dec/11 ] |
|
Please try patch: http://review.whamcloud.com/1900 This patch will prioritize cancel request as what we have done for BRW requests. Hi Chris, just want to be sure, you didn't see significant time being used on spin_lock, right? otherwise, I'm wondering why it ever took 50 seconds to handle dlm cancel request. |
| Comment by Brian Behlendorf [ 21/Dec/11 ] |
|
Yes, I agree. The new server logs pretty clearly show that we just failed to extend the timeout on the server for the LDLM_CANCEL request which was sitting in the queue. I'm somewhat surprised we didn't think of this case when the hpreq handlers were introduced to originally resolve this same basic eviction issue. We only targeted the bulk requests and ignored the actual cancel requests. I suppose at the time this probably addressed the majority of the problem. Regardless, we should register a srv_hpreq_handler for the ldlm service. This would allow us to selectively flag the LDLM_CANCEL request as high priority so it gets added to the request export list. The timer will then get properly extended using the same mechanism the bulks use which is a pretty clean fix. Alternately, we could extend the timeout when the early reply is sent but that seems slightly less clean to me since we already have the existing hpreq infrastructure for this. Anyway, it sounds like your already cooking up a patch. I'll be happy to review whatever you come up with. — Summary — — Logs — CLIENT 1324506831.445065:0:5089:0:(service.c:1708:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb_00:LOV_OSC_UUID+5:5011:x1388842889277109:12345-10.1.1.4@o2ib9:104 SERVER 1324506831.432953:0:5011:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ll_ost_73:lc3-OST0004_UUID:5011:1388842889277109:192.168.121.98@o2ib2:104 CLIENT 1324506980.324793:0:5065:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd-brw:a6f0cdd7-033b-4d74-afd4-eb1f173a8216:5065:1388756786783710:10.1.1.4@o2ib9:4 SERVER 1324506980.312259:0:4098:0:(events.c:284:request_in_callback()) incoming req@ffff88062737d000 x1388756786783757 msgsize 296
CLIENT 1324507026.675449:0:5248:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_02:a6f0cdd7-033b-4d74-afd4-eb1f173a8216:5248:1388756786783757:10.1.1.4@o2ib9:103 |
| Comment by Brian Behlendorf [ 21/Dec/11 ] |
|
Nice! You implemented what I was asking for even before I could post my comment. We'll give it a spin thanks! |
| Comment by Jinshan Xiong (Inactive) [ 22/Dec/11 ] |
|
How is the latest patch? Can I have these stats? also can you please share me the elevator algorithm you're using for the storage? |
| Comment by Eric Barton (Inactive) [ 22/Dec/11 ] |
|
Further note - can you confirm that we have a set of brw_stats taken after all (a) caching was disabled on the OSS and (b) brw_stats were reset or modules were reloaded. Can you also telnet into the Pike's Peak controller (Chris Morrone knows how) and run 'luall' to list volume group average and max latencies and also 'luall 3' to list individual volume latencies and post the output here. Thanks. |
| Comment by Christopher Morrone [ 22/Dec/11 ] |
|
luall doesn't show much, just "OldestCmd Age(ms)". And this command shows stats per DRIVE, not per volume/lun. The worst appears to be 21ms. "luall 3" shows read and write "ART(us)" and "MRT(us)" times for all drives. I am assuming that ART is average, and MRT is max...what "RT" means exactly I am unsure. For reads, worst ART is around 16000, and worst MRT is 1000000 (1s). |
| Comment by Christopher Morrone [ 22/Dec/11 ] |
|
We use the default elevator, whatever that is. Wow, you guys really seem more concerned about the slow io than we are. If Ned or Brian has time tomorrow perhaps they can get brw_stats or jbd2 info. But our motivation is low to figure out the performance with ldiskfs on this hastily thrown together test system. We are really busy, and should spend any performance investigation time on Orion and ZFS.
Yes, the second brw_stats that I attached to this bug was after running for about 3 hours after a fresh reboot of the servers with lustre's writethrough_cache_enable=0 and read_cache_enable=0: http://jira.whamcloud.com/secure/attachment/10694/zwicky3_brw_stats.txt |
| Comment by Christopher Morrone [ 22/Dec/11 ] |
|
We did get an eviction after some testing today with . Unfortunately, they rebooted the clients and I didn't know about it. Debug settings were not enabled on the client, so I only have a server log. I will upload that soon. We'll probably hit it again overnight, and Ned can gather logs for us tomorrow. |
| Comment by Christopher Morrone [ 22/Dec/11 ] |
|
Uploading server log to zwicky4-lustre-log.1324601430.5293.txt.bz2 Note that we had the following two patches applied: http://review.whamcloud.com/1893 (patch set 4) We didn't try 1899 yet. |
| Comment by Jinshan Xiong (Inactive) [ 22/Dec/11 ] |
|
I already know the problem, I'll cook a patch for this. |
| Comment by Christopher Morrone [ 22/Dec/11 ] |
|
It looks like another failure mode where the client already had the writes outstanding before the blocking AST was issued? I see write RPCs that started before the blocking AST, but finished long after it (shortly before the lock timeout). So the client may have had no additional dirty data to flush, so nothing winds up in the hpreq queue. Is that your assessment? Do you think the patch will be ready to try in the next hour or two? If so, I can put it on the servers tonight. Otherwise it will have to wait for Ned or Brian in the morning. |
| Comment by Jinshan Xiong (Inactive) [ 22/Dec/11 ] |
|
The problem is that we should refresh lock timeout after the request is finished - instead of refreshing it while the request is being handled, otherwise it exists a race that the lock times out immediately after we take the request out of HP list. The correct way is to take the corresponding locks out of waiting list if HP RPCs are sitting in the queue, and after HP RPCs have been finished, we will add them back into waiting list, with a new lock timeout so that clients will have enough time to send dlm cancel request. This patch will be a bit more complex than others, so that I will need more time. |
| Comment by Jinshan Xiong (Inactive) [ 22/Dec/11 ] |
|
I realize I can work out a workaround patch. Stay tuned. |
| Comment by Jinshan Xiong (Inactive) [ 22/Dec/11 ] |
|
Please try this patch if you're still there: http://review.whamcloud.com/#change,1893,patchset=5, plus patch: http://review.whamcloud.com/1900 |
| Comment by Christopher Morrone [ 22/Dec/11 ] |
|
That might be ok, but is it sufficient? In the server log that I supplied, the server appears to have begun handling write RPCs from the client BEFORE the conflicting lock enqueue triggered the blocking AST. So the client can't cancel the lock because it has writes outstanding, but those writes are not on the server's hpreq queue because they were sent before the client new about the blocking AST. And since the writes are not on the hpreq queue, they don't go on the exp_queued_rpc list. In the most recent server log, extending the lock timeout after the write is "Handled" might have avoided the eviction (but maybe not, without the time extension at "Handling" time, it may have timed out before "Handled"). But what if the handling time was longer still? The lock would still have timed out. We probably need your new patch and our old patch to extend the lock timeout at AT early reply time as well. I was thinking that we didn't need our patch anymore with your http://review.whamcloud.com/1893 patch, but now I think we need both, since messages in the normal request queue are not on the export list. |
| Comment by Jinshan Xiong (Inactive) [ 22/Dec/11 ] |
exp_queued_rpc contains all BRW, punch and LDLM_CANCEL requests to this export(please take a look at ptlrpc_hpreq_init() and ost_hpreq_handler() where rq_ops is assigned if the request is BRW and punch). For the scenario you mentioned, ldlm_lock_busy() will help us prolong lock timeout value. |
| Comment by Christopher Morrone [ 23/Dec/11 ] |
|
I see, thanks! So not all BRW will wind up on the hpreq queue, but they all are put on the per-export list. Except for a BRW that is 0 sized or OBD_BRW_SRVLOCK is set, in which case we don't care. In that case, this does seem like the correct approach. |
| Comment by Ned Bass [ 23/Dec/11 ] |
|
We got an eviction after about 4 hours. Actually two clients were evicted, one on 4 different locks, the other on one lock. Uploading lu874-2011-12-23.tar.gz which has logs for the OSS and both clients. Also included are console snippets, rpcstats, brw_stats, and jbd2 stats. |
| Comment by Jinshan Xiong (Inactive) [ 27/Dec/11 ] |
|
I didn't find a lot of useful information in the last log. Please check the new patch at: http://review.whamcloud.com/#change,1893,patchset=6 where I added more debug information. Please make sure you have the following settings before running the patch: Thanks. |
| Comment by Christopher Morrone [ 02/Jan/12 ] |
|
We are back from the holidays. Ned has put that on zwicky, and we'll let you know how it goes. |
| Comment by Christopher Morrone [ 03/Jan/12 ] |
|
The clients on hype are having non-lustre related problems, so I used the zwicky clients (46 or so nodes) with my own IOR workload. It ran all night, and didn't see a single eviction. That looks like an improvement. I have started looking at the logs that Ned posted from Dec 23. It looks to me like this happens: 1324680456.750097: OST0002 sends blocking AST to client at 192.168.121.68, with 100s timeout 1324680453.739698: Client handles blocking AST Client meanwhile has bulk reads outstanding, and continues to issuing new bulk reads 1324680494.632359: Client has "Completed" last bulk read to OST0002 1324680540.738880: Client is bored, pings OST0002 1324680557.576704: Server evicts client It is suspicious that we do not see the client sending an LDML_CANCEL in response to the LDLM_BL_CALLBACK. Could this be a client-side problem? |
| Comment by Christopher Morrone [ 03/Jan/12 ] |
|
Hmmm, wait, I think I got confused on that time line. 1324680165.925630 - server handles ldlm enqueue of lock 0x63c856866362b373 1324680529.576703 - first eviction of 192.168.121.68 for lock 0x63c856866362b373 Oddly there appears to be no record of the server sending the blocking ast for this particular lock. |
| Comment by Jinshan Xiong (Inactive) [ 04/Jan/12 ] |
|
Yes, so please run the job again with my latest patch where I added more debug information. |
| Comment by Christopher Morrone [ 04/Jan/12 ] |
|
We have been. Like I said we installed that on Jan 2. I cannot currently reproduce the problem from the zwicky clients. We just got time on the hype clients again, so we will see if hype can reproduce it. |
| Comment by Christopher Morrone [ 04/Jan/12 ] |
|
Yes, we got a hit with hype. Uploading this tarball now: lu874-2012-01-04.tar.bz2 |
| Comment by Christopher Morrone [ 04/Jan/12 ] |
|
It looks much like the previous log, but a little clearer because on the server side the RPC that sends the LDLM_BL_CALLBACK is clear, as is the receipt on the client side. It certainly looks like the client fails to asynchronously run the AST handler before it is evicted. |
| Comment by Jinshan Xiong (Inactive) [ 05/Jan/12 ] |
|
Thanks, Chris. From the latest log, it looks like ldlm_bl_XX threads were all busy in canceling locks, you can make it sure by running: cat hype323-lustre-log.1325717857.3860.txt |grep ldlm_bl_ 00000100:00100000:14.0:1325717749.546146:0:3837:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_05:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3837:1390025835073007:10.1.1.4@o2ib9:103 00000100:00100000:2.0:1325717854.403072:0:3837:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_05:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3837:1390025835073007:10.1.1.4@o2ib9:103 00000100:00100000:2.0:1325717854.424827:0:3837:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_05:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3837:1390025835073050:10.1.1.4@o2ib9:103 00000100:00100000:12.0:1325717855.078546:0:3837:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_05:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3837:1390025835073050:10.1.1.4@o2ib9:103 jxiong@mac: lu874-2012-01-04$ cat hype323-lustre-log.1325717857.3860.txt |grep ldlm_bl_ 00000100:00100000:6.0:1325717749.535609:0:3844:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_09:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3844:1390025835073006:10.1.1.4@o2ib9:103 00000100:00100000:14.0:1325717749.546146:0:3837:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_05:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3837:1390025835073007:10.1.1.4@o2ib9:103 00000100:00100000:5.0:1325717749.546293:0:3838:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_06:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3838:1390025835073008:10.1.1.4@o2ib9:103 00000100:00100000:7.0:1325717749.546327:0:4973:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_00:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:4973:1390025835073009:10.1.1.4@o2ib9:103 00000100:00100000:13.0:1325717749.547063:0:3839:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_07:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3839:1390025835073010:10.1.1.4@o2ib9:103 00000100:00100000:10.0:1325717749.553431:0:3843:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_08:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3843:1390025835073011:10.1.1.4@o2ib9:103 00000100:00100000:8.0:1325717749.553493:0:3835:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_03:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3835:1390025835073012:10.1.1.4@o2ib9:103 00000100:00100000:1.0:1325717749.553835:0:3836:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_04:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3836:1390025835073013:10.1.1.4@o2ib9:103 00000100:00100000:0.0:1325717749.554036:0:4974:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_01:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:4974:1390025835073014:10.1.1.4@o2ib9:103 00000100:00100000:2.0:1325717749.578488:0:3859:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_12:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3859:1390025835073015:10.1.1.4@o2ib9:103 00000100:00100000:4.0:1325717749.578555:0:3858:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_11:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3858:1390025835073016:10.1.1.4@o2ib9:103 00000100:00100000:14.0:1325717749.685542:0:3852:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_10:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3852:1390025835073017:10.1.1.4@o2ib9:103 00000100:00100000:6.0:1325717749.690087:0:3827:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_02:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3827:1390025835073018:10.1.1.4@o2ib9:103 00000100:00100000:6.0:1325717830.954193:0:3844:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_09:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3844:1390025835073006:10.1.1.4@o2ib9:103 00000100:00100000:6.0:1325717830.976901:0:3844:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_09:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3844:1390025835073042:10.1.1.4@o2ib9:103 00000100:00100000:2.0:1325717854.403072:0:3837:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_05:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3837:1390025835073007:10.1.1.4@o2ib9:103 00000100:00100000:2.0:1325717854.424827:0:3837:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_05:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3837:1390025835073050:10.1.1.4@o2ib9:103 00000100:00100000:12.0:1325717855.078546:0:3837:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_05:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3837:1390025835073050:10.1.1.4@o2ib9:103 00000100:00100000:2.0:1325717855.079104:0:3838:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_06:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3838:1390025835073008:10.1.1.4@o2ib9:103 00000100:00100000:0.0:1325717855.079116:0:4974:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_01:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:4974:1390025835073014:10.1.1.4@o2ib9:103 00000100:00100000:10.0:1325717855.079119:0:3843:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_08:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3843:1390025835073011:10.1.1.4@o2ib9:103 00000100:00100000:7.0:1325717855.079122:0:4973:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_00:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:4973:1390025835073009:10.1.1.4@o2ib9:103 00000100:00100000:13.0:1325717855.079124:0:3839:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_07:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3839:1390025835073010:10.1.1.4@o2ib9:103 00000100:00100000:1.0:1325717855.079135:0:3859:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_12:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3859:1390025835073015:10.1.1.4@o2ib9:103 00000100:00100000:0.0:1325717855.079167:0:3835:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_03:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3835:1390025835073012:10.1.1.4@o2ib9:103 00000100:00100000:4.0:1325717855.079174:0:3858:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_11:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3858:1390025835073016:10.1.1.4@o2ib9:103 00000100:00100000:6.0:1325717855.079177:0:3827:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_02:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3827:1390025835073018:10.1.1.4@o2ib9:103 00000100:00100000:14.0:1325717855.079179:0:3852:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_10:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3852:1390025835073017:10.1.1.4@o2ib9:103 00000100:00100000:10.0:1325717855.079180:0:3836:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_04:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3836:1390025835073013:10.1.1.4@o2ib9:103 00000100:00100000:1.0:1325717855.079192:0:3844:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_09:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3844:1390025835073042:10.1.1.4@o2ib9:103 and check timestamp. RPC type 103 is LDLM_CANCEL. This is the root cause why it took so long time for the client to response blocking request of dlm lock: 0xcf6cea5a43ee80a. But what made it so slow for the server to handle cancel request. Let's take a look at request x1390025835073007. This RPC was handled by ldlm_cn_08 on the server side. It clearly showed it had been blocked at ost_blocking_ast() because it had to write data in synchronous mode. In ost_blocking_ast(): oa->o_id = lock->l_resource->lr_name.name[0];
oa->o_seq = lock->l_resource->lr_name.name[1];
oa->o_valid = OBD_MD_FLID|OBD_MD_FLGROUP;
oinfo->oi_oa = oa;
rc = obd_sync(lock->l_export, oinfo,
lock->l_policy_data.l_extent.start,
lock->l_policy_data.l_extent.end, NULL);
if (rc)
CERROR("Error %d syncing data on lock cancel\n", rc);
This piece of code was imported in bug 16919. Unfortunately it's not easy to fix this problem. I would like to disable sync_on_lock_cancel by writing "never" into OST's sync_on_lock_cancel proc entry, but this will cause data corruption. So the safer solution is to disable async journal on the OST side - let's go back to stone age |
| Comment by Jinshan Xiong (Inactive) [ 05/Jan/12 ] |
|
Maybe we are on the wrong way to fix this whole thing. If there were no slow IO problem, we wouldn't need those patches at all. We've already got stats for storage, it will help if someone can take a look at it. |
| Comment by Christopher Morrone [ 05/Jan/12 ] |
|
Well, sure, the client is busy canceling locks. But why did the client not even call ldlm_handle_bl_callback() until after its eviction? There are only a couple of reasons that we would fail to create a new ldlm_bl_* thread to handle this work item and call ldlm_handle_bl_callback():
It doesn't look like blwi->blwi_mem_pressure should be set ever. I am not really understanding that yet. Of the clients currently running, the largest count of ldlm_bl_ threads is currently 60. Since these nodes have 16 processors, the max thread count should be 128. If we had more pending ldlm_bl_ work items, why were more threads not created? |
| Comment by Christopher Morrone [ 05/Jan/12 ] |
|
Oh! Nevermind, I completely see why more threads aren't starting. In the while(1), if the thread is able to get work it will never try to start more threads. It is only when it has to sleep and wakes up that it tries launching more threads. So that looks like any easy fix. I'll try it out. |
| Comment by Jinshan Xiong (Inactive) [ 05/Jan/12 ] |
|
it called ldlm_handle_bl_callback() by thread ldlm_bl_05 at the client side: jxiong@mac: lu874-2012-01-04$ cat hype323-lustre-log.1325717857.3860.txt |grep 0xcd56b148db513809 00010000:00000040:8.0:1325717749.305482:0:4956:0:(ldlm_lock.c:1896:ldlm_lock_dump()) -- Lock dump: ffff880281e9f6c0/0xcd56b148db513809 (rc: 2) (pos: 25) (pid: 3792) 00010000:00000040:8.0:1325717749.306337:0:4956:0:(ldlm_lock.c:1896:ldlm_lock_dump()) -- Lock dump: ffff880281e9f6c0/0xcd56b148db513809 (rc: 2) (pos: 25) (pid: 3792) 00010000:00000040:15.0:1325717749.308916:0:4956:0:(ldlm_lock.c:1896:ldlm_lock_dump()) -- Lock dump: ffff880281e9f6c0/0xcd56b148db513809 (rc: 2) (pos: 25) (pid: 3792) 00010000:00000040:15.0:1325717749.309211:0:4956:0:(ldlm_lock.c:1896:ldlm_lock_dump()) -- Lock dump: ffff880281e9f6c0/0xcd56b148db513809 (rc: 2) (pos: 25) (pid: 3792) 00010000:00000040:15.0:1325717749.309477:0:4956:0:(ldlm_lock.c:1896:ldlm_lock_dump()) -- Lock dump: ffff880281e9f6c0/0xcd56b148db513809 (rc: 2) (pos: 25) (pid: 3792) 00010000:00000040:15.0:1325717749.309732:0:4956:0:(ldlm_lock.c:1896:ldlm_lock_dump()) -- Lock dump: ffff880281e9f6c0/0xcd56b148db513809 (rc: 2) (pos: 25) (pid: 3792) 00010000:00000040:15.0:1325717749.309995:0:4956:0:(ldlm_lock.c:1896:ldlm_lock_dump()) -- Lock dump: ffff880281e9f6c0/0xcd56b148db513809 (rc: 2) (pos: 25) (pid: 3792) 00010000:00000040:15.0:1325717749.311192:0:4956:0:(ldlm_lock.c:1896:ldlm_lock_dump()) -- Lock dump: ffff880281e9f6c0/0xcd56b148db513809 (rc: 2) (pos: 25) (pid: 3792) 00010000:00000040:15.0:1325717749.311472:0:4956:0:(ldlm_lock.c:1896:ldlm_lock_dump()) -- Lock dump: ffff880281e9f6c0/0xcd56b148db513809 (rc: 2) (pos: 25) (pid: 3792) 00010000:00000040:15.0:1325717749.311749:0:4956:0:(ldlm_lock.c:1896:ldlm_lock_dump()) -- Lock dump: ffff880281e9f6c0/0xcd56b148db513809 (rc: 2) (pos: 25) (pid: 3792) 00010000:00000040:15.0:1325717749.312027:0:4956:0:(ldlm_lock.c:1896:ldlm_lock_dump()) -- Lock dump: ffff880281e9f6c0/0xcd56b148db513809 (rc: 2) (pos: 25) (pid: 3792) 00010000:00000040:15.0:1325717749.312311:0:4956:0:(ldlm_lock.c:1896:ldlm_lock_dump()) -- Lock dump: ffff880281e9f6c0/0xcd56b148db513809 (rc: 2) (pos: 25) (pid: 3792) 00010000:00000040:15.0:1325717749.312639:0:4956:0:(ldlm_lock.c:1896:ldlm_lock_dump()) -- Lock dump: ffff880281e9f6c0/0xcd56b148db513809 (rc: 2) (pos: 25) (pid: 3792) 00010000:00000040:15.0:1325717749.313815:0:4956:0:(ldlm_lock.c:1896:ldlm_lock_dump()) -- Lock dump: ffff880281e9f6c0/0xcd56b148db513809 (rc: 2) (pos: 25) (pid: 3792) 00010000:00000040:15.0:1325717749.314657:0:4956:0:(ldlm_lock.c:1896:ldlm_lock_dump()) -- Lock dump: ffff880281e9f6c0/0xcd56b148db513809 (rc: 2) (pos: 25) (pid: 3792) 00010000:00000040:15.0:1325717749.319334:0:4956:0:(ldlm_lock.c:1896:ldlm_lock_dump()) -- Lock dump: ffff880281e9f6c0/0xcd56b148db513809 (rc: 2) (pos: 25) (pid: 3792) 00010000:00010000:2.0:1325717854.403101:0:3837:0:(ldlm_lockd.c:1487:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: lc3-OST0004-osc-ffff880410c85800 lock: ffff880281e9f6c0/0xcd56b148db513809 lrc: 3/0,0 mode: PW/PW res: 14953882/0 rrc: 70 type: EXT [30198988800->30232543231] (req 30198988800->30200037375) flags: 0x10100000 remote: 0xcf6cea5a43ee80a expref: -99 pid: 3792 timeout 0 00010000:00010000:2.0:1325717854.424783:0:3837:0:(ldlm_request.c:1030:ldlm_cli_cancel_local()) ### client-side cancel ns: lc3-OST0004-osc-ffff880410c85800 lock: ffff880281e9f6c0/0xcd56b148db513809 lrc: 4/0,0 mode: PW/PW res: 14953882/0 rrc: 70 type: EXT [30198988800->30232543231] (req 30198988800->30200037375) flags: 0x10102010 remote: 0xcf6cea5a43ee80a expref: -99 pid: 3792 timeout 0 00000020:00000040:2.0:1325717854.424808:0:3837:0:(lustre_handles.c:135:class_handle_unhash_nolock()) removing object ffff880281e9f6c0 with handle 0xcd56b148db513809 from hash 00010000:00010000:2.0:1325717854.424817:0:3837:0:(ldlm_request.c:1088:ldlm_cancel_pack()) ### packing ns: lc3-OST0004-osc-ffff880410c85800 lock: ffff880281e9f6c0/0xcd56b148db513809 lrc: 2/0,0 mode: --/PW res: 14953882/0 rrc: 70 type: EXT [30198988800->30232543231] (req 30198988800->30200037375) flags: 0x30302090 remote: 0xcf6cea5a43ee80a expref: -99 pid: 3792 timeout 0 00010000:00010000:12.0:1325717855.109909:0:3837:0:(ldlm_lockd.c:1509:ldlm_handle_bl_callback()) ### client blocking callback handler END ns: lc3-OST0004-osc-ffff880410c85800 lock: ffff880281e9f6c0/0xcd56b148db513809 lrc: 1/0,0 mode: --/PW res: 14953882/0 rrc: 59 type: EXT [30198988800->30232543231] (req 30198988800->30200037375) flags: 0x30302090 remote: 0xcf6cea5a43ee80a expref: -99 pid: 3792 timeout 0 00010000:00010000:12.0:1325717855.109912:0:3837:0:(ldlm_lock.c:198:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: lc3-OST0004-osc-ffff880410c85800 lock: ffff880281e9f6c0/0xcd56b148db513809 lrc: 0/0,0 mode: --/PW res: 14953882/0 rrc: 59 type: EXT [30198988800->30232543231] (req 30198988800->30200037375) flags: 0x30302090 remote: 0xcf6cea5a43ee80a expref: -99 pid: 3792 timeout 0 cookie 0xcd56b148db513809 is the client dlm cookie caused eviction on the server side. Process 3837 is ldlm_bl_05. However, it run ldlm_handle_bl_callback() at 1325717854.403101 the server has already evicted this client. ldlm_bl_05 was busy in canceling 0xcd56b148db5137df and RPC x1390025835073007 was the LDLM_CANCEL request: jxiong@mac: lu874-2012-01-04$ cat hype323-lustre-log.1325717857.3860.txt |grep x1390025835073007 00000100:00000040:14.0:1325717749.546139:0:3837:0:(lustre_net.h:1649:ptlrpc_rqphase_move()) @@@ move req "New" -> "Rpc" req@ffff880774a7d800 x1390025835073007/t0(0) o-1->lc3-OST0004_UUID@10.1.1.4@o2ib9:17/18 lens 296/192 e 0 to 0 dl 0 ref 2 fl New:N/ffffffff/ffffffff rc 0/-1 00000100:00000040:14.0:1325717749.546155:0:3837:0:(niobuf.c:681:ptl_send_rpc()) @@@ send flg=0 req@ffff880774a7d800 x1390025835073007/t0(0) o-1->lc3-OST0004_UUID@10.1.1.4@o2ib9:17/18 lens 296/192 e 0 to 0 dl 1325717822 ref 3 fl Rpc:N/ffffffff/ffffffff rc 0/-1 00000100:00000040:6.0:1325717749.546359:0:4541:0:(client.c:2168:__ptlrpc_req_finished()) @@@ refcount now 2 req@ffff880774a7d800 x1390025835073007/t0(0) o-1->lc3-OST0004_UUID@10.1.1.4@o2ib9:17/18 lens 296/192 e 0 to 0 dl 1325717822 ref 3 fl Rpc:N/ffffffff/ffffffff rc 0/-1 00000100:00000040:4.0:1325717854.403018:0:4554:0:(events.c:163:reply_in_callback()) @@@ reply in flags=0 mlen=192 offset=192 replen=192 req@ffff880774a7d800 x1390025835073007/t0(0) o-1->lc3-OST0004_UUID@10.1.1.4@o2ib9:17/18 lens 296/192 e 1 to 0 dl 1325717914 ref 2 fl Rpc:RN/ffffffff/ffffffff rc 0/-1 00000100:00000040:2.0:1325717854.403064:0:3837:0:(lustre_net.h:1649:ptlrpc_rqphase_move()) @@@ move req "Rpc" -> "Interpret" req@ffff880774a7d800 x1390025835073007/t0(0) o-1->lc3-OST0004_UUID@10.1.1.4@o2ib9:17/18 lens 296/192 e 1 to 0 dl 1325717914 ref 2 fl Rpc:RN/ffffffff/ffffffff rc 0/-1 00000100:00000040:2.0:1325717854.403068:0:3837:0:(lustre_net.h:1649:ptlrpc_rqphase_move()) @@@ move req "Interpret" -> "Complete" req@ffff880774a7d800 x1390025835073007/t0(0) o-1->lc3-OST0004_UUID@10.1.1.4@o2ib9:17/18 lens 296/192 e 1 to 0 dl 1325717914 ref 2 fl Interpret:RN/ffffffff/ffffffff rc 0/-1 00000100:00000040:2.0:1325717854.403074:0:3837:0:(client.c:2168:__ptlrpc_req_finished()) @@@ refcount now 1 req@ffff880774a7d800 x1390025835073007/t0(0) o-1->lc3-OST0004_UUID@10.1.1.4@o2ib9:17/18 lens 296/192 e 1 to 0 dl 1325717914 ref 2 fl Complete:RN/ffffffff/ffffffff rc 0/-1 00000100:00000040:2.0:1325717854.403076:0:3837:0:(client.c:2168:__ptlrpc_req_finished()) @@@ refcount now 0 req@ffff880774a7d800 x1390025835073007/t0(0) o-1->lc3-OST0004_UUID@10.1.1.4@o2ib9:17/18 lens 296/192 e 1 to 0 dl 1325717914 ref 1 fl Complete:RN/ffffffff/ffffffff rc 0/-1 look at the timestamp, it started to send the rpc at 1325717749.546139, and it was finished at 1325717854.40306. (Immediately it then started to handle the dlm lock caused the eviction). Then I checked all other ldlm_bl_ threads and found they were in similar situation at that time: jxiong@mac: lu874-2012-01-04$ cat hype323-lustre-log.1325717857.3860.txt |grep ldlm_bl_ 00000100:00100000:6.0:1325717749.535609:0:3844:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_09:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3844:1390025835073006:10.1.1.4@o2ib9:103 00000100:00100000:14.0:1325717749.546146:0:3837:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_05:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3837:1390025835073007:10.1.1.4@o2ib9:103 00000100:00100000:5.0:1325717749.546293:0:3838:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_06:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3838:1390025835073008:10.1.1.4@o2ib9:103 00000100:00100000:7.0:1325717749.546327:0:4973:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_00:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:4973:1390025835073009:10.1.1.4@o2ib9:103 00000100:00100000:13.0:1325717749.547063:0:3839:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_07:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3839:1390025835073010:10.1.1.4@o2ib9:103 00000100:00100000:10.0:1325717749.553431:0:3843:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_08:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3843:1390025835073011:10.1.1.4@o2ib9:103 00000100:00100000:8.0:1325717749.553493:0:3835:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_03:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3835:1390025835073012:10.1.1.4@o2ib9:103 00000100:00100000:1.0:1325717749.553835:0:3836:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_04:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3836:1390025835073013:10.1.1.4@o2ib9:103 00000100:00100000:0.0:1325717749.554036:0:4974:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_01:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:4974:1390025835073014:10.1.1.4@o2ib9:103 00000100:00100000:2.0:1325717749.578488:0:3859:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_12:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3859:1390025835073015:10.1.1.4@o2ib9:103 00000100:00100000:4.0:1325717749.578555:0:3858:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_11:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3858:1390025835073016:10.1.1.4@o2ib9:103 00000100:00100000:14.0:1325717749.685542:0:3852:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_10:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3852:1390025835073017:10.1.1.4@o2ib9:103 00000100:00100000:6.0:1325717749.690087:0:3827:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_02:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3827:1390025835073018:10.1.1.4@o2ib9:103 00000100:00100000:6.0:1325717830.954193:0:3844:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_09:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3844:1390025835073006:10.1.1.4@o2ib9:103 00000100:00100000:6.0:1325717830.976901:0:3844:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_09:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3844:1390025835073042:10.1.1.4@o2ib9:103 00000100:00100000:2.0:1325717854.403072:0:3837:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_05:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3837:1390025835073007:10.1.1.4@o2ib9:103 00000100:00100000:2.0:1325717854.424827:0:3837:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_05:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3837:1390025835073050:10.1.1.4@o2ib9:103 00000100:00100000:12.0:1325717855.078546:0:3837:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_05:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3837:1390025835073050:10.1.1.4@o2ib9:103 00000100:00100000:2.0:1325717855.079104:0:3838:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_06:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3838:1390025835073008:10.1.1.4@o2ib9:103 00000100:00100000:0.0:1325717855.079116:0:4974:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_01:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:4974:1390025835073014:10.1.1.4@o2ib9:103 00000100:00100000:10.0:1325717855.079119:0:3843:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_08:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3843:1390025835073011:10.1.1.4@o2ib9:103 00000100:00100000:7.0:1325717855.079122:0:4973:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_00:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:4973:1390025835073009:10.1.1.4@o2ib9:103 00000100:00100000:13.0:1325717855.079124:0:3839:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_07:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3839:1390025835073010:10.1.1.4@o2ib9:103 00000100:00100000:1.0:1325717855.079135:0:3859:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_12:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3859:1390025835073015:10.1.1.4@o2ib9:103 00000100:00100000:0.0:1325717855.079167:0:3835:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_03:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3835:1390025835073012:10.1.1.4@o2ib9:103 00000100:00100000:4.0:1325717855.079174:0:3858:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_11:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3858:1390025835073016:10.1.1.4@o2ib9:103 00000100:00100000:6.0:1325717855.079177:0:3827:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_02:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3827:1390025835073018:10.1.1.4@o2ib9:103 00000100:00100000:14.0:1325717855.079179:0:3852:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_10:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3852:1390025835073017:10.1.1.4@o2ib9:103 00000100:00100000:10.0:1325717855.079180:0:3836:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_04:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3836:1390025835073013:10.1.1.4@o2ib9:103 00000100:00100000:1.0:1325717855.079192:0:3844:0:(client.c:1729:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_09:baa53be5-81cb-9b8a-be0b-8e5f8468d1d4:3844:1390025835073042:10.1.1.4@o2ib9:103 check the process name and process id and timestamp you will see it. Then I searched and tracked x1390025835073007 in the server log, and made that conclusion. |
| Comment by Jinshan Xiong (Inactive) [ 05/Jan/12 ] |
Indeed, that implementation is buggy. It's a good try to fix it. However, I suspect we will always fall into this situation no matter how many threads you have. |
| Comment by Christopher Morrone [ 06/Jan/12 ] |
Granted, this will just increase the number of simultaneous ASTs that can handle. But with a handling capability of 128, we may never hit the limit in practice. I agree that a long term solution would be nice. Perhaps the server should extend ALL lock timeouts for a client that is actively canceling ANY locks. If it is making progress, then we should give it more time. But in the short term, we set ldlm_num_threads to 128 yesterday at midday (easier than changing the version of lustre). We have not seen a single lock timeout eviction since making that change. Today we will install my patch to fix dynamic thread creation, and I expect to see the same results. |
| Comment by Christopher Morrone [ 06/Jan/12 ] |
|
Jinshan, if there is anything that you want to do to finalize patches 1893 and 1900, now is the time. We will be putting those in a chaos tag soon. |
| Comment by Jinshan Xiong (Inactive) [ 06/Jan/12 ] |
|
Sure, thanks for remind. I'd like to try patch http://review.whamcloud.com/#change,1918 because this is a production patch. It passed acceptance test but inspection is not even started. Can you please share me your patch so that I can bind them together? |
| Comment by Christopher Morrone [ 06/Jan/12 ] |
|
Ok, I need to sanity check it first, but I'll push it up today. It looks like 1918 replaces 1893, and then we'll need 1900 refreshed to work with 1918, correct? |
| Comment by Jinshan Xiong (Inactive) [ 06/Jan/12 ] |
|
I'll push a new patch for 1918 so that 1900 can be used intact. |
| Comment by Jinshan Xiong (Inactive) [ 06/Jan/12 ] |
|
I pushed a new patch at: http://review.whamcloud.com/#change,1918 which includes the change for growing bl thread correctly. Also, this patch can work with patch set 1 of 1900. Please take a look. |
| Comment by Christopher Morrone [ 06/Jan/12 ] |
|
My patch is in http://review.whamcloud.com/1926 |
| Comment by Christopher Morrone [ 09/Jan/12 ] |
|
Things appear to be much, much better now. We only had two evictions over the past 2+ days. I am uploading lu874-2012-01-09.tar.bz2. In this log, the client handles the blocking ast 1326115272.450122:0:5024:0:(service.c:1711:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb_01:LOV_OSC_UUID+5:4132:x1390296971148876:12345-10.1.1.3@o2ib9:104 1326115272.450126:0:5024:0:(ldlm_lockd.c:2007:ldlm_callback_handler()) blocking ast 1326115272.450155:0:5024:0:(service.c:1758:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb_01:LOV_OSC_UUID+4:4132:x1390296971148876:12345-10.1.1.3 Less than three seconds later an ldlm_bl thread is able to service the new work item 1326115275.066467:0:9320:0:(ldlm_lockd.c:1487:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: lc3-OST0000-osc-ffff880421e00400 lock: ffff88041412f900/0x97ca182c5e08889d lrc: 3/0,0 mode: PW/PW res: 93945984/0 rrc: 13 type: EXT [37446746112->39191576575] (req 37446746112->37447794687) flags: 0x10120010 remote: 0xc9b7d80519c6fdf expref: -99 pid: 25677 timeout 0 1326115275.066478:0:9320:0:(ldlm_lockd.c:1500:ldlm_handle_bl_callback()) Lock ffff88041412f900 already unused, calling callback (ffffffffa081eba0) 1326115275.066497:0:9320:0:(cl_lock.c:143:cl_lock_trace0()) cancel lock: ffff880362bc96d8@(1 ffff88083234cac0 1 5 0 0 0 0)(ffff88042a668200/1/1) at cl_lock_cancel():1830 But from there the trail goes quiet until the client realizes that it has been evicted, at which point it finally performs the local cancel on the lock 1326115633.438936:0:5002:0:(import.c:1277:ptlrpc_import_recovery_state_machine()) 167-0: This client was evicted by lc3-OST0000; in progress operations using this service will fail. 1326115633.455522:0:25813:0:(ldlm_request.c:1030:ldlm_cli_cancel_local()) ### client-side cancel ns: lc3-OST0000-osc-ffff880421e00400 lock: ffff88041412f900/0x97ca182c5e08889d lrc: 5/0,0 mode: PW/PW res: 93945984/0 rrc: 14 type: EXT [37446746112->39191576575] (req 37446746112->37447794687) flags: 0x12122c10 remote: 0xc9b7d80519c6fdf expref: -99 pid: 25677 timeout 0 1326115633.510908:0:9320:0:(ldlm_lockd.c:1509:ldlm_handle_bl_callback()) ### client blocking callback handler END ns: lc3-OST0000-osc-ffff880421e00400 lock: ffff88041412f900/0x97ca182c5e08889d lrc: 4/0,0 mode: PW/PW res: 93945984/0 rrc: 14 type: EXT [37446746112->39191576575] (req 37446746112->37447794687) flags: 0x12322c90 remote: 0xc9b7d80519c6fdf expref: -99 pid: 25677 timeout 0 The lock is supposedly "already unused, calling callback". But the LDLM_CANCEL is not sent at that time. Much of the code under that patch seems to be lacking debug messages, and I am not familiar with this part of the code. If the lock is "unused", why was it not canceled immediately? We know that the ldlm_bl thread entered cl_lock_cancel() from the messages in the logs. I am assuming that it got there by way of: ldlm_handle_bl_callback() But note that the cl_lock_cancel is for a DIFFERENT lock. So it seems like the client gets derailed into canceling another lock when it should really, under this situation, just send the LDLM_CANCEL for the lock it already knows is finished. Unless it can't for some reason. That is as far as I have gotten. I don't know this area of code, so an explanation would be appreciated. |
| Comment by Christopher Morrone [ 09/Jan/12 ] |
|
cl_lock_cancel0() is talking about the associated cl_lock, which is not the same as an ldlm_lock. This is still clear as mud beyond that though. |
| Comment by Jinshan Xiong (Inactive) [ 10/Jan/12 ] |
If the lock is "unused", why was it not canceled immediately? If the lock is unused, it means there is no process holding this lock at the client side, this lock is able to be canceled. Before canceling this lock, the client has to evict pages covered by this lock. We should write the dirty pages back and them tear down them. I'll post my findings from this log soon. |
| Comment by Jinshan Xiong (Inactive) [ 10/Jan/12 ] |
|
From the client log, dlm callback process 9320 is assigned to cancel this lock. So it tried to tear down caching pages covered by this lock. 00000008:00000040:1.0:1326115275.072091:0:9320:0:(osc_io.c:133:osc_io_submit()) 16128 1 00000008:00000002:1.0:1326115275.073863:0:9320:0:(osc_request.c:2761:osc_check_rpcs()) loi ready 1 wr 5632:1 rd 0:0 9 in flight 00000008:00000040:1.0:1326115275.073865:0:9320:0:(osc_io.c:227:osc_io_submit()) 16128/0 0 00000008:00000002:1.0:1326115275.074197:0:9320:0:(osc_request.c:3139:osc_teardown_async_page()) loi ready 1 wr 5632:1 rd 0:0 oap ffff880432696bf0 page ffffea000c4c2ba8 torn down .... This is the 1st round, it collected all unlocked pages, that was 16128 pages in total. Then it was blocked for a long while, I believe by a locked page, so it will await that page to be unlocked. From the log, I guess the page was at around the last 768th page of a 5632-page queue. Also, this page must have been added by page writeback thread - so it has been locked and ready for transfer. This brings us a problem: if a lock is being canceled, we should raise the priority of those pages which have already been queued by someone else. Let's then take a look at RPC x1390296139335090 to figure out why it couldn't flush those pages in time, this RPC must be writing pages from 2816th to 2560th in the queue. 00000100:00000040:9.0:1326115282.809487:0:4909:0:(lustre_net.h:1653:ptlrpc_rqphase_move()) @@@ move req "New" -> "Interpret" req@ffff88055ed92800 x1390296139335090/t0(0) o-1->6cf0fc27-8b58-f45c-e92e-867bcd05a001@:0/0 lens 456/0 e 0 to 0 dl 1326115440 ref 1 fl New:/ffffffff/ffffffff rc 0/-1 Process 4909 was assigned to handle this RPC, and there was a slow again: 00002000:00020000:6.0:1326115609.350838:0:4909:0:(filter_io_26.c:775:filter_commitrw_write()) lc3-OST0000: slow direct_io 326s |
| Comment by Jinshan Xiong (Inactive) [ 10/Jan/12 ] |
|
Can you please try patch: http://review.whamcloud.com/#change,1938 after sanity test has passed? |
| Comment by Christopher Morrone [ 10/Jan/12 ] |
|
Thanks! We started running with the new patch include at noon today. Of course, evictions are now rare enough that if we see nothing, we will not really know if it helped for at least a couple of days. |
| Comment by Jinshan Xiong (Inactive) [ 17/Jan/12 ] |
|
We're going to start landing process because these patches make eviction less. |
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Andreas Dilger [ 24/Jan/12 ] |
|
Adding this to the 2.1.1 tracker, since this has made a significant improvement on tests run with 2.1.0. |
| Comment by Jay Lan (Inactive) [ 01/Feb/12 ] |
|
Would this commit land to b2_1 tree soon? |
| Comment by Jinshan Xiong (Inactive) [ 01/Feb/12 ] |
|
I will port it to b2_1 after landing this on master. |
| Comment by Jay Lan (Inactive) [ 02/Feb/12 ] |
|
Hi Jinshan, I think it was landed on master on Jan 19? |
| Comment by Jinshan Xiong (Inactive) [ 02/Feb/12 ] |
|
there are a series of patches(4 in total) needed to solve this problem, besides that one landed the other day, you also need: http://review.whamcloud.com/#change,1918 There is no further changes coming to these patches. I think it's pretty safe to apply them to production system since they have been using by LLNL for a while. |
| Comment by Jay Lan (Inactive) [ 03/Feb/12 ] |
|
Hi Jinshan, The #change.1918 caused compilation error. You added "static inline int ldlm_res_eq" declaration to lustre_idl.h without taking /tmp.chroot/nas-2.1.0-2a/lustre/ptlrpc/../../lustre/ldlm/ldlm_resource.c:423: error: redefinition of 'ldlm_res_eq' |
| Comment by Jinshan Xiong (Inactive) [ 03/Feb/12 ] |
|
Please use these patches for b2_1: http://review.whamcloud.com/2085 |
| Comment by Build Master (Inactive) [ 03/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 03/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 03/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 03/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 03/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 03/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 03/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 03/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 03/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 03/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 03/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 03/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 03/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 03/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 13/Feb/12 ] |
|
Integrated in Result = SUCCESS
Oleg Drokin : 15bd23c47332e476b985b220170fd47814ab7f14
|
| Comment by Build Master (Inactive) [ 13/Feb/12 ] |
|
Integrated in Result = SUCCESS
Oleg Drokin : 15bd23c47332e476b985b220170fd47814ab7f14
|
| Comment by Build Master (Inactive) [ 13/Feb/12 ] |
|
Integrated in Result = SUCCESS
Oleg Drokin : 15bd23c47332e476b985b220170fd47814ab7f14
|
| Comment by Build Master (Inactive) [ 13/Feb/12 ] |
|
Integrated in Result = SUCCESS
Oleg Drokin : 15bd23c47332e476b985b220170fd47814ab7f14
|
| Comment by Build Master (Inactive) [ 13/Feb/12 ] |
|
Integrated in Result = SUCCESS
Oleg Drokin : 15bd23c47332e476b985b220170fd47814ab7f14
|
| Comment by Build Master (Inactive) [ 13/Feb/12 ] |
|
Integrated in Result = SUCCESS
Oleg Drokin : 15bd23c47332e476b985b220170fd47814ab7f14
|
| Comment by Build Master (Inactive) [ 13/Feb/12 ] |
|
Integrated in Result = SUCCESS
Oleg Drokin : 15bd23c47332e476b985b220170fd47814ab7f14
|
| Comment by Build Master (Inactive) [ 13/Feb/12 ] |
|
Integrated in Result = SUCCESS
Oleg Drokin : 15bd23c47332e476b985b220170fd47814ab7f14
|
| Comment by Build Master (Inactive) [ 13/Feb/12 ] |
|
Integrated in Result = SUCCESS
Oleg Drokin : 15bd23c47332e476b985b220170fd47814ab7f14
|
| Comment by Build Master (Inactive) [ 13/Feb/12 ] |
|
Integrated in Result = SUCCESS
Oleg Drokin : 15bd23c47332e476b985b220170fd47814ab7f14
|
| Comment by Build Master (Inactive) [ 13/Feb/12 ] |
|
Integrated in Result = SUCCESS
Oleg Drokin : 15bd23c47332e476b985b220170fd47814ab7f14
|
| Comment by Build Master (Inactive) [ 13/Feb/12 ] |
|
Integrated in Result = SUCCESS
Oleg Drokin : 15bd23c47332e476b985b220170fd47814ab7f14
|
| Comment by Build Master (Inactive) [ 13/Feb/12 ] |
|
Integrated in Result = SUCCESS
Oleg Drokin : 15bd23c47332e476b985b220170fd47814ab7f14
|
| Comment by Build Master (Inactive) [ 13/Feb/12 ] |
|
Integrated in Result = SUCCESS
Oleg Drokin : 15bd23c47332e476b985b220170fd47814ab7f14
|
| Comment by Peter Jones [ 13/Feb/12 ] |
|
All patches landed for 2.2 |
| Comment by Adam DeConinck [ 15/Feb/12 ] |
|
We are pretty sure we are encountering this problem on one of our clusters right now, and it's causing a lot of user pain. Have the patches posted by Jinshan Xiong on 03/Feb made it into b2_1 yet, or should they be applied directly? Should these patches be considered production-ready, or are they unstable enough that we should wait for a release? We'd really like to get rid of this issue asap, but obviously if the patches are still considered unstable we should wait... |
| Comment by Jinshan Xiong (Inactive) [ 15/Feb/12 ] |
|
I think they are pretty safe to be applied to production system. |
| Comment by Jay Lan (Inactive) [ 15/Feb/12 ] |
|
If you apply the commit 15bd23 to b2_1 tree, you will fail in compilation for the reason I stated in the comment on 03/Feb/12 1:30 PM. |
| Comment by Jinshan Xiong (Inactive) [ 15/Feb/12 ] |
|
patches for b2_1 are at: Jinshan Xiong added a comment - 03/Feb/12 3:45 PM http://review.whamcloud.com/2085 |
| Comment by Diego Moreno (Inactive) [ 17/Feb/12 ] |
|
Hi, Are the patches for b2_1 ready for production ? I see there are two patches (http://review.whamcloud.com/#change,2086 and http://review.whamcloud.com/#change,2087) with still some failed tests on Maloo. |
| Comment by Build Master (Inactive) [ 17/Feb/12 ] |
|
Integrated in Result = FAILURE
Oleg Drokin : 26de7f886db3724bf6af307c9863e2d723b88b68
Oleg Drokin : 15bd23c47332e476b985b220170fd47814ab7f14
|
| Comment by Build Master (Inactive) [ 17/Feb/12 ] |
|
Integrated in Result = FAILURE
Oleg Drokin : 26de7f886db3724bf6af307c9863e2d723b88b68
Oleg Drokin : 15bd23c47332e476b985b220170fd47814ab7f14
|
| Comment by Build Master (Inactive) [ 17/Feb/12 ] |
|
Integrated in Result = ABORTED
Oleg Drokin : 26de7f886db3724bf6af307c9863e2d723b88b68
Oleg Drokin : 15bd23c47332e476b985b220170fd47814ab7f14
|
| Comment by Sebastien Buisson (Inactive) [ 01/Oct/12 ] |
|
Hi, During testing at CEA, we hit a new occurrence of this issue. The bad news is that the test cluster is running Lustre 2.1.2, which already contains the 4 patches from this defect. Here is what we see with crash: crash> ps | grep -v " IN " The dmesg on the server is: LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client Lustre: ptmp-OST000e: haven't heard from client 1eb0bec5-18e1-c608-250f-9a0d61bd565f (at JO.BOB.BAL.TT@o2ib1) in 227 And on the client, the dmesg contains: LustreError: 11-0: an error occurred while communicating with JO.BOB.BAO.AL@o2ib1. The obd_ping operation failed with LustreError: 481:0:(import.c:326:ptlrpc_invalidate_import()) ptmp-OST000e_UUID: rc = -110 waiting for callback (1 != 0) But the good news is that we now have a reproducer. I will attach it in my next comment. Sebastien. |
| Comment by Sebastien Buisson (Inactive) [ 01/Oct/12 ] |
|
Here is the reproducer used at CEA. The issue is quite easy to reproduce if the stripe count is 2. To run the test: Please note that if we remove the fstat() system call from the reproducer, the issue cannot be reproduced anymore. I hope all this information will help you investigate this issue. TIA, |
| Comment by Jinshan Xiong (Inactive) [ 01/Oct/12 ] |
|
Hi Sebastien, Do you have patch a02dbe9d1b14ee67c2982f83804f9baffca59c04( |
| Comment by Sebastien Buisson (Inactive) [ 02/Oct/12 ] |
|
Hi Jinshan, The patch you are pointing to was merged just before 2.1.2-RC1, so yes it is included in our sources. Sebastien. |
| Comment by Peter Jones [ 04/Oct/12 ] |
|
Bobijam Could you please investigate the latest information supplied on this ticket Thanks Peter |
| Comment by Zhenyu Xu [ 16/Oct/12 ] |
|
Xiong, From the backtrace of thread 10244 (reproducer) and 14196 (ptlrpcd), they were all running on CPU 4 For thread 10244, it is a write operation trying to enqueue a lovsub lock, and waiting for an exist conflicting lovsub lock to cancel pages it covers. As for thread 14196, ptlrpcd finished an osc lock enqueue handling, calling osc_lock_upcall() to notify top locks, it has lovsub lock's mutexed, when the calling path go through cl_lock_state_signal()->lovsub_lock_state(), it try to get lov lock's mutex. (child mutex get, then parent mutex get) Xiong, does it sound a dead lock here? |
| Comment by Jinshan Xiong (Inactive) [ 19/Oct/12 ] |
|
Hi Bobi, yes this is a deadlock case, we should release parent lock before waiting for the conflict lock. |
| Comment by Zhenyu Xu [ 19/Oct/12 ] |
|
master patch tracking at http://review.whamcloud.com/4306 patch description We should not get cl_lock mutexes from top to bottom directly, could See comments above cl_lock::cll_guard. |
| Comment by Jinshan Xiong (Inactive) [ 07/Nov/12 ] |
|
After taking a closer look, I think this issue is more about getting stuck at page flush. See process 10243:
it was waiting for the write to finish while holding mutex of cl_lock. This is why you can see other processes were blocking at cl_lock mutex. Then, when the client was evicted, the write request was still in the sending list, that means it didn't receive the reply from OST yet.
Can you please take a look at OST log to check if there is any information about the RPC x1414088799797679? If the patches work, this request should be put into high prio RPC list, and lock's expiration timeout should be renewed. |
| Comment by Jinshan Xiong (Inactive) [ 07/Nov/12 ] |
|
Hi Sebastien Buisson, can you please rerun the test and collect and provide us log, thanks in advance. |
| Comment by Diego Moreno (Inactive) [ 08/Nov/12 ] |
|
Jinshan, the patch on http://review.whamcloud.com/#change,4306 hasn't yet been approved. Despite that do you think it's safe to give it a try in production? |
| Comment by Jinshan Xiong (Inactive) [ 08/Nov/12 ] |
|
Please don't try that one. I just want to understand the problem better. |
| Comment by Lustre Bull [ 12/Nov/12 ] |
|
Hi, All available logs from the OST were given on October, 1st (server dmesg). And unfortunately, there is no mention of the RPC x1414088799797679. I think you can try to rerun the reproducer by yourself if you need information that our logs lack. Cheers, |
| Comment by Bruno Faccini (Inactive) [ 27/Nov/12 ] |
|
Jinshan, The Lustre-2.3 stack (taken via Alt+SysRq+T) is just a little bit different at the end/top (after osc_lock_flush() call) due to Client write-cache feature, I assume ... : So the run/task is stuck for ever and when we try to reboot the Lustre-2.3 Node/Client we get : Call Trace: [<ffffffffa0e0cf17>] lbug_with_loc+0x47/0xb0 [libcfs] Kernel panic - not syncing: LBUG When this occurs we can find the same messages sequence in Client syslog ... But reproducer does not work very solid in such configuration, so I will try to confirm/reproduce with a full Lustre-2.1 config. Will let you know. |
| Comment by Bruno Faccini (Inactive) [ 27/Nov/12 ] |
|
Humm forget my earlier 2nd part comments about the LBUG upon restart "(cl_lock.c:1960:discard_cb()) ASSERTION( (!(page->cp_type == CPT_CACHEABLE) || (!PageWriteback(cl_page_vmpage(env, page)))) ) failed", I think I triggered it due to using a Client running with a Lustre-2.3 build from Johann and for |
| Comment by Jinshan Xiong (Inactive) [ 27/Nov/12 ] |
|
Hi Bruno, the hang at osc_extent_wait() should have been fixed with patches: 300526627be80c16be05dd1abd9a05b563dc0763 of |
| Comment by Bruno Faccini (Inactive) [ 30/Nov/12 ] |
|
Hello Jinshan, |
| Comment by Bruno Faccini (Inactive) [ 30/Nov/12 ] |
|
And BTW, I just went to Toro see how worked my current reproducer session and I discovered it triggered after more than a full day running the reproducer with the following configuration : _ client-17 as the Server for MDT+2xOSTs (v2.1 lustre-reviews build 10648) And again the same situation/stack and msgs on concerned Client: ..... LustreError: 24977:0:(import.c:366:ptlrpc_invalidate_import()) lustre-OST0001_UUID: RPCs in "Unregistering" phase found (0). Network is sluggish? Waiting them to error out. ..... reproducer S 0000000000000000 0 24964 24962 0x00000080 =========================================== I still haven't been able to run with a full Lustre-2.1 Clients/Servers config due to Toro's current over-booking and me actually not able to run my own builds on Toro's VMs (but I am working on that !!). Do you want to have a look at the current situation I left as-is on client-12vm[1-4], client-[15,17] ?? |
| Comment by Bruno Faccini (Inactive) [ 30/Nov/12 ] |
|
Tell me because I need to give back my reservations on Toro ... |
| Comment by Jinshan Xiong (Inactive) [ 30/Nov/12 ] |
|
it looks like the client was evicted and then it was hung at unregistering reply, this caused the write RPC couldn't be erred out so it was stuck there. It showed two problems here: Thanks, |
| Comment by Jinshan Xiong (Inactive) [ 30/Nov/12 ] |
|
FYI: this problem happened on client-12vm4... |
| Comment by Doug Oucharek (Inactive) [ 30/Nov/12 ] |
|
I just looked at the logs on client-12vm4. Unfortunately, there are no LNet logs. If this can be easily reproduced |
| Comment by Jinshan Xiong (Inactive) [ 30/Nov/12 ] |
|
Hi Bruno, can you please run the test again and turn on neterror? |
| Comment by Bruno Faccini (Inactive) [ 04/Dec/12 ] |
|
Ok it is running as you requested, I will let you know as soon as it triggers ... |
| Comment by Bruno Faccini (Inactive) [ 06/Dec/12 ] |
|
Humm looks not so easy to reproduce, actually reproducer is running/looping since more than 2 days now ... Trying to introduce some impact like umount/mount and "echo 3>drop_caches" between each run. |
| Comment by Bruno Faccini (Inactive) [ 10/Dec/12 ] |
Ok, hang/problem reproduced with the same configuration on "Dec 7 05:41:25", and this time Client-12vm3 is the node where the task is hang : bruno@brent:~$ pdsh -l root -w client-15,client-17,client-12vm[1-4] ps -ef\|grep reprod | grep -v grep client-15: root 17903 17902 99 Dec07 ? 2-21:45:43 /home/bruno/LU-874/reproducer client-15: root 17904 17902 99 Dec07 ? 2-21:45:40 /home/bruno/LU-874/reproducer client-15: root 17905 17902 99 Dec07 ? 2-21:42:27 /home/bruno/LU-874/reproducer client-17: root 27924 6700 0 Dec07 pts/0 00:00:05 mpirun -bynode -n 15 --mca orte_rsh_agent rsh:ssh -H client-15 -H client-12vm1 -H client-12vm2 -H client-12vm3 -H client-12vm4 --mca btl_tcp_if_include eth0 /home/bruno/LU-874/reproducer client-12vm3: root 12853 12852 49 Dec07 ? 1-10:55:51 /home/bruno/LU-874/reproducer client-12vm3: root 12854 12852 0 Dec07 ? 00:00:18 /home/bruno/LU-874/reproducer client-12vm3: root 12855 12852 49 Dec07 ? 1-10:55:47 /home/bruno/LU-874/reproducer client-12vm1: root 11255 11254 33 Dec07 ? 23:17:43 /home/bruno/LU-874/reproducer client-12vm1: root 11256 11254 33 Dec07 ? 23:17:42 /home/bruno/LU-874/reproducer client-12vm1: root 11257 11254 33 Dec07 ? 23:16:50 /home/bruno/LU-874/reproducer client-12vm2: root 11243 11242 33 Dec07 ? 23:17:25 /home/bruno/LU-874/reproducer client-12vm2: root 11244 11242 33 Dec07 ? 23:17:26 /home/bruno/LU-874/reproducer client-12vm2: root 11245 11242 33 Dec07 ? 23:17:27 /home/bruno/LU-874/reproducer client-12vm4: root 22121 22120 33 Dec07 ? 23:17:42 /home/bruno/LU-874/reproducer client-12vm4: root 22122 22120 33 Dec07 ? 23:16:54 /home/bruno/LU-874/reproducer client-12vm4: root 22123 22120 33 Dec07 ? 23:17:44 /home/bruno/LU-874/reproducer bruno@brent:~$ Not sure that we got more with +neterror, but anyway you can have a look to the situation I left as-is. Hung task stack look always the same : ====================================== reproducer S 0000000000000000 0 12854 12852 0x00000080^M ffff88003f827608 0000000000000082 ffff88003f827578 ffffffffa082f2ac^M ffff88003f827588 0000000000000069 0000000000000000 ffff88007c7c27c0^M ffff88007260a638 ffff88003f827fd8 000000000000fb88 ffff88007260a638^M Call Trace:^M [<ffffffffa082f2ac>] ? lustre_msg_get_opc+0x9c/0x110 [ptlrpc]^M [<ffffffffa053577e>] cfs_waitq_wait+0xe/0x10 [libcfs]^M [<ffffffffa0a3213f>] osc_extent_wait+0xef/0x2a0 [osc]^M [<ffffffff81060250>] ? default_wake_function+0x0/0x20^M [<ffffffffa0a3264c>] osc_cache_wait_range+0x35c/0x9a0 [osc]^M [<ffffffffa06c796e>] ? cl_object_put+0xe/0x10 [obdclass]^M [<ffffffffa0a2fd78>] ? osc_io_unplug0+0x748/0x12a0 [osc]^M [<ffffffffa0a378e1>] osc_cache_writeback_range+0xd71/0x1180 [osc]^M [<ffffffff81060262>] ? default_wake_function+0x12/0x20^M [<ffffffffa054adb7>] ? cfs_hash_bd_lookup_intent+0x37/0x130 [libcfs]^M [<ffffffffa054a2f2>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs]^M [<ffffffffa06c8fcd>] ? cl_env_nested_get+0x5d/0xc0 [obdclass]^M [<ffffffffa054dcdb>] ? cfs_hash_add_unique+0x1b/0x40 [libcfs]^M [<ffffffffa0a1f4d6>] osc_lock_flush+0x86/0x200 [osc]^M [<ffffffffa0a1f6a9>] osc_lock_cancel+0x59/0x1a0 [osc]^M [<ffffffffa06cef95>] cl_lock_cancel0+0x75/0x160 [obdclass]^M [<ffffffffa06d0bd6>] cl_lock_hold_release+0x1c6/0x2b0 [obdclass]^M [<ffffffffa06d2727>] cl_lock_unhold+0x37/0x130 [obdclass]^M [<ffffffffa0aae598>] lov_sublock_release+0x1a8/0x2a0 [lov]^M [<ffffffffa0ab119b>] lov_lock_enqueue+0x2cb/0x830 [lov]^M [<ffffffffa06d338c>] cl_enqueue_try+0xfc/0x310 [obdclass]^M [<ffffffffa06d484d>] cl_enqueue_locked+0x6d/0x210 [obdclass]^M [<ffffffffa06d54ee>] cl_lock_request+0x7e/0x280 [obdclass]^M [<ffffffffa0b6f6bb>] cl_glimpse_lock+0x17b/0x4a0 [lustre]^M [<ffffffffa0b6ff47>] cl_glimpse_size0+0x187/0x190 [lustre]^M [<ffffffffa0b2cf62>] ll_inode_revalidate_it+0xf2/0x1c0 [lustre]^M [<ffffffffa054a154>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]^M [<ffffffffa0b2d079>] ll_getattr_it+0x49/0x170 [lustre]^M [<ffffffffa0b2d1d7>] ll_getattr+0x37/0x40 [lustre]^M [<ffffffff812143e3>] ? security_inode_getattr+0x23/0x30^M [<ffffffff81180571>] vfs_getattr+0x51/0x80^M [<ffffffff8118082f>] vfs_fstat+0x3f/0x60^M [<ffffffff81180874>] sys_newfstat+0x24/0x40^M [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0^M [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b^M Tell me when you are done that I can return the Nodes reservation ... |
| Comment by Jinshan Xiong (Inactive) [ 10/Dec/12 ] |
|
Hi Doug, can you please take a look at it? |
| Comment by Doug Oucharek (Inactive) [ 10/Dec/12 ] |
|
Took a look at the logs for client-12vm3. Bruno is right, there were no LNet logs so packet loss was not an issue/cause here. Question: I'm no familiar with this test, but why does ptlrpc continuously complain in the logs about the network being sluggish? I went back in the logs until just before the sluggish logs started. Here is what started the complaining: Dec 7 05:41:25 client-12vm3 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.4.17@tcp. The obd_ping operation failed with -107 Dec 7 05:41:25 client-12vm3 kernel: Lustre: lustre-OST0001-osc-ffff88007c62e000 : Connection to lustre-OST0001 (at 10.10.4.17@tcp) was lost; in progress operati ons using this service will wait for recovery to complete Dec 7 05:41:25 client-12vm3 kernel: LustreError: 167-0: This client was evicted by lustre-OST0001; in progress operations using this service will fail. Dec 7 05:41:45 client-12vm3 kernel: LustreError: 12873:0:(import.c:324:ptlrpc_i nvalidate_import()) lustre-OST0001_UUID: rc = -110 waiting for callback (1 != 0) Dec 7 05:41:45 client-12vm3 kernel: LustreError: 12873:0:(import.c:350:ptlrpc_i nvalidate_import()) @@@ still on sending list req@ffff88007c58a400 x14205073167 79022/t0(0) o4->lustre-OST0001-osc-ffff88007c62e000@10.10.4.17@tcp:6/4 lens 488/ 448 e 0 to 0 dl 1354887590 ref 2 fl Rpc:RE/0/ffffffff rc -5/-1 Dec 7 05:41:45 client-12vm3 kernel: LustreError: 12873:0:(import.c:366:ptlrpc_i nvalidate_import()) lustre-OST0001_UUID: RPCs in "Unregistering" phase found (0) . Network is sluggish? Waiting them to error out. The sluggish logs repeat until Dec 10 04:09:37 at which time the task is detected as hung. Is this expected behaviour of this test? |
| Comment by Bruno Faccini (Inactive) [ 14/Dec/12 ] |
|
Sorry Doug, you may wait from some feedback here, so yes this are the msgs that we can see when we reproduce the hung situation, and the current hung thread stack on Client too. |
| Comment by Bruno Faccini (Inactive) [ 17/Dec/12 ] |
|
Jinshan, As I keep Toro reservations since quite a long time for this problem/reproducer, I need to free them now due to higher priority tests to be done on the cluster. So i decided to freeze the Nodes situations by 1st taking a "dump live" image of the 4 client-12vm[1-4] VMs with virsh and then I crashed all Nodes+VMs with Alt+SysRq sequence. All these infos are now available under brent:~bruno/ |
| Comment by Jinshan Xiong (Inactive) [ 17/Dec/12 ] |
|
Thank you Bruno. |
| Comment by Oleg Drokin [ 23/Jan/13 ] |
|
Bruno, it would have been great if together with vmcore files you'd put lustre modules rpm and the kernel-debuginfo so that they are useful for everybody. From my look it seems Bobijam analysus from Oct 16 is pretty much what I would think. There's an extra piece here that explains it all. Reply for the RPC is not processed because ptlrpcd is locked up trying to get mutex, as the result, RPC (reply arrived based on flags) is stuck on the sending list and could not be taken out (also job for ptlrpcd that is locked up). |
| Comment by Bruno Faccini (Inactive) [ 24/Jan/13 ] |
|
You are right I had to be quick and did not save the necessary symbol stuff, but at this moment it was not intended to serve after more than a month and also I thought all could be easily retrieved with their build versions (saved in job_hang.txt file) ... But if finally not, you are right that's a good experience to learn from for next time. If you want I can reproduce it again on Toro ? And according to your analysis we need to pursue with http://review.whamcloud.com/4306 ? |
| Comment by Oleg Drokin [ 24/Jan/13 ] |
|
I am not working on this issue myself, I had a discussion with Jinshan yesterday and I believe we now agree the reason for this particular problem is understood. The old rule of "ptlrpcd should not block" is still very important, when it's violated, all async ptlrpcd-handled RPCs are at risk of being stuck because their replies could not be handled. |
| Comment by Jinshan Xiong (Inactive) [ 29/Jan/13 ] |
|
I will try to make a patch tomorrow. |
| Comment by Jinshan Xiong (Inactive) [ 29/Jan/13 ] |
|
Hi Bruno Faccini, can you please try to check if this works: http://review.whamcloud.com/5208 |
| Comment by Bruno Faccini (Inactive) [ 04/Feb/13 ] |
|
Not reproduced during the week-end, I keep trying using different Clients/Servers scenarios/configs. |
| Comment by Bruno Faccini (Inactive) [ 05/Feb/13 ] |
|
Jinshan, we are now with 2 more days (total of more than 5 now !) of testing this latest patch on 2 different clusters running with multiple configurations, with heavy reproducer exposure. But still no hang reproduction. |
| Comment by Jinshan Xiong (Inactive) [ 05/Feb/13 ] |
|
Thank you Bruno, let's start the landing process then. |
| Comment by Jodi Levi (Inactive) [ 06/Feb/13 ] |
|
|
| Comment by Bruno Faccini (Inactive) [ 15/Feb/13 ] |
|
Jinshan, I post this here since I think it belongs to this JIRA !! |
| Comment by Jinshan Xiong (Inactive) [ 15/Feb/13 ] |
|
Thank you Bruno. Let's land 5208 in this case. |
| Comment by Stephen Champion [ 09/Aug/13 ] |
|
I noticed that http://review.whamcloud.com/5208 was committed to master but not to b2_1, while all of the other patches committed for |
| Comment by Bruno Faccini (Inactive) [ 16/Nov/13 ] |
|
Stephen, I just back-ported Change #5208 (patch-set #2) as part of |
| Comment by James A Simmons [ 16/Aug/16 ] |
|
Can we close this please |