[LU-649] client receives reply to lock cancel yet server evicts it 100 seconds later Created: 30/Aug/11 Updated: 21/Mar/13 Resolved: 21/Mar/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.2.0, Lustre 1.8.8 |
| Type: | Bug | Priority: | Major |
| Reporter: | James Karellas | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
lustre 1.8.5 |
||
| Severity: | 3 |
| Bugzilla ID: | 24,531 |
| Rank (Obsolete): | 5617 |
| Description |
|
NASA AMES is testing a cluster in preparation for upgrades. "These errors are preventing us from completing a clean migration of the Right now they are running a sles10sp3 variant plus lustre 1.8.2 plus patches on the client and client: kernel 2.6.16.60-0.68.1.20100916-nasa, lustre-1.8.2-4nas_ofed151 General problem is that they are seeing a lot of client evictions for no obvious reasons. In particular this bug, here is the customer's description: "client: Jun 21 11:04:36 hw-burly1 kernel: LustreError: 11-0: server: Jun 21 11:04:36 service178 kernel: LustreError: I have debug logs with 'ioctl neterror net warning other dlmtrace error emerg Here's a quick rundown of the events for this particular eviction:
00000100:00000200:21:1308679375.166321:0:9771:0:(client.c:925:ptlrpc_check_reply (or I'm misinterpreting the message). Since the server never received the lock cancellation for the blocking AST it I checked the ib fabric and HCAs on each end, and didn't find anything I upped at_min from the default to 45, then 150, in case the LNet messages were I upgraded the client to 1.8.5-4nas about an hour ago, we'll see if that helps. Since the filesystem isn't in production, I have quite a bit of flexibility in I looked over this as well and all I've come up with so far is that perhaps attachment 31479 [details] bug I'll attach the aforementioned lustre debug logs. ===see bugzilla 24531 for logs=== |
| Comments |
| Comment by Peter Jones [ 31/Aug/11 ] |
|
Lai Could you please look into this one? Thanks Peter |
| Comment by Lai Siyao [ 01/Sep/11 ] |
|
The logs show that client sent request x1371481946874671 and got reply, but server never received it (server did receive x1371481946874670 and x1371481946874672). This is really weird! Johann, have you met with this before? |
| Comment by Johann Lombardi (Inactive) [ 02/Sep/11 ] |
|
I don't think that you are looking at the right lock. From the server log, the eviction is due to this lock: 00010000:00020000:12:1308679476.836124:0:0:0:(ldlm_lockd.c:308:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 10.151.61.171@o2ib ns: filter-nbp1-OST004f_UUID lock: ffff810a03cb3a00/0x50a73ed8006fda0b lrc: 3/0,0 mode: PW/PW res: 233731/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x20 remote: 0x4cdfae7ceeea95bc expref: 9364 pid: 13239 timeout 5323842234 So that's a write lock on res 233731. From the client log: 00010000:00010000:21:1308679375.165776:0:9771:0:(ldlm_request.c:953:ldlm_cancel_pack()) ### packing ns: nbp1-OST004f-osc-ffff810c48fc7000 lock: ffff810905620200/0x4cdfae7ceeea94ce lrc: 2/0,0 mode: --/PR res: 233731/0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x30302090 remote: 0x50a73ed8006fda04 expref: -99 pid: 330 timeout 0 00010000:00010000:21:1308679375.165784:0:9771:0:(ldlm_request.c:957:ldlm_cancel_pack()) 1 locks packed 00000100:00100000:21:1308679375.165789:0:9771:0:(client.c:2066:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_00:1ae3c145-b2f5-384f-a14f-5af93cbbb774:9771:x1371481946874671:10.151.26.34@o2ib:103 So x1371481946874671 is the cancellation of a read lock on res 233731. Let's look at what really happened to the write lock on the client side: 00010000:00010000:7:1308679375.277593:0:10078:0:(ldlm_lockd.c:1414:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: nbp1-OST004f-osc-ffff810c48fc7000 lock: ffff810b37fd2c00/0x4cdfae7ceeea95bc lrc: 3/0,1 mode: PW/PW res: 233731/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x10120000 remote: 0x50a73ed8006fda0b expref: -99 pid: 335 timeout 0 => client receives blocking callback 00010000:00010000:7:1308679375.277601:0:10078:0:(ldlm_lockd.c:1433:ldlm_handle_bl_callback()) Lock ffff810b37fd2c00 is referenced, will be cancelled later => but lock is busy 00010000:00010000:4:1308679477.046560:0:335:0:(ldlm_lock.c:597:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(PW) ns: nbp1-OST004f-osc-ffff810c48fc7000 lock: ffff810b37fd2c00/0x4cdfae7ceeea95bc lrc: 3/0,1 mode: PW/PW res: 233731/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x12120c10 remote: 0x50a73ed8006fda0b expref: -99 pid: 335 timeout 0 => user drops lock reference 102s later, that's way too much. 00010000:00010000:4:1308679477.046576:0:335:0:(ldlm_lock.c:637:ldlm_lock_decref_internal()) ### final decref done on cbpending lock ns: nbp1-OST004f-osc-ffff810c48fc7000 lock: ffff810b37fd2c00/0x4cdfae7ceeea95bc lrc: 2/0,0 mode: PW/PW res: 233731/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x12120c10 remote: 0x50a73ed8006fda0b expref: -99 pid: 335 timeout 0 00010000:00010000:18:1308679477.046622:0:9772:0:(ldlm_request.c:895:ldlm_cli_cancel_local()) ### client-side cancel ns: nbp1-OST004f-osc-ffff810c48fc7000 lock: ffff810b37fd2c00/0x4cdfae7ceeea95bc lrc: 3/0,0 mode: PW/PW res: 233731/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x12122c10 remote: 0x50a73ed8006fda0b expref: -99 pid: 335 timeout 0 00010000:00010000:18:1308679477.046635:0:9772:0:(ldlm_request.c:908:ldlm_cli_cancel_local()) not sending request (at caller's instruction) => we got evicted already, too late. Let's look at why the lock was busy for more than 100s: 10000:00010000:4:1308679477.241858:0:335:0:(ldlm_lock.c:597:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(PR) ns: nbp10-OST004d-osc-ffff811448d7ec00 lock: ffff810a7bf4f400/0x4cdfae7ceeec06a8 lrc: 3/1,0 mode: PR/PR res: 177671/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x100000 remote: 0xd50b3cbef74e49d3 expref: -99 pid: 330 timeout 0 00010000:00010000:4:1308679477.241871:0:335:0:(ldlm_lock.c:580:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(PR) ns: nbp10-OST0051-osc-ffff811448d7ec00 lock: ffff81089861f000/0x4cdfae7ceeec06af lrc: 2/1,0 mode: PR/PR res: 176677/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x100000 remote: 0xc9fc043754a71d5a expref: -99 pid: 330 timeout 0 00010000:0001000 ... The process is taking multiple extent locks on different OST, quite bad since it can cause cascading eviction. 00000100:00100000:4:1308679477.242107:0:335:0:(client.c:1168:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc mcp:f28f51b8-bf6c-8367-c592-0bab79ead394:335:x1371481946929889:10.151.25.202@o2ib:3 00000100:00000200:4:1308679477.242116:0:335:0:(niobuf.c:280:ptlrpc_register_bulk()) Setup bulk put-sink buffers: 1 pages 1455 bytes, xid 1371481946929889, portal 8 And it sends bulk RPC itself, so my guess is that it is running large direct i/o. James, could you please confirm that you have jobs using direct i/o when the problem happens? What i really don't understand is that 1.8.2 uses server-side locking by default for direct i/o (unless you explicitly use LL_FILE_LOCKED_DIRECTIO in your application), so the process should not grab any extent locks. That's weird. Do you happen to use LL_FILE_LOCKED_DIRECTIO somehow? |
| Comment by James Karellas [ 29/Sep/11 ] |
|
Yes, we are using direct io. We don't use LL_FILE_LOCKED_DIRECTIO. The only lustre specific things in the code we use are the use of llapi_file_get_stripe, llapi_file_create, and the O_LOV_DELAY_CREATE flag in open. Don't know if any of those would access that behind the scenes. |
| Comment by Lai Siyao [ 21/Oct/11 ] |
|
thread 335 looks to be the process doing direct read(many OST_READ request issued without lock), but strangely it tried to write file (see below, lov_brw_check() is called, which is called in ll_prepare_write()): 00000080:00010000:4:1308679375.164792:0:335:0:(file.c:1045:ll_extent_lock()) Locking inode 75245881, start 0 end 1048575 ... 00020000:00010000:4:1308679375.166276:0:335:0:(lov_request.c:176:lov_update_enqueue_set()) ### lock acquired, setting rss=0, kms=0 ns: nbp1-OST004f-osc-ffff810c48fc7000 lock: ffff810b37fd2c00/0x4cdfae7ceeea95bc lrc: 3/0,1 mode: PW/PW res: 233731/0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x20000 remote: 0x50a73ed8006fda0b expref: -99 pid: 335 timeout 0 00020000:00080000:4:1308679477.046499:0:335:0:(lov_obd.c:1730:lov_brw_check()) lov idx 79 inactive 00010000:00010000:4:1308679477.046576:0:335:0:(ldlm_lock.c:637:ldlm_lock_decref_internal()) ### final decref done on cbpending lock ns: nbp1-OST004f-osc-ffff810c48fc7000 lock: ffff810b37fd2c00/0x4cdfae7ceeea95bc lrc: 2/0,0 mode: PW/PW res: 233731/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x12120c10 remote: 0x50a73ed8006fda0b expref: -99 pid: 335 timeout 0 client receives lock cancel for this lock ffff810b37fd2c00: 00010000:00010000:7:1308679375.277593:0:10078:0:(ldlm_lockd.c:1414:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: nbp1-OST004f-osc-ffff810c48fc7000 lock: ffff810b37fd2c00/0x4cdfae7ceeea95bc lrc: 3/0,1 mode: PW/PW res: 233731/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x10120000 remote: 0x50a73ed8006fda0b expref: -99 pid: 335 timeout 0 00010000:00010000:7:1308679375.277601:0:10078:0:(ldlm_lockd.c:1433:ldlm_handle_bl_callback()) Lock ffff810b37fd2c00 is referenced, will be cancelled later another thread 330 is calling setattr, and requests another lock conflicting with the above lock: 00000080:00010000:6:1308679375.114742:0:330:0:(file.c:1045:ll_extent_lock()) Locking inode 75245881, start 0 end 18446744073709551615 ... 00010000:00010000:6:1308679375.278215:0:330:0:(ldlm_request.c:168:ldlm_completion_ast()) ### waiting indefinitely because of NO_TIMEOUT ns: nbp1-OST004f-osc-ffff810c48fc7000 lock: ffff810905620000/0x4cdfae7ceeea94ff lrc: 4/0,1 mode: --/PW res: 233731/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20000 remote: 0x50a73ed8006fda12 expref: -99 pid: 330 timeout 0 00010000:00010000:6:1308679477.046302:0:330:0:(ldlm_request.c:194:ldlm_completion_ast()) ### client-side enqueue waking up: destroyed ns: nbp1-OST004f-osc-ffff810c48fc7000 lock: ffff810905620000/0x4cdfae7ceeea94ff lrc: 4/0,1 mode: --/PW res: 233731/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x2020c10 remote: 0x50a73ed8006fda12 expref: -99 pid: 330 timeout 0 00010000:00010000:6:1308679477.046318:0:330:0:(ldlm_request.c:370:failed_lock_cleanup()) ### lock was granted or failed in race ns: nbp1-OST004f-osc-ffff810c48fc7000 lock: ffff810905620000/0x4cdfae7ceeea94ff lrc: 4/0,1 mode: --/PW res: 233731/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x2020c10 remote: 0x50a73ed8006fda12 expref: -99 pid: 330 timeout 0 ... 00020000:00080000:6:1308679477.046420:0:330:0:(lov_request.c:1213:lov_prep_setattr_set()) lov idx 79 inactive 00000080:00020000:6:1308679477.046429:0:330:0:(llite_lib.c:1752:ll_setattr_raw()) obd_setattr_async fails: rc=-5 00000080:00020000:6:1308679477.046450:0:330:0:(llite_lib.c:1761:ll_setattr_raw()) extent unlock failed: err=-5 Because thread 335 didn't release lock ffff810b37fd2c00, and caused client evicted, and finally thread 330 setattr failed either. Looks like there is a deadlock here: But I don't find the cause yet. If this test can be reproduced, it's better to make it simpler: 1 test file, and with D_VFSTRACE debug enabled. |
| Comment by Jay Lan (Inactive) [ 08/Dec/11 ] |
|
Oracle claimed to have a fix to this problem on BZ 24531. |
| Comment by Lai Siyao [ 09/Dec/11 ] |
|
The deadlock looks to be the one, but the fix may not be complete. I'll leave comment there. Besides, according to `man 2 open`: Applications should avoid mixing O_DIRECT and normal I/O to the same file, and especially to overlapping byte regions in the same file. Even when the file system correctly handles the coherency issues in this situation, overall I/O throughput is likely to be slower than using either mode alone. Likewise, applications should avoid mixing mmap(2) of files with direct I/O to the same files. Each direct IO will cause all dirty pages flushed to backend, and the logs shows lots of lock conflicts, I don't see any benefit of mixing these two IO modes. Though mixing them should work correctly. |
| Comment by Jay Lan (Inactive) [ 09/Dec/11 ] |
|
I agree with Lai's comment that applications should avoid mixing O_DIRECT and normal I/O. |
| Comment by Lai Siyao [ 12/Dec/11 ] |
|
review is on http://review.whamcloud.com/#change,1829 |
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Jay Lan (Inactive) [ 27/Jan/12 ] |
|
The commit was landed on b1_8. Should it go b2_1 also? It is not in the master branch either. |
| Comment by Peter Jones [ 27/Jan/12 ] |
|
Yes this should land to master at some point. The logic is quite different on master due to CLIO so this will require a different approach |
| Comment by Johann Lombardi (Inactive) [ 28/Mar/12 ] |
|
There is a concern that this patch isn't right, so i'm going to revert it from b1_8 until a new version is ready. |
| Comment by Build Master (Inactive) [ 28/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 28/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 28/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 28/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 28/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 28/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 28/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 28/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 28/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 28/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 28/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Nathan Rutman [ 05/Apr/12 ] |
|
Looks like Oracle is going a different route in fixing this – ------- Comment #51 From Vladimir V. Saveliev 2012-02-07 01:46:54 [reply] ------- Yes, although there are two fixes. |
| Comment by Peter Jones [ 21/Mar/13 ] |
|
As per NASA ok to close ticket |