[LU-649] client receives reply to lock cancel yet server evicts it 100 seconds later Created: 30/Aug/11  Updated: 21/Mar/13  Resolved: 21/Mar/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.2.0, Lustre 1.8.8

Type: Bug Priority: Major
Reporter: James Karellas Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None
Environment:

lustre 1.8.5


Severity: 3
Bugzilla ID: 24,531
Rank (Obsolete): 5617

 Description   

NASA AMES is testing a cluster in preparation for upgrades.

"These errors are preventing us from completing a clean migration of the
filesystem, and we're not going to put it into production until they're
eliminated. We're planning to move filesystems on six more DDN 9900s to SFA
hardware and we'd like to complete the process as soon as possible."

Right now they are running a sles10sp3 variant plus lustre 1.8.2 plus patches on the client and
centos 5.5 plus lustre 1.8.5 plus patches on the server. I'll add the patch list when I get it.

client: kernel 2.6.16.60-0.68.1.20100916-nasa, lustre-1.8.2-4nas_ofed151
server: kernel 2.6.18-194.3.1.el5.20110322lustre185, lustre-1.8.5-4nas_ofed152

General problem is that they are seeing a lot of client evictions for no obvious reasons.

In particular this bug, here is the customer's description:

"client:

Jun 21 11:04:36 hw-burly1 kernel: LustreError: 11-0:
nbp1-OST004f-osc-ffff810c48fc7000: An error occurred while communicating with
NID 10.151.26.34@o2ib; the ost_statfs operation failed with -107
Jun 21 11:04:36 hw-burly1 kernel: Lustre: nbp1-OST004f-osc-ffff810c48fc7000:
Connection to nbp1-OST004f (at 10.151.26.34@o2ib) was lost; in progress
operations using this service will wait for recovery to complete
Jun 21 11:04:36 hw-burly1 kernel: LustreError: 167-0: This client was evicted
by nbp1-OST004f; in progress operations using this service will fail.
Jun 21 11:04:37 hw-burly1 kernel: LustreError:
27834:0:(ldlm_resource.c:518:ldlm_namespace_cleanup()) Namespace
nbp1-OST004f-osc-ffff810c48fc7000 resource refcount nonzero (2) after lock
cleanup; forcing cleanup.
Jun 21 11:04:37 hw-burly1 kernel: LustreError:
330:0:(llite_lib.c:1752:ll_setattr_raw()) obd_setattr_async fails: rc=-5
Jun 21 11:04:37 hw-burly1 kernel: LustreError:
330:0:(llite_lib.c:1761:ll_setattr_raw()) extent unlock failed: err=-5
Jun 21 11:04:37 hw-burly1 kernel: LustreError:
27834:0:(ldlm_resource.c:523:ldlm_namespace_cleanup()) Resource:
ffff81092f055500 (233731/0/0/0) (rc: 1)
Jun 21 11:04:38 hw-burly1 kernel: LustreError:
27834:0:(import.c:1279:ptlrpc_invalidate_import_thread()) dump the log upon
eviction
Jun 21 11:04:38 hw-burly1 kernel: LustreError: dumping log to
/var/dumps/lustre-log.1308679478.27834
Jun 21 11:04:38 hw-burly1 kernel: Lustre: nbp1-OST004f-osc-ffff810c48fc7000:
Connection restored to nbp1-OST004f (at 10.151.26.34@o2ib)

server:

Jun 21 11:04:36 service178 kernel: LustreError:
0:0:(ldlm_lockd.c:308:waiting_locks_callback()) ### lock callback timer expired
after 100s: evicting client at 10.151.61.171@o2ib ns: filter-nbp1-OST004f_UUID
lock: ffff810a03cb3a00/0x50a73ed8006fda0b lrc: 3/0,0 mode: PW/PW res: 233731/0
rrc: 2 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x20 remote:
0x4cdfae7ceeea95bc expref: 9364 pid: 13239 timeout 5323842234
Jun 21 11:04:36 service178 kernel: LustreError: dumping log to
/var/dumps/lustre-log.1308679476.13105

I have debug logs with 'ioctl neterror net warning other dlmtrace error emerg
ha rpctrace config console' enabled for both the client and server, which I'll
upload once I have a case number. Unfortunately ntp wasn't running on the
server at the time, so the times are a bit off. I have logs from several other
events as well, though I have yet to examine them to see if the pattern is the
same in every case.

Here's a quick rundown of the events for this particular eviction:

  • The client sends a PW lock enqueue request in at 1308679375.111977 in
    x1371481946874614, which the server grants.
  • At 1308679375.164893 it wants to upgrade the lock to a PW, in
    x1371481946874670. Since it conflicts with the existing lock, the server sends
    back a blocking AST at 1308679375.958555 in x1371171161806020, which the client
    responds to.
  • The client sends the lock cancellation at 1308679375.165789
    x1371481946874671. However, there's no record of the server receiving it.
    Oddly, the client seems to think it got a response:

00000100:00000200:21:1308679375.166321:0:9771:0:(client.c:925:ptlrpc_check_reply
()) @@@ REPLIED: rc = 1 for req@ffff811f4080dc00 x1371481946874671/t0
o103->nbp1-OST004f_UUID@10.151.26.34@o2ib:17/18 lens 296/384 e 0 to 1 dl
1308679413 ref 1 fl Rpc:RN/0/0 rc 0/0
00000100:00100000:21:1308679375.166329:0:9771:0:(client.c:2171:ptlrpc_queue_wait
()) Completed RPC pname:cluuid:pid:xid:nid:opc
ldlm_bl_00:1ae3c145-b2f5-384f-a14f-5af93cbbb774:9771:x1371481946874671:10.151.26
.34@o2ib:103

(or I'm misinterpreting the message).

Since the server never received the lock cancellation for the blocking AST it
issued, it evicts the client.

I checked the ib fabric and HCAs on each end, and didn't find anything
obviously wrong. I didn't catch any network-related errors in the Lustre logs.

I upped at_min from the default to 45, then 150, in case the LNet messages were
taking their time getting to the server. That didn't help.

I upgraded the client to 1.8.5-4nas about an hour ago, we'll see if that helps.
If it doesn't, I'm going to consider downgrading the server to a 1.8.4-based
build, which we've been running on a production filesystem for a while.

Since the filesystem isn't in production, I have quite a bit of flexibility in
terms of being able to apply patches - unlike the rest of our filesystems."

I looked over this as well and all I've come up with so far is that perhaps attachment 31479 [details] bug
23352 might be of some use.

I'll attach the aforementioned lustre debug logs.

===see bugzilla 24531 for logs===



 Comments   
Comment by Peter Jones [ 31/Aug/11 ]

Lai

Could you please look into this one?

Thanks

Peter

Comment by Lai Siyao [ 01/Sep/11 ]

The logs show that client sent request x1371481946874671 and got reply, but server never received it (server did receive x1371481946874670 and x1371481946874672). This is really weird!

Johann, have you met with this before?

Comment by Johann Lombardi (Inactive) [ 02/Sep/11 ]

I don't think that you are looking at the right lock.

From the server log, the eviction is due to this lock:

00010000:00020000:12:1308679476.836124:0:0:0:(ldlm_lockd.c:308:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 10.151.61.171@o2ib  ns: filter-nbp1-OST004f_UUID lock: ffff810a03cb3a00/0x50a73ed8006fda0b lrc: 3/0,0 mode: PW/PW res: 233731/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x20 remote: 0x4cdfae7ceeea95bc expref: 9364 pid: 13239 timeout 5323842234

So that's a write lock on res 233731.

From the client log:

00010000:00010000:21:1308679375.165776:0:9771:0:(ldlm_request.c:953:ldlm_cancel_pack()) ### packing ns: nbp1-OST004f-osc-ffff810c48fc7000 lock: ffff810905620200/0x4cdfae7ceeea94ce lrc: 2/0,0 mode: --/PR res: 233731/0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x30302090 remote: 0x50a73ed8006fda04 expref: -99 pid: 330 timeout 0
00010000:00010000:21:1308679375.165784:0:9771:0:(ldlm_request.c:957:ldlm_cancel_pack()) 1 locks packed
00000100:00100000:21:1308679375.165789:0:9771:0:(client.c:2066:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_00:1ae3c145-b2f5-384f-a14f-5af93cbbb774:9771:x1371481946874671:10.151.26.34@o2ib:103

So x1371481946874671 is the cancellation of a read lock on res 233731.

Let's look at what really happened to the write lock on the client side:

00010000:00010000:7:1308679375.277593:0:10078:0:(ldlm_lockd.c:1414:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: nbp1-OST004f-osc-ffff810c48fc7000 lock: ffff810b37fd2c00/0x4cdfae7ceeea95bc lrc: 3/0,1 mode: PW/PW res: 233731/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x10120000 remote: 0x50a73ed8006fda0b expref: -99 pid: 335 timeout 0

=> client receives blocking callback

00010000:00010000:7:1308679375.277601:0:10078:0:(ldlm_lockd.c:1433:ldlm_handle_bl_callback()) Lock ffff810b37fd2c00 is referenced, will be cancelled later

=> but lock is busy

00010000:00010000:4:1308679477.046560:0:335:0:(ldlm_lock.c:597:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(PW) ns: nbp1-OST004f-osc-ffff810c48fc7000 lock: ffff810b37fd2c00/0x4cdfae7ceeea95bc lrc: 3/0,1 mode: PW/PW res: 233731/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x12120c10 remote: 0x50a73ed8006fda0b expref: -99 pid: 335 timeout 0

=> user drops lock reference 102s later, that's way too much.

00010000:00010000:4:1308679477.046576:0:335:0:(ldlm_lock.c:637:ldlm_lock_decref_internal()) ### final decref done on cbpending lock ns: nbp1-OST004f-osc-ffff810c48fc7000 lock: ffff810b37fd2c00/0x4cdfae7ceeea95bc lrc: 2/0,0 mode: PW/PW res: 233731/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x12120c10 remote: 0x50a73ed8006fda0b expref: -99 pid: 335 timeout 0
00010000:00010000:18:1308679477.046622:0:9772:0:(ldlm_request.c:895:ldlm_cli_cancel_local()) ### client-side cancel ns: nbp1-OST004f-osc-ffff810c48fc7000 lock: ffff810b37fd2c00/0x4cdfae7ceeea95bc lrc: 3/0,0 mode: PW/PW res: 233731/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x12122c10 remote: 0x50a73ed8006fda0b expref: -99 pid: 335 timeout 0
00010000:00010000:18:1308679477.046635:0:9772:0:(ldlm_request.c:908:ldlm_cli_cancel_local()) not sending request (at caller's instruction)

=> we got evicted already, too late.

Let's look at why the lock was busy for more than 100s:

10000:00010000:4:1308679477.241858:0:335:0:(ldlm_lock.c:597:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(PR) ns: nbp10-OST004d-osc-ffff811448d7ec00 lock: ffff810a7bf4f400/0x4cdfae7ceeec06a8 lrc: 3/1,0 mode: PR/PR res: 177671/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x100000 remote: 0xd50b3cbef74e49d3 expref: -99 pid: 330 timeout 0
00010000:00010000:4:1308679477.241871:0:335:0:(ldlm_lock.c:580:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(PR) ns: nbp10-OST0051-osc-ffff811448d7ec00 lock: ffff81089861f000/0x4cdfae7ceeec06af lrc: 2/1,0 mode: PR/PR res: 176677/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x100000 remote: 0xc9fc043754a71d5a expref: -99 pid: 330 timeout 0
00010000:0001000
...

The process is taking multiple extent locks on different OST, quite bad since it can cause cascading eviction.

00000100:00100000:4:1308679477.242107:0:335:0:(client.c:1168:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc mcp:f28f51b8-bf6c-8367-c592-0bab79ead394:335:x1371481946929889:10.151.25.202@o2ib:3
00000100:00000200:4:1308679477.242116:0:335:0:(niobuf.c:280:ptlrpc_register_bulk()) Setup bulk put-sink buffers: 1 pages 1455 bytes, xid 1371481946929889, portal 8

And it sends bulk RPC itself, so my guess is that it is running large direct i/o.

James, could you please confirm that you have jobs using direct i/o when the problem happens?

What i really don't understand is that 1.8.2 uses server-side locking by default for direct i/o (unless you explicitly use LL_FILE_LOCKED_DIRECTIO in your application), so the process should not grab any extent locks. That's weird. Do you happen to use LL_FILE_LOCKED_DIRECTIO somehow?

Comment by James Karellas [ 29/Sep/11 ]

Yes, we are using direct io. We don't use LL_FILE_LOCKED_DIRECTIO. The only lustre specific things in the code we use are the use of llapi_file_get_stripe, llapi_file_create, and the O_LOV_DELAY_CREATE flag in open. Don't know if any of those would access that behind the scenes.

Comment by Lai Siyao [ 21/Oct/11 ]

thread 335 looks to be the process doing direct read(many OST_READ request issued without lock), but strangely it tried to write file (see below, lov_brw_check() is called, which is called in ll_prepare_write()):

00000080:00010000:4:1308679375.164792:0:335:0:(file.c:1045:ll_extent_lock()) Locking inode 75245881, start 0 end 1048575
...
00020000:00010000:4:1308679375.166276:0:335:0:(lov_request.c:176:lov_update_enqueue_set()) ### lock acquired, setting rss=0, kms=0 ns: nbp1-OST004f-osc-ffff810c48fc7000 lock: ffff810b37fd2c00/0x4cdfae7ceeea95bc lrc: 3/0,1 mode: PW/PW res: 233731/0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x20000 remote: 0x50a73ed8006fda0b expref: -99 pid: 335 timeout 0
00020000:00080000:4:1308679477.046499:0:335:0:(lov_obd.c:1730:lov_brw_check()) lov idx 79 inactive
00010000:00010000:4:1308679477.046576:0:335:0:(ldlm_lock.c:637:ldlm_lock_decref_internal()) ### final decref done on cbpending lock ns: nbp1-OST004f-osc-ffff810c48fc7000 lock: ffff810b37fd2c00/0x4cdfae7ceeea95bc lrc: 2/0,0 mode: PW/PW res: 233731/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x12120c10 remote: 0x50a73ed8006fda0b expref: -99 pid: 335 timeout 0

client receives lock cancel for this lock ffff810b37fd2c00:

00010000:00010000:7:1308679375.277593:0:10078:0:(ldlm_lockd.c:1414:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: nbp1-OST004f-osc-ffff810c48fc7000 lock: ffff810b37fd2c00/0x4cdfae7ceeea95bc lrc: 3/0,1 mode: PW/PW res: 233731/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x10120000 remote: 0x50a73ed8006fda0b expref: -99 pid: 335 timeout 0
00010000:00010000:7:1308679375.277601:0:10078:0:(ldlm_lockd.c:1433:ldlm_handle_bl_callback()) Lock ffff810b37fd2c00 is referenced, will be cancelled later

another thread 330 is calling setattr, and requests another lock conflicting with the above lock:

00000080:00010000:6:1308679375.114742:0:330:0:(file.c:1045:ll_extent_lock()) Locking inode 75245881, start 0 end 18446744073709551615
...
00010000:00010000:6:1308679375.278215:0:330:0:(ldlm_request.c:168:ldlm_completion_ast()) ### waiting indefinitely because of NO_TIMEOUT ns: nbp1-OST004f-osc-ffff810c48fc7000 lock: ffff810905620000/0x4cdfae7ceeea94ff lrc: 4/0,1 mode: --/PW res: 233731/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20000 remote: 0x50a73ed8006fda12 expref: -99 pid: 330 timeout 0
00010000:00010000:6:1308679477.046302:0:330:0:(ldlm_request.c:194:ldlm_completion_ast()) ### client-side enqueue waking up: destroyed ns: nbp1-OST004f-osc-ffff810c48fc7000 lock: ffff810905620000/0x4cdfae7ceeea94ff lrc: 4/0,1 mode: --/PW res: 233731/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x2020c10 remote: 0x50a73ed8006fda12 expref: -99 pid: 330 timeout 0
00010000:00010000:6:1308679477.046318:0:330:0:(ldlm_request.c:370:failed_lock_cleanup()) ### lock was granted or failed in race ns: nbp1-OST004f-osc-ffff810c48fc7000 lock: ffff810905620000/0x4cdfae7ceeea94ff lrc: 4/0,1 mode: --/PW res: 233731/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x2020c10 remote: 0x50a73ed8006fda12 expref: -99 pid: 330 timeout 0
...
00020000:00080000:6:1308679477.046420:0:330:0:(lov_request.c:1213:lov_prep_setattr_set()) lov idx 79 inactive
00000080:00020000:6:1308679477.046429:0:330:0:(llite_lib.c:1752:ll_setattr_raw()) obd_setattr_async fails: rc=-5
00000080:00020000:6:1308679477.046450:0:330:0:(llite_lib.c:1761:ll_setattr_raw()) extent unlock failed: err=-5

Because thread 335 didn't release lock ffff810b37fd2c00, and caused client evicted, and finally thread 330 setattr failed either.

Looks like there is a deadlock here:
thread 335:
ll_file_write() --> ll_extent_lock() --> generic_file_write() ...
thread 330:
ll_setattr_raw --> ll_extent_lock()

But I don't find the cause yet. If this test can be reproduced, it's better to make it simpler: 1 test file, and with D_VFSTRACE debug enabled.

Comment by Jay Lan (Inactive) [ 08/Dec/11 ]

Oracle claimed to have a fix to this problem on BZ 24531.

Comment by Lai Siyao [ 09/Dec/11 ]

The deadlock looks to be the one, but the fix may not be complete. I'll leave comment there.

Besides, according to `man 2 open`:

Applications should avoid mixing O_DIRECT and normal I/O to the same file, and
especially to overlapping byte regions in the same file.  Even when the file
system correctly handles the coherency issues in this situation, overall I/O
throughput is likely to be slower than using either mode alone.  Likewise,
applications should avoid mixing mmap(2) of files with direct I/O to the same
files.

Each direct IO will cause all dirty pages flushed to backend, and the logs shows lots of lock conflicts, I don't see any benefit of mixing these two IO modes. Though mixing them should work correctly.

Comment by Jay Lan (Inactive) [ 09/Dec/11 ]

I agree with Lai's comment that applications should avoid mixing O_DIRECT and normal I/O.
Now that the deadlock has been identified, if Oracle chose to take the easy way out, can you provide a right fix?

Comment by Lai Siyao [ 12/Dec/11 ]

review is on http://review.whamcloud.com/#change,1829

Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-b1_8 » x86_64,client,el5,ofa #163
LU-649 io: DIO doesn't need lock i_mutex (Revision 4ebec99e309fcafd1be8a0b10673d69a083ae87f)

Result = SUCCESS
Johann Lombardi : 4ebec99e309fcafd1be8a0b10673d69a083ae87f
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-b1_8 » x86_64,server,el5,inkernel #163
LU-649 io: DIO doesn't need lock i_mutex (Revision 4ebec99e309fcafd1be8a0b10673d69a083ae87f)

Result = SUCCESS
Johann Lombardi : 4ebec99e309fcafd1be8a0b10673d69a083ae87f
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-b1_8 » x86_64,client,el5,inkernel #163
LU-649 io: DIO doesn't need lock i_mutex (Revision 4ebec99e309fcafd1be8a0b10673d69a083ae87f)

Result = SUCCESS
Johann Lombardi : 4ebec99e309fcafd1be8a0b10673d69a083ae87f
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-b1_8 » x86_64,client,el6,inkernel #163
LU-649 io: DIO doesn't need lock i_mutex (Revision 4ebec99e309fcafd1be8a0b10673d69a083ae87f)

Result = SUCCESS
Johann Lombardi : 4ebec99e309fcafd1be8a0b10673d69a083ae87f
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-b1_8 » x86_64,client,ubuntu1004,inkernel #163
LU-649 io: DIO doesn't need lock i_mutex (Revision 4ebec99e309fcafd1be8a0b10673d69a083ae87f)

Result = SUCCESS
Johann Lombardi : 4ebec99e309fcafd1be8a0b10673d69a083ae87f
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-b1_8 » x86_64,server,el5,ofa #163
LU-649 io: DIO doesn't need lock i_mutex (Revision 4ebec99e309fcafd1be8a0b10673d69a083ae87f)

Result = SUCCESS
Johann Lombardi : 4ebec99e309fcafd1be8a0b10673d69a083ae87f
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-b1_8 » i686,client,el6,inkernel #163
LU-649 io: DIO doesn't need lock i_mutex (Revision 4ebec99e309fcafd1be8a0b10673d69a083ae87f)

Result = SUCCESS
Johann Lombardi : 4ebec99e309fcafd1be8a0b10673d69a083ae87f
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-b1_8 » i686,client,el5,ofa #163
LU-649 io: DIO doesn't need lock i_mutex (Revision 4ebec99e309fcafd1be8a0b10673d69a083ae87f)

Result = SUCCESS
Johann Lombardi : 4ebec99e309fcafd1be8a0b10673d69a083ae87f
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-b1_8 » i686,client,el5,inkernel #163
LU-649 io: DIO doesn't need lock i_mutex (Revision 4ebec99e309fcafd1be8a0b10673d69a083ae87f)

Result = SUCCESS
Johann Lombardi : 4ebec99e309fcafd1be8a0b10673d69a083ae87f
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-b1_8 » i686,server,el5,inkernel #163
LU-649 io: DIO doesn't need lock i_mutex (Revision 4ebec99e309fcafd1be8a0b10673d69a083ae87f)

Result = SUCCESS
Johann Lombardi : 4ebec99e309fcafd1be8a0b10673d69a083ae87f
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-b1_8 » i686,server,el5,ofa #163
LU-649 io: DIO doesn't need lock i_mutex (Revision 4ebec99e309fcafd1be8a0b10673d69a083ae87f)

Result = SUCCESS
Johann Lombardi : 4ebec99e309fcafd1be8a0b10673d69a083ae87f
Files :

  • lustre/llite/rw26.c
Comment by Jay Lan (Inactive) [ 27/Jan/12 ]

The commit was landed on b1_8. Should it go b2_1 also? It is not in the master branch either.

Comment by Peter Jones [ 27/Jan/12 ]

Yes this should land to master at some point. The logic is quite different on master due to CLIO so this will require a different approach

Comment by Johann Lombardi (Inactive) [ 28/Mar/12 ]

There is a concern that this patch isn't right, so i'm going to revert it from b1_8 until a new version is ready.

Comment by Build Master (Inactive) [ 28/Mar/12 ]

Integrated in lustre-b1_8 » x86_64,client,el6,inkernel #183
LU-649 io: DIO doesn't need lock i_mutex (Revision 01fcfed5fcc841903012e4e6cbc8da7c2c49c38c)

Result = SUCCESS
Johann Lombardi : 01fcfed5fcc841903012e4e6cbc8da7c2c49c38c
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 28/Mar/12 ]

Integrated in lustre-b1_8 » i686,client,el6,inkernel #183
LU-649 io: DIO doesn't need lock i_mutex (Revision 01fcfed5fcc841903012e4e6cbc8da7c2c49c38c)

Result = SUCCESS
Johann Lombardi : 01fcfed5fcc841903012e4e6cbc8da7c2c49c38c
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 28/Mar/12 ]

Integrated in lustre-b1_8 » i686,client,el5,inkernel #183
LU-649 io: DIO doesn't need lock i_mutex (Revision 01fcfed5fcc841903012e4e6cbc8da7c2c49c38c)

Result = SUCCESS
Johann Lombardi : 01fcfed5fcc841903012e4e6cbc8da7c2c49c38c
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 28/Mar/12 ]

Integrated in lustre-b1_8 » i686,client,el5,ofa #183
LU-649 io: DIO doesn't need lock i_mutex (Revision 01fcfed5fcc841903012e4e6cbc8da7c2c49c38c)

Result = SUCCESS
Johann Lombardi : 01fcfed5fcc841903012e4e6cbc8da7c2c49c38c
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 28/Mar/12 ]

Integrated in lustre-b1_8 » x86_64,client,ubuntu1004,inkernel #183
LU-649 io: DIO doesn't need lock i_mutex (Revision 01fcfed5fcc841903012e4e6cbc8da7c2c49c38c)

Result = SUCCESS
Johann Lombardi : 01fcfed5fcc841903012e4e6cbc8da7c2c49c38c
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 28/Mar/12 ]

Integrated in lustre-b1_8 » i686,server,el5,inkernel #183
LU-649 io: DIO doesn't need lock i_mutex (Revision 01fcfed5fcc841903012e4e6cbc8da7c2c49c38c)

Result = SUCCESS
Johann Lombardi : 01fcfed5fcc841903012e4e6cbc8da7c2c49c38c
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 28/Mar/12 ]

Integrated in lustre-b1_8 » x86_64,server,el5,ofa #183
LU-649 io: DIO doesn't need lock i_mutex (Revision 01fcfed5fcc841903012e4e6cbc8da7c2c49c38c)

Result = SUCCESS
Johann Lombardi : 01fcfed5fcc841903012e4e6cbc8da7c2c49c38c
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 28/Mar/12 ]

Integrated in lustre-b1_8 » x86_64,client,el5,inkernel #183
LU-649 io: DIO doesn't need lock i_mutex (Revision 01fcfed5fcc841903012e4e6cbc8da7c2c49c38c)

Result = SUCCESS
Johann Lombardi : 01fcfed5fcc841903012e4e6cbc8da7c2c49c38c
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 28/Mar/12 ]

Integrated in lustre-b1_8 » x86_64,client,el5,ofa #183
LU-649 io: DIO doesn't need lock i_mutex (Revision 01fcfed5fcc841903012e4e6cbc8da7c2c49c38c)

Result = SUCCESS
Johann Lombardi : 01fcfed5fcc841903012e4e6cbc8da7c2c49c38c
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 28/Mar/12 ]

Integrated in lustre-b1_8 » i686,server,el5,ofa #183
LU-649 io: DIO doesn't need lock i_mutex (Revision 01fcfed5fcc841903012e4e6cbc8da7c2c49c38c)

Result = SUCCESS
Johann Lombardi : 01fcfed5fcc841903012e4e6cbc8da7c2c49c38c
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 28/Mar/12 ]

Integrated in lustre-b1_8 » x86_64,server,el5,inkernel #183
LU-649 io: DIO doesn't need lock i_mutex (Revision 01fcfed5fcc841903012e4e6cbc8da7c2c49c38c)

Result = SUCCESS
Johann Lombardi : 01fcfed5fcc841903012e4e6cbc8da7c2c49c38c
Files :

  • lustre/llite/rw26.c
Comment by Nathan Rutman [ 05/Apr/12 ]

Looks like Oracle is going a different route in fixing this –

------- Comment #51 From Vladimir V. Saveliev 2012-02-07 01:46:54 [reply] -------
Tag: Engineer Comment
(In reply to comment #50)
> Looking at the comments in http://jira.whamcloud.com/browse/LU-649 is it reasonable to say that
> this
> problem is resolved?
>

Yes, although there are two fixes.
Whamcloud has landed their own fix. We are going to land
https://bugzilla.lustre.org/attachment.cgi?id=33151 which actually fixes the deadlock.

Comment by Peter Jones [ 21/Mar/13 ]

As per NASA ok to close ticket

Generated at Sat Feb 10 01:09:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.