Details

    • Improvement
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.1.3
    • None
    • rhel 6

    Description

      Can you confirm when a client is writing to a binary file on the lustre file system is it available to be read for user processing or is it locked while being written to? If it is locked how do we disable locking?

      Attachments

        Activity

          [LU-4827] Would like to disable file locking

          Since there are no further requests, I am going to close this issue, please reopen if you have more questions.

          cliffw Cliff White (Inactive) added a comment - Since there are no further requests, I am going to close this issue, please reopen if you have more questions.

          Can we do anything further for you on this issue? Did the mount options help?

          cliffw Cliff White (Inactive) added a comment - Can we do anything further for you on this issue? Did the mount options help?
          green Oleg Drokin added a comment - - edited

          As per our discussion today.

          You can do mount -o nolock to disable client side locking for IO on the client where you performed the mount with this option.
          Note this mode of operations was relatively little tested in 2.1.x releases.

          Lustre debugging is controlled by a a proc file /proc/sys/lnet/debug . You can cat this file to get current debugging mask. The lustre messages that are matched by the mask are output into a special debug buffer whose size is controlled by /proc/sys/lnet/debug_mb default is ok for low debugging, but you might to raise to say 2048 or even 4090 if you have a lot of extensive debugging enabled.
          Note that extended debugging takes quite a bit of CPU.
          If debug buffer proves not sufficient enough to hold entire duration even when you raise it to 4090, there's also an option of so called debug daemon where debug buffer will be flushed to disk storage (you need to have non-lustre storage on this node for this mode).
          These variables are all separate per node so you'd need to control them separately on all nodes where you plan to gather this data.
          command "lctl dk >/some/path/somewhere.txt" would collect the current debug buffer data and output it to the file specified (only needed if you do not use debug daemon mode).
          Debug daemon on the other hand is operated as "lctl debug_daemon start /path/to/file 40960" (this last number is the size of the file in megabytes). To stop debug daemon issue "lctl debug_daemon stop" command

          Good additional debug settings for initial investigations are rpctrace (to RPC traffic), dlmtrace (to show locking operations) and vfstrace (to show vfs operation. This is only needed on client). To enable them you execute a command like this:
          "echo +rpctrace +dlmtrace +vfstrace >/proc/sys/lnet/debug" on your clients and "echo +rpctrace +dlmtrace >/proc/sys/lnet/debug" on servers.

          For example I did a "dd if=/dev/zero of=/mnt/lustre2/zeros bs=1024k count=10" command.
          Client side logs excerpts with stuff of interest:

          00000080:00200000:7.0:1396480657.791825:0:4197:0:(namei.c:568:ll_lookup_it()) VFS Op:name=zeros, dir=[0x200000007:0x1:0x0](ffff880082986a58), intent=open|creat
          00000002:00010000:7.0:1396480657.791832:0:4197:0:(mdc_locks.c:1173:mdc_intent_lock()) (name: zeros,[0x200000402:0x3:0x0]) in obj [0x200000007:0x1:0x0], intent: open|creat flags 0101102
          00010000:00010000:7.0:1396480657.791865:0:4197:0:(ldlm_request.c:988:ldlm_cli_enqueue()) ### sending request ns: lustre-MDT0000-mdc-ffff88007a7ab7f0 lock: ffff880067c63db8/0xbefcd51a570cb0fd lrc: 3/0,1 mode: --/CW res: [0x200000007:0x1:0x0].0 bits 0x2 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 4197 timeout: 0 lvb_type: 0
          00000100:00100000:7.0:1396480657.791875:0:4197:0:(client.c:1478:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc dd:a7681c72-f95a-a5e9-7434-cc6004e1c10d:4197:1464203148756228:0@lo:101
          ...
          00000100:00100000:7.0:1396480657.792320:0:4197:0:(client.c:1849:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc dd:a7681c72-f95a-a5e9-7434-cc6004e1c10d:4197:1464203148756228:0@lo:101
          00010000:00010000:7.0:1396480657.792340:0:4197:0:(ldlm_request.c:554:ldlm_cli_enqueue_fini()) ### client-side enqueue END (ABORTED) ns: lustre-MDT0000-mdc-ffff88007a7ab7f0 lock: ffff880067c63db8/0xbefcd51a570cb0fd lrc: 4/0,1 mode: --/CW res: [0x200000007:0x1:0x0].0 bits 0x2 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 4197 timeout: 0 lvb_type: 0
          ...
          00000080:00200000:7.0:1396480657.792399:0:4197:0:(file.c:559:ll_file_open()) VFS Op:inode=[0x200000402:0x2:0x0](ffff880079dc2a58), flags 101101
          

          fields are separated by colon. Third field is cpu this is executed on, fourth field is a unix timestamp, then sixth field is pid of the process and then there's a file and line number for the message and the actual message.
          Here we see a file open operation being performed. It starts with a lookup (we see filename and intent that describes the operation to be done).
          "ldlm_request.c:988:ldlm_cli_enqueue()) ### sending request" line is a client preparing to ask for a lock from server, this one is metadata lock, example on data lock below.
          "Sending RPC pname" message tells us the client is sending an RPC to the server, along with various other useful bits of info there like a process pid and name there's also something called xid which is lustre request id. In this case it's 1464203148756228. We can use this to find what happened to this request in server logs. "opc" (in this case 101) is operation code (in this case 101 = lock enqueue)
          "Completed RPC pname" is when the request reply has been received. (good to see how long the request took including roundtrip time. You might only be interested in checking server-side requests that took a long time to execute.
          "client-side enqueue END (ABORTED)" line means that the server refused to provide a lock to the client (happens frequently with metadata locks, but less so with data).

          Server side counterpart of this open call (that we were able to find by xid):

          00000100:00100000:5.0F:1396480657.791989:0:3694:0:(service.c:1939:ptlrpc_server_handle_req_in()) got req x1464203148756228
          ...
          00000100:00100000:5.0:1396480657.792039:0:3694:0:(service.c:2082:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt01_004:a7681c72-f95a-a5e9-7434-cc6004e1c10d+7:4197:x1464203148756228:12345-0@lo:101
          ...
          00000100:00100000:5.0:1396480657.792292:0:3694:0:(service.c:2132:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt01_004:a7681c72-f95a-a5e9-7434-cc6004e1c10d+7:4197:x1464203148756228:12345-0@lo:101 Request procesed in 254us (357us total) trans 4294967329 rc 301/301
          

          First line (ptlrpc_server_handle_req_in) is when the message have actually arrived to the server.
          Then the "Handling RPC pname" is when the message was actually entered processing (these two times could be quite a bit apart if server is overloaded and has amassed a large backlog of requests to serve).
          "Handled RPC pname" is when the request finished processing (and how long it took not counting the initial waiting). Everything in between is processing of this request if you need to see a better timeline (filter by pid, in this case 3694).

          Now we look again in the client logs

          00000080:00200000:7.0:1396480657.799063:0:4197:0:(file.c:1138:ll_file_io_generic()) file: zeros, type: 1 ppos: 0, count: 1048576
          

          This is a write syscall to write 1M (count: 1048576) of data at offset 0 (ppos: 0), we also get to know file name. (actually I just doublechecked and lustre version you use would not have a line like this printed, it was introduced in a later release.)

          00000020:00200000:7.0:1396480657.799071:0:4197:0:(cl_io.c:237:cl_io_rw_init()) header@ffff88007a713ef0[0x0, 2, [0x200000402:0x2:0x0] hash]
          00000020:00200000:7.0:1396480657.799072:0:4197:0:(cl_io.c:237:cl_io_rw_init()) io range: 1 [0, 1048576) 0 0
          00000080:00200000:7.0:1396480657.799074:0:4197:0:(vvp_io.c:1154:vvp_io_init()) [0x200000402:0x2:0x0] ignore/verify layout 0/0, layout version 0 restore needed 0
          00020000:00200000:7.0:1396480657.799087:0:4197:0:(lov_io.c:419:lov_io_iter_init()) shrink: 0 [0, 1048576)
          

          These lines tell us essentially same info as the generic_io:
          "cl_io_rw_init" line informs us that we have a write coming (number after "range:" 1 = write, 0 = read) and the range is from byte 0 to byte 1048576 (which is 1M).
          All this will later repeat with different offsets 9 more times since we asked dd to write 10 blocks of 1M each.

          After a lot more chatter during which we find that we already have a good lock for this file to perform writes (because it was obtained while performing initial truncate, I am not going to cover it here), we get to actual writes that is a lot more chatter about individual page population we eventually get enough pages to stand asynchronous writeout of the dirty pages:

          00010000:00010000:7.0:1396480657.799203:0:4197:0:(ldlm_lock.c:797:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(PW) ns: lustre-OST0000-osc-ffff88007a7ab7f0 lock: ffff88008bd9fdb8/0xbefcd51a570cb12e lrc: 4/0,1 mode: PW/PW res: [0x5:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x10020080020000 nid: local remote: 0xbefcd51a570cb135 expref: -99 pid: 4197 timeout: 0 lvb_type: 1
          ...
          00000080:00200000:7.0:1396480657.808735:0:4197:0:(vvp_io.c:717:vvp_io_write_commit()) commit async pages: 256, from 0, to 4096
          00000080:00200000:7.0:1396480657.808910:0:4197:0:(vvp_io.c:739:vvp_io_write_commit()) Committed 256 pages 1048576 bytes, tot: 1048576
          00000080:00200000:7.0:1396480657.808913:0:4197:0:(vvp_io.c:823:vvp_io_write_start()) write: nob 1048576, result: 1048576
          00000080:00200000:7.0:1396480657.809492:0:4197:0:(vvp_io.c:153:vvp_io_fini()) [0x200000402:0x2:0x0] ignore/verify layout 0/0, layout version 0 restore needed 0
          00000080:00200000:7.0:1396480657.809493:0:4197:0:(file.c:1215:ll_file_io_generic()) iot: 1, result: 1048576
          00000100:00100000:2.0:1396480657.809500:0:2031:0:(client.c:1478:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_6:a7681c72-f95a-a5e9-7434-cc6004e1c10d:2031:1464203148756276:0@lo:4
          

          "ldlm_lock_addref_internal_nolock" line is the lock we matched. I only list it here because it has a bit of useful data - the lock "handle that's listed after the lock address before "lrc" string, in this case 0xbefcd51a570cb12e and "remote handle" - the handle used on the server for this same lock: 0xbefcd51a570cb135 - these two handles could be used to find this lock on the server which is handy.
          Next three lines would look a bit different in your case. This is the ends of the syscall and sending of the accumulated data to the server asynchronously at the same time (notice there's a different pid in the "Sending RPC" line. opc 4 = WRITE.

          The write RPC eventually completes:

          00000100:00100000:2.0:1396480657.813317:0:2031:0:(client.c:1849:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_6:a7681c72-f95a-a5e9-7434-cc6004e1c10d:2031:1464203148756276:0@lo:4
          

          We can also trace how it went on the server side using it's xid of 1464203148756276 in server logs, but htere's nothing too interesting there.

          Now another command from a second client to read the entire file we just wrote: "cat /mnt/lustre/zeros". (We use a second client here because if we read from the same client, it would just get the data from local cache without even touching the kernel).

          We'll skip the opening as it's the same and jump straight into reading

          00000020:00200000:1.0:1396480673.044082:0:4199:0:(cl_io.c:237:cl_io_rw_init()) header@ffff88007e673ef0[0x0, 1, [0x200000402:0x2:0x0] hash]
          00000020:00200000:1.0:1396480673.044083:0:4199:0:(cl_io.c:237:cl_io_rw_init()) io range: 0 [0, 4194304) 0 0
          ...
          00000080:00200000:1.0:1396480673.044101:0:4199:0:(lcommon_cl.c:694:ccc_io_one_lock_index()) lock: 1 [0, 1023]
          ...
          00010000:00010000:1.0:1396480673.044210:0:4199:0:(ldlm_request.c:926:ldlm_cli_enqueue()) ### client-side enqueue START, flags 40000000
           ns: lustre-OST0000-osc-ffff8800947307f0 lock: ffff880065565db8/0xbefcd51a570cb189 lrc: 3/1,0 mode: --/PR res: [0x5:0x0:0x0].0 rrc: 1 type: EXT [0->4194303] (req 0->4194303) flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 4199 timeout: 0 lvb_type: 1
          00000100:00100000:3.0:1396480673.044290:0:2027:0:(client.c:1478:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_2:47a13ec0-2a9e-a542-7690-d156dc774737:2027:1464203148756556:0@lo:101
          

          So here cat decides to read first 4M of the file (from 0 to 4194304).
          The client needs a lock to read this data that it does not have, so it prepares the lock request ("client-side enqueue START" line). This lists the same usual info for a lock like we saw during write - the local handle (0xbefcd51a570cb189), desired lock region (req 0->4194303), remore handle is still 0x0 because the server has not responded to this request yet (once it does, the handle would be populated). There's also lock mode --/PR meaning we want a READ lock (PR is for Write), but it has not been granted yet (becomes PR/PR once granted as you will see later).
          Then we send the request to the server (xid 1464203148756556)

          Now let's switch to OST log and see what happens with this lock request there:

          00000100:00100000:2.0:1396480673.044531:0:3282:0:(service.c:1939:ptlrpc_server_handle_req_in()) got req x1464203148756556
          00000100:00100000:2.0:1396480673.044564:0:3282:0:(service.c:2082:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost00_002:47a13ec0-2a9e-a542-7690-d156dc774737+5:2027:x1464203148756556:12345-0@lo:101
          00010000:00010000:2.0:1396480673.044567:0:3282:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue0()) ### server-side enqueue handler START
          00010000:00010000:2.0:1396480673.044575:0:3282:0:(ldlm_lockd.c:1253:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: filter-lustre-OST0000_UUID lock: ffff8800b424fdb8/0xbefcd51a570cb190 lrc: 2/0,0 mode: --/PR res: [0x5:0x0:0x0].0 rrc: 2 type: EXT [0->0] (req 0->0) flags: 0x40000000000000 nid: local remote: 0xbefcd51a570cb189 expref: -99 pid: 3282 timeout: 0 lvb_type: 0
          

          So usual request starting preamble and then lock manager prints that it start processing for requested lock and prints lock info. Note how server side the lock handle that was local on the client now is printed in the remote section, where as local handle is populated with a different value (this one will become remote one on the client later on). Lock is still ungranted and the request is not yet populated either so it looks like 0-0 at this time, but that's not for long.

          After some processing server notices that this lock could not be granted because there is a conflicting lock:

          00010000:00010000:2.0:1396480673.044585:0:3282:0:(ldlm_lock.c:715:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: filter-lustre-OST0000_UUID lock: ffff880085b2cdb8/0xbefcd51a570cb135 lrc: 2/0,0 mode: PW/PW res: [0x5:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x40000080000000 nid: 0@lo remote: 0xbefcd51a570cb12e expref: 4 pid: 3695 timeout: 0 lvb_type: 0
          

          Note how this is the same PW lock (judging by the handle) that the first writing client has.

          Server sets aside the second client's lock into the waiting list:

          00010000:00010000:2.0:1396480673.044597:0:3282:0:(ldlm_resource.c:1282:ldlm_resource_add_lock()) ### About to add this lock:
           ns: filter-lustre-OST0000_UUID lock: ffff8800b424fdb8/0xbefcd51a570cb190 lrc: 4/0,0 mode: --/PR res: [0x5:0x0:0x0].0 rrc: 2 type: EXT [0->4194303] (req 0->4194303) flags: 0x50000000000000 nid: 0@lo remote: 0xbefcd51a570cb189 expref: 6 pid: 3282 timeout: 0 lvb_type: 0
          

          And then prepares a "Blocking AST" - that is a message to the first client to ask to to drop the lock and all associated cache and adds this lock into a different list to keep tabs on it should the client does not reply in time (e.g. due to it's death).

          00010000:00010000:2.0:1396480673.044613:0:3282:0:(ldlm_lockd.c:848:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: filter-lustre-OST0000_UUID lock: ffff880085b2cdb8/0xbefcd51a570cb135 lrc: 3/0,0 mode: PW/PW res: [0x5:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x50000080000020 nid: 0@lo remote: 0xbefcd51a570cb12e expref: 4 pid: 3695 timeout: 0 lvb_type: 0
          00010000:00010000:2.0:1396480673.044616:0:3282:0:(ldlm_lockd.c:459:ldlm_add_waiting_lock()) ### adding to wait list(timeout: 150, AT: on) ns: filter-lustre-OST0000_UUID lock: ffff880085b2cdb8/0xbefcd51a570cb135 lrc: 4/0,0 mode: PW/PW res: [0x5:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x70000080000020 nid: 0@lo remote: 0xbefcd51a570cb12e expref: 4 pid: 3695 timeout: 4332366772 lvb_type: 0
          

          Then it sends the AST RPC to the client (note how this is the same message we usually see on the clients when client initiates RPCs:

          00000100:00100000:2.0:1396480673.044618:0:3282:0:(client.c:1478:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ll_ost00_002:lustre-OST0000_UUID:3282:1464203148756580:0@lo:104
          

          Using xid from this rpc 1464203148756580 we now look into first client log again to see how it's handled. opc 104 = BLOCKING AST

          00000100:00100000:3.0:1396480673.044707:0:4049:0:(service.c:2082:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb00_002:LOV_OSC_UUID+4:3282:x1464203148756580:12345-0@lo:104
          00000100:00100000:3.0:1396480673.044865:0:4049:0:(service.c:2132:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb00_002:LOV_OSC_UUID+4:3282:x1464203148756580:12345-0@lo:104 Request procesed in 155us (214us total) trans 0 rc 0/0
          

          This request was processed superfast, but that's just the notification, now the first client starts to prepare to actually cancel the conflicting lock. We keep looking into the first client log. (While all of this is hapening server sent a "please wait" message to the second client and the second client wntered the waiting, we are not covering it here).

          00010000:00010000:5.0:1396480673.045034:0:3131:0:(ldlm_lockd.c:1654:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: lustre-OST0000-osc-ffff88007a7ab7f0 lock: ffff88008bd9fdb8/0xbefcd51a570cb12e lrc: 4/0,0 mode: PW/PW res: [0x5:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x420480020000 nid: local remote: 0xbefcd51a570cb135 expref: -99 pid: 4197 timeout: 0 lvb_type: 1
          00010000:00010000:5.0:1396480673.045037:0:3131:0:(ldlm_lockd.c:1667:ldlm_handle_bl_callback()) Lock ffff88008bd9fdb8 already unused, calling callback (ffffffffa0abdde0)
          

          This is the first client noticing it needs to drop the lock.
          After that there's a flurry or entries in the log about client preparing remaining dirty pages for writing and then dropping now clean pages from the cache.
          Important part is all of this takes nonzero time, esp the writes that involve RPCs. And the busier the server and the network the longer it takes.

          Eventually all of that is done and we are teady to cancel the lock:

          00010000:00010000:4.0:1396480673.121597:0:3132:0:(ldlm_request.c:1186:ldlm_cancel_pack()) ### packing ns: lustre-OST0000-osc-ffff88007a7ab7f0 lock: ffff88008bd9fdb8/0xbefcd51a570cb12e lrc: 3/0,0 mode: --/PW res: [0x5:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x4c69480020000 nid: local remote: 0xbefcd51a570cb135 expref: -99 pid: 4197 timeout: 0 lvb_type: 1
          00010000:00010000:4.0:1396480673.121601:0:3132:0:(ldlm_request.c:1190:ldlm_cancel_pack()) 1 locks packed
          ..
          00000100:00100000:7.0:1396480673.121658:0:2031:0:(client.c:1478:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_6:a7681c72-f95a-a5e9-7434-cc6004e1c10d:2031:1464203148766664:0@lo:103
          

          Note how lock state hanged to --/PW meaning this lock is now cancelled locally and we are informing server. And we are sending a cancel rpc (opc 103 = CANCEL) to the server, xid 1464203148766664.

          Now switching to the OST server to keep looking at the processing of this CANCEL request.

          00000100:00100000:5.0:1396480673.121903:0:3130:0:(service.c:2082:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cn01_002:a7681c72-f95a-a5e9-7434-cc6004e1c10d+5:2031:x1464203148766664:12345-0@lo:103
          00010000:00010000:5.0:1396480673.121913:0:3130:0:(ldlm_lockd.c:1566:ldlm_request_cancel()) ### server-side cancel handler START: 1 locks, starting at 0
          00010000:00010000:5.0:1396480673.122086:0:3130:0:(ldlm_lockd.c:523:ldlm_del_waiting_lock()) ### removed ns: filter-lustre-OST0000_UUID lock: ffff880085b2cdb8/0xbefcd51a570cb135 lrc: 3/0,0 mode: PW/PW res: [0x5:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x70000080000020 nid: 0@lo remote: 0xbefcd51a570cb12e expref: 5 pid: 3695 timeout: 4332366772 lvb_type: 0
          

          Thirt line here about removing from the waiting list - is because we finally got the cancel we have waited for.

          00010000:00010000:5.0:1396480673.122093:0:3130:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: filter-lustre-OST0000_UUID lock: ffff880085b2cdb8/0xbefcd51a570cb135 lrc: 0/0,0 mode: --/PW res: [0x5:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x64801080000020 nid: 0@lo remote: 0xbefcd51a570cb12e expref: 5 pid: 3695 timeout: 4332366772 lvb_type: 0
          ...
          00010000:00010000:5.0:1396480673.122098:0:3130:0:(ldlm_resource.c:1282:ldlm_resource_add_lock()) ### About to add this lock:
           ns: filter-lustre-OST0000_UUID lock: ffff8800b424fdb8/0xbefcd51a570cb190 lrc: 2/0,0 mode: PR/PR res: [0x5:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4194303) flags: 0x40000000000000 nid: 0@lo remote: 0xbefcd51a570cb189 expref: 5 pid: 3282 timeout: 0 lvb_type: 0
          

          So the lock is cancelled server side and that allows the other lock that was waiting for it to be granted. (see the mode changing to PR/PR to indcate it's now granted)

          Now we need to send this great news to the client using COMPLETION AST:

          00010000:00010000:5.0:1396480673.122100:0:3130:0:(ldlm_lock.c:736:ldlm_add_cp_work_item()) ### lock granted; sending completion AST. ns: filter-lustre-OST0000_UUID lock: ffff8800b424fdb8/0xbefcd51a570cb190 lrc: 2/0,0 mode: PR/PR res: [0x5:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4194303) flags: 0x40080000000000 nid: 0@lo remote: 0xbefcd51a570cb189 expref: 5 pid: 3282 timeout: 0 lvb_type: 0
          00010000:00010000:5.0:1396480673.122122:0:3130:0:(ldlm_lockd.c:955:ldlm_server_completion_ast()) ### server preparing completion AST (after 0s wait) ns: filter-lustre-OST0000_UUID lock: ffff8800b424fdb8/0xbefcd51a570cb190 lrc: 3/0,0 mode: PR/PR res: [0x5:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4194303) flags: 0x40000000000000 nid: 0@lo remote: 0xbefcd51a570cb189 expref: 5 pid: 3282 timeout: 0 lvb_type: 0
          00000100:00100000:5.0:1396480673.122125:0:3130:0:(client.c:1478:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_cn01_002:lustre-OST0000_UUID:3130:1464203148766668:0@lo:105
          

          opc 105 = COMPLETION AST. We use xid 1464203148766668 to track progress on the second client now.

          00000100:00100000:5.0:1396480673.122201:0:3124:0:(service.c:2082:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb01_001:LOV_OSC_UUID+4:3130:x1464203148766668:12345-0@lo:105
          00010000:00010000:5.0:1396480673.122258:0:3124:0:(ldlm_lockd.c:1696:ldlm_handle_cp_callback()) ### client completion callback handler START ns: lustre-OST0000-osc-ffff8800947307f0 lock: ffff880065565db8/0xbefcd51a570cb189 lrc: 4/1,0 mode: --/PR res: [0x5:0x0:0x0].0 rrc: 1 type: EXT [0->4194303] (req 0->4194303) flags: 0x20000 nid: local remote: 0xbefcd51a570cb190 expref: -99 pid: 4199 timeout: 0 lvb_type: 1
          00010000:00010000:5.0:1396480673.122261:0:3124:0:(ldlm_lockd.c:1765:ldlm_handle_cp_callback()) ### completion AST, new policy data ns: lustre-OST0000-osc-ffff8800947307f0 lock: ffff880065565db8/0xbefcd51a570cb189 lrc: 4/1,0 mode: --/PR res: [0x5:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4194303) flags: 0x10000000020000 nid: local remote: 0xbefcd51a570cb190 expref: -99 pid: 4199 timeout: 0 lvb_type: 1
          

          Client got the AST and updated the local lock information: even though it requested only offset 0 to 4M, server granted lock with a range of 0 to EOF (18446744073709551615) because there were no other conflicts and it helps to save on lock traffic later when (if) the clients need to read beyond initial 4M.

          This timestamp for the lock granting completion (there's some extra chatter in the logs about various internal states updated after the lock granting), when compared to when we first requested this PR lock is how long it took for this lock to be obtained, and for how long the reading process was actually blocked (and you can see where it was blocked along the way too).

          Next steps are to read the data that are similar to writes (excepts unlike writes, reads are synchronous for the main read, and then there's readahead to read more data hoping it would be needed that's done asynchronously).

          This is a bit of a simplistic example for a single striped file, there would be multiple locks to multiple servers for multistriped file and their operation is parallel, but still if something blocks on a lock, it would be to a single server and this would be the outline you'll see.
          The only difference is for a file written with O_APPEND flag writes will get locks on all stripes, not just on the one corresponding to the io area.

          green Oleg Drokin added a comment - - edited As per our discussion today. You can do mount -o nolock to disable client side locking for IO on the client where you performed the mount with this option. Note this mode of operations was relatively little tested in 2.1.x releases. Lustre debugging is controlled by a a proc file /proc/sys/lnet/debug . You can cat this file to get current debugging mask. The lustre messages that are matched by the mask are output into a special debug buffer whose size is controlled by /proc/sys/lnet/debug_mb default is ok for low debugging, but you might to raise to say 2048 or even 4090 if you have a lot of extensive debugging enabled. Note that extended debugging takes quite a bit of CPU. If debug buffer proves not sufficient enough to hold entire duration even when you raise it to 4090, there's also an option of so called debug daemon where debug buffer will be flushed to disk storage (you need to have non-lustre storage on this node for this mode). These variables are all separate per node so you'd need to control them separately on all nodes where you plan to gather this data. command "lctl dk >/some/path/somewhere.txt" would collect the current debug buffer data and output it to the file specified (only needed if you do not use debug daemon mode). Debug daemon on the other hand is operated as "lctl debug_daemon start /path/to/file 40960" (this last number is the size of the file in megabytes). To stop debug daemon issue "lctl debug_daemon stop" command Good additional debug settings for initial investigations are rpctrace (to RPC traffic), dlmtrace (to show locking operations) and vfstrace (to show vfs operation. This is only needed on client). To enable them you execute a command like this: "echo +rpctrace +dlmtrace +vfstrace >/proc/sys/lnet/debug" on your clients and "echo +rpctrace +dlmtrace >/proc/sys/lnet/debug" on servers. For example I did a "dd if=/dev/zero of=/mnt/lustre2/zeros bs=1024k count=10" command. Client side logs excerpts with stuff of interest: 00000080:00200000:7.0:1396480657.791825:0:4197:0:(namei.c:568:ll_lookup_it()) VFS Op:name=zeros, dir=[0x200000007:0x1:0x0](ffff880082986a58), intent=open|creat 00000002:00010000:7.0:1396480657.791832:0:4197:0:(mdc_locks.c:1173:mdc_intent_lock()) (name: zeros,[0x200000402:0x3:0x0]) in obj [0x200000007:0x1:0x0], intent: open|creat flags 0101102 00010000:00010000:7.0:1396480657.791865:0:4197:0:(ldlm_request.c:988:ldlm_cli_enqueue()) ### sending request ns: lustre-MDT0000-mdc-ffff88007a7ab7f0 lock: ffff880067c63db8/0xbefcd51a570cb0fd lrc: 3/0,1 mode: --/CW res: [0x200000007:0x1:0x0].0 bits 0x2 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 4197 timeout: 0 lvb_type: 0 00000100:00100000:7.0:1396480657.791875:0:4197:0:(client.c:1478:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc dd:a7681c72-f95a-a5e9-7434-cc6004e1c10d:4197:1464203148756228:0@lo:101 ... 00000100:00100000:7.0:1396480657.792320:0:4197:0:(client.c:1849:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc dd:a7681c72-f95a-a5e9-7434-cc6004e1c10d:4197:1464203148756228:0@lo:101 00010000:00010000:7.0:1396480657.792340:0:4197:0:(ldlm_request.c:554:ldlm_cli_enqueue_fini()) ### client-side enqueue END (ABORTED) ns: lustre-MDT0000-mdc-ffff88007a7ab7f0 lock: ffff880067c63db8/0xbefcd51a570cb0fd lrc: 4/0,1 mode: --/CW res: [0x200000007:0x1:0x0].0 bits 0x2 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 4197 timeout: 0 lvb_type: 0 ... 00000080:00200000:7.0:1396480657.792399:0:4197:0:(file.c:559:ll_file_open()) VFS Op:inode=[0x200000402:0x2:0x0](ffff880079dc2a58), flags 101101 fields are separated by colon. Third field is cpu this is executed on, fourth field is a unix timestamp, then sixth field is pid of the process and then there's a file and line number for the message and the actual message. Here we see a file open operation being performed. It starts with a lookup (we see filename and intent that describes the operation to be done). "ldlm_request.c:988:ldlm_cli_enqueue()) ### sending request" line is a client preparing to ask for a lock from server, this one is metadata lock, example on data lock below. "Sending RPC pname" message tells us the client is sending an RPC to the server, along with various other useful bits of info there like a process pid and name there's also something called xid which is lustre request id. In this case it's 1464203148756228. We can use this to find what happened to this request in server logs. "opc" (in this case 101) is operation code (in this case 101 = lock enqueue) "Completed RPC pname" is when the request reply has been received. (good to see how long the request took including roundtrip time. You might only be interested in checking server-side requests that took a long time to execute. "client-side enqueue END (ABORTED)" line means that the server refused to provide a lock to the client (happens frequently with metadata locks, but less so with data). Server side counterpart of this open call (that we were able to find by xid): 00000100:00100000:5.0F:1396480657.791989:0:3694:0:(service.c:1939:ptlrpc_server_handle_req_in()) got req x1464203148756228 ... 00000100:00100000:5.0:1396480657.792039:0:3694:0:(service.c:2082:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt01_004:a7681c72-f95a-a5e9-7434-cc6004e1c10d+7:4197:x1464203148756228:12345-0@lo:101 ... 00000100:00100000:5.0:1396480657.792292:0:3694:0:(service.c:2132:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt01_004:a7681c72-f95a-a5e9-7434-cc6004e1c10d+7:4197:x1464203148756228:12345-0@lo:101 Request procesed in 254us (357us total) trans 4294967329 rc 301/301 First line (ptlrpc_server_handle_req_in) is when the message have actually arrived to the server. Then the "Handling RPC pname" is when the message was actually entered processing (these two times could be quite a bit apart if server is overloaded and has amassed a large backlog of requests to serve). "Handled RPC pname" is when the request finished processing (and how long it took not counting the initial waiting). Everything in between is processing of this request if you need to see a better timeline (filter by pid, in this case 3694). Now we look again in the client logs 00000080:00200000:7.0:1396480657.799063:0:4197:0:(file.c:1138:ll_file_io_generic()) file: zeros, type: 1 ppos: 0, count: 1048576 This is a write syscall to write 1M (count: 1048576) of data at offset 0 (ppos: 0), we also get to know file name. (actually I just doublechecked and lustre version you use would not have a line like this printed, it was introduced in a later release.) 00000020:00200000:7.0:1396480657.799071:0:4197:0:(cl_io.c:237:cl_io_rw_init()) header@ffff88007a713ef0[0x0, 2, [0x200000402:0x2:0x0] hash] 00000020:00200000:7.0:1396480657.799072:0:4197:0:(cl_io.c:237:cl_io_rw_init()) io range: 1 [0, 1048576) 0 0 00000080:00200000:7.0:1396480657.799074:0:4197:0:(vvp_io.c:1154:vvp_io_init()) [0x200000402:0x2:0x0] ignore/verify layout 0/0, layout version 0 restore needed 0 00020000:00200000:7.0:1396480657.799087:0:4197:0:(lov_io.c:419:lov_io_iter_init()) shrink: 0 [0, 1048576) These lines tell us essentially same info as the generic_io: "cl_io_rw_init" line informs us that we have a write coming (number after "range:" 1 = write, 0 = read) and the range is from byte 0 to byte 1048576 (which is 1M). All this will later repeat with different offsets 9 more times since we asked dd to write 10 blocks of 1M each. After a lot more chatter during which we find that we already have a good lock for this file to perform writes (because it was obtained while performing initial truncate, I am not going to cover it here), we get to actual writes that is a lot more chatter about individual page population we eventually get enough pages to stand asynchronous writeout of the dirty pages: 00010000:00010000:7.0:1396480657.799203:0:4197:0:(ldlm_lock.c:797:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(PW) ns: lustre-OST0000-osc-ffff88007a7ab7f0 lock: ffff88008bd9fdb8/0xbefcd51a570cb12e lrc: 4/0,1 mode: PW/PW res: [0x5:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x10020080020000 nid: local remote: 0xbefcd51a570cb135 expref: -99 pid: 4197 timeout: 0 lvb_type: 1 ... 00000080:00200000:7.0:1396480657.808735:0:4197:0:(vvp_io.c:717:vvp_io_write_commit()) commit async pages: 256, from 0, to 4096 00000080:00200000:7.0:1396480657.808910:0:4197:0:(vvp_io.c:739:vvp_io_write_commit()) Committed 256 pages 1048576 bytes, tot: 1048576 00000080:00200000:7.0:1396480657.808913:0:4197:0:(vvp_io.c:823:vvp_io_write_start()) write: nob 1048576, result: 1048576 00000080:00200000:7.0:1396480657.809492:0:4197:0:(vvp_io.c:153:vvp_io_fini()) [0x200000402:0x2:0x0] ignore/verify layout 0/0, layout version 0 restore needed 0 00000080:00200000:7.0:1396480657.809493:0:4197:0:(file.c:1215:ll_file_io_generic()) iot: 1, result: 1048576 00000100:00100000:2.0:1396480657.809500:0:2031:0:(client.c:1478:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_6:a7681c72-f95a-a5e9-7434-cc6004e1c10d:2031:1464203148756276:0@lo:4 "ldlm_lock_addref_internal_nolock" line is the lock we matched. I only list it here because it has a bit of useful data - the lock "handle that's listed after the lock address before "lrc" string, in this case 0xbefcd51a570cb12e and "remote handle" - the handle used on the server for this same lock: 0xbefcd51a570cb135 - these two handles could be used to find this lock on the server which is handy. Next three lines would look a bit different in your case. This is the ends of the syscall and sending of the accumulated data to the server asynchronously at the same time (notice there's a different pid in the "Sending RPC" line. opc 4 = WRITE. The write RPC eventually completes: 00000100:00100000:2.0:1396480657.813317:0:2031:0:(client.c:1849:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_6:a7681c72-f95a-a5e9-7434-cc6004e1c10d:2031:1464203148756276:0@lo:4 We can also trace how it went on the server side using it's xid of 1464203148756276 in server logs, but htere's nothing too interesting there. Now another command from a second client to read the entire file we just wrote: "cat /mnt/lustre/zeros". (We use a second client here because if we read from the same client, it would just get the data from local cache without even touching the kernel). We'll skip the opening as it's the same and jump straight into reading 00000020:00200000:1.0:1396480673.044082:0:4199:0:(cl_io.c:237:cl_io_rw_init()) header@ffff88007e673ef0[0x0, 1, [0x200000402:0x2:0x0] hash] 00000020:00200000:1.0:1396480673.044083:0:4199:0:(cl_io.c:237:cl_io_rw_init()) io range: 0 [0, 4194304) 0 0 ... 00000080:00200000:1.0:1396480673.044101:0:4199:0:(lcommon_cl.c:694:ccc_io_one_lock_index()) lock: 1 [0, 1023] ... 00010000:00010000:1.0:1396480673.044210:0:4199:0:(ldlm_request.c:926:ldlm_cli_enqueue()) ### client-side enqueue START, flags 40000000 ns: lustre-OST0000-osc-ffff8800947307f0 lock: ffff880065565db8/0xbefcd51a570cb189 lrc: 3/1,0 mode: --/PR res: [0x5:0x0:0x0].0 rrc: 1 type: EXT [0->4194303] (req 0->4194303) flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 4199 timeout: 0 lvb_type: 1 00000100:00100000:3.0:1396480673.044290:0:2027:0:(client.c:1478:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_2:47a13ec0-2a9e-a542-7690-d156dc774737:2027:1464203148756556:0@lo:101 So here cat decides to read first 4M of the file (from 0 to 4194304). The client needs a lock to read this data that it does not have, so it prepares the lock request ("client-side enqueue START" line). This lists the same usual info for a lock like we saw during write - the local handle (0xbefcd51a570cb189), desired lock region (req 0->4194303), remore handle is still 0x0 because the server has not responded to this request yet (once it does, the handle would be populated). There's also lock mode --/PR meaning we want a READ lock (PR is for Write), but it has not been granted yet (becomes PR/PR once granted as you will see later). Then we send the request to the server (xid 1464203148756556) Now let's switch to OST log and see what happens with this lock request there: 00000100:00100000:2.0:1396480673.044531:0:3282:0:(service.c:1939:ptlrpc_server_handle_req_in()) got req x1464203148756556 00000100:00100000:2.0:1396480673.044564:0:3282:0:(service.c:2082:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost00_002:47a13ec0-2a9e-a542-7690-d156dc774737+5:2027:x1464203148756556:12345-0@lo:101 00010000:00010000:2.0:1396480673.044567:0:3282:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue0()) ### server-side enqueue handler START 00010000:00010000:2.0:1396480673.044575:0:3282:0:(ldlm_lockd.c:1253:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: filter-lustre-OST0000_UUID lock: ffff8800b424fdb8/0xbefcd51a570cb190 lrc: 2/0,0 mode: --/PR res: [0x5:0x0:0x0].0 rrc: 2 type: EXT [0->0] (req 0->0) flags: 0x40000000000000 nid: local remote: 0xbefcd51a570cb189 expref: -99 pid: 3282 timeout: 0 lvb_type: 0 So usual request starting preamble and then lock manager prints that it start processing for requested lock and prints lock info. Note how server side the lock handle that was local on the client now is printed in the remote section, where as local handle is populated with a different value (this one will become remote one on the client later on). Lock is still ungranted and the request is not yet populated either so it looks like 0-0 at this time, but that's not for long. After some processing server notices that this lock could not be granted because there is a conflicting lock: 00010000:00010000:2.0:1396480673.044585:0:3282:0:(ldlm_lock.c:715:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: filter-lustre-OST0000_UUID lock: ffff880085b2cdb8/0xbefcd51a570cb135 lrc: 2/0,0 mode: PW/PW res: [0x5:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x40000080000000 nid: 0@lo remote: 0xbefcd51a570cb12e expref: 4 pid: 3695 timeout: 0 lvb_type: 0 Note how this is the same PW lock (judging by the handle) that the first writing client has. Server sets aside the second client's lock into the waiting list: 00010000:00010000:2.0:1396480673.044597:0:3282:0:(ldlm_resource.c:1282:ldlm_resource_add_lock()) ### About to add this lock: ns: filter-lustre-OST0000_UUID lock: ffff8800b424fdb8/0xbefcd51a570cb190 lrc: 4/0,0 mode: --/PR res: [0x5:0x0:0x0].0 rrc: 2 type: EXT [0->4194303] (req 0->4194303) flags: 0x50000000000000 nid: 0@lo remote: 0xbefcd51a570cb189 expref: 6 pid: 3282 timeout: 0 lvb_type: 0 And then prepares a "Blocking AST" - that is a message to the first client to ask to to drop the lock and all associated cache and adds this lock into a different list to keep tabs on it should the client does not reply in time (e.g. due to it's death). 00010000:00010000:2.0:1396480673.044613:0:3282:0:(ldlm_lockd.c:848:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: filter-lustre-OST0000_UUID lock: ffff880085b2cdb8/0xbefcd51a570cb135 lrc: 3/0,0 mode: PW/PW res: [0x5:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x50000080000020 nid: 0@lo remote: 0xbefcd51a570cb12e expref: 4 pid: 3695 timeout: 0 lvb_type: 0 00010000:00010000:2.0:1396480673.044616:0:3282:0:(ldlm_lockd.c:459:ldlm_add_waiting_lock()) ### adding to wait list(timeout: 150, AT: on) ns: filter-lustre-OST0000_UUID lock: ffff880085b2cdb8/0xbefcd51a570cb135 lrc: 4/0,0 mode: PW/PW res: [0x5:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x70000080000020 nid: 0@lo remote: 0xbefcd51a570cb12e expref: 4 pid: 3695 timeout: 4332366772 lvb_type: 0 Then it sends the AST RPC to the client (note how this is the same message we usually see on the clients when client initiates RPCs: 00000100:00100000:2.0:1396480673.044618:0:3282:0:(client.c:1478:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ll_ost00_002:lustre-OST0000_UUID:3282:1464203148756580:0@lo:104 Using xid from this rpc 1464203148756580 we now look into first client log again to see how it's handled. opc 104 = BLOCKING AST 00000100:00100000:3.0:1396480673.044707:0:4049:0:(service.c:2082:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb00_002:LOV_OSC_UUID+4:3282:x1464203148756580:12345-0@lo:104 00000100:00100000:3.0:1396480673.044865:0:4049:0:(service.c:2132:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb00_002:LOV_OSC_UUID+4:3282:x1464203148756580:12345-0@lo:104 Request procesed in 155us (214us total) trans 0 rc 0/0 This request was processed superfast, but that's just the notification, now the first client starts to prepare to actually cancel the conflicting lock. We keep looking into the first client log. (While all of this is hapening server sent a "please wait" message to the second client and the second client wntered the waiting, we are not covering it here). 00010000:00010000:5.0:1396480673.045034:0:3131:0:(ldlm_lockd.c:1654:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: lustre-OST0000-osc-ffff88007a7ab7f0 lock: ffff88008bd9fdb8/0xbefcd51a570cb12e lrc: 4/0,0 mode: PW/PW res: [0x5:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x420480020000 nid: local remote: 0xbefcd51a570cb135 expref: -99 pid: 4197 timeout: 0 lvb_type: 1 00010000:00010000:5.0:1396480673.045037:0:3131:0:(ldlm_lockd.c:1667:ldlm_handle_bl_callback()) Lock ffff88008bd9fdb8 already unused, calling callback (ffffffffa0abdde0) This is the first client noticing it needs to drop the lock. After that there's a flurry or entries in the log about client preparing remaining dirty pages for writing and then dropping now clean pages from the cache. Important part is all of this takes nonzero time, esp the writes that involve RPCs. And the busier the server and the network the longer it takes. Eventually all of that is done and we are teady to cancel the lock: 00010000:00010000:4.0:1396480673.121597:0:3132:0:(ldlm_request.c:1186:ldlm_cancel_pack()) ### packing ns: lustre-OST0000-osc-ffff88007a7ab7f0 lock: ffff88008bd9fdb8/0xbefcd51a570cb12e lrc: 3/0,0 mode: --/PW res: [0x5:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x4c69480020000 nid: local remote: 0xbefcd51a570cb135 expref: -99 pid: 4197 timeout: 0 lvb_type: 1 00010000:00010000:4.0:1396480673.121601:0:3132:0:(ldlm_request.c:1190:ldlm_cancel_pack()) 1 locks packed .. 00000100:00100000:7.0:1396480673.121658:0:2031:0:(client.c:1478:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_6:a7681c72-f95a-a5e9-7434-cc6004e1c10d:2031:1464203148766664:0@lo:103 Note how lock state hanged to --/PW meaning this lock is now cancelled locally and we are informing server. And we are sending a cancel rpc (opc 103 = CANCEL) to the server, xid 1464203148766664. Now switching to the OST server to keep looking at the processing of this CANCEL request. 00000100:00100000:5.0:1396480673.121903:0:3130:0:(service.c:2082:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cn01_002:a7681c72-f95a-a5e9-7434-cc6004e1c10d+5:2031:x1464203148766664:12345-0@lo:103 00010000:00010000:5.0:1396480673.121913:0:3130:0:(ldlm_lockd.c:1566:ldlm_request_cancel()) ### server-side cancel handler START: 1 locks, starting at 0 00010000:00010000:5.0:1396480673.122086:0:3130:0:(ldlm_lockd.c:523:ldlm_del_waiting_lock()) ### removed ns: filter-lustre-OST0000_UUID lock: ffff880085b2cdb8/0xbefcd51a570cb135 lrc: 3/0,0 mode: PW/PW res: [0x5:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x70000080000020 nid: 0@lo remote: 0xbefcd51a570cb12e expref: 5 pid: 3695 timeout: 4332366772 lvb_type: 0 Thirt line here about removing from the waiting list - is because we finally got the cancel we have waited for. 00010000:00010000:5.0:1396480673.122093:0:3130:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: filter-lustre-OST0000_UUID lock: ffff880085b2cdb8/0xbefcd51a570cb135 lrc: 0/0,0 mode: --/PW res: [0x5:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x64801080000020 nid: 0@lo remote: 0xbefcd51a570cb12e expref: 5 pid: 3695 timeout: 4332366772 lvb_type: 0 ... 00010000:00010000:5.0:1396480673.122098:0:3130:0:(ldlm_resource.c:1282:ldlm_resource_add_lock()) ### About to add this lock: ns: filter-lustre-OST0000_UUID lock: ffff8800b424fdb8/0xbefcd51a570cb190 lrc: 2/0,0 mode: PR/PR res: [0x5:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4194303) flags: 0x40000000000000 nid: 0@lo remote: 0xbefcd51a570cb189 expref: 5 pid: 3282 timeout: 0 lvb_type: 0 So the lock is cancelled server side and that allows the other lock that was waiting for it to be granted. (see the mode changing to PR/PR to indcate it's now granted) Now we need to send this great news to the client using COMPLETION AST: 00010000:00010000:5.0:1396480673.122100:0:3130:0:(ldlm_lock.c:736:ldlm_add_cp_work_item()) ### lock granted; sending completion AST. ns: filter-lustre-OST0000_UUID lock: ffff8800b424fdb8/0xbefcd51a570cb190 lrc: 2/0,0 mode: PR/PR res: [0x5:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4194303) flags: 0x40080000000000 nid: 0@lo remote: 0xbefcd51a570cb189 expref: 5 pid: 3282 timeout: 0 lvb_type: 0 00010000:00010000:5.0:1396480673.122122:0:3130:0:(ldlm_lockd.c:955:ldlm_server_completion_ast()) ### server preparing completion AST (after 0s wait) ns: filter-lustre-OST0000_UUID lock: ffff8800b424fdb8/0xbefcd51a570cb190 lrc: 3/0,0 mode: PR/PR res: [0x5:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4194303) flags: 0x40000000000000 nid: 0@lo remote: 0xbefcd51a570cb189 expref: 5 pid: 3282 timeout: 0 lvb_type: 0 00000100:00100000:5.0:1396480673.122125:0:3130:0:(client.c:1478:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_cn01_002:lustre-OST0000_UUID:3130:1464203148766668:0@lo:105 opc 105 = COMPLETION AST. We use xid 1464203148766668 to track progress on the second client now. 00000100:00100000:5.0:1396480673.122201:0:3124:0:(service.c:2082:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb01_001:LOV_OSC_UUID+4:3130:x1464203148766668:12345-0@lo:105 00010000:00010000:5.0:1396480673.122258:0:3124:0:(ldlm_lockd.c:1696:ldlm_handle_cp_callback()) ### client completion callback handler START ns: lustre-OST0000-osc-ffff8800947307f0 lock: ffff880065565db8/0xbefcd51a570cb189 lrc: 4/1,0 mode: --/PR res: [0x5:0x0:0x0].0 rrc: 1 type: EXT [0->4194303] (req 0->4194303) flags: 0x20000 nid: local remote: 0xbefcd51a570cb190 expref: -99 pid: 4199 timeout: 0 lvb_type: 1 00010000:00010000:5.0:1396480673.122261:0:3124:0:(ldlm_lockd.c:1765:ldlm_handle_cp_callback()) ### completion AST, new policy data ns: lustre-OST0000-osc-ffff8800947307f0 lock: ffff880065565db8/0xbefcd51a570cb189 lrc: 4/1,0 mode: --/PR res: [0x5:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4194303) flags: 0x10000000020000 nid: local remote: 0xbefcd51a570cb190 expref: -99 pid: 4199 timeout: 0 lvb_type: 1 Client got the AST and updated the local lock information: even though it requested only offset 0 to 4M, server granted lock with a range of 0 to EOF (18446744073709551615) because there were no other conflicts and it helps to save on lock traffic later when (if) the clients need to read beyond initial 4M. This timestamp for the lock granting completion (there's some extra chatter in the logs about various internal states updated after the lock granting), when compared to when we first requested this PR lock is how long it took for this lock to be obtained, and for how long the reading process was actually blocked (and you can see where it was blocked along the way too). Next steps are to read the data that are similar to writes (excepts unlike writes, reads are synchronous for the main read, and then there's readahead to read more data hoping it would be needed that's done asynchronously). This is a bit of a simplistic example for a single striped file, there would be multiple locks to multiple servers for multistriped file and their operation is parallel, but still if something blocks on a lock, it would be to a single server and this would be the outline you'll see. The only difference is for a file written with O_APPEND flag writes will get locks on all stripes, not just on the one corresponding to the io area.

          No, this is not the same issue as LU-4080. I only referenced that ticket to give insight on how we are using Lustre. Bottom line is, we have an application on the clients that is designed to read the binary data files collected on the file system from an external source while the files are being written to. We suspect the application and/or data collection process is delayed because the area of the file that is locked for writing cannot be read by the client application. All we want to know is: How do we disable or at least reduce file locking? If that is not possible, then we need to move forward with evaluating a different file system other than Lustre for our environment. Thank you, George Jackson

          jacksong George Jackson (Inactive) added a comment - No, this is not the same issue as LU-4080 . I only referenced that ticket to give insight on how we are using Lustre. Bottom line is, we have an application on the clients that is designed to read the binary data files collected on the file system from an external source while the files are being written to. We suspect the application and/or data collection process is delayed because the area of the file that is locked for writing cannot be read by the client application. All we want to know is: How do we disable or at least reduce file locking? If that is not possible, then we need to move forward with evaluating a different file system other than Lustre for our environment. Thank you, George Jackson

          If this is the same issue as LU-4080, it may be better to re-open that issue, to keep things together. Have you changed anything beyond the striping recommendations for LU-4080? I am consulting with engineers on this issue. Can you explain any further why you believe locking to be the issue here? It is not apparent from the description in LU-4080.

          cliffw Cliff White (Inactive) added a comment - If this is the same issue as LU-4080 , it may be better to re-open that issue, to keep things together. Have you changed anything beyond the striping recommendations for LU-4080 ? I am consulting with engineers on this issue. Can you explain any further why you believe locking to be the issue here? It is not apparent from the description in LU-4080 .

          Cliff, unfortunately we are a classified DOD site and can give you only so much info. Is there a way disable or reduce the amount of file locking.

          For a description of what we do please reference George Jackson's ticket LU-4080 titled Trouble shooting poor filesystem performance.

          douglas.cain Douglas Allen Cain (Inactive) added a comment - Cliff, unfortunately we are a classified DOD site and can give you only so much info. Is there a way disable or reduce the amount of file locking. For a description of what we do please reference George Jackson's ticket LU-4080 titled Trouble shooting poor filesystem performance.

          All data in Lustre is protected by locks, this ensures all operations are atomic and prevents corruption. The file itself is not locked, the locks apply to ranges of blocks within the file. This locking should be transparent to userspace in most cases. Clients are allowed to open files for reading while another process is writing. Lustre locks a region of the file currently undergoing modification for a period of time needed to actually carry out the modification. A process reading the file in parallel, should it attempt to read this locked region, would block briefly until the modification is complete and then would be able to see the modified data. Reads to other portions of the file will not block. The Linux kernel does something similar on a per page basis where Lustre is usually doing it on a per-megabyte basis. Can you give us more specifics on your use case, and the issues you are experiencing?

          cliffw Cliff White (Inactive) added a comment - All data in Lustre is protected by locks, this ensures all operations are atomic and prevents corruption. The file itself is not locked, the locks apply to ranges of blocks within the file. This locking should be transparent to userspace in most cases. Clients are allowed to open files for reading while another process is writing. Lustre locks a region of the file currently undergoing modification for a period of time needed to actually carry out the modification. A process reading the file in parallel, should it attempt to read this locked region, would block briefly until the modification is complete and then would be able to see the modified data. Reads to other portions of the file will not block. The Linux kernel does something similar on a per page basis where Lustre is usually doing it on a per-megabyte basis. Can you give us more specifics on your use case, and the issues you are experiencing?

          People

            cliffw Cliff White (Inactive)
            douglas.cain Douglas Allen Cain (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: