Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1496

Client evicted frequently due to lock callback timer expiration

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 1.8.9
    • Lustre 1.8.x (1.8.0 - 1.8.5)
    • None
    • 3
    • 6385

    Description

      Our customer is seeing client eviction due to lock callback timer expiration relatively frequently.
      The client is not always same, but it occurred 3 times on Jun 3rd. As far as the customer checked
      the network, there is no error reported.

      << OSS >>
      2012/06/03 17:32:06 kern.err@oss4 kernel[-]: [261289.824573] LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 172.17.11.9@o2ib ns: filter-data-OST0007_UUID lock: ffff8100debd3000/0x2babc16b9ceef3a2 lrc: 3/0,0 mode: PW/PW res: 345385/0 rrc: 8 type: EXT [32768->159743] (req 32768->36863) flags: 0x20 remote: 0xd4d1e1a63a1900b8 expref: 14 pid: 2520 timeout 4556489496
      2012/06/03 17:32:08 kern.err@oss4 kernel[-]: [261292.335250] LustreError: 26916:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (107) req@ffff81049eabb000 x1403470492409861/t0 o13><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1338712334 ref 1 fl Interpret:/0/0 rc -107/0
      2012/06/03 17:32:22 kern.err@oss4 kernel[-]: [261306.299203] LustreError: 2435:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (114) req@ffff8102ac2e3c00 x1403470492409878/t0 o8><?>@<?>:0/0 lens 368/264 e 0 to 0 dl 1338712442 ref 1 fl Interpret:/0/0 rc -114/0

      << client >>
      2012/06/03 17:32:08 kern.err@cnode009 kernel[-]: kernel: LustreError: 11-0: an error occurred while communicating with 172.17.13.36@o2ib. The ost_statfs operation failed with -107
      2012/06/03 17:32:08 kern.warning@cnode009 kernel[-]: kernel: Lustre: data-OST0007-osc-ffff81063bc60800: Connection to service data-OST0007 via nid 172.17.13.36@o2ib was lost; in progress operations using this service will wait for recovery to complete.
      2012/06/03 17:32:14 kern.warning@cnode009 kernel[-]: kernel: Lustre: 3960:0:(client.c:1482:ptlrpc_expire_one_request()) @@@ Request x1403470492409862 sent from data-OST0007-osc-ffff81063bc60800 to NID 172.17.13.36@o2ib 6s ago has timed out (6s prior to deadline).
      2012/06/03 17:32:14 kern.warning@cnode009 kernel[-]: kernel: req@ffff810256624800 x1403470492409862/t0 o8->data-OST0007_UUID@172.17.13.36@o2ib:28/4 lens 368/584 e 0 to 1 dl 1338712334 ref 1 fl Rpc:N/0/0 rc 0/0

      Attachments

        Issue Links

          Activity

            [LU-1496] Client evicted frequently due to lock callback timer expiration

            I am running into this problem while running lustre 1.8.8. Is there a simple patch or do I have to upgrade to 1.8.9? Have already increased the osc..max_rpcs_in_flight=32, turned off checksums, and set osc..max_dirty_mb=512. This helped with some evictions. However, I tend to think the timeout=100 still needs to be increased even with adaptive timeouts. Notice the 100sec timeout below when we have at_min=0 and at_max=800. Interesting expiration time since timeout=100. Could also be related to an actual network socket timeout. Please help.

            Melinda

            LustreError: 0:0:(ldlm_lockd.c:313:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 34.239.17.202@tcp ns: filter-dlustre-OST0002_UUID lock: ffff8102636cde00/0x392a313d071a080e lrc: 3/0,0 mode: PW/PW res: 961326/0 rrc: 2 type: EXT [0->18446744073709551615] (req 71299072->71303167) flags: 0x20 remote: 0xa1584afedda7b6c expref: 197 pid: 5369 timeout 10277973712
            LustreError: 0:0:(ldlm_lockd.c:313:waiting_locks_callback()) Skipped 39 previous similar messages
            LustreError: 4057:0:(ldlm_lib.c:1919:target_send_reply_msg())

            mmcdade Melinda McDade added a comment - I am running into this problem while running lustre 1.8.8. Is there a simple patch or do I have to upgrade to 1.8.9? Have already increased the osc. .max_rpcs_in_flight=32, turned off checksums, and set osc. .max_dirty_mb=512. This helped with some evictions. However, I tend to think the timeout=100 still needs to be increased even with adaptive timeouts. Notice the 100sec timeout below when we have at_min=0 and at_max=800. Interesting expiration time since timeout=100. Could also be related to an actual network socket timeout. Please help. Melinda LustreError: 0:0:(ldlm_lockd.c:313:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 34.239.17.202@tcp ns: filter-dlustre-OST0002_UUID lock: ffff8102636cde00/0x392a313d071a080e lrc: 3/0,0 mode: PW/PW res: 961326/0 rrc: 2 type: EXT [0->18446744073709551615] (req 71299072->71303167) flags: 0x20 remote: 0xa1584afedda7b6c expref: 197 pid: 5369 timeout 10277973712 LustreError: 0:0:(ldlm_lockd.c:313:waiting_locks_callback()) Skipped 39 previous similar messages LustreError: 4057:0:(ldlm_lib.c:1919:target_send_reply_msg())
            pjones Peter Jones added a comment -

            Landed to b1_8 and already landed to master under LU-874

            pjones Peter Jones added a comment - Landed to b1_8 and already landed to master under LU-874

            Thank for checking the log. The customer is currently considering to upgrade lustre version with the patch applied.
            Please close this ticket. Thank you.

            mnishizawa Mitsuhiro Nishizawa added a comment - Thank for checking the log. The customer is currently considering to upgrade lustre version with the patch applied. Please close this ticket. Thank you.
            bobijam Zhenyu Xu added a comment -

            Sorry for the late response, yes, I think the phenomenon is the same.

            bobijam Zhenyu Xu added a comment - Sorry for the late response, yes, I think the phenomenon is the same.

            Thanks, Zhenyu. We will discuss if we can apply the patch.

            Today, the customer encoutered another instance of the issue.
            2012/06/21 01:04:29 kern.err@oss2 kernel[-]: [1753382.589040] LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 172.17.11.29@o2ib ns: filter-data-OST0003_UUID lock: ffff8101127e8c00/0xb2893ade320a8743 lrc: 3/0,0 mode: PW/PW res: 431401/0 rrc: 8 type: EXT [86016->155647] (req 86016->90111) flags: 0x20 remote: 0xa4f5ddef5f4b10d8 expref: 10 pid: 2462 timeout 6052442992

            We uploaded the debug log to the FTP server:/uploads/LU-1496/LU-1496_20120621.tar.gz
            debug_kernel_20120621_010429.gz ... oss2 debug log
            debug_kernel_cnode029.gz ... client debug log
            log-20120621/cnode029_kernel.txt ... client messages
            log-20120621/oss2_kernel.txt ... oss2 messages

            It looks like this is again due to high I/O load on the OSS.
            Would you please check if this is the same cause?

            mnishizawa Mitsuhiro Nishizawa added a comment - Thanks, Zhenyu. We will discuss if we can apply the patch. Today, the customer encoutered another instance of the issue. 2012/06/21 01:04:29 kern.err@oss2 kernel [-] : [1753382.589040] LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 172.17.11.29@o2ib ns: filter-data-OST0003_UUID lock: ffff8101127e8c00/0xb2893ade320a8743 lrc: 3/0,0 mode: PW/PW res: 431401/0 rrc: 8 type: EXT [86016->155647] (req 86016->90111) flags: 0x20 remote: 0xa4f5ddef5f4b10d8 expref: 10 pid: 2462 timeout 6052442992 We uploaded the debug log to the FTP server:/uploads/ LU-1496 / LU-1496 _20120621.tar.gz debug_kernel_20120621_010429.gz ... oss2 debug log debug_kernel_cnode029.gz ... client debug log log-20120621/cnode029_kernel.txt ... client messages log-20120621/oss2_kernel.txt ... oss2 messages It looks like this is again due to high I/O load on the OSS. Would you please check if this is the same cause?
            bobijam Zhenyu Xu added a comment -

            port patch tracking at http://review.whamcloud.com/3157

            bobijam Zhenyu Xu added a comment - port patch tracking at http://review.whamcloud.com/3157
            bobijam Zhenyu Xu added a comment -

            not directly, we need a patch port for 1.8.x

            bobijam Zhenyu Xu added a comment - not directly, we need a patch port for 1.8.x

            The customer's lustre version is 1.8.4. Is the issue in LU-874 applicable to 1.8.4?
            Can the patched released there be applied?

            mnishizawa Mitsuhiro Nishizawa added a comment - The customer's lustre version is 1.8.4. Is the issue in LU-874 applicable to 1.8.4? Can the patched released there be applied?
            bobijam Zhenyu Xu added a comment -

            I think LU-918 and LU-874 addressed this issue.

            bobijam Zhenyu Xu added a comment - I think LU-918 and LU-874 addressed this issue.
            bobijam Zhenyu Xu added a comment -

            The OSS4 is heavy loaded, and many clients are compete writing to same files. The write request from cnode050 to OST0007 (on OSS4) was delayed too much for the heavy load and got evicted.

            OSS4's heavy load

            2012/06/19 13:36:27 kern.warning@oss4 kernel[-]: [1626037.218357] Lustre: data-OST0007: slow quota init 150s due to heavy IO load
            2012/06/19 13:36:27 kern.warning@oss4 kernel[-]: [1626037.218364] Lustre: Skipped 4 previous similar messages
            2012/06/19 13:36:27 kern.warning@oss4 kernel[-]: [1626037.218535] Lustre: data-OST0007: slow quota init 150s due to heavy IO load
            2012/06/19 13:36:27 kern.warning@oss4 kernel[-]: [1626037.218541] Lustre: Skipped 1 previous similar message
            2012/06/19 13:36:27 kern.warning@oss4 kernel[-]: [1626037.224625] Lustre: data-OST0007: slow direct_io 166s due to heavy IO load
            2012/06/19 21:42:41 kern.warning@oss4 kernel[-]: [1655136.964476] Lustre: data-OST0006: slow quota init 101s due to heavy IO load
            2012/06/19 21:42:41 kern.warning@oss4 kernel[-]: [1655136.964479] Lustre: data-OST0006: slow quota init 80s due to heavy IO load
            2012/06/19 21:42:41 kern.warning@oss4 kernel[-]: [1655136.964485] Lustre: Skipped 48 previous similar messages
            2012/06/19 21:42:41 kern.warning@oss4 kernel[-]: [1655136.964489] Lustre: Skipped 48 previous similar messages
            2012/06/19 21:42:41 kern.warning@oss4 kernel[-]: [1655136.965584] Lustre: data-OST0006: slow direct_io 131s due to heavy IO load
            2012/06/19 21:42:41 kern.info@oss4 kernel[-]: [1655136.966417] Lustre: data-OST0006: slow quota init 31s due to heavy IO load
            2012/06/19 21:42:41 kern.info@oss4 kernel[-]: [1655136.966421] Lustre: Skipped 7 previous similar messages

            cnode050's write request got evicted due to lock's timeout

            00000100:00000200:23:1340117424.398493:0:4625:0:(service.c:1046:ptlrpc_hpreq_reorder_nolock()) @@@ high priority req req@ffff8101b6316800 x1403470546287529/t0 o4->2208c67a-8f53-db44-9cbd-e65dba8f887c@NET_0x50000ac110b32_UUID:0/0 lens 448/0 e 0 to 0 dl 1340117467 ref 1 fl New:H/0/0 rc 0/0
            00000100:00100000:23:1340117424.398534:0:4625:0:(service.c:1369:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost_io_171:2208c67a-8f53-db44-9cbd-e65dba8f887c+9:3937:x1403470546287529:12345-172.17.11.50@o2ib:4

            got request on 1340117424 and start to handle

            00000100:00001000:7:1340117462.705473:0:2587:0:(service.c:762:ptlrpc_at_send_early_reply()) @@@ sending early reply (deadline +5s, margin 63s) for 68+30 req@ffff8101b6316800 x1403470546287529/t0 o4>2208c67a-8f53-db44-9cbd-e65dba8f887c@NET_0x50000ac110b32_UUID:0/0 lens 448/416 e 0 to 0 dl 1340117467 ref 2 fl Interpret:H/0/0 rc 0/0
            00000100:00001000:7:1340117462.705480:0:2587:0:(import.c:1468:at_measured()) add 68 to ffff8102ee7ab860 time=60 v=68 (68 1 31 1)

            38 seconds passed, asked client give OST extra 68 seconds

            00010000:00020000:23:1340117525.229707:0:0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 172.17.11.50@o2ib ns: filter-data-OST0007_UUID lock: ffff8101a5e2fa00/0x2babc16ba243f737 lrc: 3/0,0 mode: PW/PW res: 430494/0 rrc: 8 type: EXT [73728->159743] (req 73728->77823) flags: 0x20 remote: 0x225685b920b04720 expref: 8 pid: 2438 timeout 5961707167

            the extent lock protecting the write has timed out and evicted the client

            bobijam Zhenyu Xu added a comment - The OSS4 is heavy loaded, and many clients are compete writing to same files. The write request from cnode050 to OST0007 (on OSS4) was delayed too much for the heavy load and got evicted. OSS4's heavy load 2012/06/19 13:36:27 kern.warning@oss4 kernel [-] : [1626037.218357] Lustre: data-OST0007: slow quota init 150s due to heavy IO load 2012/06/19 13:36:27 kern.warning@oss4 kernel [-] : [1626037.218364] Lustre: Skipped 4 previous similar messages 2012/06/19 13:36:27 kern.warning@oss4 kernel [-] : [1626037.218535] Lustre: data-OST0007: slow quota init 150s due to heavy IO load 2012/06/19 13:36:27 kern.warning@oss4 kernel [-] : [1626037.218541] Lustre: Skipped 1 previous similar message 2012/06/19 13:36:27 kern.warning@oss4 kernel [-] : [1626037.224625] Lustre: data-OST0007: slow direct_io 166s due to heavy IO load 2012/06/19 21:42:41 kern.warning@oss4 kernel [-] : [1655136.964476] Lustre: data-OST0006: slow quota init 101s due to heavy IO load 2012/06/19 21:42:41 kern.warning@oss4 kernel [-] : [1655136.964479] Lustre: data-OST0006: slow quota init 80s due to heavy IO load 2012/06/19 21:42:41 kern.warning@oss4 kernel [-] : [1655136.964485] Lustre: Skipped 48 previous similar messages 2012/06/19 21:42:41 kern.warning@oss4 kernel [-] : [1655136.964489] Lustre: Skipped 48 previous similar messages 2012/06/19 21:42:41 kern.warning@oss4 kernel [-] : [1655136.965584] Lustre: data-OST0006: slow direct_io 131s due to heavy IO load 2012/06/19 21:42:41 kern.info@oss4 kernel [-] : [1655136.966417] Lustre: data-OST0006: slow quota init 31s due to heavy IO load 2012/06/19 21:42:41 kern.info@oss4 kernel [-] : [1655136.966421] Lustre: Skipped 7 previous similar messages cnode050's write request got evicted due to lock's timeout 00000100:00000200:23:1340117424.398493:0:4625:0:(service.c:1046:ptlrpc_hpreq_reorder_nolock()) @@@ high priority req req@ffff8101b6316800 x1403470546287529/t0 o4->2208c67a-8f53-db44-9cbd-e65dba8f887c@NET_0x50000ac110b32_UUID:0/0 lens 448/0 e 0 to 0 dl 1340117467 ref 1 fl New:H/0/0 rc 0/0 00000100:00100000:23:1340117424.398534:0:4625:0:(service.c:1369:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost_io_171:2208c67a-8f53-db44-9cbd-e65dba8f887c+9:3937:x1403470546287529:12345-172.17.11.50@o2ib:4 got request on 1340117424 and start to handle 00000100:00001000:7:1340117462.705473:0:2587:0:(service.c:762:ptlrpc_at_send_early_reply()) @@@ sending early reply (deadline +5s, margin 63s) for 68+30 req@ffff8101b6316800 x1403470546287529/t0 o4 >2208c67a-8f53-db44-9cbd-e65dba8f887c@NET_0x50000ac110b32_UUID:0/0 lens 448/416 e 0 to 0 dl 1340117467 ref 2 fl Interpret:H/0/0 rc 0/0 00000100:00001000:7:1340117462.705480:0:2587:0:(import.c:1468:at_measured()) add 68 to ffff8102ee7ab860 time=60 v=68 (68 1 31 1) 38 seconds passed, asked client give OST extra 68 seconds 00010000:00020000:23:1340117525.229707:0:0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 172.17.11.50@o2ib ns: filter-data-OST0007_UUID lock: ffff8101a5e2fa00/0x2babc16ba243f737 lrc: 3/0,0 mode: PW/PW res: 430494/0 rrc: 8 type: EXT [73728->159743] (req 73728->77823) flags: 0x20 remote: 0x225685b920b04720 expref: 8 pid: 2438 timeout 5961707167 the extent lock protecting the write has timed out and evicted the client

            People

              bobijam Zhenyu Xu
              mnishizawa Mitsuhiro Nishizawa
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: