Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1496

Client evicted frequently due to lock callback timer expiration

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 1.8.9
    • Lustre 1.8.x (1.8.0 - 1.8.5)
    • None
    • 3
    • 6385

    Description

      Our customer is seeing client eviction due to lock callback timer expiration relatively frequently.
      The client is not always same, but it occurred 3 times on Jun 3rd. As far as the customer checked
      the network, there is no error reported.

      << OSS >>
      2012/06/03 17:32:06 kern.err@oss4 kernel[-]: [261289.824573] LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 172.17.11.9@o2ib ns: filter-data-OST0007_UUID lock: ffff8100debd3000/0x2babc16b9ceef3a2 lrc: 3/0,0 mode: PW/PW res: 345385/0 rrc: 8 type: EXT [32768->159743] (req 32768->36863) flags: 0x20 remote: 0xd4d1e1a63a1900b8 expref: 14 pid: 2520 timeout 4556489496
      2012/06/03 17:32:08 kern.err@oss4 kernel[-]: [261292.335250] LustreError: 26916:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (107) req@ffff81049eabb000 x1403470492409861/t0 o13><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1338712334 ref 1 fl Interpret:/0/0 rc -107/0
      2012/06/03 17:32:22 kern.err@oss4 kernel[-]: [261306.299203] LustreError: 2435:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (114) req@ffff8102ac2e3c00 x1403470492409878/t0 o8><?>@<?>:0/0 lens 368/264 e 0 to 0 dl 1338712442 ref 1 fl Interpret:/0/0 rc -114/0

      << client >>
      2012/06/03 17:32:08 kern.err@cnode009 kernel[-]: kernel: LustreError: 11-0: an error occurred while communicating with 172.17.13.36@o2ib. The ost_statfs operation failed with -107
      2012/06/03 17:32:08 kern.warning@cnode009 kernel[-]: kernel: Lustre: data-OST0007-osc-ffff81063bc60800: Connection to service data-OST0007 via nid 172.17.13.36@o2ib was lost; in progress operations using this service will wait for recovery to complete.
      2012/06/03 17:32:14 kern.warning@cnode009 kernel[-]: kernel: Lustre: 3960:0:(client.c:1482:ptlrpc_expire_one_request()) @@@ Request x1403470492409862 sent from data-OST0007-osc-ffff81063bc60800 to NID 172.17.13.36@o2ib 6s ago has timed out (6s prior to deadline).
      2012/06/03 17:32:14 kern.warning@cnode009 kernel[-]: kernel: req@ffff810256624800 x1403470492409862/t0 o8->data-OST0007_UUID@172.17.13.36@o2ib:28/4 lens 368/584 e 0 to 1 dl 1338712334 ref 1 fl Rpc:N/0/0 rc 0/0

      Attachments

        Issue Links

          Activity

            [LU-1496] Client evicted frequently due to lock callback timer expiration

            I am running into this problem while running lustre 1.8.8. Is there a simple patch or do I have to upgrade to 1.8.9? Have already increased the osc..max_rpcs_in_flight=32, turned off checksums, and set osc..max_dirty_mb=512. This helped with some evictions. However, I tend to think the timeout=100 still needs to be increased even with adaptive timeouts. Notice the 100sec timeout below when we have at_min=0 and at_max=800. Interesting expiration time since timeout=100. Could also be related to an actual network socket timeout. Please help.

            Melinda

            LustreError: 0:0:(ldlm_lockd.c:313:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 34.239.17.202@tcp ns: filter-dlustre-OST0002_UUID lock: ffff8102636cde00/0x392a313d071a080e lrc: 3/0,0 mode: PW/PW res: 961326/0 rrc: 2 type: EXT [0->18446744073709551615] (req 71299072->71303167) flags: 0x20 remote: 0xa1584afedda7b6c expref: 197 pid: 5369 timeout 10277973712
            LustreError: 0:0:(ldlm_lockd.c:313:waiting_locks_callback()) Skipped 39 previous similar messages
            LustreError: 4057:0:(ldlm_lib.c:1919:target_send_reply_msg())

            mmcdade Melinda McDade added a comment - I am running into this problem while running lustre 1.8.8. Is there a simple patch or do I have to upgrade to 1.8.9? Have already increased the osc. .max_rpcs_in_flight=32, turned off checksums, and set osc. .max_dirty_mb=512. This helped with some evictions. However, I tend to think the timeout=100 still needs to be increased even with adaptive timeouts. Notice the 100sec timeout below when we have at_min=0 and at_max=800. Interesting expiration time since timeout=100. Could also be related to an actual network socket timeout. Please help. Melinda LustreError: 0:0:(ldlm_lockd.c:313:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 34.239.17.202@tcp ns: filter-dlustre-OST0002_UUID lock: ffff8102636cde00/0x392a313d071a080e lrc: 3/0,0 mode: PW/PW res: 961326/0 rrc: 2 type: EXT [0->18446744073709551615] (req 71299072->71303167) flags: 0x20 remote: 0xa1584afedda7b6c expref: 197 pid: 5369 timeout 10277973712 LustreError: 0:0:(ldlm_lockd.c:313:waiting_locks_callback()) Skipped 39 previous similar messages LustreError: 4057:0:(ldlm_lib.c:1919:target_send_reply_msg())
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 1.8.9 [ 10204 ]
            pjones Peter Jones made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            pjones Peter Jones added a comment -

            Landed to b1_8 and already landed to master under LU-874

            pjones Peter Jones added a comment - Landed to b1_8 and already landed to master under LU-874

            Thank for checking the log. The customer is currently considering to upgrade lustre version with the patch applied.
            Please close this ticket. Thank you.

            mnishizawa Mitsuhiro Nishizawa added a comment - Thank for checking the log. The customer is currently considering to upgrade lustre version with the patch applied. Please close this ticket. Thank you.
            bobijam Zhenyu Xu added a comment -

            Sorry for the late response, yes, I think the phenomenon is the same.

            bobijam Zhenyu Xu added a comment - Sorry for the late response, yes, I think the phenomenon is the same.

            Thanks, Zhenyu. We will discuss if we can apply the patch.

            Today, the customer encoutered another instance of the issue.
            2012/06/21 01:04:29 kern.err@oss2 kernel[-]: [1753382.589040] LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 172.17.11.29@o2ib ns: filter-data-OST0003_UUID lock: ffff8101127e8c00/0xb2893ade320a8743 lrc: 3/0,0 mode: PW/PW res: 431401/0 rrc: 8 type: EXT [86016->155647] (req 86016->90111) flags: 0x20 remote: 0xa4f5ddef5f4b10d8 expref: 10 pid: 2462 timeout 6052442992

            We uploaded the debug log to the FTP server:/uploads/LU-1496/LU-1496_20120621.tar.gz
            debug_kernel_20120621_010429.gz ... oss2 debug log
            debug_kernel_cnode029.gz ... client debug log
            log-20120621/cnode029_kernel.txt ... client messages
            log-20120621/oss2_kernel.txt ... oss2 messages

            It looks like this is again due to high I/O load on the OSS.
            Would you please check if this is the same cause?

            mnishizawa Mitsuhiro Nishizawa added a comment - Thanks, Zhenyu. We will discuss if we can apply the patch. Today, the customer encoutered another instance of the issue. 2012/06/21 01:04:29 kern.err@oss2 kernel [-] : [1753382.589040] LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 172.17.11.29@o2ib ns: filter-data-OST0003_UUID lock: ffff8101127e8c00/0xb2893ade320a8743 lrc: 3/0,0 mode: PW/PW res: 431401/0 rrc: 8 type: EXT [86016->155647] (req 86016->90111) flags: 0x20 remote: 0xa4f5ddef5f4b10d8 expref: 10 pid: 2462 timeout 6052442992 We uploaded the debug log to the FTP server:/uploads/ LU-1496 / LU-1496 _20120621.tar.gz debug_kernel_20120621_010429.gz ... oss2 debug log debug_kernel_cnode029.gz ... client debug log log-20120621/cnode029_kernel.txt ... client messages log-20120621/oss2_kernel.txt ... oss2 messages It looks like this is again due to high I/O load on the OSS. Would you please check if this is the same cause?
            bobijam Zhenyu Xu added a comment -

            port patch tracking at http://review.whamcloud.com/3157

            bobijam Zhenyu Xu added a comment - port patch tracking at http://review.whamcloud.com/3157
            bobijam Zhenyu Xu added a comment -

            not directly, we need a patch port for 1.8.x

            bobijam Zhenyu Xu added a comment - not directly, we need a patch port for 1.8.x

            The customer's lustre version is 1.8.4. Is the issue in LU-874 applicable to 1.8.4?
            Can the patched released there be applied?

            mnishizawa Mitsuhiro Nishizawa added a comment - The customer's lustre version is 1.8.4. Is the issue in LU-874 applicable to 1.8.4? Can the patched released there be applied?

            People

              bobijam Zhenyu Xu
              mnishizawa Mitsuhiro Nishizawa
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: