Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4112

Random eviction of clients on lock callback timeout

Details

    • Bug
    • Resolution: Won't Fix
    • Blocker
    • None
    • Lustre 2.1.6
    • None
    • 3
    • 11068

    Description

      Hi,

      At Meteo France, Lustre clients are randomly evicted with the following message:

      1381739209 2013 Oct 14 08:26:49 beaufixoss9 kern err kernel LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 172.23.67.157@o2ib ns: filter-scratch-OST0024_UUID lock: ffff880215b3c6c0/0xe3c417cfbe584025 lrc: 3/0,0 mode: PW/PW res: 87670641/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x20 remote: 0x964c642755ae9fd expref: 4 pid: 20133 timeout 4581190301

      Some attempts were made to increase the ldlm_timeout, but even with a timeout of 300 seconds, the timer finally expires and evicts the client.

      After collecting some debug logs, it seems that there is a communication problem: the client sends the lock callback RPC, but it is never received by the OSS, leading it to timeout.

      Here is an example:

      [root@beaufixmngt0 beaufix1046]# grep evict lustre-logs-20131011-082535
      00000100:00080000:17.0:1381479935.274633:0:4383:0:(import.c:915:ptlrpc_connect_interpret()) @@@ scratch-OST0017-osc-ffff880454e9b400: evicting (reconnect/recover flags not set: 4) req@ffff88087945e000 x1448401857359985/t0(0) o8->scratch-OST0017-osc-ffff880454e9b400@172.23.80.69@o2ib1:28/4 lens 368/264 e 0 to 0 dl 1381479946 ref 1 fl Interpret:RN/0/0 rc 0/0
      00000100:02020000:17.0:1381479935.274641:0:4383:0:(import.c:1299:ptlrpc_import_recovery_state_machine()) 167-0: This client was evicted by scratch-OST0017; in progress operations using this service will fail.
      00000100:00080000:17.0:1381479935.285351:0:4383:0:(import.c:1303:ptlrpc_import_recovery_state_machine()) evicted from scratch-OST0017_UUID@172.23.80.69@o2ib1; invalidating

      172.23.80.69 = beaufixoss15-ib1

      [root@beaufixmngt0 beaufix1046]# head -n1 ../beaufixoss15/lustre-logs-20131011-082458
      00010000:00000010:14.1F:1381479186.063701:0:0:0:(ldlm_lock.c:198:ldlm_lock_free()) slab-freed 'lock': 560 at ffff880067e6c480.
      [root@beaufixmngt0 beaufix1046]# tail -n 2 ../beaufixoss15/lustre-logs-20131011-082458 | head -n1
      00002000:00000001:2.0:1381479898.458969:0:20716:0:(filter_io_26.c:325:filter_do_bio()) Process entered

      OSS debug log covers from 1381479186 to 1381479898

      Here is the list of RPCs sent during this time (total of 5 RPCs):
      [root@beaufixmngt0 beaufix1046]# grep -f requests lustre-logs-20131011-082535 | grep ptl_send_buf
      00000100:00000200:22.0:1381479685.229939:0:4404:0:(niobuf.c:87:ptl_send_buf()) Sending 192 bytes to portal 28, xid 1448401857358474, offset 0
      00000100:00000200:46.0:1381479697.155610:0:4424:0:(niobuf.c:87:ptl_send_buf()) Sending 296 bytes to portal 28, xid 1448401857358807, offset 0
      00000100:00000200:6.0:1381479797.753246:0:4384:0:(niobuf.c:87:ptl_send_buf()) Sending 456 bytes to portal 6, xid 1448401857359681, offset 0
      00000100:00000200:27.0:1381479835.230381:0:4411:0:(niobuf.c:87:ptl_send_buf()) Sending 192 bytes to portal 28, xid 1448401857359731, offset 0
      00000100:00000200:6.0:1381479885.229892:0:4409:0:(niobuf.c:87:ptl_send_buf()) Sending 192 bytes to portal 28, xid 1448401857359816, offset 0

      [root@beaufixmngt0 beaufix1046]# grep -f requests ../beaufixoss15/lustre-logs-20131011-082458 | grep incoming
      00000100:00000040:0.0:1381479885.230810:0:10038:0:(events.c:284:request_in_callback()) incoming req@ffff88005b69f400 x1448401857359816 msgsize 192

      -> Only 1 RPC received out of 5 sent by the clients

      This example only shows what happens on top of LNet, but we can see no error at the Infiniband level yet.

      This is a blocking problem as it is preventing the cluster from being used for production.

      I will upload traces to help analysis.

      Thanks for your help,
      Sebastien.

      Attachments

        1. 2014-02-11-run6-client.out.gz
          970 kB
        2. LU-4112-reproducer.tar.gz
          2 kB
        3. LU-4112-reproducer-v2.tar.gz
          7 kB
        4. lustre-logs.tar.bz2.aa
          0.2 kB
        5. lustre-logs.tar.bz2.ab
          0.2 kB
        6. server_log_2_10_2014.log
          11 kB
        7. xfer_client_2_10_2014.log
          13 kB

        Activity

          People

            bfaccini Bruno Faccini (Inactive)
            sebastien.buisson Sebastien Buisson (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            14 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: