Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4112

Random eviction of clients on lock callback timeout

Details

    • Bug
    • Resolution: Won't Fix
    • Blocker
    • None
    • Lustre 2.1.6
    • None
    • 3
    • 11068

    Description

      Hi,

      At Meteo France, Lustre clients are randomly evicted with the following message:

      1381739209 2013 Oct 14 08:26:49 beaufixoss9 kern err kernel LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 172.23.67.157@o2ib ns: filter-scratch-OST0024_UUID lock: ffff880215b3c6c0/0xe3c417cfbe584025 lrc: 3/0,0 mode: PW/PW res: 87670641/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x20 remote: 0x964c642755ae9fd expref: 4 pid: 20133 timeout 4581190301

      Some attempts were made to increase the ldlm_timeout, but even with a timeout of 300 seconds, the timer finally expires and evicts the client.

      After collecting some debug logs, it seems that there is a communication problem: the client sends the lock callback RPC, but it is never received by the OSS, leading it to timeout.

      Here is an example:

      [root@beaufixmngt0 beaufix1046]# grep evict lustre-logs-20131011-082535
      00000100:00080000:17.0:1381479935.274633:0:4383:0:(import.c:915:ptlrpc_connect_interpret()) @@@ scratch-OST0017-osc-ffff880454e9b400: evicting (reconnect/recover flags not set: 4) req@ffff88087945e000 x1448401857359985/t0(0) o8->scratch-OST0017-osc-ffff880454e9b400@172.23.80.69@o2ib1:28/4 lens 368/264 e 0 to 0 dl 1381479946 ref 1 fl Interpret:RN/0/0 rc 0/0
      00000100:02020000:17.0:1381479935.274641:0:4383:0:(import.c:1299:ptlrpc_import_recovery_state_machine()) 167-0: This client was evicted by scratch-OST0017; in progress operations using this service will fail.
      00000100:00080000:17.0:1381479935.285351:0:4383:0:(import.c:1303:ptlrpc_import_recovery_state_machine()) evicted from scratch-OST0017_UUID@172.23.80.69@o2ib1; invalidating

      172.23.80.69 = beaufixoss15-ib1

      [root@beaufixmngt0 beaufix1046]# head -n1 ../beaufixoss15/lustre-logs-20131011-082458
      00010000:00000010:14.1F:1381479186.063701:0:0:0:(ldlm_lock.c:198:ldlm_lock_free()) slab-freed 'lock': 560 at ffff880067e6c480.
      [root@beaufixmngt0 beaufix1046]# tail -n 2 ../beaufixoss15/lustre-logs-20131011-082458 | head -n1
      00002000:00000001:2.0:1381479898.458969:0:20716:0:(filter_io_26.c:325:filter_do_bio()) Process entered

      OSS debug log covers from 1381479186 to 1381479898

      Here is the list of RPCs sent during this time (total of 5 RPCs):
      [root@beaufixmngt0 beaufix1046]# grep -f requests lustre-logs-20131011-082535 | grep ptl_send_buf
      00000100:00000200:22.0:1381479685.229939:0:4404:0:(niobuf.c:87:ptl_send_buf()) Sending 192 bytes to portal 28, xid 1448401857358474, offset 0
      00000100:00000200:46.0:1381479697.155610:0:4424:0:(niobuf.c:87:ptl_send_buf()) Sending 296 bytes to portal 28, xid 1448401857358807, offset 0
      00000100:00000200:6.0:1381479797.753246:0:4384:0:(niobuf.c:87:ptl_send_buf()) Sending 456 bytes to portal 6, xid 1448401857359681, offset 0
      00000100:00000200:27.0:1381479835.230381:0:4411:0:(niobuf.c:87:ptl_send_buf()) Sending 192 bytes to portal 28, xid 1448401857359731, offset 0
      00000100:00000200:6.0:1381479885.229892:0:4409:0:(niobuf.c:87:ptl_send_buf()) Sending 192 bytes to portal 28, xid 1448401857359816, offset 0

      [root@beaufixmngt0 beaufix1046]# grep -f requests ../beaufixoss15/lustre-logs-20131011-082458 | grep incoming
      00000100:00000040:0.0:1381479885.230810:0:10038:0:(events.c:284:request_in_callback()) incoming req@ffff88005b69f400 x1448401857359816 msgsize 192

      -> Only 1 RPC received out of 5 sent by the clients

      This example only shows what happens on top of LNet, but we can see no error at the Infiniband level yet.

      This is a blocking problem as it is preventing the cluster from being used for production.

      I will upload traces to help analysis.

      Thanks for your help,
      Sebastien.

      Attachments

        1. 2014-02-11-run6-client.out.gz
          970 kB
          Manish Patel
        2. LU-4112-reproducer.tar.gz
          2 kB
          Sebastien Piechurski
        3. LU-4112-reproducer-v2.tar.gz
          7 kB
          Sebastien Piechurski
        4. lustre-logs.tar.bz2.aa
          0.2 kB
          Sebastien Buisson
        5. lustre-logs.tar.bz2.ab
          0.2 kB
          Sebastien Buisson
        6. server_log_2_10_2014.log
          11 kB
          Manish Patel
        7. xfer_client_2_10_2014.log
          13 kB
          Manish Patel

        Activity

          [LU-4112] Random eviction of clients on lock callback timeout
          simmonsja James A Simmons made changes -
          Resolution New: Won't Fix [ 2 ]
          Status Original: Open [ 1 ] New: Closed [ 6 ]
          pjones Peter Jones made changes -
          End date New: 05/Jun/14
          Start date New: 16/Oct/13
          pjones Peter Jones made changes -
          Labels Original: p4b
          manish Manish Patel (Inactive) made changes -
          Attachment Original: baioli_ls-al_.02_10_2014.log [ 14146 ]
          manish Manish Patel (Inactive) made changes -
          Attachment New: baioli_ls-al_.02_10_2014.log [ 14146 ]
          Attachment New: xfer_client_2_10_2014.log [ 14147 ]
          Attachment New: server_log_2_10_2014.log [ 14148 ]
          Attachment New: 2014-02-11-run6-client.out.gz [ 14149 ]
          spiechurski Sebastien Piechurski made changes -
          Attachment New: LU-4112-reproducer-v2.tar.gz [ 13822 ]
          pjones Peter Jones made changes -
          Labels New: p4b
          spiechurski Sebastien Piechurski made changes -
          Comment [ When the issue is shown, before the client is evicted, all the processes are stucked with the following stack:
          [<ffffffffa055dd87>] cl_lock_mutex_get+0x77/0xe0 [obdclass]
          [<ffffffffa05610e6>] cl_lock_hold_mutex+0x96/0x6b0 [obdclass]
          [<ffffffffa056185e>] cl_lock_request+0x5e/0x1c0 [obdclass]
          [<ffffffffa0566bc5>] cl_io_lock+0x3d5/0x570 [obdclass]
          [<ffffffffa0566e1a>] cl_io_loop+0xba/0x190 [obdclass]
          [<ffffffffa0a2c097>] ll_file_io_generic+0x3a7/0x560 [lustre]
          [<ffffffffa0a2c389>] ll_file_aio_read+0x139/0x2c0 [lustre]
          [<ffffffffa0a2c849>] ll_file_read+0x169/0x2a0 [lustre]
          [<ffffffff811679b5>] vfs_read+0xb5/0x1a0
          [<ffffffff81167af1>] sys_read+0x51/0x90

          Except one that will be with this one:
          [<ffffffffa055ccce>] cl_lock_state_signal+0x7e/0x160 [obdclass]
          [<ffffffffa055ce34>] cl_lock_state_set+0x84/0x1d0 [obdclass]
          [<ffffffffa055e5da>] cl_lock_intransit+0x5a/0x150 [obdclass]
          [<ffffffffa055f53b>] cl_use_try+0x13b/0x300 [obdclass]
          [<ffffffffa055f8a5>] cl_enqueue_try+0x1a5/0x300 [obdclass]
          [<ffffffffa0560bfd>] cl_enqueue_locked+0x6d/0x200 [obdclass]
          [<ffffffffa056187e>] cl_lock_request+0x7e/0x1c0 [obdclass]
          [<ffffffffa0566bc5>] cl_io_lock+0x3d5/0x570 [obdclass]
          [<ffffffffa0566e1a>] cl_io_loop+0xba/0x190 [obdclass]
          [<ffffffffa09d2097>] ll_file_io_generic+0x3a7/0x560 [lustre]
          [<ffffffffa09d2389>] ll_file_aio_read+0x139/0x2c0 [lustre]
          [<ffffffffa09d2849>] ll_file_read+0x169/0x2a0 [lustre]
          [<ffffffff811679b5>] vfs_read+0xb5/0x1a0
          [<ffffffff81167af1>] sys_read+0x51/0x90

          Hope this helps. ]
          spiechurski Sebastien Piechurski made changes -
          Attachment New: LU-4112-reproducer.tar.gz [ 13693 ]
          sebastien.buisson Sebastien Buisson (Inactive) made changes -
          Attachment New: lustre-logs.tar.bz2.aa [ 13637 ]
          Attachment New: lustre-logs.tar.bz2.ab [ 13638 ]

          People

            bfaccini Bruno Faccini (Inactive)
            sebastien.buisson Sebastien Buisson (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            14 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: