Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4963

client eviction during IOR test - lock callback timer expired

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.5.1
    • None
    • servers: 2.5.1
      clients: 2.5.1, 2.4.x
    • 3
    • 13726

    Description

      During the performance testing on 64 core AMD nodes we have observed client evictions in IOR.

      Problem occurs in the read phase of the test for 2.4.x and 2.5.x clients

      Message on the server side:
      LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 203s: evicting client at 172.16.204.67@o2ib ns: filter-scratch2-OST000f_UUID lock: ffff880427dd4bc0/0xa7267be7d79bca20 lrc: 3/0,0 mode: PW/PW res: [0x2997:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x60000000010020 nid: 172.16.204.67@o2ib remote: 0x2c2795a988d99fa0 expref: 152 pid: 20319 timeout: 6755680954 lvb_type: 0

      Example ior run for problem reproduction:
      ior -b 4g -e -C -F -i 10 -k -M 4g -N 24 -o /mnt/lustre/scratch2/test -O lustreStripeCount=1 -t 1m -w -r

      Problem does not appear with thread count lower than 24 per node.
      From 24 and up it is always occuring in the read phase:

      IOR-3.0.1: MPI Coordinated Test of Parallel I/O

      Began: Sat Apr 26 01:18:33 2014
      Command line used: /people/x/jor/IOR/ior-3.0.1/src/ior -b 4g -e -C -F -i 10 -k -M 4g -N 24 -o /mnt/lustre/scratch2/people/x/ior-test/test -O lustreStripeCount=1 -t 1m -w -r
      Machine: Linux n1085-amd.zeus

      Test 0 started: Sat Apr 26 01:18:33 2014
      Summary:
      api = POSIX
      test filename = /mnt/lustre/scratch2/people/x/ior-test/test
      access = file-per-process
      ordering in a file = sequential offsets
      ordering inter file= constant task offsets = 1
      clients = 24 (24 per node)
      memoryPerNode = 10.09 GiB
      repetitions = 10
      xfersize = 1 MiB
      blocksize = 4 GiB
      aggregate filesize = 96 GiB
      Lustre stripe size = Use default
      stripe count = 1

      access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
      ------ --------- ---------- --------- -------- -------- -------- -------- ----
      write 2242.27 4194304 1024.00 0.037550 43.84 2.69 43.84 0
      read 3710 4194304 1024.00 0.005054 26.50 21.84 26.50 0
      write 2234.48 4194304 1024.00 0.022456 43.99 1.78 43.99 1
      read 8040 4194304 1024.00 0.019482 12.23 5.21 12.23 1
      WARNING: Task 19, partial write(), 4096 of 1048576 bytes at offset 3892314112
      ior ERROR: write() failed, errno 5, Input/output error (aiori-POSIX.c:236)
      ...

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              lflis Lukasz Flis
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: