Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
Lustre 2.5.1
-
None
-
servers: 2.5.1
clients: 2.5.1, 2.4.x
-
3
-
13726
Description
During the performance testing on 64 core AMD nodes we have observed client evictions in IOR.
Problem occurs in the read phase of the test for 2.4.x and 2.5.x clients
Message on the server side:
LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 203s: evicting client at 172.16.204.67@o2ib ns: filter-scratch2-OST000f_UUID lock: ffff880427dd4bc0/0xa7267be7d79bca20 lrc: 3/0,0 mode: PW/PW res: [0x2997:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x60000000010020 nid: 172.16.204.67@o2ib remote: 0x2c2795a988d99fa0 expref: 152 pid: 20319 timeout: 6755680954 lvb_type: 0
Example ior run for problem reproduction:
ior -b 4g -e -C -F -i 10 -k -M 4g -N 24 -o /mnt/lustre/scratch2/test -O lustreStripeCount=1 -t 1m -w -r
Problem does not appear with thread count lower than 24 per node.
From 24 and up it is always occuring in the read phase:
IOR-3.0.1: MPI Coordinated Test of Parallel I/O
Began: Sat Apr 26 01:18:33 2014
Command line used: /people/x/jor/IOR/ior-3.0.1/src/ior -b 4g -e -C -F -i 10 -k -M 4g -N 24 -o /mnt/lustre/scratch2/people/x/ior-test/test -O lustreStripeCount=1 -t 1m -w -r
Machine: Linux n1085-amd.zeus
Test 0 started: Sat Apr 26 01:18:33 2014
Summary:
api = POSIX
test filename = /mnt/lustre/scratch2/people/x/ior-test/test
access = file-per-process
ordering in a file = sequential offsets
ordering inter file= constant task offsets = 1
clients = 24 (24 per node)
memoryPerNode = 10.09 GiB
repetitions = 10
xfersize = 1 MiB
blocksize = 4 GiB
aggregate filesize = 96 GiB
Lustre stripe size = Use default
stripe count = 1
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
write 2242.27 4194304 1024.00 0.037550 43.84 2.69 43.84 0
read 3710 4194304 1024.00 0.005054 26.50 21.84 26.50 0
write 2234.48 4194304 1024.00 0.022456 43.99 1.78 43.99 1
read 8040 4194304 1024.00 0.019482 12.23 5.21 12.23 1
WARNING: Task 19, partial write(), 4096 of 1048576 bytes at offset 3892314112
ior ERROR: write() failed, errno 5, Input/output error (aiori-POSIX.c:236)
...
Attachments
Issue Links
- is duplicated by
-
LU-4584 Lock revocation process fails consistently
- Resolved