[LU-4963] client eviction during IOR test - lock callback timer expired Created: 25/Apr/14  Updated: 28/Apr/14  Resolved: 28/Apr/14

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.1
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Lukasz Flis Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None
Environment:

servers: 2.5.1
clients: 2.5.1, 2.4.x


Issue Links:
Duplicate
is duplicated by LU-4584 Lock revocation process fails consist... Resolved
Severity: 3
Rank (Obsolete): 13726

 Description   

During the performance testing on 64 core AMD nodes we have observed client evictions in IOR.

Problem occurs in the read phase of the test for 2.4.x and 2.5.x clients

Message on the server side:
LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 203s: evicting client at 172.16.204.67@o2ib ns: filter-scratch2-OST000f_UUID lock: ffff880427dd4bc0/0xa7267be7d79bca20 lrc: 3/0,0 mode: PW/PW res: [0x2997:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x60000000010020 nid: 172.16.204.67@o2ib remote: 0x2c2795a988d99fa0 expref: 152 pid: 20319 timeout: 6755680954 lvb_type: 0

Example ior run for problem reproduction:
ior -b 4g -e -C -F -i 10 -k -M 4g -N 24 -o /mnt/lustre/scratch2/test -O lustreStripeCount=1 -t 1m -w -r

Problem does not appear with thread count lower than 24 per node.
From 24 and up it is always occuring in the read phase:

IOR-3.0.1: MPI Coordinated Test of Parallel I/O

Began: Sat Apr 26 01:18:33 2014
Command line used: /people/x/jor/IOR/ior-3.0.1/src/ior -b 4g -e -C -F -i 10 -k -M 4g -N 24 -o /mnt/lustre/scratch2/people/x/ior-test/test -O lustreStripeCount=1 -t 1m -w -r
Machine: Linux n1085-amd.zeus

Test 0 started: Sat Apr 26 01:18:33 2014
Summary:
api = POSIX
test filename = /mnt/lustre/scratch2/people/x/ior-test/test
access = file-per-process
ordering in a file = sequential offsets
ordering inter file= constant task offsets = 1
clients = 24 (24 per node)
memoryPerNode = 10.09 GiB
repetitions = 10
xfersize = 1 MiB
blocksize = 4 GiB
aggregate filesize = 96 GiB
Lustre stripe size = Use default
stripe count = 1

access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
write 2242.27 4194304 1024.00 0.037550 43.84 2.69 43.84 0
read 3710 4194304 1024.00 0.005054 26.50 21.84 26.50 0
write 2234.48 4194304 1024.00 0.022456 43.99 1.78 43.99 1
read 8040 4194304 1024.00 0.019482 12.23 5.21 12.23 1
WARNING: Task 19, partial write(), 4096 of 1048576 bytes at offset 3892314112
ior ERROR: write() failed, errno 5, Input/output error (aiori-POSIX.c:236)
...



 Comments   
Comment by Lukasz Flis [ 26/Apr/14 ]

my mistake. Problem occurs in the write phase not the read one

Comment by James A Simmons [ 28/Apr/14 ]

I have been having the same problem. This issue is being tracked under LU-4584

Comment by Lukasz Flis [ 28/Apr/14 ]

Debug log with dlmtrace

ftp.whamclud.com/uploads/LU-4963/lustre-client-2.4.1-6.cyfronet.log.gz

Generated at Sat Feb 10 01:47:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.