[LU-2719] Lustre slowdown, errors with IOR: INFO: task IOR: blocked for more than 120 seconds. Created: 31/Jan/13  Updated: 09/Jan/20  Resolved: 09/Jan/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.4
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Malcolm Cowe (Inactive) Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

All servers have identically similar h/w and s/w config:

Packages:

lustre-2.1.4-2.6.32_279.14.1.el6_lustre.x86_64
lustre-modules-2.1.4-2.6.32_279.14.1.el6_lustre.x86_64
lustre-ldiskfs-3.3.0-2.6.32_279.14.1.el6_lustre.x86_64
e2fsprogs-1.42.3.wc3-7.el6
kernel-2.6.32-279.14.1.el6_lustre
kernel-firmware-2.6.32-279.14.1.el6_lustre

Kernel IB stack

uname -a:

Linux oss1 2.6.32-279.14.1.el6_lustre.x86_64 #1 SMP Fri Dec 14 23:22:17 PST 2012 x86_64 x86_64 x86_64 GNU/Linux

128GB RAM
2 x 8-core Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
MLX FDR IB

MDS is running the IB subnet manager.

Clients have identical h/w config and the following s/w:

kernel-2.6.32-279.2.1.el6.x86_64
kernel-firmware-2.6.32-279.14.1.el6.noarch
lustre-client-2.1.3-2.6.32_279.2.1.el6.x86_64.x86_64
lustre-client-modules-2.1.3-2.6.32_279.2.1.el6.x86_64.x86_64

Kernel IB Stack

Linux hd-client-00 2.6.32-279.2.1.el6.x86_64 #1 SMP Fri Jul 20 01:55:29 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

One of the client's has reported h/w issues. These appear to have been resolved.


Attachments: Text File hd-client-excerpt.txt     File messages-client-00-trunc     File messages-client-01-trunc     File messages-client-02-trunc     File messages-hd-oss-00     File messages-hd-oss-01     File messages-mds-00-trunc    
Severity: 3
Rank (Obsolete): 6612

 Description   

Observed slow-down in performance in test environment that they are using to benchmark HD video streaming. After an initial PoC before Christmas, they have restarted their evaluation and are trying to re-establish the baseline performance and meeting with poor results compared with their original testing.

The system was completely rebooted earlier in the week and is, so far, much improved. Still waiting on confirmation that the numbers are in line with expectations but I would like to try and make sure that we haven't missed anything.

There is a possibility that the problems experienced were environment or configuration related (e.g. network instability or a configuration error). Since rebooting the servers, the system appears to be more stable.

Ideally, looking for consensus or confirmation that issue is not systemic.

Syslogs attached. They have been scrubbed a bit to remove some very verbose and extraneous informational entries from unrelated software.


Generated at Sat Feb 10 01:27:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.