[LU-9089] performance-sanity test_4 OpenFabrics vendor limiting the amount of physical memory Created: 08/Feb/17  Updated: 21/May/21  Resolved: 21/May/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Casper Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

onyx-64-67, Full Group test,
master branch, v2.9.52, b3499, zfs,
CentOS Linux 7 clients


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

performance-sanity, test_4 TIMEOUT

Access to logs: https://testing.hpdd.intel.com/test_sets/095753c6-e5e9-11e6-b6d4-5254006e85c2

Also seen in November 2016 (DCO-6144).

Note: Timeout issues for this test have been seen since 2011 and have most frequently been associated with LU-1357. 1357 attributes the timeout to the use of VMs. With this ticket, physical hardware was used for testing.

From test_log:

+ su mpiuser sh -c "/usr/lib64/compat-openmpi16/bin/mpirun -mca boot ssh -machinefile /tmp/mdsrate-create-large.machines -np 1 /usr/lib64/lustre/tests/mdsrate --create --time 600 --nfiles 52671 --dir /mnt/lustre/mdsrate/single --filefmt 'f%%d' "
--------------------------------------------------------------------------
A deprecated MCA parameter value was specified in an MCA parameter
file.  Deprecated MCA parameters should be avoided; they may disappear
in future releases.

  Deprecated parameter: plm_rsh_agent
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A deprecated MCA parameter value was specified in an MCA parameter
file.  Deprecated MCA parameters should be avoided; they may disappear
in future releases.

  Deprecated parameter: plm_rsh_agent
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory.  This can cause MPI jobs to
run with erratic performance, hang, and/or crash.

This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered.  You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.

See this Open MPI FAQ item for more information on these Linux kernel module
parameters:

    http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

  Local host:              onyx-66.onyx.hpdd.intel.com
  Registerable memory:     32768 MiB
  Total memory:            49110 MiB

Your MPI job will continue, but may be behave poorly and/or hang.
--------------------------------------------------------------------------

Generated at Sat Feb 10 02:23:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.