[LU-14292] metadata-updates test 3 fails with 'mpi_run failed' Created: 04/Jan/21  Updated: 22/Jun/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

metadata-updates test_3 fails with 'mpi_run failed' starting on 2020-08-08 for for Lustre 2.13.55.9 full-patchless test session at https://testing.whamcloud.com/test_sets/abe45967-5b59-4042-88eb-ff34e5e658ad. Since that time, metadata-updates test 3 has failed 27 times with the latest failure at https://testing.whamcloud.com/test_sets/f550175e-d2d7-4ba2-a9fc-0316726010be.

Looking at the failure in the suite_log, we see that the disk quota was exceeded

== metadata-updates test 3: write_disjoint test ====================================================== 07:33:16 (1608881596)
+ chmod 0777 /mnt/lustre
drwxrwxrwx 6 root root 65536 Dec 25 07:32 /mnt/lustre
+ su mpiuser sh -c "/usr/lib64/openmpi/bin/mpirun --mca btl tcp,self --mca btl_tcp_if_include eth0 -mca boot ssh --oversubscribe -machinefile /tmp/auster.machines -np 2 /usr/lib64/openmpi/bin/write_disjoint -f /mnt/lustre/d0.metadata-updates/f3.metadata-updates -n 1000 "
Warning: Permanently added 'trevis-202vm2,10.9.7.130' (ECDSA) to the list of known hosts.
random seed: 1608881598
loop 0: chunk_size 15794568
rank 0, loop 8: write() returned Disk quota exceeded
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    trevis-202vm1
  Remote host:   trevis-202vm2
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
 metadata-updates test_3: @@@@@@ FAIL: mpi_run failed 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:6273:error()
  = /usr/lib64/lustre/tests/metadata-updates.sh:269:test_3()

On client1 (vm1) dmesg, we see

[111671.914784] Lustre: DEBUG MARKER: == metadata-updates test 3: write_disjoint test ====================================================== 07:33:16 (1608881596)
[111673.143081] hugetlbfs: write_disjoint (1566296): Using mlock ulimits for SHM_HUGETLB is deprecated
[111678.973977] systemd-coredump[1566303]: Not enough arguments passed by the kernel (0, expected 7).
[111679.381003] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  metadata-updates test_3: @@@@@@ FAIL: mpi_run failed 

There’s nothing else that indicates a problem in the console and dmesg logs.


Generated at Sat Feb 10 03:08:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.