[LU-10221] recovery-mds-scale test_failover_mds: onyx-40vm1:LBUG/LASSERT detected Created: 09/Nov/17  Updated: 17/May/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Casper Assignee: Hongchao Zhang
Resolution: Unresolved Votes: 0
Labels: None
Environment:

onyx, failover
servers: el7.4, ldiskfs, branch master, v2.10.55, b3667
clients: sles12sp3, branch master, v2.10.55, b3667


Issue Links:
Related
is related to LU-9601 recovery-mds-scale test_failover_mds:... Reopened
is related to LU-10319 recovery-random-scale, test_fail_clie... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

session: https://testing.hpdd.intel.com/test_sessions/2c5b36b7-c2a1-4e4a-89e7-993f6d6350b5
test set: https://testing.hpdd.intel.com/test_sets/db9bc182-c4e6-11e7-9c63-52540065bddc

May be related to LU-9601, which loses client 1 (due to OOM).

From test_log:

==== Checking the clients loads AFTER failover -- failure NOT OK
11:36:51 (1510169811) waiting for onyx-40vm1 network 5 secs ...
11:36:51 (1510169811) network interface is UP
CMD: onyx-40vm1 rc=0;
			val=\$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
			if [[ \$? -eq 0 && \$val -ne 0 ]]; then
				echo \$(hostname -s): \$val;
				rc=\$val;
			fi;
			exit \$rc
pdsh@onyx-40vm5: onyx-40vm1: mcmd: connect failed: Connection refused
 recovery-mds-scale test_failover_mds: @@@@@@ FAIL: onyx-40vm1:LBUG/LASSERT detected 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:5289:error()
  = /usr/lib64/lustre/tests/test-framework.sh:6285:check_node_health()
  = /usr/lib64/lustre/tests/test-framework.sh:2277:check_client_load()
  = /usr/lib64/lustre/tests/test-framework.sh:2322:check_client_loads()
  = /usr/lib64/lustre/tests/recovery-mds-scale.sh:170:failover_target()
  = /usr/lib64/lustre/tests/recovery-mds-scale.sh:236:test_failover_mds()
  = /usr/lib64/lustre/tests/test-framework.sh:5565:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:5604:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:5451:run_test()
  = /usr/lib64/lustre/tests/recovery-mds-scale.sh:238:main()


 Comments   
Comment by Peter Jones [ 04/Dec/17 ]

Hongchao

Could you please advise on this one?

Thanks

Peter

Comment by Hongchao Zhang [ 05/Dec/17 ]

the log show the node onyx-40vm1 encounter a problem, but there is no logs (except for dd debug log),
Are you still able to get the syslog/console log at this node?

Comment by James Casper [ 05/Dec/17 ]

I don't see any syslogs, but consoles are available at the session level under Session logs.

Comment by Hongchao Zhang [ 06/Dec/17 ]

this should be the same as LU-9601, there is also a OOM in the log,

Lustre: 1884:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
[ 1085.759409] irqbalance invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_score_adj=0
[ 1085.759416] irqbalance cp[    1.744383] virtio-pci 0000:00:04.0: virtio_pci: leaving for legacy driver
[    1.745550] ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 10
Generated at Sat Feb 10 02:33:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.