[LU-10221] recovery-mds-scale test_failover_mds: onyx-40vm1:LBUG/LASSERT detected Created: 09/Nov/17 Updated: 17/May/18 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Casper | Assignee: | Hongchao Zhang |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
onyx, failover |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
session: https://testing.hpdd.intel.com/test_sessions/2c5b36b7-c2a1-4e4a-89e7-993f6d6350b5 May be related to LU-9601, which loses client 1 (due to OOM). From test_log: ==== Checking the clients loads AFTER failover -- failure NOT OK 11:36:51 (1510169811) waiting for onyx-40vm1 network 5 secs ... 11:36:51 (1510169811) network interface is UP CMD: onyx-40vm1 rc=0; val=\$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ \$? -eq 0 && \$val -ne 0 ]]; then echo \$(hostname -s): \$val; rc=\$val; fi; exit \$rc pdsh@onyx-40vm5: onyx-40vm1: mcmd: connect failed: Connection refused recovery-mds-scale test_failover_mds: @@@@@@ FAIL: onyx-40vm1:LBUG/LASSERT detected Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:5289:error() = /usr/lib64/lustre/tests/test-framework.sh:6285:check_node_health() = /usr/lib64/lustre/tests/test-framework.sh:2277:check_client_load() = /usr/lib64/lustre/tests/test-framework.sh:2322:check_client_loads() = /usr/lib64/lustre/tests/recovery-mds-scale.sh:170:failover_target() = /usr/lib64/lustre/tests/recovery-mds-scale.sh:236:test_failover_mds() = /usr/lib64/lustre/tests/test-framework.sh:5565:run_one() = /usr/lib64/lustre/tests/test-framework.sh:5604:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:5451:run_test() = /usr/lib64/lustre/tests/recovery-mds-scale.sh:238:main() |
| Comments |
| Comment by Peter Jones [ 04/Dec/17 ] |
|
Hongchao Could you please advise on this one? Thanks Peter |
| Comment by Hongchao Zhang [ 05/Dec/17 ] |
|
the log show the node onyx-40vm1 encounter a problem, but there is no logs (except for dd debug log), |
| Comment by James Casper [ 05/Dec/17 ] |
|
I don't see any syslogs, but consoles are available at the session level under Session logs. |
| Comment by Hongchao Zhang [ 06/Dec/17 ] |
|
this should be the same as LU-9601, there is also a OOM in the log, Lustre: 1884:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 3 previous similar messages [ 1085.759409] irqbalance invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_score_adj=0 [ 1085.759416] irqbalance cp[ 1.744383] virtio-pci 0000:00:04.0: virtio_pci: leaving for legacy driver [ 1.745550] ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 10 |