[LU-13323] sanity test_160a: timeout Created: 04/Mar/20  Updated: 04/Mar/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for S Buisson <sbuisson@ddn.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/f0d9c514-dca5-4f09-8180-068456c0d0c8

test_160a failed with the following error:

Timeout occurred after 160 mins, last suite running was sanity

After MDT failover, it seems clients never reconnect.
On MDS side:

[ 5987.459946] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds1
[ 5987.641616] Lustre: Failing over lustre-MDT0000
[ 5988.179879] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.5.43@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
[ 5988.182823] LustreError: Skipped 6 previous similar messages
[ 5993.848997] Lustre: 25449:0:(client.c:2228:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1583256866/real 1583256866]  req@ffff939a9e2c4900 x1660159018546432/t0(0) o251->MGC10.9.6.10@tcp@10.9.6.10@tcp:26/25 lens 224/224 e 0 to 1 dl 1583256872 ref 2 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'umount.0'
[ 5994.048821] Lustre: server umount lustre-MDT0000 complete
[ 5994.292111] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
[ 5994.292111] lctl dl | grep ' ST ' || true
[ 5994.650156] Lustre: DEBUG MARKER: modprobe dm-flakey;
[ 5994.650156] 			 dmsetup targets | grep -q flakey
[ 5995.000133] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1
[ 5995.350091] Lustre: DEBUG MARKER: modprobe dm-flakey;
[ 5995.350091] 			 dmsetup targets | grep -q flakey
[ 5995.696047] Lustre: DEBUG MARKER: dmsetup status /dev/mapper/mds1_flakey >/dev/null 2>&1
[ 5996.039360] Lustre: DEBUG MARKER: dmsetup status /dev/mapper/mds1_flakey 2>&1
[ 5996.384483] Lustre: DEBUG MARKER: test -b /dev/mapper/mds1_flakey
[ 5996.727194] Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey
[ 5997.067827] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1; mount -t lustre -olocalrecov  /dev/mapper/mds1_flakey /mnt/lustre-mds1
[ 5997.294752] LDISKFS-fs (dm-3): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[ 5997.326213] Lustre: osd-ldiskfs create tunables for lustre-MDT0000
[ 5997.721570] Lustre: lustre-MDT0000: Imperative Recovery not enabled, recovery window 60-180
[ 5997.773182] Lustre: lustre-MDD0000: changelog on
[ 5997.774238] Lustre: 26177:0:(mdd_device.c:545:mdd_changelog_llog_init()) lustre-MDD0000 : orphan changelog records found, starting from index 0 to index 16, being cleared now
[ 5997.778242] Lustre: lustre-MDT0000: in recovery but waiting for the first client to connect

On client1:

[ 6075.180830] Lustre: lustre-MDT0000-mdc-ffff9513656af800: Connection to lustre-MDT0000 (at 10.9.6.10@tcp) was lost; in progress operations using this service will wait for recovery to complete
[ 6082.181294] Lustre: 19983:0:(client.c:2228:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1583256870/real 1583256870]  req@ffff951364fd4480 x1660164743953984/t0(0) o400->MGC10.9.6.10@tcp@10.9.6.10@tcp:26/25 lens 224/224 e 0 to 1 dl 1583256877 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/0:1.0'
[ 6082.186243] LustreError: 166-1: MGC10.9.6.10@tcp: Connection to MGS (at 10.9.6.10@tcp) was lost; in progress operations using this service will fail

On client2:

[ 6072.132648] Lustre: lustre-MDT0000-mdc-ffff92943a687000: Connection to lustre-MDT0000 (at 10.9.6.10@tcp) was lost; in progress operations using this service will wait for recovery to complete
[ 6079.133314] Lustre: 11180:0:(client.c:2228:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1583256867/real 1583256867]  req@ffff929423be0900 x1660159108778048/t0(0) o400->MGC10.9.6.10@tcp@10.9.6.10@tcp:26/25 lens 224/224 e 0 to 1 dl 1583256874 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/1:2.0'
[ 6079.136248] LustreError: 166-1: MGC10.9.6.10@tcp: Connection to MGS (at 10.9.6.10@tcp) was lost; in progress operations using this service will fail

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_160a - Timeout occurred after 160 mins, last suite running was sanity


Generated at Sat Feb 10 03:00:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.