Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
This issue was created by maloo for S Buisson <sbuisson@ddn.com>
This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/f0d9c514-dca5-4f09-8180-068456c0d0c8
test_160a failed with the following error:
Timeout occurred after 160 mins, last suite running was sanity
After MDT failover, it seems clients never reconnect.
On MDS side:
[ 5987.459946] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds1 [ 5987.641616] Lustre: Failing over lustre-MDT0000 [ 5988.179879] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.5.43@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [ 5988.182823] LustreError: Skipped 6 previous similar messages [ 5993.848997] Lustre: 25449:0:(client.c:2228:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1583256866/real 1583256866] req@ffff939a9e2c4900 x1660159018546432/t0(0) o251->MGC10.9.6.10@tcp@10.9.6.10@tcp:26/25 lens 224/224 e 0 to 1 dl 1583256872 ref 2 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'umount.0' [ 5994.048821] Lustre: server umount lustre-MDT0000 complete [ 5994.292111] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && [ 5994.292111] lctl dl | grep ' ST ' || true [ 5994.650156] Lustre: DEBUG MARKER: modprobe dm-flakey; [ 5994.650156] dmsetup targets | grep -q flakey [ 5995.000133] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1 [ 5995.350091] Lustre: DEBUG MARKER: modprobe dm-flakey; [ 5995.350091] dmsetup targets | grep -q flakey [ 5995.696047] Lustre: DEBUG MARKER: dmsetup status /dev/mapper/mds1_flakey >/dev/null 2>&1 [ 5996.039360] Lustre: DEBUG MARKER: dmsetup status /dev/mapper/mds1_flakey 2>&1 [ 5996.384483] Lustre: DEBUG MARKER: test -b /dev/mapper/mds1_flakey [ 5996.727194] Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey [ 5997.067827] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1; mount -t lustre -olocalrecov /dev/mapper/mds1_flakey /mnt/lustre-mds1 [ 5997.294752] LDISKFS-fs (dm-3): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc [ 5997.326213] Lustre: osd-ldiskfs create tunables for lustre-MDT0000 [ 5997.721570] Lustre: lustre-MDT0000: Imperative Recovery not enabled, recovery window 60-180 [ 5997.773182] Lustre: lustre-MDD0000: changelog on [ 5997.774238] Lustre: 26177:0:(mdd_device.c:545:mdd_changelog_llog_init()) lustre-MDD0000 : orphan changelog records found, starting from index 0 to index 16, being cleared now [ 5997.778242] Lustre: lustre-MDT0000: in recovery but waiting for the first client to connect
On client1:
[ 6075.180830] Lustre: lustre-MDT0000-mdc-ffff9513656af800: Connection to lustre-MDT0000 (at 10.9.6.10@tcp) was lost; in progress operations using this service will wait for recovery to complete [ 6082.181294] Lustre: 19983:0:(client.c:2228:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1583256870/real 1583256870] req@ffff951364fd4480 x1660164743953984/t0(0) o400->MGC10.9.6.10@tcp@10.9.6.10@tcp:26/25 lens 224/224 e 0 to 1 dl 1583256877 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/0:1.0' [ 6082.186243] LustreError: 166-1: MGC10.9.6.10@tcp: Connection to MGS (at 10.9.6.10@tcp) was lost; in progress operations using this service will fail
On client2:
[ 6072.132648] Lustre: lustre-MDT0000-mdc-ffff92943a687000: Connection to lustre-MDT0000 (at 10.9.6.10@tcp) was lost; in progress operations using this service will wait for recovery to complete [ 6079.133314] Lustre: 11180:0:(client.c:2228:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1583256867/real 1583256867] req@ffff929423be0900 x1660159108778048/t0(0) o400->MGC10.9.6.10@tcp@10.9.6.10@tcp:26/25 lens 224/224 e 0 to 1 dl 1583256874 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/1:2.0' [ 6079.136248] LustreError: 166-1: MGC10.9.6.10@tcp: Connection to MGS (at 10.9.6.10@tcp) was lost; in progress operations using this service will fail
VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_160a - Timeout occurred after 160 mins, last suite running was sanity