[LU-503] replay-single test_70b: FAIL: post-failover df: 1 Created: 14/Jul/11  Updated: 29/Jun/15  Resolved: 07/May/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0, Lustre 2.3.0, Lustre 2.1.1, Lustre 2.1.2, Lustre 2.1.3, Lustre 2.1.4, Lustre 1.8.9
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Nathaniel Clark
Resolution: Cannot Reproduce Votes: 0
Labels: None

Issue Links:
Related
is related to LU-951 Test failure on test suite replay-sin... Resolved
Severity: 3
Rank (Obsolete): 4153

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/a7e1f190-ada0-11e0-b33f-52540025f9af.

Unfortunately I cannot reproduce it to fetch more logs



 Comments   
Comment by Jian Yu [ 24/Aug/11 ]

Lustre Tag: v2_1_0_0_RC0
Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/267/
Distro/Arch: RHEL6/x86_64(server), SLES11/x86_64(client)

replay-single test 70b failed as follows:

<~snip~>
Failing mds1 on node client-15-ib
Stopping /mnt/mds1 (opts:)
affected facets: mds1
Failover mds1 to client-15-ib
09:13:25 (1314029605) waiting for client-15-ib network 900 secs ...
09:13:25 (1314029605) network interface is UP
Starting mds1: -o user_xattr,acl  /dev/sda5 /mnt/mds1
client-15-ib: debug=0x33f0404
client-15-ib: subsystem_debug=0xffb7e3ff
client-15-ib: debug_mb=48
Started lustre-MDT0000
client-2-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call
client-5-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call
 replay-single test_70b: @@@@@@ FAIL: post-failover df: 1 

Dmesg on the client node showed that:

[ 3969.930998] Lustre: MGC192.168.4.15@o2ib: Connection restored to service MGS using nid 192.168.4.15@o2ib.
[ 3969.967639] LustreError: 11-0: an error occurred while communicating with 192.168.4.15@o2ib. The mds_connect operation failed with -11
[ 3969.967643] LustreError: Skipped 30 previous similar messages
[ 3974.940274] LustreError: 3946:0:(client.c:2573:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff88031cac3000 x1377855983327289/t300647711259(300647711259) o-1->lustre-MDT0000_UUID@192.168.4.15@o2ib:12/10 lens 552/544 e 0 to 0 dl 1314029642 ref 2 fl Interpret:RP/ffffffff/ffffffff rc 301/-1
[ 4183.930433] LustreError: 3946:0:(client.c:2518:ptlrpc_replay_interpret()) request replay timed out, restarting recovery
[ 4183.930821] LustreError: 167-0: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
[ 4185.527554] LustreError: 9122:0:(lmv_obd.c:1201:lmv_statfs()) can't stat MDS #0 (lustre-MDT0000-mdc-ffff88033e7d4400), error -4
[ 4185.527561] LustreError: 9122:0:(llite_lib.c:1431:ll_statfs_internal()) md_statfs fails: rc = -4
[ 4185.528223] LustreError: 9156:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff8802b431bc00 x1377855984068278/t0(0) o-1->lustre-MDT0000_UUID@192.168.4.15@o2ib:23/10 lens 360/1048 e 0 to 0 dl 0 ref 2 fl Rpc:/ffffffff/ffffffff rc 0/-1
[ 4185.528228] LustreError: 9156:0:(client.c:1060:ptlrpc_import_delay_req()) Skipped 3 previous similar messages
[ 4185.528240] LustreError: 9156:0:(file.c:158:ll_close_inode_openhandle()) inode 144115473691181069 mdc close failed: rc = -108
[ 4185.595589] Lustre: DEBUG MARKER: replay-single test_70b: @@@@@@ FAIL: post-failover df: 1

Maloo report: https://maloo.whamcloud.com/test_sets/be1fd32a-cd38-11e0-8d02-52540025f9af

Comment by Sarah Liu [ 02/Dec/11 ]

hit the similar issue when running replay-single test_52 on 1.8<->2.2 interop tesing.here is the maloo link

https://maloo.whamcloud.com/test_sets/e7e3060e-1596-11e1-b189-52540025f9af

Comment by Oleg Drokin [ 03/Jan/12 ]

Only 1.8 client1 logs are available?
Having server logs would be great too.
It's possible that the issue is totally on 1.8 side as well.

Comment by Jian Yu [ 13/Feb/12 ]

Lustre Tag: v2_1_1_0_RC2
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/41/
Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32-220.el6)
Network: TCP (1GigE)
FAILURE_MODE=HARD

The replay-single test 44c failed as follows:

<~snip~>
client-27vm1: stat: cannot read file system information for `/mnt/lustre': Interrupted system call
 replay-single test_44c: @@@@@@ FAIL: post-failover df: 1

The console log on client-27vm1 showed that:

09:32:22:LustreError: 166-1: MGC10.10.4.164@tcp: Connection to service MGS via nid 10.10.4.164@tcp was lost; in progress operations using this service will fail.
09:32:57:LustreError: 11-0: an error occurred while communicating with 10.10.4.160@tcp. The obd_ping operation failed with -19
09:32:57:LustreError: Skipped 15 previous similar messages
09:32:57:LustreError: 167-0: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
09:32:57:LustreError: 6692:0:(lmv_obd.c:1201:lmv_statfs()) can't stat MDS #0 (lustre-MDT0000-mdc-ffff880050090800), error -4
09:32:57:LustreError: 6692:0:(llite_lib.c:1432:ll_statfs_internal()) md_statfs fails: rc = -4
09:32:57:Lustre: lustre-MDT0000-mdc-ffff880050090800: Connection restored to service lustre-MDT0000 using nid 10.10.4.160@tcp.
09:32:57:Lustre: Skipped 11 previous similar messages
09:32:57:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-single test_44c: @@@@@@ FAIL: post-failover df: 1 
09:32:57:Lustre: DEBUG MARKER: replay-single test_44c: @@@@@@ FAIL: post-failover df: 1

Maloo report: https://maloo.whamcloud.com/test_sets/bbbed6ae-55b3-11e1-9aa8-5254004bbbd3

Comment by Jian Yu [ 04/Jun/12 ]

Lustre Tag: v2_1_2_RC2
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/87/
e2fsprogs Build: http://build.whamcloud.com/job/e2fsprogs-master/314/
Distro/Arch: RHEL6.2/x86_64(server), SLES11SP1/x86_64(client)
Network: IB (in-kernel OFED)
ENABLE_QUOTA=yes

replay-single test_70b failed with the same issue: https://maloo.whamcloud.com/test_sets/ab9f7a52-adf7-11e1-b2f9-52540035b04c
replay-dual: https://maloo.whamcloud.com/test_sets/b8a29928-ae10-11e1-ae0d-52540035b04c

Comment by Sarah Liu [ 12/Jun/12 ]

another failure on master branch, subtest 52:https://maloo.whamcloud.com/test_sets/51db2c58-b18c-11e1-bb61-52540035b04c

Comment by Jian Yu [ 02/Sep/12 ]

Another instance on b2_1 branch:
replay-dual test 10: https://maloo.whamcloud.com/test_sets/1607ebb8-f452-11e1-b3b2-52540035b04c

Comment by Sarah Liu [ 11/Sep/12 ]

Another instance on b2_3-tag2.2.94 during failover testing
https://maloo.whamcloud.com/test_sets/e302b38e-f92d-11e1-a1b8-52540035b04c

client 1 console log:

10:39:27:Lustre: DEBUG MARKER: == replay-single test 44c: race in target handle connect ============================================= 10:39:22 (1347039562)
10:39:27:Lustre: DEBUG MARKER: f=/mnt/lustre/fsa-$(hostname); mcreate $f; rm $f
10:39:38:Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000
10:39:49:LustreError: 166-1: MGC10.10.4.166@tcp: Connection to MGS (at 10.10.4.166@tcp) was lost; in progress operations using this service will fail
10:40:51:Lustre: Evicted from MGS (at 10.10.4.170@tcp) after server handle changed from 0x75660b397d4087ff to 0xcb53557584749609
10:40:51:Lustre: Skipped 2 previous similar messages
10:40:51:Lustre: MGC10.10.4.166@tcp: Reactivating import
10:41:02:Lustre: lustre-MDT0000-mdc-ffff810058fc9800: Connection to lustre-MDT0000 (at 10.10.4.166@tcp) was lost; in progress operations using this service will wait for recovery to complete
10:41:02:Lustre: Skipped 9 previous similar messages
10:41:33:LustreError: 167-0: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
10:41:33:LustreError: 27916:0:(lmv_obd.c:1197:lmv_statfs()) can't stat MDS #0 (lustre-MDT0000-mdc-ffff810058fc9800), error -5
10:41:33:LustreError: 27916:0:(llite_lib.c:1546:ll_statfs_internal()) md_statfs fails: rc = -5
10:41:54:LustreError: 166-1: MGC10.10.4.166@tcp: Connection to MGS (at 10.10.4.170@tcp) was lost; in progress operations using this service will fail
10:42:56:Lustre: Evicted from MGS (at MGC10.10.4.166@tcp_0) after server handle changed from 0xcb53557584749609 to 0x75660b397d409240
10:42:56:Lustre: MGC10.10.4.166@tcp: Reactivating import
10:42:56:LustreError: 28275:0:(lmv_obd.c:1197:lmv_statfs()) can't stat MDS #0 (lustre-MDT0000-mdc-ffff810058fc9800), error -5
10:42:56:LustreError: 28275:0:(llite_lib.c:1546:ll_statfs_internal()) md_statfs fails: rc = -5
10:42:56:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-single test_44c: @@@@@@ FAIL: post-failover df: 1 
10:42:57:Lustre: DEBUG MARKER: replay-single test_44c: @@@@@@ FAIL: post-failover df: 1

client 1 dmesg:

client-28vm1: stat: cannot read file system information for `/mnt/lustre': Input/output error
 replay-single test_44c: @@@@@@ FAIL: post-failover df: 1 
Comment by Jian Yu [ 12/Oct/12 ]

Lustre Tag: v2_3_0_RC2
Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/32
Distro/Arch: RHEL6.3/x86_64(server), RHEL5.8/x86_64(client)
Test Group: failover

replay-single test 44c also failed: https://maloo.whamcloud.com/test_sets/63efb1d0-146e-11e2-af8d-52540035b04c

Comment by Jian Yu [ 21/Dec/12 ]

Lustre Tag: v2_1_4_RC1
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/159/
Distro/Arch: RHEL6.3/x86_64
Test Group: failover

replay-single test 44c still failed: https://maloo.whamcloud.com/test_sets/5d18ad4c-4bb6-11e2-aa80-52540035b04c

Comment by Keith Mannthey (Inactive) [ 10/Jan/13 ]

On Master: https://maloo.whamcloud.com/test_sessions/02bc9462-5b97-11e2-b205-52540035b04c
test_62

Error: 'post-failover df: 1'
Failure Rate: 2.00% of last 100 executions [all branches]

Same exact error.

Comment by Jian Yu [ 15/Feb/13 ]

Lustre Tag: v1_8_9_WC1_RC1
Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/256
Distro/Arch: RHEL5.9/x86_64(server)
Network: TCP (1GigE)
Test Group: failover

The replay-single test_20b also failed with the same issue:
https://maloo.whamcloud.com/test_sets/429815e2-76c8-11e2-bc2f-52540035b04c

Comment by Andreas Dilger [ 07/May/15 ]

Haven't seen this in a couple of years.

Generated at Sat Feb 10 01:07:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.