[LU-503] replay-single test_70b: FAIL: post-failover df: 1 Created: 14/Jul/11 Updated: 29/Jun/15 Resolved: 07/May/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0, Lustre 2.3.0, Lustre 2.1.1, Lustre 2.1.2, Lustre 2.1.3, Lustre 2.1.4, Lustre 1.8.9 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Nathaniel Clark |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 4153 | ||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/a7e1f190-ada0-11e0-b33f-52540025f9af. Unfortunately I cannot reproduce it to fetch more logs |
| Comments |
| Comment by Jian Yu [ 24/Aug/11 ] |
|
Lustre Tag: v2_1_0_0_RC0 replay-single test 70b failed as follows: <~snip~> Failing mds1 on node client-15-ib Stopping /mnt/mds1 (opts:) affected facets: mds1 Failover mds1 to client-15-ib 09:13:25 (1314029605) waiting for client-15-ib network 900 secs ... 09:13:25 (1314029605) network interface is UP Starting mds1: -o user_xattr,acl /dev/sda5 /mnt/mds1 client-15-ib: debug=0x33f0404 client-15-ib: subsystem_debug=0xffb7e3ff client-15-ib: debug_mb=48 Started lustre-MDT0000 client-2-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call client-5-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call replay-single test_70b: @@@@@@ FAIL: post-failover df: 1 Dmesg on the client node showed that: [ 3969.930998] Lustre: MGC192.168.4.15@o2ib: Connection restored to service MGS using nid 192.168.4.15@o2ib. [ 3969.967639] LustreError: 11-0: an error occurred while communicating with 192.168.4.15@o2ib. The mds_connect operation failed with -11 [ 3969.967643] LustreError: Skipped 30 previous similar messages [ 3974.940274] LustreError: 3946:0:(client.c:2573:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@ffff88031cac3000 x1377855983327289/t300647711259(300647711259) o-1->lustre-MDT0000_UUID@192.168.4.15@o2ib:12/10 lens 552/544 e 0 to 0 dl 1314029642 ref 2 fl Interpret:RP/ffffffff/ffffffff rc 301/-1 [ 4183.930433] LustreError: 3946:0:(client.c:2518:ptlrpc_replay_interpret()) request replay timed out, restarting recovery [ 4183.930821] LustreError: 167-0: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. [ 4185.527554] LustreError: 9122:0:(lmv_obd.c:1201:lmv_statfs()) can't stat MDS #0 (lustre-MDT0000-mdc-ffff88033e7d4400), error -4 [ 4185.527561] LustreError: 9122:0:(llite_lib.c:1431:ll_statfs_internal()) md_statfs fails: rc = -4 [ 4185.528223] LustreError: 9156:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8802b431bc00 x1377855984068278/t0(0) o-1->lustre-MDT0000_UUID@192.168.4.15@o2ib:23/10 lens 360/1048 e 0 to 0 dl 0 ref 2 fl Rpc:/ffffffff/ffffffff rc 0/-1 [ 4185.528228] LustreError: 9156:0:(client.c:1060:ptlrpc_import_delay_req()) Skipped 3 previous similar messages [ 4185.528240] LustreError: 9156:0:(file.c:158:ll_close_inode_openhandle()) inode 144115473691181069 mdc close failed: rc = -108 [ 4185.595589] Lustre: DEBUG MARKER: replay-single test_70b: @@@@@@ FAIL: post-failover df: 1 Maloo report: https://maloo.whamcloud.com/test_sets/be1fd32a-cd38-11e0-8d02-52540025f9af |
| Comment by Sarah Liu [ 02/Dec/11 ] |
|
hit the similar issue when running replay-single test_52 on 1.8<->2.2 interop tesing.here is the maloo link https://maloo.whamcloud.com/test_sets/e7e3060e-1596-11e1-b189-52540025f9af |
| Comment by Oleg Drokin [ 03/Jan/12 ] |
|
Only 1.8 client1 logs are available? |
| Comment by Jian Yu [ 13/Feb/12 ] |
|
Lustre Tag: v2_1_1_0_RC2 The replay-single test 44c failed as follows: <~snip~> client-27vm1: stat: cannot read file system information for `/mnt/lustre': Interrupted system call replay-single test_44c: @@@@@@ FAIL: post-failover df: 1 The console log on client-27vm1 showed that: 09:32:22:LustreError: 166-1: MGC10.10.4.164@tcp: Connection to service MGS via nid 10.10.4.164@tcp was lost; in progress operations using this service will fail. 09:32:57:LustreError: 11-0: an error occurred while communicating with 10.10.4.160@tcp. The obd_ping operation failed with -19 09:32:57:LustreError: Skipped 15 previous similar messages 09:32:57:LustreError: 167-0: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. 09:32:57:LustreError: 6692:0:(lmv_obd.c:1201:lmv_statfs()) can't stat MDS #0 (lustre-MDT0000-mdc-ffff880050090800), error -4 09:32:57:LustreError: 6692:0:(llite_lib.c:1432:ll_statfs_internal()) md_statfs fails: rc = -4 09:32:57:Lustre: lustre-MDT0000-mdc-ffff880050090800: Connection restored to service lustre-MDT0000 using nid 10.10.4.160@tcp. 09:32:57:Lustre: Skipped 11 previous similar messages 09:32:57:Lustre: DEBUG MARKER: /usr/sbin/lctl mark replay-single test_44c: @@@@@@ FAIL: post-failover df: 1 09:32:57:Lustre: DEBUG MARKER: replay-single test_44c: @@@@@@ FAIL: post-failover df: 1 Maloo report: https://maloo.whamcloud.com/test_sets/bbbed6ae-55b3-11e1-9aa8-5254004bbbd3 |
| Comment by Jian Yu [ 04/Jun/12 ] |
|
Lustre Tag: v2_1_2_RC2 replay-single test_70b failed with the same issue: https://maloo.whamcloud.com/test_sets/ab9f7a52-adf7-11e1-b2f9-52540035b04c |
| Comment by Sarah Liu [ 12/Jun/12 ] |
|
another failure on master branch, subtest 52:https://maloo.whamcloud.com/test_sets/51db2c58-b18c-11e1-bb61-52540035b04c |
| Comment by Jian Yu [ 02/Sep/12 ] |
|
Another instance on b2_1 branch: |
| Comment by Sarah Liu [ 11/Sep/12 ] |
|
Another instance on b2_3-tag2.2.94 during failover testing client 1 console log: 10:39:27:Lustre: DEBUG MARKER: == replay-single test 44c: race in target handle connect ============================================= 10:39:22 (1347039562) 10:39:27:Lustre: DEBUG MARKER: f=/mnt/lustre/fsa-$(hostname); mcreate $f; rm $f 10:39:38:Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000 10:39:49:LustreError: 166-1: MGC10.10.4.166@tcp: Connection to MGS (at 10.10.4.166@tcp) was lost; in progress operations using this service will fail 10:40:51:Lustre: Evicted from MGS (at 10.10.4.170@tcp) after server handle changed from 0x75660b397d4087ff to 0xcb53557584749609 10:40:51:Lustre: Skipped 2 previous similar messages 10:40:51:Lustre: MGC10.10.4.166@tcp: Reactivating import 10:41:02:Lustre: lustre-MDT0000-mdc-ffff810058fc9800: Connection to lustre-MDT0000 (at 10.10.4.166@tcp) was lost; in progress operations using this service will wait for recovery to complete 10:41:02:Lustre: Skipped 9 previous similar messages 10:41:33:LustreError: 167-0: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. 10:41:33:LustreError: 27916:0:(lmv_obd.c:1197:lmv_statfs()) can't stat MDS #0 (lustre-MDT0000-mdc-ffff810058fc9800), error -5 10:41:33:LustreError: 27916:0:(llite_lib.c:1546:ll_statfs_internal()) md_statfs fails: rc = -5 10:41:54:LustreError: 166-1: MGC10.10.4.166@tcp: Connection to MGS (at 10.10.4.170@tcp) was lost; in progress operations using this service will fail 10:42:56:Lustre: Evicted from MGS (at MGC10.10.4.166@tcp_0) after server handle changed from 0xcb53557584749609 to 0x75660b397d409240 10:42:56:Lustre: MGC10.10.4.166@tcp: Reactivating import 10:42:56:LustreError: 28275:0:(lmv_obd.c:1197:lmv_statfs()) can't stat MDS #0 (lustre-MDT0000-mdc-ffff810058fc9800), error -5 10:42:56:LustreError: 28275:0:(llite_lib.c:1546:ll_statfs_internal()) md_statfs fails: rc = -5 10:42:56:Lustre: DEBUG MARKER: /usr/sbin/lctl mark replay-single test_44c: @@@@@@ FAIL: post-failover df: 1 10:42:57:Lustre: DEBUG MARKER: replay-single test_44c: @@@@@@ FAIL: post-failover df: 1 client 1 dmesg: client-28vm1: stat: cannot read file system information for `/mnt/lustre': Input/output error replay-single test_44c: @@@@@@ FAIL: post-failover df: 1 |
| Comment by Jian Yu [ 12/Oct/12 ] |
|
Lustre Tag: v2_3_0_RC2 replay-single test 44c also failed: https://maloo.whamcloud.com/test_sets/63efb1d0-146e-11e2-af8d-52540035b04c |
| Comment by Jian Yu [ 21/Dec/12 ] |
|
Lustre Tag: v2_1_4_RC1 replay-single test 44c still failed: https://maloo.whamcloud.com/test_sets/5d18ad4c-4bb6-11e2-aa80-52540035b04c |
| Comment by Keith Mannthey (Inactive) [ 10/Jan/13 ] |
|
On Master: https://maloo.whamcloud.com/test_sessions/02bc9462-5b97-11e2-b205-52540035b04c Error: 'post-failover df: 1' Same exact error. |
| Comment by Jian Yu [ 15/Feb/13 ] |
|
Lustre Tag: v1_8_9_WC1_RC1 The replay-single test_20b also failed with the same issue: |
| Comment by Andreas Dilger [ 07/May/15 ] |
|
Haven't seen this in a couple of years. |