Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.1.5, Lustre 1.8.9
Affects Version/s: Lustre 1.8.9
Labels:
None
Environment:

Hide

Lustre Tag: v1_8_9_WC1_RC1
Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/256
Distro/Arch: RHEL5.9/x86_64(server), RHEL6.3/x86_64(client)
Network: TCP (1GigE)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD

MGS/MDS Nodes: client-31vm3(active), client-31vm7(passive)
                                                           \ /
                                            1 combined MGS/MDT

OSS Nodes: client-31vm4(active), client-31vm8(passive)
                                                   \ /
                                               7 OSTs

Client Nodes: client-31vm[1,5,6]

IP Addresses:
client-31vm1: 10.10.4.196
client-31vm3: 10.10.4.190
client-31vm4: 10.10.4.191
client-31vm5: 10.10.4.192
client-31vm6: 10.10.4.193
client-31vm7: 10.10.4.194
client-31vm8: 10.10.4.195

Show
Lustre Tag: v1_8_9_WC1_RC1 Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/256 Distro/Arch: RHEL5.9/x86_64(server), RHEL6.3/x86_64(client) Network: TCP (1GigE) ENABLE_QUOTA=yes FAILURE_MODE=HARD MGS/MDS Nodes: client-31vm3(active), client-31vm7(passive)                                                            \ /                                             1 combined MGS/MDT OSS Nodes: client-31vm4(active), client-31vm8(passive)                                                    \ /                                                7 OSTs Client Nodes: client-31vm[1,5,6] IP Addresses: client-31vm1: 10.10.4.196 client-31vm3: 10.10.4.190 client-31vm4: 10.10.4.191 client-31vm5: 10.10.4.192 client-31vm6: 10.10.4.193 client-31vm7: 10.10.4.194 client-31vm8: 10.10.4.195

Severity:
3
Rank (Obsolete):
6837

Description

While running recovery-mds-scale test_failover_ost, it failed as follows after running 15 hours:

==== Checking the clients loads AFTER failover -- failure NOT OK
Client load failed on node client-31vm6, rc=1
Client load failed during failover. Exiting...
Found the END_RUN_FILE file: /home/autotest/.autotest/shared_dir/2013-02-12/172229-70152412386500/end_run_file
client-31vm6.lab.whamcloud.com
Client load  failed on node client-31vm6.lab.whamcloud.com:
/logdir/test_logs/2013-02-12/lustre-b1_8-el5-x86_64-vs-lustre-b1_8-el6-x86_64--review--1_1_1__13121__-70152412386500-172228/recovery-mds-scale.test_failover_ost.run__stdout.client-31vm6.lab.whamcloud.com.log
/logdir/test_logs/2013-02-12/lustre-b1_8-el5-x86_64-vs-lustre-b1_8-el6-x86_64--review--1_1_1__13121__-70152412386500-172228/recovery-mds-scale.test_failover_ost.run__debug.client-31vm6.lab.whamcloud.com.log
2013-02-13 10:50:55 Terminating clients loads ...
Duration:               86400
Server failover period: 900 seconds
Exited after:           56345 seconds
Number of failovers before exit:
mds: 0 times
ost1: 8 times
ost2: 3 times
ost3: 13 times
ost4: 7 times
ost5: 10 times
ost6: 10 times
ost7: 12 times
Status: FAIL: rc=1

The output of "tar" operation on client-31vm6 showed that:

tar: etc/chef/solo.rb: Cannot open: Input/output error
tar: etc/chef/client.rb: Cannot open: Input/output error
tar: etc/prelink.cache: Cannot open: Input/output error
tar: etc/readahead.conf: Cannot open: Input/output error
tar: etc/localtime: Cannot open: Input/output error
tar: Exiting with failure status due to previous errors

Dmesg on the MDS node (client-31vm3) showed that:

Lustre: DEBUG MARKER: Starting failover on ost2
Lustre: 7396:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1426822752375573 sent from lustre-OST0000-osc to NID 10.10.4.195@tcp 7s ago has timed out (7s prior to deadline).
  req@ffff810048893000 x1426822752375573/t0 o13->lustre-OST0000_UUID@10.10.4.195@tcp:7/4 lens 192/528 e 0 to 1 dl 1360781077 ref 1 fl Rpc:N/0/0 rc 0/0
Lustre: 7396:0:(client.c:1529:ptlrpc_expire_one_request()) Skipped 117 previous similar messages
Lustre: lustre-OST0000-osc: Connection to service lustre-OST0000 via nid 10.10.4.195@tcp was lost; in progress operations using this service will wait for recovery to complete.
Lustre: lustre-OST0005-osc: Connection to service lustre-OST0005 via nid 10.10.4.195@tcp was lost; in progress operations using this service will wait for recovery to complete.
Lustre: Skipped 4 previous similar messages
Lustre: lustre-OST0006-osc: Connection to service lustre-OST0006 via nid 10.10.4.195@tcp was lost; in progress operations using this service will wait for recovery to complete.
Lustre: 7398:0:(import.c:517:import_select_connection()) lustre-OST0000-osc: tried all connections, increasing latency to 2s
Lustre: 7398:0:(import.c:517:import_select_connection()) Skipped 59 previous similar messages
LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x80013 sub-object on OST idx 0/1: rc = -11
LustreError: 7725:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x80013 sub-object on OST idx 0/1: rc = -5
LustreError: 7725:0:(mds_open.c:442:mds_create_objects()) error creating objects for inode 524307: rc = -5
LustreError: 7725:0:(mds_open.c:827:mds_finish_open()) mds_create_objects: rc = -5
LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x80013 sub-object on OST idx 1/1: rc = -11
LustreError: 7750:0:(mds_open.c:442:mds_create_objects()) error creating objects for inode 524307: rc = -5
LustreError: 7750:0:(mds_open.c:827:mds_finish_open()) mds_create_objects: rc = -5
Lustre: MGS: haven't heard from client 71727980-7899-77af-8af0-a42b0349985a (at 10.10.4.195@tcp) in 49 seconds. I think it's dead, and I am evicting it.
LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x80013 sub-object on OST idx 2/1: rc = -11
LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) Skipped 1 previous similar message
LustreError: 7745:0:(mds_open.c:442:mds_create_objects()) error creating objects for inode 524307: rc = -5
LustreError: 7745:0:(mds_open.c:827:mds_finish_open()) mds_create_objects: rc = -5
LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x80013 sub-object on OST idx 3/1: rc = -11
LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) Skipped 1 previous similar message
LustreError: 7739:0:(mds_open.c:442:mds_create_objects()) error creating objects for inode 524307: rc = -5
LustreError: 7739:0:(mds_open.c:827:mds_finish_open()) mds_create_objects: rc = -5
LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x80013 sub-object on OST idx 4/1: rc = -11
LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) Skipped 1 previous similar message
LustreError: 7731:0:(mds_open.c:442:mds_create_objects()) error creating objects for inode 524307: rc = -5
LustreError: 7731:0:(mds_open.c:827:mds_finish_open()) mds_create_objects: rc = -5
Lustre: 7397:0:(quota_master.c:1724:mds_quota_recovery()) Only 6/7 OSTs are active, abort quota recovery
Lustre: 7397:0:(quota_master.c:1724:mds_quota_recovery()) Skipped 6 previous similar messages
Lustre: lustre-OST0000-osc: Connection restored to service lustre-OST0000 using nid 10.10.4.191@tcp.
Lustre: Skipped 6 previous similar messages
Lustre: MDS lustre-MDT0000: lustre-OST0000_UUID now active, resetting orphans
Lustre: Skipped 6 previous similar messages
LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x80013 sub-object on OST idx 5/1: rc = -11
LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) Skipped 1 previous similar message
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Client load failed on node client-31vm6, rc=1

Maloo report: https://maloo.whamcloud.com/test_sets/40f45b4c-760f-11e2-b5e2-52540035b04c

Attachments

Activity

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Jian Yu

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 16/Feb/13 4:49 AM

Updated:: 07/Mar/13 1:55 AM

Resolved:: 21/Feb/13 2:20 AM