Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 1.8.9
-
None
-
Lustre Tag: v1_8_9_WC1_RC1
Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/256
Distro/Arch: RHEL5.9/x86_64(server), RHEL6.3/x86_64(client)
Network: TCP (1GigE)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD
MGS/MDS Nodes: client-31vm3(active), client-31vm7(passive)
\ /
1 combined MGS/MDT
OSS Nodes: client-31vm4(active), client-31vm8(passive)
\ /
7 OSTs
Client Nodes: client-31vm[1,5,6]
IP Addresses:
client-31vm1: 10.10.4.196
client-31vm3: 10.10.4.190
client-31vm4: 10.10.4.191
client-31vm5: 10.10.4.192
client-31vm6: 10.10.4.193
client-31vm7: 10.10.4.194
client-31vm8: 10.10.4.195
Lustre Tag: v1_8_9_WC1_RC1 Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/256 Distro/Arch: RHEL5.9/x86_64(server), RHEL6.3/x86_64(client) Network: TCP (1GigE) ENABLE_QUOTA=yes FAILURE_MODE=HARD MGS/MDS Nodes: client-31vm3(active), client-31vm7(passive) \ / 1 combined MGS/MDT OSS Nodes: client-31vm4(active), client-31vm8(passive) \ / 7 OSTs Client Nodes: client-31vm[1,5,6] IP Addresses: client-31vm1: 10.10.4.196 client-31vm3: 10.10.4.190 client-31vm4: 10.10.4.191 client-31vm5: 10.10.4.192 client-31vm6: 10.10.4.193 client-31vm7: 10.10.4.194 client-31vm8: 10.10.4.195
-
3
-
6837
Description
While running recovery-mds-scale test_failover_ost, it failed as follows after running 15 hours:
==== Checking the clients loads AFTER failover -- failure NOT OK Client load failed on node client-31vm6, rc=1 Client load failed during failover. Exiting... Found the END_RUN_FILE file: /home/autotest/.autotest/shared_dir/2013-02-12/172229-70152412386500/end_run_file client-31vm6.lab.whamcloud.com Client load failed on node client-31vm6.lab.whamcloud.com: /logdir/test_logs/2013-02-12/lustre-b1_8-el5-x86_64-vs-lustre-b1_8-el6-x86_64--review--1_1_1__13121__-70152412386500-172228/recovery-mds-scale.test_failover_ost.run__stdout.client-31vm6.lab.whamcloud.com.log /logdir/test_logs/2013-02-12/lustre-b1_8-el5-x86_64-vs-lustre-b1_8-el6-x86_64--review--1_1_1__13121__-70152412386500-172228/recovery-mds-scale.test_failover_ost.run__debug.client-31vm6.lab.whamcloud.com.log 2013-02-13 10:50:55 Terminating clients loads ... Duration: 86400 Server failover period: 900 seconds Exited after: 56345 seconds Number of failovers before exit: mds: 0 times ost1: 8 times ost2: 3 times ost3: 13 times ost4: 7 times ost5: 10 times ost6: 10 times ost7: 12 times Status: FAIL: rc=1
The output of "tar" operation on client-31vm6 showed that:
tar: etc/chef/solo.rb: Cannot open: Input/output error tar: etc/chef/client.rb: Cannot open: Input/output error tar: etc/prelink.cache: Cannot open: Input/output error tar: etc/readahead.conf: Cannot open: Input/output error tar: etc/localtime: Cannot open: Input/output error tar: Exiting with failure status due to previous errors
Dmesg on the MDS node (client-31vm3) showed that:
Lustre: DEBUG MARKER: Starting failover on ost2 Lustre: 7396:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1426822752375573 sent from lustre-OST0000-osc to NID 10.10.4.195@tcp 7s ago has timed out (7s prior to deadline). req@ffff810048893000 x1426822752375573/t0 o13->lustre-OST0000_UUID@10.10.4.195@tcp:7/4 lens 192/528 e 0 to 1 dl 1360781077 ref 1 fl Rpc:N/0/0 rc 0/0 Lustre: 7396:0:(client.c:1529:ptlrpc_expire_one_request()) Skipped 117 previous similar messages Lustre: lustre-OST0000-osc: Connection to service lustre-OST0000 via nid 10.10.4.195@tcp was lost; in progress operations using this service will wait for recovery to complete. Lustre: lustre-OST0005-osc: Connection to service lustre-OST0005 via nid 10.10.4.195@tcp was lost; in progress operations using this service will wait for recovery to complete. Lustre: Skipped 4 previous similar messages Lustre: lustre-OST0006-osc: Connection to service lustre-OST0006 via nid 10.10.4.195@tcp was lost; in progress operations using this service will wait for recovery to complete. Lustre: 7398:0:(import.c:517:import_select_connection()) lustre-OST0000-osc: tried all connections, increasing latency to 2s Lustre: 7398:0:(import.c:517:import_select_connection()) Skipped 59 previous similar messages LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x80013 sub-object on OST idx 0/1: rc = -11 LustreError: 7725:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x80013 sub-object on OST idx 0/1: rc = -5 LustreError: 7725:0:(mds_open.c:442:mds_create_objects()) error creating objects for inode 524307: rc = -5 LustreError: 7725:0:(mds_open.c:827:mds_finish_open()) mds_create_objects: rc = -5 LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x80013 sub-object on OST idx 1/1: rc = -11 LustreError: 7750:0:(mds_open.c:442:mds_create_objects()) error creating objects for inode 524307: rc = -5 LustreError: 7750:0:(mds_open.c:827:mds_finish_open()) mds_create_objects: rc = -5 Lustre: MGS: haven't heard from client 71727980-7899-77af-8af0-a42b0349985a (at 10.10.4.195@tcp) in 49 seconds. I think it's dead, and I am evicting it. LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x80013 sub-object on OST idx 2/1: rc = -11 LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) Skipped 1 previous similar message LustreError: 7745:0:(mds_open.c:442:mds_create_objects()) error creating objects for inode 524307: rc = -5 LustreError: 7745:0:(mds_open.c:827:mds_finish_open()) mds_create_objects: rc = -5 LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x80013 sub-object on OST idx 3/1: rc = -11 LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) Skipped 1 previous similar message LustreError: 7739:0:(mds_open.c:442:mds_create_objects()) error creating objects for inode 524307: rc = -5 LustreError: 7739:0:(mds_open.c:827:mds_finish_open()) mds_create_objects: rc = -5 LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x80013 sub-object on OST idx 4/1: rc = -11 LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) Skipped 1 previous similar message LustreError: 7731:0:(mds_open.c:442:mds_create_objects()) error creating objects for inode 524307: rc = -5 LustreError: 7731:0:(mds_open.c:827:mds_finish_open()) mds_create_objects: rc = -5 Lustre: 7397:0:(quota_master.c:1724:mds_quota_recovery()) Only 6/7 OSTs are active, abort quota recovery Lustre: 7397:0:(quota_master.c:1724:mds_quota_recovery()) Skipped 6 previous similar messages Lustre: lustre-OST0000-osc: Connection restored to service lustre-OST0000 using nid 10.10.4.191@tcp. Lustre: Skipped 6 previous similar messages Lustre: MDS lustre-MDT0000: lustre-OST0000_UUID now active, resetting orphans Lustre: Skipped 6 previous similar messages LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x80013 sub-object on OST idx 5/1: rc = -11 LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) Skipped 1 previous similar message Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: /usr/sbin/lctl mark Client load failed on node client-31vm6, rc=1
Maloo report: https://maloo.whamcloud.com/test_sets/40f45b4c-760f-11e2-b5e2-52540035b04c