Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4722

IO Errors during the failover - SLES 11 SP2 - Lustre 2.4.2

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.4.2
    • SLES 11 SP2
      Lustre 2.4.2
    • 3
    • 12978

    Description

      We have applied the patch provided in teh LU-3645. And still the customer complains that the issue is can be reproduced.

      Attaching the latest set of logs.

      The issue re-occured on 18th Feb.

      Attachments

        Activity

          [LU-4722] IO Errors during the failover - SLES 11 SP2 - Lustre 2.4.2

          In pfscn3,

          Apr 14 16:20:04 pfscn3 kernel: : LDISKFS-fs (dm-9): mounted filesystem with ordered data mode. quota=on. Opts:
          Apr 14 16:20:05 pfscn3 kernel: : Lustre: pfscdat2-OST0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
          Apr 14 16:20:14 pfscn3 kernel: : Lustre: pfscdat2-OST0000: Will be in recovery for at least 2:30, or until 27 clients reconnect
          Apr 14 16:22:44 pfscn3 kernel: : Lustre: pfscdat2-OST0000: recovery is timed out, evict stale exports
          Apr 14 16:22:44 pfscn3 kernel: : Lustre: pfscdat2-OST0000: disconnecting 26 stale clients
          Apr 14 16:22:44 pfscn3 kernel: : Lustre: pfscdat2-OST0000: Recovery over after 2:30, of 27 clients 1 recovered and 26 were evicted.
          Apr 14 16:22:44 pfscn3 kernel: : Lustre: pfscdat2-OST0000: deleting orphan objects from 0x0:29711772 to 0x0:29714801
          Apr 14 16:25:28 pfscn3 kernel: : Lustre: Failing over pfscdat2-OST0000
          Apr 14 16:25:29 pfscn3 kernel: : Lustre: server umount pfscdat2-OST0000 complete
          Apr 14 16:25:29 pfscn3 kernel: : LustreError: 137-5: pfscdat2-OST0000_UUID: not available for connect from 172.26.17.2@o2ib (no target)
          

          In pfscn4

          Apr 11 08:40:02 pfscn4 kernel: : LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. quota=on. Opts:
          Apr 11 08:40:02 pfscn4 kernel: : LustreError: 137-5: pfscdat2-OST0000_UUID: not available for connect from 172.26.16.28@o2ib (no target)
          Apr 11 08:40:07 pfscn4 kernel: : Lustre: 3897:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent Apr 11 08:41:17 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Recovery over after 0:36, of 36 clients 36 recovered and 0 were evicted.
          Apr 11 08:41:20 pfscn4 kernel: : Lustre: pfscdat2-OST0000: deleting orphan objects from 0x0:29380160 to 0x0:29380273
          
          ...
          
          Apr 14 16:19:52 pfscn4 kernel: : Lustre: Failing over pfscdat2-OST0000
          Apr 14 16:19:53 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Not available for connect from 172.26.20.7@o2ib (stopping)
          Apr 14 16:19:53 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Not available for connect from 172.26.20.8@o2ib (stopping)
          Apr 14 16:19:54 pfscn4 kernel: : LustreError: 137-5: pfscdat2-OST0000_UUID: not available for connect from 172.26.20.3@o2ib (no target)
          Apr 14 16:19:57 pfscn4 kernel: : LustreError: 137-5: pfscdat2-OST0000_UUID: not available for connect from 172.26.17.2@o2ib (no target)
          Apr 14 16:19:57 pfscn4 kernel: : LustreError: Skipped 1 previous similar message
          Apr 14 16:19:58 pfscn4 kernel: : Lustre: server umount pfscdat2-OST0000 complete
          
          ...
          
          Apr 14 16:25:43 pfscn4 kernel: : LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. quota=on. Opts:
          Apr 14 16:25:44 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
          Apr 14 16:25:47 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Will be in recovery for at least 2:30, or until 1 client reconnects
          Apr 14 16:25:47 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Denying connection for new client 90192127-1d80-d056-258f-193df5a6691b (at 172.26.4.4@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:30
          Apr 14 16:25:49 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Denying connection for new client 9b4ef354-d4a2-79a9-196f-2666496727d6 (at 172.26.20.9@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:28
          Apr 14 16:25:49 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Denying connection for new client b1ee0c39-c7c4-6f09-d8ec-5bf4d696e919 (at 172.26.4.1@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:27
          Apr 14 16:25:49 pfscn4 kernel: : Lustre: Skipped 2 previous similar messages
          Apr 14 16:25:51 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Denying connection for new client b1ee0c39-c7c4-6f09-d8ec-5bf4d696e919 (at 172.26.4.1@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:25
          Apr 14 16:25:54 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Denying connection for new client 42dd99f1-05b3-2c90-ca73-e27b00e04746 (at 172.26.4.8@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:23
          Apr 14 16:25:54 pfscn4 kernel: : Lustre: Skipped 10 previous similar messages
          Apr 14 16:26:14 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Denying connection for new client b1ee0c39-c7c4-6f09-d8ec-5bf4d696e919 (at 172.26.4.1@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:02
          Apr 14 16:26:14 pfscn4 kernel: : Lustre: Skipped 1 previous similar message
          Apr 14 16:26:20 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Recovery over after 0:33, of 1 clients 1 recovered and 0 were evicted.
          Apr 14 16:26:20 pfscn4 kernel: : Lustre: pfscdat2-OST0000: deleting orphan objects from 0x0:29711772 to 0x0:29714833
          

          In pfscn3
          the device "dm-9" was mounted at 16:20:04 as pfscdat2-OST0000, during recovery, Lustre indeed found there were 27 clients (26 normal clients,
          1 client from MDT), but it seems these 26 normal clients didn't recover with pfscn3 (the eviction condition after recovery timeout is either the client
          didn't need recovery or there was no queued replay request). then these clients were deleted and pfscdat2-OST0000 was unmounted at 16:25:29.

          In pfscn4
          the device "dm-11" was mounted at 16:25:43 as pfscdat2-OST0000, but it didn't contain client records, then these clients thought it were evicted.

          then the problem could be why these clients didn't connect to pfscn3 to recover?

          hongchao.zhang Hongchao Zhang added a comment - In pfscn3, Apr 14 16:20:04 pfscn3 kernel: : LDISKFS-fs (dm-9): mounted filesystem with ordered data mode. quota=on. Opts: Apr 14 16:20:05 pfscn3 kernel: : Lustre: pfscdat2-OST0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450 Apr 14 16:20:14 pfscn3 kernel: : Lustre: pfscdat2-OST0000: Will be in recovery for at least 2:30, or until 27 clients reconnect Apr 14 16:22:44 pfscn3 kernel: : Lustre: pfscdat2-OST0000: recovery is timed out, evict stale exports Apr 14 16:22:44 pfscn3 kernel: : Lustre: pfscdat2-OST0000: disconnecting 26 stale clients Apr 14 16:22:44 pfscn3 kernel: : Lustre: pfscdat2-OST0000: Recovery over after 2:30, of 27 clients 1 recovered and 26 were evicted. Apr 14 16:22:44 pfscn3 kernel: : Lustre: pfscdat2-OST0000: deleting orphan objects from 0x0:29711772 to 0x0:29714801 Apr 14 16:25:28 pfscn3 kernel: : Lustre: Failing over pfscdat2-OST0000 Apr 14 16:25:29 pfscn3 kernel: : Lustre: server umount pfscdat2-OST0000 complete Apr 14 16:25:29 pfscn3 kernel: : LustreError: 137-5: pfscdat2-OST0000_UUID: not available for connect from 172.26.17.2@o2ib (no target) In pfscn4 Apr 11 08:40:02 pfscn4 kernel: : LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. quota=on. Opts: Apr 11 08:40:02 pfscn4 kernel: : LustreError: 137-5: pfscdat2-OST0000_UUID: not available for connect from 172.26.16.28@o2ib (no target) Apr 11 08:40:07 pfscn4 kernel: : Lustre: 3897:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent Apr 11 08:41:17 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Recovery over after 0:36, of 36 clients 36 recovered and 0 were evicted. Apr 11 08:41:20 pfscn4 kernel: : Lustre: pfscdat2-OST0000: deleting orphan objects from 0x0:29380160 to 0x0:29380273 ... Apr 14 16:19:52 pfscn4 kernel: : Lustre: Failing over pfscdat2-OST0000 Apr 14 16:19:53 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Not available for connect from 172.26.20.7@o2ib (stopping) Apr 14 16:19:53 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Not available for connect from 172.26.20.8@o2ib (stopping) Apr 14 16:19:54 pfscn4 kernel: : LustreError: 137-5: pfscdat2-OST0000_UUID: not available for connect from 172.26.20.3@o2ib (no target) Apr 14 16:19:57 pfscn4 kernel: : LustreError: 137-5: pfscdat2-OST0000_UUID: not available for connect from 172.26.17.2@o2ib (no target) Apr 14 16:19:57 pfscn4 kernel: : LustreError: Skipped 1 previous similar message Apr 14 16:19:58 pfscn4 kernel: : Lustre: server umount pfscdat2-OST0000 complete ... Apr 14 16:25:43 pfscn4 kernel: : LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. quota=on. Opts: Apr 14 16:25:44 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450 Apr 14 16:25:47 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Will be in recovery for at least 2:30, or until 1 client reconnects Apr 14 16:25:47 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Denying connection for new client 90192127-1d80-d056-258f-193df5a6691b (at 172.26.4.4@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:30 Apr 14 16:25:49 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Denying connection for new client 9b4ef354-d4a2-79a9-196f-2666496727d6 (at 172.26.20.9@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:28 Apr 14 16:25:49 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Denying connection for new client b1ee0c39-c7c4-6f09-d8ec-5bf4d696e919 (at 172.26.4.1@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:27 Apr 14 16:25:49 pfscn4 kernel: : Lustre: Skipped 2 previous similar messages Apr 14 16:25:51 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Denying connection for new client b1ee0c39-c7c4-6f09-d8ec-5bf4d696e919 (at 172.26.4.1@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:25 Apr 14 16:25:54 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Denying connection for new client 42dd99f1-05b3-2c90-ca73-e27b00e04746 (at 172.26.4.8@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:23 Apr 14 16:25:54 pfscn4 kernel: : Lustre: Skipped 10 previous similar messages Apr 14 16:26:14 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Denying connection for new client b1ee0c39-c7c4-6f09-d8ec-5bf4d696e919 (at 172.26.4.1@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:02 Apr 14 16:26:14 pfscn4 kernel: : Lustre: Skipped 1 previous similar message Apr 14 16:26:20 pfscn4 kernel: : Lustre: pfscdat2-OST0000: Recovery over after 0:33, of 1 clients 1 recovered and 0 were evicted. Apr 14 16:26:20 pfscn4 kernel: : Lustre: pfscdat2-OST0000: deleting orphan objects from 0x0:29711772 to 0x0:29714833 In pfscn3 the device "dm-9" was mounted at 16:20:04 as pfscdat2-OST0000, during recovery, Lustre indeed found there were 27 clients (26 normal clients, 1 client from MDT), but it seems these 26 normal clients didn't recover with pfscn3 (the eviction condition after recovery timeout is either the client didn't need recovery or there was no queued replay request). then these clients were deleted and pfscdat2-OST0000 was unmounted at 16:25:29. In pfscn4 the device "dm-11" was mounted at 16:25:43 as pfscdat2-OST0000, but it didn't contain client records, then these clients thought it were evicted. then the problem could be why these clients didn't connect to pfscn3 to recover?

          Hi Rajesh,

          the uploaded logs seems to not be compressed correctly

          hongchao@eric:/scratch/ftp/uploads/LU-4722$ file 2014-04-14-client_lctl_dk_20140414.tgz 
          2014-04-14-client_lctl_dk_20140414.tgz: HTML document text
          hongchao@eric:/scratch/ftp/uploads/LU-4722$ file client_messages_20140414.tgz 
          client_messages_20140414.tgz: HTML document text
          hongchao@eric:/scratch/ftp/uploads/LU-4722$ file server_lctl_dk_20140414.tgz 
          server_lctl_dk_20140414.tgz: HTML document text
          hongchao@eric:/scratch/ftp/uploads/LU-4722$ 
          

          could you please check it? Thanks

          hongchao.zhang Hongchao Zhang added a comment - Hi Rajesh, the uploaded logs seems to not be compressed correctly hongchao@eric:/scratch/ftp/uploads/LU-4722$ file 2014-04-14-client_lctl_dk_20140414.tgz 2014-04-14-client_lctl_dk_20140414.tgz: HTML document text hongchao@eric:/scratch/ftp/uploads/LU-4722$ file client_messages_20140414.tgz client_messages_20140414.tgz: HTML document text hongchao@eric:/scratch/ftp/uploads/LU-4722$ file server_lctl_dk_20140414.tgz server_lctl_dk_20140414.tgz: HTML document text hongchao@eric:/scratch/ftp/uploads/LU-4722$ could you please check it? Thanks

          the patch have been applied, we have re-tried the test, results are same. attaching teh logs for your reference.

          rganesan@ddn.com Rajeshwaran Ganesan added a comment - the patch have been applied, we have re-tried the test, results are same. attaching teh logs for your reference.

          This is an EXAScaler setup and we use MOFED. Could you please provide the RPM with MOFED.

          rganesan@ddn.com Rajeshwaran Ganesan added a comment - This is an EXAScaler setup and we use MOFED. Could you please provide the RPM with MOFED.
          hongchao.zhang Hongchao Zhang added a comment - - edited
          hongchao.zhang Hongchao Zhang added a comment - - edited Hi Rajesh, you can download the RPMs from http://build.whamcloud.com/job/lustre-reviews/22788/arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel/artifact/artifacts/RPMS/x86_64/ and it's only for server side. Regards, Hongchao

          Hello - Could you please provide me the RPM with the patch and also, please let me know whether its a client patch or server patch.

          Thanks,
          Rajesh

          rganesan@ddn.com Rajeshwaran Ganesan added a comment - Hello - Could you please provide me the RPM with the patch and also, please let me know whether its a client patch or server patch. Thanks, Rajesh

          Hello - Could you please provide us source RPM.

          Does it need to be applied on all the MDS/OSSes or only required for the Lustre clients,

          Thanks

          rganesan@ddn.com Rajeshwaran Ganesan added a comment - Hello - Could you please provide us source RPM. Does it need to be applied on all the MDS/OSSes or only required for the Lustre clients, Thanks

          Hi,

          the debug patch is at http://review.whamcloud.com/#/c/9845/, which prints more logs related to client allocation, deletion and initialization,
          and the "ha" and "inode" should be added to the "debug" by "lctl set_param debug='+ha +inode'".

          Thanks

          hongchao.zhang Hongchao Zhang added a comment - Hi, the debug patch is at http://review.whamcloud.com/#/c/9845/ , which prints more logs related to client allocation, deletion and initialization, and the "ha" and "inode" should be added to the "debug" by "lctl set_param debug='+ha +inode'". Thanks

          Hello,

          You commented that " I am trying to write an debug patch to collect some debug info to track the issue now." May I know when the debug patch is available"

          I can install the debug patch and get the debug logs and also lctl dk output.

          rganesan@ddn.com Rajeshwaran Ganesan added a comment - Hello, You commented that " I am trying to write an debug patch to collect some debug info to track the issue now." May I know when the debug patch is available" I can install the debug patch and get the debug logs and also lctl dk output.

          Hi,

          Do you use both pfscn3 and pfscn4 to run the "pfscdat2-OST0000" service and these two nodes use the same device "/dev/mapper/ost_pfscdat2_0"?

          according to the logs, pfscn4 only found one client (the connection from MDT) saved in the OST disk (ost_pfscdat2_0),

          00010000:02000400:7.0:1395331134.233862:0:4638:0:(ldlm_lib.c:1581:target_start_recovery_timer()) pfscdat2-OST0000: Will be in recovery for at least 2:30, or until 1 client reconnects
          
          00010000:00080000:23.0:1395331183.996474:0:4723:0:(ldlm_lib.c:748:target_handle_reconnect()) connect export for UUID 'pfscdat2-MDT0000-mdtlov_UUID' at ffff880ffbe7a400, cookie 0x6b5630daed2ba3e
          00010000:00080000:23.0:1395331183.996481:0:4723:0:(ldlm_lib.c:1033:target_handle_connect()) pfscdat2-OST0000: connection from pfscdat2-MDT0000-mdtlov_UUID@172.26.17.2@o2ib recovering/t150323924993 exp ffff880ffbe7a400 cur 1395331183 last 1395331132
          00002000:00080000:23.0:1395331183.996492:0:4723:0:(ofd_obd.c:145:ofd_parse_connect_data()) pfscdat2-OST0000: Received MDS connection for group 0
          

          there are some write operations from iccn2(172.26.20.2), but not from iccn1, (only two write operations from iccn3).

          could you please help to get the debug log (lctl dk) before rebooting the OSS (pfscn4)?

          hongchao.zhang Hongchao Zhang added a comment - Hi, Do you use both pfscn3 and pfscn4 to run the "pfscdat2-OST0000" service and these two nodes use the same device "/dev/mapper/ost_pfscdat2_0"? according to the logs, pfscn4 only found one client (the connection from MDT) saved in the OST disk (ost_pfscdat2_0), 00010000:02000400:7.0:1395331134.233862:0:4638:0:(ldlm_lib.c:1581:target_start_recovery_timer()) pfscdat2-OST0000: Will be in recovery for at least 2:30, or until 1 client reconnects 00010000:00080000:23.0:1395331183.996474:0:4723:0:(ldlm_lib.c:748:target_handle_reconnect()) connect export for UUID 'pfscdat2-MDT0000-mdtlov_UUID' at ffff880ffbe7a400, cookie 0x6b5630daed2ba3e 00010000:00080000:23.0:1395331183.996481:0:4723:0:(ldlm_lib.c:1033:target_handle_connect()) pfscdat2-OST0000: connection from pfscdat2-MDT0000-mdtlov_UUID@172.26.17.2@o2ib recovering/t150323924993 exp ffff880ffbe7a400 cur 1395331183 last 1395331132 00002000:00080000:23.0:1395331183.996492:0:4723:0:(ofd_obd.c:145:ofd_parse_connect_data()) pfscdat2-OST0000: Received MDS connection for group 0 there are some write operations from iccn2(172.26.20.2), but not from iccn1, (only two write operations from iccn3). could you please help to get the debug log (lctl dk) before rebooting the OSS (pfscn4)?
          hongchao.zhang Hongchao Zhang added a comment - - edited

          Hi,

          yes, we have checked the logs, and it should be not the duplicate of LU-3645.

          this issue is related to the eviction of clients by OST,

          00010000:02000400:7.0:1395331134.233862:0:4638:0:(ldlm_lib.c:1581:target_start_recovery_timer()) pfscdat2-OST0000: Will be in recovery for at least 2:30, or until 1 client reconnects
          00010000:02000400:7.0:1395331134.233874:0:4638:0:(ldlm_lib.c:1077:target_handle_connect()) pfscdat2-OST0000: Denying connection for new client 01e5a9c0-2c56-a938-4eea-7658375f5c04 (at 172.26.20.2@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:30
          00010000:00080000:18.0F:1395331134.557763:0:4632:0:(ldlm_lib.c:1033:target_handle_connect()) pfscdat2-OST0000: connection from 69c34fd9-73e4-94d1-e43e-5c26d4b1ac9f@172.26.20.5@o2ib recovering/t0 exp (null) cur 1395331134 last 0
          00010000:02000400:18.0:1395331134.557774:0:4632:0:(ldlm_lib.c:1077:target_handle_connect()) pfscdat2-OST0000: Denying connection for new client 69c34fd9-73e4-94d1-e43e-5c26d4b1ac9f (at 172.26.20.5@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:29
          00010000:00080000:21.0F:1395331137.127888:0:4723:0:(ldlm_lib.c:1033:target_handle_connect()) pfscdat2-OST0000: connection from 0067c3cd-6643-b8bb-721b-08b1b92dccde@172.26.4.1@o2ib recovering/t0 exp (null) cur 1395331137 last 0
          00010000:02000400:21.0:1395331137.127899:0:4723:0:(ldlm_lib.c:1077:target_handle_connect()) pfscdat2-OST0000: Denying connection for new client 0067c3cd-6643-b8bb-721b-08b1b92dccde (at 172.26.4.1@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:27
          00010000:00080000:7.0:1395331137.164613:0:4638:0:(ldlm_lib.c:1033:target_handle_connect()) pfscdat2-OST0000: connection from 798993fc-cca8-3e6f-6b5d-5ce89dd9836b@172.26.20.1@o2ib recovering/t146029313516 exp (null) cur 1395331137 last 0
          00010000:02000400:7.0:1395331137.164620:0:4638:0:(ldlm_lib.c:1077:target_handle_connect()) pfscdat2-OST0000: Denying connection for new client 798993fc-cca8-3e6f-6b5d-5ce89dd9836b (at 172.26.20.1@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:27
          00010000:00080000:7.0:1395331137.212815:0:4638:0:(ldlm_lib.c:1033:target_handle_connect()) pfscdat2-OST0000: connection from 49ae061e-87a8-feb4-0a3c-96e0e66336ca@172.26.20.6@o2ib recovering/t0 exp (null) cur 1395331137 last 0
          

          there is only one client found by pfscdat2-OST0000 to be needed to recover, after the recovery completed, the other clients were evicted by the OST.

          I am trying to write an debug patch to collect some debug info to track the issue now.

          hongchao.zhang Hongchao Zhang added a comment - - edited Hi, yes, we have checked the logs, and it should be not the duplicate of LU-3645 . this issue is related to the eviction of clients by OST, 00010000:02000400:7.0:1395331134.233862:0:4638:0:(ldlm_lib.c:1581:target_start_recovery_timer()) pfscdat2-OST0000: Will be in recovery for at least 2:30, or until 1 client reconnects 00010000:02000400:7.0:1395331134.233874:0:4638:0:(ldlm_lib.c:1077:target_handle_connect()) pfscdat2-OST0000: Denying connection for new client 01e5a9c0-2c56-a938-4eea-7658375f5c04 (at 172.26.20.2@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:30 00010000:00080000:18.0F:1395331134.557763:0:4632:0:(ldlm_lib.c:1033:target_handle_connect()) pfscdat2-OST0000: connection from 69c34fd9-73e4-94d1-e43e-5c26d4b1ac9f@172.26.20.5@o2ib recovering/t0 exp (null) cur 1395331134 last 0 00010000:02000400:18.0:1395331134.557774:0:4632:0:(ldlm_lib.c:1077:target_handle_connect()) pfscdat2-OST0000: Denying connection for new client 69c34fd9-73e4-94d1-e43e-5c26d4b1ac9f (at 172.26.20.5@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:29 00010000:00080000:21.0F:1395331137.127888:0:4723:0:(ldlm_lib.c:1033:target_handle_connect()) pfscdat2-OST0000: connection from 0067c3cd-6643-b8bb-721b-08b1b92dccde@172.26.4.1@o2ib recovering/t0 exp (null) cur 1395331137 last 0 00010000:02000400:21.0:1395331137.127899:0:4723:0:(ldlm_lib.c:1077:target_handle_connect()) pfscdat2-OST0000: Denying connection for new client 0067c3cd-6643-b8bb-721b-08b1b92dccde (at 172.26.4.1@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:27 00010000:00080000:7.0:1395331137.164613:0:4638:0:(ldlm_lib.c:1033:target_handle_connect()) pfscdat2-OST0000: connection from 798993fc-cca8-3e6f-6b5d-5ce89dd9836b@172.26.20.1@o2ib recovering/t146029313516 exp (null) cur 1395331137 last 0 00010000:02000400:7.0:1395331137.164620:0:4638:0:(ldlm_lib.c:1077:target_handle_connect()) pfscdat2-OST0000: Denying connection for new client 798993fc-cca8-3e6f-6b5d-5ce89dd9836b (at 172.26.20.1@o2ib), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:27 00010000:00080000:7.0:1395331137.212815:0:4638:0:(ldlm_lib.c:1033:target_handle_connect()) pfscdat2-OST0000: connection from 49ae061e-87a8-feb4-0a3c-96e0e66336ca@172.26.20.6@o2ib recovering/t0 exp (null) cur 1395331137 last 0 there is only one client found by pfscdat2-OST0000 to be needed to recover, after the recovery completed, the other clients were evicted by the OST. I am trying to write an debug patch to collect some debug info to track the issue now.

          People

            hongchao.zhang Hongchao Zhang
            rganesan@ddn.com Rajeshwaran Ganesan
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: