Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7938

Recovery on secondary OSS node stalled

Details

    • Bug
    • Resolution: Not a Bug
    • Blocker
    • Lustre 2.9.0
    • Lustre 2.8.0, Lustre 2.9.0
    • lola
      build: 2.8 GA + patches
    • 3
    • 9223372036854775807

    Description

      Description

      Error happens during soak testing of build '20160324' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160324). DNE is enabled. MDTs had been formatted with ldiskfs, OSTs using zfs. MDS and OSS nodes are configured in HA active-active failover configuration. MDS nodes operated wiht 1 MDT per MDS, while OSSes running 4 OST per node.
      Nodes lola-4 and lola-5 form a HA cluster.

      Event history

      • 2016-03-29 07:48:04,307:fsmgmt.fsmgmt:INFO triggering fault oss_failover of node lola-4
        powercyle node
      • 2016-03-29 07:52:27,876:fsmgmt.fsmgmt:INFO lola-4 is up
      • 2016-03-29 07:53:09,557:fsmgmt.fsmgmt:INFO zpool import and mount of lola-4's OSTs complete
      • Recovery don't complete after recovery_time is zero
      • 2016-03-29 08:03 (approximately) Aborted recovery manually (lctl --device ... abort_recovery)

      Attached files:
      messages, console and debug log before (lustre-log-20160329-0759-recovery-stalled) and after recovery was aborted (lustre-log-20160329-0803-recovery-aborted)

      Attachments

        Activity

          [LU-7938] Recovery on secondary OSS node stalled

          Failover works after reformat of Lustre FS. Ticket can be closed.

          heckes Frank Heckes (Inactive) added a comment - Failover works after reformat of Lustre FS. Ticket can be closed.

          failover.node is specified for mkfs.lustre command via option --servicenode

          heckes Frank Heckes (Inactive) added a comment - failover.node is specified for mkfs.lustre command via option --servicenode

          reformat is onging. I'm going to start soak again, once it has completed.

          heckes Frank Heckes (Inactive) added a comment - reformat is onging. I'm going to start soak again, once it has completed.

          I set the parameter 'lustre:failover.node' for on all OSTs to the appropriate values for the failover nodes (see attached file ost-failover-setttings-20160726_0702).
          Actually this didn't solve the problem. Recovery is stalled still on failover node.

          I also reformatted the soak FS. The parameter setting will be filled (e.g.) with the following content:

          192.168.1.102: mkfs_cmd = zfs create -o canmount=off -o xattr=sa soaked-ost18/ost18
          192.168.1.102: Writing soaked-ost18/ost18 properties
          192.168.1.102:   lustre:version=1
          192.168.1.102:   lustre:flags=4194
          192.168.1.102:   lustre:index=18
          192.168.1.102:   lustre:fsname=soaked
          192.168.1.102:   lustre:svname=soaked:OST0012
          192.168.1.102:   lustre:mgsnode=192.168.1.108@o2ib10:192.168.1.109@o2ib10
          192.168.1.102:   lustre:failover.node=192.168.1.102@o2ib10:192.168.1.103@o2ib10
          

          Additionally I used option '--force-nohostid' for mkfs.lustre - command. I think the patch enabling hostid haven't landed, yet and spl fails during reformat if /etc/hostid isn't available.

          heckes Frank Heckes (Inactive) added a comment - I set the parameter ' lustre:failover.node ' for on all OSTs to the appropriate values for the failover nodes (see attached file ost-failover-setttings-20160726_0702). Actually this didn't solve the problem. Recovery is stalled still on failover node. I also reformatted the soak FS. The parameter setting will be filled (e.g.) with the following content: 192.168.1.102: mkfs_cmd = zfs create -o canmount=off -o xattr=sa soaked-ost18/ost18 192.168.1.102: Writing soaked-ost18/ost18 properties 192.168.1.102: lustre:version=1 192.168.1.102: lustre:flags=4194 192.168.1.102: lustre:index=18 192.168.1.102: lustre:fsname=soaked 192.168.1.102: lustre:svname=soaked:OST0012 192.168.1.102: lustre:mgsnode=192.168.1.108@o2ib10:192.168.1.109@o2ib10 192.168.1.102: lustre:failover.node=192.168.1.102@o2ib10:192.168.1.103@o2ib10 Additionally I used option ' --force-nohostid ' for mkfs.lustre - command. I think the patch enabling hostid haven't landed, yet and spl fails during reformat if /etc/hostid isn't available.

          Indeed it's a configuration error. Looks like NID of the secondary node (192.168.1.103@o2ib) weren't written during FS format.
          I set them explicitly again using zfs set lustre:failover.node, and check whether this resolves the issue.

          heckes Frank Heckes (Inactive) added a comment - Indeed it's a configuration error. Looks like NID of the secondary node ( 192.168.1.103@o2ib ) weren't written during FS format. I set them explicitly again using zfs set lustre:failover.node , and check whether this resolves the issue.

          Thanks Frank for the detailed logs. According to the client side log, it always tried to connect lola2 for the failed OSTs (OST0012 OST0000 OST0006 OST000c):

          00000100:00000001:30.0:1469462043.991030:0:125723:0:(import.c:508:import_select_connection()) Process entered
          00000100:00080000:30.0:1469462043.991033:0:125723:0:(import.c:523:import_select_connection()) soaked-OST0000-osc-ffff8810717d2000: connect to NID 192.168.1.102@o2ib10 last attempt 5189761512
          00000100:00080000:30.0:1469462043.991036:0:125723:0:(import.c:567:import_select_connection()) soaked-OST0000-osc-ffff8810717d2000: tried all connections, increasing latency to 50s
          00000100:00080000:30.0:1469462043.991099:0:125723:0:(import.c:601:import_select_connection()) soaked-OST0000-osc-ffff8810717d2000: import ffff880ff6706000 using connection 192.168.1.102@o2ib10/192.168.1.102@o2ib10
          00000100:00000001:30.0:1469462043.991104:0:125723:0:(import.c:605:import_select_connection()) Process leaving (rc=0 : 0 : 0)
          ...

          That means there is only one candidate can be selected to connect the OST0000, but according to your description, there should be another candidate 192.168.1.103@o2ib for connection. So I am wondering how did you config the lola2 and lola3 as the failover pairs?

          yong.fan nasf (Inactive) added a comment - Thanks Frank for the detailed logs. According to the client side log, it always tried to connect lola2 for the failed OSTs (OST0012 OST0000 OST0006 OST000c): 00000100:00000001:30.0:1469462043.991030:0:125723:0:(import.c:508:import_select_connection()) Process entered 00000100:00080000:30.0:1469462043.991033:0:125723:0:(import.c:523:import_select_connection()) soaked-OST0000-osc-ffff8810717d2000: connect to NID 192.168.1.102@o2ib10 last attempt 5189761512 00000100:00080000:30.0:1469462043.991036:0:125723:0:(import.c:567:import_select_connection()) soaked-OST0000-osc-ffff8810717d2000: tried all connections, increasing latency to 50s 00000100:00080000:30.0:1469462043.991099:0:125723:0:(import.c:601:import_select_connection()) soaked-OST0000-osc-ffff8810717d2000: import ffff880ff6706000 using connection 192.168.1.102@o2ib10/192.168.1.102@o2ib10 00000100:00000001:30.0:1469462043.991104:0:125723:0:(import.c:605:import_select_connection()) Process leaving (rc=0 : 0 : 0) ... That means there is only one candidate can be selected to connect the OST0000, but according to your description, there should be another candidate 192.168.1.103@o2ib for connection. So I am wondering how did you config the lola2 and lola3 as the failover pairs?

          debug files have been uploaded. Roles:
          lola-8: MGS
          lola-3: secondary OSS
          lola-30: Lustre client

          heckes Frank Heckes (Inactive) added a comment - debug files have been uploaded. Roles: lola-8: MGS lola-3: secondary OSS lola-30: Lustre client
          heckes Frank Heckes (Inactive) added a comment - - edited

          Resources of lola-2 (OST0012 OST0000 OST0006 OST000c) failed over to lola-3

          • 2016-07-25 08:49:12,572:fsmgmt.fsmgmt:INFO Wait for recovery to complete
          • 2016-07-25 08:55 created debug logs
          • 2016-07-25 08:57:51 aborted recovery
          • 2016-07-25 08:59 created debug logs

          Upload of debug logs has been started, but it might take some time.

          heckes Frank Heckes (Inactive) added a comment - - edited Resources of lola-2 ( OST0012 OST0000 OST0006 OST000c ) failed over to lola-3 2016-07-25 08:49:12,572:fsmgmt.fsmgmt:INFO Wait for recovery to complete 2016-07-25 08:55 created debug logs 2016-07-25 08:57:51 aborted recovery 2016-07-25 08:59 created debug logs Upload of debug logs has been started, but it might take some time.
          yong.fan nasf (Inactive) added a comment - - edited

          According to the recovery status, no clients connected to the new mounted OSTs on the lola-5, and there are no client connection records in the lola-5 debug logs (I am sure that it is NOT because of logs overwritten). So it seems that the clients were not aware of the OST load switching from lola-4 to lola-5, as to no connection to the lola-5, then no recovery. We need the full debug logs on the MGS and the client (one active client is enough). Thanks!

          yong.fan nasf (Inactive) added a comment - - edited According to the recovery status, no clients connected to the new mounted OSTs on the lola-5, and there are no client connection records in the lola-5 debug logs (I am sure that it is NOT because of logs overwritten). So it seems that the clients were not aware of the OST load switching from lola-4 to lola-5, as to no connection to the lola-5, then no recovery. We need the full debug logs on the MGS and the client (one active client is enough). Thanks!
          heckes Frank Heckes (Inactive) added a comment - - edited

          Comments from Cliff above seems to be related to LU-8089.

          The error for for OST reported in the description above, also occurred for build https://build.hpdd.intel.com/job/lustre-master/3406 (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160713)
          and is reproducible for any failover initiated.
          Test environment configuration is
          4 MDSes with 1 MDT each formatted with ldiskfs in active-active failover confguration
          6 OSSes with 4 OSTs each formatted with zfs in active-active failover configuration
          16 clients
          The stalled recovery process can be interrupted (aborted) at any time.

          I attached new message, console and debug log files (take during recovery, after recovery abort) of secondary node lola-5.
          Debug files of all other nodes taken at the same time, are available and can be uploaded on demand.

          Sequence of events:

          • 2016-07-22 04:21:54,665:fsmgmt.fsmgmt:INFO triggering fault oss_failover lola-4
          • 2016-07-22 04:26:37,971:fsmgmt.fsmgmt:INFO lola-4 is up
          • 2016-07-22 04:27:01,278:fsmgmt.fsmgmt:INFO ... soaked-OST0008 mounted successfully on lola-5 (NOTE: decimal id for OST)
          • 2016-07-22 04:27:12,348:fsmgmt.fsmgmt:INFO ... soaked-OST0020 mounted successfully on lola-5
          • 2016-07-22 04:27:23,515:fsmgmt.fsmgmt:INFO ... soaked-OST0002 mounted successfully on lola-5
          • 2016-07-22 04:27:36,047:fsmgmt.fsmgmt:INFO ... soaked-OST0014 mounted successfully on lola-5
          • 2016-07-22 04:27:36 Recovery started
          • 2016-07-22 04:37 Recovery aborted

          Status of recovery on OSS node lola-5 and client status were:

          [root@lola-5 ~]# lctl get_param *.*.recovery_status
          obdfilter.soaked-OST0002.recovery_status=
          status: RECOVERING
          recovery_start: 0
          time_remaining: 0
          connected_clients: 0/23
          req_replay_clients: 0
          lock_repay_clients: 0
          completed_clients: 0
          evicted_clients: 0
          replayed_requests: 0
          queued_requests: 0
          next_transno: 68719725397
          obdfilter.soaked-OST0003.recovery_status=
          status: COMPLETE 
          recovery_start: 1468949759
          recovery_duration: 30
          completed_clients: 23/23
          replayed_requests: 0
          last_transno: 34368384861
          VBR: DISABLED
          IR: ENABLED
          obdfilter.soaked-OST0008.recovery_status=
          status: RECOVERING
          recovery_start: 0
          time_remaining: 0
          connected_clients: 0/23
          req_replay_clients: 0
          lock_repay_clients: 0
          completed_clients: 0
          evicted_clients: 0
          replayed_requests: 0
          queued_requests: 0
          next_transno: 68719723367
          obdfilter.soaked-OST0009.recovery_status=
          status: COMPLETE
          recovery_start: 1468949751
          recovery_duration: 38
          completed_clients: 23/23
          replayed_requests: 0
          last_transno: 34368183131
          VBR: DISABLED
          IR: ENABLED
          obdfilter.soaked-OST000e.recovery_status=
          status: RECOVERING
          recovery_start: 0
          time_remaining: 0
          connected_clients: 0/23
          req_replay_clients: 0
          lock_repay_clients: 0
          completed_clients: 0
          evicted_clients: 0
          replayed_requests: 0
          queued_requests: 0
          next_transno: 68719751551
          obdfilter.soaked-OST000f.recovery_status=
          status: COMPLETE 
          recovery_start: 1468949763
          recovery_duration: 26
          completed_clients: 23/23
          replayed_requests: 0
          last_transno: 34368269722
          VBR: DISABLED
          IR: ENABLED
          obdfilter.soaked-OST0014.recovery_status=
          status: RECOVERING
          recovery_start: 0
          time_remaining: 0
          connected_clients: 0/23
          req_replay_clients: 0
          lock_repay_clients: 0
          completed_clients: 0
          evicted_clients: 0
          replayed_requests: 0
          queued_requests: 0
          next_transno: 68719756594
          obdfilter.soaked-OST0015.recovery_status=
          status: COMPLETE 
          recovery_start: 1468949751
          recovery_duration: 38
          completed_clients: 23/23
          replayed_requests: 0
          last_transno: 34368229218
          VBR: DISABLED
          IR: ENABLED
          

          The recovery stalled in a previous session running during night time over several hours without completion. That's why I aborted the recovery after ten minutes; I wasn't sure debug logs would contain relevant information, but confident that the recovery would never end.

          heckes Frank Heckes (Inactive) added a comment - - edited Comments from Cliff above seems to be related to LU-8089 . The error for for OST reported in the description above, also occurred for build https://build.hpdd.intel.com/job/lustre-master/3406 (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160713 ) and is reproducible for any failover initiated. Test environment configuration is 4 MDSes with 1 MDT each formatted with ldiskfs in active-active failover confguration 6 OSSes with 4 OSTs each formatted with zfs in active-active failover configuration 16 clients The stalled recovery process can be interrupted (aborted) at any time. I attached new message, console and debug log files (take during recovery, after recovery abort) of secondary node lola-5 . Debug files of all other nodes taken at the same time, are available and can be uploaded on demand. Sequence of events: 2016-07-22 04:21:54,665:fsmgmt.fsmgmt:INFO triggering fault oss_failover lola-4 2016-07-22 04:26:37,971:fsmgmt.fsmgmt:INFO lola-4 is up 2016-07-22 04:27:01,278:fsmgmt.fsmgmt:INFO ... soaked-OST0008 mounted successfully on lola-5 (NOTE: decimal id for OST) 2016-07-22 04:27:12,348:fsmgmt.fsmgmt:INFO ... soaked-OST0020 mounted successfully on lola-5 2016-07-22 04:27:23,515:fsmgmt.fsmgmt:INFO ... soaked-OST0002 mounted successfully on lola-5 2016-07-22 04:27:36,047:fsmgmt.fsmgmt:INFO ... soaked-OST0014 mounted successfully on lola-5 2016-07-22 04:27:36 Recovery started 2016-07-22 04:37 Recovery aborted Status of recovery on OSS node lola-5 and client status were: [root@lola-5 ~]# lctl get_param *.*.recovery_status obdfilter.soaked-OST0002.recovery_status= status: RECOVERING recovery_start: 0 time_remaining: 0 connected_clients: 0/23 req_replay_clients: 0 lock_repay_clients: 0 completed_clients: 0 evicted_clients: 0 replayed_requests: 0 queued_requests: 0 next_transno: 68719725397 obdfilter.soaked-OST0003.recovery_status= status: COMPLETE recovery_start: 1468949759 recovery_duration: 30 completed_clients: 23/23 replayed_requests: 0 last_transno: 34368384861 VBR: DISABLED IR: ENABLED obdfilter.soaked-OST0008.recovery_status= status: RECOVERING recovery_start: 0 time_remaining: 0 connected_clients: 0/23 req_replay_clients: 0 lock_repay_clients: 0 completed_clients: 0 evicted_clients: 0 replayed_requests: 0 queued_requests: 0 next_transno: 68719723367 obdfilter.soaked-OST0009.recovery_status= status: COMPLETE recovery_start: 1468949751 recovery_duration: 38 completed_clients: 23/23 replayed_requests: 0 last_transno: 34368183131 VBR: DISABLED IR: ENABLED obdfilter.soaked-OST000e.recovery_status= status: RECOVERING recovery_start: 0 time_remaining: 0 connected_clients: 0/23 req_replay_clients: 0 lock_repay_clients: 0 completed_clients: 0 evicted_clients: 0 replayed_requests: 0 queued_requests: 0 next_transno: 68719751551 obdfilter.soaked-OST000f.recovery_status= status: COMPLETE recovery_start: 1468949763 recovery_duration: 26 completed_clients: 23/23 replayed_requests: 0 last_transno: 34368269722 VBR: DISABLED IR: ENABLED obdfilter.soaked-OST0014.recovery_status= status: RECOVERING recovery_start: 0 time_remaining: 0 connected_clients: 0/23 req_replay_clients: 0 lock_repay_clients: 0 completed_clients: 0 evicted_clients: 0 replayed_requests: 0 queued_requests: 0 next_transno: 68719756594 obdfilter.soaked-OST0015.recovery_status= status: COMPLETE recovery_start: 1468949751 recovery_duration: 38 completed_clients: 23/23 replayed_requests: 0 last_transno: 34368229218 VBR: DISABLED IR: ENABLED The recovery stalled in a previous session running during night time over several hours without completion. That's why I aborted the recovery after ten minutes; I wasn't sure debug logs would contain relevant information, but confident that the recovery would never end.

          People

            yong.fan nasf (Inactive)
            heckes Frank Heckes (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: