Best Practices for Run Time High Availability and Disaster Recovery

Document created by Adam Arrowsmith Employee on Sep 30, 2016Last modified by Adam Arrowsmith Employee on Jan 24, 2018
Version 10Show Document
  • View in full screen mode

This article provides best practices for managing highly available atom/molecule/cloud runtimes and strategies for disaster recovery.

 

 

Overview

When configuring the Dell Boomi run time infrastructure, it is important to understand the Fault Tolerance, Availability, and Disaster Recovery requirements within your organization.

 

Based on those requirements, you may choose to run in the cloud, within your own data centers, or in a hybrid environment, utilizing resources in both the cloud and on-premise resources.

 

This article will serve to describe some of the best practices around managing the various Dell Boomi Run Time Engines, to ensure they meet the availability requirements of your business.

 

High Availability vs Disaster Recovery

High availability refers to maintaining service within a given data center.

 

Disaster recovery refers to the loss of an entire data center and the need to shift operations to a completely separate environment.

 

Enterprise solutions typically require an integration platform that is both Highly Available, but also has the capability to be restored in the event of a catastrophic event.

 

Run Time Options

 

Dell Boomi Hosted

The Dell Boomi Public Cloud is a stable, robust, and scale-able run time engine, available to each of our customers.  Our OPS team manages the availability and DR of those systems, in order to reduce our customers' operational costs.

 

Customer Hosted

If customers have performance or connectivity concerns, and have determined that they need to host their own Run Time, there are several options that must be considered.

 

Cloud vs On-Prem

The Atom, Molecule, and Cloud run time engines, can be hosted in a variety of different ways.  They can be installed on personal laptops, physical servers, or within corporate VM systems which may offer some fault tolerance and fail-over functionality.

 

More and more we are seeing customers install our applications in Cloud Services like Amazon, Azure, or Google.  Each of these solutions offer certain options for providing the necessary availability numbers, and each offer different mechanisms for backups and DR.

 

Atoms, Molecules, and Clouds

Atoms consist of a single Java instance, running on a single server.  While this configuration is sufficient for many of our customers, in many environments, by definition, the Atom is a single point of failure.  If you are hosting a customer facing web service, or have time sensitive integrations, you may require a more highly available system.

 

In contrast, the Molecule and Cloud options, because they run as a cluster of Java machines, offer a higher level of availability.  If an individual node were to fail, the remaining nodes in the cluster will be available to run subsequent executions.  ( Note: Executions that are running on a server when that server fails, are marked as failed, and do not automatically resume on remaining nodes ).

 

Because Molecules and Clouds are configured with similar network and hardware configurations, for the remainder of this article, we will talk about Molecules.  The same concepts generally apply to Clouds as well.

 

The clustered systems do depend upon shared network storage.  When we talk about HA and DR for Molecules, we will generally concern ourselves with capabilities around that network storage.

 

Hybrid Environments

Many of our customers maintain multiple environments, where different integrations may run in different types of environments.

 

Perhaps production processes are split between the Dell Boomi Public Cloud, and an on-premise Molecule.

 

Or perhaps a customer wishes to save on Molecule licensing costs, but asking the development team to do their connectivity tests against a local atom, while the production integrations run against a more robust Molecule.

 

In hybrid environments like these, the availability and DR must be maintained, appropriately, for each of the environments.

 

Backing up an Atom

When we refer to backups, we are often referring to the backup of an entire server/VM, or to a backup of an Atom's installation directory.  This is not necessary in a Molecule environment, because all of the information needed to restore the Molecule, is contained entirely on that network drive.

 

If there are DR requirements around an Atom configuration, you may choose to make backups of the server, or of the installation directory, to be restored as needed.  However, as a best practice, the new server should be configured with the same IP address and hostname, with a similar file system structure as to the original. 

 

If you are not restoring an entire system image, then post restore, you will need to re-create the Windows service or Linux startup scripts ( articles linked below ).

 

High Availability for a Molecule or Cloud

As mentioned earlier, the availability of your Molecule, often depends on that of the underlying network storage.

 

If hosting your infrastructure within your own data center, whether using VMs or physical machines, the NFS drive associated with that environment, will typically be hosted by a network attached, SAN device.  These devices often have built in fault tolerance and redundancy features, and often times also have some backup capability.

 

If you are hosting a virtual environment, there are a variety of options for redundant storage.

 

In the event of a failure of a particular Molecule Node, a new Node can be created with the same configuration, and added to the cluster.

 

Common Approaches for Disaster Recovery of a Molecule or Cloud

 

Single Cluster Spanning Data Centers

Because of network latency, it is not recommended to share the Molecule's network storage across data centers.  When put into Production Load scenarios, there tends to be some File/IO issues, which manifest themselves as Clustering problems.

 

Likewise, we do not recommend configuring bi-directional replication across two data centers.  Because the Molecule tends to utilize a high number of very small files, and because of the file locking requirement, this solution will also begin to fail under production load.

 

Active/Active

It is possible to stand up distinct Molecules in both data centers.  These would each consist of their own set of nodes, their own NFS drive, and will each require a Molecule license and the corresponding connector licenses.

 

In this configuration, both Molecules would appear as individual elements on the Atom Management page, and can be grouped together within the same AtomSphere Environment.  In this way, changes to Deployments and to Environment Extensions, will be automatically managed across both Molecules.  Effectively, processes can be run in either environment independently.

 

Depending on the network topology, a Load Balancer might be configured to distribute inbound real time requests to both Molecules.

 

Schedules however, are maintained on each Molecule independently.  If jobs are scheduled to run at the top of the hour, they might run on both Molecules at the same time.  Instead, we recommend staggering the schedules across the Molecules to avoid this.  However, during an outage of one of the Molecules, obviously your processes will only be run per the schedule of the remaining node.

 

Also, if there is an outage of one of the Molecules, because extensions run on one of the clusters at a time, you might lose some execution logs associated with the failed Molecule.

 

Because of the additional license cost, the operational overhead, the complexity of managing the schedules, and the possibility of data loss, we do not generally recommend this active/active approach.

 

Active/Passive

The recommended approach for Disaster Recovery of a Molecule or Cloud, is to maintain a passive backup of the production Molecule at the DR site.

 

All of the information needed to restore a Molecule is contained within the installation directory on that NFS drive.  Node servers do not contain any persistent data.  For this reason, we simply need to take periodic backups of that network share, and store them in the DR site.  In the event of an outage, restore the backup drives, stand up new Nodes against it, and start up the Java services.  Effectively, we are "extending" or "adding nodes to" the existing Molecule.

 

There will only be a single Molecule displayed within Atom Management, and you will not need additional licenses.  The only data loss will be based upon the last time the backup was successful.

 

This is the approach maintained by the Dell Boomi Operations team for our Public Cloud.

 

Active/Passive DR Considerations

 

Replication vs Backup of the Network Storage

We recommend taking "Point in Time" backups of the Network storage device.  This method reduces the risk of a disk corruption being replicated into the DR system.

 

Block Level Backups vs File Backups and Performance

Regardless of the type of backup solution, you will need to provision a location at the DR site, to store the backed up information.

 

Many storage solutions offer built in backup capability, typically at the block or device level ( Netapp, Amazon EBS, GlusterFS, etc ).  These vendor provided tools will typically not use system resources ( CPU ) of the Molecule itself, and should also have less impact on the IOPS of the storage device.

 

If necessary, the Molecule's installation directory can be backed up at the file level ( using a tool like rsync ).  File level backups could have a dramatic effect on your performance.  Therefore the disks must be monitored closely for IOPS and throughput.  Tools that have traverse the file system will also consume CPU resources.  You must also make sure that the backup tools do not cause locking contention with the Molecule nodes.

 

When choosing a file level backup solution, take into consideration that the Molecule creates "lots of little files" at a relatively high rate.

 

Depending on the technology available, consider incremental backups vs full backups.  This decision however, could also impact your recovery time when failures do occur.

 

RPO and Data Loss

When we discuss Recovery Point Objectives, we simply discuss how often the snapshot should be taken of the Molecule's NFS.  We must weight the performance cost of running more frequent backups, with the potential for data loss.

 

Typically we recommend starting with 2 to 4 snapshots taken daily, but depending on your business needs, you certainly can do them more often.

 

When discussing RPO with your team, consider the different types of data that is stored within the Molecule installation directory, compared with the data that is stored in the platform.  For example, Execution History / Metadata is stored in the Platform database to be presented via the Process Reporting UI, but the data files and process logs associated with each individual execution, is only stored on the run time itself.

 

In addition to execution related data, you should consider other Platform activity that has a corresponding update on the Molecule.  This includes Deployment activity (Processes, Certificates, Jar files, etc), changes to Atom Management and Shared Web Server settings, Environment Extensions, and Process Schedules.

 

When the DR system is brought online, any of the above recent activity must be manually verified because it is possible that the Boomi Platform DB, could contain newer information then that captured by the most recent NFS backup.  In your production environment, these types of actions should be relatively minimal, and should be under configuration management and therefore logged.  Redeploy, or re-save Atom Properties or Environment Extensions, in order to re-sync the Platform with the Atom configuration.

 

Finally, you may need to account for any persistent process data after cutting over to the DR system.  This can include any counters, AS2 ISA/GS information, and Persisted Process Properties.  This also includes "Last Run Dates".  Depending on how processes are configured, when they run against the DR system, they will use the values that included with the most recent backup.

 

Recovery Time Objective (RTO)

The Recovery Time Objective will be determined by how long the business can withstand the run time outage.  Effectively, you need to be able to restore the Molecule at the DR site, within this time period.

 

Your recovery time will depend upon how quickly you can restore the backup of that NFS within the DR site, and how quickly you can build, configure, and start the Java service on the DR Molecule nodes.

 

Restoration of the NFS, will depend upon the tools used to take the backup, therefore we typically spend more time discussing options for building and configuring those new nodes.

 

External Connectivity

External dependencies should also be identified during the DR planning stages.

 

If there are other on-premise systems within your production environment, you will need to consider whether or not they may fail over to the DR site as well.  If they do, there may be connection parameters or other information that will need to change in order to reach the new endpoints.

 

If your integration processes connect to any external endpoints that might have white listed your production servers, then they may need to also white list your DR servers.  If DR Node IP addresses can be reserved preemptively, then consider asking your clients to white list them at the beginning.

 

If you are hosting Web Services or AS2 inbound processes, you will need to determine whether or not to fail over your load balancer.  Your production load balancer may need to be configured with the IP addresses of the DR nodes.  If you have public facing URLs or other, then you may need to make DNS updates in order to redirect your clients' traffic.

 

Initial Hosts for Unicast

If the IP addresses for the DR nodes are known ahead of time, they can be added to the Initial Hosts for Unicast property ( Atom Management or via container.properties ), at setup time.  If they are not known, then they will need to be added to this Property during the event.

 

Options for Configuring DR Nodes

When the nodes are stood up in the DR environment, they must be configured the same way as the production nodes were.  The hardware characteristics, OS, security patches, available port numbers, certificates, and Java path and versions should all match.  The Boomi user needs to be available, the NFS mount point needs to be the same, and the Local Working and Local Temp directories need to be created with the same user permissions.  There are few options for making this process easier.

 

Note: The primary and DR run time instances must NOT be run at the same time to avoid conflicts with communicating with the AtomSphere platform

 

Backing up Nodes

Particularly in a VM environment, you may have the ability to take full backups of each DR nodes.  When restoring these at the DR site, you may even be able to restore hostnames and IP addresses.

 

In the event of a failure, restore the backups of those nodes, and as long as the DR NFS was shared in the same way, the Molecule should come back online.

 

Rather than maintaining multiple backups for each node, you may also keep a single image, that can be used to create multiple nodes.

 

Before restoring an image, or building a new Molecule Node in the DR site, it is important to ensure that the Production Molecule instance, has been completely stopped.

 

Configuring Nodes During Initial Setup

Some customers prefer to provision and configure their DR nodes as part of the initial Molecule setup.  With this method, there is no need to build or configure the servers during the DR event.  However, there is additional operational and maintenance cost associated with maintaining these extra servers.

 

In this scenario, any server maintenance, patching, or updates that are made to the production environment, must also be made to the DR nodes.

 

You can choose to install the Molecule Node ( node_install script ) as part of the setup, or during the DR event.  If you choose to do this, it must be done during a maintenance window, where you can make sure there is no load on the production Molecule and that no processes are running.  While the DR nodes are online, they may inadvertently take executions which could cause discrepancies in Process Reporting.

 

If you do install the Java service, it may be a good idea to manually disable it until those nodes are brought online.

 

If you run the node_install scripts and join the DR nodes to the cluster during initial setup, the DR nodes will appear in the Atom Management UI under Cluster Status.  Once the DR nodes have been shut back down, if desired, you can use the gear icon on that UI to remove them from the platform.

 

DR Node Install Check List (Used when Pre-Installing DR)

  1. Add DR Nodes to Initial Hosts for Unicast
    1. Identify IP addresses of DR nodes
    2. Login to AtomSphere
    3. Manage->Atom Management-> Properties->Advanced->Initial Hosts for Unicast
    4. Update the list of Nodes to include both Production and DR nodes
      • Note: This property does not need to include ALL the nodes in the cluster, but one node in this list must be online at all times.
  1. Take Prod Molecule offline by stopping all node services
  2. Fail-over/activate file share backup in DR site
  3. Start VMs of additional DR nodes. Verify:
    • Working Directory path is setup and is the same
    • Temp directory path is setup and is the same
    • Java is installed and is the same build version
  4. Install additional nodes per the articles referenced below
  5. Confirm DR nodes are installed successfully by starting the service ( no cluster errors )
  6. Stop the DR node services (can bring down the BR machines as well)
  7. Restart node services in prod environment
  8. Optional: From Atom Management->Cluster Status, remove the DR Nodes from the cluster view.  This only updates the UI, and does not delete the service from the corresponding DR nodes.

 

Provisioning Nodes as Needed

The recommended approach, because of resource costs and the operational overhead of configuring DR nodes ahead of time, is to only stand up DR nodes in the event of a production outage.

 

Generally speaking, the configuration of NFS mounts, local directory storage and Java installation can be scripted in order to improve recovery time.

 

A system image, that already includes baseline OS, security, and Java versions, can serve as the foundation for that scripted functionality.  Note: It may be possible to use the same process to create production and DR nodes.

 

Active/Passive DR Procedure

These are the general steps to follow during a DR event

  1. Verify production data center is down (stop services if not)
  2. Failover file share to DR site/ Activate file share backup
  3. Build/Configure DR servers, or start pre-built ones.
  4. Start the Java services on those nodes
  5. If necessary, confirm load balancer/reverse proxy failover now points to DR nodes ( update DNS or Client applications as necessary )
  6. Audit Platform activity that occurred since the last NFS backup ( See the RPO section above ).  Double check the following:
    • Deployments ( Processes, Certificates, APIs, Jars, etc )
    • Atom Management ( Extensions, Properties, Shared Web Server Settings, Schedules )
    • As necessary, Last Run Dates, Persisted Properties, and Counters
  7. Optional: Remove the offline Production nodes from Atom Management->Cluster Status
  8. Verify that processes are executing as expected

 

Restoration of Original Data Center

Once the DR site has gone live and processes have executed there, then you can no longer simply cut back over to your original data center, without first copying back the DR NFS drive.  Any changes in the platform ( discussed above ), are now reflected on the new NFS drive.

 

If you do need to return operations to the original data center, it is best to schedule the migration during a planned system downtime.

 

Steps:

  1. Shut down the DR Molecule by stopping the Java services
  2. Copy the DR NFS drive to overwrite the NFS at the original data center
  3. Start the DR nodes in the production environment

 

Note: While working in the original environment, make sure not to start Molecule nodes against the older NFS drive.

 

How to Test DR

 

POC Prior to Go Live

We recommend walking through the DR procedure, as a POC activity, during the initial stages of Molecule creation.  This allows the infrastructure team to become familiar with the concepts, and to annotate the instructions above per their specific requirements.

 

This option is sufficient for many of our customers.

 

Use a Test Instance as an Example

In this scenario, customers choose to create a test Molecule specifically for the DR testing.  ( Work with your Sales team to procure any necessary licenses )

 

You will need to build and deploy some test processes, or identify some production processes that can be run on the temporary Molecule.  Consider deploying some web services if appropriate.

 

Once you create a backup of that test Molecule, you can then follow the DR procedure, to move the Molecule into the DR site.  You can then run those test processes, re-deploy them if necessary, change schedules, etc.  And finally, you can follow the steps to move the Molecule back into Production data center.

 

This is a simple use case that does not impact your production environment, but allows you to verify your automation and procedures.  It does not test doing the move with production load, and it does not test the impact on your external systems ( Load Balancer, Trading Partners, Client Applications, etc )

 

Cutover

The preferred approach, is to move the Molecule to the DR site, and to continue production processing there.  Instead of using the backup site only for DR, consider it as a secondary production site.

 

The feasibility of this approach will depend greatly on the connectivity of on-prem and external endpoints, but for most cloud based hosting environments, and for most cloud based applications, there should be no difference.

 

This method is also useful because it can allow you to do large scale system maintenance of your primary site.

 

Referenced Articles

New nodes can be added to the cluster using the following procedures:

1 person found this helpful

Attachments

    Outcomes