This four-part series discusses various methods and considerations for keeping multiple applications in sync, including one-way, bidirectional, real-time/event-based, and hub-and-spoke.
- Sync Strategies Part 1: One-Way Syncs
- Sync Strategies Part 2: Two-Way Syncs
- Sync Strategies Part 3: Real-Time Syncs
- Sync Strategies Part 4: Syncing Multiple Applications
- One-Way Sync Strategies
- 1. BEST: Extract by “Flag” Field Value
- 2. GOOD: Extract by Last Modified Date
- 3. LAST RESORT: Change Data Capture
In this article we’ll talk about strategies for syncing records incrementally in one direction between two applications, a very common scenario. But before looking at some specific strategies, let’s take a moment to clarify what “incremental” really means. An incremental (or “delta”) sync is one that only processes the data records that have changed (created or modified) since the last time the integration ran as opposed to processing the entire data set every time. It is the preferred approach for most point-to-point integration scenarios because by limiting the number of records being processed to only what has changed, it allows the integration to run more quickly and efficiently which in turn allows it to run more frequently to keep systems up to date.
Designing and implementing an incremental sync does require more effort than simply syncing all records every time. However if you’ve got thousands and thousands or even hundreds of records, when you try to sift through error logs, or you start exceeding API limits, or the integration takes hours to run, that extra effort up front will definitely pay off.
One-Way Sync Strategies
When it comes to integration, not all applications are created equally. Depending on the availability and capabilities of an API and the configurability of the application itself, your sync options may be limited. However if you do have a choice, use the recommendations below to identify the best strategy for your situation.
The three strategies we will review are:
- BEST: Extract by “Flag” Field Value
- GOOD: Extract by Last Modified Date
- LAST RESORT: Change Data Capture
1. BEST: Extract by “Flag” Field Value
In this approach, records are extracted based on the value of some status or “flag” field. Upon successful completion, the integration updates this field to a different value so it will not be extracted again.This option provides a lot of flexibility and can end users to easily control the sync behavior by modifying values within the end application. Bottom line: if you have this option, use it.
- The flag field can take on many different forms and will depend on the application. It could be an actual “status” field (integration-specific or business-related), a true/false “ready to sync” field, the existence of an “external ID”, etc. It could also be a combination of fields and values.
- Tip: If the application allows it, create a custom field specifically for “integration status”. If not, look for a standard “status” field or even an unused field that could be used for this purpose.
- If flag field value is not set “naturally” as part of the normal business process, you may need to implement custom workflow logic in the end application to initialize the value accordingly.
- Ideal for when a record should be synced once and only once, such as transactions.
- If you need to sync subsequent changes as well--which is common for “master data” records like Customer or Contact--consider combining with the last modified date approach below or with application logic to detect end-user changes using scripting or triggers within the application.
- Allows records to be synced indeterminate of one another. If one record in the group fails, the others can be updated successfully and will not be reprocessed.
- Assuming the field is exposed to end users, it allows them to easily reset, retry, or ignore individual records in a self-service fashion instead of relying on the integration team.
- Keeps the integration stateless, meaning nothing is stored at the integration layer such as the last run date.
- May require end-application customizations or logic to create flag field.
- If the source record does not have a field to use as the flag and you cannot create custom fields, or the application’s API does not permit you to query by this field, this approach is not an option.
Taking it to AtomSphere
- Configure the operation query filters in the start step to only select records with certain values for the flag field(s).
- After successfully syncing the record, ensure the process updates the flag fields in the source application appropriately.
- Tip: Use a User Defined Document Property to retain the source record’s internal ID throughout the process so you can update the original record at the end of the process.
2. GOOD: Extract by Last Modified Date
Many applications automatically capture the timestamp of when records are created or modified. This is great news when you only want to get records that have been created or modified since the last time the integration ran.In this approach, records are extracted based on the value of their last modified date field being greater than some date, such as the last time the integration ran or the most recent record synced previously.This is the next best option and may be the only option if there are no flag fields available in the source application or its API.
- What date value should be used when extracting records?
- 1) Capture the current system timestamp when the integration begins and save it upon successful completion. However if the integration client does not run on the same server as the application (which is almost always the case for cloud applications), there can be small discrepancies in the timestamps due to server times, opening the possibility to overlook records.
- 2) Use a relative date, such as “extract changes made within the last 24 hours”. However this can be problematic a) by syncing the same records multiple times if the integration runs more than once within the relative date range and b) records will be missed if the integration does not run for an extended period of time (e.g. server maintenance). This option is highly discouraged.
- 3) The most robust strategy is to capture the most recent last modified date from the records themselves, and persist that value upon successful completion. Because the timestamp is from the source application itself you can be sure no records will be missed due to server time differences.
- It is important to be vigilant in monitoring and correcting errored records to prevent the number of records being resynced from accumulating too high.
- Integration design usually simpler than flag field approach because you do not need to update source application at the end of the sync.
- No end-application customizations or logic necessary.
- Makes the integration “stateful”. Some troubleshooting efforts may require modifying the last record date captured within the integration.
- Because the timestamp should only be persisted upon successful completion of the entire batch, if one record fails, all the records will be extracted again the next time the sync runs. This means you will need logic in the integration to determine if a record has already been synced (for example, querying the destination application to check if a record already exists).
- Tip: If the destination application supports an “upsert” operation and overwriting the records with the same values is not a problem, you can skip the existence logic to simplify the integration.
- Some applications may only provide the last modified date and not date AND TIME. This means you have to extract records modified since the beginning of that day. In this situation, adapt your integration logic to either perform existence lookups against the destination application or simply update the records repeatedly. Although undesirable to sync the records unnecessarily, this latter approach is often sufficient for infrequent (e.g. daily), lower-volume integrations.
- If mass updates are performed in the source application as part of normal business processes, this will result in all records being synced the next time the integration runs.
Taking it to AtomSphere
- If capturing the date from the records themselves (preferred), use a Map function to evaluate and save the most recent date encountered as a Process Property. Then at the very end of the process, if there were no errors or warnings, use a Set Properties step to persist the Process Property.
- If using the process’ built-in Last Run Date or Last Successful Run Date, you don’t need to do anything other than ensure the process status is successful/failed appropriately (e.g. use an Exception step if it reaches a point where you do not want the Last Successful Run Date to be updated).
- Be aware that using the Retry Documents feature within Process Monitoring to resync failed documents will result in the persisted date being updated which will cause additional records to be resynced or missed. For example, when capturing the date from the record itself, if you retry a document from last week, upon successful completion it will store that date and the next time the sync runs, it will extract all records since last week. If this is a concern, avoid using Retry Documents and always correct the records in the source application to have them to be extracted during the next sync.
3. LAST RESORT: Change Data Capture
But what if you can't use a flag field and there’s no last modified date and the application or API only gives you the entire data set every time.In this option, the integration must remember the entire data set from the last execution in a “cache” to be able to compare the current data set against it. Through this comparison the records that were added, modified, or removed can be identified.
- Fortunately this situation is rare for modern application-based scenarios and is more common in working with legacy applications and file-based integrations, such as a nightly product catalog extract for example.
- Need to ensure the entire data set is obtained from the source application every time.
- Should only be used for “master data” records, not transactional records.
- No end-application customizations or logic necessary.
- Makes the integration very stateful because the entire data set is stored in the integration.
- Requires the extraction and processing of the entire data set, which could be a very large number of records. This will typically take a longer time to run.
- The cache can be “corrupted” if a “bad” data set is processed, with unintentional results. For example, if you accidentally synced a small “test” data set against your full production cache, it would look like a large number of records were deleted which could be very bad for the destination application.
Taking it to AtomSphere
- Ensure the start step operation is configured to retrieve the entire data set every time, whether this is an application connector query or a file-based connector.
- Use the Find Changes step to determine which records should be added, updated, or deleted in the destination application.
Getting data from point A to point B comes with a number of options and considerations. Keep these incremental strategies in mind when you’re designing your next one-way integration and you’ll be well on your way to building a lean, mean, efficient sync. Happy syncing!
Stay tuned for Sync Strategies Part 2: Bidirectional Syncs....