This article provides an overview of the options and best practices for monitoring your integration processes.
Knowing when integration processes fail or do not complete as expected is a critical part of managing the integration platform. AtomSphere provides a number of options to suit your needs depending upon the volume of executions, desired communication method, and sophistication of notification routing and delivery requirements.
For monitoring the general availability and health of your Atoms and Molecules, see Operational Monitoring Techniques for Atom, Molecule, and Atom Cloud Runtimes.
Functional Monitoring Options
The table below describes the various process monitoring options and best practices for use.
The Process Reporting console provides detailed information about process executions including process logs and documents.
Manage > Process Reporting
The Process Reporting console can be used to manually monitor low volumes of executions but as activity increases, handling Events on an exception-basis becomes necessary. However regardless of how you are notified about an error, you will most likely use the Process Reporting console to research logs and document-level information, troubleshoot, and rerun documents. See also Overview of AtomSphere Logs for Troubleshooting.
Use Document Tracking to capture key values from document data to be able to quickly find and trace a given record through multiple process executions.
Important notes about Process Reporting:
The Dashboards provide historical and summary information about process executions and inbound requests.
Dashboard > choose dashboard
There are three dashboards that can be used for trending and historical analysis:
AtomSphere users can choose to receive platform Events via email. Users can choose to subscribe to Events by Log Level and Type.
Account menu > Setup > Email Alerts
Email alerts are best used for exception-based notifications that will be received and acted upon by individuals. As a general recommendation, configure the Log Level to WARNING to receive alerts for WARNING and ERROR events but not expected successful activity such as "process started" and "process completed". See Email alert management.
For higher executions volumes or if more advanced subscription/routing rules are desired, use the Event API to consume and handle Events as needed. See the Platform API section below.
Important notes about email alerts:
Platform Events are available via RSS feed per Atom or at the account level.
Account menu > Setup > Account Information
Manage > Atom Management > choose Atom > Atom Information
RSS are the simplest approach for receiving platform Events. If you have an incident reporting system or monitoring tool that can consume RSS feeds, consider using them.
For exception-based monitoring, use the "Alerts Only" feeds to only receive Warning and Error level Events. There are no other filtering options.
Events can be consumed programmatically via the Event API.
There is no UI for the list of Events.
The Event object provides the most flexibility for routing rules, message formatting, and delivery options. It is the best option for exception-based monitoring of larger numbers of executions.
For execution-related Events, you can use the Execution Record object and Download process log operation to retrieve additional details about the execution. Note the API does not provide access to document-level information including tracked fields.
You can create another AtomSphere process (or use an external monitoring tool) to periodically extract new Events and route and deliver as required. For example, Events could be sent to your incident reporting application to leverage existing communication channels and escalation rules. Note the connection to the destination system including your company's SMTP mail server will consume a connection license.
IMPORTANT: Event records are available for 7 days.
(Built into Process Flow)
Custom error handling and notification logic can be built into the process design/flow as required.
If truly custom notification logic is required, you can design this into the process flow itself by capturing errors and validation warnings using Try/Catch and Business Rules shapes and connector operation response handling. For example, runtime errors could be caught, formatted with a Message shape, and emailed directly to recipients using a Mail connector and your company's SMTP mail server.
Alternatively, the errors could be mapped to custom alert message XML format, and written to an external database or queue for subsequent processing. This option may be desired when integrating with an existing notification/auditing/monitoring framework or to provide custom attributes in the notification message. It requires additional upfront design, development, and testing effort. It is still advised to use the platform Event subscriptions to be notified of problems that cannot be handled within the context of a process execution.
A Note about Execution History Retention and Purge Policies
The Process Reporting console displays execution results for the past 30 days. This period cannot be changed. The results include process- and document-level metadata such as execution statistics, document counts, and document tracked fields.
The Process Reporting console will always retain the summary execution history for 30 days, however the execution artifacts such as process logs and document data can be purged more frequently. This may be desirable for Atoms/runtimes processing a large volume of data. Note that purge schedules are configured for the entire runtime, not per-process.
The runtime can also be configured to "Purge Immediately" so that execution artifacts (most notably document data) are removed immediately upon completion of the execution. Again keep in mind the Purge Immediately setting applies to the entire runtime, not individual processes.
Execution artifacts for processes running on the Dell Boomi Atom Cloud are purged after 14 days. This can be decreased but not increased beyond 14 days.
While it is technically possible to increase the purge history beyond 30 days for local runtimes, it is not practical because the execution summary information in the Process Reporting console is only available for 30 days.
The purge frequency can be modified per runtime in Atom Management > Atom Properties (or Molecule or Cloud Properties). The Basic Property sets a “global” purge frequency for all execution artifacts. Additionally/alternatively you can override this setting by configuring purge frequencies for logs (process and container), documents (also applies to Atom Queues), and temp data individually with Advanced Properties.
For more information see Purging of Atom, Molecule, or Atom Cloud logs and data.
Considerations for Monitoring High Volume Low Latency Processes
Processes configured to use Low Latency mode require special considerations for monitoring. As noted above, Low Latency process executions do not generate Events for process errors and therefore you cannot rely on email alerts, RSS feeds, or even the Event API. User notifications (from Notify shapes) are created and therefore could be received via email alerts, however because Low Latency mode is typically used for high volume scenarios such as inbound web services and JMS messages, receiving an Event or email alert for each individual error can be impractical given the sheer volume of potential messages. Instead the strategy is to detect a general, persistent issue with the service and then use the available process logs and dashboards to investigate further.
Below are some additional considerations and recommendations for monitoring errors for Low Latency processes.
- Determine whether error reporting is even required. This is especially relevant for web service requests in which some types of requests may be transient in nature. In fact the client may have already resubmitted the request. Similarly, determine if you can mitigate errors--especially those caused by "bad data" in the request--by implementing validation logic within the process to reject the request and return an appropriate response to the client. In those situations, there is no need to report an error in the integration layer because the service behaved as intended and successfully rejected an invalid request.
- Design for resiliency. For critical integrations, resiliency should be built into the overall workflow to overcome potential connectivity and system availability issues. This typically involves a decoupled process design, message queuing, and retry mechanisms to ensure all messages are received, processed, and delivered.
- Implement a custom error handling framework. To capture and report issues during a Low Latency execution, design a custom error handling framework as part of the process flow as discussed in the table above. Using techniques such as proactive validation (e.g. Business Rules, Decision, Cleanse shapes) and error handling (e.g. Try/Catch shapes, connector response inspection), capture and generate your own event messages and submit them an appropriate repository such as a message queue, database, or custom log file to be handled asynchronously. However make sure the repository and downstream framework are robust enough to handle your anticipated volume of events.
- Leverage the ExecutionSummaryRecord API. The Execution Summary Record object provides a pre-packaged summarized view of a given Low Latency process's execution history for a given time block. This is the same information displayed in the Real-time Dashboard. The status field indicates whether there was an issue with one or more executions during the time block. The summary record also provides useful statistics such as execution count, execution duration standard deviation that can be used to detect unexpected execution patterns. Note this data is collected and made available roughly every ~5 minutes so it is not immediate.