Private Atom Cloud Configuration and Monitoring Best Practices

Document created by rich_patterson Employee on Feb 8, 2016Last modified by Adam Arrowsmith on Aug 31, 2016
Version 5Show Document
  • View in full screen mode
This article will describe some best practices with regards for configuring your private atom cloud, as well as some suggestions regarding the use of JMX, Zabbix, and Splunk to maintain that cloud.

Configuration

Below are some best practices for Private Atom Cloud management with respect to setting execution governance limits and infrastructure setup.

  • Configure Max Execution Time
    • Limit forked executions to run for only 24 hours by changing com.boomi.container.maxExecutionTime per the following reference guide article (see Properties panel)
    • In addition, configure a script at the OS level, which will alert your system admins if there are any java processes that run for longer than the allotted 24 hours.
    • If JVM processes run for longer than maxExecutionTime, you should alert the accounts to which these processes belong, and manually terminate those java processes.  These processes may need to be reviewed for possible misconfiguration.
  • Limit customer's executions to 512MB heap
    • This is the default in the procrunner script.
    • This way, if a customer were to create a poorly designed AtomSphere process, you have effectively isolated them from the rest of your customers, reducing the likely hood that other customers can be impacted by this.
  • Enable Disk Quota per account
  • Enable Account Concurrent Execution Limit:
  • Set Max Number of Forked Executions for all accounts
  • Set the ULimit values at the OS level.
    • We recommend the following ULimit values:
      • ulimit -n is set to 8192
      • ulimit –v is set to unlimited
      • ulimit –c is set to 0 (zero)
    • To determine your limit:
      • Periodically execute the “lsof” command will show listing of open files – https://en.wikipedia.org/wiki/Lsof.
      • Or, monitor the JMX attribute jmx["java.lang:type=OperatingSystem", "OpenFileDescriptorCount"].
      • Also, monitor splunk for errors related to “file not found” or “process discarded due to system load” and if they are occurring or occurring with increasing frequency, check OS system level attributes to determine if limits are being reached.
  • Network and storage considerations:
    • For our cloud, we have NetApp mounted via NFS.
    • We also recommend using a separate dedicated network for NFS traffic – each node have a separate NIC card, GB network
  • Managing Keystores across multiple cluster nodes:
    • Java should be installed locally to each node.
    • For ease of configuration, we use Puppet to manage certificates to all java keystores across all nodes

 

Monitoring

Here are some best practices, for using JMX, Zabbix, and Splunk to monitor system health:
  • There are two different JMX items that could be queried:
    • If you want to know “how many executions are running right now” you would use jmx["com.boomi.container.services:type=ExecutionManager", "RunnningExecutionCount"] (Running, non-queued executions).  This will only be point-in-time.  If you only query every 60 seconds, you might miss counting an execution that ran for ~15 seconds in between two captures.
    • The other one we suggest is jmx["com.boomi.container.services:type=ExecutionManager", "Stats.localExecutionCount"] (Total number of executions).  Since this is a running tally, it is required to calculate a delta value.  Zabbix allows you to do that by using an “Item” of type “Calculated” (instead of JMX Agent).
  • Below is an example item we named boomi.stats.localexecutioncount.delta to capture the number of executions ran on the node since the last capture (and we capture every 60 seconds).
    • Formula = last("jmx[\"com.boomi.container.services:type=ExecutionManager\", \"Stats.localExecutionCount\"]",0)
    • Store Value “Delta (simple change)"
    • It requires that you are also separately capturing the “JMX Agent” item  jmx["com.boomi.container.services:type=ExecutionManager", "Stats.localExecutionCount"]
0EM40000000IGYA

  • Create a Zabbix trigger to alert the on-call admin when there may be issues with the cloud:
    • For example “The number of executions running in the last 5 minutes is less than or equal to 5"
  • To identify when a particular issue occurs, it’s best to implement Graphs in Zabbix.
    • If you create a graph for the item, then you can zoom in/out on specific times to see where statistics have changed dramatically.
  • Consider using Splunk to capture container and/or shared server logs on each server.
    • It is true that the log does not contain the node name / ID within it, it does get configured by Splunk when the log is injected.
    • You can also see the node ID in the file name.
    • Here are some Splunk search queries that could work:
      • host=*dfwatom*
      • source=*xxx_xxx_xxx*.log
      • source=*container.xxx_xxx_xxx_*.log
    • If you are using a single server to capture all of the container logs from the central NFS share, you could do something like this to hardcode/set the host name for each log
      • Note that you would have to call out each filename individually instead of using a wildcard.
      • In this example we use the * wildcard for the date portion of the filename.
# Example Splunk inputs.conf
# Atom Container Log (dfwatom01)
[monitor:///usr/local/boomi/cloud/bod/logs/*.container.xxx_xxx_xxx_xxx.log]
host=dfwatom01.dfw.boomi.com

# Atom Container Log (dfwatom02)
[monitor:///usr/local/boomi/cloud/bod/logs/*.container.xxx_xxx_xxx_xxx.log]
host=dfwatom02.dfw.boomi.com
index=main
5 people found this helpful

Attachments

    Outcomes