Molecule node restarted due to clustering issues

Document created by mike_aronson Employee on Jun 30, 2015Last modified by vreddy on Mar 1, 2016
Version 2Show Document
  • View in full screen mode
You observed that a molecule node restarted and would like to diagnose what happened.

 


Download the container logs.  See if any of these errors have occurred:
  • “…after 2000ms, missing ACKs from [XX.XX.XXX.XX:7800], local_addr=XX.XX.XXX.XX:7800”
    • This would indicate that the node at IP address XX.XX.XXX.XX is not communicating anymore via port 7800 and indicates a network issue with that node.
  • "...WARNING [org.jgroups.protocols.FD_SOCK up] I was suspected by XX.XX.XXX.XX:7800; ignoring the SUSPECT message"
    • If this error continues and the IP address XX.XX.XXX.XX is part of the cluster, then this could indicate a configuration issue.  If the IP address is not part of the cluster, then it is expected to be ignored.  If the IP address is part of the cluster, but only occurs intermittently, especially around the shutdown time, then it could indicate a network issue or could be expected during normal transitional startup until the cluster stabilizes.
  • "...WARNING [org.jgroups.protocols.pbcast.NAKACK handleMessage] XX.XX.XXX.XX:7800] discarded message from non-member XX.XX.XXX.XX:7800, my view is [XX.XX.XXX.XX:7800] [XX.XX.XXX.XX:7800]"
    • If the IP address is Not part of the cluster, then it is expected to be discarded.  If it is part of the cluster, then this could indicate a configuration issue or a temporary clustering issue.
  • "...SEVERE [org.jgroups.protocols.pbcast.NAKACK getEntry] sender XX.XX.XXX.X:7800 not found in xmit_table"
    • This indicates the node has left the cluster
  • "...INFO [com.boomi.container.cloudlet.Container stopHeadServices] Container is no longer the head container"
    • the head node in the cluster has lost headship, possibility intentionally as failover if network issues were observed
  • ..."WARNING [org.jgroups.protocols.pbcast.GMS castViewChangeWithDest] 10.16.190.27:7800 failed to collect all ACKs (1) for mcasted view MergeView..."
    • This would again indicate network connectivity issues.  One or more of the nodes is not acknowledging inter-node communication messages required for healthy cluster.
  • "....INFO [com.boomi.container.core.BaseContainer restart] Atom restart requested in 5000 milliseconds: Atom restarting due to loss of headship..."
    • This is a failover condition that the molecule node will do to attempt to restore operations and network connectivity
If the above types of messages occur, one or more of the other nodes in the cluster may report an error such as the following in their container log, indicating that the attempts to communicate to the node having issues were broken:
  • "...SEVERE [org.jgroups.blocks.BasicConnectionTable$Connection _send] failed sending data to XX.XX.XXX.XX:7800: java.net.SocketException: Broken pipe
    • This would indicate that the node at IP address XX.XX.XXX.XX and port 7800 is having a network connection issue

 

In addition to the above, you may also find that there are messages in the container logs log indicating duplicate schedules or missed schedules were occurring at the time, or communications to our platform may be failing attempts.

 

To avoid running duplicate schedules or missing schedules, another node may "take over" as the head since it was still healthy.

The relinquishing of headship and restart of a troublesome node may be intended behavior to help prevent prolonged issues.

 

If you would like to pursue investigating why there were network connection issues, you may also wish to investigate any logs that may be available at the network level.

Attachments

    Outcomes