Fault Tolerance Features
There are multiple levels of Fault Tolerance built into all levels of ELM Enterprise Manager making it one of the most robust log management and server monitoring solutions available. We take collection, protection and validation of data seriously and you will see that in the variety of approaches we’ve built into our products.
At the agent or monitored system level, ELM is designed with two different levels of fault tolerance protection.
Caching– When Service Agents are unable to connect to an ELM Server they will cache data until a connection is re-established to maintain data collection of all events configured for monitoring. The cache size can be configured as needed.
Agent Monitor- This monitor item performs regular checks on the ELM Service Agents installed. If the Service Agent fails to respond or responds slowly, actions such as a restart can be taken or a notification triggered so that monitoring can be resumed as quickly as possible.
Point to Point Verification
ELM includes monitoring features that go above and beyond a simple PING status indicator. An Event Writer used in conjunction with Correlation Views can verify the complete cycle of Agents collecting events and sending them to the ELM Server as expected. If a predetermined stop or start event is not detected within a specified interval, actions such as a notification, dashboard alert, or a restart script can be implemented.
ELM is deployed using both a Primary and Failover database strategy. The Primary database stores the most recent event, performance, SNMP and Syslog data.
The Failover database prevents loss of monitoring and alerting while the Primary is unavailable or under maintenance for example. Once a connection to the Primary database is re-established, data from the Failover automatically populates the Primary, merging seamlessly so that all views and reports perform as expected without gaps.
ELM provides additional Fault Tolerance by providing the option to employ a Standby ELM Server which will accept data (Events, Performance Data) from Agents should the primary or home ELM Server become unavailable for an extended period of time.
When this condition is detected, Agents will automatically swing over and report to the Standby Server.
Once issues are resolved agents will return to reporting to their home server.
Active Server Redundancy
ELM is one of the few log management and server monitoring solutions that supports Agents reporting to more than one centralized server. In this scenario Agents from Location A can report to both Server A and Server B just as Agents in Location B can report to both Server A and Server B for mirrored redundancy.
Each of the centralized servers also monitor one another. If Server A were to go offline, Server B will then take over the notifications that were originating from A and vice versa.
This redundancy feature supports environments where monitoring of events and continuation of notifications is absolutely critical.