Creating observability for
uptime-critical services

1. About the Company

Xenoview’s Deep Visual-AI platform collects and analyzes images from electronic components in the production line. 

The platform ensures product quality, authenticity, and traceability for OEMs and EMSs.

 Any interruption in its services could directly impact the quality assessment and authenticity verification of the electronic components in production.

creating observability

2. Project Objective

Monitor critical services to ensure continuous platform uptime.

3. Requirements

Xenoview’s platform is implemented as a distributed application architecture:

1. Xenoview on-site applications. Runs on clients’ on-premises windows servers, located on the production lines (“Edge location”).

        Requirements

  • Monitor Windows machines metrics such as network throughput, CPU and memory utilization, disk information;
  • Monitor cloud services such as S3 Buckets and PostgreSQL to identify potential data-flow failures from the on-site application.

2. Xenoview cloud applications. Runs on AWS.

        Requirements

  • Critical Services to be monitored- RDS, ECS services, S3, SQS.
Cybord Use Case Eyal

4. IAMOPS Monitoring Solution - High Level Design

Based on the requirements, IAMOPS designed and configured two monitoring systems.

It was decided to use the Grafana stack, a suite of open-source tools, for dashboards and visualizations, Prometheus for metrics collection, and Loki for logs collection and querying.

Technologies configured:

  • Lambda Functions – Metrics and logs creation
  • Prometheus – Used as mediator to transfer metrics from Lambda to Mimir
  • Mimir (Grafana) – Scalable long-term storage for Prometheus
  • Loki (Grafana) – Multi-tenant log aggregation
  • Grafana – Visualization and triggering alerts
  • Slack, Teams and Emails – Alerts and notifications
Cybord Use Case Eyal (1)
Cybord Use Case Eyal (2)

4.1 Edge Location - Monitoring Solution

1. Resource Monitoring (L1):

A Grafana dashboard was created with predefined date ranges using a customized Docker image. Grafana Agent was installed on the Windows machines to collect metrics.

The collected metrics allow visualization of space utilization, network throughput, CPU and memory utilization, disk information, and CPU details for each service.

Cybord Use Case Eyal (31)

2. Site Monitoring (L2) :

Lambda functions are used to scan S3 buckets customer-wise based on the path. These functions collect information about object sizes and counts, and if specific conditions are met (e.g., objects count is zero or objects are smaller than 1 GB), alerts are sent to the Xenoview technology’s team. This automated monitoring ensures that data in S3 buckets is regularly checked for anomalies and issues.

Cybord Use Case Eyal (32)

3. Production line Monitoring (L3):

Lambda functions query PostgreSQL databases based on customer types and send the collected information to Grafana.

These functions run multiple times a day, ensuring that data about customer progress is up to date and available for visualization.

Cybord Use Case Eyal22

4.2.1 Cloud Infrastructure Monitoring Solution - ECS Logging Dashboard:

Dashboard Visuals:

  • We can see the logs from AWS ECS-hosted applications effortlessly.
  • We can filter logs by ECS Cluster Name, Service Name, Container ID, and Stream. Apart from it, we can search the logs based on any keyword. ex.”conflict”
  • View of total log counts and percentages for error logs.
  • We are able to see the log amount difference between stdout and stderr logs.
  • We can see the the rate of particular keyword in logs.

Technical Configurations:

  • We enabled AWS ECS Firelens feature which is used to send ECS container logs to any of the custom destination (in our case, custom destination is Loki).
  • Once we enabled the Firelens integration, it automatically creates one side car container along with application container in ECS service.
  • Basically, this side car container is acting as a log shipping agent for ECS which will take logs from application container where you enabled Firelens and will send logs to Loki using Loki’s write endpoint URL.
  • Loki is database for logs which stores log in S3 (long-term-storage).
  • In order to visualize these log in Grafana, we added Loki as a data source in Grafana and create dashboard which can query logs from the Loki and present data in dashboard.
Cybord Use Case Eyal33

4.2.2 Cloud Infrastructure Monitoring Solution - AWS RDS Dashboard:

Dashboard Visualizations:

  • Panels in the AWS RDS Dashboard mostly present the utilization of RDS compute resources like CPU, Memory and Disk.
  • We can filter out the data based on the AWS Region and DB instance name.
  • Apart from these metrics, there are additional metrics which we can see in this dashboard, like number of active connections, SQL failed jobs, Read/Write IOPS which can used during troubleshooting.

Technical Configurations:

  • As AWS RDS is Amazon managed service, it doesn’t give any platform where we can install our agents. But, we can monitor its all metrics in Cloudwatch.
  • In order to make centralized monitoring system using Grafana, we can add Cloudwatch as a data source in Grafana and we can visualize all data which is available in Cloudwatch in Grafana dashboard.
  • Grafana will query these data from the Cloudwatch.
Cybord Use Case Eyal444

5. Best Practices Implemented

1. Cost-Efficiency:

  • By utilizing Grafana and Lambda functions, Xenoview technology aims to optimize its expenses.
  • The use of Grafana allows for the creation of customized monitoring dashboards without incurring additional licensing costs. This open-source solution helps reduce expenses associated with proprietary monitoring software.
  • Running Lambda functions on AWS EventBridge schedules ensures efficient resource utilization by executing scans at specific intervals, preventing unnecessary computing costs.

2. Comprehensive Visibility:

  • Resource and Site Monitoring, gives a holistic view of the environment to Xenoview technology.
  • In resource monitoring, Grafana is customized to display metrics relevant to Xenoview technology’s Windows servers, including CPU, Disk, Memory, and more.
  • Site monitoring solution is highly customized to scan the S3 bucket data on customer basis. DynamoDB is used to store and retrieve customer-specific data, allowing for dynamic monitoring based on different customer profiles. The Lambda function codes are developed to adapt to these custom requirements.

3. Monitoring the managed services:

  • To effectively monitor AWS Managed Services such as S3 and RDS as needed, there is no direct method for installing agents directly within these services.
  • Instead, by harnessing the power of “AWS Lambda as an Agent,” the necessary customizations is accomplished to collect telemetry data from these AWS Services.
  • Lambda can inspect the parameters of these services based on specific requirements, format the data appropriately for seamless visualization, and transmit it to monitoring tools such as Loki, Mimir, etc., enhancing the visualization capabilities within Grafana.

6. Summary

In order to create observability and maximize uptime of critical services, IAMOPS conducted a detailed requirements analysis, reviewed several alternatives and eventually designed and developed a customized monitoring solution, based on open source tools combined with self-developed Lambda functions.

The solution is designed to be cost effective, and enable the required observability.

Let's get the Conversation started!

Click below to explore the DevOps journey with us.

Looking for DevOps to join my team

Looking for a job opportunity

Apply to

Creating observability for uptime-critical services

Thanks for applying!

Your application has been sent to our recruitment team successfully. If your profile is selected, our recruitment team will get in touch with you.

We wish you all the best!