Creating observability for
1. About the Company
Xenoview’s Deep Visual-AI platform collects and analyzes images from electronic components in the production line.
The platform ensures product quality, authenticity, and traceability for OEMs and EMSs.
Any interruption in its services could directly impact the quality assessment and authenticity verification of the electronic components in production.
2. Project Objective
Monitor critical services to ensure continuous platform uptime.
Xenoview’s platform is implemented as a distributed application architecture:
1. Xenoview on-site applications. Runs on clients’ on-premises windows servers, located on the production lines (“Edge location”).
- Monitor Windows machines metrics such as network throughput, CPU and memory utilization, disk information;
- Monitor cloud services such as S3 Buckets and PostgreSQL to identify potential data-flow failures from the on-site application.
2. Xenoview cloud applications. Runs on AWS.
- Critical Services to be monitored- RDS, ECS services, S3, SQS.
4. IAMOPS Monitoring Solution - High Level Design
Based on the requirements, IAMOPS designed and configured two monitoring systems.
It was decided to use the Grafana stack, a suite of open-source tools, for dashboards and visualizations, Prometheus for metrics collection, and Loki for logs collection and querying.
- Lambda Functions – Metrics and logs creation
- Prometheus – Used as mediator to transfer metrics from Lambda to Mimir
- Mimir (Grafana) – Scalable long-term storage for Prometheus
- Loki (Grafana) – Multi-tenant log aggregation
- Grafana – Visualization and triggering alerts
- Slack, Teams and Emails – Alerts and notifications
4.1 Edge Location - Monitoring Solution
1. Resource Monitoring (L1):
A Grafana dashboard was created with predefined date ranges using a customized Docker image. Grafana Agent was installed on the Windows machines to collect metrics.
The collected metrics allow visualization of space utilization, network throughput, CPU and memory utilization, disk information, and CPU details for each service.
2. Site Monitoring (L2) :
Lambda functions are used to scan S3 buckets customer-wise based on the path. These functions collect information about object sizes and counts, and if specific conditions are met (e.g., objects count is zero or objects are smaller than 1 GB), alerts are sent to the Xenoview technology’s team. This automated monitoring ensures that data in S3 buckets is regularly checked for anomalies and issues.
3. Production line Monitoring (L3):
Lambda functions query PostgreSQL databases based on customer types and send the collected information to Grafana.
These functions run multiple times a day, ensuring that data about customer progress is up to date and available for visualization.
4.2.1 Cloud Infrastructure Monitoring Solution - ECS Logging Dashboard:
- We can see the logs from AWS ECS-hosted applications effortlessly.
- We can filter logs by ECS Cluster Name, Service Name, Container ID, and Stream. Apart from it, we can search the logs based on any keyword. ex.”conflict”
- View of total log counts and percentages for error logs.
- We are able to see the log amount difference between stdout and stderr logs.
- We can see the the rate of particular keyword in logs.
- We enabled AWS ECS Firelens feature which is used to send ECS container logs to any of the custom destination (in our case, custom destination is Loki).
- Once we enabled the Firelens integration, it automatically creates one side car container along with application container in ECS service.
- Basically, this side car container is acting as a log shipping agent for ECS which will take logs from application container where you enabled Firelens and will send logs to Loki using Loki’s write endpoint URL.
- Loki is database for logs which stores log in S3 (long-term-storage).
- In order to visualize these log in Grafana, we added Loki as a data source in Grafana and create dashboard which can query logs from the Loki and present data in dashboard.
4.2.2 Cloud Infrastructure Monitoring Solution - AWS RDS Dashboard:
- Panels in the AWS RDS Dashboard mostly present the utilization of RDS compute resources like CPU, Memory and Disk.
- We can filter out the data based on the AWS Region and DB instance name.
- Apart from these metrics, there are additional metrics which we can see in this dashboard, like number of active connections, SQL failed jobs, Read/Write IOPS which can used during troubleshooting.
- As AWS RDS is Amazon managed service, it doesn’t give any platform where we can install our agents. But, we can monitor its all metrics in Cloudwatch.
- In order to make centralized monitoring system using Grafana, we can add Cloudwatch as a data source in Grafana and we can visualize all data which is available in Cloudwatch in Grafana dashboard.
- Grafana will query these data from the Cloudwatch.
5. Best Practices Implemented
- By utilizing Grafana and Lambda functions, Xenoview technology aims to optimize its expenses.
- The use of Grafana allows for the creation of customized monitoring dashboards without incurring additional licensing costs. This open-source solution helps reduce expenses associated with proprietary monitoring software.
- Running Lambda functions on AWS EventBridge schedules ensures efficient resource utilization by executing scans at specific intervals, preventing unnecessary computing costs.
2. Comprehensive Visibility:
- Resource and Site Monitoring, gives a holistic view of the environment to Xenoview technology.
- In resource monitoring, Grafana is customized to display metrics relevant to Xenoview technology’s Windows servers, including CPU, Disk, Memory, and more.
- Site monitoring solution is highly customized to scan the S3 bucket data on customer basis. DynamoDB is used to store and retrieve customer-specific data, allowing for dynamic monitoring based on different customer profiles. The Lambda function codes are developed to adapt to these custom requirements.
3. Monitoring the managed services:
- To effectively monitor AWS Managed Services such as S3 and RDS as needed, there is no direct method for installing agents directly within these services.
- Instead, by harnessing the power of “AWS Lambda as an Agent,” the necessary customizations is accomplished to collect telemetry data from these AWS Services.
- Lambda can inspect the parameters of these services based on specific requirements, format the data appropriately for seamless visualization, and transmit it to monitoring tools such as Loki, Mimir, etc., enhancing the visualization capabilities within Grafana.
In order to create observability and maximize uptime of critical services, IAMOPS conducted a detailed requirements analysis, reviewed several alternatives and eventually designed and developed a customized monitoring solution, based on open source tools combined with self-developed Lambda functions.
The solution is designed to be cost effective, and enable the required observability.