This repository implements an efficient report analyzer that utilizes an asynchronous task model, dividing the system into two main components: the service and the parser. The asynchronous approach allows the client to receive a quick response without waiting for the entire report to be processed synchronously.
The service exposes HTTP endpoints to the client, allowing it to send attachments along with unique identifiers such as traceId
, companyId
, customerId
, and contentType
.
- For attachments smaller than or equal to 250KB, the service stores them directly in Amazon SQS.
- For attachments larger than 256KB, the service uploads them to Amazon S3 and stores the S3 URL in SQS.
- After storing the attachment, the service immediately returns a response to the client.
The parser component employs a pipeline processing model with three goroutines:
- The first goroutine retrieves attachments from SQS.
- The second goroutine parses the attachments according to RFC standards and formats the data for logging.
- The third goroutine stores the valid data in S3 based on
companyId
andcustomerId
, facilitating future detail retrieval, and sends the aggregated data to the log.
This pipeline architecture significantly improves efficiency by at least 2/3. In case of failure, the data is sent to a Dead Letter Queue (DLQ) for retry after a visibility timeout.
For related pipeline diagrams, please refer to: https://github.com/iamyeswc/monitor-service/blob/main/doc/pipeline.pdf
To use Pprof for tracing errors, visit: https://github.com/iamyeswc/monitor-service/blob/main/doc/pprof.md
Due to the complexity of raw reports, there is a need for an efficient and accurate report analyzer that supports the analysis of reports within managed domains. By analyzing report results, clients can easily monitor trends and identify anomalies in emails sent from their managed domains. The analyzer also allows clients to view report data by sending sources, including email services, hostnames, and IP addresses. Additionally, it presents more detailed information from the original reports in a readable format, enabling clients to quickly delve deeper and identify potential threats.
- Asynchronous Processing: The system is designed to provide fast responses to clients without requiring them to wait for all reports to be fully processed.
- Pipeline Architecture: The parser uses a pipeline model to efficiently process attachments, improving overall throughput.
- Robust Error Handling: The system implements mechanisms to ensure that all batches of large reports are processed correctly, avoiding issues with message deletion in SQS.
-
Low Performance:
- Limited Concurrency: Although multiple goroutines are used for task processing, the single pipeline architecture limited overall concurrency. To address this, a multi-pipeline approach was implemented, allowing parallel processing of tasks to enhance throughput.
- Uneven Task Processing Speed: The first and third goroutines processed tasks quickly, while the second goroutine, which handles decompression, DNS queries, and report processing, took longer. To improve efficiency, multiple goroutines were assigned to the second task, enabling parallel processing of time-consuming operations.
-
Handling Large Reports:
- Due to log processing constraints, large reports must be divided into multiple batches. If one batch fails, the system attempts to delete the same message multiple times in the third task. Although AWS SQS does not return an error when attempting to delete an already deleted message, this behavior is problematic as it may indicate unprocessed batches.
- To resolve this, counters for the number of batches to process and the number of successfully processed batches were introduced. Only when these counts match will the message be deleted from the queue, ensuring that all batches have been successfully processed. If at least one batch fails, the message remains in the queue and is sent to the DLQ for retry after the visibility timeout.
This project represents a comprehensive report analyzer that leverages asynchronous processing and a pipeline architecture to enhance performance and user experience. By addressing the challenges of processing large reports and ensuring robust error handling, the system provides a reliable solution for analyzing complex reports efficiently.
For further details, please refer to the code and documentation within this repository.