Skip to content

Implement crawling controller to fetch directory URLs containing OpenAPI definitions. - Implementing Queue-Based Architecture of Downloading Index Files from Common Crawl Server Using RabbitMQ NOTE: All code is contained within the downloadAndProcessIndexFilesInBackground() function #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: develop
Choose a base branch
from

Conversation

priyanshu-kun
Copy link
Contributor

@priyanshu-kun priyanshu-kun commented Jun 22, 2023

In this pull request, a new crawling controller is presented, whose job it is to fetch directory URLs that are particularly linked to OpenAPI specifications. The controller makes it easier to get the desired OpenAPI definitions by retrieving the directories that contain them.


In this commit, a queue-based architecture is implemented to handle the downloading of index files from the Common Crawl server. RabbitMQ is utilized as the message broker for managing the queue. The downloadAndProcessIndexFilesInBackground() function contains all the necessary code for performing the background download and processing of the index files.

This implementation ensures a more efficient and scalable approach to handle long-running operations while keeping the server responsive and preventing overloading. The queue-based architecture allows for asynchronous processing of index files, providing better performance and fault tolerance.

By leveraging RabbitMQ and encapsulating the functionality within the downloadAndProcessIndexFilesInBackground() function, the codebase is organized and modular, making it easier to maintain and extend in the future."

Copy link
Contributor

@vinitshahdeo vinitshahdeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@priyanshu-kun Please move the Dummy App to a separate branch - feature/backup-dummy-app

Copy link
Contributor

@vinitshahdeo vinitshahdeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@priyanshu-kun Have completed initial review, please take a look.

@vinitshahdeo
Copy link
Contributor

@priyanshu-kun In order to prevent rate-limiting issues, you can explore back-off and sleep methods.

priyanshu-kun added 2 commits July 9, 2023 17:24
… Common Crawl Server Using RabbitMQ NOTE: All code is contained within the downloadAndProcessIndexFilesInBackground() function.
@priyanshu-kun priyanshu-kun changed the title Implement crawling controller to fetch directory URLs containing OpenAPI definitions. Implement crawling controller to fetch directory URLs containing OpenAPI definitions. - Implementing Queue-Based Architecture of Downloading Index Files from Common Crawl Server Using RabbitMQ NOTE: All code is contained within the downloadAndProcessIndexFilesInBackground() function Jul 10, 2023
return res.badRequest('Data source not provided');
}

try {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for this try catch as we have one already and we are not making any explicit handling for this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants