Skip to content

feat: Add specialized downloaders achieving 86.95% success rate #57

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -36,4 +36,9 @@ wheels/
*.egg-info/
.installed.cfg
*.egg

.idea
venv
logs
*.log
processed_output
pycache
12 changes: 12 additions & 0 deletions .idea/glossAPI.iml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions .idea/inspectionProfiles/profiles_settings.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

10 changes: 10 additions & 0 deletions .idea/misc.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 8 additions & 0 deletions .idea/modules.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions .idea/vcs.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

26 changes: 26 additions & 0 deletions build.gradle
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
plugins {
id 'java'
id 'idea'
}

group = 'com.example'
version = '1.0-SNAPSHOT'

repositories {
mavenCentral()
}

dependencies {
// MCP dependencies would go here
implementation 'org.apache.logging.log4j:log4j-api:2.17.1'
implementation 'org.apache.logging.log4j:log4j-core:2.17.1'

testImplementation 'org.junit.jupiter:junit-jupiter-api:5.9.2'
testRuntimeOnly 'org.junit.jupiter:junit-jupiter-engine:5.9.2'
}

test {
useJUnitPlatform()
}

// MCP-specific configurations would go here
Empty file added concurrent_download.log
Empty file.
Empty file added conversion.log
Empty file.
101 changes: 101 additions & 0 deletions downloader.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
2025-03-29 03:14:37,949 - INFO - System info: {'platform': 'Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35', 'python': '3.10.12', 'time': '2025-03-29 03:14:37'}
2025-03-29 03:14:37,950 - INFO - Arguments received: JSON file: scraping/json_sitemaps/kallipos_pdf.json, Sleep time: 1.0, File type: pdf, Request type: get, Output path: downloads/kallipos, Progress report path: downloads/kallipos, Batch size: 100, Max retries: 3
2025-03-29 03:14:37,959 - INFO - Loaded 5169 metadata entries from scraping/json_sitemaps/kallipos_pdf.json
2025-03-29 03:14:37,959 - INFO - Using 3 concurrent requests for batch size 100
2025-03-29 03:14:37,959 - INFO - Loaded existing progress report with 224 entries
2025-03-29 03:14:37,960 - INFO - Starting PDF downloads. 4945 files remaining.
2025-03-29 03:14:37,961 - INFO - Scheduled 10/100 downloads
2025-03-29 03:14:37,961 - INFO - Scheduled 20/100 downloads
2025-03-29 03:14:37,961 - INFO - Scheduled 30/100 downloads
2025-03-29 03:14:37,962 - INFO - Scheduled 40/100 downloads
2025-03-29 03:14:37,962 - INFO - Scheduled 50/100 downloads
2025-03-29 03:14:37,962 - INFO - Scheduled 60/100 downloads
2025-03-29 03:14:37,963 - INFO - Scheduled 70/100 downloads
2025-03-29 03:14:37,963 - INFO - Scheduled 80/100 downloads
2025-03-29 03:14:37,964 - INFO - Scheduled 90/100 downloads
2025-03-29 03:14:37,964 - INFO - Scheduled 100/100 downloads
2025-03-29 03:14:38,915 - ERROR - Server error 500 when downloading https://repository.kallipos.gr/retrieve/257938a8-2fba-4151-8100-5c0342d8ff71/295-TRIANTAFYLLOU-Information-Retrieval-and-Search-Techniques.pdf
2025-03-29 03:14:38,916 - INFO - Retry attempt 1/3 - waiting 2.52 seconds
2025-03-29 03:14:39,328 - ERROR - Server error 500 when downloading https://repository.kallipos.gr/retrieve/b5d477dd-7b8b-485e-8140-41fb1a7e5595/20230714_%CE%A0%CE%91_%CE%92%CE%B1%CE%B3%CE%B9%CE%B1%CC%81%CE%BD%CE%BF%CF%82_%CE%93%CF%81%CE%B1%CF%86%CE%B9%CF%83%CF%84%CE%B9%CE%BA%CE%B7%CC%81%20%CE%95%CF%80%CE%B9%CE%BC%CE%B5%CC%81%CE%BB%CE%B5%CE%B9%CE%B1.pdf
2025-03-29 03:14:39,328 - WARNING - Increased request delay to 1.50s after 2 consecutive errors
2025-03-29 03:14:39,329 - INFO - Retry attempt 1/3 - waiting 3.18 seconds
2025-03-29 03:14:39,419 - ERROR - Server error 500 when downloading https://repository.kallipos.gr/retrieve/e51ef661-b962-4b35-b170-f4ecd02a3188/562-DASSIOS-Partial-Differential-Equations.pdf
2025-03-29 03:14:39,420 - WARNING - Increased request delay to 3.38s after 3 consecutive errors
2025-03-29 03:14:39,420 - INFO - Retry attempt 1/3 - waiting 2.40 seconds
2025-03-29 03:31:05,024 - INFO - System info: {'platform': 'Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35', 'python': '3.10.12', 'time': '2025-03-29 03:31:05'}
2025-03-29 03:31:05,025 - INFO - Arguments received: JSON file: scraping/json_sitemaps/kallipos_pdf.json, Sleep time: 3.0, File type: pdf, Request type: get, Output path: downloads/kallipos, Progress report path: downloads/kallipos, Batch size: 100, Max retries: 3
2025-03-29 03:31:05,033 - INFO - Loaded 5169 metadata entries from scraping/json_sitemaps/kallipos_pdf.json
2025-03-29 03:31:05,033 - INFO - πŸš€ KALLIPOS MODE ENABLED: Using optimized settings for Kallipos repository
2025-03-29 03:31:05,033 - INFO - Using 1 concurrent requests for batch size 100
2025-03-29 03:31:05,033 - INFO - Using Kallipos-specific settings: max_retries=6, base_wait=5.0s, max_wait=120.0s
2025-03-29 03:31:05,034 - INFO - Loaded existing progress report with 224 entries
2025-03-29 03:31:05,034 - INFO - Starting PDF downloads. 4945 files remaining.
2025-03-29 03:31:05,035 - INFO - Scheduled 10/100 downloads
2025-03-29 03:31:05,035 - INFO - Scheduled 20/100 downloads
2025-03-29 03:31:05,036 - INFO - Scheduled 30/100 downloads
2025-03-29 03:31:05,036 - INFO - Scheduled 40/100 downloads
2025-03-29 03:31:05,036 - INFO - Scheduled 50/100 downloads
2025-03-29 03:31:05,037 - INFO - Scheduled 60/100 downloads
2025-03-29 03:31:05,037 - INFO - Scheduled 70/100 downloads
2025-03-29 03:31:05,037 - INFO - Scheduled 80/100 downloads
2025-03-29 03:31:05,038 - INFO - Scheduled 90/100 downloads
2025-03-29 03:31:05,038 - INFO - Scheduled 100/100 downloads
2025-03-29 03:31:05,038 - INFO - Adding 7.19s extra delay for repository.kallipos.gr (every 10 requests)
2025-03-29 03:31:13,138 - ERROR - Server error 500 when downloading https://repository.kallipos.gr/retrieve/257938a8-2fba-4151-8100-5c0342d8ff71/295-TRIANTAFYLLOU-Information-Retrieval-and-Search-Techniques.pdf from repository.kallipos.gr
2025-03-29 03:31:13,139 - WARNING - Kallipos server returned 500 error - this is common and usually temporary
2025-03-29 03:31:13,139 - INFO - Retry attempt 1/6 - waiting 33.16 seconds
2025-03-29 03:31:34,395 - INFO - Progress report written to downloads/kallipos/progress_report.json
2025-03-29 03:31:34,396 - INFO - Download summary: 224/5169 files (4.3%) processed
2025-03-29 03:31:34,396 - INFO - Total elapsed time: 1016.45 seconds
2025-03-29 03:31:34,396 - INFO - Average time per downloaded file: 4.54 seconds
2025-03-29 03:31:34,396 - INFO - Download rate: 0.22 files/second
2025-03-29 03:31:34,398 - INFO - Program terminated by user
2025-03-29 03:44:24,237 - INFO - System info: {'platform': 'Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35', 'python': '3.10.12', 'time': '2025-03-29 03:44:24'}
2025-03-29 03:45:14,236 - INFO - System info: {'platform': 'Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35', 'python': '3.10.12', 'time': '2025-03-29 03:45:14'}
2025-03-29 03:45:47,122 - INFO - System info: {'platform': 'Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35', 'python': '3.10.12', 'time': '2025-03-29 03:45:47'}
2025-03-29 03:45:47,123 - INFO - Arguments received: JSON file: scraping/json_sitemaps/kallipos_pdf.json, Sleep time: 3.0, File type: pdf, Request type: get, Output path: downloads/kallipos, Progress report path: downloads/kallipos, Batch size: 100, Max retries: 3
2025-03-29 03:45:47,133 - INFO - Loaded 5169 metadata entries from scraping/json_sitemaps/kallipos_pdf.json
2025-03-29 03:45:47,133 - INFO - πŸš€ KALLIPOS MODE ENABLED: Using optimized settings for Kallipos repository
2025-03-29 03:45:47,133 - INFO - Using 1 concurrent requests for batch size 100
2025-03-29 03:45:47,133 - INFO - Using Kallipos-specific settings: max_retries=6, base_wait=5.0s, max_wait=120.0s
2025-03-29 03:45:47,134 - INFO - Loaded existing progress report with 224 entries
2025-03-29 03:45:47,134 - INFO - Starting PDF downloads. 4945 files remaining.
2025-03-29 03:45:47,135 - INFO - Scheduled 10/100 downloads
2025-03-29 03:45:47,135 - INFO - Scheduled 20/100 downloads
2025-03-29 03:45:47,136 - INFO - Scheduled 30/100 downloads
2025-03-29 03:45:47,136 - INFO - Scheduled 40/100 downloads
2025-03-29 03:45:47,136 - INFO - Scheduled 50/100 downloads
2025-03-29 03:45:47,137 - INFO - Scheduled 60/100 downloads
2025-03-29 03:45:47,137 - INFO - Scheduled 70/100 downloads
2025-03-29 03:45:47,137 - INFO - Scheduled 80/100 downloads
2025-03-29 03:45:47,138 - INFO - Scheduled 90/100 downloads
2025-03-29 03:45:47,138 - INFO - Scheduled 100/100 downloads
2025-03-29 03:45:47,138 - INFO - Adding 3.12s extra delay for repository.kallipos.gr (every 10 requests)
2025-03-29 03:45:51,170 - ERROR - Server error 500 when downloading https://repository.kallipos.gr/retrieve/257938a8-2fba-4151-8100-5c0342d8ff71/295-TRIANTAFYLLOU-Information-Retrieval-and-Search-Techniques.pdf from repository.kallipos.gr
2025-03-29 03:45:51,171 - WARNING - Kallipos server returned 500 error - this is common and usually temporary
2025-03-29 03:45:51,171 - INFO - Retry attempt 1/6 - waiting 32.74 seconds
2025-03-29 03:46:30,798 - INFO - Progress report written to downloads/kallipos/progress_report.json
2025-03-29 03:46:30,798 - INFO - Download summary: 224/5169 files (4.3%) processed
2025-03-29 03:46:30,799 - INFO - Total elapsed time: 43.68 seconds
2025-03-29 03:46:30,799 - INFO - Average time per downloaded file: 0.19 seconds
2025-03-29 03:46:30,799 - INFO - Download rate: 5.13 files/second
2025-03-29 03:46:30,801 - INFO - Program terminated by user
2025-03-29 03:47:04,073 - INFO - System info: {'platform': 'Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35', 'python': '3.10.12', 'time': '2025-03-29 03:47:04'}
2025-03-29 03:47:04,074 - INFO - Arguments received: JSON file: scraping/json_sitemaps/kallipos_pdf.json, Sleep time: 5.0, File type: pdf, Request type: get, Output path: downloads/kallipos, Progress report path: downloads/kallipos, Batch size: 10, Max retries: 8
2025-03-29 03:47:04,081 - INFO - Loaded 5169 metadata entries from scraping/json_sitemaps/kallipos_pdf.json
2025-03-29 03:47:04,081 - INFO - πŸš€ KALLIPOS MODE ENABLED: Using optimized settings for Kallipos repository
2025-03-29 03:47:04,081 - INFO - Using 1 concurrent requests for batch size 10
2025-03-29 03:47:04,081 - INFO - Using Kallipos-specific settings: max_retries=6, base_wait=5.0s, max_wait=120.0s
2025-03-29 03:47:04,081 - INFO - Loaded existing progress report with 224 entries
2025-03-29 03:47:04,081 - INFO - Starting PDF downloads. 4945 files remaining.
2025-03-29 03:47:04,083 - INFO - Scheduled 10/10 downloads
2025-03-29 03:47:04,083 - INFO - Adding 7.59s extra delay for repository.kallipos.gr (every 10 requests)
2025-03-29 03:47:12,621 - ERROR - Server error 500 when downloading https://repository.kallipos.gr/retrieve/257938a8-2fba-4151-8100-5c0342d8ff71/295-TRIANTAFYLLOU-Information-Retrieval-and-Search-Techniques.pdf from repository.kallipos.gr
2025-03-29 03:47:12,621 - WARNING - Kallipos server returned 500 error - this is common and usually temporary
2025-03-29 03:47:12,621 - INFO - Retry attempt 1/6 - waiting 23.08 seconds
2025-03-29 04:10:48,703 - INFO - Progress report written to downloads/kallipos/progress_report.json
2025-03-29 04:10:48,704 - INFO - Download summary: 224/5169 files (4.3%) processed
2025-03-29 04:10:48,704 - INFO - Total elapsed time: 1424.63 seconds
2025-03-29 04:10:48,704 - INFO - Average time per downloaded file: 6.36 seconds
2025-03-29 04:10:48,704 - INFO - Download rate: 0.16 files/second
2025-03-29 04:10:48,707 - INFO - Program terminated by user
11 changes: 11 additions & 0 deletions logs/download_20250329_043409.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
2025-03-29 04:34:09,007 - INFO - Logging initialized. Log file: logs/download_20250329_043409.log
2025-03-29 04:34:09,008 - INFO - PDF Downloader starting with arguments: {'json': 'scraping/json_sitemaps/kallipos_pdf.json', 'type': 'pdf', 'req': 'get', 'output': 'downloads/kallipos', 'little_potato': 'downloads/kallipos', 'batch': 2, 'sleep': 5.0, 'timeout': 60, 'max_retries': 3, 'concurrent': 1, 'log_dir': 'logs', 'kallipos_mode': True}
2025-03-29 04:34:09,015 - INFO - Loaded 5169 items from scraping/json_sitemaps/kallipos_pdf.json
2025-03-29 04:34:09,015 - INFO - πŸš€ KALLIPOS MODE ENABLED: Using optimized settings for Kallipos repository
2025-03-29 04:34:09,015 - INFO - Using 1 concurrent downloads, 5.0s delay, 8 max retries
2025-03-29 04:34:09,015 - INFO - Loaded progress report with 224 entries
2025-03-29 04:34:09,016 - INFO - Total files: 5169, Already processed: 224, Remaining: 4945
2025-03-29 04:34:09,016 - INFO - Scheduled 2/2 downloads
2025-03-29 04:34:15,629 - ERROR - Server error 500 for https://repository.kallipos.gr/retrieve/257938a8-2fba-4151-8100-5c0342d8ff71/295-TRIANTAFYLLOU-Information-Retrieval-and-Search-Techniques.pdf (common with Kallipos repository)
2025-03-29 04:34:15,629 - INFO - Adding 10.5s extra delay for Kallipos 500 error
2025-03-29 04:34:26,129 - INFO - Retry 1/8 - waiting 14.7s
11 changes: 11 additions & 0 deletions logs/download_20250329_043614.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
2025-03-29 04:36:14,476 - INFO - Logging initialized. Log file: logs/download_20250329_043614.log
2025-03-29 04:36:14,477 - INFO - PDF Downloader starting with arguments: {'json': 'scraping/json_sitemaps/kallipos_pdf.json', 'type': 'pdf', 'req': 'get', 'output': 'downloads/kallipos', 'little_potato': 'downloads/kallipos', 'batch': 5, 'sleep': 8.0, 'timeout': 120, 'max_retries': 5, 'concurrent': 1, 'log_dir': 'logs', 'kallipos_mode': True}
2025-03-29 04:36:14,484 - INFO - Loaded 5169 items from scraping/json_sitemaps/kallipos_pdf.json
2025-03-29 04:36:14,484 - INFO - πŸš€ KALLIPOS MODE ENABLED: Using optimized settings for Kallipos repository
2025-03-29 04:36:14,484 - INFO - Using 1 concurrent downloads, 8.0s delay, 5 max retries
2025-03-29 04:36:14,484 - INFO - Loaded progress report with 224 entries
2025-03-29 04:36:14,485 - INFO - Total files: 5169, Already processed: 224, Remaining: 4945
2025-03-29 04:36:14,485 - INFO - Scheduled 5/5 downloads
2025-03-29 04:36:23,027 - ERROR - Server error 500 for https://repository.kallipos.gr/retrieve/257938a8-2fba-4151-8100-5c0342d8ff71/295-TRIANTAFYLLOU-Information-Retrieval-and-Search-Techniques.pdf (common with Kallipos repository)
2025-03-29 04:36:23,027 - INFO - Adding 18.2s extra delay for Kallipos 500 error
2025-03-29 04:36:41,210 - INFO - Retry 1/5 - waiting 16.1s
30 changes: 30 additions & 0 deletions logs/download_20250329_050054.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
2025-03-29 05:00:54,822 - INFO - Logging initialized. Log file: /home/alexa/development/glossAPI/logs/download_20250329_050054.log
2025-03-29 05:00:54,822 - INFO - PDF Downloader starting with arguments: {'json': '/home/alexa/development/glossAPI/scraping/json_sitemaps/kallipos_pdf.json', 'type': 'pdf', 'req': 'get', 'output': '/home/alexa/development/glossAPI/downloads/kallipos', 'little_potato': '/home/alexa/development/glossAPI/downloads/kallipos', 'batch': 5, 'sleep': 15.0, 'timeout': 180, 'max_retries': 10, 'concurrent': 1, 'log_dir': '/home/alexa/development/glossAPI/logs', 'kallipos_mode': True, 'start_at': 224, 'skip_every': 0, 'randomize': True}
2025-03-29 05:00:54,830 - INFO - Loaded 5169 items from /home/alexa/development/glossAPI/scraping/json_sitemaps/kallipos_pdf.json
2025-03-29 05:00:54,831 - INFO - Starting at position 224, skipped 224 items
2025-03-29 05:00:54,831 - INFO - Randomizing download order to avoid server detection patterns
2025-03-29 05:00:54,832 - INFO - Final processing list contains 4945 items
2025-03-29 05:00:54,833 - INFO - πŸš€ KALLIPOS MODE ENABLED: Using optimized settings for Kallipos repository
2025-03-29 05:00:54,833 - INFO - Using 1 concurrent downloads, 15.0s delay, 10 max retries
2025-03-29 05:00:54,833 - INFO - Loaded progress report with 224 entries
2025-03-29 05:00:54,833 - INFO - Total files: 4945, Already processed: 224, Remaining: 4721
2025-03-29 05:00:54,834 - INFO - Scheduled 5/5 downloads
2025-03-29 05:00:54,834 - DEBUG - Waiting 23.2s before next request
2025-03-29 05:01:19,037 - WARNING - Response claimed to be 200 OK but content doesn't appear to be a PDF (type: text/html;charset=utf-8, size: 56125)
2025-03-29 05:01:19,038 - ERROR - Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7f10b2226a10>
2025-03-29 05:01:19,038 - DEBUG - Waiting 22.6s before next request
2025-03-29 05:01:42,805 - WARNING - Response claimed to be 200 OK but content doesn't appear to be a PDF (type: text/html;charset=utf-8, size: 33790)
2025-03-29 05:01:42,806 - ERROR - Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7f10b22265c0>
2025-03-29 05:01:42,806 - DEBUG - Waiting 22.0s before next request
2025-03-29 05:02:08,883 - WARNING - Response claimed to be 200 OK but content doesn't appear to be a PDF (type: text/html;charset=utf-8, size: 52253)
2025-03-29 05:02:08,883 - ERROR - Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7f10b22245b0>
2025-03-29 05:02:08,884 - DEBUG - Waiting 23.0s before next request
2025-03-29 05:02:33,190 - WARNING - Response claimed to be 200 OK but content doesn't appear to be a PDF (type: text/html;charset=utf-8, size: 30537)
2025-03-29 05:02:33,190 - ERROR - Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7f10b2225120>
2025-03-29 05:02:33,191 - DEBUG - Waiting 22.0s before next request
2025-03-29 05:02:56,761 - ERROR - Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7f10b22265c0>
Loading