News Aggregator

1.Notes

This project allowed me to experiment different strategies for parallelism. However, the large dataset gave me some trouble when trying to debug my code, as I had to constantly reduce the dataset size to test my functions or look for the cause of problems in thousands of files. Yet it was needed to actually test scalability and performance.

2. Parallel Processing Strategy

The goal was to parallelize as much of the code as possible. The processing pipeline consists of multiple stages with synchronization barriers between them.

2.1 Initialization Stage

The DataBaseInit class reads small configuration files sequentially:

Categories
Languages
Linking words
List of article files to process

Worker threads are then created and initialized.

2.2 File Reading Stage

Each thread is statically assigned a subset of files based on its thread index.

Sequential Database Structure:

public class SequentialDataBase {
private final Set<NewsArticle> articleSet;
// category mappings
private final Map<String, List<String>> categoryToArticle;

// language mappings
private final Map<String, List<String>> languageToArticle;

// statistics
private final Map<String, Integer> keyWordsOccurrences;
private final Map<String, Integer> authorOccurrences;

// most recent article
private NewsArticle mostRecentArticle;
}

Reading:

Each thread reads its assigned files using Jackson JSON parser Articles are added to the shared a local db Duplicate tracking: titles and UUIDs are mapped to occurrence counts

Synchronization: Threads wait at a barrier until all files are read, as local dbs are thread-safe. The barrier is need to ensure all threads complete reading before next stage.

2.3 Duplicate Removal Stage

The master thread reduces all duplicate counts from local dbs into global maps for titles and UUIDs.

Synchronization: Threads wait at a barrier to ensure duplicates are removed.

2.4 Processing Stage

Each thread will process its local database to build mappings and remove duplicates.

Processing:

Each threads removes duplicates from its local db based on global counts Each thread processes its article subset (files previously read) to build: Categorizes articles by category and language Counts keyword and author occurrences Tracks most recent article Sorts category and language lists within its subset as well as the article set

2.5 Data Merge Stage

Partial databases are merged into the Main Database statically(each thread merges a subset of tasks).

Merging:

public List<MergeFunction> getMergeOperations() {
    return List.of(
        this::mergeCategories,
        this::mergeAuthor,
        this::mergeMostRecentArticle,
        this::mergeLanguages,
        this::mergeKeyWords
        );
}

Category and Language Merging: Merges sorted lists using k-way merge algorithm Keyword and Author Merging: Merges occurrence counts by summing values for each key Most Recent Article: Compares and updates the most recent article

Synchronization: Threads wait at a barrier to ensure merging is complete.

2.6 Writing Stage

Embarrassingly Parallel:

Category files: Each thread writes a subset of category files
Language files: Each thread writes a subset of language files

Partial Parallelization:

all_articles.txt and keywords_count.txt cannot be fully parallelized
Each thread writes its own partial file
articles are already in sorted order (due to ConcurrentSkipListSet)
keywords are sorted using k-way merge algorithm during previous stage
files prefixed with thread index (e.g., 0_all_articles.txt) are created with partial results Synchronization: Threads wait at a barrier to ensure all writing is complete.

2.7 File Merge Stage

The master thread performs final operations:

Concatenates partial i_all_articles.txt files
Concatenates partial i_keywords_count.txt files
Writes final reports

Results

Test Environment

    Component Specifications
    CPU AMD Ryzen 7 4700U with Radeon Graphics
    Total CPU(s) 8
    Physical Cores 8
    Threads per Core 1
    Virtualization Support AMD-V
    Hypervisor Vendor Microsoft (WSL2 environment)
    Total RAM 3.6 GiB
    Free RAM ~2.6 GiB
    Operating System Ubuntu 24.04.2 LTS
    Java Version Java™ SE 25.0.1 LTS (build 25.0.1+8-LTS-27)

Test Dataset

The datatset used was TEST_4_DATASET(test_4/articles.txt and test_4/input.txt): 11031 article files.

Execution Time and Speedup

Observations

The performance increases rapidly from 1 - 4 threads, after which the improvement starts to slow down
Beyond 6 threads, execution time only improves slightly—indicating that parallelism starts reaching its limit
This plateau may be caused by heavy file I/O, the overhead from inserting the initial data in the db, as well as having more threads than merge operations in the merging stage, therefore, some threads remain idle, while others are busy.
The best overall execution time is achieved with 8 threads , but the most efficient balance between speed and overhead occurs at 4 threads on this system.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
graphs		graphs
input		input
src/main/java		src/main/java
.DS_Store		.DS_Store
Makefile		Makefile
README.pdf		README.pdf
pom.xml		pom.xml
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

News Aggregator

1.Notes

2. Parallel Processing Strategy

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

News Aggregator

1.Notes

2. Parallel Processing Strategy

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages