Add example notebook for downloading S3 logs #69

therealnb · 2025-12-05T10:24:07Z

This PR adds an example IPython notebook demonstrating how to download and analyze log files from the staging-eks-otel-logging S3 bucket. Just an example - I'm not sure we even want to check it into this repo. I wanted to try it. I think the logs path might be too granular. If we have lots of these it will take some time to download all of them. We can condense them to parquet files.

What's included

Notebook: scripts/download_s3_logs.ipynb - A complete example showing:
- How to list all JSON files in the S3 bucket
- How to download and parse OpenTelemetry log records
- How to extract individual logRecords from nested structures
- How to extract body and attributes from log records
- How to create a pandas DataFrame for analysis
Dependencies: Added boto3, pandas, and ipykernel to dev dependencies

Features

The notebook handles the partitioned S3 structure (logs/year=2025/month=12/day=04/hour=15/minute=21/) and automatically extracts partition information as separate columns for easier filtering and analysis.

Usage

Install dependencies: uv sync --group dev
Configure AWS credentials (environment variables, AWS config file, or IAM role)
Open the notebook and run the cells sequentially

This is intended as an example/reference for anyone who needs to download and analyze logs from the S3 bucket.

This notebook demonstrates how to download and analyze log files from the staging-eks-otel-logging S3 bucket. It includes: - Listing all JSON files in the bucket - Downloading and parsing OpenTelemetry log records - Extracting individual logRecords from nested structures - Extracting body and attributes from log records - Creating a pandas DataFrame for analysis The notebook handles the partitioned S3 structure (year/month/day/hour/minute) and extracts partition information as separate columns for easier filtering. Dependencies (boto3, pandas, ipykernel) have been added to dev dependencies.

claude · 2025-12-05T10:25:19Z

PR Review: Add example notebook for downloading S3 logs

Summary

This PR adds a Jupyter notebook for downloading and analyzing S3 logs. While functional, there are several improvements needed before merging.

Issues Found

🔴 Critical: Repository Scope Mismatch

This notebook analyzes logs from staging-eks-otel-logging S3 bucket, which appears unrelated to Stacklok documentation. This repository is specifically for "semantic search over Stacklok documentation" per the project overview. The notebook deals with OpenTelemetry logs from what appears to be an internal service, not documentation processing.

Action Required: Please clarify why this belongs in the toolhive-doc-mcp repository or move it to the appropriate repo.

🔴 Hardcoded Staging Resources

Line 52: BUCKET_NAME = 'staging-eks-otel-logging' - Hardcoded staging bucket name
Consider making this configurable via environment variables or notebook parameters

🟡 Code Quality Issues

Inefficient pandas operations (lines 192-193):

df['log_body'] = df.apply(lambda row: extract_body(row.to_dict()), axis=1)
attributes_list = df.apply(lambda row: extract_attributes(row.to_dict()), axis=1)

Converting entire rows to dicts is inefficient
Should pass the record dict directly or use vectorized operations where possible
Each row.to_dict() creates unnecessary copies

Unused imports (line 42):

from pathlib import Path  # Not used
import sys  # Not used
from collections.abc import Iterator  # Not used

Line length violations:
Multiple lines exceed 100 character limit (per pyproject.toml line-length = 100), e.g., line 49

🟡 Documentation & Usability

Missing information:

No explanation of what the log data represents
No example use cases or analysis patterns
The notebook outputs are committed (47 rows of sample data) - consider cleaning before commit

Security consideration:

AWS credentials management is mentioned but no guidance on IAM permissions needed
Consider documenting minimum required S3 permissions

Suggestions

Configuration: Move bucket name to config or env var
Optimize pandas: Avoid row.to_dict() in apply() calls
Clean imports: Remove unused imports
Line length: Fix lines exceeding 100 chars
Clear outputs: Run "Clear All Outputs" before committing
Documentation: Add section explaining what this data is used for and example analyses

Verdict

❌ Request Changes - Primary concern is repository scope alignment. Please clarify the purpose and relevance to this project.

Review completed in <2 min

- Remove unused imports - Fix f-string formatting - Add noqa comments for complexity - Fix line length issues

therealnb requested a review from ptelang December 5, 2025 10:27

Fix linting issues in notebook

43969a9

- Remove unused imports - Fix f-string formatting - Add noqa comments for complexity - Fix line length issues

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add example notebook for downloading S3 logs #69

Add example notebook for downloading S3 logs #69

therealnb commented Dec 5, 2025 •

edited

Loading

Uh oh!

claude bot commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add example notebook for downloading S3 logs #69

Are you sure you want to change the base?

Add example notebook for downloading S3 logs #69

Conversation

therealnb commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's included

Features

Usage

Uh oh!

claude bot commented Dec 5, 2025

PR Review: Add example notebook for downloading S3 logs

Summary

Issues Found

🔴 Critical: Repository Scope Mismatch

🔴 Hardcoded Staging Resources

🟡 Code Quality Issues

🟡 Documentation & Usability

Suggestions

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

therealnb commented Dec 5, 2025 •

edited

Loading