Skip to content

Conversation

@therealnb
Copy link
Contributor

@therealnb therealnb commented Dec 5, 2025

This PR adds an example IPython notebook demonstrating how to download and analyze log files from the staging-eks-otel-logging S3 bucket. Just an example - I'm not sure we even want to check it into this repo. I wanted to try it. I think the logs path might be too granular. If we have lots of these it will take some time to download all of them. We can condense them to parquet files.

What's included

  • Notebook: scripts/download_s3_logs.ipynb - A complete example showing:

    • How to list all JSON files in the S3 bucket
    • How to download and parse OpenTelemetry log records
    • How to extract individual logRecords from nested structures
    • How to extract body and attributes from log records
    • How to create a pandas DataFrame for analysis
  • Dependencies: Added boto3, pandas, and ipykernel to dev dependencies

Features

The notebook handles the partitioned S3 structure (logs/year=2025/month=12/day=04/hour=15/minute=21/) and automatically extracts partition information as separate columns for easier filtering and analysis.

Usage

  1. Install dependencies: uv sync --group dev
  2. Configure AWS credentials (environment variables, AWS config file, or IAM role)
  3. Open the notebook and run the cells sequentially

This is intended as an example/reference for anyone who needs to download and analyze logs from the S3 bucket.

This notebook demonstrates how to download and analyze log files from the
staging-eks-otel-logging S3 bucket. It includes:

- Listing all JSON files in the bucket
- Downloading and parsing OpenTelemetry log records
- Extracting individual logRecords from nested structures
- Extracting body and attributes from log records
- Creating a pandas DataFrame for analysis

The notebook handles the partitioned S3 structure (year/month/day/hour/minute)
and extracts partition information as separate columns for easier filtering.

Dependencies (boto3, pandas, ipykernel) have been added to dev dependencies.
@claude
Copy link

claude bot commented Dec 5, 2025

PR Review: Add example notebook for downloading S3 logs

Summary

This PR adds a Jupyter notebook for downloading and analyzing S3 logs. While functional, there are several improvements needed before merging.

Issues Found

🔴 Critical: Repository Scope Mismatch

This notebook analyzes logs from staging-eks-otel-logging S3 bucket, which appears unrelated to Stacklok documentation. This repository is specifically for "semantic search over Stacklok documentation" per the project overview. The notebook deals with OpenTelemetry logs from what appears to be an internal service, not documentation processing.

Action Required: Please clarify why this belongs in the toolhive-doc-mcp repository or move it to the appropriate repo.

🔴 Hardcoded Staging Resources

  • Line 52: BUCKET_NAME = 'staging-eks-otel-logging' - Hardcoded staging bucket name
  • Consider making this configurable via environment variables or notebook parameters

🟡 Code Quality Issues

Inefficient pandas operations (lines 192-193):

df['log_body'] = df.apply(lambda row: extract_body(row.to_dict()), axis=1)
attributes_list = df.apply(lambda row: extract_attributes(row.to_dict()), axis=1)
  • Converting entire rows to dicts is inefficient
  • Should pass the record dict directly or use vectorized operations where possible
  • Each row.to_dict() creates unnecessary copies

Unused imports (line 42):

from pathlib import Path  # Not used
import sys  # Not used
from collections.abc import Iterator  # Not used

Line length violations:
Multiple lines exceed 100 character limit (per pyproject.toml line-length = 100), e.g., line 49

🟡 Documentation & Usability

Missing information:

  • No explanation of what the log data represents
  • No example use cases or analysis patterns
  • The notebook outputs are committed (47 rows of sample data) - consider cleaning before commit

Security consideration:

  • AWS credentials management is mentioned but no guidance on IAM permissions needed
  • Consider documenting minimum required S3 permissions

Suggestions

  1. Configuration: Move bucket name to config or env var
  2. Optimize pandas: Avoid row.to_dict() in apply() calls
  3. Clean imports: Remove unused imports
  4. Line length: Fix lines exceeding 100 chars
  5. Clear outputs: Run "Clear All Outputs" before committing
  6. Documentation: Add section explaining what this data is used for and example analyses

Verdict

Request Changes - Primary concern is repository scope alignment. Please clarify the purpose and relevance to this project.

Review completed in <2 min

@therealnb therealnb requested a review from ptelang December 5, 2025 10:27
- Remove unused imports
- Fix f-string formatting
- Add noqa comments for complexity
- Fix line length issues
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants