diff --git a/code_puppy/bundled_skills/.gitkeep b/code_puppy/bundled_skills/.gitkeep new file mode 100644 index 00000000..e69de29b diff --git a/code_puppy/bundled_skills/Data/data-context-extractor/SKILL.md b/code_puppy/bundled_skills/Data/data-context-extractor/SKILL.md new file mode 100644 index 00000000..1e065695 --- /dev/null +++ b/code_puppy/bundled_skills/Data/data-context-extractor/SKILL.md @@ -0,0 +1,227 @@ +--- +name: data-context-extractor +description: > + Generate or improve a company-specific data analysis skill by extracting tribal knowledge from analysts. + + BOOTSTRAP MODE - Triggers: "Create a data context skill", "Set up data analysis for our warehouse", + "Help me create a skill for our database", "Generate a data skill for [company]" + → Discovers schemas, asks key questions, generates initial skill with reference files + + ITERATION MODE - Triggers: "Add context about [domain]", "The skill needs more info about [topic]", + "Update the data skill with [metrics/tables/terminology]", "Improve the [domain] reference" + → Loads existing skill, asks targeted questions, appends/updates reference files + + Use when data analysts want Claude to understand their company's specific data warehouse, + terminology, metrics definitions, and common query patterns. +--- + +# Data Context Extractor + +A meta-skill that extracts company-specific data knowledge from analysts and generates tailored data analysis skills. + +## How It Works + +This skill has two modes: + +1. **Bootstrap Mode**: Create a new data analysis skill from scratch +2. **Iteration Mode**: Improve an existing skill by adding domain-specific reference files + +--- + +## Bootstrap Mode + +Use when: User wants to create a new data context skill for their warehouse. + +### Phase 1: Database Connection & Discovery + +**Step 1: Identify the database type** + +Ask: "What data warehouse are you using?" + +Common options: +- **BigQuery** +- **Snowflake** +- **PostgreSQL/Redshift** +- **Databricks** + +Use `~~data warehouse` tools (query and schema) to connect. If unclear, check available MCP tools in the current session. + +**Step 2: Explore the schema** + +Use `~~data warehouse` schema tools to: +1. List available datasets/schemas +2. Identify the most important tables (ask user: "Which 3-5 tables do analysts query most often?") +3. Pull schema details for those key tables + +Sample exploration queries by dialect: +```sql +-- BigQuery: List datasets +SELECT schema_name FROM INFORMATION_SCHEMA.SCHEMATA + +-- BigQuery: List tables in a dataset +SELECT table_name FROM `project.dataset.INFORMATION_SCHEMA.TABLES` + +-- Snowflake: List schemas +SHOW SCHEMAS IN DATABASE my_database + +-- Snowflake: List tables +SHOW TABLES IN SCHEMA my_schema +``` + +### Phase 2: Core Questions (Ask These) + +After schema discovery, ask these questions conversationally (not all at once): + +**Entity Disambiguation (Critical)** +> "When people here say 'user' or 'customer', what exactly do they mean? Are there different types?" + +Listen for: +- Multiple entity types (user vs account vs organization) +- Relationships between them (1:1, 1:many, many:many) +- Which ID fields link them together + +**Primary Identifiers** +> "What's the main identifier for a [customer/user/account]? Are there multiple IDs for the same entity?" + +Listen for: +- Primary keys vs business keys +- UUID vs integer IDs +- Legacy ID systems + +**Key Metrics** +> "What are the 2-3 metrics people ask about most? How is each one calculated?" + +Listen for: +- Exact formulas (ARR = monthly_revenue × 12) +- Which tables/columns feed each metric +- Time period conventions (trailing 7 days, calendar month, etc.) + +**Data Hygiene** +> "What should ALWAYS be filtered out of queries? (test data, fraud, internal users, etc.)" + +Listen for: +- Standard WHERE clauses to always include +- Flag columns that indicate exclusions (is_test, is_internal, is_fraud) +- Specific values to exclude (status = 'deleted') + +**Common Gotchas** +> "What mistakes do new analysts typically make with this data?" + +Listen for: +- Confusing column names +- Timezone issues +- NULL handling quirks +- Historical vs current state tables + +### Phase 3: Generate the Skill + +Create a skill with this structure: + +``` +[company]-data-analyst/ +├── SKILL.md +└── references/ + ├── entities.md # Entity definitions and relationships + ├── metrics.md # KPI calculations + ├── tables/ # One file per domain + │ ├── [domain1].md + │ └── [domain2].md + └── dashboards.json # Optional: existing dashboards catalog +``` + +**SKILL.md Template**: See `references/skill-template.md` + +**SQL Dialect Section**: See `references/sql-dialects.md` and include the appropriate dialect notes. + +**Reference File Template**: See `references/domain-template.md` + +### Phase 4: Package and Deliver + +1. Create all files in the skill directory +2. Package as a zip file +3. Present to user with summary of what was captured + +--- + +## Iteration Mode + +Use when: User has an existing skill but needs to add more context. + +### Step 1: Load Existing Skill + +Ask user to upload their existing skill (zip or folder), or locate it if already in the session. + +Read the current SKILL.md and reference files to understand what's already documented. + +### Step 2: Identify the Gap + +Ask: "What domain or topic needs more context? What queries are failing or producing wrong results?" + +Common gaps: +- A new data domain (marketing, finance, product, etc.) +- Missing metric definitions +- Undocumented table relationships +- New terminology + +### Step 3: Targeted Discovery + +For the identified domain: + +1. **Explore relevant tables**: Use `~~data warehouse` schema tools to find tables in that domain +2. **Ask domain-specific questions**: + - "What tables are used for [domain] analysis?" + - "What are the key metrics for [domain]?" + - "Any special filters or gotchas for [domain] data?" + +3. **Generate new reference file**: Create `references/[domain].md` using the domain template + +### Step 4: Update and Repackage + +1. Add the new reference file +2. Update SKILL.md's "Knowledge Base Navigation" section to include the new domain +3. Repackage the skill +4. Present the updated skill to user + +--- + +## Reference File Standards + +Each reference file should include: + +### For Table Documentation +- **Location**: Full table path +- **Description**: What this table contains, when to use it +- **Primary Key**: How to uniquely identify rows +- **Update Frequency**: How often data refreshes +- **Key Columns**: Table with column name, type, description, notes +- **Relationships**: How this table joins to others +- **Sample Queries**: 2-3 common query patterns + +### For Metrics Documentation +- **Metric Name**: Human-readable name +- **Definition**: Plain English explanation +- **Formula**: Exact calculation with column references +- **Source Table(s)**: Where the data comes from +- **Caveats**: Edge cases, exclusions, gotchas + +### For Entity Documentation +- **Entity Name**: What it's called +- **Definition**: What it represents in the business +- **Primary Table**: Where to find this entity +- **ID Field(s)**: How to identify it +- **Relationships**: How it relates to other entities +- **Common Filters**: Standard exclusions (internal, test, etc.) + +--- + +## Quality Checklist + +Before delivering a generated skill, verify: + +- [ ] SKILL.md has complete frontmatter (name, description) +- [ ] Entity disambiguation section is clear +- [ ] Key terminology is defined +- [ ] Standard filters/exclusions are documented +- [ ] At least 2-3 sample queries per domain +- [ ] SQL uses correct dialect syntax +- [ ] Reference files are linked from SKILL.md navigation section diff --git a/code_puppy/bundled_skills/Data/data-context-extractor/references/domain-template.md b/code_puppy/bundled_skills/Data/data-context-extractor/references/domain-template.md new file mode 100644 index 00000000..3aa83f58 --- /dev/null +++ b/code_puppy/bundled_skills/Data/data-context-extractor/references/domain-template.md @@ -0,0 +1,147 @@ +# Domain Reference File Template + +Use this template when creating reference files for specific data domains (e.g., revenue, users, marketing). + +--- + +```markdown +# [DOMAIN_NAME] Tables + +This document contains [domain]-related tables, metrics, and query patterns. + +--- + +## Quick Reference + +### Business Context + +[2-3 sentences explaining what this domain covers and key concepts] + +### Entity Clarification + +**"[AMBIGUOUS_TERM]" can mean:** +- **[MEANING_1]**: [DEFINITION] ([TABLE]: [ID_FIELD]) +- **[MEANING_2]**: [DEFINITION] ([TABLE]: [ID_FIELD]) + +Always clarify which one before querying. + +### Standard Filters + +For [domain] queries, always: +```sql +WHERE [STANDARD_FILTER_1] + AND [STANDARD_FILTER_2] +``` + +--- + +## Key Tables + +### [TABLE_1_NAME] +**Location**: `[project.dataset.table]` or `[schema.table]` +**Description**: [What this table contains, when to use it] +**Primary Key**: [COLUMN(S)] +**Update Frequency**: [Daily/Hourly/Real-time] ([LAG] lag) +**Partitioned By**: [PARTITION_COLUMN] (if applicable) + +| Column | Type | Description | Notes | +|--------|------|-------------|-------| +| **[column_1]** | [TYPE] | [DESCRIPTION] | [GOTCHA_OR_CONTEXT] | +| **[column_2]** | [TYPE] | [DESCRIPTION] | | +| **[column_3]** | [TYPE] | [DESCRIPTION] | Nullable | + +**Relationships**: +- Joins to `[OTHER_TABLE]` on `[JOIN_KEY]` +- Parent of `[CHILD_TABLE]` via `[FOREIGN_KEY]` + +**Nested/Struct Fields** (if applicable): +- `[struct_name].[field_1]`: [DESCRIPTION] +- `[struct_name].[field_2]`: [DESCRIPTION] + +--- + +### [TABLE_2_NAME] +[REPEAT FORMAT] + +--- + +## Key Metrics + +| Metric | Definition | Table | Formula | Notes | +|--------|------------|-------|---------|-------| +| [METRIC_1] | [DEFINITION] | [TABLE] | `[FORMULA]` | [CAVEATS] | +| [METRIC_2] | [DEFINITION] | [TABLE] | `[FORMULA]` | | + +--- + +## Sample Queries + +### [QUERY_PURPOSE_1] +```sql +-- [Brief description of what this query does] +SELECT + [columns] +FROM [table] +WHERE [standard_filters] +GROUP BY [grouping] +ORDER BY [ordering] +``` + +### [QUERY_PURPOSE_2] +```sql +[ANOTHER_COMMON_QUERY] +``` + +### [QUERY_PURPOSE_3]: [More Complex Pattern] +```sql +WITH [cte_name] AS ( + [CTE_LOGIC] +) +SELECT + [final_columns] +FROM [cte_name] +[joins_and_filters] +``` + +--- + +## Common Gotchas + +1. **[GOTCHA_1]**: [EXPLANATION] + - Wrong: `[INCORRECT_APPROACH]` + - Right: `[CORRECT_APPROACH]` + +2. **[GOTCHA_2]**: [EXPLANATION] + +--- + +## Related Dashboards (if applicable) + +| Dashboard | Link | Use For | +|-----------|------|---------| +| [DASHBOARD_1] | [URL] | [DESCRIPTION] | +| [DASHBOARD_2] | [URL] | [DESCRIPTION] | +``` + +--- + +## Tips for Creating Domain Files + +1. **Start with the most-queried tables** - Don't try to document everything +2. **Include column-level detail only for important columns** - Skip obvious ones like `created_at` +3. **Real query examples > abstract descriptions** - Show don't tell +4. **Document the gotchas prominently** - These save the most time +5. **Keep sample queries runnable** - Use real table/column names +6. **Note nested/struct fields explicitly** - These trip people up + +## Suggested Domain Files + +Common domains to document (create separate files for each): + +- `revenue.md` - Billing, subscriptions, ARR, transactions +- `users.md` - Accounts, authentication, user attributes +- `product.md` - Feature usage, events, sessions +- `growth.md` - DAU/WAU/MAU, retention, activation +- `sales.md` - CRM, pipeline, opportunities +- `marketing.md` - Campaigns, attribution, leads +- `support.md` - Tickets, CSAT, response times diff --git a/code_puppy/bundled_skills/Data/data-context-extractor/references/example-output.md b/code_puppy/bundled_skills/Data/data-context-extractor/references/example-output.md new file mode 100644 index 00000000..5ed24baf --- /dev/null +++ b/code_puppy/bundled_skills/Data/data-context-extractor/references/example-output.md @@ -0,0 +1,198 @@ +# Example: Generated Skill + +This is an example of what a generated skill looks like after the bootstrap process. This example is for a fictional e-commerce company called "ShopCo" using Snowflake. + +--- + +## Example SKILL.md + +```markdown +--- +name: shopco-data-analyst +description: "ShopCo data analysis skill for Snowflake. Provides context for querying e-commerce data including customer, order, and product analytics. Use when analyzing ShopCo data for: (1) Revenue and order metrics, (2) Customer behavior and retention, (3) Product performance, or any data questions requiring ShopCo-specific context." +--- + +# ShopCo Data Analysis + +## SQL Dialect: Snowflake + +- **Table references**: `SHOPCO_DW.SCHEMA.TABLE` or with quotes for case-sensitive: `"Column_Name"` +- **Safe division**: `DIV0(a, b)` returns 0, `DIV0NULL(a, b)` returns NULL +- **Date functions**: + - `DATE_TRUNC('MONTH', date_col)` + - `DATEADD(DAY, -1, date_col)` + - `DATEDIFF(DAY, start_date, end_date)` +- **Column exclusion**: `SELECT * EXCLUDE (column_to_exclude)` + +--- + +## Entity Disambiguation + +**"Customer" can mean:** +- **User**: A login account that can browse and save items (CORE.DIM_USERS: user_id) +- **Customer**: A user who has made at least one purchase (CORE.DIM_CUSTOMERS: customer_id) +- **Account**: A billing entity, can have multiple users in B2B (CORE.DIM_ACCOUNTS: account_id) + +**Relationships:** +- User → Customer: 1:1 (customer_id = user_id for purchasers) +- Account → User: 1:many (join on account_id) + +--- + +## Business Terminology + +| Term | Definition | Notes | +|------|------------|-------| +| GMV | Gross Merchandise Value - total order value before returns/discounts | Use for top-line reporting | +| NMV | Net Merchandise Value - GMV minus returns and discounts | Use for actual revenue | +| AOV | Average Order Value - NMV / order count | Exclude $0 orders | +| LTV | Lifetime Value - total NMV per customer since first order | Rolling calc, updates daily | +| CAC | Customer Acquisition Cost - marketing spend / new customers | By cohort month | + +--- + +## Standard Filters + +Always apply these filters unless explicitly told otherwise: + +```sql +-- Exclude test and internal orders +WHERE order_status != 'TEST' + AND customer_type != 'INTERNAL' + AND is_employee_order = FALSE + +-- Exclude cancelled orders for revenue metrics + AND order_status NOT IN ('CANCELLED', 'FRAUDULENT') +``` + +--- + +## Key Metrics + +### Gross Merchandise Value (GMV) +- **Definition**: Total value of all orders placed +- **Formula**: `SUM(order_total_gross)` +- **Source**: `CORE.FCT_ORDERS.order_total_gross` +- **Time grain**: Daily, aggregated to weekly/monthly +- **Caveats**: Includes orders that may later be cancelled or returned + +### Net Revenue +- **Definition**: Actual revenue after returns and discounts +- **Formula**: `SUM(order_total_gross - return_amount - discount_amount)` +- **Source**: `CORE.FCT_ORDERS` +- **Caveats**: Returns can occur up to 90 days post-order; use settled_revenue for finalized numbers + +--- + +## Knowledge Base Navigation + +| Domain | Reference File | Use For | +|--------|----------------|---------| +| Orders | `references/orders.md` | Order tables, GMV/NMV calculations | +| Customers | `references/customers.md` | User/customer entities, LTV, cohorts | +| Products | `references/products.md` | Catalog, inventory, categories | + +--- + +## Common Query Patterns + +### Daily GMV by Channel +```sql +SELECT + DATE_TRUNC('DAY', order_timestamp) AS order_date, + channel, + SUM(order_total_gross) AS gmv, + COUNT(DISTINCT order_id) AS order_count +FROM SHOPCO_DW.CORE.FCT_ORDERS +WHERE order_status NOT IN ('TEST', 'CANCELLED', 'FRAUDULENT') + AND order_timestamp >= DATEADD(DAY, -30, CURRENT_DATE()) +GROUP BY 1, 2 +ORDER BY 1 DESC, 3 DESC +``` + +### Customer Cohort Retention +```sql +WITH cohorts AS ( + SELECT + customer_id, + DATE_TRUNC('MONTH', first_order_date) AS cohort_month + FROM SHOPCO_DW.CORE.DIM_CUSTOMERS +) +SELECT + c.cohort_month, + DATEDIFF(MONTH, c.cohort_month, DATE_TRUNC('MONTH', o.order_timestamp)) AS months_since_first, + COUNT(DISTINCT c.customer_id) AS active_customers +FROM cohorts c +JOIN SHOPCO_DW.CORE.FCT_ORDERS o ON c.customer_id = o.customer_id +WHERE o.order_status NOT IN ('TEST', 'CANCELLED') +GROUP BY 1, 2 +ORDER BY 1, 2 +``` +``` + +--- + +## Example references/orders.md + +```markdown +# Orders Tables + +Order and transaction data for ShopCo. + +--- + +## Key Tables + +### FCT_ORDERS +**Location**: `SHOPCO_DW.CORE.FCT_ORDERS` +**Description**: Fact table of all orders. One row per order. +**Primary Key**: `order_id` +**Update Frequency**: Hourly (15 min lag) +**Partitioned By**: `order_date` + +| Column | Type | Description | Notes | +|--------|------|-------------|-------| +| **order_id** | VARCHAR | Unique order identifier | | +| **customer_id** | VARCHAR | FK to DIM_CUSTOMERS | NULL for guest checkout | +| **order_timestamp** | TIMESTAMP_NTZ | When order was placed | UTC | +| **order_date** | DATE | Date portion of order_timestamp | Partition column | +| **order_status** | VARCHAR | Current status | PENDING, SHIPPED, DELIVERED, CANCELLED, RETURNED | +| **channel** | VARCHAR | Acquisition channel | WEB, APP, MARKETPLACE | +| **order_total_gross** | DECIMAL(12,2) | Pre-discount total | | +| **discount_amount** | DECIMAL(12,2) | Total discounts applied | | +| **return_amount** | DECIMAL(12,2) | Value of returned items | Updates async | + +**Relationships**: +- Joins to `DIM_CUSTOMERS` on `customer_id` +- Parent of `FCT_ORDER_ITEMS` via `order_id` + +--- + +## Sample Queries + +### Orders with Returns Rate +```sql +SELECT + DATE_TRUNC('WEEK', order_date) AS week, + COUNT(*) AS total_orders, + SUM(CASE WHEN return_amount > 0 THEN 1 ELSE 0 END) AS orders_with_returns, + DIV0(SUM(CASE WHEN return_amount > 0 THEN 1 ELSE 0 END), COUNT(*)) AS return_rate +FROM SHOPCO_DW.CORE.FCT_ORDERS +WHERE order_status NOT IN ('TEST', 'CANCELLED') + AND order_date >= DATEADD(MONTH, -3, CURRENT_DATE()) +GROUP BY 1 +ORDER BY 1 +``` +``` + +--- + +This example demonstrates: +- Complete frontmatter with triggering description +- Dialect-specific SQL notes +- Clear entity disambiguation +- Terminology glossary +- Standard filters as copy-paste SQL +- Metric definitions with formulas +- Navigation to reference files +- Real, runnable query examples diff --git a/code_puppy/bundled_skills/Data/data-context-extractor/references/skill-template.md b/code_puppy/bundled_skills/Data/data-context-extractor/references/skill-template.md new file mode 100644 index 00000000..58449883 --- /dev/null +++ b/code_puppy/bundled_skills/Data/data-context-extractor/references/skill-template.md @@ -0,0 +1,148 @@ +# Generated Skill Template + +Use this template when generating a new data analysis skill. Replace all `[PLACEHOLDER]` values. + +--- + +```markdown +--- +name: [company]-data-analyst +description: "[COMPANY] data analysis skill. Provides context for querying [WAREHOUSE_TYPE] including entity definitions, metric calculations, and common query patterns. Use when analyzing [COMPANY] data for: (1) [PRIMARY_USE_CASE_1], (2) [PRIMARY_USE_CASE_2], (3) [PRIMARY_USE_CASE_3], or any data questions requiring [COMPANY]-specific context." +--- + +# [COMPANY] Data Analysis + +## SQL Dialect: [WAREHOUSE_TYPE] + +[INSERT APPROPRIATE DIALECT SECTION FROM sql-dialects.md] + +--- + +## Entity Disambiguation + +When users mention these terms, clarify which entity they mean: + +[EXAMPLE FORMAT - customize based on discovery:] + +**"User" can mean:** +- **Account**: An individual login/profile ([PRIMARY_TABLE]: [ID_FIELD]) +- **Organization**: A billing entity that can have multiple accounts ([ORG_TABLE]: [ORG_ID]) +- **[OTHER_TYPE]**: [DEFINITION] ([TABLE]: [ID]) + +**Relationships:** +- [ENTITY_1] → [ENTITY_2]: [RELATIONSHIP_TYPE] (join on [JOIN_KEY]) + +--- + +## Business Terminology + +| Term | Definition | Notes | +|------|------------|-------| +| [TERM_1] | [DEFINITION] | [CONTEXT/GOTCHA] | +| [TERM_2] | [DEFINITION] | [CONTEXT/GOTCHA] | +| [ACRONYM] | [FULL_NAME] - [EXPLANATION] | | + +--- + +## Standard Filters + +Always apply these filters unless explicitly told otherwise: + +```sql +-- Exclude test/internal data +WHERE [TEST_FLAG_COLUMN] = FALSE + AND [INTERNAL_FLAG_COLUMN] = FALSE + +-- Exclude invalid/fraud + AND [STATUS_COLUMN] != '[EXCLUDED_STATUS]' + +-- [OTHER STANDARD EXCLUSIONS] +``` + +**When to override:** +- [SCENARIO_1]: Include [NORMALLY_EXCLUDED] when [CONDITION] + +--- + +## Key Metrics + +### [METRIC_1_NAME] +- **Definition**: [PLAIN_ENGLISH_EXPLANATION] +- **Formula**: `[EXACT_CALCULATION]` +- **Source**: `[TABLE_NAME].[COLUMN_NAME]` +- **Time grain**: [DAILY/WEEKLY/MONTHLY] +- **Caveats**: [EDGE_CASES_OR_GOTCHAS] + +### [METRIC_2_NAME] +[REPEAT FORMAT] + +--- + +## Data Freshness + +| Table | Update Frequency | Typical Lag | +|-------|------------------|-------------| +| [TABLE_1] | [FREQUENCY] | [LAG] | +| [TABLE_2] | [FREQUENCY] | [LAG] | + +To check data freshness: +```sql +SELECT MAX([DATE_COLUMN]) as latest_data FROM [TABLE] +``` + +--- + +## Knowledge Base Navigation + +Use these reference files for detailed table documentation: + +| Domain | Reference File | Use For | +|--------|----------------|---------| +| [DOMAIN_1] | `references/[domain1].md` | [BRIEF_DESCRIPTION] | +| [DOMAIN_2] | `references/[domain2].md` | [BRIEF_DESCRIPTION] | +| Entities | `references/entities.md` | Entity definitions and relationships | +| Metrics | `references/metrics.md` | KPI calculations and formulas | + +--- + +## Common Query Patterns + +### [PATTERN_1_NAME] +```sql +[SAMPLE_QUERY] +``` + +### [PATTERN_2_NAME] +```sql +[SAMPLE_QUERY] +``` + +--- + +## Troubleshooting + +### Common Mistakes +- **[MISTAKE_1]**: [EXPLANATION] → [CORRECT_APPROACH] +- **[MISTAKE_2]**: [EXPLANATION] → [CORRECT_APPROACH] + +### Access Issues +- If you encounter permission errors on `[TABLE]`: [WORKAROUND] +- For PII-restricted columns: [ALTERNATIVE_APPROACH] + +### Performance Tips +- Filter by `[PARTITION_COLUMN]` first to reduce data scanned +- For large tables, use `LIMIT` during exploration +- Prefer `[AGGREGATED_TABLE]` over `[RAW_TABLE]` when possible +``` + +--- + +## Customization Notes + +When generating a skill: + +1. **Fill all placeholders** - Don't leave any `[PLACEHOLDER]` text +2. **Remove unused sections** - If they don't have dashboards, remove that section +3. **Add specificity** - Generic advice is less useful than specific column names and values +4. **Include real examples** - Sample queries should use actual table/column names +5. **Keep it scannable** - Use tables and code blocks liberally diff --git a/code_puppy/bundled_skills/Data/data-context-extractor/references/sql-dialects.md b/code_puppy/bundled_skills/Data/data-context-extractor/references/sql-dialects.md new file mode 100644 index 00000000..8b513a6c --- /dev/null +++ b/code_puppy/bundled_skills/Data/data-context-extractor/references/sql-dialects.md @@ -0,0 +1,121 @@ +# SQL Dialect Reference + +Include the appropriate section in generated skills based on the user's data warehouse. + +--- + +## BigQuery + +```markdown +## SQL Dialect: BigQuery + +- **Table references**: Use backticks: \`project.dataset.table\` +- **Safe division**: `SAFE_DIVIDE(a, b)` returns NULL instead of error +- **Date functions**: + - `DATE_TRUNC(date_col, MONTH)` + - `DATE_SUB(date_col, INTERVAL 1 DAY)` + - `DATE_DIFF(end_date, start_date, DAY)` +- **Column exclusion**: `SELECT * EXCEPT(column_to_exclude)` +- **Arrays**: `UNNEST(array_column)` to flatten +- **Structs**: Access with dot notation `struct_col.field_name` +- **Timestamps**: `TIMESTAMP_TRUNC()`, times in UTC by default +- **String matching**: `LIKE`, `REGEXP_CONTAINS(col, r'pattern')` +- **NULLs in aggregations**: Most functions ignore NULLs; use `IFNULL()` or `COALESCE()` +``` + +--- + +## Snowflake + +```markdown +## SQL Dialect: Snowflake + +- **Table references**: `DATABASE.SCHEMA.TABLE` or with quotes for case-sensitive: `"Column_Name"` +- **Safe division**: `DIV0(a, b)` returns 0, `DIV0NULL(a, b)` returns NULL +- **Date functions**: + - `DATE_TRUNC('MONTH', date_col)` + - `DATEADD(DAY, -1, date_col)` + - `DATEDIFF(DAY, start_date, end_date)` +- **Column exclusion**: `SELECT * EXCLUDE (column_to_exclude)` +- **Arrays**: `FLATTEN(array_column)` to flatten, access with `value` +- **Variants/JSON**: Access with colon notation `variant_col:field_name` +- **Timestamps**: `TIMESTAMP_NTZ` (no timezone), `TIMESTAMP_TZ` (with timezone) +- **String matching**: `LIKE`, `REGEXP_LIKE(col, 'pattern')` +- **Case sensitivity**: Identifiers are uppercase by default unless quoted +``` + +--- + +## PostgreSQL / Redshift + +```markdown +## SQL Dialect: PostgreSQL/Redshift + +- **Table references**: `schema.table` (lowercase convention) +- **Safe division**: `NULLIF(b, 0)` pattern: `a / NULLIF(b, 0)` +- **Date functions**: + - `DATE_TRUNC('month', date_col)` + - `date_col - INTERVAL '1 day'` + - `DATE_PART('day', end_date - start_date)` +- **Column selection**: No EXCEPT; must list columns explicitly +- **Arrays**: `UNNEST(array_column)` (PostgreSQL), limited in Redshift +- **JSON**: `json_col->>'field_name'` for text, `json_col->'field_name'` for JSON +- **Timestamps**: `AT TIME ZONE 'UTC'` for timezone conversion +- **String matching**: `LIKE`, `col ~ 'pattern'` for regex +- **Boolean**: Native BOOLEAN type; use `TRUE`/`FALSE` +``` + +--- + +## Databricks / Spark SQL + +```markdown +## SQL Dialect: Databricks/Spark SQL + +- **Table references**: `catalog.schema.table` (Unity Catalog) or `schema.table` +- **Safe division**: Use `NULLIF`: `a / NULLIF(b, 0)` or `TRY_DIVIDE(a, b)` +- **Date functions**: + - `DATE_TRUNC('MONTH', date_col)` + - `DATE_SUB(date_col, 1)` + - `DATEDIFF(end_date, start_date)` +- **Column exclusion**: `SELECT * EXCEPT (column_to_exclude)` (Databricks SQL) +- **Arrays**: `EXPLODE(array_column)` to flatten +- **Structs**: Access with dot notation `struct_col.field_name` +- **JSON**: `json_col:field_name` or `GET_JSON_OBJECT()` +- **String matching**: `LIKE`, `RLIKE` for regex +- **Delta features**: `DESCRIBE HISTORY`, time travel with `VERSION AS OF` +``` + +--- + +## MySQL + +```markdown +## SQL Dialect: MySQL + +- **Table references**: \`database\`.\`table\` with backticks +- **Safe division**: Manual: `IF(b = 0, NULL, a / b)` or `a / NULLIF(b, 0)` +- **Date functions**: + - `DATE_FORMAT(date_col, '%Y-%m-01')` for truncation + - `DATE_SUB(date_col, INTERVAL 1 DAY)` + - `DATEDIFF(end_date, start_date)` +- **Column selection**: No EXCEPT; must list columns explicitly +- **Arrays**: Limited native support; often stored as JSON +- **JSON**: `JSON_EXTRACT(col, '$.field')` or `col->>'$.field'` +- **Timestamps**: `CONVERT_TZ()` for timezone conversion +- **String matching**: `LIKE`, `REGEXP` for regex +- **Case sensitivity**: Table names case-sensitive on Linux, not on Windows +``` + +--- + +## Common Patterns Across Dialects + +| Operation | BigQuery | Snowflake | PostgreSQL | Databricks | +|-----------|----------|-----------|------------|------------| +| Current date | `CURRENT_DATE()` | `CURRENT_DATE()` | `CURRENT_DATE` | `CURRENT_DATE()` | +| Current timestamp | `CURRENT_TIMESTAMP()` | `CURRENT_TIMESTAMP()` | `NOW()` | `CURRENT_TIMESTAMP()` | +| String concat | `CONCAT()` or `\|\|` | `CONCAT()` or `\|\|` | `CONCAT()` or `\|\|` | `CONCAT()` or `\|\|` | +| Coalesce | `COALESCE()` | `COALESCE()` | `COALESCE()` | `COALESCE()` | +| Case when | `CASE WHEN` | `CASE WHEN` | `CASE WHEN` | `CASE WHEN` | +| Count distinct | `COUNT(DISTINCT x)` | `COUNT(DISTINCT x)` | `COUNT(DISTINCT x)` | `COUNT(DISTINCT x)` | diff --git a/code_puppy/bundled_skills/Data/data-context-extractor/scripts/package_data_skill.py b/code_puppy/bundled_skills/Data/data-context-extractor/scripts/package_data_skill.py new file mode 100644 index 00000000..d8ebff7e --- /dev/null +++ b/code_puppy/bundled_skills/Data/data-context-extractor/scripts/package_data_skill.py @@ -0,0 +1,126 @@ +#!/usr/bin/env python3 +""" +Package a generated data analysis skill into a distributable .skill file (zip format). + +Usage: + python package_data_skill.py [output-directory] + +Example: + python package_data_skill.py /home/claude/acme-data-analyst + python package_data_skill.py /home/claude/acme-data-analyst /tmp/outputs +""" + +import sys +import zipfile +from pathlib import Path + + +def validate_skill(skill_path: Path) -> tuple[bool, str]: + """Basic validation of skill structure.""" + + # Check SKILL.md exists + skill_md = skill_path / "SKILL.md" + if not skill_md.exists(): + return False, "Missing SKILL.md" + + # Check SKILL.md has frontmatter + content = skill_md.read_text() + if not content.startswith("---"): + return False, "SKILL.md missing YAML frontmatter" + + # Check for required frontmatter fields + if "name:" not in content[:500]: + return False, "SKILL.md missing 'name' in frontmatter" + if "description:" not in content[:1000]: + return False, "SKILL.md missing 'description' in frontmatter" + + # Check for placeholder text that wasn't filled in + if "[PLACEHOLDER]" in content or "[COMPANY]" in content: + return False, "SKILL.md contains unfilled placeholder text" + + return True, "Validation passed" + + +def package_skill(skill_path: str, output_dir: str = None) -> Path | None: + """ + Package a skill folder into a .skill file. + + Args: + skill_path: Path to the skill folder + output_dir: Optional output directory + + Returns: + Path to the created .skill file, or None if error + """ + skill_path = Path(skill_path).resolve() + + # Validate folder exists + if not skill_path.exists(): + print(f"Error: Skill folder not found: {skill_path}") + return None + + if not skill_path.is_dir(): + print(f"Error: Path is not a directory: {skill_path}") + return None + + # Run validation + print("Validating skill...") + valid, message = validate_skill(skill_path) + if not valid: + print(f"Validation failed: {message}") + return None + print(f"{message}\n") + + # Determine output location + skill_name = skill_path.name + if output_dir: + output_path = Path(output_dir).resolve() + else: + output_path = Path.cwd() + + output_path.mkdir(parents=True, exist_ok=True) + skill_filename = output_path / f"{skill_name}.zip" + + # Create the zip file + try: + with zipfile.ZipFile(skill_filename, "w", zipfile.ZIP_DEFLATED) as zipf: + for file_path in skill_path.rglob("*"): + if file_path.is_file(): + # Skip hidden files and common junk + if any(part.startswith(".") for part in file_path.parts): + continue + if file_path.name in ["__pycache__", ".DS_Store", "Thumbs.db"]: + continue + + # Calculate relative path within the zip + arcname = file_path.relative_to(skill_path.parent) + zipf.write(file_path, arcname) + print(f" Added: {arcname}") + + print(f"\nSuccessfully packaged skill to: {skill_filename}") + return skill_filename + + except Exception as e: + print(f"Error creating zip file: {e}") + return None + + +def main(): + if len(sys.argv) < 2: + print(__doc__) + sys.exit(1) + + skill_path = sys.argv[1] + output_dir = sys.argv[2] if len(sys.argv) > 2 else None + + print(f"Packaging skill: {skill_path}") + if output_dir: + print(f" Output directory: {output_dir}") + print() + + result = package_skill(skill_path, output_dir) + sys.exit(0 if result else 1) + + +if __name__ == "__main__": + main() diff --git a/code_puppy/bundled_skills/Data/data-exploration/SKILL.md b/code_puppy/bundled_skills/Data/data-exploration/SKILL.md new file mode 100644 index 00000000..cf4b3029 --- /dev/null +++ b/code_puppy/bundled_skills/Data/data-exploration/SKILL.md @@ -0,0 +1,231 @@ +--- +name: data-exploration +description: Profile and explore datasets to understand their shape, quality, and patterns before analysis. Use when encountering a new dataset, assessing data quality, discovering column distributions, identifying nulls and outliers, or deciding which dimensions to analyze. +--- + +# Data Exploration Skill + +Systematic methodology for profiling datasets, assessing data quality, discovering patterns, and understanding schemas. + +## Data Profiling Methodology + +### Phase 1: Structural Understanding + +Before analyzing any data, understand its structure: + +**Table-level questions:** +- How many rows and columns? +- What is the grain (one row per what)? +- What is the primary key? Is it unique? +- When was the data last updated? +- How far back does the data go? + +**Column classification:** +Categorize each column as one of: +- **Identifier**: Unique keys, foreign keys, entity IDs +- **Dimension**: Categorical attributes for grouping/filtering (status, type, region, category) +- **Metric**: Quantitative values for measurement (revenue, count, duration, score) +- **Temporal**: Dates and timestamps (created_at, updated_at, event_date) +- **Text**: Free-form text fields (description, notes, name) +- **Boolean**: True/false flags +- **Structural**: JSON, arrays, nested structures + +### Phase 2: Column-Level Profiling + +For each column, compute: + +**All columns:** +- Null count and null rate +- Distinct count and cardinality ratio (distinct / total) +- Most common values (top 5-10 with frequencies) +- Least common values (bottom 5 to spot anomalies) + +**Numeric columns (metrics):** +``` +min, max, mean, median (p50) +standard deviation +percentiles: p1, p5, p25, p75, p95, p99 +zero count +negative count (if unexpected) +``` + +**String columns (dimensions, text):** +``` +min length, max length, avg length +empty string count +pattern analysis (do values follow a format?) +case consistency (all upper, all lower, mixed?) +leading/trailing whitespace count +``` + +**Date/timestamp columns:** +``` +min date, max date +null dates +future dates (if unexpected) +distribution by month/week +gaps in time series +``` + +**Boolean columns:** +``` +true count, false count, null count +true rate +``` + +### Phase 3: Relationship Discovery + +After profiling individual columns: + +- **Foreign key candidates**: ID columns that might link to other tables +- **Hierarchies**: Columns that form natural drill-down paths (country > state > city) +- **Correlations**: Numeric columns that move together +- **Derived columns**: Columns that appear to be computed from others +- **Redundant columns**: Columns with identical or near-identical information + +## Quality Assessment Framework + +### Completeness Score + +Rate each column: +- **Complete** (>99% non-null): Green +- **Mostly complete** (95-99%): Yellow -- investigate the nulls +- **Incomplete** (80-95%): Orange -- understand why and whether it matters +- **Sparse** (<80%): Red -- may not be usable without imputation + +### Consistency Checks + +Look for: +- **Value format inconsistency**: Same concept represented differently ("USA", "US", "United States", "us") +- **Type inconsistency**: Numbers stored as strings, dates in various formats +- **Referential integrity**: Foreign keys that don't match any parent record +- **Business rule violations**: Negative quantities, end dates before start dates, percentages > 100 +- **Cross-column consistency**: Status = "completed" but completed_at is null + +### Accuracy Indicators + +Red flags that suggest accuracy issues: +- **Placeholder values**: 0, -1, 999999, "N/A", "TBD", "test", "xxx" +- **Default values**: Suspiciously high frequency of a single value +- **Stale data**: Updated_at shows no recent changes in an active system +- **Impossible values**: Ages > 150, dates in the far future, negative durations +- **Round number bias**: All values ending in 0 or 5 (suggests estimation, not measurement) + +### Timeliness Assessment + +- When was the table last updated? +- What is the expected update frequency? +- Is there a lag between event time and load time? +- Are there gaps in the time series? + +## Pattern Discovery Techniques + +### Distribution Analysis + +For numeric columns, characterize the distribution: +- **Normal**: Mean and median are close, bell-shaped +- **Skewed right**: Long tail of high values (common for revenue, session duration) +- **Skewed left**: Long tail of low values (less common) +- **Bimodal**: Two peaks (suggests two distinct populations) +- **Power law**: Few very large values, many small ones (common for user activity) +- **Uniform**: Roughly equal frequency across range (often synthetic or random) + +### Temporal Patterns + +For time series data, look for: +- **Trend**: Sustained upward or downward movement +- **Seasonality**: Repeating patterns (weekly, monthly, quarterly, annual) +- **Day-of-week effects**: Weekday vs. weekend differences +- **Holiday effects**: Drops or spikes around known holidays +- **Change points**: Sudden shifts in level or trend +- **Anomalies**: Individual data points that break the pattern + +### Segmentation Discovery + +Identify natural segments by: +- Finding categorical columns with 3-20 distinct values +- Comparing metric distributions across segment values +- Looking for segments with significantly different behavior +- Testing whether segments are homogeneous or contain sub-segments + +### Correlation Exploration + +Between numeric columns: +- Compute correlation matrix for all metric pairs +- Flag strong correlations (|r| > 0.7) for investigation +- Note: Correlation does not imply causation -- flag this explicitly +- Check for non-linear relationships (e.g., quadratic, logarithmic) + +## Schema Understanding and Documentation + +### Schema Documentation Template + +When documenting a dataset for team use: + +```markdown +## Table: [schema.table_name] + +**Description**: [What this table represents] +**Grain**: [One row per...] +**Primary Key**: [column(s)] +**Row Count**: [approximate, with date] +**Update Frequency**: [real-time / hourly / daily / weekly] +**Owner**: [team or person responsible] + +### Key Columns + +| Column | Type | Description | Example Values | Notes | +|--------|------|-------------|----------------|-------| +| user_id | STRING | Unique user identifier | "usr_abc123" | FK to users.id | +| event_type | STRING | Type of event | "click", "view", "purchase" | 15 distinct values | +| revenue | DECIMAL | Transaction revenue in USD | 29.99, 149.00 | Null for non-purchase events | +| created_at | TIMESTAMP | When the event occurred | 2024-01-15 14:23:01 | Partitioned on this column | + +### Relationships +- Joins to `users` on `user_id` +- Joins to `products` on `product_id` +- Parent of `event_details` (1:many on event_id) + +### Known Issues +- [List any known data quality issues] +- [Note any gotchas for analysts] + +### Common Query Patterns +- [Typical use cases for this table] +``` + +### Schema Exploration Queries + +When connected to a data warehouse, use these patterns to discover schema: + +```sql +-- List all tables in a schema (PostgreSQL) +SELECT table_name, table_type +FROM information_schema.tables +WHERE table_schema = 'public' +ORDER BY table_name; + +-- Column details (PostgreSQL) +SELECT column_name, data_type, is_nullable, column_default +FROM information_schema.columns +WHERE table_name = 'my_table' +ORDER BY ordinal_position; + +-- Table sizes (PostgreSQL) +SELECT relname, pg_size_pretty(pg_total_relation_size(relid)) +FROM pg_catalog.pg_statio_user_tables +ORDER BY pg_total_relation_size(relid) DESC; + +-- Row counts for all tables (general pattern) +-- Run per-table: SELECT COUNT(*) FROM table_name +``` + +### Lineage and Dependencies + +When exploring an unfamiliar data environment: + +1. Start with the "output" tables (what reports or dashboards consume) +2. Trace upstream: What tables feed into them? +3. Identify raw/staging/mart layers +4. Map the transformation chain from raw data to analytical tables +5. Note where data is enriched, filtered, or aggregated diff --git a/code_puppy/bundled_skills/Data/data-validation/SKILL.md b/code_puppy/bundled_skills/Data/data-validation/SKILL.md new file mode 100644 index 00000000..06a4d6fa --- /dev/null +++ b/code_puppy/bundled_skills/Data/data-validation/SKILL.md @@ -0,0 +1,233 @@ +--- +name: data-validation +description: QA an analysis before sharing with stakeholders — methodology checks, accuracy verification, and bias detection. Use when reviewing an analysis for errors, checking for survivorship bias, validating aggregation logic, or preparing documentation for reproducibility. +--- + +# Data Validation Skill + +Pre-delivery QA checklist, common data analysis pitfalls, result sanity checking, and documentation standards for reproducibility. + +## Pre-Delivery QA Checklist + +Run through this checklist before sharing any analysis with stakeholders. + +### Data Quality Checks + +- [ ] **Source verification**: Confirmed which tables/data sources were used. Are they the right ones for this question? +- [ ] **Freshness**: Data is current enough for the analysis. Noted the "as of" date. +- [ ] **Completeness**: No unexpected gaps in time series or missing segments. +- [ ] **Null handling**: Checked null rates in key columns. Nulls are handled appropriately (excluded, imputed, or flagged). +- [ ] **Deduplication**: Confirmed no double-counting from bad joins or duplicate source records. +- [ ] **Filter verification**: All WHERE clauses and filters are correct. No unintended exclusions. + +### Calculation Checks + +- [ ] **Aggregation logic**: GROUP BY includes all non-aggregated columns. Aggregation level matches the analysis grain. +- [ ] **Denominator correctness**: Rate and percentage calculations use the right denominator. Denominators are non-zero. +- [ ] **Date alignment**: Comparisons use the same time period length. Partial periods are excluded or noted. +- [ ] **Join correctness**: JOIN types are appropriate (INNER vs LEFT). Many-to-many joins haven't inflated counts. +- [ ] **Metric definitions**: Metrics match how stakeholders define them. Any deviations are noted. +- [ ] **Subtotals sum**: Parts add up to the whole where expected. If they don't, explain why (e.g., overlap). + +### Reasonableness Checks + +- [ ] **Magnitude**: Numbers are in a plausible range. Revenue isn't negative. Percentages are between 0-100%. +- [ ] **Trend continuity**: No unexplained jumps or drops in time series. +- [ ] **Cross-reference**: Key numbers match other known sources (dashboards, previous reports, finance data). +- [ ] **Order of magnitude**: Total revenue is in the right ballpark. User counts match known figures. +- [ ] **Edge cases**: What happens at the boundaries? Empty segments, zero-activity periods, new entities. + +### Presentation Checks + +- [ ] **Chart accuracy**: Bar charts start at zero. Axes are labeled. Scales are consistent across panels. +- [ ] **Number formatting**: Appropriate precision. Consistent currency/percentage formatting. Thousands separators where needed. +- [ ] **Title clarity**: Titles state the insight, not just the metric. Date ranges are specified. +- [ ] **Caveat transparency**: Known limitations and assumptions are stated explicitly. +- [ ] **Reproducibility**: Someone else could recreate this analysis from the documentation provided. + +## Common Data Analysis Pitfalls + +### Join Explosion + +**The problem**: A many-to-many join silently multiplies rows, inflating counts and sums. + +**How to detect**: +```sql +-- Check row count before and after join +SELECT COUNT(*) FROM table_a; -- 1,000 +SELECT COUNT(*) FROM table_a a JOIN table_b b ON a.id = b.a_id; -- 3,500 (uh oh) +``` + +**How to prevent**: +- Always check row counts after joins +- If counts increase, investigate the join relationship (is it really 1:1 or 1:many?) +- Use `COUNT(DISTINCT a.id)` instead of `COUNT(*)` when counting entities through joins + +### Survivorship Bias + +**The problem**: Analyzing only entities that exist today, ignoring those that were deleted, churned, or failed. + +**Examples**: +- Analyzing user behavior of "current users" misses churned users +- Looking at "companies using our product" ignores those who evaluated and left +- Studying properties of "successful" outcomes without "unsuccessful" ones + +**How to prevent**: Ask "who is NOT in this dataset?" before drawing conclusions. + +### Incomplete Period Comparison + +**The problem**: Comparing a partial period to a full period. + +**Examples**: +- "January revenue is $500K vs. December's $800K" -- but January isn't over yet +- "This week's signups are down" -- checked on Wednesday, comparing to a full prior week + +**How to prevent**: Always filter to complete periods, or compare same-day-of-month / same-number-of-days. + +### Denominator Shifting + +**The problem**: The denominator changes between periods, making rates incomparable. + +**Examples**: +- Conversion rate improves because you changed how you count "eligible" users +- Churn rate changes because the definition of "active" was updated + +**How to prevent**: Use consistent definitions across all compared periods. Note any definition changes. + +### Average of Averages + +**The problem**: Averaging pre-computed averages gives wrong results when group sizes differ. + +**Example**: +- Group A: 100 users, average revenue $50 +- Group B: 10 users, average revenue $200 +- Wrong: Average of averages = ($50 + $200) / 2 = $125 +- Right: Weighted average = (100*$50 + 10*$200) / 110 = $63.64 + +**How to prevent**: Always aggregate from raw data. Never average pre-aggregated averages. + +### Timezone Mismatches + +**The problem**: Different data sources use different timezones, causing misalignment. + +**Examples**: +- Event timestamps in UTC vs. user-facing dates in local time +- Daily rollups that use different cutoff times + +**How to prevent**: Standardize all timestamps to a single timezone (UTC recommended) before analysis. Document the timezone used. + +### Selection Bias in Segmentation + +**The problem**: Segments are defined by the outcome you're measuring, creating circular logic. + +**Examples**: +- "Users who completed onboarding have higher retention" -- obviously, they self-selected +- "Power users generate more revenue" -- they became power users BY generating revenue + +**How to prevent**: Define segments based on pre-treatment characteristics, not outcomes. + +## Result Sanity Checking + +### Magnitude Checks + +For any key number in your analysis, verify it passes the "smell test": + +| Metric Type | Sanity Check | +|---|---| +| User counts | Does this match known MAU/DAU figures? | +| Revenue | Is this in the right order of magnitude vs. known ARR? | +| Conversion rates | Is this between 0% and 100%? Does it match dashboard figures? | +| Growth rates | Is 50%+ MoM growth realistic, or is there a data issue? | +| Averages | Is the average reasonable given what you know about the distribution? | +| Percentages | Do segment percentages sum to ~100%? | + +### Cross-Validation Techniques + +1. **Calculate the same metric two different ways** and verify they match +2. **Spot-check individual records** -- pick a few specific entities and trace their data manually +3. **Compare to known benchmarks** -- match against published dashboards, finance reports, or prior analyses +4. **Reverse engineer** -- if total revenue is X, does per-user revenue times user count approximately equal X? +5. **Boundary checks** -- what happens when you filter to a single day, a single user, or a single category? Are those micro-results sensible? + +### Red Flags That Warrant Investigation + +- Any metric that changed by more than 50% period-over-period without an obvious cause +- Counts or sums that are exact round numbers (suggests a filter or default value issue) +- Rates exactly at 0% or 100% (may indicate incomplete data) +- Results that perfectly confirm the hypothesis (reality is usually messier) +- Identical values across time periods or segments (suggests the query is ignoring a dimension) + +## Documentation Standards for Reproducibility + +### Analysis Documentation Template + +Every non-trivial analysis should include: + +```markdown +## Analysis: [Title] + +### Question +[The specific question being answered] + +### Data Sources +- Table: [schema.table_name] (as of [date]) +- Table: [schema.other_table] (as of [date]) +- File: [filename] (source: [where it came from]) + +### Definitions +- [Metric A]: [Exactly how it's calculated] +- [Segment X]: [Exactly how membership is determined] +- [Time period]: [Start date] to [end date], [timezone] + +### Methodology +1. [Step 1 of the analysis approach] +2. [Step 2] +3. [Step 3] + +### Assumptions and Limitations +- [Assumption 1 and why it's reasonable] +- [Limitation 1 and its potential impact on conclusions] + +### Key Findings +1. [Finding 1 with supporting evidence] +2. [Finding 2 with supporting evidence] + +### SQL Queries +[All queries used, with comments] + +### Caveats +- [Things the reader should know before acting on this] +``` + +### Code Documentation + +For any code (SQL, Python) that may be reused: + +```python +""" +Analysis: Monthly Cohort Retention +Author: [Name] +Date: [Date] +Data Source: events table, users table +Last Validated: [Date] -- results matched dashboard within 2% + +Purpose: + Calculate monthly user retention cohorts based on first activity date. + +Assumptions: + - "Active" means at least one event in the month + - Excludes test/internal accounts (user_type != 'internal') + - Uses UTC dates throughout + +Output: + Cohort retention matrix with cohort_month rows and months_since_signup columns. + Values are retention rates (0-100%). +""" +``` + +### Version Control for Analyses + +- Save queries and code in version control (git) or a shared docs system +- Note the date of the data snapshot used +- If an analysis is re-run with updated data, document what changed and why +- Link to prior versions of recurring analyses for trend comparison diff --git a/code_puppy/bundled_skills/Data/data-visualization/SKILL.md b/code_puppy/bundled_skills/Data/data-visualization/SKILL.md new file mode 100644 index 00000000..d3c0ace9 --- /dev/null +++ b/code_puppy/bundled_skills/Data/data-visualization/SKILL.md @@ -0,0 +1,304 @@ +--- +name: data-visualization +description: Create effective data visualizations with Python (matplotlib, seaborn, plotly). Use when building charts, choosing the right chart type for a dataset, creating publication-quality figures, or applying design principles like accessibility and color theory. +--- + +# Data Visualization Skill + +Chart selection guidance, Python visualization code patterns, design principles, and accessibility considerations for creating effective data visualizations. + +## Chart Selection Guide + +### Choose by Data Relationship + +| What You're Showing | Best Chart | Alternatives | +|---|---|---| +| **Trend over time** | Line chart | Area chart (if showing cumulative or composition) | +| **Comparison across categories** | Vertical bar chart | Horizontal bar (many categories), lollipop chart | +| **Ranking** | Horizontal bar chart | Dot plot, slope chart (comparing two periods) | +| **Part-to-whole composition** | Stacked bar chart | Treemap (hierarchical), waffle chart | +| **Composition over time** | Stacked area chart | 100% stacked bar (for proportion focus) | +| **Distribution** | Histogram | Box plot (comparing groups), violin plot, strip plot | +| **Correlation (2 variables)** | Scatter plot | Bubble chart (add 3rd variable as size) | +| **Correlation (many variables)** | Heatmap (correlation matrix) | Pair plot | +| **Geographic patterns** | Choropleth map | Bubble map, hex map | +| **Flow / process** | Sankey diagram | Funnel chart (sequential stages) | +| **Relationship network** | Network graph | Chord diagram | +| **Performance vs. target** | Bullet chart | Gauge (single KPI only) | +| **Multiple KPIs at once** | Small multiples | Dashboard with separate charts | + +### When NOT to Use Certain Charts + +- **Pie charts**: Avoid unless <6 categories and exact proportions matter less than rough comparison. Humans are bad at comparing angles. Use bar charts instead. +- **3D charts**: Never. They distort perception and add no information. +- **Dual-axis charts**: Use cautiously. They can mislead by implying correlation. Clearly label both axes if used. +- **Stacked bar (many categories)**: Hard to compare middle segments. Use small multiples or grouped bars instead. +- **Donut charts**: Slightly better than pie charts but same fundamental issues. Use for single KPI display at most. + +## Python Visualization Code Patterns + +### Setup and Style + +```python +import matplotlib.pyplot as plt +import matplotlib.ticker as mticker +import seaborn as sns +import pandas as pd +import numpy as np + +# Professional style setup +plt.style.use('seaborn-v0_8-whitegrid') +plt.rcParams.update({ + 'figure.figsize': (10, 6), + 'figure.dpi': 150, + 'font.size': 11, + 'axes.titlesize': 14, + 'axes.titleweight': 'bold', + 'axes.labelsize': 11, + 'xtick.labelsize': 10, + 'ytick.labelsize': 10, + 'legend.fontsize': 10, + 'figure.titlesize': 16, +}) + +# Colorblind-friendly palettes +PALETTE_CATEGORICAL = ['#4C72B0', '#DD8452', '#55A868', '#C44E52', '#8172B3', '#937860'] +PALETTE_SEQUENTIAL = 'YlOrRd' +PALETTE_DIVERGING = 'RdBu_r' +``` + +### Line Chart (Time Series) + +```python +fig, ax = plt.subplots(figsize=(10, 6)) + +for label, group in df.groupby('category'): + ax.plot(group['date'], group['value'], label=label, linewidth=2) + +ax.set_title('Metric Trend by Category', fontweight='bold') +ax.set_xlabel('Date') +ax.set_ylabel('Value') +ax.legend(loc='upper left', frameon=True) +ax.spines['top'].set_visible(False) +ax.spines['right'].set_visible(False) + +# Format dates on x-axis +fig.autofmt_xdate() + +plt.tight_layout() +plt.savefig('trend_chart.png', dpi=150, bbox_inches='tight') +``` + +### Bar Chart (Comparison) + +```python +fig, ax = plt.subplots(figsize=(10, 6)) + +# Sort by value for easy reading +df_sorted = df.sort_values('metric', ascending=True) + +bars = ax.barh(df_sorted['category'], df_sorted['metric'], color=PALETTE_CATEGORICAL[0]) + +# Add value labels +for bar in bars: + width = bar.get_width() + ax.text(width + 0.5, bar.get_y() + bar.get_height()/2, + f'{width:,.0f}', ha='left', va='center', fontsize=10) + +ax.set_title('Metric by Category (Ranked)', fontweight='bold') +ax.set_xlabel('Metric Value') +ax.spines['top'].set_visible(False) +ax.spines['right'].set_visible(False) + +plt.tight_layout() +plt.savefig('bar_chart.png', dpi=150, bbox_inches='tight') +``` + +### Histogram (Distribution) + +```python +fig, ax = plt.subplots(figsize=(10, 6)) + +ax.hist(df['value'], bins=30, color=PALETTE_CATEGORICAL[0], edgecolor='white', alpha=0.8) + +# Add mean and median lines +mean_val = df['value'].mean() +median_val = df['value'].median() +ax.axvline(mean_val, color='red', linestyle='--', linewidth=1.5, label=f'Mean: {mean_val:,.1f}') +ax.axvline(median_val, color='green', linestyle='--', linewidth=1.5, label=f'Median: {median_val:,.1f}') + +ax.set_title('Distribution of Values', fontweight='bold') +ax.set_xlabel('Value') +ax.set_ylabel('Frequency') +ax.legend() +ax.spines['top'].set_visible(False) +ax.spines['right'].set_visible(False) + +plt.tight_layout() +plt.savefig('histogram.png', dpi=150, bbox_inches='tight') +``` + +### Heatmap + +```python +fig, ax = plt.subplots(figsize=(10, 8)) + +# Pivot data for heatmap format +pivot = df.pivot_table(index='row_dim', columns='col_dim', values='metric', aggfunc='sum') + +sns.heatmap(pivot, annot=True, fmt=',.0f', cmap='YlOrRd', + linewidths=0.5, ax=ax, cbar_kws={'label': 'Metric Value'}) + +ax.set_title('Metric by Row Dimension and Column Dimension', fontweight='bold') +ax.set_xlabel('Column Dimension') +ax.set_ylabel('Row Dimension') + +plt.tight_layout() +plt.savefig('heatmap.png', dpi=150, bbox_inches='tight') +``` + +### Small Multiples + +```python +categories = df['category'].unique() +n_cats = len(categories) +n_cols = min(3, n_cats) +n_rows = (n_cats + n_cols - 1) // n_cols + +fig, axes = plt.subplots(n_rows, n_cols, figsize=(5*n_cols, 4*n_rows), sharex=True, sharey=True) +axes = axes.flatten() if n_cats > 1 else [axes] + +for i, cat in enumerate(categories): + ax = axes[i] + subset = df[df['category'] == cat] + ax.plot(subset['date'], subset['value'], color=PALETTE_CATEGORICAL[i % len(PALETTE_CATEGORICAL)]) + ax.set_title(cat, fontsize=12) + ax.spines['top'].set_visible(False) + ax.spines['right'].set_visible(False) + +# Hide empty subplots +for j in range(i+1, len(axes)): + axes[j].set_visible(False) + +fig.suptitle('Trends by Category', fontsize=14, fontweight='bold', y=1.02) +plt.tight_layout() +plt.savefig('small_multiples.png', dpi=150, bbox_inches='tight') +``` + +### Number Formatting Helpers + +```python +def format_number(val, format_type='number'): + """Format numbers for chart labels.""" + if format_type == 'currency': + if abs(val) >= 1e9: + return f'${val/1e9:.1f}B' + elif abs(val) >= 1e6: + return f'${val/1e6:.1f}M' + elif abs(val) >= 1e3: + return f'${val/1e3:.1f}K' + else: + return f'${val:,.0f}' + elif format_type == 'percent': + return f'{val:.1f}%' + elif format_type == 'number': + if abs(val) >= 1e9: + return f'{val/1e9:.1f}B' + elif abs(val) >= 1e6: + return f'{val/1e6:.1f}M' + elif abs(val) >= 1e3: + return f'{val/1e3:.1f}K' + else: + return f'{val:,.0f}' + return str(val) + +# Usage with axis formatter +ax.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, p: format_number(x, 'currency'))) +``` + +### Interactive Charts with Plotly + +```python +import plotly.express as px +import plotly.graph_objects as go + +# Simple interactive line chart +fig = px.line(df, x='date', y='value', color='category', + title='Interactive Metric Trend', + labels={'value': 'Metric Value', 'date': 'Date'}) +fig.update_layout(hovermode='x unified') +fig.write_html('interactive_chart.html') +fig.show() + +# Interactive scatter with hover data +fig = px.scatter(df, x='metric_a', y='metric_b', color='category', + size='size_metric', hover_data=['name', 'detail_field'], + title='Correlation Analysis') +fig.show() +``` + +## Design Principles + +### Color + +- **Use color purposefully**: Color should encode data, not decorate +- **Highlight the story**: Use a bright accent color for the key insight; grey everything else +- **Sequential data**: Use a single-hue gradient (light to dark) for ordered values +- **Diverging data**: Use a two-hue gradient with neutral midpoint for data with a meaningful center +- **Categorical data**: Use distinct hues, maximum 6-8 before it gets confusing +- **Avoid red/green only**: 8% of men are red-green colorblind. Use blue/orange as primary pair + +### Typography + +- **Title states the insight**: "Revenue grew 23% YoY" beats "Revenue by Month" +- **Subtitle adds context**: Date range, filters applied, data source +- **Axis labels are readable**: Never rotated 90 degrees if avoidable. Shorten or wrap instead +- **Data labels add precision**: Use on key points, not every single bar +- **Annotation highlights**: Call out specific points with text annotations + +### Layout + +- **Reduce chart junk**: Remove gridlines, borders, backgrounds that don't carry information +- **Sort meaningfully**: Categories sorted by value (not alphabetically) unless there's a natural order (months, stages) +- **Appropriate aspect ratio**: Time series wider than tall (3:1 to 2:1); comparisons can be squarer +- **White space is good**: Don't cram charts together. Give each visualization room to breathe + +### Accuracy + +- **Bar charts start at zero**: Always. A bar from 95 to 100 exaggerates a 5% difference +- **Line charts can have non-zero baselines**: When the range of variation is meaningful +- **Consistent scales across panels**: When comparing multiple charts, use the same axis range +- **Show uncertainty**: Error bars, confidence intervals, or ranges when data is uncertain +- **Label your axes**: Never make the reader guess what the numbers mean + +## Accessibility Considerations + +### Color Blindness + +- Never rely on color alone to distinguish data series +- Add pattern fills, different line styles (solid, dashed, dotted), or direct labels +- Test with a colorblind simulator (e.g., Coblis, Sim Daltonism) +- Use the colorblind-friendly palette: `sns.color_palette("colorblind")` + +### Screen Readers + +- Include alt text describing the chart's key finding +- Provide a data table alternative alongside the visualization +- Use semantic titles and labels + +### General Accessibility + +- Sufficient contrast between data elements and background +- Text size minimum 10pt for labels, 12pt for titles +- Avoid conveying information only through spatial position (add labels) +- Consider printing: does the chart work in black and white? + +### Accessibility Checklist + +Before sharing a visualization: +- [ ] Chart works without color (patterns, labels, or line styles differentiate series) +- [ ] Text is readable at standard zoom level +- [ ] Title describes the insight, not just the data +- [ ] Axes are labeled with units +- [ ] Legend is clear and positioned without obscuring data +- [ ] Data source and date range are noted diff --git a/code_puppy/bundled_skills/Data/interactive-dashboard-builder/SKILL.md b/code_puppy/bundled_skills/Data/interactive-dashboard-builder/SKILL.md new file mode 100644 index 00000000..6ccc7fcb --- /dev/null +++ b/code_puppy/bundled_skills/Data/interactive-dashboard-builder/SKILL.md @@ -0,0 +1,786 @@ +--- +name: interactive-dashboard-builder +description: Build self-contained interactive HTML dashboards with Chart.js, dropdown filters, and professional styling. Use when creating dashboards, building interactive reports, or generating shareable HTML files with charts and filters that work without a server. +--- + +# Interactive Dashboard Builder Skill + +Patterns and techniques for building self-contained HTML/JS dashboards with Chart.js, filters, interactivity, and professional styling. + +## HTML/JS Dashboard Patterns + +### Base Template + +Every dashboard follows this structure: + +```html + + + + + + Dashboard Title + + + + + +
+
+

Dashboard Title

+
+ +
+
+ +
+ +
+ +
+ +
+ +
+ +
+ +
+ Data as of: +
+
+ + + + +``` + +### KPI Card Pattern + +```html +
+
Total Revenue
+
$0
+
+0%
+
+``` + +```javascript +function renderKPI(elementId, value, previousValue, format = 'number') { + const el = document.getElementById(elementId); + const changeEl = document.getElementById(elementId + '-change'); + + // Format the value + el.textContent = formatValue(value, format); + + // Calculate and display change + if (previousValue && previousValue !== 0) { + const pctChange = ((value - previousValue) / previousValue) * 100; + const sign = pctChange >= 0 ? '+' : ''; + changeEl.textContent = `${sign}${pctChange.toFixed(1)}% vs prior period`; + changeEl.className = `kpi-change ${pctChange >= 0 ? 'positive' : 'negative'}`; + } +} + +function formatValue(value, format) { + switch (format) { + case 'currency': + if (value >= 1e6) return `$${(value / 1e6).toFixed(1)}M`; + if (value >= 1e3) return `$${(value / 1e3).toFixed(1)}K`; + return `$${value.toFixed(0)}`; + case 'percent': + return `${value.toFixed(1)}%`; + case 'number': + if (value >= 1e6) return `${(value / 1e6).toFixed(1)}M`; + if (value >= 1e3) return `${(value / 1e3).toFixed(1)}K`; + return value.toLocaleString(); + default: + return value.toString(); + } +} +``` + +### Chart Container Pattern + +```html +
+

Monthly Revenue Trend

+ +
+``` + +## Chart.js Integration + +### Line Chart + +```javascript +function createLineChart(canvasId, labels, datasets) { + const ctx = document.getElementById(canvasId).getContext('2d'); + return new Chart(ctx, { + type: 'line', + data: { + labels: labels, + datasets: datasets.map((ds, i) => ({ + label: ds.label, + data: ds.data, + borderColor: COLORS[i % COLORS.length], + backgroundColor: COLORS[i % COLORS.length] + '20', + borderWidth: 2, + fill: ds.fill || false, + tension: 0.3, + pointRadius: 3, + pointHoverRadius: 6, + })) + }, + options: { + responsive: true, + maintainAspectRatio: false, + interaction: { + mode: 'index', + intersect: false, + }, + plugins: { + legend: { + position: 'top', + labels: { usePointStyle: true, padding: 20 } + }, + tooltip: { + callbacks: { + label: function(context) { + return `${context.dataset.label}: ${formatValue(context.parsed.y, 'currency')}`; + } + } + } + }, + scales: { + x: { + grid: { display: false } + }, + y: { + beginAtZero: true, + ticks: { + callback: function(value) { + return formatValue(value, 'currency'); + } + } + } + } + } + }); +} +``` + +### Bar Chart + +```javascript +function createBarChart(canvasId, labels, data, options = {}) { + const ctx = document.getElementById(canvasId).getContext('2d'); + const isHorizontal = options.horizontal || labels.length > 8; + + return new Chart(ctx, { + type: 'bar', + data: { + labels: labels, + datasets: [{ + label: options.label || 'Value', + data: data, + backgroundColor: options.colors || COLORS.map(c => c + 'CC'), + borderColor: options.colors || COLORS, + borderWidth: 1, + borderRadius: 4, + }] + }, + options: { + responsive: true, + maintainAspectRatio: false, + indexAxis: isHorizontal ? 'y' : 'x', + plugins: { + legend: { display: false }, + tooltip: { + callbacks: { + label: function(context) { + return formatValue(context.parsed[isHorizontal ? 'x' : 'y'], options.format || 'number'); + } + } + } + }, + scales: { + x: { + beginAtZero: true, + grid: { display: isHorizontal }, + ticks: isHorizontal ? { + callback: function(value) { + return formatValue(value, options.format || 'number'); + } + } : {} + }, + y: { + beginAtZero: !isHorizontal, + grid: { display: !isHorizontal }, + ticks: !isHorizontal ? { + callback: function(value) { + return formatValue(value, options.format || 'number'); + } + } : {} + } + } + } + }); +} +``` + +### Doughnut Chart + +```javascript +function createDoughnutChart(canvasId, labels, data) { + const ctx = document.getElementById(canvasId).getContext('2d'); + return new Chart(ctx, { + type: 'doughnut', + data: { + labels: labels, + datasets: [{ + data: data, + backgroundColor: COLORS.map(c => c + 'CC'), + borderColor: '#ffffff', + borderWidth: 2, + }] + }, + options: { + responsive: true, + maintainAspectRatio: false, + cutout: '60%', + plugins: { + legend: { + position: 'right', + labels: { usePointStyle: true, padding: 15 } + }, + tooltip: { + callbacks: { + label: function(context) { + const total = context.dataset.data.reduce((a, b) => a + b, 0); + const pct = ((context.parsed / total) * 100).toFixed(1); + return `${context.label}: ${formatValue(context.parsed, 'number')} (${pct}%)`; + } + } + } + } + } + }); +} +``` + +### Updating Charts on Filter Change + +```javascript +function updateChart(chart, newLabels, newData) { + chart.data.labels = newLabels; + + if (Array.isArray(newData[0])) { + // Multiple datasets + newData.forEach((data, i) => { + chart.data.datasets[i].data = data; + }); + } else { + chart.data.datasets[0].data = newData; + } + + chart.update('none'); // 'none' disables animation for instant update +} +``` + +## Filter and Interactivity Implementation + +### Dropdown Filter + +```html +
+ + +
+``` + +```javascript +function populateFilter(selectId, data, field) { + const select = document.getElementById(selectId); + const values = [...new Set(data.map(d => d[field]))].sort(); + + // Keep the "All" option, add unique values + values.forEach(val => { + const option = document.createElement('option'); + option.value = val; + option.textContent = val; + select.appendChild(option); + }); +} + +function getFilterValue(selectId) { + const val = document.getElementById(selectId).value; + return val === 'all' ? null : val; +} +``` + +### Date Range Filter + +```html +
+ + + to + +
+``` + +```javascript +function filterByDateRange(data, dateField, startDate, endDate) { + return data.filter(row => { + const rowDate = new Date(row[dateField]); + if (startDate && rowDate < new Date(startDate)) return false; + if (endDate && rowDate > new Date(endDate)) return false; + return true; + }); +} +``` + +### Combined Filter Logic + +```javascript +applyFilters() { + const region = getFilterValue('filter-region'); + const category = getFilterValue('filter-category'); + const startDate = document.getElementById('filter-date-start').value; + const endDate = document.getElementById('filter-date-end').value; + + this.filteredData = this.rawData.filter(row => { + if (region && row.region !== region) return false; + if (category && row.category !== category) return false; + if (startDate && row.date < startDate) return false; + if (endDate && row.date > endDate) return false; + return true; + }); + + this.renderKPIs(); + this.updateCharts(); + this.renderTable(); +} +``` + +### Sortable Table + +```javascript +function renderTable(containerId, data, columns) { + const container = document.getElementById(containerId); + let sortCol = null; + let sortDir = 'desc'; + + function render(sortedData) { + let html = ''; + + // Header + html += ''; + columns.forEach(col => { + const arrow = sortCol === col.field + ? (sortDir === 'asc' ? ' ▲' : ' ▼') + : ''; + html += ``; + }); + html += ''; + + // Body + html += ''; + sortedData.forEach(row => { + html += ''; + columns.forEach(col => { + const value = col.format ? formatValue(row[col.field], col.format) : row[col.field]; + html += ``; + }); + html += ''; + }); + html += '
${col.label}${arrow}
${value}
'; + + container.innerHTML = html; + } + + window.sortTable = function(field) { + if (sortCol === field) { + sortDir = sortDir === 'asc' ? 'desc' : 'asc'; + } else { + sortCol = field; + sortDir = 'desc'; + } + const sorted = [...data].sort((a, b) => { + const aVal = a[field], bVal = b[field]; + const cmp = aVal < bVal ? -1 : aVal > bVal ? 1 : 0; + return sortDir === 'asc' ? cmp : -cmp; + }); + render(sorted); + }; + + render(data); +} +``` + +## CSS Styling for Dashboards + +### Color System + +```css +:root { + /* Background layers */ + --bg-primary: #f8f9fa; + --bg-card: #ffffff; + --bg-header: #1a1a2e; + + /* Text */ + --text-primary: #212529; + --text-secondary: #6c757d; + --text-on-dark: #ffffff; + + /* Accent colors for data */ + --color-1: #4C72B0; + --color-2: #DD8452; + --color-3: #55A868; + --color-4: #C44E52; + --color-5: #8172B3; + --color-6: #937860; + + /* Status colors */ + --positive: #28a745; + --negative: #dc3545; + --neutral: #6c757d; + + /* Spacing */ + --gap: 16px; + --radius: 8px; +} +``` + +### Layout + +```css +* { + margin: 0; + padding: 0; + box-sizing: border-box; +} + +body { + font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; + background: var(--bg-primary); + color: var(--text-primary); + line-height: 1.5; +} + +.dashboard-container { + max-width: 1400px; + margin: 0 auto; + padding: var(--gap); +} + +.dashboard-header { + background: var(--bg-header); + color: var(--text-on-dark); + padding: 20px 24px; + border-radius: var(--radius); + margin-bottom: var(--gap); + display: flex; + justify-content: space-between; + align-items: center; + flex-wrap: wrap; + gap: 12px; +} + +.dashboard-header h1 { + font-size: 20px; + font-weight: 600; +} +``` + +### KPI Cards + +```css +.kpi-row { + display: grid; + grid-template-columns: repeat(auto-fit, minmax(200px, 1fr)); + gap: var(--gap); + margin-bottom: var(--gap); +} + +.kpi-card { + background: var(--bg-card); + border-radius: var(--radius); + padding: 20px 24px; + box-shadow: 0 1px 3px rgba(0, 0, 0, 0.08); +} + +.kpi-label { + font-size: 13px; + color: var(--text-secondary); + text-transform: uppercase; + letter-spacing: 0.5px; + margin-bottom: 4px; +} + +.kpi-value { + font-size: 28px; + font-weight: 700; + color: var(--text-primary); + margin-bottom: 4px; +} + +.kpi-change { + font-size: 13px; + font-weight: 500; +} + +.kpi-change.positive { color: var(--positive); } +.kpi-change.negative { color: var(--negative); } +``` + +### Chart Containers + +```css +.chart-row { + display: grid; + grid-template-columns: repeat(auto-fit, minmax(400px, 1fr)); + gap: var(--gap); + margin-bottom: var(--gap); +} + +.chart-container { + background: var(--bg-card); + border-radius: var(--radius); + padding: 20px 24px; + box-shadow: 0 1px 3px rgba(0, 0, 0, 0.08); +} + +.chart-container h3 { + font-size: 14px; + font-weight: 600; + color: var(--text-primary); + margin-bottom: 16px; +} + +.chart-container canvas { + max-height: 300px; +} +``` + +### Filters + +```css +.filters { + display: flex; + gap: 12px; + align-items: center; + flex-wrap: wrap; +} + +.filter-group { + display: flex; + align-items: center; + gap: 6px; +} + +.filter-group label { + font-size: 12px; + color: rgba(255, 255, 255, 0.7); +} + +.filter-group select, +.filter-group input[type="date"] { + padding: 6px 10px; + border: 1px solid rgba(255, 255, 255, 0.2); + border-radius: 4px; + background: rgba(255, 255, 255, 0.1); + color: var(--text-on-dark); + font-size: 13px; +} + +.filter-group select option { + background: var(--bg-header); + color: var(--text-on-dark); +} +``` + +### Data Table + +```css +.table-section { + background: var(--bg-card); + border-radius: var(--radius); + padding: 20px 24px; + box-shadow: 0 1px 3px rgba(0, 0, 0, 0.08); + overflow-x: auto; +} + +.data-table { + width: 100%; + border-collapse: collapse; + font-size: 13px; +} + +.data-table thead th { + text-align: left; + padding: 10px 12px; + border-bottom: 2px solid #dee2e6; + color: var(--text-secondary); + font-weight: 600; + font-size: 12px; + text-transform: uppercase; + letter-spacing: 0.5px; + white-space: nowrap; + user-select: none; +} + +.data-table thead th:hover { + color: var(--text-primary); + background: #f8f9fa; +} + +.data-table tbody td { + padding: 10px 12px; + border-bottom: 1px solid #f0f0f0; +} + +.data-table tbody tr:hover { + background: #f8f9fa; +} + +.data-table tbody tr:last-child td { + border-bottom: none; +} +``` + +### Responsive Design + +```css +@media (max-width: 768px) { + .dashboard-header { + flex-direction: column; + align-items: flex-start; + } + + .kpi-row { + grid-template-columns: repeat(2, 1fr); + } + + .chart-row { + grid-template-columns: 1fr; + } + + .filters { + flex-direction: column; + align-items: flex-start; + } +} + +@media print { + body { background: white; } + .dashboard-container { max-width: none; } + .filters { display: none; } + .chart-container { break-inside: avoid; } + .kpi-card { border: 1px solid #dee2e6; box-shadow: none; } +} +``` + +## Performance Considerations for Large Datasets + +### Data Size Guidelines + +| Data Size | Approach | +|---|---| +| <1,000 rows | Embed directly in HTML. Full interactivity. | +| 1,000 - 10,000 rows | Embed in HTML. May need to pre-aggregate for charts. | +| 10,000 - 100,000 rows | Pre-aggregate server-side. Embed only aggregated data. | +| >100,000 rows | Not suitable for client-side dashboard. Use a BI tool or paginate. | + +### Pre-Aggregation Pattern + +Instead of embedding raw data and aggregating in the browser: + +```javascript +// DON'T: embed 50,000 raw rows +const RAW_DATA = [/* 50,000 rows */]; + +// DO: pre-aggregate before embedding +const CHART_DATA = { + monthly_revenue: [ + { month: '2024-01', revenue: 150000, orders: 1200 }, + { month: '2024-02', revenue: 165000, orders: 1350 }, + // ... 12 rows instead of 50,000 + ], + top_products: [ + { product: 'Widget A', revenue: 45000 }, + // ... 10 rows + ], + kpis: { + total_revenue: 1980000, + total_orders: 15600, + avg_order_value: 127, + } +}; +``` + +### Chart Performance + +- Limit line charts to <500 data points per series (downsample if needed) +- Limit bar charts to <50 categories +- For scatter plots, cap at 1,000 points (use sampling for larger datasets) +- Disable animations for dashboards with many charts: `animation: false` in Chart.js options +- Use `Chart.update('none')` instead of `Chart.update()` for filter-triggered updates + +### DOM Performance + +- Limit data tables to 100-200 visible rows. Add pagination for more. +- Use `requestAnimationFrame` for coordinated chart updates +- Avoid rebuilding the entire DOM on filter change -- update only changed elements + +```javascript +// Efficient table pagination +function renderTablePage(data, page, pageSize = 50) { + const start = page * pageSize; + const end = Math.min(start + pageSize, data.length); + const pageData = data.slice(start, end); + // Render only pageData + // Show pagination controls: "Showing 1-50 of 2,340" +} +``` diff --git a/code_puppy/bundled_skills/Data/sql-queries/SKILL.md b/code_puppy/bundled_skills/Data/sql-queries/SKILL.md new file mode 100644 index 00000000..e2225c2b --- /dev/null +++ b/code_puppy/bundled_skills/Data/sql-queries/SKILL.md @@ -0,0 +1,427 @@ +--- +name: sql-queries +description: Write correct, performant SQL across all major data warehouse dialects (Snowflake, BigQuery, Databricks, PostgreSQL, etc.). Use when writing queries, optimizing slow SQL, translating between dialects, or building complex analytical queries with CTEs, window functions, or aggregations. +--- + +# SQL Queries Skill + +Write correct, performant, readable SQL across all major data warehouse dialects. + +## Dialect-Specific Reference + +### PostgreSQL (including Aurora, RDS, Supabase, Neon) + +**Date/time:** +```sql +-- Current date/time +CURRENT_DATE, CURRENT_TIMESTAMP, NOW() + +-- Date arithmetic +date_column + INTERVAL '7 days' +date_column - INTERVAL '1 month' + +-- Truncate to period +DATE_TRUNC('month', created_at) + +-- Extract parts +EXTRACT(YEAR FROM created_at) +EXTRACT(DOW FROM created_at) -- 0=Sunday + +-- Format +TO_CHAR(created_at, 'YYYY-MM-DD') +``` + +**String functions:** +```sql +-- Concatenation +first_name || ' ' || last_name +CONCAT(first_name, ' ', last_name) + +-- Pattern matching +column ILIKE '%pattern%' -- case-insensitive +column ~ '^regex_pattern$' -- regex + +-- String manipulation +LEFT(str, n), RIGHT(str, n) +SPLIT_PART(str, delimiter, position) +REGEXP_REPLACE(str, pattern, replacement) +``` + +**Arrays and JSON:** +```sql +-- JSON access +data->>'key' -- text +data->'nested'->'key' -- json +data#>>'{path,to,key}' -- nested text + +-- Array operations +ARRAY_AGG(column) +ANY(array_column) +array_column @> ARRAY['value'] +``` + +**Performance tips:** +- Use `EXPLAIN ANALYZE` to profile queries +- Create indexes on frequently filtered/joined columns +- Use `EXISTS` over `IN` for correlated subqueries +- Partial indexes for common filter conditions +- Use connection pooling for concurrent access + +--- + +### Snowflake + +**Date/time:** +```sql +-- Current date/time +CURRENT_DATE(), CURRENT_TIMESTAMP(), SYSDATE() + +-- Date arithmetic +DATEADD(day, 7, date_column) +DATEDIFF(day, start_date, end_date) + +-- Truncate to period +DATE_TRUNC('month', created_at) + +-- Extract parts +YEAR(created_at), MONTH(created_at), DAY(created_at) +DAYOFWEEK(created_at) + +-- Format +TO_CHAR(created_at, 'YYYY-MM-DD') +``` + +**String functions:** +```sql +-- Case-insensitive by default (depends on collation) +column ILIKE '%pattern%' +REGEXP_LIKE(column, 'pattern') + +-- Parse JSON +column:key::string -- dot notation for VARIANT +PARSE_JSON('{"key": "value"}') +GET_PATH(variant_col, 'path.to.key') + +-- Flatten arrays/objects +SELECT f.value FROM table, LATERAL FLATTEN(input => array_col) f +``` + +**Semi-structured data:** +```sql +-- VARIANT type access +data:customer:name::STRING +data:items[0]:price::NUMBER + +-- Flatten nested structures +SELECT + t.id, + item.value:name::STRING as item_name, + item.value:qty::NUMBER as quantity +FROM my_table t, +LATERAL FLATTEN(input => t.data:items) item +``` + +**Performance tips:** +- Use clustering keys on large tables (not traditional indexes) +- Filter on clustering key columns for partition pruning +- Set appropriate warehouse size for query complexity +- Use `RESULT_SCAN(LAST_QUERY_ID())` to avoid re-running expensive queries +- Use transient tables for staging/temp data + +--- + +### BigQuery (Google Cloud) + +**Date/time:** +```sql +-- Current date/time +CURRENT_DATE(), CURRENT_TIMESTAMP() + +-- Date arithmetic +DATE_ADD(date_column, INTERVAL 7 DAY) +DATE_SUB(date_column, INTERVAL 1 MONTH) +DATE_DIFF(end_date, start_date, DAY) +TIMESTAMP_DIFF(end_ts, start_ts, HOUR) + +-- Truncate to period +DATE_TRUNC(created_at, MONTH) +TIMESTAMP_TRUNC(created_at, HOUR) + +-- Extract parts +EXTRACT(YEAR FROM created_at) +EXTRACT(DAYOFWEEK FROM created_at) -- 1=Sunday + +-- Format +FORMAT_DATE('%Y-%m-%d', date_column) +FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S', ts_column) +``` + +**String functions:** +```sql +-- No ILIKE, use LOWER() +LOWER(column) LIKE '%pattern%' +REGEXP_CONTAINS(column, r'pattern') +REGEXP_EXTRACT(column, r'pattern') + +-- String manipulation +SPLIT(str, delimiter) -- returns ARRAY +ARRAY_TO_STRING(array, delimiter) +``` + +**Arrays and structs:** +```sql +-- Array operations +ARRAY_AGG(column) +UNNEST(array_column) +ARRAY_LENGTH(array_column) +value IN UNNEST(array_column) + +-- Struct access +struct_column.field_name +``` + +**Performance tips:** +- Always filter on partition columns (usually date) to reduce bytes scanned +- Use clustering for frequently filtered columns within partitions +- Use `APPROX_COUNT_DISTINCT()` for large-scale cardinality estimates +- Avoid `SELECT *` -- billing is per-byte scanned +- Use `DECLARE` and `SET` for parameterized scripts +- Preview query cost with dry run before executing large queries + +--- + +### Redshift (Amazon) + +**Date/time:** +```sql +-- Current date/time +CURRENT_DATE, GETDATE(), SYSDATE + +-- Date arithmetic +DATEADD(day, 7, date_column) +DATEDIFF(day, start_date, end_date) + +-- Truncate to period +DATE_TRUNC('month', created_at) + +-- Extract parts +EXTRACT(YEAR FROM created_at) +DATE_PART('dow', created_at) +``` + +**String functions:** +```sql +-- Case-insensitive +column ILIKE '%pattern%' +REGEXP_INSTR(column, 'pattern') > 0 + +-- String manipulation +SPLIT_PART(str, delimiter, position) +LISTAGG(column, ', ') WITHIN GROUP (ORDER BY column) +``` + +**Performance tips:** +- Design distribution keys for collocated joins (DISTKEY) +- Use sort keys for frequently filtered columns (SORTKEY) +- Use `EXPLAIN` to check query plan +- Avoid cross-node data movement (watch for DS_BCAST and DS_DIST) +- `ANALYZE` and `VACUUM` regularly +- Use late-binding views for schema flexibility + +--- + +### Databricks SQL + +**Date/time:** +```sql +-- Current date/time +CURRENT_DATE(), CURRENT_TIMESTAMP() + +-- Date arithmetic +DATE_ADD(date_column, 7) +DATEDIFF(end_date, start_date) +ADD_MONTHS(date_column, 1) + +-- Truncate to period +DATE_TRUNC('MONTH', created_at) +TRUNC(date_column, 'MM') + +-- Extract parts +YEAR(created_at), MONTH(created_at) +DAYOFWEEK(created_at) +``` + +**Delta Lake features:** +```sql +-- Time travel +SELECT * FROM my_table TIMESTAMP AS OF '2024-01-15' +SELECT * FROM my_table VERSION AS OF 42 + +-- Describe history +DESCRIBE HISTORY my_table + +-- Merge (upsert) +MERGE INTO target USING source +ON target.id = source.id +WHEN MATCHED THEN UPDATE SET * +WHEN NOT MATCHED THEN INSERT * +``` + +**Performance tips:** +- Use Delta Lake's `OPTIMIZE` and `ZORDER` for query performance +- Leverage Photon engine for compute-intensive queries +- Use `CACHE TABLE` for frequently accessed datasets +- Partition by low-cardinality date columns + +--- + +## Common SQL Patterns + +### Window Functions + +```sql +-- Ranking +ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) +RANK() OVER (PARTITION BY category ORDER BY revenue DESC) +DENSE_RANK() OVER (ORDER BY score DESC) + +-- Running totals / moving averages +SUM(revenue) OVER (ORDER BY date_col ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as running_total +AVG(revenue) OVER (ORDER BY date_col ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as moving_avg_7d + +-- Lag / Lead +LAG(value, 1) OVER (PARTITION BY entity ORDER BY date_col) as prev_value +LEAD(value, 1) OVER (PARTITION BY entity ORDER BY date_col) as next_value + +-- First / Last value +FIRST_VALUE(status) OVER (PARTITION BY user_id ORDER BY created_at ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) +LAST_VALUE(status) OVER (PARTITION BY user_id ORDER BY created_at ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) + +-- Percent of total +revenue / SUM(revenue) OVER () as pct_of_total +revenue / SUM(revenue) OVER (PARTITION BY category) as pct_of_category +``` + +### CTEs for Readability + +```sql +WITH +-- Step 1: Define the base population +base_users AS ( + SELECT user_id, created_at, plan_type + FROM users + WHERE created_at >= DATE '2024-01-01' + AND status = 'active' +), + +-- Step 2: Calculate user-level metrics +user_metrics AS ( + SELECT + u.user_id, + u.plan_type, + COUNT(DISTINCT e.session_id) as session_count, + SUM(e.revenue) as total_revenue + FROM base_users u + LEFT JOIN events e ON u.user_id = e.user_id + GROUP BY u.user_id, u.plan_type +), + +-- Step 3: Aggregate to summary level +summary AS ( + SELECT + plan_type, + COUNT(*) as user_count, + AVG(session_count) as avg_sessions, + SUM(total_revenue) as total_revenue + FROM user_metrics + GROUP BY plan_type +) + +SELECT * FROM summary ORDER BY total_revenue DESC; +``` + +### Cohort Retention + +```sql +WITH cohorts AS ( + SELECT + user_id, + DATE_TRUNC('month', first_activity_date) as cohort_month + FROM users +), +activity AS ( + SELECT + user_id, + DATE_TRUNC('month', activity_date) as activity_month + FROM user_activity +) +SELECT + c.cohort_month, + COUNT(DISTINCT c.user_id) as cohort_size, + COUNT(DISTINCT CASE + WHEN a.activity_month = c.cohort_month THEN a.user_id + END) as month_0, + COUNT(DISTINCT CASE + WHEN a.activity_month = c.cohort_month + INTERVAL '1 month' THEN a.user_id + END) as month_1, + COUNT(DISTINCT CASE + WHEN a.activity_month = c.cohort_month + INTERVAL '3 months' THEN a.user_id + END) as month_3 +FROM cohorts c +LEFT JOIN activity a ON c.user_id = a.user_id +GROUP BY c.cohort_month +ORDER BY c.cohort_month; +``` + +### Funnel Analysis + +```sql +WITH funnel AS ( + SELECT + user_id, + MAX(CASE WHEN event = 'page_view' THEN 1 ELSE 0 END) as step_1_view, + MAX(CASE WHEN event = 'signup_start' THEN 1 ELSE 0 END) as step_2_start, + MAX(CASE WHEN event = 'signup_complete' THEN 1 ELSE 0 END) as step_3_complete, + MAX(CASE WHEN event = 'first_purchase' THEN 1 ELSE 0 END) as step_4_purchase + FROM events + WHERE event_date >= CURRENT_DATE - INTERVAL '30 days' + GROUP BY user_id +) +SELECT + COUNT(*) as total_users, + SUM(step_1_view) as viewed, + SUM(step_2_start) as started_signup, + SUM(step_3_complete) as completed_signup, + SUM(step_4_purchase) as purchased, + ROUND(100.0 * SUM(step_2_start) / NULLIF(SUM(step_1_view), 0), 1) as view_to_start_pct, + ROUND(100.0 * SUM(step_3_complete) / NULLIF(SUM(step_2_start), 0), 1) as start_to_complete_pct, + ROUND(100.0 * SUM(step_4_purchase) / NULLIF(SUM(step_3_complete), 0), 1) as complete_to_purchase_pct +FROM funnel; +``` + +### Deduplication + +```sql +-- Keep the most recent record per key +WITH ranked AS ( + SELECT + *, + ROW_NUMBER() OVER ( + PARTITION BY entity_id + ORDER BY updated_at DESC + ) as rn + FROM source_table +) +SELECT * FROM ranked WHERE rn = 1; +``` + +## Error Handling and Debugging + +When a query fails: + +1. **Syntax errors**: Check for dialect-specific syntax (e.g., `ILIKE` not available in BigQuery, `SAFE_DIVIDE` only in BigQuery) +2. **Column not found**: Verify column names against schema -- check for typos, case sensitivity (PostgreSQL is case-sensitive for quoted identifiers) +3. **Type mismatches**: Cast explicitly when comparing different types (`CAST(col AS DATE)`, `col::DATE`) +4. **Division by zero**: Use `NULLIF(denominator, 0)` or dialect-specific safe division +5. **Ambiguous columns**: Always qualify column names with table alias in JOINs +6. **Group by errors**: All non-aggregated columns must be in GROUP BY (except in BigQuery which allows grouping by alias) diff --git a/code_puppy/bundled_skills/Data/statistical-analysis/SKILL.md b/code_puppy/bundled_skills/Data/statistical-analysis/SKILL.md new file mode 100644 index 00000000..c408d856 --- /dev/null +++ b/code_puppy/bundled_skills/Data/statistical-analysis/SKILL.md @@ -0,0 +1,244 @@ +--- +name: statistical-analysis +description: Apply statistical methods including descriptive stats, trend analysis, outlier detection, and hypothesis testing. Use when analyzing distributions, testing for significance, detecting anomalies, computing correlations, or interpreting statistical results. +--- + +# Statistical Analysis Skill + +Descriptive statistics, trend analysis, outlier detection, hypothesis testing, and guidance on when to be cautious about statistical claims. + +## Descriptive Statistics Methodology + +### Central Tendency + +Choose the right measure of center based on the data: + +| Situation | Use | Why | +|---|---|---| +| Symmetric distribution, no outliers | Mean | Most efficient estimator | +| Skewed distribution | Median | Robust to outliers | +| Categorical or ordinal data | Mode | Only option for non-numeric | +| Highly skewed with outliers (e.g., revenue per user) | Median + mean | Report both; the gap shows skew | + +**Always report mean and median together for business metrics.** If they diverge significantly, the data is skewed and the mean alone is misleading. + +### Spread and Variability + +- **Standard deviation**: How far values typically fall from the mean. Use with normally distributed data. +- **Interquartile range (IQR)**: Distance from p25 to p75. Robust to outliers. Use with skewed data. +- **Coefficient of variation (CV)**: StdDev / Mean. Use to compare variability across metrics with different scales. +- **Range**: Max minus min. Sensitive to outliers but gives a quick sense of data extent. + +### Percentiles for Business Context + +Report key percentiles to tell a richer story than mean alone: + +``` +p1: Bottom 1% (floor / minimum typical value) +p5: Low end of normal range +p25: First quartile +p50: Median (typical user) +p75: Third quartile +p90: Top 10% / power users +p95: High end of normal range +p99: Top 1% / extreme users +``` + +**Example narrative**: "The median session duration is 4.2 minutes, but the top 10% of users spend over 22 minutes per session, pulling the mean up to 7.8 minutes." + +### Describing Distributions + +Characterize every numeric distribution you analyze: + +- **Shape**: Normal, right-skewed, left-skewed, bimodal, uniform, heavy-tailed +- **Center**: Mean and median (and the gap between them) +- **Spread**: Standard deviation or IQR +- **Outliers**: How many and how extreme +- **Bounds**: Is there a natural floor (zero) or ceiling (100%)? + +## Trend Analysis and Forecasting + +### Identifying Trends + +**Moving averages** to smooth noise: +```python +# 7-day moving average (good for daily data with weekly seasonality) +df['ma_7d'] = df['metric'].rolling(window=7, min_periods=1).mean() + +# 28-day moving average (smooths weekly AND monthly patterns) +df['ma_28d'] = df['metric'].rolling(window=28, min_periods=1).mean() +``` + +**Period-over-period comparison**: +- Week-over-week (WoW): Compare to same day last week +- Month-over-month (MoM): Compare to same month prior +- Year-over-year (YoY): Gold standard for seasonal businesses +- Same-day-last-year: Compare specific calendar day + +**Growth rates**: +``` +Simple growth: (current - previous) / previous +CAGR: (ending / beginning) ^ (1 / years) - 1 +Log growth: ln(current / previous) -- better for volatile series +``` + +### Seasonality Detection + +Check for periodic patterns: +1. Plot the raw time series -- visual inspection first +2. Compute day-of-week averages: is there a clear weekly pattern? +3. Compute month-of-year averages: is there an annual cycle? +4. When comparing periods, always use YoY or same-period comparisons to avoid conflating trend with seasonality + +### Forecasting (Simple Methods) + +For business analysts (not data scientists), use straightforward methods: + +- **Naive forecast**: Tomorrow = today. Use as a baseline. +- **Seasonal naive**: Tomorrow = same day last week/year. +- **Linear trend**: Fit a line to historical data. Only for clearly linear trends. +- **Moving average forecast**: Use trailing average as the forecast. + +**Always communicate uncertainty**. Provide a range, not a point estimate: +- "We expect 10K-12K signups next month based on the 3-month trend" +- NOT "We will get exactly 11,234 signups next month" + +**When to escalate to a data scientist**: Non-linear trends, multiple seasonalities, external factors (marketing spend, holidays), or when forecast accuracy matters for resource allocation. + +## Outlier and Anomaly Detection + +### Statistical Methods + +**Z-score method** (for normally distributed data): +```python +z_scores = (df['value'] - df['value'].mean()) / df['value'].std() +outliers = df[abs(z_scores) > 3] # More than 3 standard deviations +``` + +**IQR method** (robust to non-normal distributions): +```python +Q1 = df['value'].quantile(0.25) +Q3 = df['value'].quantile(0.75) +IQR = Q3 - Q1 +lower_bound = Q1 - 1.5 * IQR +upper_bound = Q3 + 1.5 * IQR +outliers = df[(df['value'] < lower_bound) | (df['value'] > upper_bound)] +``` + +**Percentile method** (simplest): +```python +outliers = df[(df['value'] < df['value'].quantile(0.01)) | + (df['value'] > df['value'].quantile(0.99))] +``` + +### Handling Outliers + +Do NOT automatically remove outliers. Instead: + +1. **Investigate**: Is this a data error, a genuine extreme value, or a different population? +2. **Data errors**: Fix or remove (e.g., negative ages, timestamps in year 1970) +3. **Genuine extremes**: Keep them but consider using robust statistics (median instead of mean) +4. **Different population**: Segment them out for separate analysis (e.g., enterprise vs. SMB customers) + +**Report what you did**: "We excluded 47 records (0.3%) with transaction amounts >$50K, which represent bulk enterprise orders analyzed separately." + +### Time Series Anomaly Detection + +For detecting unusual values in a time series: + +1. Compute expected value (moving average or same-period-last-year) +2. Compute deviation from expected +3. Flag deviations beyond a threshold (typically 2-3 standard deviations of the residuals) +4. Distinguish between point anomalies (single unusual value) and change points (sustained shift) + +## Hypothesis Testing Basics + +### When to Use + +Use hypothesis testing when you need to determine whether an observed difference is likely real or could be due to random chance. Common scenarios: + +- A/B test results: Is variant B actually better than A? +- Before/after comparison: Did the product change actually move the metric? +- Segment comparison: Do enterprise customers really have higher retention? + +### The Framework + +1. **Null hypothesis (H0)**: There is no difference (the default assumption) +2. **Alternative hypothesis (H1)**: There is a difference +3. **Choose significance level (alpha)**: Typically 0.05 (5% chance of false positive) +4. **Compute test statistic and p-value** +5. **Interpret**: If p < alpha, reject H0 (evidence of a real difference) + +### Common Tests + +| Scenario | Test | When to Use | +|---|---|---| +| Compare two group means | t-test (independent) | Normal data, two groups | +| Compare two group proportions | z-test for proportions | Conversion rates, binary outcomes | +| Compare paired measurements | Paired t-test | Before/after on same entities | +| Compare 3+ group means | ANOVA | Multiple segments or variants | +| Non-normal data, two groups | Mann-Whitney U test | Skewed metrics, ordinal data | +| Association between categories | Chi-squared test | Two categorical variables | + +### Practical Significance vs. Statistical Significance + +**Statistical significance** means the difference is unlikely due to chance. + +**Practical significance** means the difference is large enough to matter for business decisions. + +A difference can be statistically significant but practically meaningless (common with large samples). Always report: +- **Effect size**: How big is the difference? (e.g., "Variant B improved conversion by 0.3 percentage points") +- **Confidence interval**: What's the range of plausible true effects? +- **Business impact**: What does this translate to in revenue, users, or other business terms? + +### Sample Size Considerations + +- Small samples produce unreliable results, even with significant p-values +- Rule of thumb for proportions: Need at least 30 events per group for basic reliability +- For detecting small effects (e.g., 1% conversion rate change), you may need thousands of observations per group +- If your sample is small, say so: "With only 200 observations per group, we have limited power to detect effects smaller than X%" + +## When to Be Cautious About Statistical Claims + +### Correlation Is Not Causation + +When you find a correlation, explicitly consider: +- **Reverse causation**: Maybe B causes A, not A causes B +- **Confounding variables**: Maybe C causes both A and B +- **Coincidence**: With enough variables, spurious correlations are inevitable + +**What you can say**: "Users who use feature X have 30% higher retention" +**What you cannot say without more evidence**: "Feature X causes 30% higher retention" + +### Multiple Comparisons Problem + +When you test many hypotheses, some will be "significant" by chance: +- Testing 20 metrics at p=0.05 means ~1 will be falsely significant +- If you looked at many segments before finding one that's different, note that +- Adjust for multiple comparisons with Bonferroni correction (divide alpha by number of tests) or report how many tests were run + +### Simpson's Paradox + +A trend in aggregated data can reverse when data is segmented: +- Always check whether the conclusion holds across key segments +- Example: Overall conversion goes up, but conversion goes down in every segment -- because the mix shifted toward a higher-converting segment + +### Survivorship Bias + +You can only analyze entities that "survived" to be in your dataset: +- Analyzing active users ignores those who churned +- Analyzing successful companies ignores those that failed +- Always ask: "Who is missing from this dataset, and would their inclusion change the conclusion?" + +### Ecological Fallacy + +Aggregate trends may not apply to individuals: +- "Countries with higher X have higher Y" does NOT mean "individuals with higher X have higher Y" +- Be careful about applying group-level findings to individual cases + +### Anchoring on Specific Numbers + +Be wary of false precision: +- "Churn will be 4.73% next quarter" implies more certainty than is warranted +- Prefer ranges: "We expect churn between 4-6% based on historical patterns" +- Round appropriately: "About 5%" is often more honest than "4.73%" diff --git a/code_puppy/bundled_skills/Finance/audit-support/SKILL.md b/code_puppy/bundled_skills/Finance/audit-support/SKILL.md new file mode 100644 index 00000000..a5ba6262 --- /dev/null +++ b/code_puppy/bundled_skills/Finance/audit-support/SKILL.md @@ -0,0 +1,373 @@ +--- +name: audit-support +description: Support SOX 404 compliance with control testing methodology, sample selection, and documentation standards. Use when generating testing workpapers, selecting audit samples, classifying control deficiencies, or preparing for internal or external audits. +--- + +# Audit Support + +**Important**: This skill assists with SOX compliance workflows but does not provide audit or legal advice. All testing workpapers and assessments should be reviewed by qualified financial professionals. While "significance" and "materiality" are context-specific concepts that are ultimately assessed by auditors, this skill is intended to assist professionals in the creation and evaluation of effective internal controls and documentation for audits. + +SOX 404 control testing methodology, sample selection approaches, testing documentation standards, control deficiency classification, and common control types. + +## SOX 404 Control Testing Methodology + +### Overview + +SOX Section 404 requires management to assess the effectiveness of internal controls over financial reporting (ICFR). This involves: + +1. **Scoping:** Identify significant accounts and relevant assertions +2. **Risk assessment:** Evaluate the risk of material misstatement for each significant account +3. **Control identification:** Document the controls that address each risk +4. **Testing:** Test the design and operating effectiveness of key controls +5. **Evaluation:** Assess whether any deficiencies exist and their severity +6. **Reporting:** Document the assessment and any material weaknesses + +### Scoping Significant Accounts + +An account is significant if there is more than a remote likelihood that it could contain a misstatement that is material (individually or in aggregate). + +**Quantitative factors:** +- Account balance exceeds materiality threshold (typically 3-5% of a key benchmark) +- Transaction volume is high, increasing the risk of error +- Account is subject to significant estimates or judgment + +**Qualitative factors:** +- Account involves complex accounting (revenue recognition, derivatives, pensions) +- Account is susceptible to fraud (cash, revenue, related-party transactions) +- Account has had prior misstatements or audit adjustments +- Account involves significant management judgment or estimates +- New account or significantly changed process + +### Relevant Assertions by Account Type + +| Account Type | Key Assertions | +|-------------|---------------| +| Revenue | Occurrence, Completeness, Accuracy, Cut-off | +| Accounts Receivable | Existence, Valuation (allowance), Rights | +| Inventory | Existence, Valuation, Completeness | +| Fixed Assets | Existence, Valuation, Completeness, Rights | +| Accounts Payable | Completeness, Accuracy, Existence | +| Accrued Liabilities | Completeness, Valuation, Accuracy | +| Equity | Completeness, Accuracy, Presentation | +| Financial Close/Reporting | Presentation, Accuracy, Completeness | + +### Design Effectiveness vs Operating Effectiveness + +**Design effectiveness:** Is the control properly designed to prevent or detect a material misstatement in the relevant assertion? +- Evaluated through walkthroughs (trace a transaction end-to-end through the process) +- Confirm the control is placed at the right point in the process +- Confirm the control addresses the identified risk +- Performed at least annually, or when processes change + +**Operating effectiveness:** Did the control actually operate as designed throughout the testing period? +- Evaluated through testing (inspection, observation, re-performance, inquiry) +- Requires sufficient sample sizes to support a conclusion +- Must cover the full period of reliance + +## Sample Selection Approaches + +### Random Selection + +**When to use:** Default method for transaction-level controls with large populations. + +**Method:** +1. Define the population (all transactions subject to the control during the period) +2. Number each item in the population sequentially +3. Use a random number generator to select sample items +4. Ensure no bias in selection (all items have equal probability) + +**Advantages:** Statistically valid, defensible, no selection bias +**Disadvantages:** May miss high-risk items, requires complete population listing + +### Targeted (Judgmental) Selection + +**When to use:** Supplement to random selection for risk-based testing; primary method when population is small or highly varied. + +**Method:** +1. Identify items with specific risk characteristics: + - High dollar amount (above a defined threshold) + - Unusual or non-standard transactions + - Period-end transactions (cut-off risk) + - Related-party transactions + - Manual or override transactions + - New vendor/customer transactions +2. Select items matching risk criteria +3. Document rationale for each targeted selection + +**Advantages:** Focuses on highest-risk items, efficient use of testing effort +**Disadvantages:** Not statistically representative, may over-represent certain risks + +### Haphazard Selection + +**When to use:** When random selection is impractical (no sequential population listing) and population is relatively homogeneous. + +**Method:** +1. Select items without any specific pattern or bias +2. Ensure selections are spread across the full population period +3. Avoid unconscious bias (don't always pick items at the top, round numbers, etc.) + +**Advantages:** Simple, no technology required +**Disadvantages:** Not statistically valid, susceptible to unconscious bias + +### Systematic Selection + +**When to use:** When population is sequential and you want even coverage across the period. + +**Method:** +1. Calculate the sampling interval: Population size / Sample size +2. Select a random starting point within the first interval +3. Select every Nth item from the starting point + +**Example:** Population of 1,000, sample of 25 → interval of 40. Random start: item 17. Select items 17, 57, 97, 137, ... + +**Advantages:** Even coverage across population, simple to execute +**Disadvantages:** Periodic patterns in the population could bias results + +### Sample Size Guidance + +| Control Frequency | Expected Population | Low Risk Sample | Moderate Risk Sample | High Risk Sample | +|------------------|--------------------|-----------------|--------------------|-----------------| +| Annual | 1 | 1 | 1 | 1 | +| Quarterly | 4 | 2 | 2 | 3 | +| Monthly | 12 | 2 | 3 | 4 | +| Weekly | 52 | 5 | 8 | 15 | +| Daily | ~250 | 20 | 30 | 40 | +| Per-transaction (small pop.) | < 250 | 20 | 30 | 40 | +| Per-transaction (large pop.) | 250+ | 25 | 40 | 60 | + +**Factors increasing sample size:** +- Higher inherent risk in the account/process +- Control is the sole control addressing a significant risk (no redundancy) +- Prior period control deficiency identified +- New control (not tested in prior periods) +- External auditor reliance on management testing + +## Testing Documentation Standards + +### Workpaper Requirements + +Every control test should be documented with: + +1. **Control identification:** + - Control number/ID + - Control description (what is done, by whom, how often) + - Control type (manual, automated, IT-dependent manual) + - Control frequency + - Risk and assertion addressed + +2. **Test design:** + - Test objective (what you are trying to determine) + - Test procedures (step-by-step instructions) + - Expected evidence (what you expect to see if the control is effective) + - Sample selection methodology and rationale + +3. **Test execution:** + - Population description and size + - Sample selection details (method, items selected) + - Results for each sample item (pass/fail with specific evidence examined) + - Exceptions noted with full description + +4. **Conclusion:** + - Overall assessment (effective / deficiency / significant deficiency / material weakness) + - Basis for conclusion + - Impact assessment for any exceptions + - Compensating controls considered (if applicable) + +5. **Sign-off:** + - Tester name and date + - Reviewer name and date + +### Evidence Standards + +**Sufficient evidence includes:** +- Screenshots showing system-enforced controls +- Signed/initialed approval documents +- Email approvals with identifiable approver and date +- System audit logs showing who performed the action and when +- Re-performed calculations with matching results +- Observation notes (with date, location, observer) + +**Insufficient evidence:** +- Verbal confirmations alone (must be corroborated) +- Undated documents +- Evidence without identifiable performer/approver +- Generic system reports without date/time stamps +- "Per discussion with [name]" without corroborating documentation + +### Working Paper Organization + +Organize testing files by control area: + +``` +SOX Testing/ +├── [Year]/ +│ ├── Scoping and Risk Assessment/ +│ ├── Revenue Cycle/ +│ │ ├── Control Matrix +│ │ ├── Walkthrough Documentation +│ │ ├── Test Workpapers (one per control) +│ │ └── Supporting Evidence +│ ├── Procure to Pay/ +│ ├── Payroll/ +│ ├── Financial Close/ +│ ├── Treasury/ +│ ├── Fixed Assets/ +│ ├── IT General Controls/ +│ ├── Entity Level Controls/ +│ └── Summary and Conclusions/ +│ ├── Deficiency Evaluation +│ └── Management Assessment +``` + +## Control Deficiency Classification + +### Deficiency + +A deficiency in internal control exists when the design or operation of a control does not allow management or employees, in the normal course of performing their assigned functions, to prevent or detect misstatements on a timely basis. + +**Evaluation factors:** +- What is the likelihood that the control failure could result in a misstatement? +- What is the magnitude of the potential misstatement? +- Is there a compensating control that mitigates the deficiency? + +### Significant Deficiency + +A deficiency, or combination of deficiencies, that is less severe than a material weakness yet important enough to merit attention by those charged with governance. + +**Indicators:** +- The deficiency could result in a misstatement that is more than inconsequential but less than material +- There is more than a remote (but less than reasonably possible) likelihood of a material misstatement +- The control is a key control and the deficiency is not fully mitigated by compensating controls +- Combination of individually minor deficiencies that together represent a significant concern + +### Material Weakness + +A deficiency, or combination of deficiencies, such that there is a reasonable possibility that a material misstatement of the financial statements will not be prevented or detected on a timely basis. + +**Indicators:** +- Identification of fraud by senior management (any magnitude) +- Restatement of previously issued financial statements to correct a material error +- Identification by the auditor of a material misstatement that would not have been detected by the company's controls +- Ineffective oversight of financial reporting by the audit committee +- Deficiency in a pervasive control (entity-level, IT general control) affecting multiple processes + +### Deficiency Aggregation + +Individual deficiencies that are not significant individually may be significant in combination: + +1. Identify all deficiencies in the same process or affecting the same assertion +2. Evaluate whether the combined effect could result in a material misstatement +3. Consider whether deficiencies in compensating controls exacerbate other deficiencies +4. Document the aggregation analysis and conclusion + +### Remediation + +For each identified deficiency: + +1. **Root cause analysis:** Why did the control fail? (design gap, execution failure, staffing, training, system issue) +2. **Remediation plan:** Specific actions to fix the control (redesign, additional training, system enhancement, added review) +3. **Timeline:** Target date for remediation completion +4. **Owner:** Person responsible for implementing the remediation +5. **Validation:** How and when the remediated control will be re-tested to confirm effectiveness + +## Common Control Types + +### IT General Controls (ITGCs) + +Controls over the IT environment that support the reliable functioning of application controls and automated processes. + +**Access Controls:** +- User access provisioning (new access requests require approval) +- User access de-provisioning (terminated users removed timely) +- Privileged access management (admin/superuser access restricted and monitored) +- Periodic access reviews (user access recertified on a defined schedule) +- Password policies (complexity, rotation, lockout) +- Segregation of duties enforcement (conflicting access prevented) + +**Change Management:** +- Change requests documented and approved before implementation +- Changes tested in a non-production environment before promotion +- Separation of development and production environments +- Emergency change procedures (documented, approved post-implementation) +- Change review and post-implementation validation + +**IT Operations:** +- Batch job monitoring and exception handling +- Backup and recovery procedures (regular backups, tested restores) +- System availability and performance monitoring +- Incident management and escalation procedures +- Disaster recovery planning and testing + +### Manual Controls + +Controls performed by people using judgment, typically involving review and approval. + +**Examples:** +- Management review of financial statements and key metrics +- Supervisory approval of journal entries above a threshold +- Three-way match verification (PO, receipt, invoice) +- Account reconciliation preparation and review +- Physical inventory observation and count +- Vendor master data change approval +- Customer credit approval + +**Key attributes to test:** +- Was the control performed by the right person (proper authority)? +- Was it performed timely (within the required timeframe)? +- Is there evidence of the review (signature, initials, email, system log)? +- Did the reviewer have sufficient information to perform an effective review? +- Were exceptions identified and appropriately addressed? + +### Automated Controls + +Controls enforced by IT systems without human intervention. + +**Examples:** +- System-enforced approval workflows (cannot proceed without required approvals) +- Three-way match automation (system blocks payment if PO/receipt/invoice don't match) +- Duplicate payment detection (system flags or blocks duplicate invoices) +- Credit limit enforcement (system prevents orders exceeding credit limit) +- Automated calculations (depreciation, amortization, interest, tax) +- System-enforced segregation of duties (conflicting roles prevented) +- Input validation controls (required fields, format checks, range checks) +- Automated reconciliation matching + +**Testing approach:** +- Test design: Confirm the system configuration enforces the control as intended +- Test operating effectiveness: For automated controls, if the system configuration has not changed, one test of the control is typically sufficient for the period (supplemented by ITGC testing of change management) +- Verify change management ITGCs are effective (if system changed, re-test the control) + +### IT-Dependent Manual Controls + +Manual controls that rely on the completeness and accuracy of system-generated information. + +**Examples:** +- Management review of a system-generated exception report +- Supervisor review of a system-generated aging report to assess reserves +- Reconciliation using system-generated trial balance data +- Approval of transactions identified by a system-generated workflow + +**Testing approach:** +- Test the manual control (review, approval, follow-up on exceptions) +- AND test the completeness and accuracy of the underlying report/data (IPE — Information Produced by the Entity) +- IPE testing confirms the data the reviewer relied on was complete and accurate + +### Entity-Level Controls + +Broad controls that operate at the organizational level and affect multiple processes. + +**Examples:** +- Tone at the top / code of conduct +- Risk assessment process +- Audit committee oversight of financial reporting +- Internal audit function and activities +- Fraud risk assessment and anti-fraud programs +- Whistleblower/ethics hotline +- Management monitoring of control effectiveness +- Financial reporting competence (staffing, training, qualifications) +- Period-end financial reporting process (close procedures, GAAP compliance reviews) + +**Significance:** +- Entity-level controls can mitigate but typically cannot replace process-level controls +- Ineffective entity-level controls (especially audit committee oversight and tone at the top) are strong indicators of a material weakness +- Effective entity-level controls may reduce the extent of testing needed for process-level controls diff --git a/code_puppy/bundled_skills/Finance/close-management/SKILL.md b/code_puppy/bundled_skills/Finance/close-management/SKILL.md new file mode 100644 index 00000000..7edf7e26 --- /dev/null +++ b/code_puppy/bundled_skills/Finance/close-management/SKILL.md @@ -0,0 +1,220 @@ +--- +name: close-management +description: Manage the month-end close process with task sequencing, dependencies, and status tracking. Use when planning the close calendar, tracking close progress, identifying blockers, or sequencing close activities by day. +--- + +# Close Management + +**Important**: This skill assists with close management workflows but does not provide financial advice. All close activities should be reviewed by qualified financial professionals. + +Month-end close checklist, task sequencing and dependencies, status tracking, and common close activities organized by day. + +## Month-End Close Checklist + +### Pre-Close (Last 2-3 Business Days of the Month) + +- [ ] Send close calendar and deadline reminders to all contributors +- [ ] Confirm cut-off procedures with AP, AR, payroll, and treasury +- [ ] Verify all sub-systems are processing normally (ERP, payroll, banking) +- [ ] Complete preliminary bank reconciliation (all but last-day activity) +- [ ] Review open purchase orders for potential accrual needs +- [ ] Confirm payroll processing schedule aligns with close timeline +- [ ] Collect information for any known unusual transactions + +### Close Day 1 (T+1: First Business Day After Month-End) + +- [ ] Confirm all sub-ledger modules have completed period-end processing +- [ ] Run AP accruals for goods/services received but not invoiced +- [ ] Post payroll entries and payroll accrual (if pay period straddles month-end) +- [ ] Record cash receipts and disbursements through month-end +- [ ] Post intercompany transactions and confirm with counterparties +- [ ] Complete bank reconciliation with final bank statement +- [ ] Run fixed asset depreciation +- [ ] Post prepaid expense amortization + +### Close Day 2 (T+2) + +- [ ] Complete revenue recognition entries and deferred revenue adjustments +- [ ] Post all remaining accrual journal entries +- [ ] Complete AR subledger reconciliation +- [ ] Complete AP subledger reconciliation +- [ ] Record inventory adjustments (if applicable) +- [ ] Post FX revaluation entries for foreign currency balances +- [ ] Begin balance sheet account reconciliations + +### Close Day 3 (T+3) + +- [ ] Complete all balance sheet reconciliations +- [ ] Post any adjusting journal entries identified during reconciliation +- [ ] Complete intercompany reconciliation and elimination entries +- [ ] Run preliminary trial balance and income statement +- [ ] Perform preliminary flux analysis on income statement +- [ ] Investigate and resolve material variances + +### Close Day 4 (T+4) + +- [ ] Post tax provision entries (income tax, sales tax, property tax) +- [ ] Complete equity roll-forward (stock compensation, treasury stock) +- [ ] Finalize all journal entries — soft close +- [ ] Generate draft financial statements (P&L, BS, CF) +- [ ] Perform detailed flux analysis and prepare variance explanations +- [ ] Management review of financial statements and key metrics + +### Close Day 5 (T+5) + +- [ ] Post any final adjustments from management review +- [ ] Finalize financial statements — hard close +- [ ] Lock the period in the ERP/GL system +- [ ] Distribute financial reporting package to stakeholders +- [ ] Update forecasts/projections based on actual results +- [ ] Conduct close retrospective — identify process improvements + +## Task Sequencing and Dependencies + +### Dependency Map + +Tasks are organized by what must complete before the next task can begin: + +``` +LEVEL 1 (No dependencies — can start immediately at T+1): +├── Cash receipts/disbursements recording +├── Bank statement retrieval +├── Payroll processing/accrual +├── Fixed asset depreciation run +├── Prepaid amortization +├── AP accrual preparation +└── Intercompany transaction posting + +LEVEL 2 (Depends on Level 1 completion): +├── Bank reconciliation (needs: cash entries + bank statement) +├── Revenue recognition (needs: billing/delivery data finalized) +├── AR subledger reconciliation (needs: all revenue/cash entries) +├── AP subledger reconciliation (needs: all AP entries/accruals) +├── FX revaluation (needs: all foreign currency entries posted) +└── Remaining accrual JEs (needs: review of all source data) + +LEVEL 3 (Depends on Level 2 completion): +├── All balance sheet reconciliations (needs: all JEs posted) +├── Intercompany reconciliation (needs: both sides posted) +├── Adjusting entries from reconciliations +└── Preliminary trial balance + +LEVEL 4 (Depends on Level 3 completion): +├── Tax provision (needs: pre-tax income finalized) +├── Equity roll-forward +├── Consolidation and eliminations +├── Draft financial statements +└── Preliminary flux analysis + +LEVEL 5 (Depends on Level 4 completion): +├── Management review +├── Final adjustments +├── Hard close / period lock +├── Financial reporting package +└── Forecast updates +``` + +### Critical Path + +The critical path determines the minimum close duration. Typical critical path: + +``` +Cash/AP/AR entries → Subledger reconciliations → Balance sheet recs → + Tax provision → Draft financials → Management review → Hard close +``` + +To shorten the close: +- Automate Level 1 entries (depreciation, prepaid amortization, standard accruals) +- Pre-reconcile accounts during the month (continuous reconciliation) +- Parallel-process independent reconciliations +- Set clear deadlines with consequences for late submissions +- Use standardized templates to reduce reconciliation prep time + +## Status Tracking and Reporting + +### Close Status Dashboard + +Track each close task with the following attributes: + +| Task | Owner | Deadline | Status | Blocker | Notes | +|------|-------|----------|--------|---------|-------| +| [Task name] | [Person/role] | [Day T+N] | Not Started / In Progress / Complete / Blocked | [If blocked, what's blocking] | [Any notes] | + +### Status Definitions + +- **Not Started:** Task has not yet begun (may be waiting on dependencies) +- **In Progress:** Task is actively being worked on +- **Complete:** Task is finished and has been reviewed/approved +- **Blocked:** Task cannot proceed due to a dependency, missing data, or issue +- **At Risk:** Task is in progress but may not meet its deadline + +### Daily Close Status Meeting (Recommended) + +During the close period, hold a brief (15-minute) daily standup: + +1. **Review status board:** Walk through open tasks, flag any that are behind +2. **Identify blockers:** Surface any issues preventing task completion +3. **Reassign or escalate:** Adjust ownership or escalate blockers to resolve quickly +4. **Update timeline:** If any tasks are at risk, assess impact on overall close timeline + +### Close Metrics to Track Over Time + +| Metric | Definition | Target | +|--------|-----------|--------| +| Close duration | Business days from period end to hard close | Reduce over time | +| # of adjusting entries after soft close | Entries posted during management review | Minimize | +| # of late tasks | Tasks completed after their deadline | Zero | +| # of reconciliation exceptions | Reconciling items requiring investigation | Reduce over time | +| # of restatements / corrections | Errors found after close | Zero | + +## Common Close Activities by Day + +### Typical 5-Day Close Calendar + +| Day | Key Activities | Responsible | +|-----|---------------|-------------| +| **T+1** | Cash entries, payroll, AP accruals, depreciation, prepaid amortization, intercompany posting | Staff accountants, payroll | +| **T+2** | Revenue recognition, remaining accruals, subledger reconciliations (AR, AP, FA), FX revaluation | Revenue accountant, AP/AR, treasury | +| **T+3** | Balance sheet reconciliations, intercompany reconciliation, eliminations, preliminary trial balance, preliminary flux | Accounting team, consolidation | +| **T+4** | Tax provision, equity roll-forward, draft financial statements, detailed flux analysis, management review | Tax, controller, FP&A | +| **T+5** | Final adjustments, hard close, period lock, reporting package distribution, forecast update, retrospective | Controller, FP&A, finance leadership | + +### Accelerated Close (3-Day Target) + +For organizations targeting a faster close: + +| Day | Key Activities | +|-----|---------------| +| **T+1** | All JEs posted (automated + manual), all subledger reconciliations, bank reconciliation, intercompany reconciliation, preliminary trial balance | +| **T+2** | All balance sheet reconciliations, tax provision, consolidation, draft financial statements, flux analysis, management review | +| **T+3** | Final adjustments, hard close, reporting package, forecast update | + +**Prerequisites for a 3-day close:** +- Automated recurring journal entries (depreciation, amortization, standard accruals) +- Continuous reconciliation during the month (not all at month-end) +- Automated intercompany elimination +- Pre-close activities completed before month-end (cut-off, accrual estimates) +- Empowered team with clear ownership and minimal handoffs +- Real-time or near-real-time sub-system integration + +## Close Process Improvement + +### Common Bottlenecks and Solutions + +| Bottleneck | Root Cause | Solution | +|-----------|-----------|---------| +| Late AP accruals | Waiting for department spend confirmation | Implement continuous accrual estimation; set cut-off deadlines | +| Manual journal entries | Recurring entries prepared manually each month | Automate standard recurring entries in the ERP | +| Slow reconciliations | Starting from scratch each month | Implement continuous/rolling reconciliation | +| Intercompany delays | Waiting for counterparty confirmation | Automate intercompany matching; set stricter deadlines | +| Management review changes | Large adjustments found during review | Improve preliminary review process; empower team to catch issues earlier | +| Missing supporting documents | Scrambling for documentation at close | Maintain documentation throughout the month | + +### Close Retrospective Questions + +After each close, ask: +1. What went well this close that we should continue? +2. What took longer than expected and why? +3. What blockers did we encounter and how can we prevent them? +4. Were there any surprises in the financial results we should have caught earlier? +5. What can we automate or streamline for next month? diff --git a/code_puppy/bundled_skills/Finance/financial-statements/SKILL.md b/code_puppy/bundled_skills/Finance/financial-statements/SKILL.md new file mode 100644 index 00000000..452dcd98 --- /dev/null +++ b/code_puppy/bundled_skills/Finance/financial-statements/SKILL.md @@ -0,0 +1,261 @@ +--- +name: financial-statements +description: Generate income statements, balance sheets, and cash flow statements with GAAP presentation and period-over-period comparison. Use when preparing financial statements, running flux analysis, or creating P&L reports with variance commentary. +--- + +# Financial Statements + +**Important**: This skill assists with financial statement workflows but does not provide financial advice. All statements should be reviewed by qualified financial professionals before use in reporting or filings. + +Formats, GAAP presentation requirements, common adjustments, and flux analysis methodology for income statements, balance sheets, and cash flow statements. + +## Income Statement + +### Standard Format (Classification of Expenses by Function) + +``` +Revenue + Product revenue + Service revenue + Other revenue +Total Revenue + +Cost of Revenue + Product costs + Service costs +Total Cost of Revenue + +Gross Profit + +Operating Expenses + Research and development + Sales and marketing + General and administrative +Total Operating Expenses + +Operating Income (Loss) + +Other Income (Expense) + Interest income + Interest expense + Other income (expense), net +Total Other Income (Expense) + +Income (Loss) Before Income Taxes + Income tax expense (benefit) +Net Income (Loss) + +Earnings Per Share (if applicable) + Basic + Diluted +``` + +### GAAP Presentation Requirements (ASC 220 / IAS 1) + +- Present all items of income and expense recognized in a period +- Classify expenses either by nature (materials, labor, depreciation) or by function (COGS, R&D, S&M, G&A) — function is more common for US companies +- If classified by function, disclose depreciation, amortization, and employee benefit costs by nature in the notes +- Present operating and non-operating items separately +- Show income tax expense as a separate line +- Extraordinary items are prohibited under both US GAAP and IFRS +- Discontinued operations presented separately, net of tax + +### Common Presentation Considerations + +- **Revenue disaggregation:** ASC 606 requires disaggregation of revenue into categories that depict how the nature, amount, timing, and uncertainty of revenue are affected by economic factors +- **Stock-based compensation:** Classify within the functional expense categories (R&D, S&M, G&A) with total SBC disclosed in notes +- **Restructuring charges:** Present separately if material, or include in operating expenses with note disclosure +- **Non-GAAP adjustments:** If presenting non-GAAP measures (common in earnings releases), clearly label and reconcile to GAAP + +## Balance Sheet + +### Standard Format (Classified Balance Sheet) + +``` +ASSETS +Current Assets + Cash and cash equivalents + Short-term investments + Accounts receivable, net + Inventory + Prepaid expenses and other current assets +Total Current Assets + +Non-Current Assets + Property and equipment, net + Operating lease right-of-use assets + Goodwill + Intangible assets, net + Long-term investments + Other non-current assets +Total Non-Current Assets + +TOTAL ASSETS + +LIABILITIES AND STOCKHOLDERS' EQUITY +Current Liabilities + Accounts payable + Accrued liabilities + Deferred revenue, current portion + Current portion of long-term debt + Operating lease liabilities, current portion + Other current liabilities +Total Current Liabilities + +Non-Current Liabilities + Long-term debt + Deferred revenue, non-current + Operating lease liabilities, non-current + Other non-current liabilities +Total Non-Current Liabilities + +Total Liabilities + +Stockholders' Equity + Common stock + Additional paid-in capital + Retained earnings (accumulated deficit) + Accumulated other comprehensive income (loss) + Treasury stock +Total Stockholders' Equity + +TOTAL LIABILITIES AND STOCKHOLDERS' EQUITY +``` + +### GAAP Presentation Requirements (ASC 210 / IAS 1) + +- Distinguish between current and non-current assets and liabilities +- Current: expected to be realized, consumed, or settled within 12 months (or the operating cycle if longer) +- Present assets in order of liquidity (most liquid first) — standard US practice +- Accounts receivable shown net of allowance for credit losses (ASC 326) +- Property and equipment shown net of accumulated depreciation +- Goodwill is not amortized — tested for impairment annually (ASC 350) +- Leases: recognize right-of-use assets and lease liabilities for operating and finance leases (ASC 842) + +## Cash Flow Statement + +### Standard Format (Indirect Method) + +``` +CASH FLOWS FROM OPERATING ACTIVITIES +Net income (loss) +Adjustments to reconcile net income to net cash from operations: + Depreciation and amortization + Stock-based compensation + Amortization of debt issuance costs + Deferred income taxes + Loss (gain) on disposal of assets + Impairment charges + Other non-cash items +Changes in operating assets and liabilities: + Accounts receivable + Inventory + Prepaid expenses and other assets + Accounts payable + Accrued liabilities + Deferred revenue + Other liabilities +Net Cash Provided by (Used in) Operating Activities + +CASH FLOWS FROM INVESTING ACTIVITIES + Purchases of property and equipment + Purchases of investments + Proceeds from sale/maturity of investments + Acquisitions, net of cash acquired + Other investing activities +Net Cash Provided by (Used in) Investing Activities + +CASH FLOWS FROM FINANCING ACTIVITIES + Proceeds from issuance of debt + Repayment of debt + Proceeds from issuance of common stock + Repurchases of common stock + Dividends paid + Payment of debt issuance costs + Other financing activities +Net Cash Provided by (Used in) Financing Activities + +Effect of exchange rate changes on cash + +Net Increase (Decrease) in Cash and Cash Equivalents +Cash and cash equivalents, beginning of period +Cash and cash equivalents, end of period +``` + +### GAAP Presentation Requirements (ASC 230 / IAS 7) + +- Indirect method is most common (start with net income, adjust for non-cash items) +- Direct method is permitted but rarely used (requires supplemental indirect reconciliation) +- Interest paid and income taxes paid must be disclosed (either on the face or in notes) +- Non-cash investing and financing activities disclosed separately (e.g., assets acquired under leases, stock issued for acquisitions) +- Cash equivalents: short-term, highly liquid investments with original maturities of 3 months or less + +## Common Adjustments and Reclassifications + +### Period-End Adjustments + +1. **Accruals:** Record expenses incurred but not yet paid (AP accruals, payroll accruals, interest accruals) +2. **Deferrals:** Adjust prepaid expenses, deferred revenue, and deferred costs for the period +3. **Depreciation and amortization:** Book periodic depreciation/amortization from fixed asset and intangible schedules +4. **Bad debt provision:** Adjust allowance for credit losses based on aging analysis and historical loss rates +5. **Inventory adjustments:** Record write-downs for obsolete, slow-moving, or impaired inventory +6. **FX revaluation:** Revalue foreign-currency-denominated monetary assets and liabilities at period-end rates +7. **Tax provision:** Record current and deferred income tax expense +8. **Fair value adjustments:** Mark-to-market investments, derivatives, and other fair-value items + +### Reclassifications + +1. **Current/non-current reclassification:** Reclassify long-term debt maturing within 12 months to current +2. **Contra account netting:** Net allowances against gross receivables, accumulated depreciation against gross assets +3. **Intercompany elimination:** Eliminate intercompany balances and transactions in consolidation +4. **Discontinued operations:** Reclassify results of discontinued operations to a separate line item +5. **Equity method adjustments:** Record share of investee income/loss for equity method investments +6. **Segment reclassifications:** Ensure transactions are properly classified by operating segment + +## Flux Analysis Methodology + +### Variance Calculation + +For each line item, calculate: +- **Dollar variance:** Current period - Prior period (or current period - budget) +- **Percentage variance:** (Current - Prior) / |Prior| x 100 +- **Basis point change:** For margins and ratios, express change in basis points (1 bp = 0.01%) + +### Materiality Thresholds + +Define what constitutes a "material" variance requiring investigation. Common approaches: + +- **Fixed dollar threshold:** Variances exceeding a set dollar amount (e.g., $50K, $100K) +- **Percentage threshold:** Variances exceeding a set percentage (e.g., 10%, 15%) +- **Combined:** Either the dollar OR percentage threshold is exceeded +- **Scaled:** Different thresholds for different line items based on their size and volatility + +*Example thresholds (adjust for your organization):* + +| Line Item Size | Dollar Threshold | Percentage Threshold | +|---------------|-----------------|---------------------| +| > $10M | $500K | 5% | +| $1M - $10M | $100K | 10% | +| < $1M | $50K | 15% | + +### Variance Decomposition + +Break down total variance into component drivers: + +- **Volume/quantity effect:** Change in volume at prior period rates +- **Rate/price effect:** Change in rate/price at current period volume +- **Mix effect:** Shift in composition between items with different rates/margins +- **New/discontinued items:** Items present in one period but not the other +- **One-time/non-recurring items:** Items that are not expected to repeat +- **Timing effect:** Items shifting between periods (not a true change in run rate) +- **Currency effect:** Impact of FX rate changes on translated results + +### Investigation and Narrative + +For each material variance: +1. Quantify the variance ($ and %) +2. Identify whether favorable or unfavorable +3. Decompose into drivers using the categories above +4. Provide a narrative explanation of the business reason +5. Assess whether the variance is temporary or represents a trend change +6. Note any actions required (further investigation, forecast update, process change) diff --git a/code_puppy/bundled_skills/Finance/journal-entry-prep/SKILL.md b/code_puppy/bundled_skills/Finance/journal-entry-prep/SKILL.md new file mode 100644 index 00000000..266c6a3f --- /dev/null +++ b/code_puppy/bundled_skills/Finance/journal-entry-prep/SKILL.md @@ -0,0 +1,185 @@ +--- +name: journal-entry-prep +description: Prepare journal entries with proper debits, credits, and supporting documentation for month-end close. Use when booking accruals, prepaid amortization, fixed asset depreciation, payroll entries, revenue recognition, or any manual journal entry. +--- + +# Journal Entry Preparation + +**Important**: This skill assists with journal entry workflows but does not provide financial advice. All entries should be reviewed by qualified financial professionals before posting. + +Best practices, standard entry types, documentation requirements, and review workflows for journal entry preparation. + +## Standard Accrual Types and Their Entries + +### Accounts Payable Accruals + +Accrue for goods or services received but not yet invoiced at period end. + +**Typical entry:** +- Debit: Expense account (or capitalize if asset-qualifying) +- Credit: Accrued liabilities + +**Sources for calculation:** +- Open purchase orders with confirmed receipts +- Contracts with services rendered but unbilled +- Recurring vendor arrangements (utilities, subscriptions, professional services) +- Employee expense reports submitted but not yet processed + +**Key considerations:** +- Reverse in the following period (auto-reversal recommended) +- Use consistent estimation methodology period over period +- Document basis for estimates (PO amount, contract terms, historical run-rate) +- Track actual vs accrual to refine future estimates + +### Fixed Asset Depreciation + +Book periodic depreciation expense for tangible and intangible assets. + +**Typical entry:** +- Debit: Depreciation/amortization expense (by department or cost center) +- Credit: Accumulated depreciation/amortization + +**Depreciation methods:** +- **Straight-line:** (Cost - Salvage) / Useful life — most common for financial reporting +- **Declining balance:** Accelerated method applying fixed rate to net book value +- **Units of production:** Based on actual usage or output vs total expected + +**Key considerations:** +- Run depreciation from the fixed asset register or schedule +- Verify new additions are set up with correct useful life and method +- Check for disposals or impairments requiring write-off +- Ensure consistency between book and tax depreciation tracking + +### Prepaid Expense Amortization + +Amortize prepaid expenses over their benefit period. + +**Typical entry:** +- Debit: Expense account (insurance, software, rent, etc.) +- Credit: Prepaid expense + +**Common prepaid categories:** +- Insurance premiums (typically 12-month policies) +- Software licenses and subscriptions +- Prepaid rent (if applicable under lease terms) +- Prepaid maintenance contracts +- Conference and event deposits + +**Key considerations:** +- Maintain an amortization schedule with start/end dates and monthly amounts +- Review for any prepaid items that should be fully expensed (immaterial amounts) +- Check for cancelled or terminated contracts requiring accelerated amortization +- Verify new prepaids are added to the schedule promptly + +### Payroll Accruals + +Accrue compensation and related costs for the period. + +**Typical entries:** + +*Salary accrual (for pay periods not aligned with month-end):* +- Debit: Salary expense (by department) +- Credit: Accrued payroll + +*Bonus accrual:* +- Debit: Bonus expense (by department) +- Credit: Accrued bonus + +*Benefits accrual:* +- Debit: Benefits expense +- Credit: Accrued benefits + +*Payroll tax accrual:* +- Debit: Payroll tax expense +- Credit: Accrued payroll taxes + +**Key considerations:** +- Calculate salary accrual based on working days in the period vs pay period +- Bonus accruals should reflect plan terms (target amounts, performance metrics, payout timing) +- Include employer-side taxes and benefits (FICA, FUTA, health, 401k match) +- Track PTO/vacation accrual liability if required by policy or jurisdiction + +### Revenue Recognition + +Recognize revenue based on performance obligations and delivery. + +**Typical entries:** + +*Recognize previously deferred revenue:* +- Debit: Deferred revenue +- Credit: Revenue + +*Recognize revenue with new receivable:* +- Debit: Accounts receivable +- Credit: Revenue + +*Defer revenue received in advance:* +- Debit: Cash / Accounts receivable +- Credit: Deferred revenue + +**Key considerations:** +- Follow ASC 606 five-step framework for contracts with customers +- Identify distinct performance obligations in each contract +- Determine transaction price (including variable consideration) +- Allocate transaction price to performance obligations +- Recognize revenue as/when performance obligations are satisfied +- Maintain contract-level detail for audit support + +## Supporting Documentation Requirements + +Every journal entry should have: + +1. **Entry description/memo:** Clear, specific description of what the entry records and why +2. **Calculation support:** How amounts were derived (formula, schedule, source data reference) +3. **Source documents:** Reference to the underlying transactions or events (PO numbers, invoice numbers, contract references, payroll register) +4. **Period:** The accounting period the entry applies to +5. **Preparer identification:** Who prepared the entry and when +6. **Approval:** Evidence of review and approval per the authorization matrix +7. **Reversal indicator:** Whether the entry auto-reverses and the reversal date + +## Review and Approval Workflows + +### Typical Approval Matrix + +| Entry Type | Amount Threshold | Approver | +|-----------|-----------------|----------| +| Standard recurring | Any amount | Accounting manager | +| Non-recurring / manual | < $50K | Accounting manager | +| Non-recurring / manual | $50K - $250K | Controller | +| Non-recurring / manual | > $250K | CFO / VP Finance | +| Top-side / consolidation | Any amount | Controller or above | +| Out-of-period adjustments | Any amount | Controller or above | + +*Note: Thresholds should be set based on your organization's materiality and risk tolerance.* + +### Review Checklist + +Before approving a journal entry, the reviewer should verify: + +- [ ] Debits equal credits (entry is balanced) +- [ ] Correct accounting period (not posting to a closed period) +- [ ] Account codes exist and are appropriate for the transaction +- [ ] Amounts are mathematically accurate and supported by calculations +- [ ] Description is clear, specific, and sufficient for audit purposes +- [ ] Department/cost center/project coding is correct +- [ ] Treatment is consistent with prior periods and accounting policies +- [ ] Auto-reversal is set appropriately (accruals should reverse) +- [ ] Supporting documentation is complete and referenced +- [ ] Entry amount is within the preparer's authority level +- [ ] No duplicate of an existing entry +- [ ] Unusual or large amounts are explained and justified + +## Common Errors to Check For + +1. **Unbalanced entries:** Debits do not equal credits (system should prevent, but check manual entries) +2. **Wrong period:** Entry posted to an incorrect or already-closed period +3. **Wrong sign:** Debit entered as credit or vice versa +4. **Duplicate entries:** Same transaction recorded twice (check for duplicates before posting) +5. **Wrong account:** Entry posted to incorrect GL account (especially similar account codes) +6. **Missing reversal:** Accrual entry not set to auto-reverse, causing double-counting +7. **Stale accruals:** Recurring accruals not updated for changed circumstances +8. **Round-number estimates:** Suspiciously round amounts that may not reflect actual calculations +9. **Incorrect FX rates:** Foreign currency entries using wrong exchange rate or date +10. **Missing intercompany elimination:** Entries between entities without corresponding elimination +11. **Capitalization errors:** Expenses that should be capitalized, or capitalized items that should be expensed +12. **Cut-off errors:** Transactions recorded in the wrong period based on delivery or service date diff --git a/code_puppy/bundled_skills/Finance/reconciliation/SKILL.md b/code_puppy/bundled_skills/Finance/reconciliation/SKILL.md new file mode 100644 index 00000000..8e324db3 --- /dev/null +++ b/code_puppy/bundled_skills/Finance/reconciliation/SKILL.md @@ -0,0 +1,174 @@ +--- +name: reconciliation +description: Reconcile accounts by comparing GL balances to subledgers, bank statements, or third-party data. Use when performing bank reconciliations, GL-to-subledger recs, intercompany reconciliations, or identifying and categorizing reconciling items. +--- + +# Reconciliation + +**Important**: This skill assists with reconciliation workflows but does not provide financial advice. All reconciliations should be reviewed by qualified financial professionals before sign-off. + +Methodology and best practices for account reconciliation, including GL-to-subledger, bank reconciliations, and intercompany. Covers reconciling item categorization, aging analysis, and escalation. + +## Reconciliation Types + +### GL to Subledger Reconciliation + +Compare the general ledger control account balance to the detailed subledger balance. + +**Common accounts:** +- Accounts receivable (GL control vs AR subledger aging) +- Accounts payable (GL control vs AP subledger aging) +- Fixed assets (GL control vs fixed asset register) +- Inventory (GL control vs inventory valuation report) +- Prepaid expenses (GL control vs prepaid amortization schedule) +- Accrued liabilities (GL control vs accrual detail schedules) + +**Process:** +1. Pull GL balance for the control account as of period end +2. Pull subledger trial balance or detail report as of the same date +3. Compare totals — they should match if posting is real-time +4. Investigate any differences (timing of posting, manual entries not reflected, interface errors) + +**Common causes of differences:** +- Manual journal entries posted to the control account but not reflected in the subledger +- Subledger transactions not yet interfaced to the GL +- Timing differences in batch posting +- Reclassification entries in the GL without subledger adjustment +- System interface errors or failed postings + +### Bank Reconciliation + +Compare the GL cash balance to the bank statement balance. + +**Process:** +1. Obtain the bank statement balance as of period end +2. Pull the GL cash account balance as of the same date +3. Identify outstanding checks (issued but not cleared at the bank) +4. Identify deposits in transit (recorded in GL but not yet credited by bank) +5. Identify bank charges, interest, or adjustments not yet recorded in GL +6. Reconcile both sides to an adjusted balance + +**Standard format:** + +``` +Balance per bank statement: $XX,XXX +Add: Deposits in transit $X,XXX +Less: Outstanding checks ($X,XXX) +Add/Less: Bank errors $X,XXX +Adjusted bank balance: $XX,XXX + +Balance per general ledger: $XX,XXX +Add: Interest/credits not recorded $X,XXX +Less: Bank fees not recorded ($X,XXX) +Add/Less: GL errors $X,XXX +Adjusted GL balance: $XX,XXX + +Difference: $0.00 +``` + +### Intercompany Reconciliation + +Reconcile balances between related entities to ensure they net to zero on consolidation. + +**Process:** +1. Pull intercompany receivable/payable balances for each entity pair +2. Compare Entity A's receivable from Entity B to Entity B's payable to Entity A +3. Identify and resolve differences +4. Confirm all intercompany transactions have been recorded on both sides +5. Verify elimination entries are correct for consolidation + +**Common causes of differences:** +- Transactions recorded by one entity but not the other (timing) +- Different FX rates used by each entity +- Misclassification (intercompany vs third-party) +- Disputed amounts or unapplied payments +- Different period-end cut-off practices across entities + +## Reconciling Item Categorization + +### Category 1: Timing Differences + +Items that exist because of normal processing timing and will clear without action: + +- **Outstanding checks:** Checks issued and recorded in GL, pending bank clearance +- **Deposits in transit:** Deposits made and recorded in GL, pending bank credit +- **In-transit transactions:** Items posted in one system but pending interface to the other +- **Pending approvals:** Transactions awaiting approval to post in one system + +**Expected resolution:** These items should clear within the normal processing cycle (typically 1-5 business days). No adjusting entry needed. + +### Category 2: Adjustments Required + +Items that require a journal entry to correct: + +- **Unrecorded bank charges:** Bank fees, wire charges, returned item fees +- **Unrecorded interest:** Interest income or expense from bank/lender +- **Recording errors:** Wrong amount, wrong account, duplicates +- **Missing entries:** Transactions in one system with no corresponding entry in the other +- **Classification errors:** Correctly recorded but in the wrong account + +**Action:** Prepare adjusting journal entry to correct the GL or subledger. + +### Category 3: Requires Investigation + +Items that cannot be immediately explained: + +- **Unidentified differences:** Variances with no obvious cause +- **Disputed items:** Amounts contested between parties +- **Aged outstanding items:** Items that have not cleared within expected timeframes +- **Recurring unexplained differences:** Same type of difference appearing each period + +**Action:** Investigate root cause, document findings, escalate if unresolved. + +## Aging Analysis for Outstanding Items + +Track the age of reconciling items to identify stale items requiring escalation: + +| Age Bucket | Status | Action | +|-----------|--------|--------| +| 0-30 days | Current | Monitor — within normal processing cycle | +| 31-60 days | Aging | Investigate — follow up on why item has not cleared | +| 61-90 days | Overdue | Escalate — notify supervisor, document investigation | +| 90+ days | Stale | Escalate to management — potential write-off or adjustment needed | + +### Aging Report Format + +| Item # | Description | Amount | Date Originated | Age (Days) | Category | Status | Owner | +|--------|-------------|--------|-----------------|------------|----------|--------|-------| +| 1 | [Detail] | $X,XXX | [Date] | XX | [Type] | [Status] | [Name] | + +### Trending + +Track reconciling item totals over time to identify growing balances: + +- Compare total outstanding items to prior period +- Flag if total reconciling items exceed materiality threshold +- Flag if number of items is growing period over period +- Identify recurring items that appear every period (may indicate process issue) + +## Escalation Thresholds + +Define escalation triggers based on your organization's risk tolerance: + +| Trigger | Threshold (Example) | Escalation | +|---------|---------------------|------------| +| Individual item amount | > $10,000 | Supervisor review | +| Individual item amount | > $50,000 | Controller review | +| Total reconciling items | > $100,000 | Controller review | +| Item age | > 60 days | Supervisor follow-up | +| Item age | > 90 days | Controller / management review | +| Unreconciled difference | Any amount | Cannot close — must resolve or document | +| Growing trend | 3+ consecutive periods | Process improvement investigation | + +*Note: Set thresholds based on your organization's materiality level and risk appetite. The examples above are illustrative.* + +## Reconciliation Best Practices + +1. **Timeliness:** Complete reconciliations within the close calendar deadline (typically T+3 to T+5 business days after period end) +2. **Completeness:** Reconcile all balance sheet accounts on a defined frequency (monthly for material accounts, quarterly for immaterial) +3. **Documentation:** Every reconciliation should include preparer, reviewer, date, and clear explanation of all reconciling items +4. **Segregation:** The person who reconciles should not be the same person who processes transactions in that account +5. **Follow-through:** Track open items to resolution — do not just carry items forward indefinitely +6. **Root cause analysis:** For recurring reconciling items, investigate and fix the underlying process issue +7. **Standardization:** Use consistent templates and procedures across all accounts +8. **Retention:** Maintain reconciliations and supporting detail per your organization's document retention policy diff --git a/code_puppy/bundled_skills/Finance/variance-analysis/SKILL.md b/code_puppy/bundled_skills/Finance/variance-analysis/SKILL.md new file mode 100644 index 00000000..1bfccf66 --- /dev/null +++ b/code_puppy/bundled_skills/Finance/variance-analysis/SKILL.md @@ -0,0 +1,265 @@ +--- +name: variance-analysis +description: Decompose financial variances into drivers with narrative explanations and waterfall analysis. Use when analyzing budget vs. actual, period-over-period changes, revenue or expense variances, or preparing variance commentary for leadership. +--- + +# Variance Analysis + +**Important**: This skill assists with variance analysis workflows but does not provide financial advice. All analyses should be reviewed by qualified financial professionals before use in reporting. + +Techniques for decomposing variances, materiality thresholds, narrative generation, waterfall chart methodology, and budget vs actual vs forecast comparisons. + +## Variance Decomposition Techniques + +### Price / Volume Decomposition + +The most fundamental variance decomposition. Used for revenue, cost of goods, and any metric that can be expressed as Price x Volume. + +**Formula:** +``` +Total Variance = Actual - Budget (or Prior) + +Volume Effect = (Actual Volume - Budget Volume) x Budget Price +Price Effect = (Actual Price - Budget Price) x Actual Volume +Mix Effect = Residual (interaction term), or allocated proportionally + +Verification: Volume Effect + Price Effect = Total Variance + (when mix is embedded in the price/volume terms) +``` + +**Three-way decomposition (separating mix):** +``` +Volume Effect = (Actual Volume - Budget Volume) x Budget Price x Budget Mix +Price Effect = (Actual Price - Budget Price) x Budget Volume x Actual Mix +Mix Effect = Budget Price x Budget Volume x (Actual Mix - Budget Mix) +``` + +**Example — Revenue variance:** +- Budget: 10,000 units at $50 = $500,000 +- Actual: 11,000 units at $48 = $528,000 +- Total variance: +$28,000 favorable + - Volume effect: +1,000 units x $50 = +$50,000 (favorable — sold more units) + - Price effect: -$2 x 11,000 units = -$22,000 (unfavorable — lower ASP) + - Net: +$28,000 + +### Rate / Mix Decomposition + +Used when analyzing blended rates across segments with different unit economics. + +**Formula:** +``` +Rate Effect = Sum of (Actual Volume_i x (Actual Rate_i - Budget Rate_i)) +Mix Effect = Sum of (Budget Rate_i x (Actual Volume_i - Expected Volume_i at Budget Mix)) +``` + +**Example — Gross margin variance:** +- Product A: 60% margin, Product B: 40% margin +- Budget mix: 50% A, 50% B → Blended margin 50% +- Actual mix: 40% A, 60% B → Blended margin 48% +- Mix effect explains 2pp of margin compression + +### Headcount / Compensation Decomposition + +Used for analyzing payroll and people-cost variances. + +``` +Total Comp Variance = Actual Compensation - Budget Compensation + +Decompose into: +1. Headcount variance = (Actual HC - Budget HC) x Budget Avg Comp +2. Rate variance = (Actual Avg Comp - Budget Avg Comp) x Budget HC +3. Mix variance = Difference due to level/department mix shift +4. Timing variance = Hiring earlier/later than planned (partial-period effect) +5. Attrition impact = Savings from unplanned departures (partially offset by backfill costs) +``` + +### Spend Category Decomposition + +Used for operating expense analysis when price/volume is not applicable. + +``` +Total OpEx Variance = Actual OpEx - Budget OpEx + +Decompose by: +1. Headcount-driven costs (salaries, benefits, payroll taxes, recruiting) +2. Volume-driven costs (hosting, transaction fees, commissions, shipping) +3. Discretionary spend (travel, events, professional services, marketing programs) +4. Contractual/fixed costs (rent, insurance, software licenses, subscriptions) +5. One-time / non-recurring (severance, legal settlements, write-offs, project costs) +6. Timing / phasing (spend shifted between periods vs plan) +``` + +## Materiality Thresholds and Investigation Triggers + +### Setting Thresholds + +Materiality thresholds determine which variances require investigation and narrative explanation. Set thresholds based on: + +1. **Financial statement materiality:** Typically 1-5% of a key benchmark (revenue, total assets, net income) +2. **Line item size:** Larger line items warrant lower percentage thresholds +3. **Volatility:** More volatile line items may need higher thresholds to avoid noise +4. **Management attention:** What level of variance would change a decision? + +### Recommended Threshold Framework + +| Comparison Type | Dollar Threshold | Percentage Threshold | Trigger | +|----------------|-----------------|---------------------|---------| +| Actual vs Budget | Organization-specific | 10% | Either exceeded | +| Actual vs Prior Period | Organization-specific | 15% | Either exceeded | +| Actual vs Forecast | Organization-specific | 5% | Either exceeded | +| Sequential (MoM) | Organization-specific | 20% | Either exceeded | + +*Set dollar thresholds based on your organization's size. Common practice: 0.5%-1% of revenue for income statement items.* + +### Investigation Priority + +When multiple variances exceed thresholds, prioritize investigation by: + +1. **Largest absolute dollar variance** — biggest P&L impact +2. **Largest percentage variance** — may indicate process issue or error +3. **Unexpected direction** — variance opposite to trend or expectation +4. **New variance** — item that was on track and is now off +5. **Cumulative/trending variance** — growing each period + +## Narrative Generation for Variance Explanations + +### Structure for Each Variance Narrative + +``` +[Line Item]: [Favorable/Unfavorable] variance of $[amount] ([percentage]%) +vs [comparison basis] for [period] + +Driver: [Primary driver description] +[2-3 sentences explaining the business reason for the variance, with specific +quantification of contributing factors] + +Outlook: [One-time / Expected to continue / Improving / Deteriorating] +Action: [None required / Monitor / Investigate further / Update forecast] +``` + +### Narrative Quality Checklist + +Good variance narratives should be: + +- [ ] **Specific:** Names the actual driver, not just "higher than expected" +- [ ] **Quantified:** Includes dollar and percentage impact of each driver +- [ ] **Causal:** Explains WHY it happened, not just WHAT happened +- [ ] **Forward-looking:** States whether the variance is expected to continue +- [ ] **Actionable:** Identifies any required follow-up or decision +- [ ] **Concise:** 2-4 sentences, not a paragraph of filler + +### Common Narrative Anti-Patterns to Avoid + +- "Revenue was higher than budget due to higher revenue" (circular — no actual explanation) +- "Expenses were elevated this period" (vague — which expenses? why?) +- "Timing" without specifying what was early/late and when it will normalize +- "One-time" without explaining what the item was +- "Various small items" for a material variance (must decompose further) +- Focusing only on the largest driver and ignoring offsetting items + +## Waterfall Chart Methodology + +### Concept + +A waterfall (or bridge) chart shows how you get from one value to another through a series of positive and negative contributors. Used to visualize variance decomposition. + +### Data Structure + +``` +Starting value: [Base/Budget/Prior period amount] +Drivers: [List of contributing factors with signed amounts] +Ending value: [Actual/Current period amount] + +Verification: Starting value + Sum of all drivers = Ending value +``` + +### Text-Based Waterfall Format + +When a charting tool is not available, present as a text waterfall: + +``` +WATERFALL: Revenue — Q4 Actual vs Q4 Budget + +Q4 Budget Revenue $10,000K + | + |--[+] Volume growth (new customers) +$800K + |--[+] Expansion revenue (existing customers) +$400K + |--[-] Price reductions / discounting -$200K + |--[-] Churn / contraction -$350K + |--[+] FX tailwind +$50K + |--[-] Timing (deals slipped to Q1) -$150K + | +Q4 Actual Revenue $10,550K + +Net Variance: +$550K (+5.5% favorable) +``` + +### Bridge Reconciliation Table + +Complement the waterfall with a reconciliation table: + +| Driver | Amount | % of Variance | Cumulative | +|--------|--------|---------------|------------| +| Volume growth | +$800K | 145% | +$800K | +| Expansion revenue | +$400K | 73% | +$1,200K | +| Price reductions | -$200K | -36% | +$1,000K | +| Churn / contraction | -$350K | -64% | +$650K | +| FX tailwind | +$50K | 9% | +$700K | +| Timing (deal slippage) | -$150K | -27% | +$550K | +| **Total variance** | **+$550K** | **100%** | | + +*Note: Percentages can exceed 100% for individual drivers when there are offsetting items.* + +### Waterfall Best Practices + +1. Order drivers from largest positive to largest negative (or in logical business sequence) +2. Keep to 5-8 drivers maximum — aggregate smaller items into "Other" +3. Verify the waterfall reconciles (start + drivers = end) +4. Color-code: green for favorable, red for unfavorable (in visual charts) +5. Label each bar with both the amount and a brief description +6. Include a "Total Variance" summary bar + +## Budget vs Actual vs Forecast Comparisons + +### Three-Way Comparison Framework + +| Metric | Budget | Forecast | Actual | Bud Var ($) | Bud Var (%) | Fcast Var ($) | Fcast Var (%) | +|--------|--------|----------|--------|-------------|-------------|---------------|---------------| +| Revenue | $X | $X | $X | $X | X% | $X | X% | +| COGS | $X | $X | $X | $X | X% | $X | X% | +| Gross Profit | $X | $X | $X | $X | X% | $X | X% | + +### When to Use Each Comparison + +- **Actual vs Budget:** Annual performance measurement, compensation decisions, board reporting. Budget is set at the beginning of the year and typically not changed. +- **Actual vs Forecast:** Operational management, identifying emerging issues. Forecast is updated periodically (monthly or quarterly) to reflect current expectations. +- **Forecast vs Budget:** Understanding how expectations have changed since planning. Useful for identifying planning accuracy issues. +- **Actual vs Prior Period:** Trend analysis, sequential performance. Useful when budget is not meaningful (new business lines, post-acquisition). +- **Actual vs Prior Year:** Year-over-year growth analysis, seasonality-adjusted comparison. + +### Forecast Accuracy Analysis + +Track how accurate forecasts are over time to improve planning: + +``` +Forecast Accuracy = 1 - |Actual - Forecast| / |Actual| + +MAPE (Mean Absolute Percentage Error) = Average of |Actual - Forecast| / |Actual| across periods +``` + +| Period | Forecast | Actual | Variance | Accuracy | +|--------|----------|--------|----------|----------| +| Jan | $X | $X | $X (X%) | XX% | +| Feb | $X | $X | $X (X%) | XX% | +| ... | ... | ... | ... | ... | +| **Avg**| | | **MAPE** | **XX%** | + +### Variance Trending + +Track how variances evolve over the year to identify systematic bias: + +- **Consistently favorable:** Budget may be too conservative (sandbagging) +- **Consistently unfavorable:** Budget may be too aggressive or execution issues +- **Growing unfavorable:** Deteriorating performance or unrealistic targets +- **Shrinking variance:** Forecast accuracy improving through the year (normal pattern) +- **Volatile:** Unpredictable business or poor forecasting methodology diff --git a/code_puppy/bundled_skills/Legal/canned-responses/SKILL.md b/code_puppy/bundled_skills/Legal/canned-responses/SKILL.md new file mode 100644 index 00000000..739f6866 --- /dev/null +++ b/code_puppy/bundled_skills/Legal/canned-responses/SKILL.md @@ -0,0 +1,337 @@ +--- +name: canned-responses +description: Generate templated responses for common legal inquiries and identify when situations require individualized attention. Use when responding to routine legal questions — data subject requests, vendor inquiries, NDA requests, discovery holds — or when managing response templates. +--- + +# Canned Responses Skill + +You are a response template assistant for an in-house legal team. You help manage, customize, and generate templated responses for common legal inquiries, and you identify when a situation should NOT use a templated response and instead requires individualized attention. + +**Important**: You assist with legal workflows but do not provide legal advice. Templated responses should be reviewed before sending, especially for regulated communications. + +## Template Management Methodology + +### Template Organization + +Templates should be organized by category and maintained in the team's local settings. Each template should include: + +1. **Category**: The type of inquiry the template addresses +2. **Template name**: A descriptive identifier +3. **Use case**: When this template is appropriate +4. **Escalation triggers**: When this template should NOT be used +5. **Required variables**: Information that must be customized for each use +6. **Template body**: The response text with variable placeholders +7. **Follow-up actions**: Standard steps after sending the response +8. **Last reviewed date**: When the template was last verified for accuracy + +### Template Lifecycle + +1. **Creation**: Draft template based on best practices and team input +2. **Review**: Legal team review and approval of template content +3. **Publication**: Add to template library with metadata +4. **Use**: Generate responses using the template +5. **Feedback**: Track when templates are modified during use to identify improvement opportunities +6. **Update**: Revise templates when laws, policies, or best practices change +7. **Retirement**: Archive templates that are no longer applicable + +## Response Categories + +### 1. Data Subject Requests (DSRs) + +**Sub-categories**: +- Acknowledgment of receipt +- Identity verification request +- Fulfillment response (access, deletion, correction) +- Partial denial with explanation +- Full denial with explanation +- Extension notification + +**Key template elements**: +- Reference to applicable regulation (GDPR, CCPA, etc.) +- Specific timeline for response +- Identity verification requirements +- Rights of the data subject (including right to complain to supervisory authority) +- Contact information for follow-up + +**Example template structure**: +``` +Subject: Your Data [Access/Deletion/Correction] Request - Reference {{request_id}} + +Dear {{requester_name}}, + +We have received your request dated {{request_date}} to [access/delete/correct] your personal data under [applicable regulation]. + +[Acknowledgment / verification request / fulfillment details / denial basis] + +We will respond substantively by {{response_deadline}}. + +[Contact information] +[Rights information] +``` + +### 2. Discovery Holds (Litigation Holds) + +**Sub-categories**: +- Initial hold notice to custodians +- Hold reminder / periodic reaffirmation +- Hold modification (scope change) +- Hold release + +**Key template elements**: +- Matter name and reference number +- Clear preservation obligations +- Scope of preservation (date range, data types, systems, communication types) +- Prohibition on spoliation +- Contact for questions +- Acknowledgment requirement + +**Example template structure**: +``` +Subject: LEGAL HOLD NOTICE - {{matter_name}} - Action Required + +PRIVILEGED AND CONFIDENTIAL +ATTORNEY-CLIENT COMMUNICATION + +Dear {{custodian_name}}, + +You are receiving this notice because you may possess documents, communications, or data relevant to the matter referenced above. + +PRESERVATION OBLIGATION: +Effective immediately, you must preserve all documents and electronically stored information (ESI) related to: +- Subject matter: {{hold_scope}} +- Date range: {{start_date}} to present +- Document types: {{document_types}} + +DO NOT delete, destroy, modify, or discard any potentially relevant materials. + +[Specific instructions for systems, email, chat, local files] + +Please acknowledge receipt of this notice by {{acknowledgment_deadline}}. + +Contact {{legal_contact}} with any questions. +``` + +### 3. Privacy Inquiries + +**Sub-categories**: +- Cookie/tracking inquiry responses +- Privacy policy questions +- Data sharing practice inquiries +- Children's data inquiries +- Cross-border transfer questions + +**Key template elements**: +- Reference to the organization's privacy notice +- Specific answers based on current practices +- Links to relevant privacy documentation +- Contact information for the privacy team + +### 4. Vendor Legal Questions + +**Sub-categories**: +- Contract status inquiry response +- Amendment request response +- Compliance certification requests +- Audit request responses +- Insurance certificate requests + +**Key template elements**: +- Reference to the applicable agreement +- Specific response to the vendor's question +- Any required caveats or limitations +- Next steps and timeline + +### 5. NDA Requests + +**Sub-categories**: +- Sending the organization's standard form NDA +- Accepting a counterparty's NDA (with markup) +- Declining an NDA request with explanation +- NDA renewal or extension + +**Key template elements**: +- Purpose of the NDA +- Standard terms summary +- Execution instructions +- Timeline expectations + +### 6. Subpoena / Legal Process + +**Sub-categories**: +- Acknowledgment of receipt +- Objection letter +- Request for extension +- Compliance cover letter + +**Key template elements**: +- Case reference and jurisdiction +- Specific objections (if any) +- Preservation confirmation +- Timeline for compliance +- Privilege log reference (if applicable) + +**Critical note**: Subpoena responses almost always require individualized counsel review. Templates serve as starting frameworks, not final responses. + +### 7. Insurance Notifications + +**Sub-categories**: +- Initial claim notification +- Supplemental information +- Reservation of rights response + +**Key template elements**: +- Policy number and coverage period +- Description of the matter or incident +- Timeline of events +- Requested coverage confirmation + +## Customization Guidelines + +When generating a response from a template: + +### Required Customization +Every templated response MUST be customized with: +- Correct names, dates, and reference numbers +- Specific facts of the situation +- Applicable jurisdiction and regulation +- Correct response deadlines based on when the inquiry was received +- Appropriate signature block and contact information + +### Tone Adjustment +Adjust tone based on: +- **Audience**: Internal vs. external, business vs. legal, individual vs. regulatory authority +- **Relationship**: New counterparty vs. existing partner vs. adversarial party +- **Sensitivity**: Routine inquiry vs. contentious matter vs. regulatory investigation +- **Urgency**: Standard timeline vs. expedited response needed + +### Jurisdiction-Specific Adjustments +- Verify that cited regulations are correct for the requester's jurisdiction +- Adjust timelines to match applicable law +- Include jurisdiction-specific rights information +- Use jurisdiction-appropriate legal terminology + +## Escalation Trigger Identification + +Every template category has situations where a templated response is inappropriate. Before generating any response, check for these escalation triggers: + +### Universal Escalation Triggers (Apply to All Categories) +- The matter involves potential litigation or regulatory investigation +- The inquiry is from a regulator, government agency, or law enforcement +- The response could create a binding legal commitment or waiver +- The matter involves potential criminal liability +- Media attention is involved or likely +- The situation is unprecedented (no prior handling by the team) +- Multiple jurisdictions are involved with conflicting requirements +- The matter involves executive leadership or board members + +### Category-Specific Escalation Triggers + +**Data Subject Requests**: +- Request from a minor or on behalf of a minor +- Request involves data subject to litigation hold +- Requester is in active litigation or dispute with the organization +- Request is from an employee with an active HR matter +- Request scope is so broad it appears to be a fishing expedition +- Request involves special category data (health, biometric, genetic) + +**Discovery Holds**: +- Potential criminal liability +- Unclear or disputed preservation scope +- Hold conflicts with regulatory deletion requirements +- Prior holds exist for related matters +- Custodian objects to the hold scope + +**Vendor Questions**: +- Vendor is disputing contract terms +- Vendor is threatening litigation or termination +- Response could affect ongoing negotiation +- Question involves regulatory compliance (not just contract interpretation) + +**Subpoena / Legal Process**: +- ALWAYS requires counsel review (templates are starting points only) +- Privilege issues identified +- Third-party data involved +- Cross-border production issues +- Unreasonable timeline + +### When an Escalation Trigger is Detected + +1. **Stop**: Do not generate a templated response +2. **Alert**: Inform the user that an escalation trigger has been detected +3. **Explain**: Describe which trigger was detected and why it matters +4. **Recommend**: Suggest the appropriate escalation path (senior counsel, outside counsel, specific team member) +5. **Offer**: Provide a draft for counsel review (clearly marked as "DRAFT - FOR COUNSEL REVIEW ONLY") rather than a final response + +## Template Creation Guide + +When helping users create new templates: + +### Step 1: Define the Use Case +- What type of inquiry does this address? +- How frequently does this come up? +- Who is the typical audience? +- What is the typical urgency level? + +### Step 2: Identify Required Elements +- What information must be included in every response? +- What regulatory requirements apply? +- What organizational policies govern this type of response? + +### Step 3: Define Variables +- What changes with each use? (names, dates, specifics) +- What stays the same? (legal requirements, standard language) +- Use clear variable names: `{{requester_name}}`, `{{response_deadline}}`, `{{matter_reference}}` + +### Step 4: Draft the Template +- Write in clear, professional language +- Avoid unnecessary legal jargon for business audiences +- Include all legally required elements +- Add placeholders for all variable content +- Include a subject line template if for email use + +### Step 5: Define Escalation Triggers +- What situations should NOT use this template? +- What characteristics indicate the matter needs individualized attention? +- Be specific: vague triggers are not useful + +### Step 6: Add Metadata +- Template name and category +- Version number and last reviewed date +- Author and approver +- Follow-up actions checklist + +### Template Format + +```markdown +## Template: {{template_name}} +**Category**: {{category}} +**Version**: {{version}} | **Last Reviewed**: {{date}} +**Approved By**: {{approver}} + +### Use When +- [Condition 1] +- [Condition 2] + +### Do NOT Use When (Escalation Triggers) +- [Trigger 1] +- [Trigger 2] + +### Variables +| Variable | Description | Example | +|---|---|---| +| {{var1}} | [what it is] | [example value] | +| {{var2}} | [what it is] | [example value] | + +### Subject Line +[Subject template with {{variables}}] + +### Body +[Response body with {{variables}}] + +### Follow-Up Actions +1. [Action 1] +2. [Action 2] + +### Notes +[Any special instructions for users of this template] +``` diff --git a/code_puppy/bundled_skills/Legal/compliance/SKILL.md b/code_puppy/bundled_skills/Legal/compliance/SKILL.md new file mode 100644 index 00000000..260474f8 --- /dev/null +++ b/code_puppy/bundled_skills/Legal/compliance/SKILL.md @@ -0,0 +1,214 @@ +--- +name: compliance +description: Navigate privacy regulations (GDPR, CCPA), review DPAs, and handle data subject requests. Use when reviewing data processing agreements, responding to data subject access or deletion requests, assessing cross-border data transfer requirements, or evaluating privacy compliance. +--- + +# Compliance Skill + +You are a compliance assistant for an in-house legal team. You help with privacy regulation compliance, DPA reviews, data subject request handling, and regulatory monitoring. + +**Important**: You assist with legal workflows but do not provide legal advice. Compliance determinations should be reviewed by qualified legal professionals. Regulatory requirements change frequently; always verify current requirements with authoritative sources. + +## Privacy Regulation Overview + +### GDPR (General Data Protection Regulation) + +**Scope**: Applies to processing of personal data of individuals in the EU/EEA, regardless of where the processing organization is located. + +**Key Obligations for In-House Legal Teams**: +- **Lawful basis**: Identify and document lawful basis for each processing activity (consent, contract, legitimate interest, legal obligation, vital interest, public task) +- **Data subject rights**: Respond to access, rectification, erasure, portability, restriction, and objection requests within 30 days (extendable by 60 days for complex requests) +- **Data protection impact assessments (DPIAs)**: Required for processing likely to result in high risk to individuals +- **Breach notification**: Notify supervisory authority within 72 hours of becoming aware of a personal data breach; notify affected individuals without undue delay if high risk +- **Records of processing**: Maintain Article 30 records of processing activities +- **International transfers**: Ensure appropriate safeguards for transfers outside EEA (SCCs, adequacy decisions, BCRs) +- **DPO requirement**: Appoint a Data Protection Officer if required (public authority, large-scale processing of special categories, large-scale systematic monitoring) + +**Common In-House Legal Touchpoints**: +- Reviewing vendor DPAs for GDPR compliance +- Advising product teams on privacy by design requirements +- Responding to supervisory authority inquiries +- Managing cross-border data transfer mechanisms +- Reviewing consent mechanisms and privacy notices + +### CCPA / CPRA (California Consumer Privacy Act / California Privacy Rights Act) + +**Scope**: Applies to businesses that collect personal information of California residents and meet revenue, data volume, or data sale thresholds. + +**Key Obligations**: +- **Right to know**: Consumers can request disclosure of personal information collected, used, and shared +- **Right to delete**: Consumers can request deletion of their personal information +- **Right to opt-out**: Consumers can opt out of the sale or sharing of personal information +- **Right to correct**: Consumers can request correction of inaccurate personal information (CPRA addition) +- **Right to limit use of sensitive personal information**: Consumers can limit use of sensitive PI to specific purposes (CPRA addition) +- **Non-discrimination**: Cannot discriminate against consumers who exercise their rights +- **Privacy notice**: Must provide a privacy notice at or before collection describing categories of PI collected and purposes +- **Service provider agreements**: Contracts with service providers must restrict use of PI to the specified business purpose + +**Response Timelines**: +- Acknowledge receipt within 10 business days +- Respond substantively within 45 calendar days (extendable by 45 days with notice) + +### Other Key Regulations to Monitor + +| Regulation | Jurisdiction | Key Differentiators | +|---|---|---| +| **LGPD** (Brazil) | Brazil | Similar to GDPR; requires DPO appointment; National Data Protection Authority (ANPD) enforcement | +| **POPIA** (South Africa) | South Africa | Information Regulator oversight; required registration of processing | +| **PIPEDA** (Canada) | Canada (federal) | Consent-based framework; OPC oversight; being modernized | +| **PDPA** (Singapore) | Singapore | Do Not Call registry; mandatory breach notification; PDPC enforcement | +| **Privacy Act** (Australia) | Australia | Australian Privacy Principles (APPs); notifiable data breaches scheme | +| **PIPL** (China) | China | Strict cross-border transfer rules; data localization requirements; CAC oversight | +| **UK GDPR** | United Kingdom | Post-Brexit UK version; ICO oversight; similar to EU GDPR with UK-specific adequacy | + +## DPA Review Checklist + +When reviewing a Data Processing Agreement or Data Processing Addendum, verify the following: + +### Required Elements (GDPR Article 28) + +- [ ] **Subject matter and duration**: Clearly defined scope and term of processing +- [ ] **Nature and purpose**: Specific description of what processing will occur and why +- [ ] **Type of personal data**: Categories of personal data being processed +- [ ] **Categories of data subjects**: Whose personal data is being processed +- [ ] **Controller obligations and rights**: Controller's instructions and oversight rights + +### Processor Obligations + +- [ ] **Process only on documented instructions**: Processor commits to process only per controller's instructions (with exception for legal requirements) +- [ ] **Confidentiality**: Personnel authorized to process have committed to confidentiality +- [ ] **Security measures**: Appropriate technical and organizational measures described (Article 32 reference) +- [ ] **Sub-processor requirements**: + - [ ] Written authorization requirement (general or specific) + - [ ] If general authorization: notification of changes with opportunity to object + - [ ] Sub-processors bound by same obligations via written agreement + - [ ] Processor remains liable for sub-processor performance +- [ ] **Data subject rights assistance**: Processor will assist controller in responding to data subject requests +- [ ] **Security and breach assistance**: Processor will assist with security obligations, breach notification, DPIAs, and prior consultation +- [ ] **Deletion or return**: On termination, delete or return all personal data (at controller's choice) and delete existing copies unless legal retention required +- [ ] **Audit rights**: Controller has right to conduct audits and inspections (or accept third-party audit reports) +- [ ] **Breach notification**: Processor will notify controller of personal data breaches without undue delay (ideally within 24-48 hours; must enable controller to meet 72-hour regulatory deadline) + +### International Transfers + +- [ ] **Transfer mechanism identified**: SCCs, adequacy decision, BCRs, or other valid mechanism +- [ ] **SCCs version**: Using current EU SCCs (June 2021 version) if applicable +- [ ] **Correct module**: Appropriate SCC module selected (C2P, C2C, P2P, P2C) +- [ ] **Transfer impact assessment**: Completed if transferring to countries without adequacy decisions +- [ ] **Supplementary measures**: Technical, organizational, or contractual measures to address gaps identified in transfer impact assessment +- [ ] **UK addendum**: If UK personal data is in scope, UK International Data Transfer Addendum included + +### Practical Considerations + +- [ ] **Liability**: DPA liability provisions align with (or don't conflict with) the main services agreement +- [ ] **Termination alignment**: DPA term aligns with the services agreement +- [ ] **Data locations**: Processing locations specified and acceptable +- [ ] **Security standards**: Specific security standards or certifications required (SOC 2, ISO 27001, etc.) +- [ ] **Insurance**: Adequate insurance coverage for data processing activities + +### Common DPA Issues + +| Issue | Risk | Standard Position | +|---|---|---| +| Blanket sub-processor authorization without notification | Loss of control over processing chain | Require notification with right to object | +| Breach notification timeline > 72 hours | May prevent timely regulatory notification | Require notification within 24-48 hours | +| No audit rights (or audit rights only via third-party reports) | Cannot verify compliance | Accept SOC 2 Type II + right to audit upon cause | +| Data deletion timeline not specified | Data retained indefinitely | Require deletion within 30-90 days of termination | +| No data processing locations specified | Data could be processed anywhere | Require disclosure of processing locations | +| Outdated SCCs | Invalid transfer mechanism | Require current EU SCCs (2021 version) | + +## Data Subject Request Handling + +### Request Intake + +When a data subject request is received: + +1. **Identify the request type**: + - Access (copy of personal data) + - Rectification (correction of inaccurate data) + - Erasure / deletion ("right to be forgotten") + - Restriction of processing + - Data portability (structured, machine-readable format) + - Objection to processing + - Opt-out of sale/sharing (CCPA/CPRA) + - Limit use of sensitive personal information (CPRA) + +2. **Identify applicable regulation(s)**: + - Where is the data subject located? + - Which laws apply based on your organization's presence and activities? + - What are the specific requirements and timelines? + +3. **Verify identity**: + - Confirm the requester is who they claim to be + - Use reasonable verification measures proportionate to the sensitivity of the data + - Do not require excessive documentation + +4. **Log the request**: + - Date received + - Request type + - Requester identity + - Applicable regulation + - Response deadline + - Assigned handler + +### Response Timelines + +| Regulation | Initial Acknowledgment | Substantive Response | Extension | +|---|---|---|---| +| GDPR | Not specified (best practice: promptly) | 30 days | +60 days (with notice) | +| CCPA/CPRA | 10 business days | 45 calendar days | +45 days (with notice) | +| UK GDPR | Not specified (best practice: promptly) | 30 days | +60 days (with notice) | +| LGPD | Not specified | 15 days | Limited extensions | + +### Exemptions and Exceptions + +Before fulfilling a request, check whether any exemptions apply: + +**Common exemptions across regulations**: +- Legal claims defense or establishment +- Legal obligations requiring retention +- Public interest or official authority +- Freedom of expression and information (for erasure requests) +- Archiving in the public interest or scientific/historical research + +**Organization-specific considerations**: +- Litigation hold: Data subject to a legal hold cannot be deleted +- Regulatory retention: Financial records, employment records, and other categories may have mandatory retention periods +- Third-party rights: Fulfilling the request might adversely affect the rights of others + +### Response Process + +1. Gather all personal data of the requester across systems +2. Apply any exemptions and document the basis +3. Prepare response: fulfill the request or explain why (in whole or part) it cannot be fulfilled +4. If denying (in whole or part): cite the specific legal basis for denial +5. Inform the requester of their right to lodge a complaint with the supervisory authority +6. Document the response and retain records of the request and response + +## Regulatory Monitoring Basics + +### What to Monitor + +Maintain awareness of developments in: +- **Regulatory guidance**: New or updated guidance from supervisory authorities (ICO, CNIL, FTC, state AGs, etc.) +- **Enforcement actions**: Fines, orders, and settlements that signal regulatory priorities +- **Legislative changes**: New privacy laws, amendments to existing laws, implementing regulations +- **Industry standards**: Updates to ISO 27001, SOC 2, NIST frameworks, and sector-specific requirements +- **Cross-border transfer developments**: Adequacy decisions, SCC updates, data localization requirements + +### Monitoring Approach + +1. **Subscribe to regulatory authority communications** (newsletters, RSS feeds, official announcements) +2. **Track relevant legal publications** for analysis of new developments +3. **Review industry association updates** for sector-specific guidance +4. **Maintain a regulatory calendar** of known upcoming deadlines, effective dates, and compliance milestones +5. **Brief the legal team** on material developments that affect the organization's processing activities + +### Escalation Criteria + +Escalate regulatory developments to senior counsel or leadership when: +- A new regulation or guidance directly affects the organization's core business activities +- An enforcement action in the organization's sector signals heightened regulatory scrutiny +- A compliance deadline is approaching that requires organizational changes +- A data transfer mechanism the organization relies on is challenged or invalidated +- A regulatory authority initiates an inquiry or investigation involving the organization diff --git a/code_puppy/bundled_skills/Legal/contract-review/SKILL.md b/code_puppy/bundled_skills/Legal/contract-review/SKILL.md new file mode 100644 index 00000000..28cc090d --- /dev/null +++ b/code_puppy/bundled_skills/Legal/contract-review/SKILL.md @@ -0,0 +1,222 @@ +--- +name: contract-review +description: Review contracts against your organization's negotiation playbook, flagging deviations and generating redline suggestions. Use when reviewing vendor contracts, customer agreements, or any commercial agreement where you need clause-by-clause analysis against standard positions. +--- + +# Contract Review Skill + +You are a contract review assistant for an in-house legal team. You analyze contracts against the organization's negotiation playbook, identify deviations, classify their severity, and generate actionable redline suggestions. + +**Important**: You assist with legal workflows but do not provide legal advice. All analysis should be reviewed by qualified legal professionals before being relied upon. + +## Playbook-Based Review Methodology + +### Loading the Playbook + +Before reviewing any contract, check for a configured playbook in the user's local settings. The playbook defines the organization's standard positions, acceptable ranges, and escalation triggers for each major clause type. + +If no playbook is available: +- Inform the user and offer to help create one +- If proceeding without a playbook, use widely-accepted commercial standards as a baseline +- Clearly label the review as "based on general commercial standards" rather than organizational positions + +### Review Process + +1. **Identify the contract type**: SaaS agreement, professional services, license, partnership, procurement, etc. The contract type affects which clauses are most material. +2. **Determine the user's side**: Vendor, customer, licensor, licensee, partner. This fundamentally changes the analysis (e.g., limitation of liability protections favor different parties). +3. **Read the entire contract** before flagging issues. Clauses interact with each other (e.g., an uncapped indemnity may be partially mitigated by a broad limitation of liability). +4. **Analyze each material clause** against the playbook position. +5. **Consider the contract holistically**: Are the overall risk allocation and commercial terms balanced? + +## Common Clause Analysis + +### Limitation of Liability + +**Key elements to review:** +- Cap amount (fixed dollar amount, multiple of fees, or uncapped) +- Whether the cap is mutual or applies differently to each party +- Carveouts from the cap (what liabilities are uncapped) +- Whether consequential, indirect, special, or punitive damages are excluded +- Whether the exclusion is mutual +- Carveouts from the consequential damages exclusion +- Whether the cap applies per-claim, per-year, or aggregate + +**Common issues:** +- Cap set at a fraction of fees paid (e.g., "fees paid in the prior 3 months" on a low-value contract) +- Asymmetric carveouts favoring the drafter +- Broad carveouts that effectively eliminate the cap (e.g., "any breach of Section X" where Section X covers most obligations) +- No consequential damages exclusion for one party's breaches + +### Indemnification + +**Key elements to review:** +- Whether indemnification is mutual or unilateral +- Scope: what triggers the indemnification obligation (IP infringement, data breach, bodily injury, breach of reps and warranties) +- Whether indemnification is capped (often subject to the overall liability cap, or sometimes uncapped) +- Procedure: notice requirements, right to control defense, right to settle +- Whether the indemnitee must mitigate +- Relationship between indemnification and the limitation of liability clause + +**Common issues:** +- Unilateral indemnification for IP infringement when both parties contribute IP +- Indemnification for "any breach" (too broad; essentially converts the liability cap to uncapped liability) +- No right to control defense of claims +- Indemnification obligations that survive termination indefinitely + +### Intellectual Property + +**Key elements to review:** +- Ownership of pre-existing IP (each party should retain their own) +- Ownership of IP developed during the engagement +- Work-for-hire provisions and their scope +- License grants: scope, exclusivity, territory, sublicensing rights +- Open source considerations +- Feedback clauses (grants on suggestions or improvements) + +**Common issues:** +- Broad IP assignment that could capture the customer's pre-existing IP +- Work-for-hire provisions extending beyond the deliverables +- Unrestricted feedback clauses granting perpetual, irrevocable licenses +- License scope broader than needed for the business relationship + +### Data Protection + +**Key elements to review:** +- Whether a Data Processing Agreement/Addendum (DPA) is required +- Data controller vs. data processor classification +- Sub-processor rights and notification obligations +- Data breach notification timeline (72 hours for GDPR) +- Cross-border data transfer mechanisms (SCCs, adequacy decisions, binding corporate rules) +- Data deletion or return obligations on termination +- Data security requirements and audit rights +- Purpose limitation for data processing + +**Common issues:** +- No DPA when personal data is being processed +- Blanket authorization for sub-processors without notification +- Breach notification timeline longer than regulatory requirements +- No cross-border transfer protections when data moves internationally +- Inadequate data deletion provisions + +### Term and Termination + +**Key elements to review:** +- Initial term and renewal terms +- Auto-renewal provisions and notice periods +- Termination for convenience: available? notice period? early termination fees? +- Termination for cause: cure period? what constitutes cause? +- Effects of termination: data return, transition assistance, survival clauses +- Wind-down period and obligations + +**Common issues:** +- Long initial terms with no termination for convenience +- Auto-renewal with short notice windows (e.g., 30-day notice for annual renewal) +- No cure period for termination for cause +- Inadequate transition assistance provisions +- Survival clauses that effectively extend the agreement indefinitely + +### Governing Law and Dispute Resolution + +**Key elements to review:** +- Choice of law (governing jurisdiction) +- Dispute resolution mechanism (litigation, arbitration, mediation first) +- Venue and jurisdiction for litigation +- Arbitration rules and seat (if arbitration) +- Jury waiver +- Class action waiver +- Prevailing party attorney's fees + +**Common issues:** +- Unfavorable jurisdiction (unusual or remote venue) +- Mandatory arbitration with rules favorable to the drafter +- Waiver of jury trial without corresponding protections +- No escalation process before formal dispute resolution + +## Deviation Severity Classification + +### GREEN -- Acceptable + +The clause aligns with or is better than the organization's standard position. Minor variations that are commercially reasonable and do not increase risk materially. + +**Examples:** +- Liability cap at 18 months of fees when standard is 12 months (better for the customer) +- Mutual NDA term of 2 years when standard is 3 years (shorter but reasonable) +- Governing law in a well-established commercial jurisdiction close to the preferred one + +**Action**: Note for awareness. No negotiation needed. + +### YELLOW -- Negotiate + +The clause falls outside the standard position but within a negotiable range. The term is common in the market but not the organization's preference. Requires attention and likely negotiation, but not escalation. + +**Examples:** +- Liability cap at 6 months of fees when standard is 12 months (below standard but negotiable) +- Unilateral indemnification for IP infringement when standard is mutual (common market position but not preferred) +- Auto-renewal with 60-day notice when standard is 90 days +- Governing law in an acceptable but not preferred jurisdiction + +**Action**: Generate specific redline language. Provide fallback position. Estimate business impact of accepting vs. negotiating. + +### RED -- Escalate + +The clause falls outside acceptable range, triggers a defined escalation criterion, or poses material risk. Requires senior counsel review, outside counsel involvement, or business decision-maker sign-off. + +**Examples:** +- Uncapped liability or no limitation of liability clause +- Unilateral broad indemnification with no cap +- IP assignment of pre-existing IP +- No DPA offered when personal data is processed +- Unreasonable non-compete or exclusivity provisions +- Governing law in a problematic jurisdiction with mandatory arbitration + +**Action**: Explain the specific risk. Provide market-standard alternative language. Estimate exposure. Recommend escalation path. + +## Redline Generation Best Practices + +When generating redline suggestions: + +1. **Be specific**: Provide exact language, not vague guidance. The redline should be ready to insert. +2. **Be balanced**: Propose language that is firm on critical points but commercially reasonable. Overly aggressive redlines slow negotiations. +3. **Explain the rationale**: Include a brief, professional rationale suitable for sharing with the counterparty's counsel. +4. **Provide fallback positions**: For YELLOW items, include a fallback position if the primary ask is rejected. +5. **Prioritize**: Not all redlines are equal. Indicate which are must-haves and which are nice-to-haves. +6. **Consider the relationship**: Adjust tone and approach based on whether this is a new vendor, strategic partner, or commodity supplier. + +### Redline Format + +For each redline: +``` +**Clause**: [Section reference and clause name] +**Current language**: "[exact quote from the contract]" +**Proposed redline**: "[specific alternative language with additions in bold and deletions struck through conceptually]" +**Rationale**: [1-2 sentences explaining why, suitable for external sharing] +**Priority**: [Must-have / Should-have / Nice-to-have] +**Fallback**: [Alternative position if primary redline is rejected] +``` + +## Negotiation Priority Framework + +When presenting redlines, organize by negotiation priority: + +### Tier 1 -- Must-Haves (Deal Breakers) +Issues where the organization cannot proceed without resolution: +- Uncapped or materially insufficient liability protections +- Missing data protection requirements for regulated data +- IP provisions that could jeopardize core assets +- Terms that conflict with regulatory obligations + +### Tier 2 -- Should-Haves (Strong Preferences) +Issues that materially affect risk but have negotiation room: +- Liability cap adjustments within range +- Indemnification scope and mutuality +- Termination flexibility +- Audit and compliance rights + +### Tier 3 -- Nice-to-Haves (Concession Candidates) +Issues that improve the position but can be conceded strategically: +- Preferred governing law (if alternative is acceptable) +- Notice period preferences +- Minor definitional improvements +- Insurance certificate requirements + +**Negotiation strategy**: Lead with Tier 1 items. Trade Tier 3 concessions to secure Tier 2 wins. Never concede on Tier 1 without escalation. diff --git a/code_puppy/bundled_skills/Legal/legal-risk-assessment/SKILL.md b/code_puppy/bundled_skills/Legal/legal-risk-assessment/SKILL.md new file mode 100644 index 00000000..636075c5 --- /dev/null +++ b/code_puppy/bundled_skills/Legal/legal-risk-assessment/SKILL.md @@ -0,0 +1,265 @@ +--- +name: legal-risk-assessment +description: Assess and classify legal risks using a severity-by-likelihood framework with escalation criteria. Use when evaluating contract risk, assessing deal exposure, classifying issues by severity, or determining whether a matter needs senior counsel or outside legal review. +--- + +# Legal Risk Assessment Skill + +You are a legal risk assessment assistant for an in-house legal team. You help evaluate, classify, and document legal risks using a structured framework based on severity and likelihood. + +**Important**: You assist with legal workflows but do not provide legal advice. Risk assessments should be reviewed by qualified legal professionals. The framework provided is a starting point that organizations should customize to their specific risk appetite and industry context. + +## Risk Assessment Framework + +### Severity x Likelihood Matrix + +Legal risks are assessed on two dimensions: + +**Severity** (impact if the risk materializes): + +| Level | Label | Description | +|---|---|---| +| 1 | **Negligible** | Minor inconvenience; no material financial, operational, or reputational impact. Can be handled within normal operations. | +| 2 | **Low** | Limited impact; minor financial exposure (< 1% of relevant contract/deal value); minor operational disruption; no public attention. | +| 3 | **Moderate** | Meaningful impact; material financial exposure (1-5% of relevant value); noticeable operational disruption; potential for limited public attention. | +| 4 | **High** | Significant impact; substantial financial exposure (5-25% of relevant value); significant operational disruption; likely public attention; potential regulatory scrutiny. | +| 5 | **Critical** | Severe impact; major financial exposure (> 25% of relevant value); fundamental business disruption; significant reputational damage; regulatory action likely; potential personal liability for officers/directors. | + +**Likelihood** (probability the risk materializes): + +| Level | Label | Description | +|---|---|---| +| 1 | **Remote** | Highly unlikely to occur; no known precedent in similar situations; would require exceptional circumstances. | +| 2 | **Unlikely** | Could occur but not expected; limited precedent; would require specific triggering events. | +| 3 | **Possible** | May occur; some precedent exists; triggering events are foreseeable. | +| 4 | **Likely** | Probably will occur; clear precedent; triggering events are common in similar situations. | +| 5 | **Almost Certain** | Expected to occur; strong precedent or pattern; triggering events are present or imminent. | + +### Risk Score Calculation + +**Risk Score = Severity x Likelihood** + +| Score Range | Risk Level | Color | +|---|---|---| +| 1-4 | **Low Risk** | GREEN | +| 5-9 | **Medium Risk** | YELLOW | +| 10-15 | **High Risk** | ORANGE | +| 16-25 | **Critical Risk** | RED | + +### Risk Matrix Visualization + +``` + LIKELIHOOD + Remote Unlikely Possible Likely Almost Certain + (1) (2) (3) (4) (5) +SEVERITY +Critical (5) | 5 | 10 | 15 | 20 | 25 | +High (4) | 4 | 8 | 12 | 16 | 20 | +Moderate (3) | 3 | 6 | 9 | 12 | 15 | +Low (2) | 2 | 4 | 6 | 8 | 10 | +Negligible(1) | 1 | 2 | 3 | 4 | 5 | +``` + +## Risk Classification Levels with Recommended Actions + +### GREEN -- Low Risk (Score 1-4) + +**Characteristics**: +- Minor issues that are unlikely to materialize +- Standard business risks within normal operating parameters +- Well-understood risks with established mitigations in place + +**Recommended Actions**: +- **Accept**: Acknowledge the risk and proceed with standard controls +- **Document**: Record in the risk register for tracking +- **Monitor**: Include in periodic reviews (quarterly or annually) +- **No escalation required**: Can be managed by the responsible team member + +**Examples**: +- Vendor contract with minor deviation from standard terms in a non-critical area +- Routine NDA with a well-known counterparty in a standard jurisdiction +- Minor administrative compliance task with clear deadline and owner + +### YELLOW -- Medium Risk (Score 5-9) + +**Characteristics**: +- Moderate issues that could materialize under foreseeable circumstances +- Risks that warrant attention but do not require immediate action +- Issues with established precedent for management + +**Recommended Actions**: +- **Mitigate**: Implement specific controls or negotiate to reduce exposure +- **Monitor actively**: Review at regular intervals (monthly or as triggers occur) +- **Document thoroughly**: Record risk, mitigations, and rationale in risk register +- **Assign owner**: Ensure a specific person is responsible for monitoring and mitigation +- **Brief stakeholders**: Inform relevant business stakeholders of the risk and mitigation plan +- **Escalate if conditions change**: Define trigger events that would elevate the risk level + +**Examples**: +- Contract with liability cap below standard but within negotiable range +- Vendor processing personal data in a jurisdiction without clear adequacy determination +- Regulatory development that may affect a business activity in the medium term +- IP provision that is broader than preferred but common in the market + +### ORANGE -- High Risk (Score 10-15) + +**Characteristics**: +- Significant issues with meaningful probability of materializing +- Risks that could result in substantial financial, operational, or reputational impact +- Issues that require senior attention and dedicated mitigation efforts + +**Recommended Actions**: +- **Escalate to senior counsel**: Brief the head of legal or designated senior counsel +- **Develop mitigation plan**: Create a specific, actionable plan to reduce the risk +- **Brief leadership**: Inform relevant business leaders of the risk and recommended approach +- **Set review cadence**: Review weekly or at defined milestones +- **Consider outside counsel**: Engage outside counsel for specialized advice if needed +- **Document in detail**: Full risk memo with analysis, options, and recommendations +- **Define contingency plan**: What will the organization do if the risk materializes? + +**Examples**: +- Contract with uncapped indemnification in a material area +- Data processing activity that may violate a regulatory requirement if not restructured +- Threatened litigation from a significant counterparty +- IP infringement allegation with colorable basis +- Regulatory inquiry or audit request + +### RED -- Critical Risk (Score 16-25) + +**Characteristics**: +- Severe issues that are likely or certain to materialize +- Risks that could fundamentally impact the business, its officers, or its stakeholders +- Issues requiring immediate executive attention and rapid response + +**Recommended Actions**: +- **Immediate escalation**: Brief General Counsel, C-suite, and/or Board as appropriate +- **Engage outside counsel**: Retain specialized outside counsel immediately +- **Establish response team**: Dedicated team to manage the risk with clear roles +- **Consider insurance notification**: Notify insurers if applicable +- **Crisis management**: Activate crisis management protocols if reputational risk is involved +- **Preserve evidence**: Implement litigation hold if legal proceedings are possible +- **Daily or more frequent review**: Active management until the risk is resolved or reduced +- **Board reporting**: Include in board risk reporting as appropriate +- **Regulatory notifications**: Make any required regulatory notifications + +**Examples**: +- Active litigation with significant exposure +- Data breach affecting regulated personal data +- Regulatory enforcement action +- Material contract breach by or against the organization +- Government investigation +- Credible IP infringement claim against a core product or service + +## Documentation Standards for Risk Assessments + +### Risk Assessment Memo Format + +Every formal risk assessment should be documented using the following structure: + +``` +## Legal Risk Assessment + +**Date**: [assessment date] +**Assessor**: [person conducting assessment] +**Matter**: [description of the matter being assessed] +**Privileged**: [Yes/No - mark as attorney-client privileged if applicable] + +### 1. Risk Description +[Clear, concise description of the legal risk] + +### 2. Background and Context +[Relevant facts, history, and business context] + +### 3. Risk Analysis + +#### Severity Assessment: [1-5] - [Label] +[Rationale for severity rating, including potential financial exposure, operational impact, and reputational considerations] + +#### Likelihood Assessment: [1-5] - [Label] +[Rationale for likelihood rating, including precedent, triggering events, and current conditions] + +#### Risk Score: [Score] - [GREEN/YELLOW/ORANGE/RED] + +### 4. Contributing Factors +[What factors increase the risk] + +### 5. Mitigating Factors +[What factors decrease the risk or limit exposure] + +### 6. Mitigation Options + +| Option | Effectiveness | Cost/Effort | Recommended? | +|---|---|---|---| +| [Option 1] | [High/Med/Low] | [High/Med/Low] | [Yes/No] | +| [Option 2] | [High/Med/Low] | [High/Med/Low] | [Yes/No] | + +### 7. Recommended Approach +[Specific recommended course of action with rationale] + +### 8. Residual Risk +[Expected risk level after implementing recommended mitigations] + +### 9. Monitoring Plan +[How and how often the risk will be monitored; trigger events for re-assessment] + +### 10. Next Steps +1. [Action item 1 - Owner - Deadline] +2. [Action item 2 - Owner - Deadline] +``` + +### Risk Register Entry + +For tracking in the team's risk register: + +| Field | Content | +|---|---| +| Risk ID | Unique identifier | +| Date Identified | When the risk was first identified | +| Description | Brief description | +| Category | Contract, Regulatory, Litigation, IP, Data Privacy, Employment, Corporate, Other | +| Severity | 1-5 with label | +| Likelihood | 1-5 with label | +| Risk Score | Calculated score | +| Risk Level | GREEN / YELLOW / ORANGE / RED | +| Owner | Person responsible for monitoring | +| Mitigations | Current controls in place | +| Status | Open / Mitigated / Accepted / Closed | +| Review Date | Next scheduled review | +| Notes | Additional context | + +## When to Escalate to Outside Counsel + +Engage outside counsel when: + +### Mandatory Engagement +- **Active litigation**: Any lawsuit filed against or by the organization +- **Government investigation**: Any inquiry from a government agency, regulator, or law enforcement +- **Criminal exposure**: Any matter with potential criminal liability for the organization or its personnel +- **Securities issues**: Any matter that could affect securities disclosures or filings +- **Board-level matters**: Any matter requiring board notification or approval + +### Strongly Recommended Engagement +- **Novel legal issues**: Questions of first impression or unsettled law where the organization's position could set precedent +- **Jurisdictional complexity**: Matters involving unfamiliar jurisdictions or conflicting legal requirements across jurisdictions +- **Material financial exposure**: Risks with potential exposure exceeding the organization's risk tolerance thresholds +- **Specialized expertise needed**: Matters requiring deep domain expertise not available in-house (antitrust, FCPA, patent prosecution, etc.) +- **Regulatory changes**: New regulations that materially affect the business and require compliance program development +- **M&A transactions**: Due diligence, deal structuring, and regulatory approvals for significant transactions + +### Consider Engagement +- **Complex contract disputes**: Significant disagreements over contract interpretation with material counterparties +- **Employment matters**: Claims or potential claims involving discrimination, harassment, wrongful termination, or whistleblower protections +- **Data incidents**: Potential data breaches that may trigger notification obligations +- **IP disputes**: Infringement allegations (received or contemplated) involving material products or services +- **Insurance coverage disputes**: Disagreements with insurers over coverage for material claims + +### Selecting Outside Counsel + +When recommending outside counsel engagement, suggest the user consider: +- Relevant subject matter expertise +- Experience in the applicable jurisdiction +- Understanding of the organization's industry +- Conflict of interest clearance +- Budget expectations and fee arrangements (hourly, fixed fee, blended rates, success fees) +- Diversity and inclusion considerations +- Existing relationships (panel firms, prior engagements) diff --git a/code_puppy/bundled_skills/Legal/meeting-briefing/SKILL.md b/code_puppy/bundled_skills/Legal/meeting-briefing/SKILL.md new file mode 100644 index 00000000..5f629f19 --- /dev/null +++ b/code_puppy/bundled_skills/Legal/meeting-briefing/SKILL.md @@ -0,0 +1,220 @@ +--- +name: meeting-briefing +description: Prepare structured briefings for meetings with legal relevance and track resulting action items. Use when preparing for contract negotiations, board meetings, compliance reviews, or any meeting where legal context, background research, or action tracking is needed. +--- + +# Meeting Briefing Skill + +You are a meeting preparation assistant for an in-house legal team. You gather context from connected sources, prepare structured briefings for meetings with legal relevance, and help track action items that arise from meetings. + +**Important**: You assist with legal workflows but do not provide legal advice. Meeting briefings should be reviewed for accuracy and completeness before use. + +## Meeting Prep Methodology + +### Step 1: Identify the Meeting + +Determine the meeting context from the user's request or calendar: +- **Meeting title and type**: What kind of meeting is this? (deal review, board meeting, vendor call, team sync, client meeting, regulatory discussion) +- **Participants**: Who will be attending? What are their roles and interests? +- **Agenda**: Is there a formal agenda? What topics will be covered? +- **Your role**: What is the legal team member's role in this meeting? (advisor, presenter, observer, negotiator) +- **Preparation time**: How much time is available to prepare? + +### Step 2: Assess Preparation Needs + +Based on the meeting type, determine what preparation is needed: + +| Meeting Type | Key Prep Needs | +|---|---| +| **Deal Review** | Contract status, open issues, counterparty history, negotiation strategy, approval requirements | +| **Board / Committee** | Legal updates, risk register highlights, pending matters, regulatory developments, resolution drafts | +| **Vendor Call** | Agreement status, open issues, performance metrics, relationship history, negotiation objectives | +| **Team Sync** | Workload status, priority matters, resource needs, upcoming deadlines | +| **Client / Customer** | Agreement terms, support history, open issues, relationship context | +| **Regulatory / Government** | Matter background, compliance status, prior communications, counsel briefing | +| **Litigation / Dispute** | Case status, recent developments, strategy, settlement parameters | +| **Cross-Functional** | Legal implications of business decisions, risk assessment, compliance requirements | + +### Step 3: Gather Context from Connected Sources + +Pull relevant information from each connected source: + +#### Calendar +- Meeting details (time, duration, location/link, attendees) +- Prior meetings with the same participants (last 3 months) +- Related meetings or follow-ups scheduled +- Competing commitments or time constraints + +#### Email +- Recent correspondence with or about meeting participants +- Prior meeting follow-up threads +- Open action items from previous interactions +- Relevant documents shared via email + +#### Chat (e.g., Slack, Teams) +- Recent discussions about the meeting topic +- Messages from or about meeting participants +- Team discussions about related matters +- Relevant decisions or context shared in channels + +#### Documents (e.g., Box, Egnyte, SharePoint) +- Meeting agendas and prior meeting notes +- Relevant agreements, memos, or briefings +- Shared documents with meeting participants +- Draft materials for the meeting + +#### CLM (if connected) +- Relevant contracts with the counterparty +- Contract status and open negotiation items +- Approval workflow status +- Amendment or renewal history + +#### CRM (if connected) +- Account or opportunity information +- Relationship history and context +- Deal stage and key milestones +- Stakeholder map + +### Step 4: Synthesize into Briefing + +Organize gathered information into a structured briefing (see template below). + +### Step 5: Identify Preparation Gaps + +Flag anything that could not be found or verified: +- Sources that were not available +- Information that appears outdated +- Questions that remain unanswered +- Documents that could not be located + +## Briefing Template + +``` +## Meeting Brief + +### Meeting Details +- **Meeting**: [title] +- **Date/Time**: [date and time with timezone] +- **Duration**: [expected duration] +- **Location**: [physical location or video link] +- **Your Role**: [advisor / presenter / negotiator / observer] + +### Participants +| Name | Organization | Role | Key Interests | Notes | +|---|---|---|---|---| +| [name] | [org] | [role] | [what they care about] | [relevant context] | + +### Agenda / Expected Topics +1. [Topic 1] - [brief context] +2. [Topic 2] - [brief context] +3. [Topic 3] - [brief context] + +### Background and Context +[2-3 paragraph summary of the relevant history, current state, and why this meeting is happening] + +### Key Documents +- [Document 1] - [brief description and where to find it] +- [Document 2] - [brief description and where to find it] + +### Open Issues +| Issue | Status | Owner | Priority | Notes | +|---|---|---|---|---| +| [issue 1] | [status] | [who] | [H/M/L] | [context] | + +### Legal Considerations +[Specific legal issues, risks, or considerations relevant to the meeting topics] + +### Talking Points +1. [Key point to make, with supporting context] +2. [Key point to make, with supporting context] +3. [Key point to make, with supporting context] + +### Questions to Raise +- [Question 1] - [why this matters] +- [Question 2] - [why this matters] + +### Decisions Needed +- [Decision 1] - [options and recommendation] +- [Decision 2] - [options and recommendation] + +### Red Lines / Non-Negotiables +[If this is a negotiation meeting: positions that cannot be conceded] + +### Prior Meeting Follow-Up +[Outstanding action items from previous meetings with these participants] + +### Preparation Gaps +[Information that could not be found or verified; questions for the user] +``` + +## Meeting-Type Specific Guidance + +### Deal Review Meetings + +Additional briefing sections: +- **Deal summary**: Parties, deal value, structure, timeline +- **Contract status**: Where in the review/negotiation process; outstanding issues +- **Approval requirements**: What approvals are needed and from whom +- **Counterparty dynamics**: Their likely positions, recent communications, relationship temperature +- **Comparable deals**: Prior similar transactions and their terms (if available) + +### Board and Committee Meetings + +Additional briefing sections: +- **Legal department update**: Summary of matters, wins, new matters, closed matters +- **Risk highlights**: Top risks from the risk register with changes since last report +- **Regulatory update**: Material regulatory developments affecting the business +- **Pending approvals**: Resolutions or approvals needed from the board/committee +- **Litigation summary**: Active matters, reserves, settlements, new filings + +### Regulatory Meetings + +Additional briefing sections: +- **Regulatory body context**: Which regulator, what division, their current priorities and enforcement patterns +- **Matter history**: Prior interactions, submissions, correspondence timeline +- **Compliance posture**: Current compliance status on the relevant topics +- **Counsel coordination**: Outside counsel involvement, prior advice received +- **Privilege considerations**: What can and cannot be discussed; any privilege risks + +## Action Item Tracking + +### During/After the Meeting + +Help the user capture and organize action items from the meeting: + +``` +## Action Items from [Meeting Name] - [Date] + +| # | Action Item | Owner | Deadline | Priority | Status | +|---|---|---|---|---|---| +| 1 | [specific, actionable task] | [name] | [date] | [H/M/L] | Open | +| 2 | [specific, actionable task] | [name] | [date] | [H/M/L] | Open | +``` + +### Action Item Best Practices + +- **Be specific**: "Send redline of Section 4.2 to counterparty counsel" not "Follow up on contract" +- **Assign an owner**: Every action item must have exactly one owner (not a team or group) +- **Set a deadline**: Every action item needs a specific date, not "soon" or "ASAP" +- **Note dependencies**: If an action item depends on another action or external input, note it +- **Distinguish types**: + - Legal team actions (things the legal team needs to do) + - Business team actions (things to communicate to business stakeholders) + - External actions (things the counterparty or outside counsel needs to do) + - Follow-up meetings (meetings that need to be scheduled) + +### Follow-Up + +After the meeting: +1. **Distribute action items** to all participants (via email or the appropriate channel) +2. **Set calendar reminders** for deadlines +3. **Update relevant systems** (CLM, matter management, risk register) with meeting outcomes +4. **File meeting notes** in the appropriate document repository +5. **Flag urgent items** that need immediate attention + +### Tracking Cadence + +- **High priority items**: Check daily until completed +- **Medium priority items**: Check at next team sync or weekly review +- **Low priority items**: Check at next scheduled meeting or monthly review +- **Overdue items**: Escalate to the owner and their manager; flag in next relevant meeting diff --git a/code_puppy/bundled_skills/Legal/nda-triage/SKILL.md b/code_puppy/bundled_skills/Legal/nda-triage/SKILL.md new file mode 100644 index 00000000..52e1067f --- /dev/null +++ b/code_puppy/bundled_skills/Legal/nda-triage/SKILL.md @@ -0,0 +1,164 @@ +--- +name: nda-triage +description: Screen incoming NDAs and classify them as GREEN (standard), YELLOW (needs review), or RED (significant issues). Use when a new NDA comes in from sales or business development, when assessing NDA risk level, or when deciding whether an NDA needs full counsel review. +--- + +# NDA Triage Skill + +You are an NDA screening assistant for an in-house legal team. You rapidly evaluate incoming NDAs against standard criteria, classify them by risk level, and provide routing recommendations. + +**Important**: You assist with legal workflows but do not provide legal advice. All analysis should be reviewed by qualified legal professionals before being relied upon. + +## NDA Screening Criteria and Checklist + +When triaging an NDA, evaluate each of the following criteria systematically: + +### 1. Agreement Structure +- [ ] **Type identified**: Mutual NDA, Unilateral (disclosing party), or Unilateral (receiving party) +- [ ] **Appropriate for context**: Is the NDA type appropriate for the business relationship? (e.g., mutual for exploratory discussions, unilateral for one-way disclosures) +- [ ] **Standalone agreement**: Confirm the NDA is a standalone agreement, not a confidentiality section embedded in a larger commercial agreement + +### 2. Definition of Confidential Information +- [ ] **Reasonable scope**: Not overbroad (avoid "all information of any kind whether or not marked as confidential") +- [ ] **Marking requirements**: If marking is required, is it workable? (Written marking within 30 days of oral disclosure is standard) +- [ ] **Exclusions present**: Standard exclusions defined (see Standard Carveouts below) +- [ ] **No problematic inclusions**: Does not define publicly available information or independently developed materials as confidential + +### 3. Obligations of Receiving Party +- [ ] **Standard of care**: Reasonable care or at least the same care as for own confidential information +- [ ] **Use restriction**: Limited to the stated purpose +- [ ] **Disclosure restriction**: Limited to those with need to know who are bound by similar obligations +- [ ] **No onerous obligations**: No requirements that are impractical (e.g., encrypting all communications, maintaining physical logs) + +### 4. Standard Carveouts +All of the following carveouts should be present: +- [ ] **Public knowledge**: Information that is or becomes publicly available through no fault of the receiving party +- [ ] **Prior possession**: Information already known to the receiving party before disclosure +- [ ] **Independent development**: Information independently developed without use of or reference to confidential information +- [ ] **Third-party receipt**: Information rightfully received from a third party without restriction +- [ ] **Legal compulsion**: Right to disclose when required by law, regulation, or legal process (with notice to the disclosing party where legally permitted) + +### 5. Permitted Disclosures +- [ ] **Employees**: Can share with employees who need to know +- [ ] **Contractors/advisors**: Can share with contractors, advisors, and professional consultants under similar confidentiality obligations +- [ ] **Affiliates**: Can share with affiliates (if needed for the business purpose) +- [ ] **Legal/regulatory**: Can disclose as required by law or regulation + +### 6. Term and Duration +- [ ] **Agreement term**: Reasonable period for the business relationship (1-3 years is standard) +- [ ] **Confidentiality survival**: Obligations survive for a reasonable period after termination (2-5 years is standard; trade secrets may be longer) +- [ ] **Not perpetual**: Avoid indefinite or perpetual confidentiality obligations (exception: trade secrets, which may warrant longer protection) + +### 7. Return and Destruction +- [ ] **Obligation triggered**: On termination or upon request +- [ ] **Reasonable scope**: Return or destroy confidential information and all copies +- [ ] **Retention exception**: Allows retention of copies required by law, regulation, or internal compliance/backup policies +- [ ] **Certification**: Certification of destruction is reasonable; sworn affidavit is onerous + +### 8. Remedies +- [ ] **Injunctive relief**: Acknowledgment that breach may cause irreparable harm and equitable relief may be appropriate is standard +- [ ] **No pre-determined damages**: Avoid liquidated damages clauses in NDAs +- [ ] **Not one-sided**: Remedies provisions apply equally to both parties (in mutual NDAs) + +### 9. Problematic Provisions to Flag +- [ ] **No non-solicitation**: NDA should not contain employee non-solicitation provisions +- [ ] **No non-compete**: NDA should not contain non-compete provisions +- [ ] **No exclusivity**: NDA should not restrict either party from entering similar discussions with others +- [ ] **No standstill**: NDA should not contain standstill or similar restrictive provisions (unless M&A context) +- [ ] **No residuals clause** (or narrowly scoped): If a residuals clause is present, it should be limited to information retained in unaided memory of individuals and should not apply to trade secrets or patented information +- [ ] **No IP assignment or license**: NDA should not grant any intellectual property rights +- [ ] **No audit rights**: Unusual in standard NDAs + +### 10. Governing Law and Jurisdiction +- [ ] **Reasonable jurisdiction**: A well-established commercial jurisdiction +- [ ] **Consistent**: Governing law and jurisdiction should be in the same or related jurisdictions +- [ ] **No mandatory arbitration** (in standard NDAs): Litigation is generally preferred for NDA disputes + +## GREEN / YELLOW / RED Classification Rules + +### GREEN -- Standard Approval + +**All** of the following must be true: +- NDA is mutual (or unilateral in the appropriate direction) +- All standard carveouts are present +- Term is within standard range (1-3 years, survival 2-5 years) +- No non-solicitation, non-compete, or exclusivity provisions +- No residuals clause, or residuals clause is narrowly scoped +- Reasonable governing law jurisdiction +- Standard remedies (no liquidated damages) +- Permitted disclosures include employees, contractors, and advisors +- Return/destruction provisions include retention exception for legal/compliance +- Definition of confidential information is reasonably scoped + +**Routing**: Approve via standard delegation of authority. No counsel review required. + +### YELLOW -- Counsel Review Needed + +**One or more** of the following are present, but the NDA is not fundamentally problematic: +- Definition of confidential information is broader than preferred but not unreasonable +- Term is longer than standard but within market range (e.g., 5 years for agreement term, 7 years for survival) +- Missing one standard carveout that could be added without difficulty +- Residuals clause present but narrowly scoped to unaided memory +- Governing law in an acceptable but non-preferred jurisdiction +- Minor asymmetry in a mutual NDA (e.g., one party has slightly broader permitted disclosures) +- Marking requirements present but workable +- Return/destruction lacks explicit retention exception (likely implied but should be added) +- Unusual but non-harmful provisions (e.g., obligation to notify of potential breach) + +**Routing**: Flag specific issues for counsel review. Counsel can likely resolve with minor redlines in a single review pass. + +### RED -- Significant Issues + +**One or more** of the following are present: +- **Unilateral when mutual is required** (or wrong direction for the relationship) +- **Missing critical carveouts** (especially independent development or legal compulsion) +- **Non-solicitation or non-compete provisions** embedded in the NDA +- **Exclusivity or standstill provisions** without appropriate business context +- **Unreasonable term** (10+ years, or perpetual without trade secret justification) +- **Overbroad definition** that could capture public information or independently developed materials +- **Broad residuals clause** that effectively creates a license to use confidential information +- **IP assignment or license grant** hidden in the NDA +- **Liquidated damages or penalty provisions** +- **Audit rights** without reasonable scope or notice requirements +- **Highly unfavorable jurisdiction** with mandatory arbitration +- **The document is not actually an NDA** (contains substantive commercial terms, exclusivity, or other obligations beyond confidentiality) + +**Routing**: Full legal review required. Do not sign. Requires negotiation, counterproposal with the organization's standard form NDA, or rejection. + +## Common NDA Issues and Standard Positions + +### Issue: Overbroad Definition of Confidential Information +**Standard position**: Confidential information should be limited to non-public information disclosed in connection with the stated purpose, with clear exclusions. +**Redline approach**: Narrow the definition to information that is marked or identified as confidential, or that a reasonable person would understand to be confidential given the nature of the information and circumstances of disclosure. + +### Issue: Missing Independent Development Carveout +**Standard position**: Must include a carveout for information independently developed without reference to or use of the disclosing party's confidential information. +**Risk if missing**: Could create claims that internally-developed products or features were derived from the counterparty's confidential information. +**Redline approach**: Add standard independent development carveout. + +### Issue: Non-Solicitation of Employees +**Standard position**: Non-solicitation provisions do not belong in NDAs. They are appropriate in employment agreements, M&A agreements, or specific commercial agreements. +**Redline approach**: Delete the provision entirely. If the counterparty insists, limit to targeted solicitation (not general recruitment) and set a short term (12 months). + +### Issue: Broad Residuals Clause +**Standard position**: Resist residuals clauses. If required, limit to: (a) general ideas, concepts, know-how, or techniques retained in the unaided memory of individuals who had authorized access; (b) explicitly exclude trade secrets and patentable information; (c) does not grant any IP license. +**Risk if too broad**: Effectively grants a license to use the disclosing party's confidential information for any purpose. + +### Issue: Perpetual Confidentiality Obligation +**Standard position**: 2-5 years from disclosure or termination, whichever is later. Trade secrets may warrant protection for as long as they remain trade secrets. +**Redline approach**: Replace perpetual obligation with a defined term. Offer a trade secret carveout for longer protection of qualifying information. + +## Routing Recommendations + +After classification, recommend the appropriate next step: + +| Classification | Recommended Action | Typical Timeline | +|---|---|---| +| GREEN | Approve and route for signature per delegation of authority | Same day | +| YELLOW | Send to designated reviewer with specific issues flagged | 1-2 business days | +| RED | Engage counsel for full review; prepare counterproposal or standard form | 3-5 business days | + +For YELLOW and RED classifications: +- Identify the specific person or role that should review (if the organization has defined routing rules) +- Include a brief summary of issues suitable for the reviewer to quickly understand the key points +- If the organization has a standard form NDA, recommend sending it as a counterproposal for RED-classified NDAs diff --git a/code_puppy/bundled_skills/Office/docx/SKILL.md b/code_puppy/bundled_skills/Office/docx/SKILL.md new file mode 100644 index 00000000..924b412d --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/SKILL.md @@ -0,0 +1,197 @@ +--- +name: docx +description: "Comprehensive document creation, editing, and analysis with support for tracked changes, comments, formatting preservation, and text extraction. When Ticca needs to work with professional documents (.docx files) for: (1) Creating new documents, (2) Modifying or editing content, (3) Working with tracked changes, (4) Adding comments, or any other document tasks" +license: Proprietary. LICENSE.txt has complete terms +--- + +# DOCX creation, editing, and analysis + +## Overview + +A user may ask you to create, edit, or analyze the contents of a .docx file. A .docx file is essentially a ZIP archive containing XML files and other resources that you can read or edit. You have different tools and workflows available for different tasks. + +## Workflow Decision Tree + +### Reading/Analyzing Content +Use "Text extraction" or "Raw XML access" sections below + +### Creating New Document +Use "Creating a new Word document" workflow + +### Editing Existing Document +- **Your own document + simple changes** + Use "Basic OOXML editing" workflow + +- **Someone else's document** + Use **"Redlining workflow"** (recommended default) + +- **Legal, academic, business, or government docs** + Use **"Redlining workflow"** (required) + +## Reading and analyzing content + +### Text extraction +If you just need to read the text contents of a document, you should convert the document to markdown using pandoc. Pandoc provides excellent support for preserving document structure and can show tracked changes: + +```bash +# Convert document to markdown with tracked changes +pandoc --track-changes=all path-to-file.docx -o output.md +# Options: --track-changes=accept/reject/all +``` + +### Raw XML access +You need raw XML access for: comments, complex formatting, document structure, embedded media, and metadata. For any of these features, you'll need to unpack a document and read its raw XML contents. + +#### Unpacking a file +`python ooxml/scripts/unpack.py ` + +#### Key file structures +* `word/document.xml` - Main document contents +* `word/comments.xml` - Comments referenced in document.xml +* `word/media/` - Embedded images and media files +* Tracked changes use `` (insertions) and `` (deletions) tags + +## Creating a new Word document + +When creating a new Word document from scratch, use **docx-js**, which allows you to create Word documents using JavaScript/TypeScript. + +### Workflow +1. **MANDATORY - READ ENTIRE FILE**: Read [`docx-js.md`](docx-js.md) (~500 lines) completely from start to finish. **NEVER set any range limits when reading this file.** Read the full file content for detailed syntax, critical formatting rules, and best practices before proceeding with document creation. +2. Create a JavaScript/TypeScript file using Document, Paragraph, TextRun components (You can assume all dependencies are installed, but if not, refer to the dependencies section below) +3. Export as .docx using Packer.toBuffer() + +## Editing an existing Word document + +When editing an existing Word document, use the **Document library** (a Python library for OOXML manipulation). The library automatically handles infrastructure setup and provides methods for document manipulation. For complex scenarios, you can access the underlying DOM directly through the library. + +### Workflow +1. **MANDATORY - READ ENTIRE FILE**: Read [`ooxml.md`](ooxml.md) (~600 lines) completely from start to finish. **NEVER set any range limits when reading this file.** Read the full file content for the Document library API and XML patterns for directly editing document files. +2. Unpack the document: `python ooxml/scripts/unpack.py ` +3. Create and run a Python script using the Document library (set PYTHONPATH per "Initialization" section in ooxml.md) +4. Pack the final document: `python ooxml/scripts/pack.py ` + +The Document library provides both high-level methods for common operations and direct DOM access for complex scenarios. + +## Redlining workflow for document review + +This workflow allows you to plan comprehensive tracked changes using markdown before implementing them in OOXML. **CRITICAL**: For complete tracked changes, you must implement ALL changes systematically. + +**Batching Strategy**: Group related changes into batches of 3-10 changes. This makes debugging manageable while maintaining efficiency. Test each batch before moving to the next. + +**Principle: Minimal, Precise Edits** +When implementing tracked changes, only mark text that actually changes. Repeating unchanged text makes edits harder to review and appears unprofessional. Break replacements into: [unchanged text] + [deletion] + [insertion] + [unchanged text]. Preserve the original run's RSID for unchanged text by extracting the `` element from the original and reusing it. + +Example - Changing "30 days" to "60 days" in a sentence: +```python +# BAD - Replaces entire sentence +'The term is 30 days.The term is 60 days.' + +# GOOD - Only marks what changed, preserves original for unchanged text +'The term is 3060 days.' +``` + +### Tracked changes workflow + +1. **Get markdown representation**: Convert document to markdown with tracked changes preserved: + ```bash + pandoc --track-changes=all path-to-file.docx -o current.md + ``` + +2. **Identify and group changes**: Review the document and identify ALL changes needed, organizing them into logical batches: + + **Location methods** (for finding changes in XML): + - Section/heading numbers (e.g., "Section 3.2", "Article IV") + - Paragraph identifiers if numbered + - Grep patterns with unique surrounding text + - Document structure (e.g., "first paragraph", "signature block") + - **DO NOT use markdown line numbers** - they don't map to XML structure + + **Batch organization** (group 3-10 related changes per batch): + - By section: "Batch 1: Section 2 amendments", "Batch 2: Section 5 updates" + - By type: "Batch 1: Date corrections", "Batch 2: Party name changes" + - By complexity: Start with simple text replacements, then tackle complex structural changes + - Sequential: "Batch 1: Pages 1-3", "Batch 2: Pages 4-6" + +3. **Read documentation and unpack**: + - **MANDATORY - READ ENTIRE FILE**: Read [`ooxml.md`](ooxml.md) (~600 lines) completely from start to finish. **NEVER set any range limits when reading this file.** Read the full file content for the Document library API and XML patterns for directly editing document files. + - **Unpack the document**: `python ooxml/scripts/unpack.py ` + - **Note the suggested RSID**: The unpack script will suggest an RSID to use for your tracked changes. Copy this RSID for use in step 4b. + +4. **Implement changes in batches**: Group changes logically (by section, by type, or by proximity) and implement them together in a single script. This approach: + - Makes debugging easier (smaller batch = easier to isolate errors) + - Allows incremental progress + - Maintains efficiency (batch size of 3-10 changes works well) + + **Suggested batch groupings:** + - By document section (e.g., "Section 3 changes", "Definitions", "Termination clause") + - By change type (e.g., "Date changes", "Party name updates", "Legal term replacements") + - By proximity (e.g., "Changes on pages 1-3", "Changes in first half of document") + + For each batch of related changes: + + **a. Map text to XML**: Grep for text in `word/document.xml` to verify how text is split across `` elements. + + **b. Create and run script**: Set PYTHONPATH and import Document library (see "Initialization" in ooxml.md), then use `get_node` to find nodes, implement changes, and `doc.save()`. See **"Document Library"** section in ooxml.md for patterns. + + **Note**: Always grep `word/document.xml` immediately before writing a script to get current line numbers and verify text content. Line numbers change after each script run. + +5. **Pack the document**: After all batches are complete, convert the unpacked directory back to .docx: + ```bash + python ooxml/scripts/pack.py unpacked reviewed-document.docx + ``` + +6. **Final verification**: Do a comprehensive check of the complete document: + - Convert final document to markdown: + ```bash + pandoc --track-changes=all reviewed-document.docx -o verification.md + ``` + - Verify ALL changes were applied correctly: + ```bash + grep "original phrase" verification.md # Should NOT find it + grep "replacement phrase" verification.md # Should find it + ``` + - Check that no unintended changes were introduced + + +## Converting Documents to Images + +To visually analyze Word documents, convert them to images using a two-step process: + +1. **Convert DOCX to PDF**: + ```bash + soffice --headless --convert-to pdf document.docx + ``` + +2. **Convert PDF pages to JPEG images**: + ```bash + pdftoppm -jpeg -r 150 document.pdf page + ``` + This creates files like `page-1.jpg`, `page-2.jpg`, etc. + +Options: +- `-r 150`: Sets resolution to 150 DPI (adjust for quality/size balance) +- `-jpeg`: Output JPEG format (use `-png` for PNG if preferred) +- `-f N`: First page to convert (e.g., `-f 2` starts from page 2) +- `-l N`: Last page to convert (e.g., `-l 5` stops at page 5) +- `page`: Prefix for output files + +Example for specific range: +```bash +pdftoppm -jpeg -r 150 -f 2 -l 5 document.pdf page # Converts only pages 2-5 +``` + +## Code Style Guidelines +**IMPORTANT**: When generating code for DOCX operations: +- Write concise code +- Avoid verbose variable names and redundant operations +- Avoid unnecessary print statements + +## Dependencies + +Required dependencies (install if not available): + +- **pandoc**: `sudo apt-get install pandoc` (for text extraction) +- **docx**: `npm install -g docx` (for creating new documents) +- **LibreOffice**: `sudo apt-get install libreoffice` (for PDF conversion) +- **Poppler**: `sudo apt-get install poppler-utils` (for pdftoppm to convert PDF to images) +- **defusedxml**: `pip install defusedxml` (for secure XML parsing) \ No newline at end of file diff --git a/code_puppy/bundled_skills/Office/docx/docx-js.md b/code_puppy/bundled_skills/Office/docx/docx-js.md new file mode 100644 index 00000000..c6d7b2dd --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/docx-js.md @@ -0,0 +1,350 @@ +# DOCX Library Tutorial + +Generate .docx files with JavaScript/TypeScript. + +**Important: Read this entire document before starting.** Critical formatting rules and common pitfalls are covered throughout - skipping sections may result in corrupted files or rendering issues. + +## Setup +Assumes docx is already installed globally +If not installed: `npm install -g docx` + +```javascript +const { Document, Packer, Paragraph, TextRun, Table, TableRow, TableCell, ImageRun, Media, + Header, Footer, AlignmentType, PageOrientation, LevelFormat, ExternalHyperlink, + InternalHyperlink, TableOfContents, HeadingLevel, BorderStyle, WidthType, TabStopType, + TabStopPosition, UnderlineType, ShadingType, VerticalAlign, SymbolRun, PageNumber, + FootnoteReferenceRun, Footnote, PageBreak } = require('docx'); + +// Create & Save +const doc = new Document({ sections: [{ children: [/* content */] }] }); +Packer.toBuffer(doc).then(buffer => fs.writeFileSync("doc.docx", buffer)); // Node.js +Packer.toBlob(doc).then(blob => { /* download logic */ }); // Browser +``` + +## Text & Formatting +```javascript +// IMPORTANT: Never use \n for line breaks - always use separate Paragraph elements +// ❌ WRONG: new TextRun("Line 1\nLine 2") +// ✅ CORRECT: new Paragraph({ children: [new TextRun("Line 1")] }), new Paragraph({ children: [new TextRun("Line 2")] }) + +// Basic text with all formatting options +new Paragraph({ + alignment: AlignmentType.CENTER, + spacing: { before: 200, after: 200 }, + indent: { left: 720, right: 720 }, + children: [ + new TextRun({ text: "Bold", bold: true }), + new TextRun({ text: "Italic", italics: true }), + new TextRun({ text: "Underlined", underline: { type: UnderlineType.DOUBLE, color: "FF0000" } }), + new TextRun({ text: "Colored", color: "FF0000", size: 28, font: "Arial" }), // Arial default + new TextRun({ text: "Highlighted", highlight: "yellow" }), + new TextRun({ text: "Strikethrough", strike: true }), + new TextRun({ text: "x2", superScript: true }), + new TextRun({ text: "H2O", subScript: true }), + new TextRun({ text: "SMALL CAPS", smallCaps: true }), + new SymbolRun({ char: "2022", font: "Symbol" }), // Bullet • + new SymbolRun({ char: "00A9", font: "Arial" }) // Copyright © - Arial for symbols + ] +}) +``` + +## Styles & Professional Formatting + +```javascript +const doc = new Document({ + styles: { + default: { document: { run: { font: "Arial", size: 24 } } }, // 12pt default + paragraphStyles: [ + // Document title style - override built-in Title style + { id: "Title", name: "Title", basedOn: "Normal", + run: { size: 56, bold: true, color: "000000", font: "Arial" }, + paragraph: { spacing: { before: 240, after: 120 }, alignment: AlignmentType.CENTER } }, + // IMPORTANT: Override built-in heading styles by using their exact IDs + { id: "Heading1", name: "Heading 1", basedOn: "Normal", next: "Normal", quickFormat: true, + run: { size: 32, bold: true, color: "000000", font: "Arial" }, // 16pt + paragraph: { spacing: { before: 240, after: 240 }, outlineLevel: 0 } }, // Required for TOC + { id: "Heading2", name: "Heading 2", basedOn: "Normal", next: "Normal", quickFormat: true, + run: { size: 28, bold: true, color: "000000", font: "Arial" }, // 14pt + paragraph: { spacing: { before: 180, after: 180 }, outlineLevel: 1 } }, + // Custom styles use your own IDs + { id: "myStyle", name: "My Style", basedOn: "Normal", + run: { size: 28, bold: true, color: "000000" }, + paragraph: { spacing: { after: 120 }, alignment: AlignmentType.CENTER } } + ], + characterStyles: [{ id: "myCharStyle", name: "My Char Style", + run: { color: "FF0000", bold: true, underline: { type: UnderlineType.SINGLE } } }] + }, + sections: [{ + properties: { page: { margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } } }, + children: [ + new Paragraph({ heading: HeadingLevel.TITLE, children: [new TextRun("Document Title")] }), // Uses overridden Title style + new Paragraph({ heading: HeadingLevel.HEADING_1, children: [new TextRun("Heading 1")] }), // Uses overridden Heading1 style + new Paragraph({ style: "myStyle", children: [new TextRun("Custom paragraph style")] }), + new Paragraph({ children: [ + new TextRun("Normal with "), + new TextRun({ text: "custom char style", style: "myCharStyle" }) + ]}) + ] + }] +}); +``` + +**Professional Font Combinations:** +- **Arial (Headers) + Arial (Body)** - Most universally supported, clean and professional +- **Times New Roman (Headers) + Arial (Body)** - Classic serif headers with modern sans-serif body +- **Georgia (Headers) + Verdana (Body)** - Optimized for screen reading, elegant contrast + +**Key Styling Principles:** +- **Override built-in styles**: Use exact IDs like "Heading1", "Heading2", "Heading3" to override Word's built-in heading styles +- **HeadingLevel constants**: `HeadingLevel.HEADING_1` uses "Heading1" style, `HeadingLevel.HEADING_2` uses "Heading2" style, etc. +- **Include outlineLevel**: Set `outlineLevel: 0` for H1, `outlineLevel: 1` for H2, etc. to ensure TOC works correctly +- **Use custom styles** instead of inline formatting for consistency +- **Set a default font** using `styles.default.document.run.font` - Arial is universally supported +- **Establish visual hierarchy** with different font sizes (titles > headers > body) +- **Add proper spacing** with `before` and `after` paragraph spacing +- **Use colors sparingly**: Default to black (000000) and shades of gray for titles and headings (heading 1, heading 2, etc.) +- **Set consistent margins** (1440 = 1 inch is standard) + + +## Lists (ALWAYS USE PROPER LISTS - NEVER USE UNICODE BULLETS) +```javascript +// Bullets - ALWAYS use the numbering config, NOT unicode symbols +// CRITICAL: Use LevelFormat.BULLET constant, NOT the string "bullet" +const doc = new Document({ + numbering: { + config: [ + { reference: "bullet-list", + levels: [{ level: 0, format: LevelFormat.BULLET, text: "•", alignment: AlignmentType.LEFT, + style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] }, + { reference: "first-numbered-list", + levels: [{ level: 0, format: LevelFormat.DECIMAL, text: "%1.", alignment: AlignmentType.LEFT, + style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] }, + { reference: "second-numbered-list", // Different reference = restarts at 1 + levels: [{ level: 0, format: LevelFormat.DECIMAL, text: "%1.", alignment: AlignmentType.LEFT, + style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] } + ] + }, + sections: [{ + children: [ + // Bullet list items + new Paragraph({ numbering: { reference: "bullet-list", level: 0 }, + children: [new TextRun("First bullet point")] }), + new Paragraph({ numbering: { reference: "bullet-list", level: 0 }, + children: [new TextRun("Second bullet point")] }), + // Numbered list items + new Paragraph({ numbering: { reference: "first-numbered-list", level: 0 }, + children: [new TextRun("First numbered item")] }), + new Paragraph({ numbering: { reference: "first-numbered-list", level: 0 }, + children: [new TextRun("Second numbered item")] }), + // ⚠️ CRITICAL: Different reference = INDEPENDENT list that restarts at 1 + // Same reference = CONTINUES previous numbering + new Paragraph({ numbering: { reference: "second-numbered-list", level: 0 }, + children: [new TextRun("Starts at 1 again (because different reference)")] }) + ] + }] +}); + +// ⚠️ CRITICAL NUMBERING RULE: Each reference creates an INDEPENDENT numbered list +// - Same reference = continues numbering (1, 2, 3... then 4, 5, 6...) +// - Different reference = restarts at 1 (1, 2, 3... then 1, 2, 3...) +// Use unique reference names for each separate numbered section! + +// ⚠️ CRITICAL: NEVER use unicode bullets - they create fake lists that don't work properly +// new TextRun("• Item") // WRONG +// new SymbolRun({ char: "2022" }) // WRONG +// ✅ ALWAYS use numbering config with LevelFormat.BULLET for real Word lists +``` + +## Tables +```javascript +// Complete table with margins, borders, headers, and bullet points +const tableBorder = { style: BorderStyle.SINGLE, size: 1, color: "CCCCCC" }; +const cellBorders = { top: tableBorder, bottom: tableBorder, left: tableBorder, right: tableBorder }; + +new Table({ + columnWidths: [4680, 4680], // ⚠️ CRITICAL: Set column widths at table level - values in DXA (twentieths of a point) + margins: { top: 100, bottom: 100, left: 180, right: 180 }, // Set once for all cells + rows: [ + new TableRow({ + tableHeader: true, + children: [ + new TableCell({ + borders: cellBorders, + width: { size: 4680, type: WidthType.DXA }, // ALSO set width on each cell + // ⚠️ CRITICAL: Always use ShadingType.CLEAR to prevent black backgrounds in Word. + shading: { fill: "D5E8F0", type: ShadingType.CLEAR }, + verticalAlign: VerticalAlign.CENTER, + children: [new Paragraph({ + alignment: AlignmentType.CENTER, + children: [new TextRun({ text: "Header", bold: true, size: 22 })] + })] + }), + new TableCell({ + borders: cellBorders, + width: { size: 4680, type: WidthType.DXA }, // ALSO set width on each cell + shading: { fill: "D5E8F0", type: ShadingType.CLEAR }, + children: [new Paragraph({ + alignment: AlignmentType.CENTER, + children: [new TextRun({ text: "Bullet Points", bold: true, size: 22 })] + })] + }) + ] + }), + new TableRow({ + children: [ + new TableCell({ + borders: cellBorders, + width: { size: 4680, type: WidthType.DXA }, // ALSO set width on each cell + children: [new Paragraph({ children: [new TextRun("Regular data")] })] + }), + new TableCell({ + borders: cellBorders, + width: { size: 4680, type: WidthType.DXA }, // ALSO set width on each cell + children: [ + new Paragraph({ + numbering: { reference: "bullet-list", level: 0 }, + children: [new TextRun("First bullet point")] + }), + new Paragraph({ + numbering: { reference: "bullet-list", level: 0 }, + children: [new TextRun("Second bullet point")] + }) + ] + }) + ] + }) + ] +}) +``` + +**IMPORTANT: Table Width & Borders** +- Use BOTH `columnWidths: [width1, width2, ...]` array AND `width: { size: X, type: WidthType.DXA }` on each cell +- Values in DXA (twentieths of a point): 1440 = 1 inch, Letter usable width = 9360 DXA (with 1" margins) +- Apply borders to individual `TableCell` elements, NOT the `Table` itself + +**Precomputed Column Widths (Letter size with 1" margins = 9360 DXA total):** +- **2 columns:** `columnWidths: [4680, 4680]` (equal width) +- **3 columns:** `columnWidths: [3120, 3120, 3120]` (equal width) + +## Links & Navigation +```javascript +// TOC (requires headings) - CRITICAL: Use HeadingLevel only, NOT custom styles +// ❌ WRONG: new Paragraph({ heading: HeadingLevel.HEADING_1, style: "customHeader", children: [new TextRun("Title")] }) +// ✅ CORRECT: new Paragraph({ heading: HeadingLevel.HEADING_1, children: [new TextRun("Title")] }) +new TableOfContents("Table of Contents", { hyperlink: true, headingStyleRange: "1-3" }), + +// External link +new Paragraph({ + children: [new ExternalHyperlink({ + children: [new TextRun({ text: "Google", style: "Hyperlink" })], + link: "https://www.google.com" + })] +}), + +// Internal link & bookmark +new Paragraph({ + children: [new InternalHyperlink({ + children: [new TextRun({ text: "Go to Section", style: "Hyperlink" })], + anchor: "section1" + })] +}), +new Paragraph({ + children: [new TextRun("Section Content")], + bookmark: { id: "section1", name: "section1" } +}), +``` + +## Images & Media +```javascript +// Basic image with sizing & positioning +// CRITICAL: Always specify 'type' parameter - it's REQUIRED for ImageRun +new Paragraph({ + alignment: AlignmentType.CENTER, + children: [new ImageRun({ + type: "png", // NEW REQUIREMENT: Must specify image type (png, jpg, jpeg, gif, bmp, svg) + data: fs.readFileSync("image.png"), + transformation: { width: 200, height: 150, rotation: 0 }, // rotation in degrees + altText: { title: "Logo", description: "Company logo", name: "Name" } // IMPORTANT: All three fields are required + })] +}) +``` + +## Page Breaks +```javascript +// Manual page break +new Paragraph({ children: [new PageBreak()] }), + +// Page break before paragraph +new Paragraph({ + pageBreakBefore: true, + children: [new TextRun("This starts on a new page")] +}) + +// ⚠️ CRITICAL: NEVER use PageBreak standalone - it will create invalid XML that Word cannot open +// ❌ WRONG: new PageBreak() +// ✅ CORRECT: new Paragraph({ children: [new PageBreak()] }) +``` + +## Headers/Footers & Page Setup +```javascript +const doc = new Document({ + sections: [{ + properties: { + page: { + margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 }, // 1440 = 1 inch + size: { orientation: PageOrientation.LANDSCAPE }, + pageNumbers: { start: 1, formatType: "decimal" } // "upperRoman", "lowerRoman", "upperLetter", "lowerLetter" + } + }, + headers: { + default: new Header({ children: [new Paragraph({ + alignment: AlignmentType.RIGHT, + children: [new TextRun("Header Text")] + })] }) + }, + footers: { + default: new Footer({ children: [new Paragraph({ + alignment: AlignmentType.CENTER, + children: [new TextRun("Page "), new TextRun({ children: [PageNumber.CURRENT] }), new TextRun(" of "), new TextRun({ children: [PageNumber.TOTAL_PAGES] })] + })] }) + }, + children: [/* content */] + }] +}); +``` + +## Tabs +```javascript +new Paragraph({ + tabStops: [ + { type: TabStopType.LEFT, position: TabStopPosition.MAX / 4 }, + { type: TabStopType.CENTER, position: TabStopPosition.MAX / 2 }, + { type: TabStopType.RIGHT, position: TabStopPosition.MAX * 3 / 4 } + ], + children: [new TextRun("Left\tCenter\tRight")] +}) +``` + +## Constants & Quick Reference +- **Underlines:** `SINGLE`, `DOUBLE`, `WAVY`, `DASH` +- **Borders:** `SINGLE`, `DOUBLE`, `DASHED`, `DOTTED` +- **Numbering:** `DECIMAL` (1,2,3), `UPPER_ROMAN` (I,II,III), `LOWER_LETTER` (a,b,c) +- **Tabs:** `LEFT`, `CENTER`, `RIGHT`, `DECIMAL` +- **Symbols:** `"2022"` (•), `"00A9"` (©), `"00AE"` (®), `"2122"` (™), `"00B0"` (°), `"F070"` (✓), `"F0FC"` (✗) + +## Critical Issues & Common Mistakes +- **CRITICAL: PageBreak must ALWAYS be inside a Paragraph** - standalone PageBreak creates invalid XML that Word cannot open +- **ALWAYS use ShadingType.CLEAR for table cell shading** - Never use ShadingType.SOLID (causes black background). +- Measurements in DXA (1440 = 1 inch) | Each table cell needs ≥1 Paragraph | TOC requires HeadingLevel styles only +- **ALWAYS use custom styles** with Arial font for professional appearance and proper visual hierarchy +- **ALWAYS set a default font** using `styles.default.document.run.font` - Arial recommended +- **ALWAYS use columnWidths array for tables** + individual cell widths for compatibility +- **NEVER use unicode symbols for bullets** - always use proper numbering configuration with `LevelFormat.BULLET` constant (NOT the string "bullet") +- **NEVER use \n for line breaks anywhere** - always use separate Paragraph elements for each line +- **ALWAYS use TextRun objects within Paragraph children** - never use text property directly on Paragraph +- **CRITICAL for images**: ImageRun REQUIRES `type` parameter - always specify "png", "jpg", "jpeg", "gif", "bmp", or "svg" +- **CRITICAL for bullets**: Must use `LevelFormat.BULLET` constant, not string "bullet", and include `text: "•"` for the bullet character +- **CRITICAL for numbering**: Each numbering reference creates an INDEPENDENT list. Same reference = continues numbering (1,2,3 then 4,5,6). Different reference = restarts at 1 (1,2,3 then 1,2,3). Use unique reference names for each separate numbered section! +- **CRITICAL for TOC**: When using TableOfContents, headings must use HeadingLevel ONLY - do NOT add custom styles to heading paragraphs or TOC will break +- **Tables**: Set `columnWidths` array + individual cell widths, apply borders to cells not table +- **Set table margins at TABLE level** for consistent cell padding (avoids repetition per cell) \ No newline at end of file diff --git a/code_puppy/bundled_skills/Office/docx/ooxml.md b/code_puppy/bundled_skills/Office/docx/ooxml.md new file mode 100644 index 00000000..9fbbf83f --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml.md @@ -0,0 +1,632 @@ +# Office Open XML Technical Reference + +**Important: Read this entire document before starting.** This document covers: +- [Technical Guidelines](#technical-guidelines) - Schema compliance rules and validation requirements +- [Document Content Patterns](#document-content-patterns) - XML patterns for headings, lists, tables, formatting, etc. +- [Document Library (Python)](#document-library-python) - Recommended approach for OOXML manipulation with automatic infrastructure setup +- [Tracked Changes (Redlining)](#tracked-changes-redlining) - XML patterns for implementing tracked changes + +## Technical Guidelines + +### Schema Compliance +- **Element ordering in ``**: ``, ``, ``, ``, ``, ``, ``, then `` last +- **Element ordering in ``**: All regular properties (``, ``, ``, ``) must come before `` or ``, which must be last. No properties can follow tracked change elements +- **Whitespace**: Add `xml:space='preserve'` to `` elements with leading/trailing spaces +- **Unicode**: Escape characters in ASCII content: `"` becomes `“` + - **Character encoding reference**: Curly quotes `""` become `“”`, apostrophe `'` becomes `’`, em-dash `—` becomes `—` +- **Tracked changes**: Use `` and `` tags with `w:author="Ticca"` outside `` elements + - **Critical**: `` closes with ``, `` closes with `` - never mix + - **RSIDs must be 8-digit hex**: Use values like `00AB1234` (only 0-9, A-F characters) + - **trackRevisions placement**: Add `` after `` in settings.xml +- **Images**: Add to `word/media/`, reference in `document.xml`, set dimensions to prevent overflow + +## Document Content Patterns + +### Basic Structure +```xml + + Text content + +``` + +### Headings and Styles +```xml + + + + + + Document Title + + + + + Section Heading + +``` + +### Text Formatting +```xml + +Bold + +Italic + +Underlined + +Highlighted +``` + +### Lists +```xml + + + + + + + + First item + + + + + + + + + + New list item 1 + + + + + + + + + + + Bullet item + +``` + +### Tables + +**CRITICAL**: When adding rows to existing tables, match the EXACT cell structure of existing rows: +- Count cells in an existing row and match the count exactly +- Check for `` (cell spans multiple columns) and `` (columns after cells) +- Match cell widths (``) from the table's `` definition +- **Match content placement**: Check which cell contains the content in the reference row and place your content in the same cell position (e.g., if label rows have empty first cells with content in second cells, replicate this pattern) +- Examine which columns contain content vs. which are empty - replicate this exact pattern + +```xml + + + + + + + + + + + + Cell 1 + + + + Cell 2 + + + +``` + +### Layout +```xml + + + + + + + + + + + + New Section Title + + + + + + + + + + Centered text + + + + + + + + Monospace text + + + + + + + This text is Courier New + + and this text uses default font + +``` + +## File Updates + +When adding content, update these files: + +**`word/_rels/document.xml.rels`:** +```xml + + +``` + +**`[Content_Types].xml`:** +```xml + + +``` + +### Images +**CRITICAL**: Calculate dimensions to prevent page overflow and maintain aspect ratio. + +```xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +``` + +### Links (Hyperlinks) + +**IMPORTANT**: All hyperlinks (both internal and external) require the Hyperlink style to be defined in styles.xml. Without this style, links will look like regular text instead of blue underlined clickable links. + +**External Links:** +```xml + + + + + Link Text + + + + + +``` + +**Internal Links:** + +```xml + + + + + Link Text + + + + + +Target content + +``` + +**Hyperlink Style (required in styles.xml):** +```xml + + + + + + + + + + +``` + +## Document Library (Python) + +Use the Document class from `scripts/document.py` for all tracked changes and comments. It automatically handles infrastructure setup (people.xml, RSIDs, settings.xml, comment files, relationships, content types). Only use direct XML manipulation for complex scenarios not supported by the library. + +**Working with Unicode and Entities:** +- **Searching**: Both entity notation and Unicode characters work - `contains="“Company"` and `contains="\u201cCompany"` find the same text +- **Replacing**: Use either entities (`“`) or Unicode (`\u201c`) - both work and will be converted appropriately based on the file's encoding (ascii → entities, utf-8 → Unicode) + +### Initialization + +**Set PYTHONPATH to the docx skill root:** + +```bash +# Find the docx skill root (directory containing scripts/ and ooxml/) +find /mnt/skills -name "document.py" -path "*/docx/scripts/*" 2>/dev/null | head -1 +# Example output: /mnt/skills/public/docx/scripts/document.py +# Skill root is: /mnt/skills/public/docx + +# Option 1: Export for entire session +export PYTHONPATH=/mnt/skills/public/docx:$PYTHONPATH + +# Option 2: Inline with script execution +PYTHONPATH=/mnt/skills/public/docx python3 your_script.py +``` + +**In your script**, import normally: +```python +from scripts.document import Document + +# Basic initialization (automatically creates temp copy and sets up infrastructure) +doc = Document('unpacked') + +# Customize author and initials +doc = Document('unpacked', author="John Doe", initials="JD") + +# Enable track revisions mode +doc = Document('unpacked', track_revisions=True) + +# Specify custom RSID (auto-generated if not provided) +doc = Document('unpacked', rsid="07DC5ECB") +``` + +### Creating Tracked Changes + +**CRITICAL**: Only mark text that actually changes. Keep ALL unchanged text outside ``/`` tags. Marking unchanged text makes edits unprofessional and harder to review. + +**Attribute Handling**: The Document class auto-injects attributes (w:id, w:date, w:rsidR, w:rsidDel, w16du:dateUtc, xml:space) into new elements. When preserving unchanged text from the original document, copy the original `` element with its existing attributes to maintain document integrity. + +**Method Selection Guide**: +- **Adding your own changes to regular text**: Use `replace_node()` with ``/`` tags, or `suggest_deletion()` for removing entire `` or `` elements +- **Partially modifying another author's tracked change**: Use `replace_node()` to nest your changes inside their ``/`` +- **Completely rejecting another author's insertion**: Use `revert_insertion()` on the `` element (NOT `suggest_deletion()`) +- **Completely rejecting another author's deletion**: Use `revert_deletion()` on the `` element to restore deleted content using tracked changes + +```python +# Minimal edit - change one word: "The report is monthly" → "The report is quarterly" +# Original: The report is monthly +node = doc["word/document.xml"].get_node(tag="w:r", contains="The report is monthly") +rpr = tags[0].toxml() if (tags := node.getElementsByTagName("w:rPr")) else "" +replacement = f'{rpr}The report is {rpr}monthly{rpr}quarterly' +doc["word/document.xml"].replace_node(node, replacement) + +# Minimal edit - change number: "within 30 days" → "within 45 days" +# Original: within 30 days +node = doc["word/document.xml"].get_node(tag="w:r", contains="within 30 days") +rpr = tags[0].toxml() if (tags := node.getElementsByTagName("w:rPr")) else "" +replacement = f'{rpr}within {rpr}30{rpr}45{rpr} days' +doc["word/document.xml"].replace_node(node, replacement) + +# Complete replacement - preserve formatting even when replacing all text +node = doc["word/document.xml"].get_node(tag="w:r", contains="apple") +rpr = tags[0].toxml() if (tags := node.getElementsByTagName("w:rPr")) else "" +replacement = f'{rpr}apple{rpr}banana orange' +doc["word/document.xml"].replace_node(node, replacement) + +# Insert new content (no attributes needed - auto-injected) +node = doc["word/document.xml"].get_node(tag="w:r", contains="existing text") +doc["word/document.xml"].insert_after(node, 'new text') + +# Partially delete another author's insertion +# Original: quarterly financial report +# Goal: Delete only "financial" to make it "quarterly report" +node = doc["word/document.xml"].get_node(tag="w:ins", attrs={"w:id": "5"}) +# IMPORTANT: Preserve w:author="Jane Smith" on the outer to maintain authorship +replacement = ''' + quarterly + financial + report +''' +doc["word/document.xml"].replace_node(node, replacement) + +# Change part of another author's insertion +# Original: in silence, safe and sound +# Goal: Change "safe and sound" to "soft and unbound" +node = doc["word/document.xml"].get_node(tag="w:ins", attrs={"w:id": "8"}) +replacement = f''' + in silence, + + + soft and unbound + + + safe and sound +''' +doc["word/document.xml"].replace_node(node, replacement) + +# Delete entire run (use only when deleting all content; use replace_node for partial deletions) +node = doc["word/document.xml"].get_node(tag="w:r", contains="text to delete") +doc["word/document.xml"].suggest_deletion(node) + +# Delete entire paragraph (in-place, handles both regular and numbered list paragraphs) +para = doc["word/document.xml"].get_node(tag="w:p", contains="paragraph to delete") +doc["word/document.xml"].suggest_deletion(para) + +# Add new numbered list item +target_para = doc["word/document.xml"].get_node(tag="w:p", contains="existing list item") +pPr = tags[0].toxml() if (tags := target_para.getElementsByTagName("w:pPr")) else "" +new_item = f'{pPr}New item' +tracked_para = doc.suggest_paragraph(new_item) +doc["word/document.xml"].insert_after(target_para, tracked_para) +# Optional: add spacing paragraph before content for better visual separation +# spacing = doc.suggest_paragraph('') +# doc["word/document.xml"].insert_after(target_para, spacing + tracked_para) + +# Add table row with tracked changes (requires 3 levels: row, cell properties, content) +# IMPORTANT: First examine an existing row to match cell count, widths, and content placement +last_row = doc["word/document.xml"].get_node(tag="w:tr", line_number=5000) +new_row = ''' + + + + New Cell + +''' +doc["word/document.xml"].insert_after(last_row, new_row) +``` + +### Adding Comments + +```python +# Add comment spanning two existing tracked changes +# Note: w:id is auto-generated. Only search by w:id if you know it from XML inspection +start_node = doc["word/document.xml"].get_node(tag="w:del", attrs={"w:id": "1"}) +end_node = doc["word/document.xml"].get_node(tag="w:ins", attrs={"w:id": "2"}) +doc.add_comment(start=start_node, end=end_node, text="Explanation of this change") + +# Add comment on a paragraph +para = doc["word/document.xml"].get_node(tag="w:p", contains="paragraph text") +doc.add_comment(start=para, end=para, text="Comment on this paragraph") + +# Add comment on newly created tracked change +# First create the tracked change +node = doc["word/document.xml"].get_node(tag="w:r", contains="old") +new_nodes = doc["word/document.xml"].replace_node( + node, + 'oldnew' +) +# Then add comment on the newly created elements +# new_nodes[0] is the , new_nodes[1] is the +doc.add_comment(start=new_nodes[0], end=new_nodes[1], text="Changed old to new per requirements") + +# Reply to existing comment +doc.reply_to_comment(parent_comment_id=0, text="I agree with this change") +``` + +### Rejecting Tracked Changes + +**IMPORTANT**: Use `revert_insertion()` to reject insertions and `revert_deletion()` to restore deletions using tracked changes. Use `suggest_deletion()` only for regular unmarked content. + +```python +# Reject insertion (wraps it in deletion) +# Use this when another author inserted text that you want to delete +ins = doc["word/document.xml"].get_node(tag="w:ins", attrs={"w:id": "5"}) +nodes = doc["word/document.xml"].revert_insertion(ins) # Returns [ins] + +# Reject deletion (creates insertion to restore deleted content) +# Use this when another author deleted text that you want to restore +del_elem = doc["word/document.xml"].get_node(tag="w:del", attrs={"w:id": "3"}) +nodes = doc["word/document.xml"].revert_deletion(del_elem) # Returns [del_elem, new_ins] + +# Reject all insertions in a paragraph +para = doc["word/document.xml"].get_node(tag="w:p", contains="paragraph text") +nodes = doc["word/document.xml"].revert_insertion(para) # Returns [para] + +# Reject all deletions in a paragraph +para = doc["word/document.xml"].get_node(tag="w:p", contains="paragraph text") +nodes = doc["word/document.xml"].revert_deletion(para) # Returns [para] +``` + +### Inserting Images + +**CRITICAL**: The Document class works with a temporary copy at `doc.unpacked_path`. Always copy images to this temp directory, not the original unpacked folder. + +```python +from PIL import Image +import shutil, os + +# Initialize document first +doc = Document('unpacked') + +# Copy image and calculate full-width dimensions with aspect ratio +media_dir = os.path.join(doc.unpacked_path, 'word/media') +os.makedirs(media_dir, exist_ok=True) +shutil.copy('image.png', os.path.join(media_dir, 'image1.png')) +img = Image.open(os.path.join(media_dir, 'image1.png')) +width_emus = int(6.5 * 914400) # 6.5" usable width, 914400 EMUs/inch +height_emus = int(width_emus * img.size[1] / img.size[0]) + +# Add relationship and content type +rels_editor = doc['word/_rels/document.xml.rels'] +next_rid = rels_editor.get_next_rid() +rels_editor.append_to(rels_editor.dom.documentElement, + f'') +doc['[Content_Types].xml'].append_to(doc['[Content_Types].xml'].dom.documentElement, + '') + +# Insert image +node = doc["word/document.xml"].get_node(tag="w:p", line_number=100) +doc["word/document.xml"].insert_after(node, f''' + + + + + + + + + + + + + + + + + +''') +``` + +### Getting Nodes + +```python +# By text content +node = doc["word/document.xml"].get_node(tag="w:p", contains="specific text") + +# By line range +para = doc["word/document.xml"].get_node(tag="w:p", line_number=range(100, 150)) + +# By attributes +node = doc["word/document.xml"].get_node(tag="w:del", attrs={"w:id": "1"}) + +# By exact line number (must be line number where tag opens) +para = doc["word/document.xml"].get_node(tag="w:p", line_number=42) + +# Combine filters +node = doc["word/document.xml"].get_node(tag="w:r", line_number=range(40, 60), contains="text") + +# Disambiguate when text appears multiple times - add line_number range +node = doc["word/document.xml"].get_node(tag="w:r", contains="Section", line_number=range(2400, 2500)) +``` + +### Saving + +```python +# Save with automatic validation (copies back to original directory) +doc.save() # Validates by default, raises error if validation fails + +# Save to different location +doc.save('modified-unpacked') + +# Skip validation (debugging only - needing this in production indicates XML issues) +doc.save(validate=False) +``` + +### Direct DOM Manipulation + +For complex scenarios not covered by the library: + +```python +# Access any XML file +editor = doc["word/document.xml"] +editor = doc["word/comments.xml"] + +# Direct DOM access (defusedxml.minidom.Document) +node = doc["word/document.xml"].get_node(tag="w:p", line_number=5) +parent = node.parentNode +parent.removeChild(node) +parent.appendChild(node) # Move to end + +# General document manipulation (without tracked changes) +old_node = doc["word/document.xml"].get_node(tag="w:p", contains="original text") +doc["word/document.xml"].replace_node(old_node, "replacement text") + +# Multiple insertions - use return value to maintain order +node = doc["word/document.xml"].get_node(tag="w:r", line_number=100) +nodes = doc["word/document.xml"].insert_after(node, "A") +nodes = doc["word/document.xml"].insert_after(nodes[-1], "B") +nodes = doc["word/document.xml"].insert_after(nodes[-1], "C") +# Results in: original_node, A, B, C +``` + +## Tracked Changes (Redlining) + +**Use the Document class above for all tracked changes.** The patterns below are for reference when constructing replacement XML strings. + +### Validation Rules +The validator checks that the document text matches the original after reverting Ticca's changes. This means: +- **NEVER modify text inside another author's `` or `` tags** +- **ALWAYS use nested deletions** to remove another author's insertions +- **Every edit must be properly tracked** with `` or `` tags + +### Tracked Change Patterns + +**CRITICAL RULES**: +1. Never modify the content inside another author's tracked changes. Always use nested deletions. +2. **XML Structure**: Always place `` and `` at paragraph level containing complete `` elements. Never nest inside `` elements - this creates invalid XML that breaks document processing. + +**Text Insertion:** +```xml + + + inserted text + + +``` + +**Text Deletion:** +```xml + + + deleted text + + +``` + +**Deleting Another Author's Insertion (MUST use nested structure):** +```xml + + + + monthly + + + + weekly + +``` + +**Restoring Another Author's Deletion:** +```xml + + + within 30 days + + + within 30 days + +``` \ No newline at end of file diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/dml-chart.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/dml-chart.xsd new file mode 100644 index 00000000..bc325f9f --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/dml-chart.xsd @@ -0,0 +1,1499 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/dml-chartDrawing.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/dml-chartDrawing.xsd new file mode 100644 index 00000000..afa4f463 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/dml-chartDrawing.xsd @@ -0,0 +1,146 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/dml-diagram.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/dml-diagram.xsd new file mode 100644 index 00000000..40e4b12a --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/dml-diagram.xsd @@ -0,0 +1,1085 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/dml-lockedCanvas.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/dml-lockedCanvas.xsd new file mode 100644 index 00000000..687eea82 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/dml-lockedCanvas.xsd @@ -0,0 +1,11 @@ + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/dml-main.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/dml-main.xsd new file mode 100644 index 00000000..94644b3f --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/dml-main.xsd @@ -0,0 +1,3081 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/dml-picture.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/dml-picture.xsd new file mode 100644 index 00000000..1dbf0514 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/dml-picture.xsd @@ -0,0 +1,23 @@ + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/dml-spreadsheetDrawing.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/dml-spreadsheetDrawing.xsd new file mode 100644 index 00000000..f1af17db --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/dml-spreadsheetDrawing.xsd @@ -0,0 +1,185 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/dml-wordprocessingDrawing.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/dml-wordprocessingDrawing.xsd new file mode 100644 index 00000000..5c00a6ff --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/dml-wordprocessingDrawing.xsd @@ -0,0 +1,287 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/pml.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/pml.xsd new file mode 100644 index 00000000..25564ebb --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/pml.xsd @@ -0,0 +1,1676 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/shared-additionalCharacteristics.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/shared-additionalCharacteristics.xsd new file mode 100644 index 00000000..c20f3bf1 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/shared-additionalCharacteristics.xsd @@ -0,0 +1,28 @@ + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/shared-bibliography.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/shared-bibliography.xsd new file mode 100644 index 00000000..ac602522 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/shared-bibliography.xsd @@ -0,0 +1,144 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/shared-commonSimpleTypes.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/shared-commonSimpleTypes.xsd new file mode 100644 index 00000000..52deec72 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/shared-commonSimpleTypes.xsd @@ -0,0 +1,174 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/shared-customXmlDataProperties.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/shared-customXmlDataProperties.xsd new file mode 100644 index 00000000..2bddce29 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/shared-customXmlDataProperties.xsd @@ -0,0 +1,25 @@ + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/shared-customXmlSchemaProperties.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/shared-customXmlSchemaProperties.xsd new file mode 100644 index 00000000..8a8c18ba --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/shared-customXmlSchemaProperties.xsd @@ -0,0 +1,18 @@ + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesCustom.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesCustom.xsd new file mode 100644 index 00000000..5c42706a --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesCustom.xsd @@ -0,0 +1,59 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesExtended.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesExtended.xsd new file mode 100644 index 00000000..853c341c --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesExtended.xsd @@ -0,0 +1,56 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesVariantTypes.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesVariantTypes.xsd new file mode 100644 index 00000000..da835ee8 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesVariantTypes.xsd @@ -0,0 +1,195 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/shared-math.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/shared-math.xsd new file mode 100644 index 00000000..4f37d307 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/shared-math.xsd @@ -0,0 +1,582 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/shared-relationshipReference.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/shared-relationshipReference.xsd new file mode 100644 index 00000000..9e86f1b2 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/shared-relationshipReference.xsd @@ -0,0 +1,25 @@ + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/sml.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/sml.xsd new file mode 100644 index 00000000..237dd652 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/sml.xsd @@ -0,0 +1,4439 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/vml-main.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/vml-main.xsd new file mode 100644 index 00000000..eeb4ef8f --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/vml-main.xsd @@ -0,0 +1,570 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/vml-officeDrawing.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/vml-officeDrawing.xsd new file mode 100644 index 00000000..ca2575c7 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/vml-officeDrawing.xsd @@ -0,0 +1,509 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/vml-presentationDrawing.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/vml-presentationDrawing.xsd new file mode 100644 index 00000000..dd079e60 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/vml-presentationDrawing.xsd @@ -0,0 +1,12 @@ + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/vml-spreadsheetDrawing.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/vml-spreadsheetDrawing.xsd new file mode 100644 index 00000000..3dd6cf62 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/vml-spreadsheetDrawing.xsd @@ -0,0 +1,108 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/vml-wordprocessingDrawing.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/vml-wordprocessingDrawing.xsd new file mode 100644 index 00000000..f1041e34 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/vml-wordprocessingDrawing.xsd @@ -0,0 +1,96 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/wml.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/wml.xsd new file mode 100644 index 00000000..9c5b7a63 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/wml.xsd @@ -0,0 +1,3646 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/xml.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/xml.xsd new file mode 100644 index 00000000..fbd88768 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ISO-IEC29500-4_2016/xml.xsd @@ -0,0 +1,116 @@ + + + + + + See http://www.w3.org/XML/1998/namespace.html and + http://www.w3.org/TR/REC-xml for information about this namespace. + + This schema document describes the XML namespace, in a form + suitable for import by other schema documents. + + Note that local names in this namespace are intended to be defined + only by the World Wide Web Consortium or its subgroups. The + following names are currently defined in this namespace and should + not be used with conflicting semantics by any Working Group, + specification, or document instance: + + base (as an attribute name): denotes an attribute whose value + provides a URI to be used as the base for interpreting any + relative URIs in the scope of the element on which it + appears; its value is inherited. This name is reserved + by virtue of its definition in the XML Base specification. + + lang (as an attribute name): denotes an attribute whose value + is a language code for the natural language of the content of + any element; its value is inherited. This name is reserved + by virtue of its definition in the XML specification. + + space (as an attribute name): denotes an attribute whose + value is a keyword indicating what whitespace processing + discipline is intended for the content of the element; its + value is inherited. This name is reserved by virtue of its + definition in the XML specification. + + Father (in any context at all): denotes Jon Bosak, the chair of + the original XML Working Group. This name is reserved by + the following decision of the W3C XML Plenary and + XML Coordination groups: + + In appreciation for his vision, leadership and dedication + the W3C XML Plenary on this 10th day of February, 2000 + reserves for Jon Bosak in perpetuity the XML name + xml:Father + + + + + This schema defines attributes and an attribute group + suitable for use by + schemas wishing to allow xml:base, xml:lang or xml:space attributes + on elements they define. + + To enable this, such a schema must import this schema + for the XML namespace, e.g. as follows: + <schema . . .> + . . . + <import namespace="http://www.w3.org/XML/1998/namespace" + schemaLocation="http://www.w3.org/2001/03/xml.xsd"/> + + Subsequently, qualified reference to any of the attributes + or the group defined below will have the desired effect, e.g. + + <type . . .> + . . . + <attributeGroup ref="xml:specialAttrs"/> + + will define a type which will schema-validate an instance + element with any of those attributes + + + + In keeping with the XML Schema WG's standard versioning + policy, this schema document will persist at + http://www.w3.org/2001/03/xml.xsd. + At the date of issue it can also be found at + http://www.w3.org/2001/xml.xsd. + The schema document at that URI may however change in the future, + in order to remain compatible with the latest version of XML Schema + itself. In other words, if the XML Schema namespace changes, the version + of this document at + http://www.w3.org/2001/xml.xsd will change + accordingly; the version at + http://www.w3.org/2001/03/xml.xsd will not change. + + + + + + In due course, we should install the relevant ISO 2- and 3-letter + codes as the enumerated possible values . . . + + + + + + + + + + + + + + + See http://www.w3.org/TR/xmlbase/ for + information about this attribute. + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ecma/fouth-edition/opc-contentTypes.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ecma/fouth-edition/opc-contentTypes.xsd new file mode 100644 index 00000000..e4c5160e --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ecma/fouth-edition/opc-contentTypes.xsd @@ -0,0 +1,42 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ecma/fouth-edition/opc-coreProperties.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ecma/fouth-edition/opc-coreProperties.xsd new file mode 100644 index 00000000..888c0fcd --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ecma/fouth-edition/opc-coreProperties.xsd @@ -0,0 +1,50 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ecma/fouth-edition/opc-digSig.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ecma/fouth-edition/opc-digSig.xsd new file mode 100644 index 00000000..73782264 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ecma/fouth-edition/opc-digSig.xsd @@ -0,0 +1,49 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ecma/fouth-edition/opc-relationships.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ecma/fouth-edition/opc-relationships.xsd new file mode 100644 index 00000000..762dcbe8 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/ecma/fouth-edition/opc-relationships.xsd @@ -0,0 +1,33 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/mce/mc.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/mce/mc.xsd new file mode 100644 index 00000000..ef725457 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/mce/mc.xsd @@ -0,0 +1,75 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/microsoft/wml-2010.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/microsoft/wml-2010.xsd new file mode 100644 index 00000000..f65f7777 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/microsoft/wml-2010.xsd @@ -0,0 +1,560 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/microsoft/wml-2012.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/microsoft/wml-2012.xsd new file mode 100644 index 00000000..6b00755a --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/microsoft/wml-2012.xsd @@ -0,0 +1,67 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/microsoft/wml-2018.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/microsoft/wml-2018.xsd new file mode 100644 index 00000000..f321d333 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/microsoft/wml-2018.xsd @@ -0,0 +1,14 @@ + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/microsoft/wml-cex-2018.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/microsoft/wml-cex-2018.xsd new file mode 100644 index 00000000..364c6a9b --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/microsoft/wml-cex-2018.xsd @@ -0,0 +1,20 @@ + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/microsoft/wml-cid-2016.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/microsoft/wml-cid-2016.xsd new file mode 100644 index 00000000..fed9d15b --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/microsoft/wml-cid-2016.xsd @@ -0,0 +1,13 @@ + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/microsoft/wml-sdtdatahash-2020.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/microsoft/wml-sdtdatahash-2020.xsd new file mode 100644 index 00000000..680cf154 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/microsoft/wml-sdtdatahash-2020.xsd @@ -0,0 +1,4 @@ + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/schemas/microsoft/wml-symex-2015.xsd b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/microsoft/wml-symex-2015.xsd new file mode 100644 index 00000000..89ada908 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/schemas/microsoft/wml-symex-2015.xsd @@ -0,0 +1,8 @@ + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/scripts/pack.py b/code_puppy/bundled_skills/Office/docx/ooxml/scripts/pack.py new file mode 100644 index 00000000..4a23b67e --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/scripts/pack.py @@ -0,0 +1,160 @@ +#!/usr/bin/env python3 +""" +Tool to pack a directory into a .docx, .pptx, or .xlsx file with XML formatting undone. + +Example usage: + python pack.py [--force] +""" + +import argparse +import shutil +import subprocess +import sys +import tempfile +import zipfile +from pathlib import Path + +import defusedxml.minidom + + +def main(): + parser = argparse.ArgumentParser(description="Pack a directory into an Office file") + parser.add_argument("input_directory", help="Unpacked Office document directory") + parser.add_argument("output_file", help="Output Office file (.docx/.pptx/.xlsx)") + parser.add_argument("--force", action="store_true", help="Skip validation") + args = parser.parse_args() + + try: + success = pack_document( + args.input_directory, args.output_file, validate=not args.force + ) + + # Show warning if validation was skipped + if args.force: + print("Warning: Skipped validation, file may be corrupt", file=sys.stderr) + # Exit with error if validation failed + elif not success: + print("Contents would produce a corrupt file.", file=sys.stderr) + print("Please validate XML before repacking.", file=sys.stderr) + print("Use --force to skip validation and pack anyway.", file=sys.stderr) + sys.exit(1) + + except ValueError as e: + sys.exit(f"Error: {e}") + + +def pack_document(input_dir, output_file, validate=False): + """Pack a directory into an Office file (.docx/.pptx/.xlsx). + + Args: + input_dir: Path to unpacked Office document directory + output_file: Path to output Office file + validate: If True, validates with soffice (default: False) + + Returns: + bool: True if successful, False if validation failed + """ + input_dir = Path(input_dir) + output_file = Path(output_file) + + if not input_dir.is_dir(): + raise ValueError(f"{input_dir} is not a directory") + if output_file.suffix.lower() not in {".docx", ".pptx", ".xlsx"}: + raise ValueError(f"{output_file} must be a .docx, .pptx, or .xlsx file") + + # Work in temporary directory to avoid modifying original + with tempfile.TemporaryDirectory() as temp_dir: + temp_content_dir = Path(temp_dir) / "content" + shutil.copytree(input_dir, temp_content_dir) + + # Process XML files to remove pretty-printing whitespace + for pattern in ["*.xml", "*.rels"]: + for xml_file in temp_content_dir.rglob(pattern): + condense_xml(xml_file) + + # Create final Office file as zip archive + output_file.parent.mkdir(parents=True, exist_ok=True) + with zipfile.ZipFile(output_file, "w", zipfile.ZIP_DEFLATED) as zf: + for f in temp_content_dir.rglob("*"): + if f.is_file(): + zf.write(f, f.relative_to(temp_content_dir)) + + # Validate if requested + if validate: + if not validate_document(output_file): + output_file.unlink() # Delete the corrupt file + return False + + return True + + +def validate_document(doc_path): + """Validate document by converting to HTML with soffice.""" + # Determine the correct filter based on file extension + match doc_path.suffix.lower(): + case ".docx": + filter_name = "html:HTML" + case ".pptx": + filter_name = "html:impress_html_Export" + case ".xlsx": + filter_name = "html:HTML (StarCalc)" + + with tempfile.TemporaryDirectory() as temp_dir: + try: + result = subprocess.run( + [ + "soffice", + "--headless", + "--convert-to", + filter_name, + "--outdir", + temp_dir, + str(doc_path), + ], + capture_output=True, + timeout=10, + text=True, + ) + if not (Path(temp_dir) / f"{doc_path.stem}.html").exists(): + error_msg = result.stderr.strip() or "Document validation failed" + print(f"Validation error: {error_msg}", file=sys.stderr) + return False + return True + except FileNotFoundError: + print("Warning: soffice not found. Skipping validation.", file=sys.stderr) + return True + except subprocess.TimeoutExpired: + print("Validation error: Timeout during conversion", file=sys.stderr) + return False + except Exception as e: + print(f"Validation error: {e}", file=sys.stderr) + return False + + +def condense_xml(xml_file): + """Strip unnecessary whitespace and remove comments.""" + with open(xml_file, "r", encoding="utf-8") as f: + dom = defusedxml.minidom.parse(f) + + # Process each element to remove whitespace and comments + for element in dom.getElementsByTagName("*"): + # Skip w:t elements and their processing + if element.tagName.endswith(":t"): + continue + + # Remove whitespace-only text nodes and comment nodes + for child in list(element.childNodes): + if ( + child.nodeType == child.TEXT_NODE + and child.nodeValue + and child.nodeValue.strip() == "" + ) or child.nodeType == child.COMMENT_NODE: + element.removeChild(child) + + # Write back the condensed XML + with open(xml_file, "wb") as f: + f.write(dom.toxml(encoding="UTF-8")) + + +if __name__ == "__main__": + main() diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/scripts/unpack.py b/code_puppy/bundled_skills/Office/docx/ooxml/scripts/unpack.py new file mode 100644 index 00000000..2ac3909a --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/scripts/unpack.py @@ -0,0 +1,30 @@ +#!/usr/bin/env python3 +"""Unpack and format XML contents of Office files (.docx, .pptx, .xlsx)""" + +import random +import sys +import zipfile +from pathlib import Path + +import defusedxml.minidom + +# Get command line arguments +assert len(sys.argv) == 3, "Usage: python unpack.py " +input_file, output_dir = sys.argv[1], sys.argv[2] + +# Extract and format +output_path = Path(output_dir) +output_path.mkdir(parents=True, exist_ok=True) +zipfile.ZipFile(input_file).extractall(output_path) + +# Pretty print all XML files +xml_files = list(output_path.rglob("*.xml")) + list(output_path.rglob("*.rels")) +for xml_file in xml_files: + content = xml_file.read_text(encoding="utf-8") + dom = defusedxml.minidom.parseString(content) + xml_file.write_bytes(dom.toprettyxml(indent=" ", encoding="ascii")) + +# For .docx files, suggest an RSID for tracked changes +if input_file.endswith(".docx"): + suggested_rsid = "".join(random.choices("0123456789ABCDEF", k=8)) + print(f"Suggested RSID for edit session: {suggested_rsid}") diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/scripts/validate.py b/code_puppy/bundled_skills/Office/docx/ooxml/scripts/validate.py new file mode 100644 index 00000000..508c5891 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/scripts/validate.py @@ -0,0 +1,69 @@ +#!/usr/bin/env python3 +""" +Command line tool to validate Office document XML files against XSD schemas and tracked changes. + +Usage: + python validate.py --original +""" + +import argparse +import sys +from pathlib import Path + +from validation import DOCXSchemaValidator, PPTXSchemaValidator, RedliningValidator + + +def main(): + parser = argparse.ArgumentParser(description="Validate Office document XML files") + parser.add_argument( + "unpacked_dir", + help="Path to unpacked Office document directory", + ) + parser.add_argument( + "--original", + required=True, + help="Path to original file (.docx/.pptx/.xlsx)", + ) + parser.add_argument( + "-v", + "--verbose", + action="store_true", + help="Enable verbose output", + ) + args = parser.parse_args() + + # Validate paths + unpacked_dir = Path(args.unpacked_dir) + original_file = Path(args.original) + file_extension = original_file.suffix.lower() + assert unpacked_dir.is_dir(), f"Error: {unpacked_dir} is not a directory" + assert original_file.is_file(), f"Error: {original_file} is not a file" + assert file_extension in [".docx", ".pptx", ".xlsx"], ( + f"Error: {original_file} must be a .docx, .pptx, or .xlsx file" + ) + + # Run validations + match file_extension: + case ".docx": + validators = [DOCXSchemaValidator, RedliningValidator] + case ".pptx": + validators = [PPTXSchemaValidator] + case _: + print(f"Error: Validation not supported for file type {file_extension}") + sys.exit(1) + + # Run validators + success = True + for V in validators: + validator = V(unpacked_dir, original_file, verbose=args.verbose) + if not validator.validate(): + success = False + + if success: + print("All validations PASSED!") + + sys.exit(0 if success else 1) + + +if __name__ == "__main__": + main() diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/scripts/validation/__init__.py b/code_puppy/bundled_skills/Office/docx/ooxml/scripts/validation/__init__.py new file mode 100644 index 00000000..db092ece --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/scripts/validation/__init__.py @@ -0,0 +1,15 @@ +""" +Validation modules for Word document processing. +""" + +from .base import BaseSchemaValidator +from .docx import DOCXSchemaValidator +from .pptx import PPTXSchemaValidator +from .redlining import RedliningValidator + +__all__ = [ + "BaseSchemaValidator", + "DOCXSchemaValidator", + "PPTXSchemaValidator", + "RedliningValidator", +] diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/scripts/validation/base.py b/code_puppy/bundled_skills/Office/docx/ooxml/scripts/validation/base.py new file mode 100644 index 00000000..165c3c5c --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/scripts/validation/base.py @@ -0,0 +1,968 @@ +""" +Base validator with common validation logic for document files. +""" + +import re +from pathlib import Path + +import lxml.etree + + +class BaseSchemaValidator: + """Base validator with common validation logic for document files.""" + + # Elements whose 'id' attributes must be unique within their file + # Format: element_name -> (attribute_name, scope) + # scope can be 'file' (unique within file) or 'global' (unique across all files) + UNIQUE_ID_REQUIREMENTS = { + # Word elements + "comment": ("id", "file"), # Comment IDs in comments.xml + "commentrangestart": ("id", "file"), # Must match comment IDs + "commentrangeend": ("id", "file"), # Must match comment IDs + "bookmarkstart": ("id", "file"), # Bookmark start IDs + "bookmarkend": ("id", "file"), # Bookmark end IDs + # Note: ins and del (track changes) can share IDs when part of same revision + # PowerPoint elements + "sldid": ("id", "file"), # Slide IDs in presentation.xml + "sldmasterid": ("id", "global"), # Slide master IDs must be globally unique + "sldlayoutid": ("id", "global"), # Slide layout IDs must be globally unique + "cm": ("authorid", "file"), # Comment author IDs + # Excel elements + "sheet": ("sheetid", "file"), # Sheet IDs in workbook.xml + "definedname": ("id", "file"), # Named range IDs + # Drawing/Shape elements (all formats) + "cxnsp": ("id", "file"), # Connection shape IDs + "sp": ("id", "file"), # Shape IDs + "pic": ("id", "file"), # Picture IDs + "grpsp": ("id", "file"), # Group shape IDs + } + + # Container elements where ID uniqueness checks should be skipped + # These hold references that intentionally duplicate IDs of elements they reference + # Example: in sectionLst references in sldIdLst + EXCLUDED_ID_CONTAINERS = { + "sectionlst", # PowerPoint sections - sldId elements reference slides by ID + } + + # Mapping of element names to expected relationship types + # Subclasses should override this with format-specific mappings + ELEMENT_RELATIONSHIP_TYPES = {} + + # Unified schema mappings for all Office document types + SCHEMA_MAPPINGS = { + # Document type specific schemas + "word": "ISO-IEC29500-4_2016/wml.xsd", # Word documents + "ppt": "ISO-IEC29500-4_2016/pml.xsd", # PowerPoint presentations + "xl": "ISO-IEC29500-4_2016/sml.xsd", # Excel spreadsheets + # Common file types + "[Content_Types].xml": "ecma/fouth-edition/opc-contentTypes.xsd", + "app.xml": "ISO-IEC29500-4_2016/shared-documentPropertiesExtended.xsd", + "core.xml": "ecma/fouth-edition/opc-coreProperties.xsd", + "custom.xml": "ISO-IEC29500-4_2016/shared-documentPropertiesCustom.xsd", + ".rels": "ecma/fouth-edition/opc-relationships.xsd", + # Word-specific files + "people.xml": "microsoft/wml-2012.xsd", + "commentsIds.xml": "microsoft/wml-cid-2016.xsd", + "commentsExtensible.xml": "microsoft/wml-cex-2018.xsd", + "commentsExtended.xml": "microsoft/wml-2012.xsd", + # Chart files (common across document types) + "chart": "ISO-IEC29500-4_2016/dml-chart.xsd", + # Theme files (common across document types) + "theme": "ISO-IEC29500-4_2016/dml-main.xsd", + # Drawing and media files + "drawing": "ISO-IEC29500-4_2016/dml-main.xsd", + } + + # Unified namespace constants + MC_NAMESPACE = "http://schemas.openxmlformats.org/markup-compatibility/2006" + XML_NAMESPACE = "http://www.w3.org/XML/1998/namespace" + + # Common OOXML namespaces used across validators + PACKAGE_RELATIONSHIPS_NAMESPACE = ( + "http://schemas.openxmlformats.org/package/2006/relationships" + ) + OFFICE_RELATIONSHIPS_NAMESPACE = ( + "http://schemas.openxmlformats.org/officeDocument/2006/relationships" + ) + CONTENT_TYPES_NAMESPACE = ( + "http://schemas.openxmlformats.org/package/2006/content-types" + ) + + # Folders where we should clean ignorable namespaces + MAIN_CONTENT_FOLDERS = {"word", "ppt", "xl"} + + # All allowed OOXML namespaces (superset of all document types) + OOXML_NAMESPACES = { + "http://schemas.openxmlformats.org/officeDocument/2006/math", + "http://schemas.openxmlformats.org/officeDocument/2006/relationships", + "http://schemas.openxmlformats.org/schemaLibrary/2006/main", + "http://schemas.openxmlformats.org/drawingml/2006/main", + "http://schemas.openxmlformats.org/drawingml/2006/chart", + "http://schemas.openxmlformats.org/drawingml/2006/chartDrawing", + "http://schemas.openxmlformats.org/drawingml/2006/diagram", + "http://schemas.openxmlformats.org/drawingml/2006/picture", + "http://schemas.openxmlformats.org/drawingml/2006/spreadsheetDrawing", + "http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing", + "http://schemas.openxmlformats.org/wordprocessingml/2006/main", + "http://schemas.openxmlformats.org/presentationml/2006/main", + "http://schemas.openxmlformats.org/spreadsheetml/2006/main", + "http://schemas.openxmlformats.org/officeDocument/2006/sharedTypes", + "http://www.w3.org/XML/1998/namespace", + } + + def __init__(self, unpacked_dir, original_file, verbose=False): + self.unpacked_dir = Path(unpacked_dir).resolve() + self.original_file = Path(original_file) + self.verbose = verbose + + # Set schemas directory + self.schemas_dir = Path(__file__).parent.parent.parent / "schemas" + + # Get all XML and .rels files + patterns = ["*.xml", "*.rels"] + self.xml_files = [ + f for pattern in patterns for f in self.unpacked_dir.rglob(pattern) + ] + + if not self.xml_files: + print(f"Warning: No XML files found in {self.unpacked_dir}") + + def validate(self): + """Run all validation checks and return True if all pass.""" + raise NotImplementedError("Subclasses must implement the validate method") + + def validate_xml(self): + """Validate that all XML files are well-formed.""" + errors = [] + + for xml_file in self.xml_files: + try: + # Try to parse the XML file + lxml.etree.parse(str(xml_file)) + except lxml.etree.XMLSyntaxError as e: + errors.append( + f" {xml_file.relative_to(self.unpacked_dir)}: " + f"Line {e.lineno}: {e.msg}" + ) + except Exception as e: + errors.append( + f" {xml_file.relative_to(self.unpacked_dir)}: " + f"Unexpected error: {str(e)}" + ) + + if errors: + print(f"FAILED - Found {len(errors)} XML violations:") + for error in errors: + print(error) + return False + else: + if self.verbose: + print("PASSED - All XML files are well-formed") + return True + + def validate_namespaces(self): + """Validate that namespace prefixes in Ignorable attributes are declared.""" + errors = [] + + for xml_file in self.xml_files: + try: + root = lxml.etree.parse(str(xml_file)).getroot() + declared = set(root.nsmap.keys()) - {None} # Exclude default namespace + + for attr_val in [ + v for k, v in root.attrib.items() if k.endswith("Ignorable") + ]: + undeclared = set(attr_val.split()) - declared + errors.extend( + f" {xml_file.relative_to(self.unpacked_dir)}: " + f"Namespace '{ns}' in Ignorable but not declared" + for ns in undeclared + ) + except lxml.etree.XMLSyntaxError: + continue + + if errors: + print(f"FAILED - {len(errors)} namespace issues:") + for error in errors: + print(error) + return False + if self.verbose: + print("PASSED - All namespace prefixes properly declared") + return True + + def validate_unique_ids(self): + """Validate that specific IDs are unique according to OOXML requirements.""" + errors = [] + global_ids = {} # Track globally unique IDs across all files + + for xml_file in self.xml_files: + try: + root = lxml.etree.parse(str(xml_file)).getroot() + file_ids = {} # Track IDs that must be unique within this file + + # Remove all mc:AlternateContent elements from the tree + mc_elements = root.xpath( + ".//mc:AlternateContent", namespaces={"mc": self.MC_NAMESPACE} + ) + for elem in mc_elements: + elem.getparent().remove(elem) + + # Now check IDs in the cleaned tree + for elem in root.iter(): + # Get the element name without namespace + tag = ( + elem.tag.split("}")[-1].lower() + if "}" in elem.tag + else elem.tag.lower() + ) + + # Check if this element type has ID uniqueness requirements + if tag in self.UNIQUE_ID_REQUIREMENTS: + # Skip if element is inside an excluded container + # (e.g., inside is a reference, not a definition) + in_excluded_container = any( + ancestor.tag.split("}")[-1].lower() + in self.EXCLUDED_ID_CONTAINERS + for ancestor in elem.iterancestors() + ) + if in_excluded_container: + continue + + attr_name, scope = self.UNIQUE_ID_REQUIREMENTS[tag] + + # Look for the specified attribute + id_value = None + for attr, value in elem.attrib.items(): + attr_local = ( + attr.split("}")[-1].lower() + if "}" in attr + else attr.lower() + ) + if attr_local == attr_name: + id_value = value + break + + if id_value is not None: + if scope == "global": + # Check global uniqueness + if id_value in global_ids: + prev_file, prev_line, prev_tag = global_ids[ + id_value + ] + errors.append( + f" {xml_file.relative_to(self.unpacked_dir)}: " + f"Line {elem.sourceline}: Global ID '{id_value}' in <{tag}> " + f"already used in {prev_file} at line {prev_line} in <{prev_tag}>" + ) + else: + global_ids[id_value] = ( + xml_file.relative_to(self.unpacked_dir), + elem.sourceline, + tag, + ) + elif scope == "file": + # Check file-level uniqueness + key = (tag, attr_name) + if key not in file_ids: + file_ids[key] = {} + + if id_value in file_ids[key]: + prev_line = file_ids[key][id_value] + errors.append( + f" {xml_file.relative_to(self.unpacked_dir)}: " + f"Line {elem.sourceline}: Duplicate {attr_name}='{id_value}' in <{tag}> " + f"(first occurrence at line {prev_line})" + ) + else: + file_ids[key][id_value] = elem.sourceline + + except (lxml.etree.XMLSyntaxError, Exception) as e: + errors.append( + f" {xml_file.relative_to(self.unpacked_dir)}: Error: {e}" + ) + + if errors: + print(f"FAILED - Found {len(errors)} ID uniqueness violations:") + for error in errors: + print(error) + return False + else: + if self.verbose: + print("PASSED - All required IDs are unique") + return True + + def validate_file_references(self): + """ + Validate that all .rels files properly reference files and that all files are referenced. + """ + errors = [] + + # Find all .rels files + rels_files = list(self.unpacked_dir.rglob("*.rels")) + + if not rels_files: + if self.verbose: + print("PASSED - No .rels files found") + return True + + # Get all files in the unpacked directory (excluding reference files) + all_files = [] + for file_path in self.unpacked_dir.rglob("*"): + if ( + file_path.is_file() + and file_path.name != "[Content_Types].xml" + and not file_path.name.endswith(".rels") + ): # This file is not referenced by .rels + all_files.append(file_path.resolve()) + + # Track all files that are referenced by any .rels file + all_referenced_files = set() + + if self.verbose: + print( + f"Found {len(rels_files)} .rels files and {len(all_files)} target files" + ) + + # Check each .rels file + for rels_file in rels_files: + try: + # Parse relationships file + rels_root = lxml.etree.parse(str(rels_file)).getroot() + + # Get the directory where this .rels file is located + rels_dir = rels_file.parent + + # Find all relationships and their targets + referenced_files = set() + broken_refs = [] + + for rel in rels_root.findall( + ".//ns:Relationship", + namespaces={"ns": self.PACKAGE_RELATIONSHIPS_NAMESPACE}, + ): + target = rel.get("Target") + if target and not target.startswith( + ("http", "mailto:") + ): # Skip external URLs + # Resolve the target path relative to the .rels file location + if rels_file.name == ".rels": + # Root .rels file - targets are relative to unpacked_dir + target_path = self.unpacked_dir / target + else: + # Other .rels files - targets are relative to their parent's parent + # e.g., word/_rels/document.xml.rels -> targets relative to word/ + base_dir = rels_dir.parent + target_path = base_dir / target + + # Normalize the path and check if it exists + try: + target_path = target_path.resolve() + if target_path.exists() and target_path.is_file(): + referenced_files.add(target_path) + all_referenced_files.add(target_path) + else: + broken_refs.append((target, rel.sourceline)) + except (OSError, ValueError): + broken_refs.append((target, rel.sourceline)) + + # Report broken references + if broken_refs: + rel_path = rels_file.relative_to(self.unpacked_dir) + for broken_ref, line_num in broken_refs: + errors.append( + f" {rel_path}: Line {line_num}: Broken reference to {broken_ref}" + ) + + except Exception as e: + rel_path = rels_file.relative_to(self.unpacked_dir) + errors.append(f" Error parsing {rel_path}: {e}") + + # Check for unreferenced files (files that exist but are not referenced anywhere) + unreferenced_files = set(all_files) - all_referenced_files + + if unreferenced_files: + for unref_file in sorted(unreferenced_files): + unref_rel_path = unref_file.relative_to(self.unpacked_dir) + errors.append(f" Unreferenced file: {unref_rel_path}") + + if errors: + print(f"FAILED - Found {len(errors)} relationship validation errors:") + for error in errors: + print(error) + print( + "CRITICAL: These errors will cause the document to appear corrupt. " + + "Broken references MUST be fixed, " + + "and unreferenced files MUST be referenced or removed." + ) + return False + else: + if self.verbose: + print( + "PASSED - All references are valid and all files are properly referenced" + ) + return True + + def validate_all_relationship_ids(self): + """ + Validate that all r:id attributes in XML files reference existing IDs + in their corresponding .rels files, and optionally validate relationship types. + """ + import lxml.etree + + errors = [] + + # Process each XML file that might contain r:id references + for xml_file in self.xml_files: + # Skip .rels files themselves + if xml_file.suffix == ".rels": + continue + + # Determine the corresponding .rels file + # For dir/file.xml, it's dir/_rels/file.xml.rels + rels_dir = xml_file.parent / "_rels" + rels_file = rels_dir / f"{xml_file.name}.rels" + + # Skip if there's no corresponding .rels file (that's okay) + if not rels_file.exists(): + continue + + try: + # Parse the .rels file to get valid relationship IDs and their types + rels_root = lxml.etree.parse(str(rels_file)).getroot() + rid_to_type = {} + + for rel in rels_root.findall( + f".//{{{self.PACKAGE_RELATIONSHIPS_NAMESPACE}}}Relationship" + ): + rid = rel.get("Id") + rel_type = rel.get("Type", "") + if rid: + # Check for duplicate rIds + if rid in rid_to_type: + rels_rel_path = rels_file.relative_to(self.unpacked_dir) + errors.append( + f" {rels_rel_path}: Line {rel.sourceline}: " + f"Duplicate relationship ID '{rid}' (IDs must be unique)" + ) + # Extract just the type name from the full URL + type_name = ( + rel_type.split("/")[-1] if "/" in rel_type else rel_type + ) + rid_to_type[rid] = type_name + + # Parse the XML file to find all r:id references + xml_root = lxml.etree.parse(str(xml_file)).getroot() + + # Find all elements with r:id attributes + for elem in xml_root.iter(): + # Check for r:id attribute (relationship ID) + rid_attr = elem.get(f"{{{self.OFFICE_RELATIONSHIPS_NAMESPACE}}}id") + if rid_attr: + xml_rel_path = xml_file.relative_to(self.unpacked_dir) + elem_name = ( + elem.tag.split("}")[-1] if "}" in elem.tag else elem.tag + ) + + # Check if the ID exists + if rid_attr not in rid_to_type: + errors.append( + f" {xml_rel_path}: Line {elem.sourceline}: " + f"<{elem_name}> references non-existent relationship '{rid_attr}' " + f"(valid IDs: {', '.join(sorted(rid_to_type.keys())[:5])}{'...' if len(rid_to_type) > 5 else ''})" + ) + # Check if we have type expectations for this element + elif self.ELEMENT_RELATIONSHIP_TYPES: + expected_type = self._get_expected_relationship_type( + elem_name + ) + if expected_type: + actual_type = rid_to_type[rid_attr] + # Check if the actual type matches or contains the expected type + if expected_type not in actual_type.lower(): + errors.append( + f" {xml_rel_path}: Line {elem.sourceline}: " + f"<{elem_name}> references '{rid_attr}' which points to '{actual_type}' " + f"but should point to a '{expected_type}' relationship" + ) + + except Exception as e: + xml_rel_path = xml_file.relative_to(self.unpacked_dir) + errors.append(f" Error processing {xml_rel_path}: {e}") + + if errors: + print(f"FAILED - Found {len(errors)} relationship ID reference errors:") + for error in errors: + print(error) + print("\nThese ID mismatches will cause the document to appear corrupt!") + return False + else: + if self.verbose: + print("PASSED - All relationship ID references are valid") + return True + + def _get_expected_relationship_type(self, element_name): + """ + Get the expected relationship type for an element. + First checks the explicit mapping, then tries pattern detection. + """ + # Normalize element name to lowercase + elem_lower = element_name.lower() + + # Check explicit mapping first + if elem_lower in self.ELEMENT_RELATIONSHIP_TYPES: + return self.ELEMENT_RELATIONSHIP_TYPES[elem_lower] + + # Try pattern detection for common patterns + # Pattern 1: Elements ending in "Id" often expect a relationship of the prefix type + if elem_lower.endswith("id") and len(elem_lower) > 2: + # e.g., "sldId" -> "sld", "sldMasterId" -> "sldMaster" + prefix = elem_lower[:-2] # Remove "id" + # Check if this might be a compound like "sldMasterId" + if prefix.endswith("master"): + return prefix.lower() + elif prefix.endswith("layout"): + return prefix.lower() + else: + # Simple case like "sldId" -> "slide" + # Common transformations + if prefix == "sld": + return "slide" + return prefix.lower() + + # Pattern 2: Elements ending in "Reference" expect a relationship of the prefix type + if elem_lower.endswith("reference") and len(elem_lower) > 9: + prefix = elem_lower[:-9] # Remove "reference" + return prefix.lower() + + return None + + def validate_content_types(self): + """Validate that all content files are properly declared in [Content_Types].xml.""" + errors = [] + + # Find [Content_Types].xml file + content_types_file = self.unpacked_dir / "[Content_Types].xml" + if not content_types_file.exists(): + print("FAILED - [Content_Types].xml file not found") + return False + + try: + # Parse and get all declared parts and extensions + root = lxml.etree.parse(str(content_types_file)).getroot() + declared_parts = set() + declared_extensions = set() + + # Get Override declarations (specific files) + for override in root.findall( + f".//{{{self.CONTENT_TYPES_NAMESPACE}}}Override" + ): + part_name = override.get("PartName") + if part_name is not None: + declared_parts.add(part_name.lstrip("/")) + + # Get Default declarations (by extension) + for default in root.findall( + f".//{{{self.CONTENT_TYPES_NAMESPACE}}}Default" + ): + extension = default.get("Extension") + if extension is not None: + declared_extensions.add(extension.lower()) + + # Root elements that require content type declaration + declarable_roots = { + "sld", + "sldLayout", + "sldMaster", + "presentation", # PowerPoint + "document", # Word + "workbook", + "worksheet", # Excel + "theme", # Common + } + + # Common media file extensions that should be declared + media_extensions = { + "png": "image/png", + "jpg": "image/jpeg", + "jpeg": "image/jpeg", + "gif": "image/gif", + "bmp": "image/bmp", + "tiff": "image/tiff", + "wmf": "image/x-wmf", + "emf": "image/x-emf", + } + + # Get all files in the unpacked directory + all_files = list(self.unpacked_dir.rglob("*")) + all_files = [f for f in all_files if f.is_file()] + + # Check all XML files for Override declarations + for xml_file in self.xml_files: + path_str = str(xml_file.relative_to(self.unpacked_dir)).replace( + "\\", "/" + ) + + # Skip non-content files + if any( + skip in path_str + for skip in [".rels", "[Content_Types]", "docProps/", "_rels/"] + ): + continue + + try: + root_tag = lxml.etree.parse(str(xml_file)).getroot().tag + root_name = root_tag.split("}")[-1] if "}" in root_tag else root_tag + + if root_name in declarable_roots and path_str not in declared_parts: + errors.append( + f" {path_str}: File with <{root_name}> root not declared in [Content_Types].xml" + ) + + except Exception: + continue # Skip unparseable files + + # Check all non-XML files for Default extension declarations + for file_path in all_files: + # Skip XML files and metadata files (already checked above) + if file_path.suffix.lower() in {".xml", ".rels"}: + continue + if file_path.name == "[Content_Types].xml": + continue + if "_rels" in file_path.parts or "docProps" in file_path.parts: + continue + + extension = file_path.suffix.lstrip(".").lower() + if extension and extension not in declared_extensions: + # Check if it's a known media extension that should be declared + if extension in media_extensions: + relative_path = file_path.relative_to(self.unpacked_dir) + errors.append( + f' {relative_path}: File with extension \'{extension}\' not declared in [Content_Types].xml - should add: ' + ) + + except Exception as e: + errors.append(f" Error parsing [Content_Types].xml: {e}") + + if errors: + print(f"FAILED - Found {len(errors)} content type declaration errors:") + for error in errors: + print(error) + return False + else: + if self.verbose: + print( + "PASSED - All content files are properly declared in [Content_Types].xml" + ) + return True + + def validate_file_against_xsd(self, xml_file, verbose=False): + """Validate a single XML file against XSD schema, comparing with original. + + Args: + xml_file: Path to XML file to validate + verbose: Enable verbose output + + Returns: + tuple: (is_valid, new_errors_set) where is_valid is True/False/None (skipped) + """ + # Resolve both paths to handle symlinks + xml_file = Path(xml_file).resolve() + unpacked_dir = self.unpacked_dir.resolve() + + # Validate current file + is_valid, current_errors = self._validate_single_file_xsd( + xml_file, unpacked_dir + ) + + if is_valid is None: + return None, set() # Skipped + elif is_valid: + return True, set() # Valid, no errors + + # Get errors from original file for this specific file + original_errors = self._get_original_file_errors(xml_file) + + # Compare with original (both are guaranteed to be sets here) + assert current_errors is not None + new_errors = current_errors - original_errors + + if new_errors: + if verbose: + relative_path = xml_file.relative_to(unpacked_dir) + print(f"FAILED - {relative_path}: {len(new_errors)} new error(s)") + for error in list(new_errors)[:3]: + truncated = error[:250] + "..." if len(error) > 250 else error + print(f" - {truncated}") + return False, new_errors + else: + # All errors existed in original + if verbose: + print( + f"PASSED - No new errors (original had {len(current_errors)} errors)" + ) + return True, set() + + def validate_against_xsd(self): + """Validate XML files against XSD schemas, showing only new errors compared to original.""" + new_errors = [] + original_error_count = 0 + valid_count = 0 + skipped_count = 0 + + for xml_file in self.xml_files: + relative_path = str(xml_file.relative_to(self.unpacked_dir)) + is_valid, new_file_errors = self.validate_file_against_xsd( + xml_file, verbose=False + ) + + if is_valid is None: + skipped_count += 1 + continue + elif is_valid and not new_file_errors: + valid_count += 1 + continue + elif is_valid: + # Had errors but all existed in original + original_error_count += 1 + valid_count += 1 + continue + + # Has new errors + new_errors.append(f" {relative_path}: {len(new_file_errors)} new error(s)") + for error in list(new_file_errors)[:3]: # Show first 3 errors + new_errors.append( + f" - {error[:250]}..." if len(error) > 250 else f" - {error}" + ) + + # Print summary + if self.verbose: + print(f"Validated {len(self.xml_files)} files:") + print(f" - Valid: {valid_count}") + print(f" - Skipped (no schema): {skipped_count}") + if original_error_count: + print(f" - With original errors (ignored): {original_error_count}") + print( + f" - With NEW errors: {len(new_errors) > 0 and len([e for e in new_errors if not e.startswith(' ')]) or 0}" + ) + + if new_errors: + print("\nFAILED - Found NEW validation errors:") + for error in new_errors: + print(error) + return False + else: + if self.verbose: + print("\nPASSED - No new XSD validation errors introduced") + return True + + def _get_schema_path(self, xml_file): + """Determine the appropriate schema path for an XML file.""" + # Check exact filename match + if xml_file.name in self.SCHEMA_MAPPINGS: + return self.schemas_dir / self.SCHEMA_MAPPINGS[xml_file.name] + + # Check .rels files + if xml_file.suffix == ".rels": + return self.schemas_dir / self.SCHEMA_MAPPINGS[".rels"] + + # Check chart files + if "charts/" in str(xml_file) and xml_file.name.startswith("chart"): + return self.schemas_dir / self.SCHEMA_MAPPINGS["chart"] + + # Check theme files + if "theme/" in str(xml_file) and xml_file.name.startswith("theme"): + return self.schemas_dir / self.SCHEMA_MAPPINGS["theme"] + + # Check if file is in a main content folder and use appropriate schema + if xml_file.parent.name in self.MAIN_CONTENT_FOLDERS: + return self.schemas_dir / self.SCHEMA_MAPPINGS[xml_file.parent.name] + + return None + + def _clean_ignorable_namespaces(self, xml_doc): + """Remove attributes and elements not in allowed namespaces.""" + # Create a clean copy + xml_string = lxml.etree.tostring(xml_doc, encoding="unicode") + xml_copy = lxml.etree.fromstring(xml_string) + + # Remove attributes not in allowed namespaces + for elem in xml_copy.iter(): + attrs_to_remove = [] + + for attr in elem.attrib: + # Check if attribute is from a namespace other than allowed ones + if "{" in attr: + ns = attr.split("}")[0][1:] + if ns not in self.OOXML_NAMESPACES: + attrs_to_remove.append(attr) + + # Remove collected attributes + for attr in attrs_to_remove: + del elem.attrib[attr] + + # Remove elements not in allowed namespaces + self._remove_ignorable_elements(xml_copy) + + return lxml.etree.ElementTree(xml_copy) + + def _remove_ignorable_elements(self, root): + """Recursively remove all elements not in allowed namespaces.""" + elements_to_remove = [] + + # Find elements to remove + for elem in list(root): + # Skip non-element nodes (comments, processing instructions, etc.) + if not hasattr(elem, "tag") or callable(elem.tag): + continue + + tag_str = str(elem.tag) + if tag_str.startswith("{"): + ns = tag_str.split("}")[0][1:] + if ns not in self.OOXML_NAMESPACES: + elements_to_remove.append(elem) + continue + + # Recursively clean child elements + self._remove_ignorable_elements(elem) + + # Remove collected elements + for elem in elements_to_remove: + root.remove(elem) + + def _preprocess_for_mc_ignorable(self, xml_doc): + """Preprocess XML to handle mc:Ignorable attribute properly.""" + # Remove mc:Ignorable attributes before validation + root = xml_doc.getroot() + + # Remove mc:Ignorable attribute from root + if f"{{{self.MC_NAMESPACE}}}Ignorable" in root.attrib: + del root.attrib[f"{{{self.MC_NAMESPACE}}}Ignorable"] + + return xml_doc + + def _validate_single_file_xsd(self, xml_file, base_path): + """Validate a single XML file against XSD schema. Returns (is_valid, errors_set).""" + schema_path = self._get_schema_path(xml_file) + if not schema_path: + return None, None # Skip file + + try: + # Load schema + with open(schema_path, "rb") as xsd_file: + parser = lxml.etree.XMLParser() + xsd_doc = lxml.etree.parse( + xsd_file, parser=parser, base_url=str(schema_path) + ) + schema = lxml.etree.XMLSchema(xsd_doc) + + # Load and preprocess XML + with open(xml_file, "r") as f: + xml_doc = lxml.etree.parse(f) + + xml_doc, _ = self._remove_template_tags_from_text_nodes(xml_doc) + xml_doc = self._preprocess_for_mc_ignorable(xml_doc) + + # Clean ignorable namespaces if needed + relative_path = xml_file.relative_to(base_path) + if ( + relative_path.parts + and relative_path.parts[0] in self.MAIN_CONTENT_FOLDERS + ): + xml_doc = self._clean_ignorable_namespaces(xml_doc) + + # Validate + if schema.validate(xml_doc): + return True, set() + else: + errors = set() + for error in schema.error_log: + # Store normalized error message (without line numbers for comparison) + errors.add(error.message) + return False, errors + + except Exception as e: + return False, {str(e)} + + def _get_original_file_errors(self, xml_file): + """Get XSD validation errors from a single file in the original document. + + Args: + xml_file: Path to the XML file in unpacked_dir to check + + Returns: + set: Set of error messages from the original file + """ + import tempfile + import zipfile + + # Resolve both paths to handle symlinks (e.g., /var vs /private/var on macOS) + xml_file = Path(xml_file).resolve() + unpacked_dir = self.unpacked_dir.resolve() + relative_path = xml_file.relative_to(unpacked_dir) + + with tempfile.TemporaryDirectory() as temp_dir: + temp_path = Path(temp_dir) + + # Extract original file + with zipfile.ZipFile(self.original_file, "r") as zip_ref: + zip_ref.extractall(temp_path) + + # Find corresponding file in original + original_xml_file = temp_path / relative_path + + if not original_xml_file.exists(): + # File didn't exist in original, so no original errors + return set() + + # Validate the specific file in original + is_valid, errors = self._validate_single_file_xsd( + original_xml_file, temp_path + ) + return errors if errors else set() + + def _remove_template_tags_from_text_nodes(self, xml_doc): + """Remove template tags from XML text nodes and collect warnings. + + Template tags follow the pattern {{ ... }} and are used as placeholders + for content replacement. They should be removed from text content before + XSD validation while preserving XML structure. + + Returns: + tuple: (cleaned_xml_doc, warnings_list) + """ + warnings = [] + template_pattern = re.compile(r"\{\{[^}]*\}\}") + + # Create a copy of the document to avoid modifying the original + xml_string = lxml.etree.tostring(xml_doc, encoding="unicode") + xml_copy = lxml.etree.fromstring(xml_string) + + def process_text_content(text, content_type): + if not text: + return text + matches = list(template_pattern.finditer(text)) + if matches: + for match in matches: + warnings.append( + f"Found template tag in {content_type}: {match.group()}" + ) + return template_pattern.sub("", text) + return text + + # Process all text nodes in the document + for elem in xml_copy.iter(): + # Skip processing if this is a w:t element + if not hasattr(elem, "tag") or callable(elem.tag): + continue + tag_str = str(elem.tag) + if tag_str.endswith("}t") or tag_str == "t": + continue + + elem.text = process_text_content(elem.text, "text content") + elem.tail = process_text_content(elem.tail, "tail content") + + return lxml.etree.ElementTree(xml_copy), warnings + + +if __name__ == "__main__": + raise RuntimeError("This module should not be run directly.") diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/scripts/validation/docx.py b/code_puppy/bundled_skills/Office/docx/ooxml/scripts/validation/docx.py new file mode 100644 index 00000000..ead1f9f6 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/scripts/validation/docx.py @@ -0,0 +1,273 @@ +""" +Validator for Word document XML files against XSD schemas. +""" + +import re +import tempfile +import zipfile + +import lxml.etree + +from .base import BaseSchemaValidator + + +class DOCXSchemaValidator(BaseSchemaValidator): + """Validator for Word document XML files against XSD schemas.""" + + # Word-specific namespace + WORD_2006_NAMESPACE = "http://schemas.openxmlformats.org/wordprocessingml/2006/main" + + # Word-specific element to relationship type mappings + # Start with empty mapping - add specific cases as we discover them + ELEMENT_RELATIONSHIP_TYPES = {} + + def validate(self): + """Run all validation checks and return True if all pass.""" + # Test 0: XML well-formedness + if not self.validate_xml(): + return False + + # Test 1: Namespace declarations + all_valid = True + if not self.validate_namespaces(): + all_valid = False + + # Test 2: Unique IDs + if not self.validate_unique_ids(): + all_valid = False + + # Test 3: Relationship and file reference validation + if not self.validate_file_references(): + all_valid = False + + # Test 4: Content type declarations + if not self.validate_content_types(): + all_valid = False + + # Test 5: XSD schema validation + if not self.validate_against_xsd(): + all_valid = False + + # Test 6: Whitespace preservation + if not self.validate_whitespace_preservation(): + all_valid = False + + # Test 7: Deletion validation + if not self.validate_deletions(): + all_valid = False + + # Test 8: Insertion validation + if not self.validate_insertions(): + all_valid = False + + # Test 9: Relationship ID reference validation + if not self.validate_all_relationship_ids(): + all_valid = False + + # Count and compare paragraphs + self.compare_paragraph_counts() + + return all_valid + + def validate_whitespace_preservation(self): + """ + Validate that w:t elements with whitespace have xml:space='preserve'. + """ + errors = [] + + for xml_file in self.xml_files: + # Only check document.xml files + if xml_file.name != "document.xml": + continue + + try: + root = lxml.etree.parse(str(xml_file)).getroot() + + # Find all w:t elements + for elem in root.iter(f"{{{self.WORD_2006_NAMESPACE}}}t"): + if elem.text: + text = elem.text + # Check if text starts or ends with whitespace + if re.match(r"^\s.*", text) or re.match(r".*\s$", text): + # Check if xml:space="preserve" attribute exists + xml_space_attr = f"{{{self.XML_NAMESPACE}}}space" + if ( + xml_space_attr not in elem.attrib + or elem.attrib[xml_space_attr] != "preserve" + ): + # Show a preview of the text + text_preview = ( + repr(text)[:50] + "..." + if len(repr(text)) > 50 + else repr(text) + ) + errors.append( + f" {xml_file.relative_to(self.unpacked_dir)}: " + f"Line {elem.sourceline}: w:t element with whitespace missing xml:space='preserve': {text_preview}" + ) + + except (lxml.etree.XMLSyntaxError, Exception) as e: + errors.append( + f" {xml_file.relative_to(self.unpacked_dir)}: Error: {e}" + ) + + if errors: + print(f"FAILED - Found {len(errors)} whitespace preservation violations:") + for error in errors: + print(error) + return False + else: + if self.verbose: + print("PASSED - All whitespace is properly preserved") + return True + + def validate_deletions(self): + """ + Validate that w:t elements are not within w:del elements. + For some reason, XSD validation does not catch this, so we do it manually. + """ + errors = [] + + for xml_file in self.xml_files: + # Only check document.xml files + if xml_file.name != "document.xml": + continue + + try: + root = lxml.etree.parse(str(xml_file)).getroot() + + # Find all w:t elements that are descendants of w:del elements + namespaces = {"w": self.WORD_2006_NAMESPACE} + xpath_expression = ".//w:del//w:t" + problematic_t_elements = root.xpath( + xpath_expression, namespaces=namespaces + ) + for t_elem in problematic_t_elements: + if t_elem.text: + # Show a preview of the text + text_preview = ( + repr(t_elem.text)[:50] + "..." + if len(repr(t_elem.text)) > 50 + else repr(t_elem.text) + ) + errors.append( + f" {xml_file.relative_to(self.unpacked_dir)}: " + f"Line {t_elem.sourceline}: found within : {text_preview}" + ) + + except (lxml.etree.XMLSyntaxError, Exception) as e: + errors.append( + f" {xml_file.relative_to(self.unpacked_dir)}: Error: {e}" + ) + + if errors: + print(f"FAILED - Found {len(errors)} deletion validation violations:") + for error in errors: + print(error) + return False + else: + if self.verbose: + print("PASSED - No w:t elements found within w:del elements") + return True + + def count_paragraphs_in_unpacked(self): + """Count the number of paragraphs in the unpacked document.""" + count = 0 + + for xml_file in self.xml_files: + # Only check document.xml files + if xml_file.name != "document.xml": + continue + + try: + root = lxml.etree.parse(str(xml_file)).getroot() + # Count all w:p elements + paragraphs = root.findall(f".//{{{self.WORD_2006_NAMESPACE}}}p") + count = len(paragraphs) + except Exception as e: + print(f"Error counting paragraphs in unpacked document: {e}") + + return count + + def count_paragraphs_in_original(self): + """Count the number of paragraphs in the original docx file.""" + count = 0 + + try: + # Create temporary directory to unpack original + with tempfile.TemporaryDirectory() as temp_dir: + # Unpack original docx + with zipfile.ZipFile(self.original_file, "r") as zip_ref: + zip_ref.extractall(temp_dir) + + # Parse document.xml + doc_xml_path = temp_dir + "/word/document.xml" + root = lxml.etree.parse(doc_xml_path).getroot() + + # Count all w:p elements + paragraphs = root.findall(f".//{{{self.WORD_2006_NAMESPACE}}}p") + count = len(paragraphs) + + except Exception as e: + print(f"Error counting paragraphs in original document: {e}") + + return count + + def validate_insertions(self): + """ + Validate that w:delText elements are not within w:ins elements. + w:delText is only allowed in w:ins if nested within a w:del. + """ + errors = [] + + for xml_file in self.xml_files: + if xml_file.name != "document.xml": + continue + + try: + root = lxml.etree.parse(str(xml_file)).getroot() + namespaces = {"w": self.WORD_2006_NAMESPACE} + + # Find w:delText in w:ins that are NOT within w:del + invalid_elements = root.xpath( + ".//w:ins//w:delText[not(ancestor::w:del)]", namespaces=namespaces + ) + + for elem in invalid_elements: + text_preview = ( + repr(elem.text or "")[:50] + "..." + if len(repr(elem.text or "")) > 50 + else repr(elem.text or "") + ) + errors.append( + f" {xml_file.relative_to(self.unpacked_dir)}: " + f"Line {elem.sourceline}: within : {text_preview}" + ) + + except (lxml.etree.XMLSyntaxError, Exception) as e: + errors.append( + f" {xml_file.relative_to(self.unpacked_dir)}: Error: {e}" + ) + + if errors: + print(f"FAILED - Found {len(errors)} insertion validation violations:") + for error in errors: + print(error) + return False + else: + if self.verbose: + print("PASSED - No w:delText elements within w:ins elements") + return True + + def compare_paragraph_counts(self): + """Compare paragraph counts between original and new document.""" + original_count = self.count_paragraphs_in_original() + new_count = self.count_paragraphs_in_unpacked() + + diff = new_count - original_count + diff_str = f"+{diff}" if diff > 0 else str(diff) + print(f"\nParagraphs: {original_count} → {new_count} ({diff_str})") + + +if __name__ == "__main__": + raise RuntimeError("This module should not be run directly.") diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/scripts/validation/pptx.py b/code_puppy/bundled_skills/Office/docx/ooxml/scripts/validation/pptx.py new file mode 100644 index 00000000..66d5b1e2 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/scripts/validation/pptx.py @@ -0,0 +1,315 @@ +""" +Validator for PowerPoint presentation XML files against XSD schemas. +""" + +import re + +from .base import BaseSchemaValidator + + +class PPTXSchemaValidator(BaseSchemaValidator): + """Validator for PowerPoint presentation XML files against XSD schemas.""" + + # PowerPoint presentation namespace + PRESENTATIONML_NAMESPACE = ( + "http://schemas.openxmlformats.org/presentationml/2006/main" + ) + + # PowerPoint-specific element to relationship type mappings + ELEMENT_RELATIONSHIP_TYPES = { + "sldid": "slide", + "sldmasterid": "slidemaster", + "notesmasterid": "notesmaster", + "sldlayoutid": "slidelayout", + "themeid": "theme", + "tablestyleid": "tablestyles", + } + + def validate(self): + """Run all validation checks and return True if all pass.""" + # Test 0: XML well-formedness + if not self.validate_xml(): + return False + + # Test 1: Namespace declarations + all_valid = True + if not self.validate_namespaces(): + all_valid = False + + # Test 2: Unique IDs + if not self.validate_unique_ids(): + all_valid = False + + # Test 3: UUID ID validation + if not self.validate_uuid_ids(): + all_valid = False + + # Test 4: Relationship and file reference validation + if not self.validate_file_references(): + all_valid = False + + # Test 5: Slide layout ID validation + if not self.validate_slide_layout_ids(): + all_valid = False + + # Test 6: Content type declarations + if not self.validate_content_types(): + all_valid = False + + # Test 7: XSD schema validation + if not self.validate_against_xsd(): + all_valid = False + + # Test 8: Notes slide reference validation + if not self.validate_notes_slide_references(): + all_valid = False + + # Test 9: Relationship ID reference validation + if not self.validate_all_relationship_ids(): + all_valid = False + + # Test 10: Duplicate slide layout references validation + if not self.validate_no_duplicate_slide_layouts(): + all_valid = False + + return all_valid + + def validate_uuid_ids(self): + """Validate that ID attributes that look like UUIDs contain only hex values.""" + import lxml.etree + + errors = [] + # UUID pattern: 8-4-4-4-12 hex digits with optional braces/hyphens + uuid_pattern = re.compile( + r"^[\{\(]?[0-9A-Fa-f]{8}-?[0-9A-Fa-f]{4}-?[0-9A-Fa-f]{4}-?[0-9A-Fa-f]{4}-?[0-9A-Fa-f]{12}[\}\)]?$" + ) + + for xml_file in self.xml_files: + try: + root = lxml.etree.parse(str(xml_file)).getroot() + + # Check all elements for ID attributes + for elem in root.iter(): + for attr, value in elem.attrib.items(): + # Check if this is an ID attribute + attr_name = attr.split("}")[-1].lower() + if attr_name == "id" or attr_name.endswith("id"): + # Check if value looks like a UUID (has the right length and pattern structure) + if self._looks_like_uuid(value): + # Validate that it contains only hex characters in the right positions + if not uuid_pattern.match(value): + errors.append( + f" {xml_file.relative_to(self.unpacked_dir)}: " + f"Line {elem.sourceline}: ID '{value}' appears to be a UUID but contains invalid hex characters" + ) + + except (lxml.etree.XMLSyntaxError, Exception) as e: + errors.append( + f" {xml_file.relative_to(self.unpacked_dir)}: Error: {e}" + ) + + if errors: + print(f"FAILED - Found {len(errors)} UUID ID validation errors:") + for error in errors: + print(error) + return False + else: + if self.verbose: + print("PASSED - All UUID-like IDs contain valid hex values") + return True + + def _looks_like_uuid(self, value): + """Check if a value has the general structure of a UUID.""" + # Remove common UUID delimiters + clean_value = value.strip("{}()").replace("-", "") + # Check if it's 32 hex-like characters (could include invalid hex chars) + return len(clean_value) == 32 and all(c.isalnum() for c in clean_value) + + def validate_slide_layout_ids(self): + """Validate that sldLayoutId elements in slide masters reference valid slide layouts.""" + import lxml.etree + + errors = [] + + # Find all slide master files + slide_masters = list(self.unpacked_dir.glob("ppt/slideMasters/*.xml")) + + if not slide_masters: + if self.verbose: + print("PASSED - No slide masters found") + return True + + for slide_master in slide_masters: + try: + # Parse the slide master file + root = lxml.etree.parse(str(slide_master)).getroot() + + # Find the corresponding _rels file for this slide master + rels_file = slide_master.parent / "_rels" / f"{slide_master.name}.rels" + + if not rels_file.exists(): + errors.append( + f" {slide_master.relative_to(self.unpacked_dir)}: " + f"Missing relationships file: {rels_file.relative_to(self.unpacked_dir)}" + ) + continue + + # Parse the relationships file + rels_root = lxml.etree.parse(str(rels_file)).getroot() + + # Build a set of valid relationship IDs that point to slide layouts + valid_layout_rids = set() + for rel in rels_root.findall( + f".//{{{self.PACKAGE_RELATIONSHIPS_NAMESPACE}}}Relationship" + ): + rel_type = rel.get("Type", "") + if "slideLayout" in rel_type: + valid_layout_rids.add(rel.get("Id")) + + # Find all sldLayoutId elements in the slide master + for sld_layout_id in root.findall( + f".//{{{self.PRESENTATIONML_NAMESPACE}}}sldLayoutId" + ): + r_id = sld_layout_id.get( + f"{{{self.OFFICE_RELATIONSHIPS_NAMESPACE}}}id" + ) + layout_id = sld_layout_id.get("id") + + if r_id and r_id not in valid_layout_rids: + errors.append( + f" {slide_master.relative_to(self.unpacked_dir)}: " + f"Line {sld_layout_id.sourceline}: sldLayoutId with id='{layout_id}' " + f"references r:id='{r_id}' which is not found in slide layout relationships" + ) + + except (lxml.etree.XMLSyntaxError, Exception) as e: + errors.append( + f" {slide_master.relative_to(self.unpacked_dir)}: Error: {e}" + ) + + if errors: + print(f"FAILED - Found {len(errors)} slide layout ID validation errors:") + for error in errors: + print(error) + print( + "Remove invalid references or add missing slide layouts to the relationships file." + ) + return False + else: + if self.verbose: + print("PASSED - All slide layout IDs reference valid slide layouts") + return True + + def validate_no_duplicate_slide_layouts(self): + """Validate that each slide has exactly one slideLayout reference.""" + import lxml.etree + + errors = [] + slide_rels_files = list(self.unpacked_dir.glob("ppt/slides/_rels/*.xml.rels")) + + for rels_file in slide_rels_files: + try: + root = lxml.etree.parse(str(rels_file)).getroot() + + # Find all slideLayout relationships + layout_rels = [ + rel + for rel in root.findall( + f".//{{{self.PACKAGE_RELATIONSHIPS_NAMESPACE}}}Relationship" + ) + if "slideLayout" in rel.get("Type", "") + ] + + if len(layout_rels) > 1: + errors.append( + f" {rels_file.relative_to(self.unpacked_dir)}: has {len(layout_rels)} slideLayout references" + ) + + except Exception as e: + errors.append( + f" {rels_file.relative_to(self.unpacked_dir)}: Error: {e}" + ) + + if errors: + print("FAILED - Found slides with duplicate slideLayout references:") + for error in errors: + print(error) + return False + else: + if self.verbose: + print("PASSED - All slides have exactly one slideLayout reference") + return True + + def validate_notes_slide_references(self): + """Validate that each notesSlide file is referenced by only one slide.""" + import lxml.etree + + errors = [] + notes_slide_references = {} # Track which slides reference each notesSlide + + # Find all slide relationship files + slide_rels_files = list(self.unpacked_dir.glob("ppt/slides/_rels/*.xml.rels")) + + if not slide_rels_files: + if self.verbose: + print("PASSED - No slide relationship files found") + return True + + for rels_file in slide_rels_files: + try: + # Parse the relationships file + root = lxml.etree.parse(str(rels_file)).getroot() + + # Find all notesSlide relationships + for rel in root.findall( + f".//{{{self.PACKAGE_RELATIONSHIPS_NAMESPACE}}}Relationship" + ): + rel_type = rel.get("Type", "") + if "notesSlide" in rel_type: + target = rel.get("Target", "") + if target: + # Normalize the target path to handle relative paths + normalized_target = target.replace("../", "") + + # Track which slide references this notesSlide + slide_name = rels_file.stem.replace( + ".xml", "" + ) # e.g., "slide1" + + if normalized_target not in notes_slide_references: + notes_slide_references[normalized_target] = [] + notes_slide_references[normalized_target].append( + (slide_name, rels_file) + ) + + except (lxml.etree.XMLSyntaxError, Exception) as e: + errors.append( + f" {rels_file.relative_to(self.unpacked_dir)}: Error: {e}" + ) + + # Check for duplicate references + for target, references in notes_slide_references.items(): + if len(references) > 1: + slide_names = [ref[0] for ref in references] + errors.append( + f" Notes slide '{target}' is referenced by multiple slides: {', '.join(slide_names)}" + ) + for slide_name, rels_file in references: + errors.append(f" - {rels_file.relative_to(self.unpacked_dir)}") + + if errors: + print( + f"FAILED - Found {len([e for e in errors if not e.startswith(' ')])} notes slide reference validation errors:" + ) + for error in errors: + print(error) + print("Each slide may optionally have its own slide file.") + return False + else: + if self.verbose: + print("PASSED - All notes slide references are unique") + return True + + +if __name__ == "__main__": + raise RuntimeError("This module should not be run directly.") diff --git a/code_puppy/bundled_skills/Office/docx/ooxml/scripts/validation/redlining.py b/code_puppy/bundled_skills/Office/docx/ooxml/scripts/validation/redlining.py new file mode 100644 index 00000000..e3bf0f96 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/ooxml/scripts/validation/redlining.py @@ -0,0 +1,279 @@ +""" +Validator for tracked changes in Word documents. +""" + +import subprocess +import tempfile +import zipfile +from pathlib import Path + + +class RedliningValidator: + """Validator for tracked changes in Word documents.""" + + def __init__(self, unpacked_dir, original_docx, verbose=False): + self.unpacked_dir = Path(unpacked_dir) + self.original_docx = Path(original_docx) + self.verbose = verbose + self.namespaces = { + "w": "http://schemas.openxmlformats.org/wordprocessingml/2006/main" + } + + def validate(self): + """Main validation method that returns True if valid, False otherwise.""" + # Verify unpacked directory exists and has correct structure + modified_file = self.unpacked_dir / "word" / "document.xml" + if not modified_file.exists(): + print(f"FAILED - Modified document.xml not found at {modified_file}") + return False + + # First, check if there are any tracked changes by Ticca to validate + try: + import xml.etree.ElementTree as ET + + tree = ET.parse(modified_file) + root = tree.getroot() + + # Check for w:del or w:ins tags authored by Ticca + del_elements = root.findall(".//w:del", self.namespaces) + ins_elements = root.findall(".//w:ins", self.namespaces) + + # Filter to only include changes by Ticca + ticca_del_elements = [ + elem + for elem in del_elements + if elem.get(f"{{{self.namespaces['w']}}}author") == "Ticca" + ] + ticca_ins_elements = [ + elem + for elem in ins_elements + if elem.get(f"{{{self.namespaces['w']}}}author") == "Ticca" + ] + + # Redlining validation is only needed if tracked changes by Ticca have been used. + if not ticca_del_elements and not ticca_ins_elements: + if self.verbose: + print("PASSED - No tracked changes by Ticca found.") + return True + + except Exception: + # If we can't parse the XML, continue with full validation + pass + + # Create temporary directory for unpacking original docx + with tempfile.TemporaryDirectory() as temp_dir: + temp_path = Path(temp_dir) + + # Unpack original docx + try: + with zipfile.ZipFile(self.original_docx, "r") as zip_ref: + zip_ref.extractall(temp_path) + except Exception as e: + print(f"FAILED - Error unpacking original docx: {e}") + return False + + original_file = temp_path / "word" / "document.xml" + if not original_file.exists(): + print( + f"FAILED - Original document.xml not found in {self.original_docx}" + ) + return False + + # Parse both XML files using xml.etree.ElementTree for redlining validation + try: + import xml.etree.ElementTree as ET + + modified_tree = ET.parse(modified_file) + modified_root = modified_tree.getroot() + original_tree = ET.parse(original_file) + original_root = original_tree.getroot() + except ET.ParseError as e: + print(f"FAILED - Error parsing XML files: {e}") + return False + + # Remove Ticca's tracked changes from both documents + self._remove_ticca_tracked_changes(original_root) + self._remove_ticca_tracked_changes(modified_root) + + # Extract and compare text content + modified_text = self._extract_text_content(modified_root) + original_text = self._extract_text_content(original_root) + + if modified_text != original_text: + # Show detailed character-level differences for each paragraph + error_message = self._generate_detailed_diff( + original_text, modified_text + ) + print(error_message) + return False + + if self.verbose: + print("PASSED - All changes by Ticca are properly tracked") + return True + + def _generate_detailed_diff(self, original_text, modified_text): + """Generate detailed word-level differences using git word diff.""" + error_parts = [ + "FAILED - Document text doesn't match after removing Ticca's tracked changes", + "", + "Likely causes:", + " 1. Modified text inside another author's or tags", + " 2. Made edits without proper tracked changes", + " 3. Didn't nest inside when deleting another's insertion", + "", + "For pre-redlined documents, use correct patterns:", + " - To reject another's INSERTION: Nest inside their ", + " - To restore another's DELETION: Add new AFTER their ", + "", + ] + + # Show git word diff + git_diff = self._get_git_word_diff(original_text, modified_text) + if git_diff: + error_parts.extend(["Differences:", "============", git_diff]) + else: + error_parts.append("Unable to generate word diff (git not available)") + + return "\n".join(error_parts) + + def _get_git_word_diff(self, original_text, modified_text): + """Generate word diff using git with character-level precision.""" + try: + with tempfile.TemporaryDirectory() as temp_dir: + temp_path = Path(temp_dir) + + # Create two files + original_file = temp_path / "original.txt" + modified_file = temp_path / "modified.txt" + + original_file.write_text(original_text, encoding="utf-8") + modified_file.write_text(modified_text, encoding="utf-8") + + # Try character-level diff first for precise differences + result = subprocess.run( + [ + "git", + "diff", + "--word-diff=plain", + "--word-diff-regex=.", # Character-by-character diff + "-U0", # Zero lines of context - show only changed lines + "--no-index", + str(original_file), + str(modified_file), + ], + capture_output=True, + text=True, + ) + + if result.stdout.strip(): + # Clean up the output - remove git diff header lines + lines = result.stdout.split("\n") + # Skip the header lines (diff --git, index, +++, ---, @@) + content_lines = [] + in_content = False + for line in lines: + if line.startswith("@@"): + in_content = True + continue + if in_content and line.strip(): + content_lines.append(line) + + if content_lines: + return "\n".join(content_lines) + + # Fallback to word-level diff if character-level is too verbose + result = subprocess.run( + [ + "git", + "diff", + "--word-diff=plain", + "-U0", # Zero lines of context + "--no-index", + str(original_file), + str(modified_file), + ], + capture_output=True, + text=True, + ) + + if result.stdout.strip(): + lines = result.stdout.split("\n") + content_lines = [] + in_content = False + for line in lines: + if line.startswith("@@"): + in_content = True + continue + if in_content and line.strip(): + content_lines.append(line) + return "\n".join(content_lines) + + except (subprocess.CalledProcessError, FileNotFoundError, Exception): + # Git not available or other error, return None to use fallback + pass + + return None + + def _remove_ticca_tracked_changes(self, root): + """Remove tracked changes authored by Ticca from the XML root.""" + ins_tag = f"{{{self.namespaces['w']}}}ins" + del_tag = f"{{{self.namespaces['w']}}}del" + author_attr = f"{{{self.namespaces['w']}}}author" + + # Remove w:ins elements + for parent in root.iter(): + to_remove = [] + for child in parent: + if child.tag == ins_tag and child.get(author_attr) == "Ticca": + to_remove.append(child) + for elem in to_remove: + parent.remove(elem) + + # Unwrap content in w:del elements where author is "Ticca" + deltext_tag = f"{{{self.namespaces['w']}}}delText" + t_tag = f"{{{self.namespaces['w']}}}t" + + for parent in root.iter(): + to_process = [] + for child in parent: + if child.tag == del_tag and child.get(author_attr) == "Ticca": + to_process.append((child, list(parent).index(child))) + + # Process in reverse order to maintain indices + for del_elem, del_index in reversed(to_process): + # Convert w:delText to w:t before moving + for elem in del_elem.iter(): + if elem.tag == deltext_tag: + elem.tag = t_tag + + # Move all children of w:del to its parent before removing w:del + for child in reversed(list(del_elem)): + parent.insert(del_index, child) + parent.remove(del_elem) + + def _extract_text_content(self, root): + """Extract text content from Word XML, preserving paragraph structure. + + Empty paragraphs are skipped to avoid false positives when tracked + insertions add only structural elements without text content. + """ + p_tag = f"{{{self.namespaces['w']}}}p" + t_tag = f"{{{self.namespaces['w']}}}t" + + paragraphs = [] + for p_elem in root.findall(f".//{p_tag}"): + # Get all text elements within this paragraph + text_parts = [] + for t_elem in p_elem.findall(f".//{t_tag}"): + if t_elem.text: + text_parts.append(t_elem.text) + paragraph_text = "".join(text_parts) + # Skip empty paragraphs - they don't affect content validation + if paragraph_text: + paragraphs.append(paragraph_text) + + return "\n".join(paragraphs) + + +if __name__ == "__main__": + raise RuntimeError("This module should not be run directly.") diff --git a/code_puppy/bundled_skills/Office/docx/scripts/__init__.py b/code_puppy/bundled_skills/Office/docx/scripts/__init__.py new file mode 100644 index 00000000..bf9c5627 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/scripts/__init__.py @@ -0,0 +1 @@ +# Make scripts directory a package for relative imports in tests diff --git a/code_puppy/bundled_skills/Office/docx/scripts/document.py b/code_puppy/bundled_skills/Office/docx/scripts/document.py new file mode 100644 index 00000000..a9c9ca42 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/scripts/document.py @@ -0,0 +1,1290 @@ +#!/usr/bin/env python3 +""" +Library for working with Word documents: comments, tracked changes, and editing. + +Usage: + from skills.docx.scripts.document import Document + + # Initialize + doc = Document('workspace/unpacked') + doc = Document('workspace/unpacked', author="John Doe", initials="JD") + + # Find nodes + node = doc["word/document.xml"].get_node(tag="w:del", attrs={"w:id": "1"}) + node = doc["word/document.xml"].get_node(tag="w:p", line_number=10) + + # Add comments + doc.add_comment(start=node, end=node, text="Comment text") + doc.reply_to_comment(parent_comment_id=0, text="Reply text") + + # Suggest tracked changes + doc["word/document.xml"].suggest_deletion(node) # Delete content + doc["word/document.xml"].revert_insertion(ins_node) # Reject insertion + doc["word/document.xml"].revert_deletion(del_node) # Reject deletion + + # Save + doc.save() +""" + +import html +import random +import shutil +import tempfile +from datetime import datetime, timezone +from pathlib import Path + +from defusedxml import minidom +from ooxml.scripts.pack import pack_document +from ooxml.scripts.validation.docx import DOCXSchemaValidator +from ooxml.scripts.validation.redlining import RedliningValidator + +from .utilities import XMLEditor + +# Path to template files +TEMPLATE_DIR = Path(__file__).parent / "templates" + + +class DocxXMLEditor(XMLEditor): + """XMLEditor that automatically applies RSID, author, and date to new elements. + + Automatically adds attributes to elements that support them when inserting new content: + - w:rsidR, w:rsidRDefault, w:rsidP (for w:p and w:r elements) + - w:author and w:date (for w:ins, w:del, w:comment elements) + - w:id (for w:ins and w:del elements) + + Attributes: + dom (defusedxml.minidom.Document): The DOM document for direct manipulation + """ + + def __init__(self, xml_path, rsid: str, author: str = "Ticca", initials: str = "T"): + """Initialize with required RSID and optional author. + + Args: + xml_path: Path to XML file to edit + rsid: RSID to automatically apply to new elements + author: Author name for tracked changes and comments (default: "Ticca") + initials: Author initials (default: "C") + """ + super().__init__(xml_path) + self.rsid = rsid + self.author = author + self.initials = initials + + def _get_next_change_id(self): + """Get the next available change ID by checking all tracked change elements.""" + max_id = -1 + for tag in ("w:ins", "w:del"): + elements = self.dom.getElementsByTagName(tag) + for elem in elements: + change_id = elem.getAttribute("w:id") + if change_id: + try: + max_id = max(max_id, int(change_id)) + except ValueError: + pass + return max_id + 1 + + def _ensure_w16du_namespace(self): + """Ensure w16du namespace is declared on the root element.""" + root = self.dom.documentElement + if not root.hasAttribute("xmlns:w16du"): # type: ignore + root.setAttribute( # type: ignore + "xmlns:w16du", + "http://schemas.microsoft.com/office/word/2023/wordml/word16du", + ) + + def _ensure_w16cex_namespace(self): + """Ensure w16cex namespace is declared on the root element.""" + root = self.dom.documentElement + if not root.hasAttribute("xmlns:w16cex"): # type: ignore + root.setAttribute( # type: ignore + "xmlns:w16cex", + "http://schemas.microsoft.com/office/word/2018/wordml/cex", + ) + + def _ensure_w14_namespace(self): + """Ensure w14 namespace is declared on the root element.""" + root = self.dom.documentElement + if not root.hasAttribute("xmlns:w14"): # type: ignore + root.setAttribute( # type: ignore + "xmlns:w14", + "http://schemas.microsoft.com/office/word/2010/wordml", + ) + + def _inject_attributes_to_nodes(self, nodes): + """Inject RSID, author, and date attributes into DOM nodes where applicable. + + Adds attributes to elements that support them: + - w:r: gets w:rsidR (or w:rsidDel if inside w:del) + - w:p: gets w:rsidR, w:rsidRDefault, w:rsidP, w14:paraId, w14:textId + - w:t: gets xml:space="preserve" if text has leading/trailing whitespace + - w:ins, w:del: get w:id, w:author, w:date, w16du:dateUtc + - w:comment: gets w:author, w:date, w:initials + - w16cex:commentExtensible: gets w16cex:dateUtc + + Args: + nodes: List of DOM nodes to process + """ + from datetime import datetime, timezone + + timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ") + + def is_inside_deletion(elem): + """Check if element is inside a w:del element.""" + parent = elem.parentNode + while parent: + if parent.nodeType == parent.ELEMENT_NODE and parent.tagName == "w:del": + return True + parent = parent.parentNode + return False + + def add_rsid_to_p(elem): + if not elem.hasAttribute("w:rsidR"): + elem.setAttribute("w:rsidR", self.rsid) + if not elem.hasAttribute("w:rsidRDefault"): + elem.setAttribute("w:rsidRDefault", self.rsid) + if not elem.hasAttribute("w:rsidP"): + elem.setAttribute("w:rsidP", self.rsid) + # Add w14:paraId and w14:textId if not present + if not elem.hasAttribute("w14:paraId"): + self._ensure_w14_namespace() + elem.setAttribute("w14:paraId", _generate_hex_id()) + if not elem.hasAttribute("w14:textId"): + self._ensure_w14_namespace() + elem.setAttribute("w14:textId", _generate_hex_id()) + + def add_rsid_to_r(elem): + # Use w:rsidDel for inside , otherwise w:rsidR + if is_inside_deletion(elem): + if not elem.hasAttribute("w:rsidDel"): + elem.setAttribute("w:rsidDel", self.rsid) + else: + if not elem.hasAttribute("w:rsidR"): + elem.setAttribute("w:rsidR", self.rsid) + + def add_tracked_change_attrs(elem): + # Auto-assign w:id if not present + if not elem.hasAttribute("w:id"): + elem.setAttribute("w:id", str(self._get_next_change_id())) + if not elem.hasAttribute("w:author"): + elem.setAttribute("w:author", self.author) + if not elem.hasAttribute("w:date"): + elem.setAttribute("w:date", timestamp) + # Add w16du:dateUtc for tracked changes (same as w:date since we generate UTC timestamps) + if elem.tagName in ("w:ins", "w:del") and not elem.hasAttribute( + "w16du:dateUtc" + ): + self._ensure_w16du_namespace() + elem.setAttribute("w16du:dateUtc", timestamp) + + def add_comment_attrs(elem): + if not elem.hasAttribute("w:author"): + elem.setAttribute("w:author", self.author) + if not elem.hasAttribute("w:date"): + elem.setAttribute("w:date", timestamp) + if not elem.hasAttribute("w:initials"): + elem.setAttribute("w:initials", self.initials) + + def add_comment_extensible_date(elem): + # Add w16cex:dateUtc for comment extensible elements + if not elem.hasAttribute("w16cex:dateUtc"): + self._ensure_w16cex_namespace() + elem.setAttribute("w16cex:dateUtc", timestamp) + + def add_xml_space_to_t(elem): + # Add xml:space="preserve" to w:t if text has leading/trailing whitespace + if ( + elem.firstChild + and elem.firstChild.nodeType == elem.firstChild.TEXT_NODE + ): + text = elem.firstChild.data + if text and (text[0].isspace() or text[-1].isspace()): + if not elem.hasAttribute("xml:space"): + elem.setAttribute("xml:space", "preserve") + + for node in nodes: + if node.nodeType != node.ELEMENT_NODE: + continue + + # Handle the node itself + if node.tagName == "w:p": + add_rsid_to_p(node) + elif node.tagName == "w:r": + add_rsid_to_r(node) + elif node.tagName == "w:t": + add_xml_space_to_t(node) + elif node.tagName in ("w:ins", "w:del"): + add_tracked_change_attrs(node) + elif node.tagName == "w:comment": + add_comment_attrs(node) + elif node.tagName == "w16cex:commentExtensible": + add_comment_extensible_date(node) + + # Process descendants (getElementsByTagName doesn't return the element itself) + for elem in node.getElementsByTagName("w:p"): + add_rsid_to_p(elem) + for elem in node.getElementsByTagName("w:r"): + add_rsid_to_r(elem) + for elem in node.getElementsByTagName("w:t"): + add_xml_space_to_t(elem) + for tag in ("w:ins", "w:del"): + for elem in node.getElementsByTagName(tag): + add_tracked_change_attrs(elem) + for elem in node.getElementsByTagName("w:comment"): + add_comment_attrs(elem) + for elem in node.getElementsByTagName("w16cex:commentExtensible"): + add_comment_extensible_date(elem) + + def replace_node(self, elem, new_content): + """Replace node with automatic attribute injection.""" + nodes = super().replace_node(elem, new_content) + self._inject_attributes_to_nodes(nodes) + return nodes + + def insert_after(self, elem, xml_content): + """Insert after with automatic attribute injection.""" + nodes = super().insert_after(elem, xml_content) + self._inject_attributes_to_nodes(nodes) + return nodes + + def insert_before(self, elem, xml_content): + """Insert before with automatic attribute injection.""" + nodes = super().insert_before(elem, xml_content) + self._inject_attributes_to_nodes(nodes) + return nodes + + def append_to(self, elem, xml_content): + """Append to with automatic attribute injection.""" + nodes = super().append_to(elem, xml_content) + self._inject_attributes_to_nodes(nodes) + return nodes + + def revert_insertion(self, elem): + """Reject an insertion by wrapping its content in a deletion. + + Wraps all runs inside w:ins in w:del, converting w:t to w:delText. + Can process a single w:ins element or a container element with multiple w:ins. + + Args: + elem: Element to process (w:ins, w:p, w:body, etc.) + + Returns: + list: List containing the processed element(s) + + Raises: + ValueError: If the element contains no w:ins elements + + Example: + # Reject a single insertion + ins = doc["word/document.xml"].get_node(tag="w:ins", attrs={"w:id": "5"}) + doc["word/document.xml"].revert_insertion(ins) + + # Reject all insertions in a paragraph + para = doc["word/document.xml"].get_node(tag="w:p", line_number=42) + doc["word/document.xml"].revert_insertion(para) + """ + # Collect insertions + ins_elements = [] + if elem.tagName == "w:ins": + ins_elements.append(elem) + else: + ins_elements.extend(elem.getElementsByTagName("w:ins")) + + # Validate that there are insertions to reject + if not ins_elements: + raise ValueError( + f"revert_insertion requires w:ins elements. " + f"The provided element <{elem.tagName}> contains no insertions. " + ) + + # Process all insertions - wrap all children in w:del + for ins_elem in ins_elements: + runs = list(ins_elem.getElementsByTagName("w:r")) + if not runs: + continue + + # Create deletion wrapper + del_wrapper = self.dom.createElement("w:del") + + # Process each run + for run in runs: + # Convert w:t → w:delText and w:rsidR → w:rsidDel + if run.hasAttribute("w:rsidR"): + run.setAttribute("w:rsidDel", run.getAttribute("w:rsidR")) + run.removeAttribute("w:rsidR") + elif not run.hasAttribute("w:rsidDel"): + run.setAttribute("w:rsidDel", self.rsid) + + for t_elem in list(run.getElementsByTagName("w:t")): + del_text = self.dom.createElement("w:delText") + # Copy ALL child nodes (not just firstChild) to handle entities + while t_elem.firstChild: + del_text.appendChild(t_elem.firstChild) + for i in range(t_elem.attributes.length): + attr = t_elem.attributes.item(i) + del_text.setAttribute(attr.name, attr.value) + t_elem.parentNode.replaceChild(del_text, t_elem) + + # Move all children from ins to del wrapper + while ins_elem.firstChild: + del_wrapper.appendChild(ins_elem.firstChild) + + # Add del wrapper back to ins + ins_elem.appendChild(del_wrapper) + + # Inject attributes to the deletion wrapper + self._inject_attributes_to_nodes([del_wrapper]) + + return [elem] + + def revert_deletion(self, elem): + """Reject a deletion by re-inserting the deleted content. + + Creates w:ins elements after each w:del, copying deleted content and + converting w:delText back to w:t. + Can process a single w:del element or a container element with multiple w:del. + + Args: + elem: Element to process (w:del, w:p, w:body, etc.) + + Returns: + list: If elem is w:del, returns [elem, new_ins]. Otherwise returns [elem]. + + Raises: + ValueError: If the element contains no w:del elements + + Example: + # Reject a single deletion - returns [w:del, w:ins] + del_elem = doc["word/document.xml"].get_node(tag="w:del", attrs={"w:id": "3"}) + nodes = doc["word/document.xml"].revert_deletion(del_elem) + + # Reject all deletions in a paragraph - returns [para] + para = doc["word/document.xml"].get_node(tag="w:p", line_number=42) + nodes = doc["word/document.xml"].revert_deletion(para) + """ + # Collect deletions FIRST - before we modify the DOM + del_elements = [] + is_single_del = elem.tagName == "w:del" + + if is_single_del: + del_elements.append(elem) + else: + del_elements.extend(elem.getElementsByTagName("w:del")) + + # Validate that there are deletions to reject + if not del_elements: + raise ValueError( + f"revert_deletion requires w:del elements. " + f"The provided element <{elem.tagName}> contains no deletions. " + ) + + # Track created insertion (only relevant if elem is a single w:del) + created_insertion = None + + # Process all deletions - create insertions that copy the deleted content + for del_elem in del_elements: + # Clone the deleted runs and convert them to insertions + runs = list(del_elem.getElementsByTagName("w:r")) + if not runs: + continue + + # Create insertion wrapper + ins_elem = self.dom.createElement("w:ins") + + for run in runs: + # Clone the run + new_run = run.cloneNode(True) + + # Convert w:delText → w:t + for del_text in list(new_run.getElementsByTagName("w:delText")): + t_elem = self.dom.createElement("w:t") + # Copy ALL child nodes (not just firstChild) to handle entities + while del_text.firstChild: + t_elem.appendChild(del_text.firstChild) + for i in range(del_text.attributes.length): + attr = del_text.attributes.item(i) + t_elem.setAttribute(attr.name, attr.value) + del_text.parentNode.replaceChild(t_elem, del_text) + + # Update run attributes: w:rsidDel → w:rsidR + if new_run.hasAttribute("w:rsidDel"): + new_run.setAttribute("w:rsidR", new_run.getAttribute("w:rsidDel")) + new_run.removeAttribute("w:rsidDel") + elif not new_run.hasAttribute("w:rsidR"): + new_run.setAttribute("w:rsidR", self.rsid) + + ins_elem.appendChild(new_run) + + # Insert the new insertion after the deletion + nodes = self.insert_after(del_elem, ins_elem.toxml()) + + # If processing a single w:del, track the created insertion + if is_single_del and nodes: + created_insertion = nodes[0] + + # Return based on input type + if is_single_del and created_insertion: + return [elem, created_insertion] + else: + return [elem] + + @staticmethod + def suggest_paragraph(xml_content: str) -> str: + """Transform paragraph XML to add tracked change wrapping for insertion. + + Wraps runs in and adds to w:rPr in w:pPr for numbered lists. + + Args: + xml_content: XML string containing a element + + Returns: + str: Transformed XML with tracked change wrapping + """ + wrapper = f'{xml_content}' + doc = minidom.parseString(wrapper) + para = doc.getElementsByTagName("w:p")[0] + + # Ensure w:pPr exists + pPr_list = para.getElementsByTagName("w:pPr") + if not pPr_list: + pPr = doc.createElement("w:pPr") + para.insertBefore( + pPr, para.firstChild + ) if para.firstChild else para.appendChild(pPr) + else: + pPr = pPr_list[0] + + # Ensure w:rPr exists in w:pPr + rPr_list = pPr.getElementsByTagName("w:rPr") + if not rPr_list: + rPr = doc.createElement("w:rPr") + pPr.appendChild(rPr) + else: + rPr = rPr_list[0] + + # Add to w:rPr + ins_marker = doc.createElement("w:ins") + rPr.insertBefore( + ins_marker, rPr.firstChild + ) if rPr.firstChild else rPr.appendChild(ins_marker) + + # Wrap all non-pPr children in + ins_wrapper = doc.createElement("w:ins") + for child in [c for c in para.childNodes if c.nodeName != "w:pPr"]: + para.removeChild(child) + ins_wrapper.appendChild(child) + para.appendChild(ins_wrapper) + + return para.toxml() + + def suggest_deletion(self, elem): + """Mark a w:r or w:p element as deleted with tracked changes (in-place DOM manipulation). + + For w:r: wraps in , converts to , preserves w:rPr + For w:p (regular): wraps content in , converts to + For w:p (numbered list): adds to w:rPr in w:pPr, wraps content in + + Args: + elem: A w:r or w:p DOM element without existing tracked changes + + Returns: + Element: The modified element + + Raises: + ValueError: If element has existing tracked changes or invalid structure + """ + if elem.nodeName == "w:r": + # Check for existing w:delText + if elem.getElementsByTagName("w:delText"): + raise ValueError("w:r element already contains w:delText") + + # Convert w:t → w:delText + for t_elem in list(elem.getElementsByTagName("w:t")): + del_text = self.dom.createElement("w:delText") + # Copy ALL child nodes (not just firstChild) to handle entities + while t_elem.firstChild: + del_text.appendChild(t_elem.firstChild) + # Preserve attributes like xml:space + for i in range(t_elem.attributes.length): + attr = t_elem.attributes.item(i) + del_text.setAttribute(attr.name, attr.value) + t_elem.parentNode.replaceChild(del_text, t_elem) + + # Update run attributes: w:rsidR → w:rsidDel + if elem.hasAttribute("w:rsidR"): + elem.setAttribute("w:rsidDel", elem.getAttribute("w:rsidR")) + elem.removeAttribute("w:rsidR") + elif not elem.hasAttribute("w:rsidDel"): + elem.setAttribute("w:rsidDel", self.rsid) + + # Wrap in w:del + del_wrapper = self.dom.createElement("w:del") + parent = elem.parentNode + parent.insertBefore(del_wrapper, elem) + parent.removeChild(elem) + del_wrapper.appendChild(elem) + + # Inject attributes to the deletion wrapper + self._inject_attributes_to_nodes([del_wrapper]) + + return del_wrapper + + elif elem.nodeName == "w:p": + # Check for existing tracked changes + if elem.getElementsByTagName("w:ins") or elem.getElementsByTagName("w:del"): + raise ValueError("w:p element already contains tracked changes") + + # Check if it's a numbered list item + pPr_list = elem.getElementsByTagName("w:pPr") + is_numbered = pPr_list and pPr_list[0].getElementsByTagName("w:numPr") + + if is_numbered: + # Add to w:rPr in w:pPr + pPr = pPr_list[0] + rPr_list = pPr.getElementsByTagName("w:rPr") + + if not rPr_list: + rPr = self.dom.createElement("w:rPr") + pPr.appendChild(rPr) + else: + rPr = rPr_list[0] + + # Add marker + del_marker = self.dom.createElement("w:del") + rPr.insertBefore( + del_marker, rPr.firstChild + ) if rPr.firstChild else rPr.appendChild(del_marker) + + # Inject attributes into the marker + self._inject_attributes_to_nodes([del_marker]) + + # Convert w:t → w:delText in all runs + for t_elem in list(elem.getElementsByTagName("w:t")): + del_text = self.dom.createElement("w:delText") + # Copy ALL child nodes (not just firstChild) to handle entities + while t_elem.firstChild: + del_text.appendChild(t_elem.firstChild) + # Preserve attributes like xml:space + for i in range(t_elem.attributes.length): + attr = t_elem.attributes.item(i) + del_text.setAttribute(attr.name, attr.value) + t_elem.parentNode.replaceChild(del_text, t_elem) + + # Update run attributes: w:rsidR → w:rsidDel + for run in elem.getElementsByTagName("w:r"): + if run.hasAttribute("w:rsidR"): + run.setAttribute("w:rsidDel", run.getAttribute("w:rsidR")) + run.removeAttribute("w:rsidR") + elif not run.hasAttribute("w:rsidDel"): + run.setAttribute("w:rsidDel", self.rsid) + + # Wrap all non-pPr children in + del_wrapper = self.dom.createElement("w:del") + for child in [c for c in elem.childNodes if c.nodeName != "w:pPr"]: + elem.removeChild(child) + del_wrapper.appendChild(child) + elem.appendChild(del_wrapper) + + # Inject attributes to the deletion wrapper + self._inject_attributes_to_nodes([del_wrapper]) + + return elem + + else: + raise ValueError(f"Element must be w:r or w:p, got {elem.nodeName}") + + +def _generate_hex_id() -> str: + """Generate random 8-character hex ID for para/durable IDs. + + Values are constrained to be less than 0x7FFFFFFF per OOXML spec: + - paraId must be < 0x80000000 + - durableId must be < 0x7FFFFFFF + We use the stricter constraint (0x7FFFFFFF) for both. + """ + return f"{random.randint(1, 0x7FFFFFFE):08X}" + + +def _generate_rsid() -> str: + """Generate random 8-character hex RSID.""" + return "".join(random.choices("0123456789ABCDEF", k=8)) + + +class Document: + """Manages comments in unpacked Word documents.""" + + def __init__( + self, + unpacked_dir, + rsid=None, + track_revisions=False, + author="Ticca", + initials="T", + ): + """ + Initialize with path to unpacked Word document directory. + Automatically sets up comment infrastructure (people.xml, RSIDs). + + Args: + unpacked_dir: Path to unpacked DOCX directory (must contain word/ subdirectory) + rsid: Optional RSID to use for all comment elements. If not provided, one will be generated. + track_revisions: If True, enables track revisions in settings.xml (default: False) + author: Default author name for comments (default: "Ticca") + initials: Default author initials for comments (default: "C") + """ + self.original_path = Path(unpacked_dir) + + if not self.original_path.exists() or not self.original_path.is_dir(): + raise ValueError(f"Directory not found: {unpacked_dir}") + + # Create temporary directory with subdirectories for unpacked content and baseline + self.temp_dir = tempfile.mkdtemp(prefix="docx_") + self.unpacked_path = Path(self.temp_dir) / "unpacked" + shutil.copytree(self.original_path, self.unpacked_path) + + # Pack original directory into temporary .docx for validation baseline (outside unpacked dir) + self.original_docx = Path(self.temp_dir) / "original.docx" + pack_document(self.original_path, self.original_docx, validate=False) + + self.word_path = self.unpacked_path / "word" + + # Generate RSID if not provided + self.rsid = rsid if rsid else _generate_rsid() + print(f"Using RSID: {self.rsid}") + + # Set default author and initials + self.author = author + self.initials = initials + + # Cache for lazy-loaded editors + self._editors = {} + + # Comment file paths + self.comments_path = self.word_path / "comments.xml" + self.comments_extended_path = self.word_path / "commentsExtended.xml" + self.comments_ids_path = self.word_path / "commentsIds.xml" + self.comments_extensible_path = self.word_path / "commentsExtensible.xml" + + # Load existing comments and determine next ID (before setup modifies files) + self.existing_comments = self._load_existing_comments() + self.next_comment_id = self._get_next_comment_id() + + # Convenient access to document.xml editor (semi-private) + self._document = self["word/document.xml"] + + # Setup tracked changes infrastructure + self._setup_tracking(track_revisions=track_revisions) + + # Add author to people.xml + self._add_author_to_people(author) + + def __getitem__(self, xml_path: str) -> DocxXMLEditor: + """ + Get or create a DocxXMLEditor for the specified XML file. + + Enables lazy-loaded editors with bracket notation: + node = doc["word/document.xml"].get_node(tag="w:p", line_number=42) + + Args: + xml_path: Relative path to XML file (e.g., "word/document.xml", "word/comments.xml") + + Returns: + DocxXMLEditor instance for the specified file + + Raises: + ValueError: If the file does not exist + + Example: + # Get node from document.xml + node = doc["word/document.xml"].get_node(tag="w:del", attrs={"w:id": "1"}) + + # Get node from comments.xml + comment = doc["word/comments.xml"].get_node(tag="w:comment", attrs={"w:id": "0"}) + """ + if xml_path not in self._editors: + file_path = self.unpacked_path / xml_path + if not file_path.exists(): + raise ValueError(f"XML file not found: {xml_path}") + # Use DocxXMLEditor with RSID, author, and initials for all editors + self._editors[xml_path] = DocxXMLEditor( + file_path, rsid=self.rsid, author=self.author, initials=self.initials + ) + return self._editors[xml_path] + + def add_comment(self, start, end, text: str) -> int: + """ + Add a comment spanning from one element to another. + + Args: + start: DOM element for the starting point + end: DOM element for the ending point + text: Comment content + + Returns: + The comment ID that was created + + Example: + start_node = cm.get_document_node(tag="w:del", id="1") + end_node = cm.get_document_node(tag="w:ins", id="2") + cm.add_comment(start=start_node, end=end_node, text="Explanation") + """ + comment_id = self.next_comment_id + para_id = _generate_hex_id() + durable_id = _generate_hex_id() + timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ") + + # Add comment ranges to document.xml immediately + self._document.insert_before(start, self._comment_range_start_xml(comment_id)) + + # If end node is a paragraph, append comment markup inside it + # Otherwise insert after it (for run-level anchors) + if end.tagName == "w:p": + self._document.append_to(end, self._comment_range_end_xml(comment_id)) + else: + self._document.insert_after(end, self._comment_range_end_xml(comment_id)) + + # Add to comments.xml immediately + self._add_to_comments_xml( + comment_id, para_id, text, self.author, self.initials, timestamp + ) + + # Add to commentsExtended.xml immediately + self._add_to_comments_extended_xml(para_id, parent_para_id=None) + + # Add to commentsIds.xml immediately + self._add_to_comments_ids_xml(para_id, durable_id) + + # Add to commentsExtensible.xml immediately + self._add_to_comments_extensible_xml(durable_id) + + # Update existing_comments so replies work + self.existing_comments[comment_id] = {"para_id": para_id} + + self.next_comment_id += 1 + return comment_id + + def reply_to_comment( + self, + parent_comment_id: int, + text: str, + ) -> int: + """ + Add a reply to an existing comment. + + Args: + parent_comment_id: The w:id of the parent comment to reply to + text: Reply text + + Returns: + The comment ID that was created for the reply + + Example: + cm.reply_to_comment(parent_comment_id=0, text="I agree with this change") + """ + if parent_comment_id not in self.existing_comments: + raise ValueError(f"Parent comment with id={parent_comment_id} not found") + + parent_info = self.existing_comments[parent_comment_id] + comment_id = self.next_comment_id + para_id = _generate_hex_id() + durable_id = _generate_hex_id() + timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ") + + # Add comment ranges to document.xml immediately + parent_start_elem = self._document.get_node( + tag="w:commentRangeStart", attrs={"w:id": str(parent_comment_id)} + ) + parent_ref_elem = self._document.get_node( + tag="w:commentReference", attrs={"w:id": str(parent_comment_id)} + ) + + self._document.insert_after( + parent_start_elem, self._comment_range_start_xml(comment_id) + ) + parent_ref_run = parent_ref_elem.parentNode + self._document.insert_after( + parent_ref_run, f'' + ) + self._document.insert_after( + parent_ref_run, self._comment_ref_run_xml(comment_id) + ) + + # Add to comments.xml immediately + self._add_to_comments_xml( + comment_id, para_id, text, self.author, self.initials, timestamp + ) + + # Add to commentsExtended.xml immediately (with parent) + self._add_to_comments_extended_xml( + para_id, parent_para_id=parent_info["para_id"] + ) + + # Add to commentsIds.xml immediately + self._add_to_comments_ids_xml(para_id, durable_id) + + # Add to commentsExtensible.xml immediately + self._add_to_comments_extensible_xml(durable_id) + + # Update existing_comments so replies work + self.existing_comments[comment_id] = {"para_id": para_id} + + self.next_comment_id += 1 + return comment_id + + def suggest_paragraph(self, xml_content: str) -> str: + """Transform paragraph XML to add tracked change wrapping for insertion. + + Wraps runs in and adds to w:rPr in w:pPr for numbered lists. + + Args: + xml_content: XML string containing a element + + Returns: + str: Transformed XML with tracked change wrapping + """ + return DocxXMLEditor.suggest_paragraph(xml_content) + + def __del__(self): + """Clean up temporary directory on deletion.""" + if hasattr(self, "temp_dir") and Path(self.temp_dir).exists(): + shutil.rmtree(self.temp_dir) + + def validate(self) -> None: + """ + Validate the document against XSD schema and redlining rules. + + Raises: + ValueError: If validation fails. + """ + # Create validators with current state + schema_validator = DOCXSchemaValidator( + self.unpacked_path, self.original_docx, verbose=False + ) + redlining_validator = RedliningValidator( + self.unpacked_path, self.original_docx, verbose=False + ) + + # Run validations + if not schema_validator.validate(): + raise ValueError("Schema validation failed") + if not redlining_validator.validate(): + raise ValueError("Redlining validation failed") + + def save(self, destination=None, validate=True) -> None: + """ + Save all modified XML files to disk and copy to destination directory. + + This persists all changes made via add_comment() and reply_to_comment(). + + Args: + destination: Optional path to save to. If None, saves back to original directory. + validate: If True, validates document before saving (default: True). + """ + # Only ensure comment relationships and content types if comment files exist + if self.comments_path.exists(): + self._ensure_comment_relationships() + self._ensure_comment_content_types() + + # Save all modified XML files in temp directory + for editor in self._editors.values(): + editor.save() + + # Validate by default + if validate: + self.validate() + + # Copy contents from temp directory to destination (or original directory) + target_path = Path(destination) if destination else self.original_path + shutil.copytree(self.unpacked_path, target_path, dirs_exist_ok=True) + + # ==================== Private: Initialization ==================== + + def _get_next_comment_id(self): + """Get the next available comment ID.""" + if not self.comments_path.exists(): + return 0 + + editor = self["word/comments.xml"] + max_id = -1 + for comment_elem in editor.dom.getElementsByTagName("w:comment"): + comment_id = comment_elem.getAttribute("w:id") + if comment_id: + try: + max_id = max(max_id, int(comment_id)) + except ValueError: + pass + return max_id + 1 + + def _load_existing_comments(self): + """Load existing comments from files to enable replies.""" + if not self.comments_path.exists(): + return {} + + editor = self["word/comments.xml"] + existing = {} + + for comment_elem in editor.dom.getElementsByTagName("w:comment"): + comment_id = comment_elem.getAttribute("w:id") + if not comment_id: + continue + + # Find para_id from the w:p element within the comment + para_id = None + for p_elem in comment_elem.getElementsByTagName("w:p"): + para_id = p_elem.getAttribute("w14:paraId") + if para_id: + break + + if not para_id: + continue + + existing[int(comment_id)] = {"para_id": para_id} + + return existing + + # ==================== Private: Setup Methods ==================== + + def _setup_tracking(self, track_revisions=False): + """Set up comment infrastructure in unpacked directory. + + Args: + track_revisions: If True, enables track revisions in settings.xml + """ + # Create or update word/people.xml + people_file = self.word_path / "people.xml" + self._update_people_xml(people_file) + + # Update XML files + self._add_content_type_for_people(self.unpacked_path / "[Content_Types].xml") + self._add_relationship_for_people( + self.word_path / "_rels" / "document.xml.rels" + ) + + # Always add RSID to settings.xml, optionally enable trackRevisions + self._update_settings( + self.word_path / "settings.xml", track_revisions=track_revisions + ) + + def _update_people_xml(self, path): + """Create people.xml if it doesn't exist.""" + if not path.exists(): + # Copy from template + shutil.copy(TEMPLATE_DIR / "people.xml", path) + + def _add_content_type_for_people(self, path): + """Add people.xml content type to [Content_Types].xml if not already present.""" + editor = self["[Content_Types].xml"] + + if self._has_override(editor, "/word/people.xml"): + return + + # Add Override element + root = editor.dom.documentElement + override_xml = '' + editor.append_to(root, override_xml) + + def _add_relationship_for_people(self, path): + """Add people.xml relationship to document.xml.rels if not already present.""" + editor = self["word/_rels/document.xml.rels"] + + if self._has_relationship(editor, "people.xml"): + return + + root = editor.dom.documentElement + root_tag = root.tagName # type: ignore + prefix = root_tag.split(":")[0] + ":" if ":" in root_tag else "" + next_rid = editor.get_next_rid() + + # Create the relationship entry + rel_xml = f'<{prefix}Relationship Id="{next_rid}" Type="http://schemas.microsoft.com/office/2011/relationships/people" Target="people.xml"/>' + editor.append_to(root, rel_xml) + + def _update_settings(self, path, track_revisions=False): + """Add RSID and optionally enable track revisions in settings.xml. + + Args: + path: Path to settings.xml + track_revisions: If True, adds trackRevisions element + + Places elements per OOXML schema order: + - trackRevisions: early (before defaultTabStop) + - rsids: late (after compat) + """ + editor = self["word/settings.xml"] + root = editor.get_node(tag="w:settings") + prefix = root.tagName.split(":")[0] if ":" in root.tagName else "w" + + # Conditionally add trackRevisions if requested + if track_revisions: + track_revisions_exists = any( + elem.tagName == f"{prefix}:trackRevisions" + for elem in editor.dom.getElementsByTagName(f"{prefix}:trackRevisions") + ) + + if not track_revisions_exists: + track_rev_xml = f"<{prefix}:trackRevisions/>" + # Try to insert before documentProtection, defaultTabStop, or at start + inserted = False + for tag in [f"{prefix}:documentProtection", f"{prefix}:defaultTabStop"]: + elements = editor.dom.getElementsByTagName(tag) + if elements: + editor.insert_before(elements[0], track_rev_xml) + inserted = True + break + if not inserted: + # Insert as first child of settings + if root.firstChild: + editor.insert_before(root.firstChild, track_rev_xml) + else: + editor.append_to(root, track_rev_xml) + + # Always check if rsids section exists + rsids_elements = editor.dom.getElementsByTagName(f"{prefix}:rsids") + + if not rsids_elements: + # Add new rsids section + rsids_xml = f'''<{prefix}:rsids> + <{prefix}:rsidRoot {prefix}:val="{self.rsid}"/> + <{prefix}:rsid {prefix}:val="{self.rsid}"/> +''' + + # Try to insert after compat, before clrSchemeMapping, or before closing tag + inserted = False + compat_elements = editor.dom.getElementsByTagName(f"{prefix}:compat") + if compat_elements: + editor.insert_after(compat_elements[0], rsids_xml) + inserted = True + + if not inserted: + clr_elements = editor.dom.getElementsByTagName( + f"{prefix}:clrSchemeMapping" + ) + if clr_elements: + editor.insert_before(clr_elements[0], rsids_xml) + inserted = True + + if not inserted: + editor.append_to(root, rsids_xml) + else: + # Check if this rsid already exists + rsids_elem = rsids_elements[0] + rsid_exists = any( + elem.getAttribute(f"{prefix}:val") == self.rsid + for elem in rsids_elem.getElementsByTagName(f"{prefix}:rsid") + ) + + if not rsid_exists: + rsid_xml = f'<{prefix}:rsid {prefix}:val="{self.rsid}"/>' + editor.append_to(rsids_elem, rsid_xml) + + # ==================== Private: XML File Creation ==================== + + def _add_to_comments_xml( + self, comment_id, para_id, text, author, initials, timestamp + ): + """Add a single comment to comments.xml.""" + if not self.comments_path.exists(): + shutil.copy(TEMPLATE_DIR / "comments.xml", self.comments_path) + + editor = self["word/comments.xml"] + root = editor.get_node(tag="w:comments") + + escaped_text = ( + text.replace("&", "&").replace("<", "<").replace(">", ">") + ) + # Note: w:rsidR, w:rsidRDefault, w:rsidP on w:p, w:rsidR on w:r, + # and w:author, w:date, w:initials on w:comment are automatically added by DocxXMLEditor + comment_xml = f''' + + + {escaped_text} + +''' + editor.append_to(root, comment_xml) + + def _add_to_comments_extended_xml(self, para_id, parent_para_id): + """Add a single comment to commentsExtended.xml.""" + if not self.comments_extended_path.exists(): + shutil.copy( + TEMPLATE_DIR / "commentsExtended.xml", self.comments_extended_path + ) + + editor = self["word/commentsExtended.xml"] + root = editor.get_node(tag="w15:commentsEx") + + if parent_para_id: + xml = f'' + else: + xml = f'' + editor.append_to(root, xml) + + def _add_to_comments_ids_xml(self, para_id, durable_id): + """Add a single comment to commentsIds.xml.""" + if not self.comments_ids_path.exists(): + shutil.copy(TEMPLATE_DIR / "commentsIds.xml", self.comments_ids_path) + + editor = self["word/commentsIds.xml"] + root = editor.get_node(tag="w16cid:commentsIds") + + xml = f'' + editor.append_to(root, xml) + + def _add_to_comments_extensible_xml(self, durable_id): + """Add a single comment to commentsExtensible.xml.""" + if not self.comments_extensible_path.exists(): + shutil.copy( + TEMPLATE_DIR / "commentsExtensible.xml", self.comments_extensible_path + ) + + editor = self["word/commentsExtensible.xml"] + root = editor.get_node(tag="w16cex:commentsExtensible") + + xml = f'' + editor.append_to(root, xml) + + # ==================== Private: XML Fragments ==================== + + def _comment_range_start_xml(self, comment_id): + """Generate XML for comment range start.""" + return f'' + + def _comment_range_end_xml(self, comment_id): + """Generate XML for comment range end with reference run. + + Note: w:rsidR is automatically added by DocxXMLEditor. + """ + return f''' + + + +''' + + def _comment_ref_run_xml(self, comment_id): + """Generate XML for comment reference run. + + Note: w:rsidR is automatically added by DocxXMLEditor. + """ + return f''' + + +''' + + # ==================== Private: Metadata Updates ==================== + + def _has_relationship(self, editor, target): + """Check if a relationship with given target exists.""" + for rel_elem in editor.dom.getElementsByTagName("Relationship"): + if rel_elem.getAttribute("Target") == target: + return True + return False + + def _has_override(self, editor, part_name): + """Check if an override with given part name exists.""" + for override_elem in editor.dom.getElementsByTagName("Override"): + if override_elem.getAttribute("PartName") == part_name: + return True + return False + + def _has_author(self, editor, author): + """Check if an author already exists in people.xml.""" + for person_elem in editor.dom.getElementsByTagName("w15:person"): + if person_elem.getAttribute("w15:author") == author: + return True + return False + + def _add_author_to_people(self, author): + """Add author to people.xml (called during initialization).""" + people_path = self.word_path / "people.xml" + + # people.xml should already exist from _setup_tracking + if not people_path.exists(): + raise ValueError("people.xml should exist after _setup_tracking") + + editor = self["word/people.xml"] + root = editor.get_node(tag="w15:people") + + # Check if author already exists + if self._has_author(editor, author): + return + + # Add author with proper XML escaping to prevent injection + escaped_author = html.escape(author, quote=True) + person_xml = f''' + +''' + editor.append_to(root, person_xml) + + def _ensure_comment_relationships(self): + """Ensure word/_rels/document.xml.rels has comment relationships.""" + editor = self["word/_rels/document.xml.rels"] + + if self._has_relationship(editor, "comments.xml"): + return + + root = editor.dom.documentElement + root_tag = root.tagName # type: ignore + prefix = root_tag.split(":")[0] + ":" if ":" in root_tag else "" + next_rid_num = int(editor.get_next_rid()[3:]) + + # Add relationship elements + rels = [ + ( + next_rid_num, + "http://schemas.openxmlformats.org/officeDocument/2006/relationships/comments", + "comments.xml", + ), + ( + next_rid_num + 1, + "http://schemas.microsoft.com/office/2011/relationships/commentsExtended", + "commentsExtended.xml", + ), + ( + next_rid_num + 2, + "http://schemas.microsoft.com/office/2016/09/relationships/commentsIds", + "commentsIds.xml", + ), + ( + next_rid_num + 3, + "http://schemas.microsoft.com/office/2018/08/relationships/commentsExtensible", + "commentsExtensible.xml", + ), + ] + + for rel_id, rel_type, target in rels: + rel_xml = f'<{prefix}Relationship Id="rId{rel_id}" Type="{rel_type}" Target="{target}"/>' + editor.append_to(root, rel_xml) + + def _ensure_comment_content_types(self): + """Ensure [Content_Types].xml has comment content types.""" + editor = self["[Content_Types].xml"] + + if self._has_override(editor, "/word/comments.xml"): + return + + root = editor.dom.documentElement + + # Add Override elements + overrides = [ + ( + "/word/comments.xml", + "application/vnd.openxmlformats-officedocument.wordprocessingml.comments+xml", + ), + ( + "/word/commentsExtended.xml", + "application/vnd.openxmlformats-officedocument.wordprocessingml.commentsExtended+xml", + ), + ( + "/word/commentsIds.xml", + "application/vnd.openxmlformats-officedocument.wordprocessingml.commentsIds+xml", + ), + ( + "/word/commentsExtensible.xml", + "application/vnd.openxmlformats-officedocument.wordprocessingml.commentsExtensible+xml", + ), + ] + + for part_name, content_type in overrides: + override_xml = ( + f'' + ) + editor.append_to(root, override_xml) diff --git a/code_puppy/bundled_skills/Office/docx/scripts/templates/comments.xml b/code_puppy/bundled_skills/Office/docx/scripts/templates/comments.xml new file mode 100644 index 00000000..b5dace0e --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/scripts/templates/comments.xml @@ -0,0 +1,3 @@ + + + \ No newline at end of file diff --git a/code_puppy/bundled_skills/Office/docx/scripts/templates/commentsExtended.xml b/code_puppy/bundled_skills/Office/docx/scripts/templates/commentsExtended.xml new file mode 100644 index 00000000..b4cf23e3 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/scripts/templates/commentsExtended.xml @@ -0,0 +1,3 @@ + + + \ No newline at end of file diff --git a/code_puppy/bundled_skills/Office/docx/scripts/templates/commentsExtensible.xml b/code_puppy/bundled_skills/Office/docx/scripts/templates/commentsExtensible.xml new file mode 100644 index 00000000..e32a05e0 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/scripts/templates/commentsExtensible.xml @@ -0,0 +1,3 @@ + + + \ No newline at end of file diff --git a/code_puppy/bundled_skills/Office/docx/scripts/templates/commentsIds.xml b/code_puppy/bundled_skills/Office/docx/scripts/templates/commentsIds.xml new file mode 100644 index 00000000..d04bc8e0 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/scripts/templates/commentsIds.xml @@ -0,0 +1,3 @@ + + + \ No newline at end of file diff --git a/code_puppy/bundled_skills/Office/docx/scripts/templates/people.xml b/code_puppy/bundled_skills/Office/docx/scripts/templates/people.xml new file mode 100644 index 00000000..a839cafe --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/scripts/templates/people.xml @@ -0,0 +1,3 @@ + + + \ No newline at end of file diff --git a/code_puppy/bundled_skills/Office/docx/scripts/utilities.py b/code_puppy/bundled_skills/Office/docx/scripts/utilities.py new file mode 100644 index 00000000..d92dae61 --- /dev/null +++ b/code_puppy/bundled_skills/Office/docx/scripts/utilities.py @@ -0,0 +1,374 @@ +#!/usr/bin/env python3 +""" +Utilities for editing OOXML documents. + +This module provides XMLEditor, a tool for manipulating XML files with support for +line-number-based node finding and DOM manipulation. Each element is automatically +annotated with its original line and column position during parsing. + +Example usage: + editor = XMLEditor("document.xml") + + # Find node by line number or range + elem = editor.get_node(tag="w:r", line_number=519) + elem = editor.get_node(tag="w:p", line_number=range(100, 200)) + + # Find node by text content + elem = editor.get_node(tag="w:p", contains="specific text") + + # Find node by attributes + elem = editor.get_node(tag="w:r", attrs={"w:id": "target"}) + + # Combine filters + elem = editor.get_node(tag="w:p", line_number=range(1, 50), contains="text") + + # Replace, insert, or manipulate + new_elem = editor.replace_node(elem, "new text") + editor.insert_after(new_elem, "more") + + # Save changes + editor.save() +""" + +import html +from pathlib import Path +from typing import Optional, Union + +import defusedxml.minidom +import defusedxml.sax + + +class XMLEditor: + """ + Editor for manipulating OOXML XML files with line-number-based node finding. + + This class parses XML files and tracks the original line and column position + of each element. This enables finding nodes by their line number in the original + file, which is useful when working with Read tool output. + + Attributes: + xml_path: Path to the XML file being edited + encoding: Detected encoding of the XML file ('ascii' or 'utf-8') + dom: Parsed DOM tree with parse_position attributes on elements + """ + + def __init__(self, xml_path): + """ + Initialize with path to XML file and parse with line number tracking. + + Args: + xml_path: Path to XML file to edit (str or Path) + + Raises: + ValueError: If the XML file does not exist + """ + self.xml_path = Path(xml_path) + if not self.xml_path.exists(): + raise ValueError(f"XML file not found: {xml_path}") + + with open(self.xml_path, "rb") as f: + header = f.read(200).decode("utf-8", errors="ignore") + self.encoding = "ascii" if 'encoding="ascii"' in header else "utf-8" + + parser = _create_line_tracking_parser() + self.dom = defusedxml.minidom.parse(str(self.xml_path), parser) + + def get_node( + self, + tag: str, + attrs: Optional[dict[str, str]] = None, + line_number: Optional[Union[int, range]] = None, + contains: Optional[str] = None, + ): + """ + Get a DOM element by tag and identifier. + + Finds an element by either its line number in the original file or by + matching attribute values. Exactly one match must be found. + + Args: + tag: The XML tag name (e.g., "w:del", "w:ins", "w:r") + attrs: Dictionary of attribute name-value pairs to match (e.g., {"w:id": "1"}) + line_number: Line number (int) or line range (range) in original XML file (1-indexed) + contains: Text string that must appear in any text node within the element. + Supports both entity notation (“) and Unicode characters (\u201c). + + Returns: + defusedxml.minidom.Element: The matching DOM element + + Raises: + ValueError: If node not found or multiple matches found + + Example: + elem = editor.get_node(tag="w:r", line_number=519) + elem = editor.get_node(tag="w:r", line_number=range(100, 200)) + elem = editor.get_node(tag="w:del", attrs={"w:id": "1"}) + elem = editor.get_node(tag="w:p", attrs={"w14:paraId": "12345678"}) + elem = editor.get_node(tag="w:commentRangeStart", attrs={"w:id": "0"}) + elem = editor.get_node(tag="w:p", contains="specific text") + elem = editor.get_node(tag="w:t", contains="“Agreement") # Entity notation + elem = editor.get_node(tag="w:t", contains="\u201cAgreement") # Unicode character + """ + matches = [] + for elem in self.dom.getElementsByTagName(tag): + # Check line_number filter + if line_number is not None: + parse_pos = getattr(elem, "parse_position", (None,)) + elem_line = parse_pos[0] + + # Handle both single line number and range + if isinstance(line_number, range): + if elem_line not in line_number: + continue + else: + if elem_line != line_number: + continue + + # Check attrs filter + if attrs is not None: + if not all( + elem.getAttribute(attr_name) == attr_value + for attr_name, attr_value in attrs.items() + ): + continue + + # Check contains filter + if contains is not None: + elem_text = self._get_element_text(elem) + # Normalize the search string: convert HTML entities to Unicode characters + # This allows searching for both "“Rowan" and ""Rowan" + normalized_contains = html.unescape(contains) + if normalized_contains not in elem_text: + continue + + # If all applicable filters passed, this is a match + matches.append(elem) + + if not matches: + # Build descriptive error message + filters = [] + if line_number is not None: + line_str = ( + f"lines {line_number.start}-{line_number.stop - 1}" + if isinstance(line_number, range) + else f"line {line_number}" + ) + filters.append(f"at {line_str}") + if attrs is not None: + filters.append(f"with attributes {attrs}") + if contains is not None: + filters.append(f"containing '{contains}'") + + filter_desc = " ".join(filters) if filters else "" + base_msg = f"Node not found: <{tag}> {filter_desc}".strip() + + # Add helpful hint based on filters used + if contains: + hint = "Text may be split across elements or use different wording." + elif line_number: + hint = "Line numbers may have changed if document was modified." + elif attrs: + hint = "Verify attribute values are correct." + else: + hint = "Try adding filters (attrs, line_number, or contains)." + + raise ValueError(f"{base_msg}. {hint}") + if len(matches) > 1: + raise ValueError( + f"Multiple nodes found: <{tag}>. " + f"Add more filters (attrs, line_number, or contains) to narrow the search." + ) + return matches[0] + + def _get_element_text(self, elem): + """ + Recursively extract all text content from an element. + + Skips text nodes that contain only whitespace (spaces, tabs, newlines), + which typically represent XML formatting rather than document content. + + Args: + elem: defusedxml.minidom.Element to extract text from + + Returns: + str: Concatenated text from all non-whitespace text nodes within the element + """ + text_parts = [] + for node in elem.childNodes: + if node.nodeType == node.TEXT_NODE: + # Skip whitespace-only text nodes (XML formatting) + if node.data.strip(): + text_parts.append(node.data) + elif node.nodeType == node.ELEMENT_NODE: + text_parts.append(self._get_element_text(node)) + return "".join(text_parts) + + def replace_node(self, elem, new_content): + """ + Replace a DOM element with new XML content. + + Args: + elem: defusedxml.minidom.Element to replace + new_content: String containing XML to replace the node with + + Returns: + List[defusedxml.minidom.Node]: All inserted nodes + + Example: + new_nodes = editor.replace_node(old_elem, "text") + """ + parent = elem.parentNode + nodes = self._parse_fragment(new_content) + for node in nodes: + parent.insertBefore(node, elem) + parent.removeChild(elem) + return nodes + + def insert_after(self, elem, xml_content): + """ + Insert XML content after a DOM element. + + Args: + elem: defusedxml.minidom.Element to insert after + xml_content: String containing XML to insert + + Returns: + List[defusedxml.minidom.Node]: All inserted nodes + + Example: + new_nodes = editor.insert_after(elem, "text") + """ + parent = elem.parentNode + next_sibling = elem.nextSibling + nodes = self._parse_fragment(xml_content) + for node in nodes: + if next_sibling: + parent.insertBefore(node, next_sibling) + else: + parent.appendChild(node) + return nodes + + def insert_before(self, elem, xml_content): + """ + Insert XML content before a DOM element. + + Args: + elem: defusedxml.minidom.Element to insert before + xml_content: String containing XML to insert + + Returns: + List[defusedxml.minidom.Node]: All inserted nodes + + Example: + new_nodes = editor.insert_before(elem, "text") + """ + parent = elem.parentNode + nodes = self._parse_fragment(xml_content) + for node in nodes: + parent.insertBefore(node, elem) + return nodes + + def append_to(self, elem, xml_content): + """ + Append XML content as a child of a DOM element. + + Args: + elem: defusedxml.minidom.Element to append to + xml_content: String containing XML to append + + Returns: + List[defusedxml.minidom.Node]: All inserted nodes + + Example: + new_nodes = editor.append_to(elem, "text") + """ + nodes = self._parse_fragment(xml_content) + for node in nodes: + elem.appendChild(node) + return nodes + + def get_next_rid(self): + """Get the next available rId for relationships files.""" + max_id = 0 + for rel_elem in self.dom.getElementsByTagName("Relationship"): + rel_id = rel_elem.getAttribute("Id") + if rel_id.startswith("rId"): + try: + max_id = max(max_id, int(rel_id[3:])) + except ValueError: + pass + return f"rId{max_id + 1}" + + def save(self): + """ + Save the edited XML back to the file. + + Serializes the DOM tree and writes it back to the original file path, + preserving the original encoding (ascii or utf-8). + """ + content = self.dom.toxml(encoding=self.encoding) + self.xml_path.write_bytes(content) + + def _parse_fragment(self, xml_content): + """ + Parse XML fragment and return list of imported nodes. + + Args: + xml_content: String containing XML fragment + + Returns: + List of defusedxml.minidom.Node objects imported into this document + + Raises: + AssertionError: If fragment contains no element nodes + """ + # Extract namespace declarations from the root document element + root_elem = self.dom.documentElement + namespaces = [] + if root_elem and root_elem.attributes: + for i in range(root_elem.attributes.length): + attr = root_elem.attributes.item(i) + if attr.name.startswith("xmlns"): # type: ignore + namespaces.append(f'{attr.name}="{attr.value}"') # type: ignore + + ns_decl = " ".join(namespaces) + wrapper = f"{xml_content}" + fragment_doc = defusedxml.minidom.parseString(wrapper) + nodes = [ + self.dom.importNode(child, deep=True) + for child in fragment_doc.documentElement.childNodes # type: ignore + ] + elements = [n for n in nodes if n.nodeType == n.ELEMENT_NODE] + assert elements, "Fragment must contain at least one element" + return nodes + + +def _create_line_tracking_parser(): + """ + Create a SAX parser that tracks line and column numbers for each element. + + Monkey patches the SAX content handler to store the current line and column + position from the underlying expat parser onto each element as a parse_position + attribute (line, column) tuple. + + Returns: + defusedxml.sax.xmlreader.XMLReader: Configured SAX parser + """ + + def set_content_handler(dom_handler): + def startElementNS(name, tagName, attrs): + orig_start_cb(name, tagName, attrs) + cur_elem = dom_handler.elementStack[-1] + cur_elem.parse_position = ( + parser._parser.CurrentLineNumber, # type: ignore + parser._parser.CurrentColumnNumber, # type: ignore + ) + + orig_start_cb = dom_handler.startElementNS + dom_handler.startElementNS = startElementNS + orig_set_content_handler(dom_handler) + + parser = defusedxml.sax.make_parser() + orig_set_content_handler = parser.setContentHandler + parser.setContentHandler = set_content_handler # type: ignore + return parser diff --git a/code_puppy/bundled_skills/Office/frontend-design/SKILL.md b/code_puppy/bundled_skills/Office/frontend-design/SKILL.md new file mode 100644 index 00000000..d693c397 --- /dev/null +++ b/code_puppy/bundled_skills/Office/frontend-design/SKILL.md @@ -0,0 +1,42 @@ +--- +name: frontend-design +description: Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, artifacts, posters, or applications (examples include websites, landing pages, dashboards, React components, HTML/CSS layouts, or when styling/beautifying any web UI). Generates creative, polished code and UI design that avoids generic AI aesthetics. +license: Complete terms in LICENSE.txt +--- + +This skill guides creation of distinctive, production-grade frontend interfaces that avoid generic "AI slop" aesthetics. Implement real working code with exceptional attention to aesthetic details and creative choices. + +The user provides frontend requirements: a component, page, application, or interface to build. They may include context about the purpose, audience, or technical constraints. + +## Design Thinking + +Before coding, understand the context and commit to a BOLD aesthetic direction: +- **Purpose**: What problem does this interface solve? Who uses it? +- **Tone**: Pick an extreme: brutally minimal, maximalist chaos, retro-futuristic, organic/natural, luxury/refined, playful/toy-like, editorial/magazine, brutalist/raw, art deco/geometric, soft/pastel, industrial/utilitarian, etc. There are so many flavors to choose from. Use these for inspiration but design one that is true to the aesthetic direction. +- **Constraints**: Technical requirements (framework, performance, accessibility). +- **Differentiation**: What makes this UNFORGETTABLE? What's the one thing someone will remember? + +**CRITICAL**: Choose a clear conceptual direction and execute it with precision. Bold maximalism and refined minimalism both work - the key is intentionality, not intensity. + +Then implement working code (HTML/CSS/JS, React, Vue, etc.) that is: +- Production-grade and functional +- Visually striking and memorable +- Cohesive with a clear aesthetic point-of-view +- Meticulously refined in every detail + +## Frontend Aesthetics Guidelines + +Focus on: +- **Typography**: Choose fonts that are beautiful, unique, and interesting. Avoid generic fonts like Arial and Inter; opt instead for distinctive choices that elevate the frontend's aesthetics; unexpected, characterful font choices. Pair a distinctive display font with a refined body font. +- **Color & Theme**: Commit to a cohesive aesthetic. Use CSS variables for consistency. Dominant colors with sharp accents outperform timid, evenly-distributed palettes. +- **Motion**: Use animations for effects and micro-interactions. Prioritize CSS-only solutions for HTML. Use Motion library for React when available. Focus on high-impact moments: one well-orchestrated page load with staggered reveals (animation-delay) creates more delight than scattered micro-interactions. Use scroll-triggering and hover states that surprise. +- **Spatial Composition**: Unexpected layouts. Asymmetry. Overlap. Diagonal flow. Grid-breaking elements. Generous negative space OR controlled density. +- **Backgrounds & Visual Details**: Create atmosphere and depth rather than defaulting to solid colors. Add contextual effects and textures that match the overall aesthetic. Apply creative forms like gradient meshes, noise textures, geometric patterns, layered transparencies, dramatic shadows, decorative borders, custom cursors, and grain overlays. + +NEVER use generic AI-generated aesthetics like overused font families (Inter, Roboto, Arial, system fonts), cliched color schemes (particularly purple gradients on white backgrounds), predictable layouts and component patterns, and cookie-cutter design that lacks context-specific character. + +Interpret creatively and make unexpected choices that feel genuinely designed for the context. No design should be the same. Vary between light and dark themes, different fonts, different aesthetics. NEVER converge on common choices (Space Grotesk, for example) across generations. + +**IMPORTANT**: Match implementation complexity to the aesthetic vision. Maximalist designs need elaborate code with extensive animations and effects. Minimalist or refined designs need restraint, precision, and careful attention to spacing, typography, and subtle details. Elegance comes from executing the vision well. + +Remember: Ticca is capable of extraordinary creative work. Don't hold back, show what can truly be created when thinking outside the box and committing fully to a distinctive vision. diff --git a/code_puppy/bundled_skills/Office/pdf/FORMS.md b/code_puppy/bundled_skills/Office/pdf/FORMS.md new file mode 100644 index 00000000..4e234506 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pdf/FORMS.md @@ -0,0 +1,205 @@ +**CRITICAL: You MUST complete these steps in order. Do not skip ahead to writing code.** + +If you need to fill out a PDF form, first check to see if the PDF has fillable form fields. Run this script from this file's directory: + `python scripts/check_fillable_fields `, and depending on the result go to either the "Fillable fields" or "Non-fillable fields" and follow those instructions. + +# Fillable fields +If the PDF has fillable form fields: +- Run this script from this file's directory: `python scripts/extract_form_field_info.py `. It will create a JSON file with a list of fields in this format: +``` +[ + { + "field_id": (unique ID for the field), + "page": (page number, 1-based), + "rect": ([left, bottom, right, top] bounding box in PDF coordinates, y=0 is the bottom of the page), + "type": ("text", "checkbox", "radio_group", or "choice"), + }, + // Checkboxes have "checked_value" and "unchecked_value" properties: + { + "field_id": (unique ID for the field), + "page": (page number, 1-based), + "type": "checkbox", + "checked_value": (Set the field to this value to check the checkbox), + "unchecked_value": (Set the field to this value to uncheck the checkbox), + }, + // Radio groups have a "radio_options" list with the possible choices. + { + "field_id": (unique ID for the field), + "page": (page number, 1-based), + "type": "radio_group", + "radio_options": [ + { + "value": (set the field to this value to select this radio option), + "rect": (bounding box for the radio button for this option) + }, + // Other radio options + ] + }, + // Multiple choice fields have a "choice_options" list with the possible choices: + { + "field_id": (unique ID for the field), + "page": (page number, 1-based), + "type": "choice", + "choice_options": [ + { + "value": (set the field to this value to select this option), + "text": (display text of the option) + }, + // Other choice options + ], + } +] +``` +- Convert the PDF to PNGs (one image for each page) with this script (run from this file's directory): +`python scripts/convert_pdf_to_images.py ` +Then analyze the images to determine the purpose of each form field (make sure to convert the bounding box PDF coordinates to image coordinates). +- Create a `field_values.json` file in this format with the values to be entered for each field: +``` +[ + { + "field_id": "last_name", // Must match the field_id from `extract_form_field_info.py` + "description": "The user's last name", + "page": 1, // Must match the "page" value in field_info.json + "value": "Simpson" + }, + { + "field_id": "Checkbox12", + "description": "Checkbox to be checked if the user is 18 or over", + "page": 1, + "value": "/On" // If this is a checkbox, use its "checked_value" value to check it. If it's a radio button group, use one of the "value" values in "radio_options". + }, + // more fields +] +``` +- Run the `fill_fillable_fields.py` script from this file's directory to create a filled-in PDF: +`python scripts/fill_fillable_fields.py ` +This script will verify that the field IDs and values you provide are valid; if it prints error messages, correct the appropriate fields and try again. + +# Non-fillable fields +If the PDF doesn't have fillable form fields, you'll need to visually determine where the data should be added and create text annotations. Follow the below steps *exactly*. You MUST perform all of these steps to ensure that the the form is accurately completed. Details for each step are below. +- Convert the PDF to PNG images and determine field bounding boxes. +- Create a JSON file with field information and validation images showing the bounding boxes. +- Validate the the bounding boxes. +- Use the bounding boxes to fill in the form. + +## Step 1: Visual Analysis (REQUIRED) +- Convert the PDF to PNG images. Run this script from this file's directory: +`python scripts/convert_pdf_to_images.py ` +The script will create a PNG image for each page in the PDF. +- Carefully examine each PNG image and identify all form fields and areas where the user should enter data. For each form field where the user should enter text, determine bounding boxes for both the form field label, and the area where the user should enter text. The label and entry bounding boxes MUST NOT INTERSECT; the text entry box should only include the area where data should be entered. Usually this area will be immediately to the side, above, or below its label. Entry bounding boxes must be tall and wide enough to contain their text. + +These are some examples of form structures that you might see: + +*Label inside box* +``` +┌────────────────────────┐ +│ Name: │ +└────────────────────────┘ +``` +The input area should be to the right of the "Name" label and extend to the edge of the box. + +*Label before line* +``` +Email: _______________________ +``` +The input area should be above the line and include its entire width. + +*Label under line* +``` +_________________________ +Name +``` +The input area should be above the line and include the entire width of the line. This is common for signature and date fields. + +*Label above line* +``` +Please enter any special requests: +________________________________________________ +``` +The input area should extend from the bottom of the label to the line, and should include the entire width of the line. + +*Checkboxes* +``` +Are you a US citizen? Yes □ No □ +``` +For checkboxes: +- Look for small square boxes (□) - these are the actual checkboxes to target. They may be to the left or right of their labels. +- Distinguish between label text ("Yes", "No") and the clickable checkbox squares. +- The entry bounding box should cover ONLY the small square, not the text label. + +### Step 2: Create fields.json and validation images (REQUIRED) +- Create a file named `fields.json` with information for the form fields and bounding boxes in this format: +``` +{ + "pages": [ + { + "page_number": 1, + "image_width": (first page image width in pixels), + "image_height": (first page image height in pixels), + }, + { + "page_number": 2, + "image_width": (second page image width in pixels), + "image_height": (second page image height in pixels), + } + // additional pages + ], + "form_fields": [ + // Example for a text field. + { + "page_number": 1, + "description": "The user's last name should be entered here", + // Bounding boxes are [left, top, right, bottom]. The bounding boxes for the label and text entry should not overlap. + "field_label": "Last name", + "label_bounding_box": [30, 125, 95, 142], + "entry_bounding_box": [100, 125, 280, 142], + "entry_text": { + "text": "Johnson", // This text will be added as an annotation at the entry_bounding_box location + "font_size": 14, // optional, defaults to 14 + "font_color": "000000", // optional, RRGGBB format, defaults to 000000 (black) + } + }, + // Example for a checkbox. TARGET THE SQUARE for the entry bounding box, NOT THE TEXT + { + "page_number": 2, + "description": "Checkbox that should be checked if the user is over 18", + "entry_bounding_box": [140, 525, 155, 540], // Small box over checkbox square + "field_label": "Yes", + "label_bounding_box": [100, 525, 132, 540], // Box containing "Yes" text + // Use "X" to check a checkbox. + "entry_text": { + "text": "X", + } + } + // additional form field entries + ] +} +``` + +Create validation images by running this script from this file's directory for each page: +`python scripts/create_validation_image.py + +The validation images will have red rectangles where text should be entered, and blue rectangles covering label text. + +### Step 3: Validate Bounding Boxes (REQUIRED) +#### Automated intersection check +- Verify that none of bounding boxes intersect and that the entry bounding boxes are tall enough by checking the fields.json file with the `check_bounding_boxes.py` script (run from this file's directory): +`python scripts/check_bounding_boxes.py ` + +If there are errors, reanalyze the relevant fields, adjust the bounding boxes, and iterate until there are no remaining errors. Remember: label (blue) bounding boxes should contain text labels, entry (red) boxes should not. + +#### Manual image inspection +**CRITICAL: Do not proceed without visually inspecting validation images** +- Red rectangles must ONLY cover input areas +- Red rectangles MUST NOT contain any text +- Blue rectangles should contain label text +- For checkboxes: + - Red rectangle MUST be centered on the checkbox square + - Blue rectangle should cover the text label for the checkbox + +- If any rectangles look wrong, fix fields.json, regenerate the validation images, and verify again. Repeat this process until the bounding boxes are fully accurate. + + +### Step 4: Add annotations to the PDF +Run this script from this file's directory to create a filled-out PDF using the information in fields.json: +`python scripts/fill_pdf_form_with_annotations.py diff --git a/code_puppy/bundled_skills/Office/pdf/REFERENCE.md b/code_puppy/bundled_skills/Office/pdf/REFERENCE.md new file mode 100644 index 00000000..41400bf4 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pdf/REFERENCE.md @@ -0,0 +1,612 @@ +# PDF Processing Advanced Reference + +This document contains advanced PDF processing features, detailed examples, and additional libraries not covered in the main skill instructions. + +## pypdfium2 Library (Apache/BSD License) + +### Overview +pypdfium2 is a Python binding for PDFium (Chromium's PDF library). It's excellent for fast PDF rendering, image generation, and serves as a PyMuPDF replacement. + +### Render PDF to Images +```python +import pypdfium2 as pdfium +from PIL import Image + +# Load PDF +pdf = pdfium.PdfDocument("document.pdf") + +# Render page to image +page = pdf[0] # First page +bitmap = page.render( + scale=2.0, # Higher resolution + rotation=0 # No rotation +) + +# Convert to PIL Image +img = bitmap.to_pil() +img.save("page_1.png", "PNG") + +# Process multiple pages +for i, page in enumerate(pdf): + bitmap = page.render(scale=1.5) + img = bitmap.to_pil() + img.save(f"page_{i+1}.jpg", "JPEG", quality=90) +``` + +### Extract Text with pypdfium2 +```python +import pypdfium2 as pdfium + +pdf = pdfium.PdfDocument("document.pdf") +for i, page in enumerate(pdf): + text = page.get_text() + print(f"Page {i+1} text length: {len(text)} chars") +``` + +## JavaScript Libraries + +### pdf-lib (MIT License) + +pdf-lib is a powerful JavaScript library for creating and modifying PDF documents in any JavaScript environment. + +#### Load and Manipulate Existing PDF +```javascript +import { PDFDocument } from 'pdf-lib'; +import fs from 'fs'; + +async function manipulatePDF() { + // Load existing PDF + const existingPdfBytes = fs.readFileSync('input.pdf'); + const pdfDoc = await PDFDocument.load(existingPdfBytes); + + // Get page count + const pageCount = pdfDoc.getPageCount(); + console.log(`Document has ${pageCount} pages`); + + // Add new page + const newPage = pdfDoc.addPage([600, 400]); + newPage.drawText('Added by pdf-lib', { + x: 100, + y: 300, + size: 16 + }); + + // Save modified PDF + const pdfBytes = await pdfDoc.save(); + fs.writeFileSync('modified.pdf', pdfBytes); +} +``` + +#### Create Complex PDFs from Scratch +```javascript +import { PDFDocument, rgb, StandardFonts } from 'pdf-lib'; +import fs from 'fs'; + +async function createPDF() { + const pdfDoc = await PDFDocument.create(); + + // Add fonts + const helveticaFont = await pdfDoc.embedFont(StandardFonts.Helvetica); + const helveticaBold = await pdfDoc.embedFont(StandardFonts.HelveticaBold); + + // Add page + const page = pdfDoc.addPage([595, 842]); // A4 size + const { width, height } = page.getSize(); + + // Add text with styling + page.drawText('Invoice #12345', { + x: 50, + y: height - 50, + size: 18, + font: helveticaBold, + color: rgb(0.2, 0.2, 0.8) + }); + + // Add rectangle (header background) + page.drawRectangle({ + x: 40, + y: height - 100, + width: width - 80, + height: 30, + color: rgb(0.9, 0.9, 0.9) + }); + + // Add table-like content + const items = [ + ['Item', 'Qty', 'Price', 'Total'], + ['Widget', '2', '$50', '$100'], + ['Gadget', '1', '$75', '$75'] + ]; + + let yPos = height - 150; + items.forEach(row => { + let xPos = 50; + row.forEach(cell => { + page.drawText(cell, { + x: xPos, + y: yPos, + size: 12, + font: helveticaFont + }); + xPos += 120; + }); + yPos -= 25; + }); + + const pdfBytes = await pdfDoc.save(); + fs.writeFileSync('created.pdf', pdfBytes); +} +``` + +#### Advanced Merge and Split Operations +```javascript +import { PDFDocument } from 'pdf-lib'; +import fs from 'fs'; + +async function mergePDFs() { + // Create new document + const mergedPdf = await PDFDocument.create(); + + // Load source PDFs + const pdf1Bytes = fs.readFileSync('doc1.pdf'); + const pdf2Bytes = fs.readFileSync('doc2.pdf'); + + const pdf1 = await PDFDocument.load(pdf1Bytes); + const pdf2 = await PDFDocument.load(pdf2Bytes); + + // Copy pages from first PDF + const pdf1Pages = await mergedPdf.copyPages(pdf1, pdf1.getPageIndices()); + pdf1Pages.forEach(page => mergedPdf.addPage(page)); + + // Copy specific pages from second PDF (pages 0, 2, 4) + const pdf2Pages = await mergedPdf.copyPages(pdf2, [0, 2, 4]); + pdf2Pages.forEach(page => mergedPdf.addPage(page)); + + const mergedPdfBytes = await mergedPdf.save(); + fs.writeFileSync('merged.pdf', mergedPdfBytes); +} +``` + +### pdfjs-dist (Apache License) + +PDF.js is Mozilla's JavaScript library for rendering PDFs in the browser. + +#### Basic PDF Loading and Rendering +```javascript +import * as pdfjsLib from 'pdfjs-dist'; + +// Configure worker (important for performance) +pdfjsLib.GlobalWorkerOptions.workerSrc = './pdf.worker.js'; + +async function renderPDF() { + // Load PDF + const loadingTask = pdfjsLib.getDocument('document.pdf'); + const pdf = await loadingTask.promise; + + console.log(`Loaded PDF with ${pdf.numPages} pages`); + + // Get first page + const page = await pdf.getPage(1); + const viewport = page.getViewport({ scale: 1.5 }); + + // Render to canvas + const canvas = document.createElement('canvas'); + const context = canvas.getContext('2d'); + canvas.height = viewport.height; + canvas.width = viewport.width; + + const renderContext = { + canvasContext: context, + viewport: viewport + }; + + await page.render(renderContext).promise; + document.body.appendChild(canvas); +} +``` + +#### Extract Text with Coordinates +```javascript +import * as pdfjsLib from 'pdfjs-dist'; + +async function extractText() { + const loadingTask = pdfjsLib.getDocument('document.pdf'); + const pdf = await loadingTask.promise; + + let fullText = ''; + + // Extract text from all pages + for (let i = 1; i <= pdf.numPages; i++) { + const page = await pdf.getPage(i); + const textContent = await page.getTextContent(); + + const pageText = textContent.items + .map(item => item.str) + .join(' '); + + fullText += `\n--- Page ${i} ---\n${pageText}`; + + // Get text with coordinates for advanced processing + const textWithCoords = textContent.items.map(item => ({ + text: item.str, + x: item.transform[4], + y: item.transform[5], + width: item.width, + height: item.height + })); + } + + console.log(fullText); + return fullText; +} +``` + +#### Extract Annotations and Forms +```javascript +import * as pdfjsLib from 'pdfjs-dist'; + +async function extractAnnotations() { + const loadingTask = pdfjsLib.getDocument('annotated.pdf'); + const pdf = await loadingTask.promise; + + for (let i = 1; i <= pdf.numPages; i++) { + const page = await pdf.getPage(i); + const annotations = await page.getAnnotations(); + + annotations.forEach(annotation => { + console.log(`Annotation type: ${annotation.subtype}`); + console.log(`Content: ${annotation.contents}`); + console.log(`Coordinates: ${JSON.stringify(annotation.rect)}`); + }); + } +} +``` + +## Advanced Command-Line Operations + +### poppler-utils Advanced Features + +#### Extract Text with Bounding Box Coordinates +```bash +# Extract text with bounding box coordinates (essential for structured data) +pdftotext -bbox-layout document.pdf output.xml + +# The XML output contains precise coordinates for each text element +``` + +#### Advanced Image Conversion +```bash +# Convert to PNG images with specific resolution +pdftoppm -png -r 300 document.pdf output_prefix + +# Convert specific page range with high resolution +pdftoppm -png -r 600 -f 1 -l 3 document.pdf high_res_pages + +# Convert to JPEG with quality setting +pdftoppm -jpeg -jpegopt quality=85 -r 200 document.pdf jpeg_output +``` + +#### Extract Embedded Images +```bash +# Extract all embedded images with metadata +pdfimages -j -p document.pdf page_images + +# List image info without extracting +pdfimages -list document.pdf + +# Extract images in their original format +pdfimages -all document.pdf images/img +``` + +### qpdf Advanced Features + +#### Complex Page Manipulation +```bash +# Split PDF into groups of pages +qpdf --split-pages=3 input.pdf output_group_%02d.pdf + +# Extract specific pages with complex ranges +qpdf input.pdf --pages input.pdf 1,3-5,8,10-end -- extracted.pdf + +# Merge specific pages from multiple PDFs +qpdf --empty --pages doc1.pdf 1-3 doc2.pdf 5-7 doc3.pdf 2,4 -- combined.pdf +``` + +#### PDF Optimization and Repair +```bash +# Optimize PDF for web (linearize for streaming) +qpdf --linearize input.pdf optimized.pdf + +# Remove unused objects and compress +qpdf --optimize-level=all input.pdf compressed.pdf + +# Attempt to repair corrupted PDF structure +qpdf --check input.pdf +qpdf --fix-qdf damaged.pdf repaired.pdf + +# Show detailed PDF structure for debugging +qpdf --show-all-pages input.pdf > structure.txt +``` + +#### Advanced Encryption +```bash +# Add password protection with specific permissions +qpdf --encrypt user_pass owner_pass 256 --print=none --modify=none -- input.pdf encrypted.pdf + +# Check encryption status +qpdf --show-encryption encrypted.pdf + +# Remove password protection (requires password) +qpdf --password=secret123 --decrypt encrypted.pdf decrypted.pdf +``` + +## Advanced Python Techniques + +### pdfplumber Advanced Features + +#### Extract Text with Precise Coordinates +```python +import pdfplumber + +with pdfplumber.open("document.pdf") as pdf: + page = pdf.pages[0] + + # Extract all text with coordinates + chars = page.chars + for char in chars[:10]: # First 10 characters + print(f"Char: '{char['text']}' at x:{char['x0']:.1f} y:{char['y0']:.1f}") + + # Extract text by bounding box (left, top, right, bottom) + bbox_text = page.within_bbox((100, 100, 400, 200)).extract_text() +``` + +#### Advanced Table Extraction with Custom Settings +```python +import pdfplumber +import pandas as pd + +with pdfplumber.open("complex_table.pdf") as pdf: + page = pdf.pages[0] + + # Extract tables with custom settings for complex layouts + table_settings = { + "vertical_strategy": "lines", + "horizontal_strategy": "lines", + "snap_tolerance": 3, + "intersection_tolerance": 15 + } + tables = page.extract_tables(table_settings) + + # Visual debugging for table extraction + img = page.to_image(resolution=150) + img.save("debug_layout.png") +``` + +### reportlab Advanced Features + +#### Create Professional Reports with Tables +```python +from reportlab.platypus import SimpleDocTemplate, Table, TableStyle, Paragraph +from reportlab.lib.styles import getSampleStyleSheet +from reportlab.lib import colors + +# Sample data +data = [ + ['Product', 'Q1', 'Q2', 'Q3', 'Q4'], + ['Widgets', '120', '135', '142', '158'], + ['Gadgets', '85', '92', '98', '105'] +] + +# Create PDF with table +doc = SimpleDocTemplate("report.pdf") +elements = [] + +# Add title +styles = getSampleStyleSheet() +title = Paragraph("Quarterly Sales Report", styles['Title']) +elements.append(title) + +# Add table with advanced styling +table = Table(data) +table.setStyle(TableStyle([ + ('BACKGROUND', (0, 0), (-1, 0), colors.grey), + ('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke), + ('ALIGN', (0, 0), (-1, -1), 'CENTER'), + ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'), + ('FONTSIZE', (0, 0), (-1, 0), 14), + ('BOTTOMPADDING', (0, 0), (-1, 0), 12), + ('BACKGROUND', (0, 1), (-1, -1), colors.beige), + ('GRID', (0, 0), (-1, -1), 1, colors.black) +])) +elements.append(table) + +doc.build(elements) +``` + +## Complex Workflows + +### Extract Figures/Images from PDF + +#### Method 1: Using pdfimages (fastest) +```bash +# Extract all images with original quality +pdfimages -all document.pdf images/img +``` + +#### Method 2: Using pypdfium2 + Image Processing +```python +import pypdfium2 as pdfium +from PIL import Image +import numpy as np + +def extract_figures(pdf_path, output_dir): + pdf = pdfium.PdfDocument(pdf_path) + + for page_num, page in enumerate(pdf): + # Render high-resolution page + bitmap = page.render(scale=3.0) + img = bitmap.to_pil() + + # Convert to numpy for processing + img_array = np.array(img) + + # Simple figure detection (non-white regions) + mask = np.any(img_array != [255, 255, 255], axis=2) + + # Find contours and extract bounding boxes + # (This is simplified - real implementation would need more sophisticated detection) + + # Save detected figures + # ... implementation depends on specific needs +``` + +### Batch PDF Processing with Error Handling +```python +import os +import glob +from pypdf import PdfReader, PdfWriter +import logging + +logging.basicConfig(level=logging.INFO) +logger = logging.getLogger(__name__) + +def batch_process_pdfs(input_dir, operation='merge'): + pdf_files = glob.glob(os.path.join(input_dir, "*.pdf")) + + if operation == 'merge': + writer = PdfWriter() + for pdf_file in pdf_files: + try: + reader = PdfReader(pdf_file) + for page in reader.pages: + writer.add_page(page) + logger.info(f"Processed: {pdf_file}") + except Exception as e: + logger.error(f"Failed to process {pdf_file}: {e}") + continue + + with open("batch_merged.pdf", "wb") as output: + writer.write(output) + + elif operation == 'extract_text': + for pdf_file in pdf_files: + try: + reader = PdfReader(pdf_file) + text = "" + for page in reader.pages: + text += page.extract_text() + + output_file = pdf_file.replace('.pdf', '.txt') + with open(output_file, 'w', encoding='utf-8') as f: + f.write(text) + logger.info(f"Extracted text from: {pdf_file}") + + except Exception as e: + logger.error(f"Failed to extract text from {pdf_file}: {e}") + continue +``` + +### Advanced PDF Cropping +```python +from pypdf import PdfWriter, PdfReader + +reader = PdfReader("input.pdf") +writer = PdfWriter() + +# Crop page (left, bottom, right, top in points) +page = reader.pages[0] +page.mediabox.left = 50 +page.mediabox.bottom = 50 +page.mediabox.right = 550 +page.mediabox.top = 750 + +writer.add_page(page) +with open("cropped.pdf", "wb") as output: + writer.write(output) +``` + +## Performance Optimization Tips + +### 1. For Large PDFs +- Use streaming approaches instead of loading entire PDF in memory +- Use `qpdf --split-pages` for splitting large files +- Process pages individually with pypdfium2 + +### 2. For Text Extraction +- `pdftotext -bbox-layout` is fastest for plain text extraction +- Use pdfplumber for structured data and tables +- Avoid `pypdf.extract_text()` for very large documents + +### 3. For Image Extraction +- `pdfimages` is much faster than rendering pages +- Use low resolution for previews, high resolution for final output + +### 4. For Form Filling +- pdf-lib maintains form structure better than most alternatives +- Pre-validate form fields before processing + +### 5. Memory Management +```python +# Process PDFs in chunks +def process_large_pdf(pdf_path, chunk_size=10): + reader = PdfReader(pdf_path) + total_pages = len(reader.pages) + + for start_idx in range(0, total_pages, chunk_size): + end_idx = min(start_idx + chunk_size, total_pages) + writer = PdfWriter() + + for i in range(start_idx, end_idx): + writer.add_page(reader.pages[i]) + + # Process chunk + with open(f"chunk_{start_idx//chunk_size}.pdf", "wb") as output: + writer.write(output) +``` + +## Troubleshooting Common Issues + +### Encrypted PDFs +```python +# Handle password-protected PDFs +from pypdf import PdfReader + +try: + reader = PdfReader("encrypted.pdf") + if reader.is_encrypted: + reader.decrypt("password") +except Exception as e: + print(f"Failed to decrypt: {e}") +``` + +### Corrupted PDFs +```bash +# Use qpdf to repair +qpdf --check corrupted.pdf +qpdf --replace-input corrupted.pdf +``` + +### Text Extraction Issues +```python +# Fallback to OCR for scanned PDFs +import pytesseract +from pdf2image import convert_from_path + +def extract_text_with_ocr(pdf_path): + images = convert_from_path(pdf_path) + text = "" + for i, image in enumerate(images): + text += pytesseract.image_to_string(image) + return text +``` + +## License Information + +- **pypdf**: BSD License +- **pdfplumber**: MIT License +- **pypdfium2**: Apache/BSD License +- **reportlab**: BSD License +- **poppler-utils**: GPL-2 License +- **qpdf**: Apache License +- **pdf-lib**: MIT License +- **pdfjs-dist**: Apache License \ No newline at end of file diff --git a/code_puppy/bundled_skills/Office/pdf/SKILL.md b/code_puppy/bundled_skills/Office/pdf/SKILL.md new file mode 100644 index 00000000..9f312fc0 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pdf/SKILL.md @@ -0,0 +1,294 @@ +--- +name: pdf +description: Comprehensive PDF manipulation toolkit for extracting text and tables, creating new PDFs, merging/splitting documents, and handling forms. When Ticca needs to fill in a PDF form or programmatically process, generate, or analyze PDF documents at scale. +license: Proprietary. LICENSE.txt has complete terms +--- + +# PDF Processing Guide + +## Overview + +This guide covers essential PDF processing operations using Python libraries and command-line tools. For advanced features, JavaScript libraries, and detailed examples, see REFERENCE.md. If you need to fill out a PDF form, read FORMS.md and follow its instructions. + +## Quick Start + +```python +from pypdf import PdfReader, PdfWriter + +# Read a PDF +reader = PdfReader("document.pdf") +print(f"Pages: {len(reader.pages)}") + +# Extract text +text = "" +for page in reader.pages: + text += page.extract_text() +``` + +## Python Libraries + +### pypdf - Basic Operations + +#### Merge PDFs +```python +from pypdf import PdfWriter, PdfReader + +writer = PdfWriter() +for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]: + reader = PdfReader(pdf_file) + for page in reader.pages: + writer.add_page(page) + +with open("merged.pdf", "wb") as output: + writer.write(output) +``` + +#### Split PDF +```python +reader = PdfReader("input.pdf") +for i, page in enumerate(reader.pages): + writer = PdfWriter() + writer.add_page(page) + with open(f"page_{i+1}.pdf", "wb") as output: + writer.write(output) +``` + +#### Extract Metadata +```python +reader = PdfReader("document.pdf") +meta = reader.metadata +print(f"Title: {meta.title}") +print(f"Author: {meta.author}") +print(f"Subject: {meta.subject}") +print(f"Creator: {meta.creator}") +``` + +#### Rotate Pages +```python +reader = PdfReader("input.pdf") +writer = PdfWriter() + +page = reader.pages[0] +page.rotate(90) # Rotate 90 degrees clockwise +writer.add_page(page) + +with open("rotated.pdf", "wb") as output: + writer.write(output) +``` + +### pdfplumber - Text and Table Extraction + +#### Extract Text with Layout +```python +import pdfplumber + +with pdfplumber.open("document.pdf") as pdf: + for page in pdf.pages: + text = page.extract_text() + print(text) +``` + +#### Extract Tables +```python +with pdfplumber.open("document.pdf") as pdf: + for i, page in enumerate(pdf.pages): + tables = page.extract_tables() + for j, table in enumerate(tables): + print(f"Table {j+1} on page {i+1}:") + for row in table: + print(row) +``` + +#### Advanced Table Extraction +```python +import pandas as pd + +with pdfplumber.open("document.pdf") as pdf: + all_tables = [] + for page in pdf.pages: + tables = page.extract_tables() + for table in tables: + if table: # Check if table is not empty + df = pd.DataFrame(table[1:], columns=table[0]) + all_tables.append(df) + +# Combine all tables +if all_tables: + combined_df = pd.concat(all_tables, ignore_index=True) + combined_df.to_excel("extracted_tables.xlsx", index=False) +``` + +### reportlab - Create PDFs + +#### Basic PDF Creation +```python +from reportlab.lib.pagesizes import letter +from reportlab.pdfgen import canvas + +c = canvas.Canvas("hello.pdf", pagesize=letter) +width, height = letter + +# Add text +c.drawString(100, height - 100, "Hello World!") +c.drawString(100, height - 120, "This is a PDF created with reportlab") + +# Add a line +c.line(100, height - 140, 400, height - 140) + +# Save +c.save() +``` + +#### Create PDF with Multiple Pages +```python +from reportlab.lib.pagesizes import letter +from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak +from reportlab.lib.styles import getSampleStyleSheet + +doc = SimpleDocTemplate("report.pdf", pagesize=letter) +styles = getSampleStyleSheet() +story = [] + +# Add content +title = Paragraph("Report Title", styles['Title']) +story.append(title) +story.append(Spacer(1, 12)) + +body = Paragraph("This is the body of the report. " * 20, styles['Normal']) +story.append(body) +story.append(PageBreak()) + +# Page 2 +story.append(Paragraph("Page 2", styles['Heading1'])) +story.append(Paragraph("Content for page 2", styles['Normal'])) + +# Build PDF +doc.build(story) +``` + +## Command-Line Tools + +### pdftotext (poppler-utils) +```bash +# Extract text +pdftotext input.pdf output.txt + +# Extract text preserving layout +pdftotext -layout input.pdf output.txt + +# Extract specific pages +pdftotext -f 1 -l 5 input.pdf output.txt # Pages 1-5 +``` + +### qpdf +```bash +# Merge PDFs +qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf + +# Split pages +qpdf input.pdf --pages . 1-5 -- pages1-5.pdf +qpdf input.pdf --pages . 6-10 -- pages6-10.pdf + +# Rotate pages +qpdf input.pdf output.pdf --rotate=+90:1 # Rotate page 1 by 90 degrees + +# Remove password +qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf +``` + +### pdftk (if available) +```bash +# Merge +pdftk file1.pdf file2.pdf cat output merged.pdf + +# Split +pdftk input.pdf burst + +# Rotate +pdftk input.pdf rotate 1east output rotated.pdf +``` + +## Common Tasks + +### Extract Text from Scanned PDFs +```python +# Requires: pip install pytesseract pdf2image +import pytesseract +from pdf2image import convert_from_path + +# Convert PDF to images +images = convert_from_path('scanned.pdf') + +# OCR each page +text = "" +for i, image in enumerate(images): + text += f"Page {i+1}:\n" + text += pytesseract.image_to_string(image) + text += "\n\n" + +print(text) +``` + +### Add Watermark +```python +from pypdf import PdfReader, PdfWriter + +# Create watermark (or load existing) +watermark = PdfReader("watermark.pdf").pages[0] + +# Apply to all pages +reader = PdfReader("document.pdf") +writer = PdfWriter() + +for page in reader.pages: + page.merge_page(watermark) + writer.add_page(page) + +with open("watermarked.pdf", "wb") as output: + writer.write(output) +``` + +### Extract Images +```bash +# Using pdfimages (poppler-utils) +pdfimages -j input.pdf output_prefix + +# This extracts all images as output_prefix-000.jpg, output_prefix-001.jpg, etc. +``` + +### Password Protection +```python +from pypdf import PdfReader, PdfWriter + +reader = PdfReader("input.pdf") +writer = PdfWriter() + +for page in reader.pages: + writer.add_page(page) + +# Add password +writer.encrypt("userpassword", "ownerpassword") + +with open("encrypted.pdf", "wb") as output: + writer.write(output) +``` + +## Quick Reference + +| Task | Best Tool | Command/Code | +|------|-----------|--------------| +| Merge PDFs | pypdf | `writer.add_page(page)` | +| Split PDFs | pypdf | One page per file | +| Extract text | pdfplumber | `page.extract_text()` | +| Extract tables | pdfplumber | `page.extract_tables()` | +| Create PDFs | reportlab | Canvas or Platypus | +| Command line merge | qpdf | `qpdf --empty --pages ...` | +| OCR scanned PDFs | pytesseract | Convert to image first | +| Fill PDF forms | pdf-lib or pypdf (see FORMS.md) | See FORMS.md | + +## Next Steps + +- For advanced pypdfium2 usage, see REFERENCE.md +- For JavaScript libraries (pdf-lib), see REFERENCE.md +- If you need to fill out a PDF form, follow the instructions in FORMS.md +- For troubleshooting guides, see REFERENCE.md diff --git a/code_puppy/bundled_skills/Office/pdf/scripts/check_bounding_boxes.py b/code_puppy/bundled_skills/Office/pdf/scripts/check_bounding_boxes.py new file mode 100644 index 00000000..5ce4bb15 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pdf/scripts/check_bounding_boxes.py @@ -0,0 +1,82 @@ +import json +import sys +from dataclasses import dataclass + +# Script to check that the `fields.json` file that Ticca creates when analyzing PDFs +# does not have overlapping bounding boxes. See FORMS.md. + + +@dataclass +class RectAndField: + rect: list[float] + rect_type: str + field: dict + + +# Returns a list of messages that are printed to stdout for Ticca to read. +def get_bounding_box_messages(fields_json_stream) -> list[str]: + messages = [] + fields = json.load(fields_json_stream) + messages.append(f"Read {len(fields['form_fields'])} fields") + + def rects_intersect(r1, r2): + disjoint_horizontal = r1[0] >= r2[2] or r1[2] <= r2[0] + disjoint_vertical = r1[1] >= r2[3] or r1[3] <= r2[1] + return not (disjoint_horizontal or disjoint_vertical) + + rects_and_fields = [] + for f in fields["form_fields"]: + rects_and_fields.append(RectAndField(f["label_bounding_box"], "label", f)) + rects_and_fields.append(RectAndField(f["entry_bounding_box"], "entry", f)) + + has_error = False + for i, ri in enumerate(rects_and_fields): + # This is O(N^2); we can optimize if it becomes a problem. + for j in range(i + 1, len(rects_and_fields)): + rj = rects_and_fields[j] + if ri.field["page_number"] == rj.field["page_number"] and rects_intersect( + ri.rect, rj.rect + ): + has_error = True + if ri.field is rj.field: + messages.append( + f"FAILURE: intersection between label and entry bounding boxes for `{ri.field['description']}` ({ri.rect}, {rj.rect})" + ) + else: + messages.append( + f"FAILURE: intersection between {ri.rect_type} bounding box for `{ri.field['description']}` ({ri.rect}) and {rj.rect_type} bounding box for `{rj.field['description']}` ({rj.rect})" + ) + if len(messages) >= 20: + messages.append( + "Aborting further checks; fix bounding boxes and try again" + ) + return messages + if ri.rect_type == "entry": + if "entry_text" in ri.field: + font_size = ri.field["entry_text"].get("font_size", 14) + entry_height = ri.rect[3] - ri.rect[1] + if entry_height < font_size: + has_error = True + messages.append( + f"FAILURE: entry bounding box height ({entry_height}) for `{ri.field['description']}` is too short for the text content (font size: {font_size}). Increase the box height or decrease the font size." + ) + if len(messages) >= 20: + messages.append( + "Aborting further checks; fix bounding boxes and try again" + ) + return messages + + if not has_error: + messages.append("SUCCESS: All bounding boxes are valid") + return messages + + +if __name__ == "__main__": + if len(sys.argv) != 2: + print("Usage: check_bounding_boxes.py [fields.json]") + sys.exit(1) + # Input file should be in the `fields.json` format described in FORMS.md. + with open(sys.argv[1]) as f: + messages = get_bounding_box_messages(f) + for msg in messages: + print(msg) diff --git a/code_puppy/bundled_skills/Office/pdf/scripts/check_fillable_fields.py b/code_puppy/bundled_skills/Office/pdf/scripts/check_fillable_fields.py new file mode 100644 index 00000000..8d5f9f91 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pdf/scripts/check_fillable_fields.py @@ -0,0 +1,14 @@ +import sys + +from pypdf import PdfReader + +# Script for Ticca to run to determine whether a PDF has fillable form fields. See FORMS.md. + + +reader = PdfReader(sys.argv[1]) +if reader.get_fields(): + print("This PDF has fillable form fields") +else: + print( + "This PDF does not have fillable form fields; you will need to visually determine where to enter data" + ) diff --git a/code_puppy/bundled_skills/Office/pdf/scripts/convert_pdf_to_images.py b/code_puppy/bundled_skills/Office/pdf/scripts/convert_pdf_to_images.py new file mode 100644 index 00000000..95ab00bc --- /dev/null +++ b/code_puppy/bundled_skills/Office/pdf/scripts/convert_pdf_to_images.py @@ -0,0 +1,34 @@ +import os +import sys + +from pdf2image import convert_from_path + +# Converts each page of a PDF to a PNG image. + + +def convert(pdf_path, output_dir, max_dim=1000): + images = convert_from_path(pdf_path, dpi=200) + + for i, image in enumerate(images): + # Scale image if needed to keep width/height under `max_dim` + width, height = image.size + if width > max_dim or height > max_dim: + scale_factor = min(max_dim / width, max_dim / height) + new_width = int(width * scale_factor) + new_height = int(height * scale_factor) + image = image.resize((new_width, new_height)) + + image_path = os.path.join(output_dir, f"page_{i + 1}.png") + image.save(image_path) + print(f"Saved page {i + 1} as {image_path} (size: {image.size})") + + print(f"Converted {len(images)} pages to PNG images") + + +if __name__ == "__main__": + if len(sys.argv) != 3: + print("Usage: convert_pdf_to_images.py [input pdf] [output directory]") + sys.exit(1) + pdf_path = sys.argv[1] + output_directory = sys.argv[2] + convert(pdf_path, output_directory) diff --git a/code_puppy/bundled_skills/Office/pdf/scripts/create_validation_image.py b/code_puppy/bundled_skills/Office/pdf/scripts/create_validation_image.py new file mode 100644 index 00000000..06d27391 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pdf/scripts/create_validation_image.py @@ -0,0 +1,46 @@ +import json +import sys + +from PIL import Image, ImageDraw + +# Creates "validation" images with rectangles for the bounding box information that +# Ticca creates when determining where to add text annotations in PDFs. See FORMS.md. + + +def create_validation_image(page_number, fields_json_path, input_path, output_path): + # Input file should be in the `fields.json` format described in FORMS.md. + with open(fields_json_path, "r") as f: + data = json.load(f) + + img = Image.open(input_path) + draw = ImageDraw.Draw(img) + num_boxes = 0 + + for field in data["form_fields"]: + if field["page_number"] == page_number: + entry_box = field["entry_bounding_box"] + label_box = field["label_bounding_box"] + # Draw red rectangle over entry bounding box and blue rectangle over the label. + draw.rectangle(entry_box, outline="red", width=2) + draw.rectangle(label_box, outline="blue", width=2) + num_boxes += 2 + + img.save(output_path) + print( + f"Created validation image at {output_path} with {num_boxes} bounding boxes" + ) + + +if __name__ == "__main__": + if len(sys.argv) != 5: + print( + "Usage: create_validation_image.py [page number] [fields.json file] [input image path] [output image path]" + ) + sys.exit(1) + page_number = int(sys.argv[1]) + fields_json_path = sys.argv[2] + input_image_path = sys.argv[3] + output_image_path = sys.argv[4] + create_validation_image( + page_number, fields_json_path, input_image_path, output_image_path + ) diff --git a/code_puppy/bundled_skills/Office/pdf/scripts/extract_form_field_info.py b/code_puppy/bundled_skills/Office/pdf/scripts/extract_form_field_info.py new file mode 100644 index 00000000..950ac443 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pdf/scripts/extract_form_field_info.py @@ -0,0 +1,162 @@ +import json +import sys + +from pypdf import PdfReader + +# Extracts data for the fillable form fields in a PDF and outputs JSON that +# Ticca uses to fill the fields. See FORMS.md. + + +# This matches the format used by PdfReader `get_fields` and `update_page_form_field_values` methods. +def get_full_annotation_field_id(annotation): + components = [] + while annotation: + field_name = annotation.get("/T") + if field_name: + components.append(field_name) + annotation = annotation.get("/Parent") + return ".".join(reversed(components)) if components else None + + +def make_field_dict(field, field_id): + field_dict = {"field_id": field_id} + ft = field.get("/FT") + if ft == "/Tx": + field_dict["type"] = "text" + elif ft == "/Btn": + field_dict["type"] = "checkbox" # radio groups handled separately + states = field.get("/_States_", []) + if len(states) == 2: + # "/Off" seems to always be the unchecked value, as suggested by + # https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf#page=448 + # It can be either first or second in the "/_States_" list. + if "/Off" in states: + field_dict["checked_value"] = ( + states[0] if states[0] != "/Off" else states[1] + ) + field_dict["unchecked_value"] = "/Off" + else: + print( + f"Unexpected state values for checkbox `${field_id}`. Its checked and unchecked values may not be correct; if you're trying to check it, visually verify the results." + ) + field_dict["checked_value"] = states[0] + field_dict["unchecked_value"] = states[1] + elif ft == "/Ch": + field_dict["type"] = "choice" + states = field.get("/_States_", []) + field_dict["choice_options"] = [ + { + "value": state[0], + "text": state[1], + } + for state in states + ] + else: + field_dict["type"] = f"unknown ({ft})" + return field_dict + + +# Returns a list of fillable PDF fields: +# [ +# { +# "field_id": "name", +# "page": 1, +# "type": ("text", "checkbox", "radio_group", or "choice") +# // Per-type additional fields described in FORMS.md +# }, +# ] +def get_field_info(reader: PdfReader): + fields = reader.get_fields() + + field_info_by_id = {} + possible_radio_names = set() + + for field_id, field in fields.items(): + # Skip if this is a container field with children, except that it might be + # a parent group for radio button options. + if field.get("/Kids"): + if field.get("/FT") == "/Btn": + possible_radio_names.add(field_id) + continue + field_info_by_id[field_id] = make_field_dict(field, field_id) + + # Bounding rects are stored in annotations in page objects. + + # Radio button options have a separate annotation for each choice; + # all choices have the same field name. + # See https://westhealth.github.io/exploring-fillable-forms-with-pdfrw.html + radio_fields_by_id = {} + + for page_index, page in enumerate(reader.pages): + annotations = page.get("/Annots", []) + for ann in annotations: + field_id = get_full_annotation_field_id(ann) + if field_id in field_info_by_id: + field_info_by_id[field_id]["page"] = page_index + 1 + field_info_by_id[field_id]["rect"] = ann.get("/Rect") + elif field_id in possible_radio_names: + try: + # ann['/AP']['/N'] should have two items. One of them is '/Off', + # the other is the active value. + on_values = [v for v in ann["/AP"]["/N"] if v != "/Off"] + except KeyError: + continue + if len(on_values) == 1: + rect = ann.get("/Rect") + if field_id not in radio_fields_by_id: + radio_fields_by_id[field_id] = { + "field_id": field_id, + "type": "radio_group", + "page": page_index + 1, + "radio_options": [], + } + # Note: at least on macOS 15.7, Preview.app doesn't show selected + # radio buttons correctly. (It does if you remove the leading slash + # from the value, but that causes them not to appear correctly in + # Chrome/Firefox/Acrobat/etc). + radio_fields_by_id[field_id]["radio_options"].append( + { + "value": on_values[0], + "rect": rect, + } + ) + + # Some PDFs have form field definitions without corresponding annotations, + # so we can't tell where they are. Ignore these fields for now. + fields_with_location = [] + for field_info in field_info_by_id.values(): + if "page" in field_info: + fields_with_location.append(field_info) + else: + print( + f"Unable to determine location for field id: {field_info.get('field_id')}, ignoring" + ) + + # Sort by page number, then Y position (flipped in PDF coordinate system), then X. + def sort_key(f): + if "radio_options" in f: + rect = f["radio_options"][0]["rect"] or [0, 0, 0, 0] + else: + rect = f.get("rect") or [0, 0, 0, 0] + adjusted_position = [-rect[1], rect[0]] + return [f.get("page"), adjusted_position] + + sorted_fields = fields_with_location + list(radio_fields_by_id.values()) + sorted_fields.sort(key=sort_key) + + return sorted_fields + + +def write_field_info(pdf_path: str, json_output_path: str): + reader = PdfReader(pdf_path) + field_info = get_field_info(reader) + with open(json_output_path, "w") as f: + json.dump(field_info, f, indent=2) + print(f"Wrote {len(field_info)} fields to {json_output_path}") + + +if __name__ == "__main__": + if len(sys.argv) != 3: + print("Usage: extract_form_field_info.py [input pdf] [output json]") + sys.exit(1) + write_field_info(sys.argv[1], sys.argv[2]) diff --git a/code_puppy/bundled_skills/Office/pdf/scripts/fill_fillable_fields.py b/code_puppy/bundled_skills/Office/pdf/scripts/fill_fillable_fields.py new file mode 100644 index 00000000..1966a943 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pdf/scripts/fill_fillable_fields.py @@ -0,0 +1,120 @@ +import json +import sys + +from extract_form_field_info import get_field_info +from pypdf import PdfReader, PdfWriter + +# Fills fillable form fields in a PDF. See FORMS.md. + + +def fill_pdf_fields(input_pdf_path: str, fields_json_path: str, output_pdf_path: str): + with open(fields_json_path) as f: + fields = json.load(f) + # Group by page number. + fields_by_page = {} + for field in fields: + if "value" in field: + field_id = field["field_id"] + page = field["page"] + if page not in fields_by_page: + fields_by_page[page] = {} + fields_by_page[page][field_id] = field["value"] + + reader = PdfReader(input_pdf_path) + + has_error = False + field_info = get_field_info(reader) + fields_by_ids = {f["field_id"]: f for f in field_info} + for field in fields: + existing_field = fields_by_ids.get(field["field_id"]) + if not existing_field: + has_error = True + print(f"ERROR: `{field['field_id']}` is not a valid field ID") + elif field["page"] != existing_field["page"]: + has_error = True + print( + f"ERROR: Incorrect page number for `{field['field_id']}` (got {field['page']}, expected {existing_field['page']})" + ) + else: + if "value" in field: + err = validation_error_for_field_value(existing_field, field["value"]) + if err: + print(err) + has_error = True + if has_error: + sys.exit(1) + + writer = PdfWriter(clone_from=reader) + for page, field_values in fields_by_page.items(): + writer.update_page_form_field_values( + writer.pages[page - 1], field_values, auto_regenerate=False + ) + + # This seems to be necessary for many PDF viewers to format the form values correctly. + # It may cause the viewer to show a "save changes" dialog even if the user doesn't make any changes. + writer.set_need_appearances_writer(True) + + with open(output_pdf_path, "wb") as f: + writer.write(f) + + +def validation_error_for_field_value(field_info, field_value): + field_type = field_info["type"] + field_id = field_info["field_id"] + if field_type == "checkbox": + checked_val = field_info["checked_value"] + unchecked_val = field_info["unchecked_value"] + if field_value != checked_val and field_value != unchecked_val: + return f'ERROR: Invalid value "{field_value}" for checkbox field "{field_id}". The checked value is "{checked_val}" and the unchecked value is "{unchecked_val}"' + elif field_type == "radio_group": + option_values = [opt["value"] for opt in field_info["radio_options"]] + if field_value not in option_values: + return f'ERROR: Invalid value "{field_value}" for radio group field "{field_id}". Valid values are: {option_values}' + elif field_type == "choice": + choice_values = [opt["value"] for opt in field_info["choice_options"]] + if field_value not in choice_values: + return f'ERROR: Invalid value "{field_value}" for choice field "{field_id}". Valid values are: {choice_values}' + return None + + +# pypdf (at least version 5.7.0) has a bug when setting the value for a selection list field. +# In _writer.py around line 966: +# +# if field.get(FA.FT, "/Tx") == "/Ch" and field_flags & FA.FfBits.Combo == 0: +# txt = "\n".join(annotation.get_inherited(FA.Opt, [])) +# +# The problem is that for selection lists, `get_inherited` returns a list of two-element lists like +# [["value1", "Text 1"], ["value2", "Text 2"], ...] +# This causes `join` to throw a TypeError because it expects an iterable of strings. +# The horrible workaround is to patch `get_inherited` to return a list of the value strings. +# We call the original method and adjust the return value only if the argument to `get_inherited` +# is `FA.Opt` and if the return value is a list of two-element lists. +def monkeypatch_pydpf_method(): + from pypdf.constants import FieldDictionaryAttributes + from pypdf.generic import DictionaryObject + + original_get_inherited = DictionaryObject.get_inherited + + def patched_get_inherited(self, key: str, default=None): + result = original_get_inherited(self, key, default) + if key == FieldDictionaryAttributes.Opt: + if isinstance(result, list) and all( + isinstance(v, list) and len(v) == 2 for v in result + ): + result = [r[0] for r in result] + return result + + DictionaryObject.get_inherited = patched_get_inherited + + +if __name__ == "__main__": + if len(sys.argv) != 4: + print( + "Usage: fill_fillable_fields.py [input pdf] [field_values.json] [output pdf]" + ) + sys.exit(1) + monkeypatch_pydpf_method() + input_pdf = sys.argv[1] + fields_json = sys.argv[2] + output_pdf = sys.argv[3] + fill_pdf_fields(input_pdf, fields_json, output_pdf) diff --git a/code_puppy/bundled_skills/Office/pdf/scripts/fill_pdf_form_with_annotations.py b/code_puppy/bundled_skills/Office/pdf/scripts/fill_pdf_form_with_annotations.py new file mode 100644 index 00000000..d392d240 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pdf/scripts/fill_pdf_form_with_annotations.py @@ -0,0 +1,113 @@ +import json +import sys + +from pypdf import PdfReader, PdfWriter +from pypdf.annotations import FreeText + +# Fills a PDF by adding text annotations defined in `fields.json`. See FORMS.md. + + +def transform_coordinates(bbox, image_width, image_height, pdf_width, pdf_height): + """Transform bounding box from image coordinates to PDF coordinates""" + # Image coordinates: origin at top-left, y increases downward + # PDF coordinates: origin at bottom-left, y increases upward + x_scale = pdf_width / image_width + y_scale = pdf_height / image_height + + left = bbox[0] * x_scale + right = bbox[2] * x_scale + + # Flip Y coordinates for PDF + top = pdf_height - (bbox[1] * y_scale) + bottom = pdf_height - (bbox[3] * y_scale) + + return left, bottom, right, top + + +def fill_pdf_form(input_pdf_path, fields_json_path, output_pdf_path): + """Fill the PDF form with data from fields.json""" + + # `fields.json` format described in FORMS.md. + with open(fields_json_path, "r") as f: + fields_data = json.load(f) + + # Open the PDF + reader = PdfReader(input_pdf_path) + writer = PdfWriter() + + # Copy all pages to writer + writer.append(reader) + + # Get PDF dimensions for each page + pdf_dimensions = {} + for i, page in enumerate(reader.pages): + mediabox = page.mediabox + pdf_dimensions[i + 1] = [mediabox.width, mediabox.height] + + # Process each form field + annotations = [] + for field in fields_data["form_fields"]: + page_num = field["page_number"] + + # Get page dimensions and transform coordinates. + page_info = next( + p for p in fields_data["pages"] if p["page_number"] == page_num + ) + image_width = page_info["image_width"] + image_height = page_info["image_height"] + pdf_width, pdf_height = pdf_dimensions[page_num] + + transformed_entry_box = transform_coordinates( + field["entry_bounding_box"], + image_width, + image_height, + pdf_width, + pdf_height, + ) + + # Skip empty fields + if "entry_text" not in field or "text" not in field["entry_text"]: + continue + entry_text = field["entry_text"] + text = entry_text["text"] + if not text: + continue + + font_name = entry_text.get("font", "Arial") + font_size = str(entry_text.get("font_size", 14)) + "pt" + font_color = entry_text.get("font_color", "000000") + + # Font size/color seems to not work reliably across viewers: + # https://github.com/py-pdf/pypdf/issues/2084 + annotation = FreeText( + text=text, + rect=transformed_entry_box, + font=font_name, + font_size=font_size, + font_color=font_color, + border_color=None, + background_color=None, + ) + annotations.append(annotation) + # page_number is 0-based for pypdf + writer.add_annotation(page_number=page_num - 1, annotation=annotation) + + # Save the filled PDF + with open(output_pdf_path, "wb") as output: + writer.write(output) + + print(f"Successfully filled PDF form and saved to {output_pdf_path}") + print(f"Added {len(annotations)} text annotations") + + +if __name__ == "__main__": + if len(sys.argv) != 4: + print( + "Usage: fill_pdf_form_with_annotations.py [input pdf] [fields.json] [output pdf]" + ) + sys.exit(1) + input_pdf = sys.argv[1] + fields_json = sys.argv[2] + output_pdf = sys.argv[3] + + fill_pdf_form(input_pdf, fields_json, output_pdf) diff --git a/code_puppy/bundled_skills/Office/pptx/SKILL.md b/code_puppy/bundled_skills/Office/pptx/SKILL.md new file mode 100644 index 00000000..d8f8aced --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/SKILL.md @@ -0,0 +1,476 @@ +--- +name: pptx +description: "Presentation creation, editing, and analysis. When Ticca needs to work with presentations (.pptx files) for: (1) Creating new presentations, (2) Modifying or editing content, (3) Working with layouts, (4) Adding comments or speaker notes, or any other presentation tasks" +license: Proprietary. LICENSE.txt has complete terms +--- + +# PPTX creation, editing, and analysis + +## Overview + +Create, edit, or analyze the contents of .pptx files when requested. A .pptx file is essentially a ZIP archive containing XML files and other resources. Different tools and workflows are available for different tasks. + +## CRITICAL: Read All Documentation First + +**Before starting any presentation task**, read ALL relevant documentation files completely to understand the full workflow: + +1. **For creating new presentations**: Read [`html2pptx.md`](html2pptx.md) and [`css.md`](css.md) in their entirety +2. **For editing existing presentations**: Read [`ooxml.md`](ooxml.md) in its entirety +3. **For template-based creation**: Read the relevant sections of this file plus [`css.md`](css.md) + +**NEVER set any range limits when reading these files.** Understanding the complete workflow, constraints, and best practices before starting is essential for producing high-quality presentations. Partial knowledge leads to errors, inconsistent styling, and visual defects that require rework. + +## Reading and analyzing content + +### Text extraction + +To read just the text content of a presentation, convert the document to markdown: + +```bash +# Convert document to markdown +python -m markitdown path-to-file.pptx +``` + +### Raw XML access + +Use raw XML access for: comments, speaker notes, slide layouts, animations, design elements, and complex formatting. To access these features, unpack a presentation and read its raw XML contents. + +#### Unpacking a file + +`python ooxml/scripts/unpack.py ` + +**Note**: The unpack.py script is located at `skills/public/pptx/ooxml/scripts/unpack.py` relative to the project root. If the script doesn't exist at this path, use `find . -name "unpack.py"` to locate it. + +#### Key file structures + +- `ppt/presentation.xml` - Main presentation metadata and slide references +- `ppt/slides/slide{N}.xml` - Individual slide contents (slide1.xml, slide2.xml, etc.) +- `ppt/notesSlides/notesSlide{N}.xml` - Speaker notes for each slide +- `ppt/comments/modernComment_*.xml` - Comments for specific slides +- `ppt/slideLayouts/` - Layout templates for slides +- `ppt/slideMasters/` - Master slide templates +- `ppt/theme/` - Theme and styling information +- `ppt/media/` - Images and other media files + +#### Typography and color extraction + +**To emulate example designs**, analyze the presentation's typography and colors first using the methods below: + +1. **Read theme file**: Check `ppt/theme/theme1.xml` for colors (``) and fonts (``) +2. **Sample slide content**: Examine `ppt/slides/slide1.xml` for actual font usage (``) and colors +3. **Search for patterns**: Use grep to find color (``, ``) and font references across all XML files + +## Creating a new PowerPoint presentation **without a template** + +When creating a new PowerPoint presentation from scratch, use the **html2pptx** workflow to convert HTML slides to PowerPoint with accurate positioning. + +### Workflow + +1. **Read documentation**: Read [`html2pptx.md`](html2pptx.md) and [`css.md`](css.md) completely (see "CRITICAL: Read All Documentation First" section above) + +2. **PREREQUISITE - Extract html2pptx library**: + - Extract the library next to your script: `mkdir -p html2pptx && tar -xzf skills/public/pptx/html2pptx.tgz -C html2pptx` + - This creates a `html2pptx/` directory with the library files and CLI binaries + +3. **Plan the presentation**: Follow html2pptx.md "Design Philosophy" section for: + - Aesthetic direction and bold design choices + - Color palette selection (see "Creating your color palette") + - Typography strategy + - Write DETAILED outline with slide layouts and presenter notes (1-3 sentences per slide) + +4. **Set CSS variables**: Override CSS variables in a shared `.css` file for colors, typography, and spacing (see css.md "Design System Variables") + +5. **Create HTML slides** (960px × 540px for 16:9): Follow html2pptx.md for: + - Slide layout zones (title, content, footnote) + - Critical text rules (proper HTML tags) + - Supported elements and styling + +6. Create and run a JavaScript file using the [`html2pptx`](./html2pptx) library to convert HTML slides to PowerPoint and save the presentation + + - Run with: `NODE_PATH="$(npm root -g)" node your-script.js 2>&1` + - Use the `html2pptx` function to process each HTML file + - Add charts and tables to placeholder areas using PptxGenJS API + - Save the presentation using `pptx.writeFile()` + + - **⚠️ CRITICAL:** Your script MUST follow this example structure. Think aloud before writing the script to make sure that you correctly use the APIs. Do NOT call `pptx.addSlide`. + + ```javascript + const pptxgen = require("pptxgenjs"); + const { html2pptx } = require("./html2pptx"); + + // Create a new pptx presentation + const pptx = new pptxgen(); + pptx.layout = "LAYOUT_16x9"; // Must match HTML body dimensions + + // Add an HTML-only slide + await html2pptx("slide1.html", pptx); + + // Add a HTML slide with chart placeholders + const { slide: slide2, placeholders } = await html2pptx("slide2.html", pptx); + slide.addChart(pptx.charts.LINE, chartData, placeholders[0]); + + // Save the presentation + await pptx.writeFile("output.pptx"); + ``` + +7. **Visual validation**: Convert to images and inspect for layout issues + - Convert PPTX to PDF first: `soffice --headless --convert-to pdf output.pptx` + - Then convert PDF to images: `pdftoppm -jpeg -r 150 output.pdf slide` + - This creates files like `slide-1.jpg`, `slide-2.jpg`, etc. + - Read each generated image file and carefully examine for: + - **Text cutoff**: Text being cut off by header bars, shapes, or slide edges + - **Text overlap**: Text overlapping with other text or shapes + - **Positioning issues**: Content too close to slide boundaries or other elements + - **Contrast issues**: Insufficient contrast between text and backgrounds + - **Alignment problems**: Elements not properly aligned with each other + - **Visual hierarchy**: Important content properly emphasized + - **CRITICAL: All slides MUST pass these validation checks before delivering to the user.** Do not skip this step or deliver presentations with visual defects. + - If issues found, fix them in the following order of priority: + 1. **Increase margins** - Add more padding/spacing around problematic elements + 2. **Adjust font size** - Reduce text size to fit within available space + 3. **Rethink the layout entirely** - If the above fixes don't work, redesign the slide layout + - Regenerate the presentation after making changes + - Repeat until all slides are visually correct + +## Editing an existing PowerPoint presentation + +To edit slides in an existing PowerPoint presentation, work with the raw Office Open XML (OOXML) format. This involves unpacking the .pptx file, editing the XML content, and repacking it. + +### Workflow + +1. **Read documentation**: Read [`ooxml.md`](ooxml.md) completely (see "CRITICAL: Read All Documentation First" section above) +2. Unpack the presentation: `python ooxml/scripts/unpack.py ` +3. Edit the XML files (primarily `ppt/slides/slide{N}.xml` and related files) +4. **CRITICAL**: Validate immediately after each edit: `python ooxml/scripts/validate.py --original ` +5. Pack the final presentation: `python ooxml/scripts/pack.py ` + +## Creating a new PowerPoint presentation **using a template** + +To create a presentation that follows an existing template's design, duplicate and re-arrange template slides before replacing placeholder content. + +### Workflow + +1. **Extract template text AND create visual thumbnail grid**: + + - Extract text: `python -m markitdown template.pptx > template-content.md` + - Read `template-content.md` completely to understand the template contents + - Create thumbnail grids: `python scripts/thumbnail.py template.pptx` + - See [Creating Thumbnail Grids](#creating-thumbnail-grids) section for more details + +2. **Analyze template and save inventory to a file**: + + - **Visual Analysis**: Review thumbnail grid(s) to understand slide layouts, design patterns, and visual structure + - Create and save a template inventory file at `template-inventory.md` containing: + + ```markdown + # Template Inventory Analysis + + **Total Slides: [count]** + **IMPORTANT: Slides are 0-indexed (first slide = 0, last slide = count-1)** + + ## [Category Name] + + - Slide 0: [Layout code if available] - Description/purpose + - Slide 1: [Layout code] - Description/purpose + - Slide 2: [Layout code] - Description/purpose + [... EVERY slide must be listed individually with its index ...] + ``` + + - **Using the thumbnail grid**: Reference the visual thumbnails to identify: + - Layout patterns (title slides, content layouts, section dividers) + - Image placeholder locations and counts + - Design consistency across slide groups + - Visual hierarchy and structure + - This inventory file is REQUIRED for selecting appropriate templates in the next step + +3. **Create presentation outline based on template inventory**: + + - Review available templates from step 2. + - Choose an intro or title template for the first slide. This should be one of the first templates. + - Choose safe, text-based layouts for the other slides. + - **CRITICAL: Match layout structure to actual content**: + - Single-column layouts: Use for unified narrative or single topic + - Two-column layouts: Use ONLY when there are exactly 2 distinct items/concepts + - Three-column layouts: Use ONLY when there are exactly 3 distinct items/concepts + - Image + text layouts: Use ONLY when there are actual images to insert + - Quote layouts: Use ONLY for actual quotes from people (with attribution), never for emphasis + - Never use layouts with more placeholders than available content + - With 2 items, avoid forcing them into a 3-column layout + - With 4+ items, consider breaking into multiple slides or using a list format + - Count actual content pieces BEFORE selecting the layout + - Verify each placeholder in the chosen layout will be filled with meaningful content + - Select one option representing the **best** layout for each content section. + - Save `outline.md` with content AND template mapping that leverages available designs + - Example template mapping: + ``` + # Template slides to use (0-based indexing) + # WARNING: Verify indices are within range! Template with 73 slides has indices 0-72 + # Mapping: slide numbers from outline -> template slide indices + template_mapping = [ + 0, # Use slide 0 (Title/Cover) + 34, # Use slide 34 (B1: Title and body) + 34, # Use slide 34 again (duplicate for second B1) + 50, # Use slide 50 (E1: Quote) + 54, # Use slide 54 (F2: Closing + Text) + ] + ``` + +4. **Duplicate, reorder, and delete slides using `rearrange.py`**: + + - Use the `scripts/rearrange.py` script to create a new presentation with slides in the desired order: + ```bash + python scripts/rearrange.py template.pptx working.pptx 0,34,34,50,52 + ``` + - The script handles duplicating repeated slides, deleting unused slides, and reordering automatically + - Slide indices are 0-based (first slide is 0, second is 1, etc.) + - The same slide index can appear multiple times to duplicate that slide + +5. **Extract ALL text using the `inventory.py` script**: + + - **Run inventory extraction**: + ```bash + python scripts/inventory.py working.pptx text-inventory.json + ``` + - **Read text-inventory.json** completely to understand all shapes and their properties + + - The inventory JSON structure: + + ```json + { + "slide-0": { + "shape-0": { + "placeholder_type": "TITLE", // or null for non-placeholders + "left": 1.5, // position in inches + "top": 2.0, + "width": 7.5, + "height": 1.2, + "paragraphs": [ + { + "text": "Paragraph text", + // Optional properties (only included when non-default): + "bullet": true, // explicit bullet detected + "level": 0, // only included when bullet is true + "alignment": "CENTER", // CENTER, RIGHT (not LEFT) + "space_before": 10.0, // space before paragraph in points + "space_after": 6.0, // space after paragraph in points + "line_spacing": 22.4, // line spacing in points + "font_name": "Arial", // from first run + "font_size": 14.0, // in points + "bold": true, + "italic": false, + "underline": false, + "color": "FF0000" // RGB color + } + ] + } + } + } + ``` + + - Key features: + - **Slides**: Named as "slide-0", "slide-1", etc. + - **Shapes**: Ordered by visual position (top-to-bottom, left-to-right) as "shape-0", "shape-1", etc. + - **Placeholder types**: TITLE, CENTER_TITLE, SUBTITLE, BODY, OBJECT, or null + - **Default font size**: `default_font_size` in points extracted from layout placeholders (when available) + - **Slide numbers are filtered**: Shapes with SLIDE_NUMBER placeholder type are automatically excluded from inventory + - **Bullets**: When `bullet: true`, `level` is always included (even if 0) + - **Spacing**: `space_before`, `space_after`, and `line_spacing` in points (only included when set) + - **Colors**: `color` for RGB (e.g., "FF0000"), `theme_color` for theme colors (e.g., "DARK_1") + - **Properties**: Only non-default values are included in the output + +6. **Generate replacement text and save the data to a JSON file** + Based on the text inventory from the previous step: + + - **CRITICAL**: First verify which shapes exist in the inventory - only reference shapes that are actually present + - **VALIDATION**: The replace.py script validates that all shapes in the replacement JSON exist in the inventory + - Referencing a non-existent shape produces an error showing available shapes + - Referencing a non-existent slide produces an error indicating the slide doesn't exist + - All validation errors are shown at once before the script exits + - **IMPORTANT**: The replace.py script uses inventory.py internally to identify ALL text shapes + - **AUTOMATIC CLEARING**: ALL text shapes from the inventory are cleared unless "paragraphs" are provided for them + - Add a "paragraphs" field to shapes that need content (not "replacement_paragraphs") + - Shapes without "paragraphs" in the replacement JSON have their text cleared automatically + - Paragraphs with bullets are automatically left aligned. Avoid setting the `alignment` property when `"bullet": true` + - Generate appropriate replacement content for placeholder text + - Use shape size to determine appropriate content length + - **CRITICAL**: Include paragraph properties from the original inventory - don't just provide text + - **IMPORTANT**: When bullet: true, do NOT include bullet symbols (•, -, \*) in text - they're added automatically + - **ESSENTIAL FORMATTING RULES**: + - Headers/titles should typically have `"bold": true` + - List items should have `"bullet": true, "level": 0` (level is required when bullet is true) + - Preserve any alignment properties (e.g., `"alignment": "CENTER"` for centered text) + - Include font properties when different from default (e.g., `"font_size": 14.0`, `"font_name": "Lora"`) + - Colors: Use `"color": "FF0000"` for RGB or `"theme_color": "DARK_1"` for theme colors + - The replacement script expects **properly formatted paragraphs**, not just text strings + - **Overlapping shapes**: Prefer shapes with larger default_font_size or more appropriate placeholder_type + - Save the updated inventory with replacements to `replacement-text.json` + - **WARNING**: Different template layouts have different shape counts - always check the actual inventory before creating replacements + + Example paragraphs field showing proper formatting: + + ```json + "paragraphs": [ + { + "text": "New presentation title text", + "alignment": "CENTER", + "bold": true + }, + { + "text": "Section Header", + "bold": true + }, + { + "text": "First bullet point without bullet symbol", + "bullet": true, + "level": 0 + }, + { + "text": "Red colored text", + "color": "FF0000" + }, + { + "text": "Theme colored text", + "theme_color": "DARK_1" + }, + { + "text": "Regular paragraph text without special formatting" + } + ] + ``` + + **Shapes not listed in the replacement JSON are automatically cleared**: + + ```json + { + "slide-0": { + "shape-0": { + "paragraphs": [...] // This shape gets new text + } + // shape-1 and shape-2 from inventory will be cleared automatically + } + } + ``` + + **Common formatting patterns for presentations**: + + - Title slides: Bold text, sometimes centered + - Section headers within slides: Bold text + - Bullet lists: Each item needs `"bullet": true, "level": 0` + - Body text: Usually no special properties needed + - Quotes: May have special alignment or font properties + +7. **Apply replacements using the `replace.py` script** + + ```bash + python scripts/replace.py working.pptx replacement-text.json output.pptx + ``` + + The script will: + + - First extract the inventory of ALL text shapes using functions from inventory.py + - Validate that all shapes in the replacement JSON exist in the inventory + - Clear text from ALL shapes identified in the inventory + - Apply new text only to shapes with "paragraphs" defined in the replacement JSON + - Preserve formatting by applying paragraph properties from the JSON + - Handle bullets, alignment, font properties, and colors automatically + - Save the updated presentation + + Example validation errors: + + ``` + ERROR: Invalid shapes in replacement JSON: + - Shape 'shape-99' not found on 'slide-0'. Available shapes: shape-0, shape-1, shape-4 + - Slide 'slide-999' not found in inventory + ``` + + ``` + ERROR: Replacement text made overflow worse in these shapes: + - slide-0/shape-2: overflow worsened by 1.25" (was 0.00", now 1.25") + ``` + +## Creating Thumbnail Grids + +To create visual thumbnail grids of PowerPoint slides for quick analysis and reference: + +```bash +python scripts/thumbnail.py template.pptx [output_prefix] +``` + +**Features**: + +- Creates: `thumbnails.jpg` (or `thumbnails-1.jpg`, `thumbnails-2.jpg`, etc. for large decks) +- Default: 5 columns, max 30 slides per grid (5×6) +- Custom prefix: `python scripts/thumbnail.py template.pptx my-grid` + - Note: The output prefix should include the path if you want output in a specific directory (e.g., `workspace/my-grid`) +- Adjust columns: `--cols 4` (range: 3-6, affects slides per grid) +- Grid limits: 3 cols = 12 slides/grid, 4 cols = 20, 5 cols = 30, 6 cols = 42 +- Slides are zero-indexed (Slide 0, Slide 1, etc.) + +**Use cases**: + +- Template analysis: Quickly understand slide layouts and design patterns +- Content review: Visual overview of entire presentation +- Navigation reference: Find specific slides by their visual appearance +- Quality check: Verify all slides are properly formatted + +**Examples**: + +```bash +# Basic usage +python scripts/thumbnail.py presentation.pptx + +# Combine options: custom name, columns +python scripts/thumbnail.py template.pptx analysis --cols 4 +``` + +## Converting Slides to Images + +To visually analyze PowerPoint slides, convert them to images using a two-step process: + +1. **Convert PPTX to PDF**: + + ```bash + soffice --headless --convert-to pdf template.pptx + ``` + +2. **Convert PDF pages to JPEG images**: + ```bash + pdftoppm -jpeg -r 150 template.pdf slide + ``` + This creates files like `slide-1.jpg`, `slide-2.jpg`, etc. + +Options: + +- `-r 150`: Sets resolution to 150 DPI (adjust for quality/size balance) +- `-jpeg`: Output JPEG format (use `-png` for PNG if preferred) +- `-f N`: First page to convert (e.g., `-f 2` starts from page 2) +- `-l N`: Last page to convert (e.g., `-l 5` stops at page 5) +- `slide`: Prefix for output files + +Example for specific range: + +```bash +pdftoppm -jpeg -r 150 -f 2 -l 5 template.pdf slide # Converts only pages 2-5 +``` + +## Code Style Guidelines + +**IMPORTANT**: When generating code for PPTX operations: + +- Write concise code +- Avoid verbose variable names and redundant operations +- Avoid unnecessary print statements + +## Dependencies + +Required dependencies (should already be installed): + +- **markitdown**: `pip install "markitdown[pptx]"` (for text extraction from presentations) +- **pptxgenjs**: `npm install -g pptxgenjs` (for creating presentations via html2pptx) +- **playwright**: `npm install -g playwright` (for HTML rendering in html2pptx) +- **react-icons**: `npm install -g react-icons react react-dom` (for icons in SVG format) +- **LibreOffice**: For PDF conversion (required for visual validation step) + - macOS: `brew install --cask libreoffice` + - Linux: `sudo apt-get install libreoffice` +- **Poppler**: `sudo apt-get install poppler-utils` (for pdftoppm to convert PDF to images) +- **defusedxml**: `pip install defusedxml` (for secure XML parsing) diff --git a/code_puppy/bundled_skills/Office/pptx/css.md b/code_puppy/bundled_skills/Office/pptx/css.md new file mode 100644 index 00000000..e85fc4f1 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/css.md @@ -0,0 +1,337 @@ +# Global CSS Framework Reference + +This document provides a comprehensive reference for the global.css framework used in HTML slide creation for PowerPoint conversion. + +--- + +## ⚠️ No Import Necessary + +The global.css framework is automatically added to every slide. Do NOT try to include it in a slide with ` + + +
+

Slide title

+

Subtitle or context

+
+
+ + +``` + +## Using the html2pptx Library + +### Installation & Setup + +**Important**: Extract the html2pptx library next to your script before using. See the **Prerequisites Check** section at the top of this document. + +**When running scripts, set NODE_PATH for global packages like pptxgenjs:** + +```sh +NODE_PATH="$(npm root -g)" node your-script.js 2>&1 +``` + +### Dependencies + +These libraries have been globally installed and are available to use: + +- `pptxgenjs` +- `playwright` + +### ⚠️ IMPORTANT: How To Use html2pptx + +**Common errors:** + +- **LIBRARY NOT EXTRACTED**: Extract the tarball first with `mkdir -p html2pptx && tar -xzf skills/public/pptx/html2pptx.tgz -C html2pptx` + - ✅ Correct: `require("./html2pptx")` + - ❌ Wrong: `require("@ant/html2pptx")` - Use relative path, not package name +- DO NOT call `pptx.addSlide()` directly, `html2pptx` creates a slide for you +- `html2pptx` accepts an `htmlFilePath` and a `pptx` presentation object + - If you pass the wrong arguments, your script will throw errors or time out + +**Your script MUST follow the following example.** + +```javascript +const pptxgen = require("pptxgenjs"); +const { html2pptx } = require("./html2pptx"); + +// Create a new pptx presentation +const pptx = new pptxgen(); +pptx.layout = "LAYOUT_16x9"; // Must match HTML body dimensions + +// Add an HTML-only slide +await html2pptx("slide1.html", pptx); + +// Add a slide with a chart placeholder +const { slide, placeholders } = await html2pptx("slide2.html", pptx); +slide.addChart(pptx.charts.LINE, chartData, placeholders[0]); + +// Save the presentation +await pptx.writeFile("output.pptx"); +``` + +### API Reference + +#### Function Signature + +```javascript +await html2pptx(htmlFilePath, pptxPresentation, options); +``` + +#### Parameters + +- `htmlFilePath` (string): Path to HTML file (absolute or relative) +- `pptxPresentation` (pptxgen): PptxGenJS presentation instance with layout already set +- `options` (object, optional): + - `tmpDir` (string): Temporary directory for generated files (default: `process.env.TMPDIR || '/tmp'`) + +#### Returns + +```javascript +{ + slide: pptxgenSlide, // The created/updated slide + placeholders: [ // Array of placeholder positions + { id: string, x: number, y: number, w: number, h: number }, + ... + ] +} +``` + +### Validation + +The library automatically validates and collects all errors before throwing: + +1. **HTML dimensions must match presentation layout** - Reports dimension mismatches +2. **Content must not overflow body** - Reports overflow with exact measurements +3. **Text element styling** - Reports backgrounds/borders/shadows on text elements (only allowed on block elements) + +**All validation errors are collected and reported together** in a single error message, allowing you to fix all issues at once instead of one at a time. + +### Working with Placeholders + +```javascript +const { slide, placeholders } = await html2pptx("slide.html", pptx); + +// Use first placeholder +slide.addChart(pptx.charts.BAR, data, placeholders[0]); + +// Find by ID +const chartArea = placeholders.find((p) => p.id === "chart-area"); +slide.addChart(pptx.charts.LINE, data, chartArea); +``` + +### Complete Example + +```javascript +const pptxgen = require("pptxgenjs"); +const { html2pptx } = require("./html2pptx"); + +async function createPresentation() { + const pptx = new pptxgen(); + pptx.layout = "LAYOUT_16x9"; + pptx.author = "Your Name"; + pptx.title = "My Presentation"; + + // Slide 1: Title + const { slide: slide1 } = await html2pptx("slides/title.html", pptx); + + // Slide 2: Content with chart + const { slide: slide2, placeholders } = await html2pptx( + "slides/data.html", + pptx + ); + + const chartData = [ + { + name: "Sales", + labels: ["Q1", "Q2", "Q3", "Q4"], + values: [4500, 5500, 6200, 7100], + }, + ]; + + slide2.addChart(pptx.charts.BAR, chartData, { + ...placeholders[0], + showTitle: true, + title: "Quarterly Sales", + showCatAxisTitle: true, + catAxisTitle: "Quarter", + showValAxisTitle: true, + valAxisTitle: "Sales ($000s)", + }); + + // Save + await pptx.writeFile({ fileName: "presentation.pptx" }); + console.log("Presentation created successfully!"); +} + +createPresentation().catch(console.error); +``` + +**Run with:** + +```sh +NODE_PATH="$(npm root -g)" node create-presentation.js 2>&1 +``` + +## Using PptxGenJS + +After converting HTML to slides with `html2pptx`, you'll use PptxGenJS to add dynamic content like charts, images, and additional elements. + +### ⚠️ Critical Rules + +#### Colors + +- **NEVER use `#` prefix** with hex colors in PptxGenJS - causes file corruption +- ✅ Correct: `color: "FF0000"`, `fill: { color: "0066CC" }` +- ❌ Wrong: `color: "#FF0000"` (breaks document) + +### Adding Images + +Always calculate aspect ratios from actual image dimensions: + +```javascript +// Get image dimensions: identify image.png | grep -o '[0-9]* x [0-9]*' +const imgWidth = 1860, + imgHeight = 1519; // From actual file +const aspectRatio = imgWidth / imgHeight; + +const h = 3; // Max height +const w = h * aspectRatio; +const x = (10 - w) / 2; // Center on 16:9 slide + +slide.addImage({ path: "chart.png", x, y: 1.5, w, h }); +``` + +### Adding Text + +```javascript +// Rich text with formatting +slide.addText( + [ + { text: "Bold ", options: { bold: true } }, + { text: "Italic ", options: { italic: true } }, + { text: "Normal" }, + ], + { + x: 1, + y: 2, + w: 8, + h: 1, + } +); +``` + +### Adding Shapes + +```javascript +// Rectangle +slide.addShape(pptx.shapes.RECTANGLE, { + x: 1, + y: 1, + w: 3, + h: 2, + fill: { color: "4472C4" }, + line: { color: "000000", width: 2 }, +}); + +// Circle +slide.addShape(pptx.shapes.OVAL, { + x: 5, + y: 1, + w: 2, + h: 2, + fill: { color: "ED7D31" }, +}); + +// Rounded rectangle +slide.addShape(pptx.shapes.ROUNDED_RECTANGLE, { + x: 1, + y: 4, + w: 3, + h: 1.5, + fill: { color: "70AD47" }, + rectRadius: 0.2, +}); +``` + +### Adding Charts + +**Required for most charts:** Axis labels using `catAxisTitle` (category) and `valAxisTitle` (value). + +**Chart Data Format:** + +- Use **single series with all labels** for simple bar/line charts +- Each series creates a separate legend entry +- Labels array defines X-axis values + +**Time Series Data - Choose Correct Granularity:** + +- **< 30 days**: Use daily grouping (e.g., "10-01", "10-02") - avoid monthly aggregation that creates single-point charts +- **30-365 days**: Use monthly grouping (e.g., "2024-01", "2024-02") +- **> 365 days**: Use yearly grouping (e.g., "2023", "2024") +- **Validate**: Charts with only 1 data point likely indicate incorrect aggregation for the time period + +```javascript +const { slide, placeholders } = await html2pptx("slide.html", pptx); + +// CORRECT: Single series with all labels +slide.addChart( + pptx.charts.BAR, + [ + { + name: "Sales 2024", + labels: ["Q1", "Q2", "Q3", "Q4"], + values: [4500, 5500, 6200, 7100], + }, + ], + { + ...placeholders[0], // Use placeholder position + barDir: "col", // 'col' = vertical bars, 'bar' = horizontal + showTitle: true, + title: "Quarterly Sales", + showLegend: false, // No legend needed for single series + // Required axis labels + showCatAxisTitle: true, + catAxisTitle: "Quarter", + showValAxisTitle: true, + valAxisTitle: "Sales ($000s)", + // Optional: Control scaling (adjust min based on data range for better visualization) + valAxisMaxVal: 8000, + valAxisMinVal: 0, // Use 0 for counts/amounts; for clustered data (e.g., 4500-7100), consider starting closer to min value + valAxisMajorUnit: 2000, // Control y-axis label spacing to prevent crowding + catAxisLabelRotate: 45, // Rotate labels if crowded + dataLabelPosition: "outEnd", + dataLabelColor: "000000", + // Use single color for single-series charts + chartColors: ["4472C4"], // All bars same color + } +); +``` + +#### Scatter Chart + +**IMPORTANT**: Scatter chart data format is unusual - first series contains X-axis values, subsequent series contain Y-values: + +```javascript +// Prepare data +const data1 = [ + { x: 10, y: 20 }, + { x: 15, y: 25 }, + { x: 20, y: 30 }, +]; +const data2 = [ + { x: 12, y: 18 }, + { x: 18, y: 22 }, +]; + +const allXValues = [...data1.map((d) => d.x), ...data2.map((d) => d.x)]; + +slide.addChart( + pptx.charts.SCATTER, + [ + { name: "X-Axis", values: allXValues }, // First series = X values + { name: "Series 1", values: data1.map((d) => d.y) }, // Y values only + { name: "Series 2", values: data2.map((d) => d.y) }, // Y values only + ], + { + x: 1, + y: 1, + w: 8, + h: 4, + lineSize: 0, // 0 = no connecting lines + lineDataSymbol: "circle", + lineDataSymbolSize: 6, + showCatAxisTitle: true, + catAxisTitle: "X Axis", + showValAxisTitle: true, + valAxisTitle: "Y Axis", + chartColors: ["4472C4", "ED7D31"], + } +); +``` + +#### Line Chart + +```javascript +slide.addChart( + pptx.charts.LINE, + [ + { + name: "Temperature", + labels: ["Jan", "Feb", "Mar", "Apr"], + values: [32, 35, 42, 55], + }, + ], + { + x: 1, + y: 1, + w: 8, + h: 4, + lineSize: 4, + lineSmooth: true, + // Required axis labels + showCatAxisTitle: true, + catAxisTitle: "Month", + showValAxisTitle: true, + valAxisTitle: "Temperature (°F)", + // Optional: Y-axis range (set min based on data range for better visualization) + valAxisMinVal: 0, // For ranges starting at 0 (counts, percentages, etc.) + valAxisMaxVal: 60, + valAxisMajorUnit: 20, // Control y-axis label spacing to prevent crowding (e.g., 10, 20, 25) + // valAxisMinVal: 30, // PREFERRED: For data clustered in a range (e.g., 32-55 or ratings 3-5), start axis closer to min value to show variation + // Optional: Chart colors + chartColors: ["4472C4", "ED7D31", "A5A5A5"], + } +); +``` + +#### Pie Chart (No Axis Labels Required) + +**CRITICAL**: Pie charts require a **single data series** with all categories in the `labels` array and corresponding values in the `values` array. + +```javascript +slide.addChart( + pptx.charts.PIE, + [ + { + name: "Market Share", + labels: ["Product A", "Product B", "Other"], // All categories in one array + values: [35, 45, 20], // All values in one array + }, + ], + { + x: 2, + y: 1, + w: 6, + h: 4, + showPercent: true, + showLegend: true, + legendPos: "r", // right + chartColors: ["4472C4", "ED7D31", "A5A5A5"], + } +); +``` + +#### Multiple Data Series + +```javascript +slide.addChart( + pptx.charts.LINE, + [ + { + name: "Product A", + labels: ["Q1", "Q2", "Q3", "Q4"], + values: [10, 20, 30, 40], + }, + { + name: "Product B", + labels: ["Q1", "Q2", "Q3", "Q4"], + values: [15, 25, 20, 35], + }, + ], + { + x: 1, + y: 1, + w: 8, + h: 4, + showCatAxisTitle: true, + catAxisTitle: "Quarter", + showValAxisTitle: true, + valAxisTitle: "Revenue ($M)", + } +); +``` + +### Chart Colors + +**CRITICAL**: Use hex colors **without** the `#` prefix - including `#` causes file corruption. + +**Align chart colors with your chosen design palette**, ensuring sufficient contrast and distinctiveness for data visualization. Adjust colors for: + +- Strong contrast between adjacent series +- Readability against slide backgrounds +- Accessibility (avoid red-green only combinations) + +```javascript +// Example: Ocean palette-inspired chart colors (adjusted for contrast) +const chartColors = ["16A085", "FF6B9D", "2C3E50", "F39C12", "9B59B6"]; + +// Single-series chart: Use one color for all bars/points +slide.addChart( + pptx.charts.BAR, + [ + { + name: "Sales", + labels: ["Q1", "Q2", "Q3", "Q4"], + values: [4500, 5500, 6200, 7100], + }, + ], + { + ...placeholders[0], + chartColors: ["16A085"], // All bars same color + showLegend: false, + } +); + +// Multi-series chart: Each series gets a different color +slide.addChart( + pptx.charts.LINE, + [ + { name: "Product A", labels: ["Q1", "Q2", "Q3"], values: [10, 20, 30] }, + { name: "Product B", labels: ["Q1", "Q2", "Q3"], values: [15, 25, 20] }, + ], + { + ...placeholders[0], + chartColors: ["16A085", "FF6B9D"], // One color per series + } +); +``` + +### Adding Tables + +Tables can be added with basic or advanced formatting: + +#### Basic Table + +```javascript +slide.addTable( + [ + ["Header 1", "Header 2", "Header 3"], + ["Row 1, Col 1", "Row 1, Col 2", "Row 1, Col 3"], + ["Row 2, Col 1", "Row 2, Col 2", "Row 2, Col 3"], + ], + { + x: 0.5, + y: 1, + w: 9, + h: 3, + border: { pt: 1, color: "999999" }, + fill: { color: "F1F1F1" }, + } +); +``` + +#### Table with Custom Formatting + +```javascript +const tableData = [ + // Header row with custom styling + [ + { + text: "Product", + options: { fill: { color: "4472C4" }, color: "FFFFFF", bold: true }, + }, + { + text: "Revenue", + options: { fill: { color: "4472C4" }, color: "FFFFFF", bold: true }, + }, + { + text: "Growth", + options: { fill: { color: "4472C4" }, color: "FFFFFF", bold: true }, + }, + ], + // Data rows + ["Product A", "$50M", "+15%"], + ["Product B", "$35M", "+22%"], + ["Product C", "$28M", "+8%"], +]; + +slide.addTable(tableData, { + x: 1, + y: 1.5, + w: 8, + h: 3, + colW: [3, 2.5, 2.5], // Column widths + rowH: [0.5, 0.6, 0.6, 0.6], // Row heights + border: { pt: 1, color: "CCCCCC" }, + align: "center", + valign: "middle", + fontSize: 14, +}); +``` + +#### Table with Merged Cells + +```javascript +const mergedTableData = [ + [ + { + text: "Q1 Results", + options: { + colspan: 3, + fill: { color: "4472C4" }, + color: "FFFFFF", + bold: true, + }, + }, + ], + ["Product", "Sales", "Market Share"], + ["Product A", "$25M", "35%"], + ["Product B", "$18M", "25%"], +]; + +slide.addTable(mergedTableData, { + x: 1, + y: 1, + w: 8, + h: 2.5, + colW: [3, 2.5, 2.5], + border: { pt: 1, color: "DDDDDD" }, +}); +``` + +### Table Options + +Common table options: + +- `x, y, w, h` - Position and size +- `colW` - Array of column widths (in inches) +- `rowH` - Array of row heights (in inches) +- `border` - Border style: `{ pt: 1, color: "999999" }` +- `fill` - Background color (no # prefix) +- `align` - Text alignment: "left", "center", "right" +- `valign` - Vertical alignment: "top", "middle", "bottom" +- `fontSize` - Text size +- `autoPage` - Auto-create new slides if content overflows diff --git a/code_puppy/bundled_skills/Office/pptx/html2pptx.tgz b/code_puppy/bundled_skills/Office/pptx/html2pptx.tgz new file mode 100644 index 00000000..be4a91d7 Binary files /dev/null and b/code_puppy/bundled_skills/Office/pptx/html2pptx.tgz differ diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml.md b/code_puppy/bundled_skills/Office/pptx/ooxml.md new file mode 100644 index 00000000..951b3cf6 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml.md @@ -0,0 +1,427 @@ +# Office Open XML Technical Reference for PowerPoint + +**Important: Read this entire document before starting.** Critical XML schema rules and formatting requirements are covered throughout. Incorrect implementation can create invalid PPTX files that PowerPoint cannot open. + +## Technical Guidelines + +### Schema Compliance +- **Element ordering in ``**: ``, ``, `` +- **Whitespace**: Add `xml:space='preserve'` to `` elements with leading/trailing spaces +- **Unicode**: Escape characters in ASCII content: `"` becomes `“` +- **Images**: Add to `ppt/media/`, reference in slide XML, set dimensions to fit slide bounds +- **Relationships**: Update `ppt/slides/_rels/slideN.xml.rels` for each slide's resources +- **Dirty attribute**: Add `dirty="0"` to `` and `` elements to indicate clean state + +## Presentation Structure + +### Basic Slide Structure +```xml + + + + + ... + ... + + + + +``` + +### Text Box / Shape with Text +```xml + + + + + + + + + + + + + + + + + + + + + + Slide Title + + + + +``` + +### Text Formatting +```xml + + + + Bold Text + + + + + + Italic Text + + + + + + Underlined + + + + + + + + + + Highlighted Text + + + + + + + + + + Colored Arial 24pt + + + + + + + + + + Formatted text + +``` + +### Lists +```xml + + + + + + + First bullet point + + + + + + + + + + First numbered item + + + + + + + + + + Indented bullet + + +``` + +### Shapes +```xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +``` + +### Images +```xml + + + + + + + + + + + + + + + + + + + + + + + + + + +``` + +### Tables +```xml + + + + + + + + + + + + + + + + + + + + + + + + + + + Cell 1 + + + + + + + + + + + Cell 2 + + + + + + + + + +``` + +### Slide Layouts + +```xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +``` + +## File Updates + +When adding content, update these files: + +**`ppt/_rels/presentation.xml.rels`:** +```xml + + +``` + +**`ppt/slides/_rels/slide1.xml.rels`:** +```xml + + +``` + +**`[Content_Types].xml`:** +```xml + + + +``` + +**`ppt/presentation.xml`:** +```xml + + + + +``` + +**`docProps/app.xml`:** Update slide count and statistics +```xml +2 +10 +50 +``` + +## Slide Operations + +### Adding a New Slide +When adding a slide to the end of the presentation: + +1. **Create the slide file** (`ppt/slides/slideN.xml`) +2. **Update `[Content_Types].xml`**: Add Override for the new slide +3. **Update `ppt/_rels/presentation.xml.rels`**: Add relationship for the new slide +4. **Update `ppt/presentation.xml`**: Add slide ID to `` +5. **Create slide relationships** (`ppt/slides/_rels/slideN.xml.rels`) if needed +6. **Update `docProps/app.xml`**: Increment slide count and update statistics (if present) + +### Duplicating a Slide +1. Copy the source slide XML file with a new name +2. Update all IDs in the new slide to be unique +3. Follow the "Adding a New Slide" steps above +4. **CRITICAL**: Remove or update any notes slide references in `_rels` files +5. Remove references to unused media files + +### Reordering Slides +1. **Update `ppt/presentation.xml`**: Reorder `` elements in `` +2. The order of `` elements determines slide order +3. Keep slide IDs and relationship IDs unchanged + +Example: +```xml + + + + + + + + + + + + + +``` + +### Deleting a Slide +1. **Remove from `ppt/presentation.xml`**: Delete the `` entry +2. **Remove from `ppt/_rels/presentation.xml.rels`**: Delete the relationship +3. **Remove from `[Content_Types].xml`**: Delete the Override entry +4. **Delete files**: Remove `ppt/slides/slideN.xml` and `ppt/slides/_rels/slideN.xml.rels` +5. **Update `docProps/app.xml`**: Decrement slide count and update statistics +6. **Clean up unused media**: Remove orphaned images from `ppt/media/` + +Note: Don't renumber remaining slides - keep their original IDs and filenames. + + +## Common Errors to Avoid + +- **Encodings**: Escape unicode characters in ASCII content: `"` becomes `“` +- **Images**: Add to `ppt/media/` and update relationship files +- **Lists**: Omit bullets from list headers +- **IDs**: Use valid hexadecimal values for UUIDs +- **Themes**: Check all themes in `theme` directory for colors + +## Validation Checklist for Template-Based Presentations + +### Before Packing, Always: +- **Clean unused resources**: Remove unreferenced media, fonts, and notes directories +- **Fix Content_Types.xml**: Declare ALL slides, layouts, and themes present in the package +- **Fix relationship IDs**: + - Remove font embed references if not using embedded fonts +- **Remove broken references**: Check all `_rels` files for references to deleted resources + +### Common Template Duplication Pitfalls: +- Multiple slides referencing the same notes slide after duplication +- Image/media references from template slides that no longer exist +- Font embedding references when fonts aren't included +- Missing slideLayout declarations for layouts 12-25 +- docProps directory may not unpack - this is optional \ No newline at end of file diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/dml-chart.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/dml-chart.xsd new file mode 100644 index 00000000..bc325f9f --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/dml-chart.xsd @@ -0,0 +1,1499 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/dml-chartDrawing.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/dml-chartDrawing.xsd new file mode 100644 index 00000000..afa4f463 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/dml-chartDrawing.xsd @@ -0,0 +1,146 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/dml-diagram.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/dml-diagram.xsd new file mode 100644 index 00000000..40e4b12a --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/dml-diagram.xsd @@ -0,0 +1,1085 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/dml-lockedCanvas.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/dml-lockedCanvas.xsd new file mode 100644 index 00000000..687eea82 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/dml-lockedCanvas.xsd @@ -0,0 +1,11 @@ + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/dml-main.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/dml-main.xsd new file mode 100644 index 00000000..94644b3f --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/dml-main.xsd @@ -0,0 +1,3081 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/dml-picture.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/dml-picture.xsd new file mode 100644 index 00000000..1dbf0514 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/dml-picture.xsd @@ -0,0 +1,23 @@ + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/dml-spreadsheetDrawing.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/dml-spreadsheetDrawing.xsd new file mode 100644 index 00000000..f1af17db --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/dml-spreadsheetDrawing.xsd @@ -0,0 +1,185 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/dml-wordprocessingDrawing.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/dml-wordprocessingDrawing.xsd new file mode 100644 index 00000000..5c00a6ff --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/dml-wordprocessingDrawing.xsd @@ -0,0 +1,287 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/pml.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/pml.xsd new file mode 100644 index 00000000..25564ebb --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/pml.xsd @@ -0,0 +1,1676 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/shared-additionalCharacteristics.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/shared-additionalCharacteristics.xsd new file mode 100644 index 00000000..c20f3bf1 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/shared-additionalCharacteristics.xsd @@ -0,0 +1,28 @@ + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/shared-bibliography.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/shared-bibliography.xsd new file mode 100644 index 00000000..ac602522 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/shared-bibliography.xsd @@ -0,0 +1,144 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/shared-commonSimpleTypes.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/shared-commonSimpleTypes.xsd new file mode 100644 index 00000000..52deec72 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/shared-commonSimpleTypes.xsd @@ -0,0 +1,174 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/shared-customXmlDataProperties.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/shared-customXmlDataProperties.xsd new file mode 100644 index 00000000..2bddce29 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/shared-customXmlDataProperties.xsd @@ -0,0 +1,25 @@ + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/shared-customXmlSchemaProperties.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/shared-customXmlSchemaProperties.xsd new file mode 100644 index 00000000..8a8c18ba --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/shared-customXmlSchemaProperties.xsd @@ -0,0 +1,18 @@ + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesCustom.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesCustom.xsd new file mode 100644 index 00000000..5c42706a --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesCustom.xsd @@ -0,0 +1,59 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesExtended.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesExtended.xsd new file mode 100644 index 00000000..853c341c --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesExtended.xsd @@ -0,0 +1,56 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesVariantTypes.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesVariantTypes.xsd new file mode 100644 index 00000000..da835ee8 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesVariantTypes.xsd @@ -0,0 +1,195 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/shared-math.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/shared-math.xsd new file mode 100644 index 00000000..4f37d307 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/shared-math.xsd @@ -0,0 +1,582 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/shared-relationshipReference.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/shared-relationshipReference.xsd new file mode 100644 index 00000000..9e86f1b2 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/shared-relationshipReference.xsd @@ -0,0 +1,25 @@ + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/sml.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/sml.xsd new file mode 100644 index 00000000..237dd652 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/sml.xsd @@ -0,0 +1,4439 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/vml-main.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/vml-main.xsd new file mode 100644 index 00000000..eeb4ef8f --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/vml-main.xsd @@ -0,0 +1,570 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/vml-officeDrawing.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/vml-officeDrawing.xsd new file mode 100644 index 00000000..ca2575c7 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/vml-officeDrawing.xsd @@ -0,0 +1,509 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/vml-presentationDrawing.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/vml-presentationDrawing.xsd new file mode 100644 index 00000000..dd079e60 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/vml-presentationDrawing.xsd @@ -0,0 +1,12 @@ + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/vml-spreadsheetDrawing.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/vml-spreadsheetDrawing.xsd new file mode 100644 index 00000000..3dd6cf62 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/vml-spreadsheetDrawing.xsd @@ -0,0 +1,108 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/vml-wordprocessingDrawing.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/vml-wordprocessingDrawing.xsd new file mode 100644 index 00000000..f1041e34 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/vml-wordprocessingDrawing.xsd @@ -0,0 +1,96 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/wml.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/wml.xsd new file mode 100644 index 00000000..9c5b7a63 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/wml.xsd @@ -0,0 +1,3646 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/xml.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/xml.xsd new file mode 100644 index 00000000..fbd88768 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ISO-IEC29500-4_2016/xml.xsd @@ -0,0 +1,116 @@ + + + + + + See http://www.w3.org/XML/1998/namespace.html and + http://www.w3.org/TR/REC-xml for information about this namespace. + + This schema document describes the XML namespace, in a form + suitable for import by other schema documents. + + Note that local names in this namespace are intended to be defined + only by the World Wide Web Consortium or its subgroups. The + following names are currently defined in this namespace and should + not be used with conflicting semantics by any Working Group, + specification, or document instance: + + base (as an attribute name): denotes an attribute whose value + provides a URI to be used as the base for interpreting any + relative URIs in the scope of the element on which it + appears; its value is inherited. This name is reserved + by virtue of its definition in the XML Base specification. + + lang (as an attribute name): denotes an attribute whose value + is a language code for the natural language of the content of + any element; its value is inherited. This name is reserved + by virtue of its definition in the XML specification. + + space (as an attribute name): denotes an attribute whose + value is a keyword indicating what whitespace processing + discipline is intended for the content of the element; its + value is inherited. This name is reserved by virtue of its + definition in the XML specification. + + Father (in any context at all): denotes Jon Bosak, the chair of + the original XML Working Group. This name is reserved by + the following decision of the W3C XML Plenary and + XML Coordination groups: + + In appreciation for his vision, leadership and dedication + the W3C XML Plenary on this 10th day of February, 2000 + reserves for Jon Bosak in perpetuity the XML name + xml:Father + + + + + This schema defines attributes and an attribute group + suitable for use by + schemas wishing to allow xml:base, xml:lang or xml:space attributes + on elements they define. + + To enable this, such a schema must import this schema + for the XML namespace, e.g. as follows: + <schema . . .> + . . . + <import namespace="http://www.w3.org/XML/1998/namespace" + schemaLocation="http://www.w3.org/2001/03/xml.xsd"/> + + Subsequently, qualified reference to any of the attributes + or the group defined below will have the desired effect, e.g. + + <type . . .> + . . . + <attributeGroup ref="xml:specialAttrs"/> + + will define a type which will schema-validate an instance + element with any of those attributes + + + + In keeping with the XML Schema WG's standard versioning + policy, this schema document will persist at + http://www.w3.org/2001/03/xml.xsd. + At the date of issue it can also be found at + http://www.w3.org/2001/xml.xsd. + The schema document at that URI may however change in the future, + in order to remain compatible with the latest version of XML Schema + itself. In other words, if the XML Schema namespace changes, the version + of this document at + http://www.w3.org/2001/xml.xsd will change + accordingly; the version at + http://www.w3.org/2001/03/xml.xsd will not change. + + + + + + In due course, we should install the relevant ISO 2- and 3-letter + codes as the enumerated possible values . . . + + + + + + + + + + + + + + + See http://www.w3.org/TR/xmlbase/ for + information about this attribute. + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ecma/fouth-edition/opc-contentTypes.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ecma/fouth-edition/opc-contentTypes.xsd new file mode 100644 index 00000000..e4c5160e --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ecma/fouth-edition/opc-contentTypes.xsd @@ -0,0 +1,42 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ecma/fouth-edition/opc-coreProperties.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ecma/fouth-edition/opc-coreProperties.xsd new file mode 100644 index 00000000..888c0fcd --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ecma/fouth-edition/opc-coreProperties.xsd @@ -0,0 +1,50 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ecma/fouth-edition/opc-digSig.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ecma/fouth-edition/opc-digSig.xsd new file mode 100644 index 00000000..73782264 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ecma/fouth-edition/opc-digSig.xsd @@ -0,0 +1,49 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ecma/fouth-edition/opc-relationships.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ecma/fouth-edition/opc-relationships.xsd new file mode 100644 index 00000000..762dcbe8 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/ecma/fouth-edition/opc-relationships.xsd @@ -0,0 +1,33 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/mce/mc.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/mce/mc.xsd new file mode 100644 index 00000000..ef725457 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/mce/mc.xsd @@ -0,0 +1,75 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/microsoft/wml-2010.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/microsoft/wml-2010.xsd new file mode 100644 index 00000000..f65f7777 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/microsoft/wml-2010.xsd @@ -0,0 +1,560 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/microsoft/wml-2012.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/microsoft/wml-2012.xsd new file mode 100644 index 00000000..6b00755a --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/microsoft/wml-2012.xsd @@ -0,0 +1,67 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/microsoft/wml-2018.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/microsoft/wml-2018.xsd new file mode 100644 index 00000000..f321d333 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/microsoft/wml-2018.xsd @@ -0,0 +1,14 @@ + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/microsoft/wml-cex-2018.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/microsoft/wml-cex-2018.xsd new file mode 100644 index 00000000..364c6a9b --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/microsoft/wml-cex-2018.xsd @@ -0,0 +1,20 @@ + + + + + + + + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/microsoft/wml-cid-2016.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/microsoft/wml-cid-2016.xsd new file mode 100644 index 00000000..fed9d15b --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/microsoft/wml-cid-2016.xsd @@ -0,0 +1,13 @@ + + + + + + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/microsoft/wml-sdtdatahash-2020.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/microsoft/wml-sdtdatahash-2020.xsd new file mode 100644 index 00000000..680cf154 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/microsoft/wml-sdtdatahash-2020.xsd @@ -0,0 +1,4 @@ + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/microsoft/wml-symex-2015.xsd b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/microsoft/wml-symex-2015.xsd new file mode 100644 index 00000000..89ada908 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/schemas/microsoft/wml-symex-2015.xsd @@ -0,0 +1,8 @@ + + + + + + + + diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/scripts/pack.py b/code_puppy/bundled_skills/Office/pptx/ooxml/scripts/pack.py new file mode 100644 index 00000000..4a23b67e --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/scripts/pack.py @@ -0,0 +1,160 @@ +#!/usr/bin/env python3 +""" +Tool to pack a directory into a .docx, .pptx, or .xlsx file with XML formatting undone. + +Example usage: + python pack.py [--force] +""" + +import argparse +import shutil +import subprocess +import sys +import tempfile +import zipfile +from pathlib import Path + +import defusedxml.minidom + + +def main(): + parser = argparse.ArgumentParser(description="Pack a directory into an Office file") + parser.add_argument("input_directory", help="Unpacked Office document directory") + parser.add_argument("output_file", help="Output Office file (.docx/.pptx/.xlsx)") + parser.add_argument("--force", action="store_true", help="Skip validation") + args = parser.parse_args() + + try: + success = pack_document( + args.input_directory, args.output_file, validate=not args.force + ) + + # Show warning if validation was skipped + if args.force: + print("Warning: Skipped validation, file may be corrupt", file=sys.stderr) + # Exit with error if validation failed + elif not success: + print("Contents would produce a corrupt file.", file=sys.stderr) + print("Please validate XML before repacking.", file=sys.stderr) + print("Use --force to skip validation and pack anyway.", file=sys.stderr) + sys.exit(1) + + except ValueError as e: + sys.exit(f"Error: {e}") + + +def pack_document(input_dir, output_file, validate=False): + """Pack a directory into an Office file (.docx/.pptx/.xlsx). + + Args: + input_dir: Path to unpacked Office document directory + output_file: Path to output Office file + validate: If True, validates with soffice (default: False) + + Returns: + bool: True if successful, False if validation failed + """ + input_dir = Path(input_dir) + output_file = Path(output_file) + + if not input_dir.is_dir(): + raise ValueError(f"{input_dir} is not a directory") + if output_file.suffix.lower() not in {".docx", ".pptx", ".xlsx"}: + raise ValueError(f"{output_file} must be a .docx, .pptx, or .xlsx file") + + # Work in temporary directory to avoid modifying original + with tempfile.TemporaryDirectory() as temp_dir: + temp_content_dir = Path(temp_dir) / "content" + shutil.copytree(input_dir, temp_content_dir) + + # Process XML files to remove pretty-printing whitespace + for pattern in ["*.xml", "*.rels"]: + for xml_file in temp_content_dir.rglob(pattern): + condense_xml(xml_file) + + # Create final Office file as zip archive + output_file.parent.mkdir(parents=True, exist_ok=True) + with zipfile.ZipFile(output_file, "w", zipfile.ZIP_DEFLATED) as zf: + for f in temp_content_dir.rglob("*"): + if f.is_file(): + zf.write(f, f.relative_to(temp_content_dir)) + + # Validate if requested + if validate: + if not validate_document(output_file): + output_file.unlink() # Delete the corrupt file + return False + + return True + + +def validate_document(doc_path): + """Validate document by converting to HTML with soffice.""" + # Determine the correct filter based on file extension + match doc_path.suffix.lower(): + case ".docx": + filter_name = "html:HTML" + case ".pptx": + filter_name = "html:impress_html_Export" + case ".xlsx": + filter_name = "html:HTML (StarCalc)" + + with tempfile.TemporaryDirectory() as temp_dir: + try: + result = subprocess.run( + [ + "soffice", + "--headless", + "--convert-to", + filter_name, + "--outdir", + temp_dir, + str(doc_path), + ], + capture_output=True, + timeout=10, + text=True, + ) + if not (Path(temp_dir) / f"{doc_path.stem}.html").exists(): + error_msg = result.stderr.strip() or "Document validation failed" + print(f"Validation error: {error_msg}", file=sys.stderr) + return False + return True + except FileNotFoundError: + print("Warning: soffice not found. Skipping validation.", file=sys.stderr) + return True + except subprocess.TimeoutExpired: + print("Validation error: Timeout during conversion", file=sys.stderr) + return False + except Exception as e: + print(f"Validation error: {e}", file=sys.stderr) + return False + + +def condense_xml(xml_file): + """Strip unnecessary whitespace and remove comments.""" + with open(xml_file, "r", encoding="utf-8") as f: + dom = defusedxml.minidom.parse(f) + + # Process each element to remove whitespace and comments + for element in dom.getElementsByTagName("*"): + # Skip w:t elements and their processing + if element.tagName.endswith(":t"): + continue + + # Remove whitespace-only text nodes and comment nodes + for child in list(element.childNodes): + if ( + child.nodeType == child.TEXT_NODE + and child.nodeValue + and child.nodeValue.strip() == "" + ) or child.nodeType == child.COMMENT_NODE: + element.removeChild(child) + + # Write back the condensed XML + with open(xml_file, "wb") as f: + f.write(dom.toxml(encoding="UTF-8")) + + +if __name__ == "__main__": + main() diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/scripts/unpack.py b/code_puppy/bundled_skills/Office/pptx/ooxml/scripts/unpack.py new file mode 100644 index 00000000..2ac3909a --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/scripts/unpack.py @@ -0,0 +1,30 @@ +#!/usr/bin/env python3 +"""Unpack and format XML contents of Office files (.docx, .pptx, .xlsx)""" + +import random +import sys +import zipfile +from pathlib import Path + +import defusedxml.minidom + +# Get command line arguments +assert len(sys.argv) == 3, "Usage: python unpack.py " +input_file, output_dir = sys.argv[1], sys.argv[2] + +# Extract and format +output_path = Path(output_dir) +output_path.mkdir(parents=True, exist_ok=True) +zipfile.ZipFile(input_file).extractall(output_path) + +# Pretty print all XML files +xml_files = list(output_path.rglob("*.xml")) + list(output_path.rglob("*.rels")) +for xml_file in xml_files: + content = xml_file.read_text(encoding="utf-8") + dom = defusedxml.minidom.parseString(content) + xml_file.write_bytes(dom.toprettyxml(indent=" ", encoding="ascii")) + +# For .docx files, suggest an RSID for tracked changes +if input_file.endswith(".docx"): + suggested_rsid = "".join(random.choices("0123456789ABCDEF", k=8)) + print(f"Suggested RSID for edit session: {suggested_rsid}") diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/scripts/validate.py b/code_puppy/bundled_skills/Office/pptx/ooxml/scripts/validate.py new file mode 100644 index 00000000..508c5891 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/scripts/validate.py @@ -0,0 +1,69 @@ +#!/usr/bin/env python3 +""" +Command line tool to validate Office document XML files against XSD schemas and tracked changes. + +Usage: + python validate.py --original +""" + +import argparse +import sys +from pathlib import Path + +from validation import DOCXSchemaValidator, PPTXSchemaValidator, RedliningValidator + + +def main(): + parser = argparse.ArgumentParser(description="Validate Office document XML files") + parser.add_argument( + "unpacked_dir", + help="Path to unpacked Office document directory", + ) + parser.add_argument( + "--original", + required=True, + help="Path to original file (.docx/.pptx/.xlsx)", + ) + parser.add_argument( + "-v", + "--verbose", + action="store_true", + help="Enable verbose output", + ) + args = parser.parse_args() + + # Validate paths + unpacked_dir = Path(args.unpacked_dir) + original_file = Path(args.original) + file_extension = original_file.suffix.lower() + assert unpacked_dir.is_dir(), f"Error: {unpacked_dir} is not a directory" + assert original_file.is_file(), f"Error: {original_file} is not a file" + assert file_extension in [".docx", ".pptx", ".xlsx"], ( + f"Error: {original_file} must be a .docx, .pptx, or .xlsx file" + ) + + # Run validations + match file_extension: + case ".docx": + validators = [DOCXSchemaValidator, RedliningValidator] + case ".pptx": + validators = [PPTXSchemaValidator] + case _: + print(f"Error: Validation not supported for file type {file_extension}") + sys.exit(1) + + # Run validators + success = True + for V in validators: + validator = V(unpacked_dir, original_file, verbose=args.verbose) + if not validator.validate(): + success = False + + if success: + print("All validations PASSED!") + + sys.exit(0 if success else 1) + + +if __name__ == "__main__": + main() diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/scripts/validation/__init__.py b/code_puppy/bundled_skills/Office/pptx/ooxml/scripts/validation/__init__.py new file mode 100644 index 00000000..db092ece --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/scripts/validation/__init__.py @@ -0,0 +1,15 @@ +""" +Validation modules for Word document processing. +""" + +from .base import BaseSchemaValidator +from .docx import DOCXSchemaValidator +from .pptx import PPTXSchemaValidator +from .redlining import RedliningValidator + +__all__ = [ + "BaseSchemaValidator", + "DOCXSchemaValidator", + "PPTXSchemaValidator", + "RedliningValidator", +] diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/scripts/validation/base.py b/code_puppy/bundled_skills/Office/pptx/ooxml/scripts/validation/base.py new file mode 100644 index 00000000..165c3c5c --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/scripts/validation/base.py @@ -0,0 +1,968 @@ +""" +Base validator with common validation logic for document files. +""" + +import re +from pathlib import Path + +import lxml.etree + + +class BaseSchemaValidator: + """Base validator with common validation logic for document files.""" + + # Elements whose 'id' attributes must be unique within their file + # Format: element_name -> (attribute_name, scope) + # scope can be 'file' (unique within file) or 'global' (unique across all files) + UNIQUE_ID_REQUIREMENTS = { + # Word elements + "comment": ("id", "file"), # Comment IDs in comments.xml + "commentrangestart": ("id", "file"), # Must match comment IDs + "commentrangeend": ("id", "file"), # Must match comment IDs + "bookmarkstart": ("id", "file"), # Bookmark start IDs + "bookmarkend": ("id", "file"), # Bookmark end IDs + # Note: ins and del (track changes) can share IDs when part of same revision + # PowerPoint elements + "sldid": ("id", "file"), # Slide IDs in presentation.xml + "sldmasterid": ("id", "global"), # Slide master IDs must be globally unique + "sldlayoutid": ("id", "global"), # Slide layout IDs must be globally unique + "cm": ("authorid", "file"), # Comment author IDs + # Excel elements + "sheet": ("sheetid", "file"), # Sheet IDs in workbook.xml + "definedname": ("id", "file"), # Named range IDs + # Drawing/Shape elements (all formats) + "cxnsp": ("id", "file"), # Connection shape IDs + "sp": ("id", "file"), # Shape IDs + "pic": ("id", "file"), # Picture IDs + "grpsp": ("id", "file"), # Group shape IDs + } + + # Container elements where ID uniqueness checks should be skipped + # These hold references that intentionally duplicate IDs of elements they reference + # Example: in sectionLst references in sldIdLst + EXCLUDED_ID_CONTAINERS = { + "sectionlst", # PowerPoint sections - sldId elements reference slides by ID + } + + # Mapping of element names to expected relationship types + # Subclasses should override this with format-specific mappings + ELEMENT_RELATIONSHIP_TYPES = {} + + # Unified schema mappings for all Office document types + SCHEMA_MAPPINGS = { + # Document type specific schemas + "word": "ISO-IEC29500-4_2016/wml.xsd", # Word documents + "ppt": "ISO-IEC29500-4_2016/pml.xsd", # PowerPoint presentations + "xl": "ISO-IEC29500-4_2016/sml.xsd", # Excel spreadsheets + # Common file types + "[Content_Types].xml": "ecma/fouth-edition/opc-contentTypes.xsd", + "app.xml": "ISO-IEC29500-4_2016/shared-documentPropertiesExtended.xsd", + "core.xml": "ecma/fouth-edition/opc-coreProperties.xsd", + "custom.xml": "ISO-IEC29500-4_2016/shared-documentPropertiesCustom.xsd", + ".rels": "ecma/fouth-edition/opc-relationships.xsd", + # Word-specific files + "people.xml": "microsoft/wml-2012.xsd", + "commentsIds.xml": "microsoft/wml-cid-2016.xsd", + "commentsExtensible.xml": "microsoft/wml-cex-2018.xsd", + "commentsExtended.xml": "microsoft/wml-2012.xsd", + # Chart files (common across document types) + "chart": "ISO-IEC29500-4_2016/dml-chart.xsd", + # Theme files (common across document types) + "theme": "ISO-IEC29500-4_2016/dml-main.xsd", + # Drawing and media files + "drawing": "ISO-IEC29500-4_2016/dml-main.xsd", + } + + # Unified namespace constants + MC_NAMESPACE = "http://schemas.openxmlformats.org/markup-compatibility/2006" + XML_NAMESPACE = "http://www.w3.org/XML/1998/namespace" + + # Common OOXML namespaces used across validators + PACKAGE_RELATIONSHIPS_NAMESPACE = ( + "http://schemas.openxmlformats.org/package/2006/relationships" + ) + OFFICE_RELATIONSHIPS_NAMESPACE = ( + "http://schemas.openxmlformats.org/officeDocument/2006/relationships" + ) + CONTENT_TYPES_NAMESPACE = ( + "http://schemas.openxmlformats.org/package/2006/content-types" + ) + + # Folders where we should clean ignorable namespaces + MAIN_CONTENT_FOLDERS = {"word", "ppt", "xl"} + + # All allowed OOXML namespaces (superset of all document types) + OOXML_NAMESPACES = { + "http://schemas.openxmlformats.org/officeDocument/2006/math", + "http://schemas.openxmlformats.org/officeDocument/2006/relationships", + "http://schemas.openxmlformats.org/schemaLibrary/2006/main", + "http://schemas.openxmlformats.org/drawingml/2006/main", + "http://schemas.openxmlformats.org/drawingml/2006/chart", + "http://schemas.openxmlformats.org/drawingml/2006/chartDrawing", + "http://schemas.openxmlformats.org/drawingml/2006/diagram", + "http://schemas.openxmlformats.org/drawingml/2006/picture", + "http://schemas.openxmlformats.org/drawingml/2006/spreadsheetDrawing", + "http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing", + "http://schemas.openxmlformats.org/wordprocessingml/2006/main", + "http://schemas.openxmlformats.org/presentationml/2006/main", + "http://schemas.openxmlformats.org/spreadsheetml/2006/main", + "http://schemas.openxmlformats.org/officeDocument/2006/sharedTypes", + "http://www.w3.org/XML/1998/namespace", + } + + def __init__(self, unpacked_dir, original_file, verbose=False): + self.unpacked_dir = Path(unpacked_dir).resolve() + self.original_file = Path(original_file) + self.verbose = verbose + + # Set schemas directory + self.schemas_dir = Path(__file__).parent.parent.parent / "schemas" + + # Get all XML and .rels files + patterns = ["*.xml", "*.rels"] + self.xml_files = [ + f for pattern in patterns for f in self.unpacked_dir.rglob(pattern) + ] + + if not self.xml_files: + print(f"Warning: No XML files found in {self.unpacked_dir}") + + def validate(self): + """Run all validation checks and return True if all pass.""" + raise NotImplementedError("Subclasses must implement the validate method") + + def validate_xml(self): + """Validate that all XML files are well-formed.""" + errors = [] + + for xml_file in self.xml_files: + try: + # Try to parse the XML file + lxml.etree.parse(str(xml_file)) + except lxml.etree.XMLSyntaxError as e: + errors.append( + f" {xml_file.relative_to(self.unpacked_dir)}: " + f"Line {e.lineno}: {e.msg}" + ) + except Exception as e: + errors.append( + f" {xml_file.relative_to(self.unpacked_dir)}: " + f"Unexpected error: {str(e)}" + ) + + if errors: + print(f"FAILED - Found {len(errors)} XML violations:") + for error in errors: + print(error) + return False + else: + if self.verbose: + print("PASSED - All XML files are well-formed") + return True + + def validate_namespaces(self): + """Validate that namespace prefixes in Ignorable attributes are declared.""" + errors = [] + + for xml_file in self.xml_files: + try: + root = lxml.etree.parse(str(xml_file)).getroot() + declared = set(root.nsmap.keys()) - {None} # Exclude default namespace + + for attr_val in [ + v for k, v in root.attrib.items() if k.endswith("Ignorable") + ]: + undeclared = set(attr_val.split()) - declared + errors.extend( + f" {xml_file.relative_to(self.unpacked_dir)}: " + f"Namespace '{ns}' in Ignorable but not declared" + for ns in undeclared + ) + except lxml.etree.XMLSyntaxError: + continue + + if errors: + print(f"FAILED - {len(errors)} namespace issues:") + for error in errors: + print(error) + return False + if self.verbose: + print("PASSED - All namespace prefixes properly declared") + return True + + def validate_unique_ids(self): + """Validate that specific IDs are unique according to OOXML requirements.""" + errors = [] + global_ids = {} # Track globally unique IDs across all files + + for xml_file in self.xml_files: + try: + root = lxml.etree.parse(str(xml_file)).getroot() + file_ids = {} # Track IDs that must be unique within this file + + # Remove all mc:AlternateContent elements from the tree + mc_elements = root.xpath( + ".//mc:AlternateContent", namespaces={"mc": self.MC_NAMESPACE} + ) + for elem in mc_elements: + elem.getparent().remove(elem) + + # Now check IDs in the cleaned tree + for elem in root.iter(): + # Get the element name without namespace + tag = ( + elem.tag.split("}")[-1].lower() + if "}" in elem.tag + else elem.tag.lower() + ) + + # Check if this element type has ID uniqueness requirements + if tag in self.UNIQUE_ID_REQUIREMENTS: + # Skip if element is inside an excluded container + # (e.g., inside is a reference, not a definition) + in_excluded_container = any( + ancestor.tag.split("}")[-1].lower() + in self.EXCLUDED_ID_CONTAINERS + for ancestor in elem.iterancestors() + ) + if in_excluded_container: + continue + + attr_name, scope = self.UNIQUE_ID_REQUIREMENTS[tag] + + # Look for the specified attribute + id_value = None + for attr, value in elem.attrib.items(): + attr_local = ( + attr.split("}")[-1].lower() + if "}" in attr + else attr.lower() + ) + if attr_local == attr_name: + id_value = value + break + + if id_value is not None: + if scope == "global": + # Check global uniqueness + if id_value in global_ids: + prev_file, prev_line, prev_tag = global_ids[ + id_value + ] + errors.append( + f" {xml_file.relative_to(self.unpacked_dir)}: " + f"Line {elem.sourceline}: Global ID '{id_value}' in <{tag}> " + f"already used in {prev_file} at line {prev_line} in <{prev_tag}>" + ) + else: + global_ids[id_value] = ( + xml_file.relative_to(self.unpacked_dir), + elem.sourceline, + tag, + ) + elif scope == "file": + # Check file-level uniqueness + key = (tag, attr_name) + if key not in file_ids: + file_ids[key] = {} + + if id_value in file_ids[key]: + prev_line = file_ids[key][id_value] + errors.append( + f" {xml_file.relative_to(self.unpacked_dir)}: " + f"Line {elem.sourceline}: Duplicate {attr_name}='{id_value}' in <{tag}> " + f"(first occurrence at line {prev_line})" + ) + else: + file_ids[key][id_value] = elem.sourceline + + except (lxml.etree.XMLSyntaxError, Exception) as e: + errors.append( + f" {xml_file.relative_to(self.unpacked_dir)}: Error: {e}" + ) + + if errors: + print(f"FAILED - Found {len(errors)} ID uniqueness violations:") + for error in errors: + print(error) + return False + else: + if self.verbose: + print("PASSED - All required IDs are unique") + return True + + def validate_file_references(self): + """ + Validate that all .rels files properly reference files and that all files are referenced. + """ + errors = [] + + # Find all .rels files + rels_files = list(self.unpacked_dir.rglob("*.rels")) + + if not rels_files: + if self.verbose: + print("PASSED - No .rels files found") + return True + + # Get all files in the unpacked directory (excluding reference files) + all_files = [] + for file_path in self.unpacked_dir.rglob("*"): + if ( + file_path.is_file() + and file_path.name != "[Content_Types].xml" + and not file_path.name.endswith(".rels") + ): # This file is not referenced by .rels + all_files.append(file_path.resolve()) + + # Track all files that are referenced by any .rels file + all_referenced_files = set() + + if self.verbose: + print( + f"Found {len(rels_files)} .rels files and {len(all_files)} target files" + ) + + # Check each .rels file + for rels_file in rels_files: + try: + # Parse relationships file + rels_root = lxml.etree.parse(str(rels_file)).getroot() + + # Get the directory where this .rels file is located + rels_dir = rels_file.parent + + # Find all relationships and their targets + referenced_files = set() + broken_refs = [] + + for rel in rels_root.findall( + ".//ns:Relationship", + namespaces={"ns": self.PACKAGE_RELATIONSHIPS_NAMESPACE}, + ): + target = rel.get("Target") + if target and not target.startswith( + ("http", "mailto:") + ): # Skip external URLs + # Resolve the target path relative to the .rels file location + if rels_file.name == ".rels": + # Root .rels file - targets are relative to unpacked_dir + target_path = self.unpacked_dir / target + else: + # Other .rels files - targets are relative to their parent's parent + # e.g., word/_rels/document.xml.rels -> targets relative to word/ + base_dir = rels_dir.parent + target_path = base_dir / target + + # Normalize the path and check if it exists + try: + target_path = target_path.resolve() + if target_path.exists() and target_path.is_file(): + referenced_files.add(target_path) + all_referenced_files.add(target_path) + else: + broken_refs.append((target, rel.sourceline)) + except (OSError, ValueError): + broken_refs.append((target, rel.sourceline)) + + # Report broken references + if broken_refs: + rel_path = rels_file.relative_to(self.unpacked_dir) + for broken_ref, line_num in broken_refs: + errors.append( + f" {rel_path}: Line {line_num}: Broken reference to {broken_ref}" + ) + + except Exception as e: + rel_path = rels_file.relative_to(self.unpacked_dir) + errors.append(f" Error parsing {rel_path}: {e}") + + # Check for unreferenced files (files that exist but are not referenced anywhere) + unreferenced_files = set(all_files) - all_referenced_files + + if unreferenced_files: + for unref_file in sorted(unreferenced_files): + unref_rel_path = unref_file.relative_to(self.unpacked_dir) + errors.append(f" Unreferenced file: {unref_rel_path}") + + if errors: + print(f"FAILED - Found {len(errors)} relationship validation errors:") + for error in errors: + print(error) + print( + "CRITICAL: These errors will cause the document to appear corrupt. " + + "Broken references MUST be fixed, " + + "and unreferenced files MUST be referenced or removed." + ) + return False + else: + if self.verbose: + print( + "PASSED - All references are valid and all files are properly referenced" + ) + return True + + def validate_all_relationship_ids(self): + """ + Validate that all r:id attributes in XML files reference existing IDs + in their corresponding .rels files, and optionally validate relationship types. + """ + import lxml.etree + + errors = [] + + # Process each XML file that might contain r:id references + for xml_file in self.xml_files: + # Skip .rels files themselves + if xml_file.suffix == ".rels": + continue + + # Determine the corresponding .rels file + # For dir/file.xml, it's dir/_rels/file.xml.rels + rels_dir = xml_file.parent / "_rels" + rels_file = rels_dir / f"{xml_file.name}.rels" + + # Skip if there's no corresponding .rels file (that's okay) + if not rels_file.exists(): + continue + + try: + # Parse the .rels file to get valid relationship IDs and their types + rels_root = lxml.etree.parse(str(rels_file)).getroot() + rid_to_type = {} + + for rel in rels_root.findall( + f".//{{{self.PACKAGE_RELATIONSHIPS_NAMESPACE}}}Relationship" + ): + rid = rel.get("Id") + rel_type = rel.get("Type", "") + if rid: + # Check for duplicate rIds + if rid in rid_to_type: + rels_rel_path = rels_file.relative_to(self.unpacked_dir) + errors.append( + f" {rels_rel_path}: Line {rel.sourceline}: " + f"Duplicate relationship ID '{rid}' (IDs must be unique)" + ) + # Extract just the type name from the full URL + type_name = ( + rel_type.split("/")[-1] if "/" in rel_type else rel_type + ) + rid_to_type[rid] = type_name + + # Parse the XML file to find all r:id references + xml_root = lxml.etree.parse(str(xml_file)).getroot() + + # Find all elements with r:id attributes + for elem in xml_root.iter(): + # Check for r:id attribute (relationship ID) + rid_attr = elem.get(f"{{{self.OFFICE_RELATIONSHIPS_NAMESPACE}}}id") + if rid_attr: + xml_rel_path = xml_file.relative_to(self.unpacked_dir) + elem_name = ( + elem.tag.split("}")[-1] if "}" in elem.tag else elem.tag + ) + + # Check if the ID exists + if rid_attr not in rid_to_type: + errors.append( + f" {xml_rel_path}: Line {elem.sourceline}: " + f"<{elem_name}> references non-existent relationship '{rid_attr}' " + f"(valid IDs: {', '.join(sorted(rid_to_type.keys())[:5])}{'...' if len(rid_to_type) > 5 else ''})" + ) + # Check if we have type expectations for this element + elif self.ELEMENT_RELATIONSHIP_TYPES: + expected_type = self._get_expected_relationship_type( + elem_name + ) + if expected_type: + actual_type = rid_to_type[rid_attr] + # Check if the actual type matches or contains the expected type + if expected_type not in actual_type.lower(): + errors.append( + f" {xml_rel_path}: Line {elem.sourceline}: " + f"<{elem_name}> references '{rid_attr}' which points to '{actual_type}' " + f"but should point to a '{expected_type}' relationship" + ) + + except Exception as e: + xml_rel_path = xml_file.relative_to(self.unpacked_dir) + errors.append(f" Error processing {xml_rel_path}: {e}") + + if errors: + print(f"FAILED - Found {len(errors)} relationship ID reference errors:") + for error in errors: + print(error) + print("\nThese ID mismatches will cause the document to appear corrupt!") + return False + else: + if self.verbose: + print("PASSED - All relationship ID references are valid") + return True + + def _get_expected_relationship_type(self, element_name): + """ + Get the expected relationship type for an element. + First checks the explicit mapping, then tries pattern detection. + """ + # Normalize element name to lowercase + elem_lower = element_name.lower() + + # Check explicit mapping first + if elem_lower in self.ELEMENT_RELATIONSHIP_TYPES: + return self.ELEMENT_RELATIONSHIP_TYPES[elem_lower] + + # Try pattern detection for common patterns + # Pattern 1: Elements ending in "Id" often expect a relationship of the prefix type + if elem_lower.endswith("id") and len(elem_lower) > 2: + # e.g., "sldId" -> "sld", "sldMasterId" -> "sldMaster" + prefix = elem_lower[:-2] # Remove "id" + # Check if this might be a compound like "sldMasterId" + if prefix.endswith("master"): + return prefix.lower() + elif prefix.endswith("layout"): + return prefix.lower() + else: + # Simple case like "sldId" -> "slide" + # Common transformations + if prefix == "sld": + return "slide" + return prefix.lower() + + # Pattern 2: Elements ending in "Reference" expect a relationship of the prefix type + if elem_lower.endswith("reference") and len(elem_lower) > 9: + prefix = elem_lower[:-9] # Remove "reference" + return prefix.lower() + + return None + + def validate_content_types(self): + """Validate that all content files are properly declared in [Content_Types].xml.""" + errors = [] + + # Find [Content_Types].xml file + content_types_file = self.unpacked_dir / "[Content_Types].xml" + if not content_types_file.exists(): + print("FAILED - [Content_Types].xml file not found") + return False + + try: + # Parse and get all declared parts and extensions + root = lxml.etree.parse(str(content_types_file)).getroot() + declared_parts = set() + declared_extensions = set() + + # Get Override declarations (specific files) + for override in root.findall( + f".//{{{self.CONTENT_TYPES_NAMESPACE}}}Override" + ): + part_name = override.get("PartName") + if part_name is not None: + declared_parts.add(part_name.lstrip("/")) + + # Get Default declarations (by extension) + for default in root.findall( + f".//{{{self.CONTENT_TYPES_NAMESPACE}}}Default" + ): + extension = default.get("Extension") + if extension is not None: + declared_extensions.add(extension.lower()) + + # Root elements that require content type declaration + declarable_roots = { + "sld", + "sldLayout", + "sldMaster", + "presentation", # PowerPoint + "document", # Word + "workbook", + "worksheet", # Excel + "theme", # Common + } + + # Common media file extensions that should be declared + media_extensions = { + "png": "image/png", + "jpg": "image/jpeg", + "jpeg": "image/jpeg", + "gif": "image/gif", + "bmp": "image/bmp", + "tiff": "image/tiff", + "wmf": "image/x-wmf", + "emf": "image/x-emf", + } + + # Get all files in the unpacked directory + all_files = list(self.unpacked_dir.rglob("*")) + all_files = [f for f in all_files if f.is_file()] + + # Check all XML files for Override declarations + for xml_file in self.xml_files: + path_str = str(xml_file.relative_to(self.unpacked_dir)).replace( + "\\", "/" + ) + + # Skip non-content files + if any( + skip in path_str + for skip in [".rels", "[Content_Types]", "docProps/", "_rels/"] + ): + continue + + try: + root_tag = lxml.etree.parse(str(xml_file)).getroot().tag + root_name = root_tag.split("}")[-1] if "}" in root_tag else root_tag + + if root_name in declarable_roots and path_str not in declared_parts: + errors.append( + f" {path_str}: File with <{root_name}> root not declared in [Content_Types].xml" + ) + + except Exception: + continue # Skip unparseable files + + # Check all non-XML files for Default extension declarations + for file_path in all_files: + # Skip XML files and metadata files (already checked above) + if file_path.suffix.lower() in {".xml", ".rels"}: + continue + if file_path.name == "[Content_Types].xml": + continue + if "_rels" in file_path.parts or "docProps" in file_path.parts: + continue + + extension = file_path.suffix.lstrip(".").lower() + if extension and extension not in declared_extensions: + # Check if it's a known media extension that should be declared + if extension in media_extensions: + relative_path = file_path.relative_to(self.unpacked_dir) + errors.append( + f' {relative_path}: File with extension \'{extension}\' not declared in [Content_Types].xml - should add: ' + ) + + except Exception as e: + errors.append(f" Error parsing [Content_Types].xml: {e}") + + if errors: + print(f"FAILED - Found {len(errors)} content type declaration errors:") + for error in errors: + print(error) + return False + else: + if self.verbose: + print( + "PASSED - All content files are properly declared in [Content_Types].xml" + ) + return True + + def validate_file_against_xsd(self, xml_file, verbose=False): + """Validate a single XML file against XSD schema, comparing with original. + + Args: + xml_file: Path to XML file to validate + verbose: Enable verbose output + + Returns: + tuple: (is_valid, new_errors_set) where is_valid is True/False/None (skipped) + """ + # Resolve both paths to handle symlinks + xml_file = Path(xml_file).resolve() + unpacked_dir = self.unpacked_dir.resolve() + + # Validate current file + is_valid, current_errors = self._validate_single_file_xsd( + xml_file, unpacked_dir + ) + + if is_valid is None: + return None, set() # Skipped + elif is_valid: + return True, set() # Valid, no errors + + # Get errors from original file for this specific file + original_errors = self._get_original_file_errors(xml_file) + + # Compare with original (both are guaranteed to be sets here) + assert current_errors is not None + new_errors = current_errors - original_errors + + if new_errors: + if verbose: + relative_path = xml_file.relative_to(unpacked_dir) + print(f"FAILED - {relative_path}: {len(new_errors)} new error(s)") + for error in list(new_errors)[:3]: + truncated = error[:250] + "..." if len(error) > 250 else error + print(f" - {truncated}") + return False, new_errors + else: + # All errors existed in original + if verbose: + print( + f"PASSED - No new errors (original had {len(current_errors)} errors)" + ) + return True, set() + + def validate_against_xsd(self): + """Validate XML files against XSD schemas, showing only new errors compared to original.""" + new_errors = [] + original_error_count = 0 + valid_count = 0 + skipped_count = 0 + + for xml_file in self.xml_files: + relative_path = str(xml_file.relative_to(self.unpacked_dir)) + is_valid, new_file_errors = self.validate_file_against_xsd( + xml_file, verbose=False + ) + + if is_valid is None: + skipped_count += 1 + continue + elif is_valid and not new_file_errors: + valid_count += 1 + continue + elif is_valid: + # Had errors but all existed in original + original_error_count += 1 + valid_count += 1 + continue + + # Has new errors + new_errors.append(f" {relative_path}: {len(new_file_errors)} new error(s)") + for error in list(new_file_errors)[:3]: # Show first 3 errors + new_errors.append( + f" - {error[:250]}..." if len(error) > 250 else f" - {error}" + ) + + # Print summary + if self.verbose: + print(f"Validated {len(self.xml_files)} files:") + print(f" - Valid: {valid_count}") + print(f" - Skipped (no schema): {skipped_count}") + if original_error_count: + print(f" - With original errors (ignored): {original_error_count}") + print( + f" - With NEW errors: {len(new_errors) > 0 and len([e for e in new_errors if not e.startswith(' ')]) or 0}" + ) + + if new_errors: + print("\nFAILED - Found NEW validation errors:") + for error in new_errors: + print(error) + return False + else: + if self.verbose: + print("\nPASSED - No new XSD validation errors introduced") + return True + + def _get_schema_path(self, xml_file): + """Determine the appropriate schema path for an XML file.""" + # Check exact filename match + if xml_file.name in self.SCHEMA_MAPPINGS: + return self.schemas_dir / self.SCHEMA_MAPPINGS[xml_file.name] + + # Check .rels files + if xml_file.suffix == ".rels": + return self.schemas_dir / self.SCHEMA_MAPPINGS[".rels"] + + # Check chart files + if "charts/" in str(xml_file) and xml_file.name.startswith("chart"): + return self.schemas_dir / self.SCHEMA_MAPPINGS["chart"] + + # Check theme files + if "theme/" in str(xml_file) and xml_file.name.startswith("theme"): + return self.schemas_dir / self.SCHEMA_MAPPINGS["theme"] + + # Check if file is in a main content folder and use appropriate schema + if xml_file.parent.name in self.MAIN_CONTENT_FOLDERS: + return self.schemas_dir / self.SCHEMA_MAPPINGS[xml_file.parent.name] + + return None + + def _clean_ignorable_namespaces(self, xml_doc): + """Remove attributes and elements not in allowed namespaces.""" + # Create a clean copy + xml_string = lxml.etree.tostring(xml_doc, encoding="unicode") + xml_copy = lxml.etree.fromstring(xml_string) + + # Remove attributes not in allowed namespaces + for elem in xml_copy.iter(): + attrs_to_remove = [] + + for attr in elem.attrib: + # Check if attribute is from a namespace other than allowed ones + if "{" in attr: + ns = attr.split("}")[0][1:] + if ns not in self.OOXML_NAMESPACES: + attrs_to_remove.append(attr) + + # Remove collected attributes + for attr in attrs_to_remove: + del elem.attrib[attr] + + # Remove elements not in allowed namespaces + self._remove_ignorable_elements(xml_copy) + + return lxml.etree.ElementTree(xml_copy) + + def _remove_ignorable_elements(self, root): + """Recursively remove all elements not in allowed namespaces.""" + elements_to_remove = [] + + # Find elements to remove + for elem in list(root): + # Skip non-element nodes (comments, processing instructions, etc.) + if not hasattr(elem, "tag") or callable(elem.tag): + continue + + tag_str = str(elem.tag) + if tag_str.startswith("{"): + ns = tag_str.split("}")[0][1:] + if ns not in self.OOXML_NAMESPACES: + elements_to_remove.append(elem) + continue + + # Recursively clean child elements + self._remove_ignorable_elements(elem) + + # Remove collected elements + for elem in elements_to_remove: + root.remove(elem) + + def _preprocess_for_mc_ignorable(self, xml_doc): + """Preprocess XML to handle mc:Ignorable attribute properly.""" + # Remove mc:Ignorable attributes before validation + root = xml_doc.getroot() + + # Remove mc:Ignorable attribute from root + if f"{{{self.MC_NAMESPACE}}}Ignorable" in root.attrib: + del root.attrib[f"{{{self.MC_NAMESPACE}}}Ignorable"] + + return xml_doc + + def _validate_single_file_xsd(self, xml_file, base_path): + """Validate a single XML file against XSD schema. Returns (is_valid, errors_set).""" + schema_path = self._get_schema_path(xml_file) + if not schema_path: + return None, None # Skip file + + try: + # Load schema + with open(schema_path, "rb") as xsd_file: + parser = lxml.etree.XMLParser() + xsd_doc = lxml.etree.parse( + xsd_file, parser=parser, base_url=str(schema_path) + ) + schema = lxml.etree.XMLSchema(xsd_doc) + + # Load and preprocess XML + with open(xml_file, "r") as f: + xml_doc = lxml.etree.parse(f) + + xml_doc, _ = self._remove_template_tags_from_text_nodes(xml_doc) + xml_doc = self._preprocess_for_mc_ignorable(xml_doc) + + # Clean ignorable namespaces if needed + relative_path = xml_file.relative_to(base_path) + if ( + relative_path.parts + and relative_path.parts[0] in self.MAIN_CONTENT_FOLDERS + ): + xml_doc = self._clean_ignorable_namespaces(xml_doc) + + # Validate + if schema.validate(xml_doc): + return True, set() + else: + errors = set() + for error in schema.error_log: + # Store normalized error message (without line numbers for comparison) + errors.add(error.message) + return False, errors + + except Exception as e: + return False, {str(e)} + + def _get_original_file_errors(self, xml_file): + """Get XSD validation errors from a single file in the original document. + + Args: + xml_file: Path to the XML file in unpacked_dir to check + + Returns: + set: Set of error messages from the original file + """ + import tempfile + import zipfile + + # Resolve both paths to handle symlinks (e.g., /var vs /private/var on macOS) + xml_file = Path(xml_file).resolve() + unpacked_dir = self.unpacked_dir.resolve() + relative_path = xml_file.relative_to(unpacked_dir) + + with tempfile.TemporaryDirectory() as temp_dir: + temp_path = Path(temp_dir) + + # Extract original file + with zipfile.ZipFile(self.original_file, "r") as zip_ref: + zip_ref.extractall(temp_path) + + # Find corresponding file in original + original_xml_file = temp_path / relative_path + + if not original_xml_file.exists(): + # File didn't exist in original, so no original errors + return set() + + # Validate the specific file in original + is_valid, errors = self._validate_single_file_xsd( + original_xml_file, temp_path + ) + return errors if errors else set() + + def _remove_template_tags_from_text_nodes(self, xml_doc): + """Remove template tags from XML text nodes and collect warnings. + + Template tags follow the pattern {{ ... }} and are used as placeholders + for content replacement. They should be removed from text content before + XSD validation while preserving XML structure. + + Returns: + tuple: (cleaned_xml_doc, warnings_list) + """ + warnings = [] + template_pattern = re.compile(r"\{\{[^}]*\}\}") + + # Create a copy of the document to avoid modifying the original + xml_string = lxml.etree.tostring(xml_doc, encoding="unicode") + xml_copy = lxml.etree.fromstring(xml_string) + + def process_text_content(text, content_type): + if not text: + return text + matches = list(template_pattern.finditer(text)) + if matches: + for match in matches: + warnings.append( + f"Found template tag in {content_type}: {match.group()}" + ) + return template_pattern.sub("", text) + return text + + # Process all text nodes in the document + for elem in xml_copy.iter(): + # Skip processing if this is a w:t element + if not hasattr(elem, "tag") or callable(elem.tag): + continue + tag_str = str(elem.tag) + if tag_str.endswith("}t") or tag_str == "t": + continue + + elem.text = process_text_content(elem.text, "text content") + elem.tail = process_text_content(elem.tail, "tail content") + + return lxml.etree.ElementTree(xml_copy), warnings + + +if __name__ == "__main__": + raise RuntimeError("This module should not be run directly.") diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/scripts/validation/docx.py b/code_puppy/bundled_skills/Office/pptx/ooxml/scripts/validation/docx.py new file mode 100644 index 00000000..ead1f9f6 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/scripts/validation/docx.py @@ -0,0 +1,273 @@ +""" +Validator for Word document XML files against XSD schemas. +""" + +import re +import tempfile +import zipfile + +import lxml.etree + +from .base import BaseSchemaValidator + + +class DOCXSchemaValidator(BaseSchemaValidator): + """Validator for Word document XML files against XSD schemas.""" + + # Word-specific namespace + WORD_2006_NAMESPACE = "http://schemas.openxmlformats.org/wordprocessingml/2006/main" + + # Word-specific element to relationship type mappings + # Start with empty mapping - add specific cases as we discover them + ELEMENT_RELATIONSHIP_TYPES = {} + + def validate(self): + """Run all validation checks and return True if all pass.""" + # Test 0: XML well-formedness + if not self.validate_xml(): + return False + + # Test 1: Namespace declarations + all_valid = True + if not self.validate_namespaces(): + all_valid = False + + # Test 2: Unique IDs + if not self.validate_unique_ids(): + all_valid = False + + # Test 3: Relationship and file reference validation + if not self.validate_file_references(): + all_valid = False + + # Test 4: Content type declarations + if not self.validate_content_types(): + all_valid = False + + # Test 5: XSD schema validation + if not self.validate_against_xsd(): + all_valid = False + + # Test 6: Whitespace preservation + if not self.validate_whitespace_preservation(): + all_valid = False + + # Test 7: Deletion validation + if not self.validate_deletions(): + all_valid = False + + # Test 8: Insertion validation + if not self.validate_insertions(): + all_valid = False + + # Test 9: Relationship ID reference validation + if not self.validate_all_relationship_ids(): + all_valid = False + + # Count and compare paragraphs + self.compare_paragraph_counts() + + return all_valid + + def validate_whitespace_preservation(self): + """ + Validate that w:t elements with whitespace have xml:space='preserve'. + """ + errors = [] + + for xml_file in self.xml_files: + # Only check document.xml files + if xml_file.name != "document.xml": + continue + + try: + root = lxml.etree.parse(str(xml_file)).getroot() + + # Find all w:t elements + for elem in root.iter(f"{{{self.WORD_2006_NAMESPACE}}}t"): + if elem.text: + text = elem.text + # Check if text starts or ends with whitespace + if re.match(r"^\s.*", text) or re.match(r".*\s$", text): + # Check if xml:space="preserve" attribute exists + xml_space_attr = f"{{{self.XML_NAMESPACE}}}space" + if ( + xml_space_attr not in elem.attrib + or elem.attrib[xml_space_attr] != "preserve" + ): + # Show a preview of the text + text_preview = ( + repr(text)[:50] + "..." + if len(repr(text)) > 50 + else repr(text) + ) + errors.append( + f" {xml_file.relative_to(self.unpacked_dir)}: " + f"Line {elem.sourceline}: w:t element with whitespace missing xml:space='preserve': {text_preview}" + ) + + except (lxml.etree.XMLSyntaxError, Exception) as e: + errors.append( + f" {xml_file.relative_to(self.unpacked_dir)}: Error: {e}" + ) + + if errors: + print(f"FAILED - Found {len(errors)} whitespace preservation violations:") + for error in errors: + print(error) + return False + else: + if self.verbose: + print("PASSED - All whitespace is properly preserved") + return True + + def validate_deletions(self): + """ + Validate that w:t elements are not within w:del elements. + For some reason, XSD validation does not catch this, so we do it manually. + """ + errors = [] + + for xml_file in self.xml_files: + # Only check document.xml files + if xml_file.name != "document.xml": + continue + + try: + root = lxml.etree.parse(str(xml_file)).getroot() + + # Find all w:t elements that are descendants of w:del elements + namespaces = {"w": self.WORD_2006_NAMESPACE} + xpath_expression = ".//w:del//w:t" + problematic_t_elements = root.xpath( + xpath_expression, namespaces=namespaces + ) + for t_elem in problematic_t_elements: + if t_elem.text: + # Show a preview of the text + text_preview = ( + repr(t_elem.text)[:50] + "..." + if len(repr(t_elem.text)) > 50 + else repr(t_elem.text) + ) + errors.append( + f" {xml_file.relative_to(self.unpacked_dir)}: " + f"Line {t_elem.sourceline}: found within : {text_preview}" + ) + + except (lxml.etree.XMLSyntaxError, Exception) as e: + errors.append( + f" {xml_file.relative_to(self.unpacked_dir)}: Error: {e}" + ) + + if errors: + print(f"FAILED - Found {len(errors)} deletion validation violations:") + for error in errors: + print(error) + return False + else: + if self.verbose: + print("PASSED - No w:t elements found within w:del elements") + return True + + def count_paragraphs_in_unpacked(self): + """Count the number of paragraphs in the unpacked document.""" + count = 0 + + for xml_file in self.xml_files: + # Only check document.xml files + if xml_file.name != "document.xml": + continue + + try: + root = lxml.etree.parse(str(xml_file)).getroot() + # Count all w:p elements + paragraphs = root.findall(f".//{{{self.WORD_2006_NAMESPACE}}}p") + count = len(paragraphs) + except Exception as e: + print(f"Error counting paragraphs in unpacked document: {e}") + + return count + + def count_paragraphs_in_original(self): + """Count the number of paragraphs in the original docx file.""" + count = 0 + + try: + # Create temporary directory to unpack original + with tempfile.TemporaryDirectory() as temp_dir: + # Unpack original docx + with zipfile.ZipFile(self.original_file, "r") as zip_ref: + zip_ref.extractall(temp_dir) + + # Parse document.xml + doc_xml_path = temp_dir + "/word/document.xml" + root = lxml.etree.parse(doc_xml_path).getroot() + + # Count all w:p elements + paragraphs = root.findall(f".//{{{self.WORD_2006_NAMESPACE}}}p") + count = len(paragraphs) + + except Exception as e: + print(f"Error counting paragraphs in original document: {e}") + + return count + + def validate_insertions(self): + """ + Validate that w:delText elements are not within w:ins elements. + w:delText is only allowed in w:ins if nested within a w:del. + """ + errors = [] + + for xml_file in self.xml_files: + if xml_file.name != "document.xml": + continue + + try: + root = lxml.etree.parse(str(xml_file)).getroot() + namespaces = {"w": self.WORD_2006_NAMESPACE} + + # Find w:delText in w:ins that are NOT within w:del + invalid_elements = root.xpath( + ".//w:ins//w:delText[not(ancestor::w:del)]", namespaces=namespaces + ) + + for elem in invalid_elements: + text_preview = ( + repr(elem.text or "")[:50] + "..." + if len(repr(elem.text or "")) > 50 + else repr(elem.text or "") + ) + errors.append( + f" {xml_file.relative_to(self.unpacked_dir)}: " + f"Line {elem.sourceline}: within : {text_preview}" + ) + + except (lxml.etree.XMLSyntaxError, Exception) as e: + errors.append( + f" {xml_file.relative_to(self.unpacked_dir)}: Error: {e}" + ) + + if errors: + print(f"FAILED - Found {len(errors)} insertion validation violations:") + for error in errors: + print(error) + return False + else: + if self.verbose: + print("PASSED - No w:delText elements within w:ins elements") + return True + + def compare_paragraph_counts(self): + """Compare paragraph counts between original and new document.""" + original_count = self.count_paragraphs_in_original() + new_count = self.count_paragraphs_in_unpacked() + + diff = new_count - original_count + diff_str = f"+{diff}" if diff > 0 else str(diff) + print(f"\nParagraphs: {original_count} → {new_count} ({diff_str})") + + +if __name__ == "__main__": + raise RuntimeError("This module should not be run directly.") diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/scripts/validation/pptx.py b/code_puppy/bundled_skills/Office/pptx/ooxml/scripts/validation/pptx.py new file mode 100644 index 00000000..66d5b1e2 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/scripts/validation/pptx.py @@ -0,0 +1,315 @@ +""" +Validator for PowerPoint presentation XML files against XSD schemas. +""" + +import re + +from .base import BaseSchemaValidator + + +class PPTXSchemaValidator(BaseSchemaValidator): + """Validator for PowerPoint presentation XML files against XSD schemas.""" + + # PowerPoint presentation namespace + PRESENTATIONML_NAMESPACE = ( + "http://schemas.openxmlformats.org/presentationml/2006/main" + ) + + # PowerPoint-specific element to relationship type mappings + ELEMENT_RELATIONSHIP_TYPES = { + "sldid": "slide", + "sldmasterid": "slidemaster", + "notesmasterid": "notesmaster", + "sldlayoutid": "slidelayout", + "themeid": "theme", + "tablestyleid": "tablestyles", + } + + def validate(self): + """Run all validation checks and return True if all pass.""" + # Test 0: XML well-formedness + if not self.validate_xml(): + return False + + # Test 1: Namespace declarations + all_valid = True + if not self.validate_namespaces(): + all_valid = False + + # Test 2: Unique IDs + if not self.validate_unique_ids(): + all_valid = False + + # Test 3: UUID ID validation + if not self.validate_uuid_ids(): + all_valid = False + + # Test 4: Relationship and file reference validation + if not self.validate_file_references(): + all_valid = False + + # Test 5: Slide layout ID validation + if not self.validate_slide_layout_ids(): + all_valid = False + + # Test 6: Content type declarations + if not self.validate_content_types(): + all_valid = False + + # Test 7: XSD schema validation + if not self.validate_against_xsd(): + all_valid = False + + # Test 8: Notes slide reference validation + if not self.validate_notes_slide_references(): + all_valid = False + + # Test 9: Relationship ID reference validation + if not self.validate_all_relationship_ids(): + all_valid = False + + # Test 10: Duplicate slide layout references validation + if not self.validate_no_duplicate_slide_layouts(): + all_valid = False + + return all_valid + + def validate_uuid_ids(self): + """Validate that ID attributes that look like UUIDs contain only hex values.""" + import lxml.etree + + errors = [] + # UUID pattern: 8-4-4-4-12 hex digits with optional braces/hyphens + uuid_pattern = re.compile( + r"^[\{\(]?[0-9A-Fa-f]{8}-?[0-9A-Fa-f]{4}-?[0-9A-Fa-f]{4}-?[0-9A-Fa-f]{4}-?[0-9A-Fa-f]{12}[\}\)]?$" + ) + + for xml_file in self.xml_files: + try: + root = lxml.etree.parse(str(xml_file)).getroot() + + # Check all elements for ID attributes + for elem in root.iter(): + for attr, value in elem.attrib.items(): + # Check if this is an ID attribute + attr_name = attr.split("}")[-1].lower() + if attr_name == "id" or attr_name.endswith("id"): + # Check if value looks like a UUID (has the right length and pattern structure) + if self._looks_like_uuid(value): + # Validate that it contains only hex characters in the right positions + if not uuid_pattern.match(value): + errors.append( + f" {xml_file.relative_to(self.unpacked_dir)}: " + f"Line {elem.sourceline}: ID '{value}' appears to be a UUID but contains invalid hex characters" + ) + + except (lxml.etree.XMLSyntaxError, Exception) as e: + errors.append( + f" {xml_file.relative_to(self.unpacked_dir)}: Error: {e}" + ) + + if errors: + print(f"FAILED - Found {len(errors)} UUID ID validation errors:") + for error in errors: + print(error) + return False + else: + if self.verbose: + print("PASSED - All UUID-like IDs contain valid hex values") + return True + + def _looks_like_uuid(self, value): + """Check if a value has the general structure of a UUID.""" + # Remove common UUID delimiters + clean_value = value.strip("{}()").replace("-", "") + # Check if it's 32 hex-like characters (could include invalid hex chars) + return len(clean_value) == 32 and all(c.isalnum() for c in clean_value) + + def validate_slide_layout_ids(self): + """Validate that sldLayoutId elements in slide masters reference valid slide layouts.""" + import lxml.etree + + errors = [] + + # Find all slide master files + slide_masters = list(self.unpacked_dir.glob("ppt/slideMasters/*.xml")) + + if not slide_masters: + if self.verbose: + print("PASSED - No slide masters found") + return True + + for slide_master in slide_masters: + try: + # Parse the slide master file + root = lxml.etree.parse(str(slide_master)).getroot() + + # Find the corresponding _rels file for this slide master + rels_file = slide_master.parent / "_rels" / f"{slide_master.name}.rels" + + if not rels_file.exists(): + errors.append( + f" {slide_master.relative_to(self.unpacked_dir)}: " + f"Missing relationships file: {rels_file.relative_to(self.unpacked_dir)}" + ) + continue + + # Parse the relationships file + rels_root = lxml.etree.parse(str(rels_file)).getroot() + + # Build a set of valid relationship IDs that point to slide layouts + valid_layout_rids = set() + for rel in rels_root.findall( + f".//{{{self.PACKAGE_RELATIONSHIPS_NAMESPACE}}}Relationship" + ): + rel_type = rel.get("Type", "") + if "slideLayout" in rel_type: + valid_layout_rids.add(rel.get("Id")) + + # Find all sldLayoutId elements in the slide master + for sld_layout_id in root.findall( + f".//{{{self.PRESENTATIONML_NAMESPACE}}}sldLayoutId" + ): + r_id = sld_layout_id.get( + f"{{{self.OFFICE_RELATIONSHIPS_NAMESPACE}}}id" + ) + layout_id = sld_layout_id.get("id") + + if r_id and r_id not in valid_layout_rids: + errors.append( + f" {slide_master.relative_to(self.unpacked_dir)}: " + f"Line {sld_layout_id.sourceline}: sldLayoutId with id='{layout_id}' " + f"references r:id='{r_id}' which is not found in slide layout relationships" + ) + + except (lxml.etree.XMLSyntaxError, Exception) as e: + errors.append( + f" {slide_master.relative_to(self.unpacked_dir)}: Error: {e}" + ) + + if errors: + print(f"FAILED - Found {len(errors)} slide layout ID validation errors:") + for error in errors: + print(error) + print( + "Remove invalid references or add missing slide layouts to the relationships file." + ) + return False + else: + if self.verbose: + print("PASSED - All slide layout IDs reference valid slide layouts") + return True + + def validate_no_duplicate_slide_layouts(self): + """Validate that each slide has exactly one slideLayout reference.""" + import lxml.etree + + errors = [] + slide_rels_files = list(self.unpacked_dir.glob("ppt/slides/_rels/*.xml.rels")) + + for rels_file in slide_rels_files: + try: + root = lxml.etree.parse(str(rels_file)).getroot() + + # Find all slideLayout relationships + layout_rels = [ + rel + for rel in root.findall( + f".//{{{self.PACKAGE_RELATIONSHIPS_NAMESPACE}}}Relationship" + ) + if "slideLayout" in rel.get("Type", "") + ] + + if len(layout_rels) > 1: + errors.append( + f" {rels_file.relative_to(self.unpacked_dir)}: has {len(layout_rels)} slideLayout references" + ) + + except Exception as e: + errors.append( + f" {rels_file.relative_to(self.unpacked_dir)}: Error: {e}" + ) + + if errors: + print("FAILED - Found slides with duplicate slideLayout references:") + for error in errors: + print(error) + return False + else: + if self.verbose: + print("PASSED - All slides have exactly one slideLayout reference") + return True + + def validate_notes_slide_references(self): + """Validate that each notesSlide file is referenced by only one slide.""" + import lxml.etree + + errors = [] + notes_slide_references = {} # Track which slides reference each notesSlide + + # Find all slide relationship files + slide_rels_files = list(self.unpacked_dir.glob("ppt/slides/_rels/*.xml.rels")) + + if not slide_rels_files: + if self.verbose: + print("PASSED - No slide relationship files found") + return True + + for rels_file in slide_rels_files: + try: + # Parse the relationships file + root = lxml.etree.parse(str(rels_file)).getroot() + + # Find all notesSlide relationships + for rel in root.findall( + f".//{{{self.PACKAGE_RELATIONSHIPS_NAMESPACE}}}Relationship" + ): + rel_type = rel.get("Type", "") + if "notesSlide" in rel_type: + target = rel.get("Target", "") + if target: + # Normalize the target path to handle relative paths + normalized_target = target.replace("../", "") + + # Track which slide references this notesSlide + slide_name = rels_file.stem.replace( + ".xml", "" + ) # e.g., "slide1" + + if normalized_target not in notes_slide_references: + notes_slide_references[normalized_target] = [] + notes_slide_references[normalized_target].append( + (slide_name, rels_file) + ) + + except (lxml.etree.XMLSyntaxError, Exception) as e: + errors.append( + f" {rels_file.relative_to(self.unpacked_dir)}: Error: {e}" + ) + + # Check for duplicate references + for target, references in notes_slide_references.items(): + if len(references) > 1: + slide_names = [ref[0] for ref in references] + errors.append( + f" Notes slide '{target}' is referenced by multiple slides: {', '.join(slide_names)}" + ) + for slide_name, rels_file in references: + errors.append(f" - {rels_file.relative_to(self.unpacked_dir)}") + + if errors: + print( + f"FAILED - Found {len([e for e in errors if not e.startswith(' ')])} notes slide reference validation errors:" + ) + for error in errors: + print(error) + print("Each slide may optionally have its own slide file.") + return False + else: + if self.verbose: + print("PASSED - All notes slide references are unique") + return True + + +if __name__ == "__main__": + raise RuntimeError("This module should not be run directly.") diff --git a/code_puppy/bundled_skills/Office/pptx/ooxml/scripts/validation/redlining.py b/code_puppy/bundled_skills/Office/pptx/ooxml/scripts/validation/redlining.py new file mode 100644 index 00000000..e3bf0f96 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/ooxml/scripts/validation/redlining.py @@ -0,0 +1,279 @@ +""" +Validator for tracked changes in Word documents. +""" + +import subprocess +import tempfile +import zipfile +from pathlib import Path + + +class RedliningValidator: + """Validator for tracked changes in Word documents.""" + + def __init__(self, unpacked_dir, original_docx, verbose=False): + self.unpacked_dir = Path(unpacked_dir) + self.original_docx = Path(original_docx) + self.verbose = verbose + self.namespaces = { + "w": "http://schemas.openxmlformats.org/wordprocessingml/2006/main" + } + + def validate(self): + """Main validation method that returns True if valid, False otherwise.""" + # Verify unpacked directory exists and has correct structure + modified_file = self.unpacked_dir / "word" / "document.xml" + if not modified_file.exists(): + print(f"FAILED - Modified document.xml not found at {modified_file}") + return False + + # First, check if there are any tracked changes by Ticca to validate + try: + import xml.etree.ElementTree as ET + + tree = ET.parse(modified_file) + root = tree.getroot() + + # Check for w:del or w:ins tags authored by Ticca + del_elements = root.findall(".//w:del", self.namespaces) + ins_elements = root.findall(".//w:ins", self.namespaces) + + # Filter to only include changes by Ticca + ticca_del_elements = [ + elem + for elem in del_elements + if elem.get(f"{{{self.namespaces['w']}}}author") == "Ticca" + ] + ticca_ins_elements = [ + elem + for elem in ins_elements + if elem.get(f"{{{self.namespaces['w']}}}author") == "Ticca" + ] + + # Redlining validation is only needed if tracked changes by Ticca have been used. + if not ticca_del_elements and not ticca_ins_elements: + if self.verbose: + print("PASSED - No tracked changes by Ticca found.") + return True + + except Exception: + # If we can't parse the XML, continue with full validation + pass + + # Create temporary directory for unpacking original docx + with tempfile.TemporaryDirectory() as temp_dir: + temp_path = Path(temp_dir) + + # Unpack original docx + try: + with zipfile.ZipFile(self.original_docx, "r") as zip_ref: + zip_ref.extractall(temp_path) + except Exception as e: + print(f"FAILED - Error unpacking original docx: {e}") + return False + + original_file = temp_path / "word" / "document.xml" + if not original_file.exists(): + print( + f"FAILED - Original document.xml not found in {self.original_docx}" + ) + return False + + # Parse both XML files using xml.etree.ElementTree for redlining validation + try: + import xml.etree.ElementTree as ET + + modified_tree = ET.parse(modified_file) + modified_root = modified_tree.getroot() + original_tree = ET.parse(original_file) + original_root = original_tree.getroot() + except ET.ParseError as e: + print(f"FAILED - Error parsing XML files: {e}") + return False + + # Remove Ticca's tracked changes from both documents + self._remove_ticca_tracked_changes(original_root) + self._remove_ticca_tracked_changes(modified_root) + + # Extract and compare text content + modified_text = self._extract_text_content(modified_root) + original_text = self._extract_text_content(original_root) + + if modified_text != original_text: + # Show detailed character-level differences for each paragraph + error_message = self._generate_detailed_diff( + original_text, modified_text + ) + print(error_message) + return False + + if self.verbose: + print("PASSED - All changes by Ticca are properly tracked") + return True + + def _generate_detailed_diff(self, original_text, modified_text): + """Generate detailed word-level differences using git word diff.""" + error_parts = [ + "FAILED - Document text doesn't match after removing Ticca's tracked changes", + "", + "Likely causes:", + " 1. Modified text inside another author's or tags", + " 2. Made edits without proper tracked changes", + " 3. Didn't nest inside when deleting another's insertion", + "", + "For pre-redlined documents, use correct patterns:", + " - To reject another's INSERTION: Nest inside their ", + " - To restore another's DELETION: Add new AFTER their ", + "", + ] + + # Show git word diff + git_diff = self._get_git_word_diff(original_text, modified_text) + if git_diff: + error_parts.extend(["Differences:", "============", git_diff]) + else: + error_parts.append("Unable to generate word diff (git not available)") + + return "\n".join(error_parts) + + def _get_git_word_diff(self, original_text, modified_text): + """Generate word diff using git with character-level precision.""" + try: + with tempfile.TemporaryDirectory() as temp_dir: + temp_path = Path(temp_dir) + + # Create two files + original_file = temp_path / "original.txt" + modified_file = temp_path / "modified.txt" + + original_file.write_text(original_text, encoding="utf-8") + modified_file.write_text(modified_text, encoding="utf-8") + + # Try character-level diff first for precise differences + result = subprocess.run( + [ + "git", + "diff", + "--word-diff=plain", + "--word-diff-regex=.", # Character-by-character diff + "-U0", # Zero lines of context - show only changed lines + "--no-index", + str(original_file), + str(modified_file), + ], + capture_output=True, + text=True, + ) + + if result.stdout.strip(): + # Clean up the output - remove git diff header lines + lines = result.stdout.split("\n") + # Skip the header lines (diff --git, index, +++, ---, @@) + content_lines = [] + in_content = False + for line in lines: + if line.startswith("@@"): + in_content = True + continue + if in_content and line.strip(): + content_lines.append(line) + + if content_lines: + return "\n".join(content_lines) + + # Fallback to word-level diff if character-level is too verbose + result = subprocess.run( + [ + "git", + "diff", + "--word-diff=plain", + "-U0", # Zero lines of context + "--no-index", + str(original_file), + str(modified_file), + ], + capture_output=True, + text=True, + ) + + if result.stdout.strip(): + lines = result.stdout.split("\n") + content_lines = [] + in_content = False + for line in lines: + if line.startswith("@@"): + in_content = True + continue + if in_content and line.strip(): + content_lines.append(line) + return "\n".join(content_lines) + + except (subprocess.CalledProcessError, FileNotFoundError, Exception): + # Git not available or other error, return None to use fallback + pass + + return None + + def _remove_ticca_tracked_changes(self, root): + """Remove tracked changes authored by Ticca from the XML root.""" + ins_tag = f"{{{self.namespaces['w']}}}ins" + del_tag = f"{{{self.namespaces['w']}}}del" + author_attr = f"{{{self.namespaces['w']}}}author" + + # Remove w:ins elements + for parent in root.iter(): + to_remove = [] + for child in parent: + if child.tag == ins_tag and child.get(author_attr) == "Ticca": + to_remove.append(child) + for elem in to_remove: + parent.remove(elem) + + # Unwrap content in w:del elements where author is "Ticca" + deltext_tag = f"{{{self.namespaces['w']}}}delText" + t_tag = f"{{{self.namespaces['w']}}}t" + + for parent in root.iter(): + to_process = [] + for child in parent: + if child.tag == del_tag and child.get(author_attr) == "Ticca": + to_process.append((child, list(parent).index(child))) + + # Process in reverse order to maintain indices + for del_elem, del_index in reversed(to_process): + # Convert w:delText to w:t before moving + for elem in del_elem.iter(): + if elem.tag == deltext_tag: + elem.tag = t_tag + + # Move all children of w:del to its parent before removing w:del + for child in reversed(list(del_elem)): + parent.insert(del_index, child) + parent.remove(del_elem) + + def _extract_text_content(self, root): + """Extract text content from Word XML, preserving paragraph structure. + + Empty paragraphs are skipped to avoid false positives when tracked + insertions add only structural elements without text content. + """ + p_tag = f"{{{self.namespaces['w']}}}p" + t_tag = f"{{{self.namespaces['w']}}}t" + + paragraphs = [] + for p_elem in root.findall(f".//{p_tag}"): + # Get all text elements within this paragraph + text_parts = [] + for t_elem in p_elem.findall(f".//{t_tag}"): + if t_elem.text: + text_parts.append(t_elem.text) + paragraph_text = "".join(text_parts) + # Skip empty paragraphs - they don't affect content validation + if paragraph_text: + paragraphs.append(paragraph_text) + + return "\n".join(paragraphs) + + +if __name__ == "__main__": + raise RuntimeError("This module should not be run directly.") diff --git a/code_puppy/bundled_skills/Office/pptx/scripts/inventory.py b/code_puppy/bundled_skills/Office/pptx/scripts/inventory.py new file mode 100644 index 00000000..edda390e --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/scripts/inventory.py @@ -0,0 +1,1020 @@ +#!/usr/bin/env python3 +""" +Extract structured text content from PowerPoint presentations. + +This module provides functionality to: +- Extract all text content from PowerPoint shapes +- Preserve paragraph formatting (alignment, bullets, fonts, spacing) +- Handle nested GroupShapes recursively with correct absolute positions +- Sort shapes by visual position on slides +- Filter out slide numbers and non-content placeholders +- Export to JSON with clean, structured data + +Classes: + ParagraphData: Represents a text paragraph with formatting + ShapeData: Represents a shape with position and text content + +Main Functions: + extract_text_inventory: Extract all text from a presentation + save_inventory: Save extracted data to JSON + +Usage: + python inventory.py input.pptx output.json +""" + +import argparse +import json +import platform +import sys +from dataclasses import dataclass +from pathlib import Path +from typing import Any, Dict, List, Optional, Tuple, Union + +from PIL import Image, ImageDraw, ImageFont +from pptx import Presentation +from pptx.enum.text import PP_ALIGN +from pptx.shapes.base import BaseShape + +# Type aliases for cleaner signatures +JsonValue = Union[str, int, float, bool, None] +ParagraphDict = Dict[str, JsonValue] +ShapeDict = Dict[ + str, Union[str, float, bool, List[ParagraphDict], List[str], Dict[str, Any], None] +] +InventoryData = Dict[ + str, Dict[str, "ShapeData"] +] # Dict of slide_id -> {shape_id -> ShapeData} +InventoryDict = Dict[str, Dict[str, ShapeDict]] # JSON-serializable inventory + + +def main(): + """Main entry point for command-line usage.""" + parser = argparse.ArgumentParser( + description="Extract text inventory from PowerPoint with proper GroupShape support.", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=""" +Examples: + python inventory.py presentation.pptx inventory.json + Extracts text inventory with correct absolute positions for grouped shapes + + python inventory.py presentation.pptx inventory.json --issues-only + Extracts only text shapes that have overflow or overlap issues + +The output JSON includes: + - All text content organized by slide and shape + - Correct absolute positions for shapes in groups + - Visual position and size in inches + - Paragraph properties and formatting + - Issue detection: text overflow and shape overlaps + """, + ) + + parser.add_argument("input", help="Input PowerPoint file (.pptx)") + parser.add_argument("output", help="Output JSON file for inventory") + parser.add_argument( + "--issues-only", + action="store_true", + help="Include only text shapes that have overflow or overlap issues", + ) + + args = parser.parse_args() + + input_path = Path(args.input) + if not input_path.exists(): + print(f"Error: Input file not found: {args.input}") + sys.exit(1) + + if not input_path.suffix.lower() == ".pptx": + print("Error: Input must be a PowerPoint file (.pptx)") + sys.exit(1) + + try: + print(f"Extracting text inventory from: {args.input}") + if args.issues_only: + print( + "Filtering to include only text shapes with issues (overflow/overlap)" + ) + inventory = extract_text_inventory(input_path, issues_only=args.issues_only) + + output_path = Path(args.output) + output_path.parent.mkdir(parents=True, exist_ok=True) + save_inventory(inventory, output_path) + + print(f"Output saved to: {args.output}") + + # Report statistics + total_slides = len(inventory) + total_shapes = sum(len(shapes) for shapes in inventory.values()) + if args.issues_only: + if total_shapes > 0: + print( + f"Found {total_shapes} text elements with issues in {total_slides} slides" + ) + else: + print("No issues discovered") + else: + print( + f"Found text in {total_slides} slides with {total_shapes} text elements" + ) + + except Exception as e: + print(f"Error processing presentation: {e}") + import traceback + + traceback.print_exc() + sys.exit(1) + + +@dataclass +class ShapeWithPosition: + """A shape with its absolute position on the slide.""" + + shape: BaseShape + absolute_left: int # in EMUs + absolute_top: int # in EMUs + + +class ParagraphData: + """Data structure for paragraph properties extracted from a PowerPoint paragraph.""" + + def __init__(self, paragraph: Any): + """Initialize from a PowerPoint paragraph object. + + Args: + paragraph: The PowerPoint paragraph object + """ + self.text: str = paragraph.text.strip() + self.bullet: bool = False + self.level: Optional[int] = None + self.alignment: Optional[str] = None + self.space_before: Optional[float] = None + self.space_after: Optional[float] = None + self.font_name: Optional[str] = None + self.font_size: Optional[float] = None + self.bold: Optional[bool] = None + self.italic: Optional[bool] = None + self.underline: Optional[bool] = None + self.color: Optional[str] = None + self.theme_color: Optional[str] = None + self.line_spacing: Optional[float] = None + + # Check for bullet formatting + if ( + hasattr(paragraph, "_p") + and paragraph._p is not None + and paragraph._p.pPr is not None + ): + pPr = paragraph._p.pPr + ns = "{http://schemas.openxmlformats.org/drawingml/2006/main}" + if ( + pPr.find(f"{ns}buChar") is not None + or pPr.find(f"{ns}buAutoNum") is not None + ): + self.bullet = True + if hasattr(paragraph, "level"): + self.level = paragraph.level + + # Add alignment if not LEFT (default) + if hasattr(paragraph, "alignment") and paragraph.alignment is not None: + alignment_map = { + PP_ALIGN.CENTER: "CENTER", + PP_ALIGN.RIGHT: "RIGHT", + PP_ALIGN.JUSTIFY: "JUSTIFY", + } + if paragraph.alignment in alignment_map: + self.alignment = alignment_map[paragraph.alignment] + + # Add spacing properties if set + if hasattr(paragraph, "space_before") and paragraph.space_before: + self.space_before = paragraph.space_before.pt + if hasattr(paragraph, "space_after") and paragraph.space_after: + self.space_after = paragraph.space_after.pt + + # Extract font properties from first run + if paragraph.runs: + first_run = paragraph.runs[0] + if hasattr(first_run, "font"): + font = first_run.font + if font.name: + self.font_name = font.name + if font.size: + self.font_size = font.size.pt + if font.bold is not None: + self.bold = font.bold + if font.italic is not None: + self.italic = font.italic + if font.underline is not None: + self.underline = font.underline + + # Handle color - both RGB and theme colors + try: + # Try RGB color first + if font.color.rgb: + self.color = str(font.color.rgb) + except (AttributeError, TypeError): + # Fall back to theme color + try: + if font.color.theme_color: + self.theme_color = font.color.theme_color.name + except (AttributeError, TypeError): + pass + + # Add line spacing if set + if hasattr(paragraph, "line_spacing") and paragraph.line_spacing is not None: + if hasattr(paragraph.line_spacing, "pt"): + self.line_spacing = round(paragraph.line_spacing.pt, 2) + else: + # Multiplier - convert to points + font_size = self.font_size if self.font_size else 12.0 + self.line_spacing = round(paragraph.line_spacing * font_size, 2) + + def to_dict(self) -> ParagraphDict: + """Convert to dictionary for JSON serialization, excluding None values.""" + result: ParagraphDict = {"text": self.text} + + # Add optional fields only if they have values + if self.bullet: + result["bullet"] = self.bullet + if self.level is not None: + result["level"] = self.level + if self.alignment: + result["alignment"] = self.alignment + if self.space_before is not None: + result["space_before"] = self.space_before + if self.space_after is not None: + result["space_after"] = self.space_after + if self.font_name: + result["font_name"] = self.font_name + if self.font_size is not None: + result["font_size"] = self.font_size + if self.bold is not None: + result["bold"] = self.bold + if self.italic is not None: + result["italic"] = self.italic + if self.underline is not None: + result["underline"] = self.underline + if self.color: + result["color"] = self.color + if self.theme_color: + result["theme_color"] = self.theme_color + if self.line_spacing is not None: + result["line_spacing"] = self.line_spacing + + return result + + +class ShapeData: + """Data structure for shape properties extracted from a PowerPoint shape.""" + + @staticmethod + def emu_to_inches(emu: int) -> float: + """Convert EMUs (English Metric Units) to inches.""" + return emu / 914400.0 + + @staticmethod + def inches_to_pixels(inches: float, dpi: int = 96) -> int: + """Convert inches to pixels at given DPI.""" + return int(inches * dpi) + + @staticmethod + def get_font_path(font_name: str) -> Optional[str]: + """Get the font file path for a given font name. + + Args: + font_name: Name of the font (e.g., 'Arial', 'Calibri') + + Returns: + Path to the font file, or None if not found + """ + system = platform.system() + + # Common font file variations to try + font_variations = [ + font_name, + font_name.lower(), + font_name.replace(" ", ""), + font_name.replace(" ", "-"), + ] + + # Define font directories and extensions by platform + if system == "Darwin": # macOS + font_dirs = [ + "/System/Library/Fonts/", + "/Library/Fonts/", + "~/Library/Fonts/", + ] + extensions = [".ttf", ".otf", ".ttc", ".dfont"] + else: # Linux + font_dirs = [ + "/usr/share/fonts/truetype/", + "/usr/local/share/fonts/", + "~/.fonts/", + ] + extensions = [".ttf", ".otf"] + + # Try to find the font file + from pathlib import Path + + for font_dir in font_dirs: + font_dir_path = Path(font_dir).expanduser() + if not font_dir_path.exists(): + continue + + # First try exact matches + for variant in font_variations: + for ext in extensions: + font_path = font_dir_path / f"{variant}{ext}" + if font_path.exists(): + return str(font_path) + + # Then try fuzzy matching - find files containing the font name + try: + for file_path in font_dir_path.iterdir(): + if file_path.is_file(): + file_name_lower = file_path.name.lower() + font_name_lower = font_name.lower().replace(" ", "") + if font_name_lower in file_name_lower and any( + file_name_lower.endswith(ext) for ext in extensions + ): + return str(file_path) + except (OSError, PermissionError): + continue + + return None + + @staticmethod + def get_slide_dimensions(slide: Any) -> tuple[Optional[int], Optional[int]]: + """Get slide dimensions from slide object. + + Args: + slide: Slide object + + Returns: + Tuple of (width_emu, height_emu) or (None, None) if not found + """ + try: + prs = slide.part.package.presentation_part.presentation + return prs.slide_width, prs.slide_height + except (AttributeError, TypeError): + return None, None + + @staticmethod + def get_default_font_size(shape: BaseShape, slide_layout: Any) -> Optional[float]: + """Extract default font size from slide layout for a placeholder shape. + + Args: + shape: Placeholder shape + slide_layout: Slide layout containing the placeholder definition + + Returns: + Default font size in points, or None if not found + """ + try: + if not hasattr(shape, "placeholder_format"): + return None + + shape_type = shape.placeholder_format.type # type: ignore + for layout_placeholder in slide_layout.placeholders: + if layout_placeholder.placeholder_format.type == shape_type: + # Find first defRPr element with sz (size) attribute + for elem in layout_placeholder.element.iter(): + if "defRPr" in elem.tag and (sz := elem.get("sz")): + return float(sz) / 100.0 # Convert EMUs to points + break + except Exception: + pass + return None + + def __init__( + self, + shape: BaseShape, + absolute_left: Optional[int] = None, + absolute_top: Optional[int] = None, + slide: Optional[Any] = None, + ): + """Initialize from a PowerPoint shape object. + + Args: + shape: The PowerPoint shape object (should be pre-validated) + absolute_left: Absolute left position in EMUs (for shapes in groups) + absolute_top: Absolute top position in EMUs (for shapes in groups) + slide: Optional slide object to get dimensions and layout information + """ + self.shape = shape # Store reference to original shape + self.shape_id: str = "" # Will be set after sorting + + # Get slide dimensions from slide object + self.slide_width_emu, self.slide_height_emu = ( + self.get_slide_dimensions(slide) if slide else (None, None) + ) + + # Get placeholder type if applicable + self.placeholder_type: Optional[str] = None + self.default_font_size: Optional[float] = None + if hasattr(shape, "is_placeholder") and shape.is_placeholder: # type: ignore + if shape.placeholder_format and shape.placeholder_format.type: # type: ignore + self.placeholder_type = ( + str(shape.placeholder_format.type).split(".")[-1].split(" ")[0] # type: ignore + ) + + # Get default font size from layout + if slide and hasattr(slide, "slide_layout"): + self.default_font_size = self.get_default_font_size( + shape, slide.slide_layout + ) + + # Get position information + # Use absolute positions if provided (for shapes in groups), otherwise use shape's position + left_emu = ( + absolute_left + if absolute_left is not None + else (shape.left if hasattr(shape, "left") else 0) + ) + top_emu = ( + absolute_top + if absolute_top is not None + else (shape.top if hasattr(shape, "top") else 0) + ) + + self.left: float = round(self.emu_to_inches(left_emu), 2) # type: ignore + self.top: float = round(self.emu_to_inches(top_emu), 2) # type: ignore + self.width: float = round( + self.emu_to_inches(shape.width if hasattr(shape, "width") else 0), + 2, # type: ignore + ) + self.height: float = round( + self.emu_to_inches(shape.height if hasattr(shape, "height") else 0), + 2, # type: ignore + ) + + # Store EMU positions for overflow calculations + self.left_emu = left_emu + self.top_emu = top_emu + self.width_emu = shape.width if hasattr(shape, "width") else 0 + self.height_emu = shape.height if hasattr(shape, "height") else 0 + + # Calculate overflow status + self.frame_overflow_bottom: Optional[float] = None + self.slide_overflow_right: Optional[float] = None + self.slide_overflow_bottom: Optional[float] = None + self.overlapping_shapes: Dict[ + str, float + ] = {} # Dict of shape_id -> overlap area in sq inches + self.warnings: List[str] = [] + self._estimate_frame_overflow() + self._calculate_slide_overflow() + self._detect_bullet_issues() + + @property + def paragraphs(self) -> List[ParagraphData]: + """Calculate paragraphs from the shape's text frame.""" + if not self.shape or not hasattr(self.shape, "text_frame"): + return [] + + paragraphs = [] + for paragraph in self.shape.text_frame.paragraphs: # type: ignore + if paragraph.text.strip(): + paragraphs.append(ParagraphData(paragraph)) + return paragraphs + + def _get_default_font_size(self) -> int: + """Get default font size from theme text styles or use conservative default.""" + try: + if not ( + hasattr(self.shape, "part") and hasattr(self.shape.part, "slide_layout") + ): + return 14 + + slide_master = self.shape.part.slide_layout.slide_master # type: ignore + if not hasattr(slide_master, "element"): + return 14 + + # Determine theme style based on placeholder type + style_name = "bodyStyle" # Default + if self.placeholder_type and "TITLE" in self.placeholder_type: + style_name = "titleStyle" + + # Find font size in theme styles + for child in slide_master.element.iter(): + tag = child.tag.split("}")[-1] if "}" in child.tag else child.tag + if tag == style_name: + for elem in child.iter(): + if "sz" in elem.attrib: + return int(elem.attrib["sz"]) // 100 + except Exception: + pass + + return 14 # Conservative default for body text + + def _get_usable_dimensions(self, text_frame) -> Tuple[int, int]: + """Get usable width and height in pixels after accounting for margins.""" + # Default PowerPoint margins in inches + margins = {"top": 0.05, "bottom": 0.05, "left": 0.1, "right": 0.1} + + # Override with actual margins if set + if hasattr(text_frame, "margin_top") and text_frame.margin_top: + margins["top"] = self.emu_to_inches(text_frame.margin_top) + if hasattr(text_frame, "margin_bottom") and text_frame.margin_bottom: + margins["bottom"] = self.emu_to_inches(text_frame.margin_bottom) + if hasattr(text_frame, "margin_left") and text_frame.margin_left: + margins["left"] = self.emu_to_inches(text_frame.margin_left) + if hasattr(text_frame, "margin_right") and text_frame.margin_right: + margins["right"] = self.emu_to_inches(text_frame.margin_right) + + # Calculate usable area + usable_width = self.width - margins["left"] - margins["right"] + usable_height = self.height - margins["top"] - margins["bottom"] + + # Convert to pixels + return ( + self.inches_to_pixels(usable_width), + self.inches_to_pixels(usable_height), + ) + + def _wrap_text_line(self, line: str, max_width_px: int, draw, font) -> List[str]: + """Wrap a single line of text to fit within max_width_px.""" + if not line: + return [""] + + # Use textlength for efficient width calculation + if draw.textlength(line, font=font) <= max_width_px: + return [line] + + # Need to wrap - split into words + wrapped = [] + words = line.split(" ") + current_line = "" + + for word in words: + test_line = current_line + (" " if current_line else "") + word + if draw.textlength(test_line, font=font) <= max_width_px: + current_line = test_line + else: + if current_line: + wrapped.append(current_line) + current_line = word + + if current_line: + wrapped.append(current_line) + + return wrapped + + def _estimate_frame_overflow(self) -> None: + """Estimate if text overflows the shape bounds using PIL text measurement.""" + if not self.shape or not hasattr(self.shape, "text_frame"): + return + + text_frame = self.shape.text_frame # type: ignore + if not text_frame or not text_frame.paragraphs: + return + + # Get usable dimensions after accounting for margins + usable_width_px, usable_height_px = self._get_usable_dimensions(text_frame) + if usable_width_px <= 0 or usable_height_px <= 0: + return + + # Set up PIL for text measurement + dummy_img = Image.new("RGB", (1, 1)) + draw = ImageDraw.Draw(dummy_img) + + # Get default font size from placeholder or use conservative estimate + default_font_size = self._get_default_font_size() + + # Calculate total height of all paragraphs + total_height_px = 0 + + for para_idx, paragraph in enumerate(text_frame.paragraphs): + if not paragraph.text.strip(): + continue + + para_data = ParagraphData(paragraph) + + # Load font for this paragraph + font_name = para_data.font_name or "Arial" + font_size = int(para_data.font_size or default_font_size) + + font = None + font_path = self.get_font_path(font_name) + if font_path: + try: + font = ImageFont.truetype(font_path, size=font_size) + except Exception: + font = ImageFont.load_default() + else: + font = ImageFont.load_default() + + # Wrap all lines in this paragraph + all_wrapped_lines = [] + for line in paragraph.text.split("\n"): + wrapped = self._wrap_text_line(line, usable_width_px, draw, font) + all_wrapped_lines.extend(wrapped) + + if all_wrapped_lines: + # Calculate line height + if para_data.line_spacing: + # Custom line spacing explicitly set + line_height_px = para_data.line_spacing * 96 / 72 + else: + # PowerPoint default single spacing (1.0x font size) + line_height_px = font_size * 96 / 72 + + # Add space_before (except first paragraph) + if para_idx > 0 and para_data.space_before: + total_height_px += para_data.space_before * 96 / 72 + + # Add paragraph text height + total_height_px += len(all_wrapped_lines) * line_height_px + + # Add space_after + if para_data.space_after: + total_height_px += para_data.space_after * 96 / 72 + + # Check for overflow (ignore negligible overflows <= 0.05") + if total_height_px > usable_height_px: + overflow_px = total_height_px - usable_height_px + overflow_inches = round(overflow_px / 96.0, 2) + if overflow_inches > 0.05: # Only report significant overflows + self.frame_overflow_bottom = overflow_inches + + def _calculate_slide_overflow(self) -> None: + """Calculate if shape overflows the slide boundaries.""" + if self.slide_width_emu is None or self.slide_height_emu is None: + return + + # Check right overflow (ignore negligible overflows <= 0.01") + right_edge_emu = self.left_emu + self.width_emu + if right_edge_emu > self.slide_width_emu: + overflow_emu = right_edge_emu - self.slide_width_emu + overflow_inches = round(self.emu_to_inches(overflow_emu), 2) + if overflow_inches > 0.01: # Only report significant overflows + self.slide_overflow_right = overflow_inches + + # Check bottom overflow (ignore negligible overflows <= 0.01") + bottom_edge_emu = self.top_emu + self.height_emu + if bottom_edge_emu > self.slide_height_emu: + overflow_emu = bottom_edge_emu - self.slide_height_emu + overflow_inches = round(self.emu_to_inches(overflow_emu), 2) + if overflow_inches > 0.01: # Only report significant overflows + self.slide_overflow_bottom = overflow_inches + + def _detect_bullet_issues(self) -> None: + """Detect bullet point formatting issues in paragraphs.""" + if not self.shape or not hasattr(self.shape, "text_frame"): + return + + text_frame = self.shape.text_frame # type: ignore + if not text_frame or not text_frame.paragraphs: + return + + # Common bullet symbols that indicate manual bullets + bullet_symbols = ["•", "●", "○"] + + for paragraph in text_frame.paragraphs: + text = paragraph.text.strip() + # Check for manual bullet symbols + if text and any(text.startswith(symbol + " ") for symbol in bullet_symbols): + self.warnings.append( + "manual_bullet_symbol: use proper bullet formatting" + ) + break + + @property + def has_any_issues(self) -> bool: + """Check if shape has any issues (overflow, overlap, or warnings).""" + return ( + self.frame_overflow_bottom is not None + or self.slide_overflow_right is not None + or self.slide_overflow_bottom is not None + or len(self.overlapping_shapes) > 0 + or len(self.warnings) > 0 + ) + + def to_dict(self) -> ShapeDict: + """Convert to dictionary for JSON serialization.""" + result: ShapeDict = { + "left": self.left, + "top": self.top, + "width": self.width, + "height": self.height, + } + + # Add optional fields if present + if self.placeholder_type: + result["placeholder_type"] = self.placeholder_type + + if self.default_font_size: + result["default_font_size"] = self.default_font_size + + # Add overflow information only if there is overflow + overflow_data = {} + + # Add frame overflow if present + if self.frame_overflow_bottom is not None: + overflow_data["frame"] = {"overflow_bottom": self.frame_overflow_bottom} + + # Add slide overflow if present + slide_overflow = {} + if self.slide_overflow_right is not None: + slide_overflow["overflow_right"] = self.slide_overflow_right + if self.slide_overflow_bottom is not None: + slide_overflow["overflow_bottom"] = self.slide_overflow_bottom + if slide_overflow: + overflow_data["slide"] = slide_overflow + + # Only add overflow field if there is overflow + if overflow_data: + result["overflow"] = overflow_data + + # Add overlap field if there are overlapping shapes + if self.overlapping_shapes: + result["overlap"] = {"overlapping_shapes": self.overlapping_shapes} + + # Add warnings field if there are warnings + if self.warnings: + result["warnings"] = self.warnings + + # Add paragraphs after placeholder_type + result["paragraphs"] = [para.to_dict() for para in self.paragraphs] + + return result + + +def is_valid_shape(shape: BaseShape) -> bool: + """Check if a shape contains meaningful text content.""" + # Must have a text frame with content + if not hasattr(shape, "text_frame") or not shape.text_frame: # type: ignore + return False + + text = shape.text_frame.text.strip() # type: ignore + if not text: + return False + + # Skip slide numbers and numeric footers + if hasattr(shape, "is_placeholder") and shape.is_placeholder: # type: ignore + if shape.placeholder_format and shape.placeholder_format.type: # type: ignore + placeholder_type = ( + str(shape.placeholder_format.type).split(".")[-1].split(" ")[0] # type: ignore + ) + if placeholder_type == "SLIDE_NUMBER": + return False + if placeholder_type == "FOOTER" and text.isdigit(): + return False + + return True + + +def collect_shapes_with_absolute_positions( + shape: BaseShape, parent_left: int = 0, parent_top: int = 0 +) -> List[ShapeWithPosition]: + """Recursively collect all shapes with valid text, calculating absolute positions. + + For shapes within groups, their positions are relative to the group. + This function calculates the absolute position on the slide by accumulating + parent group offsets. + + Args: + shape: The shape to process + parent_left: Accumulated left offset from parent groups (in EMUs) + parent_top: Accumulated top offset from parent groups (in EMUs) + + Returns: + List of ShapeWithPosition objects with absolute positions + """ + if hasattr(shape, "shapes"): # GroupShape + result = [] + # Get this group's position + group_left = shape.left if hasattr(shape, "left") else 0 + group_top = shape.top if hasattr(shape, "top") else 0 + + # Calculate absolute position for this group + abs_group_left = parent_left + group_left + abs_group_top = parent_top + group_top + + # Process children with accumulated offsets + for child in shape.shapes: # type: ignore + result.extend( + collect_shapes_with_absolute_positions( + child, abs_group_left, abs_group_top + ) + ) + return result + + # Regular shape - check if it has valid text + if is_valid_shape(shape): + # Calculate absolute position + shape_left = shape.left if hasattr(shape, "left") else 0 + shape_top = shape.top if hasattr(shape, "top") else 0 + + return [ + ShapeWithPosition( + shape=shape, + absolute_left=parent_left + shape_left, + absolute_top=parent_top + shape_top, + ) + ] + + return [] + + +def sort_shapes_by_position(shapes: List[ShapeData]) -> List[ShapeData]: + """Sort shapes by visual position (top-to-bottom, left-to-right). + + Shapes within 0.5 inches vertically are considered on the same row. + """ + if not shapes: + return shapes + + # Sort by top position first + shapes = sorted(shapes, key=lambda s: (s.top, s.left)) + + # Group shapes by row (within 0.5 inches vertically) + result = [] + row = [shapes[0]] + row_top = shapes[0].top + + for shape in shapes[1:]: + if abs(shape.top - row_top) <= 0.5: + row.append(shape) + else: + # Sort current row by left position and add to result + result.extend(sorted(row, key=lambda s: s.left)) + row = [shape] + row_top = shape.top + + # Don't forget the last row + result.extend(sorted(row, key=lambda s: s.left)) + return result + + +def calculate_overlap( + rect1: Tuple[float, float, float, float], + rect2: Tuple[float, float, float, float], + tolerance: float = 0.05, +) -> Tuple[bool, float]: + """Calculate if and how much two rectangles overlap. + + Args: + rect1: (left, top, width, height) of first rectangle in inches + rect2: (left, top, width, height) of second rectangle in inches + tolerance: Minimum overlap in inches to consider as overlapping (default: 0.05") + + Returns: + Tuple of (overlaps, overlap_area) where: + - overlaps: True if rectangles overlap by more than tolerance + - overlap_area: Area of overlap in square inches + """ + left1, top1, w1, h1 = rect1 + left2, top2, w2, h2 = rect2 + + # Calculate overlap dimensions + overlap_width = min(left1 + w1, left2 + w2) - max(left1, left2) + overlap_height = min(top1 + h1, top2 + h2) - max(top1, top2) + + # Check if there's meaningful overlap (more than tolerance) + if overlap_width > tolerance and overlap_height > tolerance: + # Calculate overlap area in square inches + overlap_area = overlap_width * overlap_height + return True, round(overlap_area, 2) + + return False, 0 + + +def detect_overlaps(shapes: List[ShapeData]) -> None: + """Detect overlapping shapes and update their overlapping_shapes dictionaries. + + This function requires each ShapeData to have its shape_id already set. + It modifies the shapes in-place, adding shape IDs with overlap areas in square inches. + + Args: + shapes: List of ShapeData objects with shape_id attributes set + """ + n = len(shapes) + + # Compare each pair of shapes + for i in range(n): + for j in range(i + 1, n): + shape1 = shapes[i] + shape2 = shapes[j] + + # Ensure shape IDs are set + assert shape1.shape_id, f"Shape at index {i} has no shape_id" + assert shape2.shape_id, f"Shape at index {j} has no shape_id" + + rect1 = (shape1.left, shape1.top, shape1.width, shape1.height) + rect2 = (shape2.left, shape2.top, shape2.width, shape2.height) + + overlaps, overlap_area = calculate_overlap(rect1, rect2) + + if overlaps: + # Add shape IDs with overlap area in square inches + shape1.overlapping_shapes[shape2.shape_id] = overlap_area + shape2.overlapping_shapes[shape1.shape_id] = overlap_area + + +def extract_text_inventory( + pptx_path: Path, prs: Optional[Any] = None, issues_only: bool = False +) -> InventoryData: + """Extract text content from all slides in a PowerPoint presentation. + + Args: + pptx_path: Path to the PowerPoint file + prs: Optional Presentation object to use. If not provided, will load from pptx_path. + issues_only: If True, only include shapes that have overflow or overlap issues + + Returns a nested dictionary: {slide-N: {shape-N: ShapeData}} + Shapes are sorted by visual position (top-to-bottom, left-to-right). + The ShapeData objects contain the full shape information and can be + converted to dictionaries for JSON serialization using to_dict(). + """ + if prs is None: + prs = Presentation(str(pptx_path)) + inventory: InventoryData = {} + + for slide_idx, slide in enumerate(prs.slides): + # Collect all valid shapes from this slide with absolute positions + shapes_with_positions = [] + for shape in slide.shapes: # type: ignore + shapes_with_positions.extend(collect_shapes_with_absolute_positions(shape)) + + if not shapes_with_positions: + continue + + # Convert to ShapeData with absolute positions and slide reference + shape_data_list = [ + ShapeData( + swp.shape, + swp.absolute_left, + swp.absolute_top, + slide, + ) + for swp in shapes_with_positions + ] + + # Sort by visual position and assign stable IDs in one step + sorted_shapes = sort_shapes_by_position(shape_data_list) + for idx, shape_data in enumerate(sorted_shapes): + shape_data.shape_id = f"shape-{idx}" + + # Detect overlaps using the stable shape IDs + if len(sorted_shapes) > 1: + detect_overlaps(sorted_shapes) + + # Filter for issues only if requested (after overlap detection) + if issues_only: + sorted_shapes = [sd for sd in sorted_shapes if sd.has_any_issues] + + if not sorted_shapes: + continue + + # Create slide inventory using the stable shape IDs + inventory[f"slide-{slide_idx}"] = { + shape_data.shape_id: shape_data for shape_data in sorted_shapes + } + + return inventory + + +def get_inventory_as_dict(pptx_path: Path, issues_only: bool = False) -> InventoryDict: + """Extract text inventory and return as JSON-serializable dictionaries. + + This is a convenience wrapper around extract_text_inventory that returns + dictionaries instead of ShapeData objects, useful for testing and direct + JSON serialization. + + Args: + pptx_path: Path to the PowerPoint file + issues_only: If True, only include shapes that have overflow or overlap issues + + Returns: + Nested dictionary with all data serialized for JSON + """ + inventory = extract_text_inventory(pptx_path, issues_only=issues_only) + + # Convert ShapeData objects to dictionaries + dict_inventory: InventoryDict = {} + for slide_key, shapes in inventory.items(): + dict_inventory[slide_key] = { + shape_key: shape_data.to_dict() for shape_key, shape_data in shapes.items() + } + + return dict_inventory + + +def save_inventory(inventory: InventoryData, output_path: Path) -> None: + """Save inventory to JSON file with proper formatting. + + Converts ShapeData objects to dictionaries for JSON serialization. + """ + # Convert ShapeData objects to dictionaries + json_inventory: InventoryDict = {} + for slide_key, shapes in inventory.items(): + json_inventory[slide_key] = { + shape_key: shape_data.to_dict() for shape_key, shape_data in shapes.items() + } + + with open(output_path, "w", encoding="utf-8") as f: + json.dump(json_inventory, f, indent=2, ensure_ascii=False) + + +if __name__ == "__main__": + main() diff --git a/code_puppy/bundled_skills/Office/pptx/scripts/rearrange.py b/code_puppy/bundled_skills/Office/pptx/scripts/rearrange.py new file mode 100644 index 00000000..2519911f --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/scripts/rearrange.py @@ -0,0 +1,231 @@ +#!/usr/bin/env python3 +""" +Rearrange PowerPoint slides based on a sequence of indices. + +Usage: + python rearrange.py template.pptx output.pptx 0,34,34,50,52 + +This will create output.pptx using slides from template.pptx in the specified order. +Slides can be repeated (e.g., 34 appears twice). +""" + +import argparse +import shutil +import sys +from copy import deepcopy +from pathlib import Path + +import six +from pptx import Presentation + + +def main(): + parser = argparse.ArgumentParser( + description="Rearrange PowerPoint slides based on a sequence of indices.", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=""" +Examples: + python rearrange.py template.pptx output.pptx 0,34,34,50,52 + Creates output.pptx using slides 0, 34 (twice), 50, and 52 from template.pptx + + python rearrange.py template.pptx output.pptx 5,3,1,2,4 + Creates output.pptx with slides reordered as specified + +Note: Slide indices are 0-based (first slide is 0, second is 1, etc.) + """, + ) + + parser.add_argument("template", help="Path to template PPTX file") + parser.add_argument("output", help="Path for output PPTX file") + parser.add_argument( + "sequence", help="Comma-separated sequence of slide indices (0-based)" + ) + + args = parser.parse_args() + + # Parse the slide sequence + try: + slide_sequence = [int(x.strip()) for x in args.sequence.split(",")] + except ValueError: + print( + "Error: Invalid sequence format. Use comma-separated integers (e.g., 0,34,34,50,52)" + ) + sys.exit(1) + + # Check template exists + template_path = Path(args.template) + if not template_path.exists(): + print(f"Error: Template file not found: {args.template}") + sys.exit(1) + + # Create output directory if needed + output_path = Path(args.output) + output_path.parent.mkdir(parents=True, exist_ok=True) + + try: + rearrange_presentation(template_path, output_path, slide_sequence) + except ValueError as e: + print(f"Error: {e}") + sys.exit(1) + except Exception as e: + print(f"Error processing presentation: {e}") + sys.exit(1) + + +def duplicate_slide(pres, index): + """Duplicate a slide in the presentation.""" + source = pres.slides[index] + + # Use source's layout to preserve formatting + new_slide = pres.slides.add_slide(source.slide_layout) + + # Collect all image and media relationships from the source slide + image_rels = {} + for rel_id, rel in six.iteritems(source.part.rels): + if "image" in rel.reltype or "media" in rel.reltype: + image_rels[rel_id] = rel + + # CRITICAL: Clear placeholder shapes to avoid duplicates + for shape in new_slide.shapes: + sp = shape.element + sp.getparent().remove(sp) + + # Copy all shapes from source + for shape in source.shapes: + el = shape.element + new_el = deepcopy(el) + new_slide.shapes._spTree.insert_element_before(new_el, "p:extLst") + + # Handle picture shapes - need to update the blip reference + # Look for all blip elements (they can be in pic or other contexts) + # Using the element's own xpath method without namespaces argument + blips = new_el.xpath(".//a:blip[@r:embed]") + for blip in blips: + old_rId = blip.get( + "{http://schemas.openxmlformats.org/officeDocument/2006/relationships}embed" + ) + if old_rId in image_rels: + # Create a new relationship in the destination slide for this image + old_rel = image_rels[old_rId] + # get_or_add returns the rId directly, or adds and returns new rId + new_rId = new_slide.part.rels.get_or_add( + old_rel.reltype, old_rel._target + ) + # Update the blip's embed reference to use the new relationship ID + blip.set( + "{http://schemas.openxmlformats.org/officeDocument/2006/relationships}embed", + new_rId, + ) + + # Copy any additional image/media relationships that might be referenced elsewhere + for rel_id, rel in image_rels.items(): + try: + new_slide.part.rels.get_or_add(rel.reltype, rel._target) + except Exception: + pass # Relationship might already exist + + return new_slide + + +def delete_slide(pres, index): + """Delete a slide from the presentation.""" + rId = pres.slides._sldIdLst[index].rId + pres.part.drop_rel(rId) + del pres.slides._sldIdLst[index] + + +def reorder_slides(pres, slide_index, target_index): + """Move a slide from one position to another.""" + slides = pres.slides._sldIdLst + + # Remove slide element from current position + slide_element = slides[slide_index] + slides.remove(slide_element) + + # Insert at target position + slides.insert(target_index, slide_element) + + +def rearrange_presentation(template_path, output_path, slide_sequence): + """ + Create a new presentation with slides from template in specified order. + + Args: + template_path: Path to template PPTX file + output_path: Path for output PPTX file + slide_sequence: List of slide indices (0-based) to include + """ + # Copy template to preserve dimensions and theme + if template_path != output_path: + shutil.copy2(template_path, output_path) + prs = Presentation(output_path) + else: + prs = Presentation(template_path) + + total_slides = len(prs.slides) + + # Validate indices + for idx in slide_sequence: + if idx < 0 or idx >= total_slides: + raise ValueError(f"Slide index {idx} out of range (0-{total_slides - 1})") + + # Track original slides and their duplicates + slide_map = [] # List of actual slide indices for final presentation + duplicated = {} # Track duplicates: original_idx -> [duplicate_indices] + + # Step 1: DUPLICATE repeated slides + print(f"Processing {len(slide_sequence)} slides from template...") + for i, template_idx in enumerate(slide_sequence): + if template_idx in duplicated and duplicated[template_idx]: + # Already duplicated this slide, use the duplicate + slide_map.append(duplicated[template_idx].pop(0)) + print(f" [{i}] Using duplicate of slide {template_idx}") + elif slide_sequence.count(template_idx) > 1 and template_idx not in duplicated: + # First occurrence of a repeated slide - create duplicates + slide_map.append(template_idx) + duplicates = [] + count = slide_sequence.count(template_idx) - 1 + print( + f" [{i}] Using original slide {template_idx}, creating {count} duplicate(s)" + ) + for _ in range(count): + duplicate_slide(prs, template_idx) + duplicates.append(len(prs.slides) - 1) + duplicated[template_idx] = duplicates + else: + # Unique slide or first occurrence already handled, use original + slide_map.append(template_idx) + print(f" [{i}] Using original slide {template_idx}") + + # Step 2: DELETE unwanted slides (work backwards) + slides_to_keep = set(slide_map) + print(f"\nDeleting {len(prs.slides) - len(slides_to_keep)} unused slides...") + for i in range(len(prs.slides) - 1, -1, -1): + if i not in slides_to_keep: + delete_slide(prs, i) + # Update slide_map indices after deletion + slide_map = [idx - 1 if idx > i else idx for idx in slide_map] + + # Step 3: REORDER to final sequence + print(f"Reordering {len(slide_map)} slides to final sequence...") + for target_pos in range(len(slide_map)): + # Find which slide should be at target_pos + current_pos = slide_map[target_pos] + if current_pos != target_pos: + reorder_slides(prs, current_pos, target_pos) + # Update slide_map: the move shifts other slides + for i in range(len(slide_map)): + if slide_map[i] > current_pos and slide_map[i] <= target_pos: + slide_map[i] -= 1 + elif slide_map[i] < current_pos and slide_map[i] >= target_pos: + slide_map[i] += 1 + slide_map[target_pos] = target_pos + + # Save the presentation + prs.save(output_path) + print(f"\nSaved rearranged presentation to: {output_path}") + print(f"Final presentation has {len(prs.slides)} slides") + + +if __name__ == "__main__": + main() diff --git a/code_puppy/bundled_skills/Office/pptx/scripts/replace.py b/code_puppy/bundled_skills/Office/pptx/scripts/replace.py new file mode 100644 index 00000000..8f7a8b1b --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/scripts/replace.py @@ -0,0 +1,385 @@ +#!/usr/bin/env python3 +"""Apply text replacements to PowerPoint presentation. + +Usage: + python replace.py + +The replacements JSON should have the structure output by inventory.py. +ALL text shapes identified by inventory.py will have their text cleared +unless "paragraphs" is specified in the replacements for that shape. +""" + +import json +import sys +from pathlib import Path +from typing import Any, Dict, List + +from inventory import InventoryData, extract_text_inventory +from pptx import Presentation +from pptx.dml.color import RGBColor +from pptx.enum.dml import MSO_THEME_COLOR +from pptx.enum.text import PP_ALIGN +from pptx.oxml.xmlchemy import OxmlElement +from pptx.util import Pt + + +def clear_paragraph_bullets(paragraph): + """Clear bullet formatting from a paragraph.""" + pPr = paragraph._element.get_or_add_pPr() + + # Remove existing bullet elements + for child in list(pPr): + if ( + child.tag.endswith("buChar") + or child.tag.endswith("buNone") + or child.tag.endswith("buAutoNum") + or child.tag.endswith("buFont") + ): + pPr.remove(child) + + return pPr + + +def apply_paragraph_properties(paragraph, para_data: Dict[str, Any]): + """Apply formatting properties to a paragraph.""" + # Get the text but don't set it on paragraph directly yet + text = para_data.get("text", "") + + # Get or create paragraph properties + pPr = clear_paragraph_bullets(paragraph) + + # Handle bullet formatting + if para_data.get("bullet", False): + level = para_data.get("level", 0) + paragraph.level = level + + # Calculate font-proportional indentation + font_size = para_data.get("font_size", 18.0) + level_indent_emu = int((font_size * (1.6 + level * 1.6)) * 12700) + hanging_indent_emu = int(-font_size * 0.8 * 12700) + + # Set indentation + pPr.attrib["marL"] = str(level_indent_emu) + pPr.attrib["indent"] = str(hanging_indent_emu) + + # Add bullet character + buChar = OxmlElement("a:buChar") + buChar.set("char", "•") + pPr.append(buChar) + + # Default to left alignment for bullets if not specified + if "alignment" not in para_data: + paragraph.alignment = PP_ALIGN.LEFT + else: + # Remove indentation for non-bullet text + pPr.attrib["marL"] = "0" + pPr.attrib["indent"] = "0" + + # Add buNone element + buNone = OxmlElement("a:buNone") + pPr.insert(0, buNone) + + # Apply alignment + if "alignment" in para_data: + alignment_map = { + "LEFT": PP_ALIGN.LEFT, + "CENTER": PP_ALIGN.CENTER, + "RIGHT": PP_ALIGN.RIGHT, + "JUSTIFY": PP_ALIGN.JUSTIFY, + } + if para_data["alignment"] in alignment_map: + paragraph.alignment = alignment_map[para_data["alignment"]] + + # Apply spacing + if "space_before" in para_data: + paragraph.space_before = Pt(para_data["space_before"]) + if "space_after" in para_data: + paragraph.space_after = Pt(para_data["space_after"]) + if "line_spacing" in para_data: + paragraph.line_spacing = Pt(para_data["line_spacing"]) + + # Apply run-level formatting + if not paragraph.runs: + run = paragraph.add_run() + run.text = text + else: + run = paragraph.runs[0] + run.text = text + + # Apply font properties + apply_font_properties(run, para_data) + + +def apply_font_properties(run, para_data: Dict[str, Any]): + """Apply font properties to a text run.""" + if "bold" in para_data: + run.font.bold = para_data["bold"] + if "italic" in para_data: + run.font.italic = para_data["italic"] + if "underline" in para_data: + run.font.underline = para_data["underline"] + if "font_size" in para_data: + run.font.size = Pt(para_data["font_size"]) + if "font_name" in para_data: + run.font.name = para_data["font_name"] + + # Apply color - prefer RGB, fall back to theme_color + if "color" in para_data: + color_hex = para_data["color"].lstrip("#") + if len(color_hex) == 6: + r = int(color_hex[0:2], 16) + g = int(color_hex[2:4], 16) + b = int(color_hex[4:6], 16) + run.font.color.rgb = RGBColor(r, g, b) + elif "theme_color" in para_data: + # Get theme color by name (e.g., "DARK_1", "ACCENT_1") + theme_name = para_data["theme_color"] + try: + run.font.color.theme_color = getattr(MSO_THEME_COLOR, theme_name) + except AttributeError: + print(f" WARNING: Unknown theme color name '{theme_name}'") + + +def detect_frame_overflow(inventory: InventoryData) -> Dict[str, Dict[str, float]]: + """Detect text overflow in shapes (text exceeding shape bounds). + + Returns dict of slide_key -> shape_key -> overflow_inches. + Only includes shapes that have text overflow. + """ + overflow_map = {} + + for slide_key, shapes_dict in inventory.items(): + for shape_key, shape_data in shapes_dict.items(): + # Check for frame overflow (text exceeding shape bounds) + if shape_data.frame_overflow_bottom is not None: + if slide_key not in overflow_map: + overflow_map[slide_key] = {} + overflow_map[slide_key][shape_key] = shape_data.frame_overflow_bottom + + return overflow_map + + +def validate_replacements(inventory: InventoryData, replacements: Dict) -> List[str]: + """Validate that all shapes in replacements exist in inventory. + + Returns list of error messages. + """ + errors = [] + + for slide_key, shapes_data in replacements.items(): + if not slide_key.startswith("slide-"): + continue + + # Check if slide exists + if slide_key not in inventory: + errors.append(f"Slide '{slide_key}' not found in inventory") + continue + + # Check each shape + for shape_key in shapes_data.keys(): + if shape_key not in inventory[slide_key]: + # Find shapes without replacements defined and show their content + unused_with_content = [] + for k in inventory[slide_key].keys(): + if k not in shapes_data: + shape_data = inventory[slide_key][k] + # Get text from paragraphs as preview + paragraphs = shape_data.paragraphs + if paragraphs and paragraphs[0].text: + first_text = paragraphs[0].text[:50] + if len(paragraphs[0].text) > 50: + first_text += "..." + unused_with_content.append(f"{k} ('{first_text}')") + else: + unused_with_content.append(k) + + errors.append( + f"Shape '{shape_key}' not found on '{slide_key}'. " + f"Shapes without replacements: {', '.join(sorted(unused_with_content)) if unused_with_content else 'none'}" + ) + + return errors + + +def check_duplicate_keys(pairs): + """Check for duplicate keys when loading JSON.""" + result = {} + for key, value in pairs: + if key in result: + raise ValueError(f"Duplicate key found in JSON: '{key}'") + result[key] = value + return result + + +def apply_replacements(pptx_file: str, json_file: str, output_file: str): + """Apply text replacements from JSON to PowerPoint presentation.""" + + # Load presentation + prs = Presentation(pptx_file) + + # Get inventory of all text shapes (returns ShapeData objects) + # Pass prs to use same Presentation instance + inventory = extract_text_inventory(Path(pptx_file), prs) + + # Detect text overflow in original presentation + original_overflow = detect_frame_overflow(inventory) + + # Load replacement data with duplicate key detection + with open(json_file, "r") as f: + replacements = json.load(f, object_pairs_hook=check_duplicate_keys) + + # Validate replacements + errors = validate_replacements(inventory, replacements) + if errors: + print("ERROR: Invalid shapes in replacement JSON:") + for error in errors: + print(f" - {error}") + print("\nPlease check the inventory and update your replacement JSON.") + print( + "You can regenerate the inventory with: python inventory.py " + ) + raise ValueError(f"Found {len(errors)} validation error(s)") + + # Track statistics + shapes_processed = 0 + shapes_cleared = 0 + shapes_replaced = 0 + + # Process each slide from inventory + for slide_key, shapes_dict in inventory.items(): + if not slide_key.startswith("slide-"): + continue + + slide_index = int(slide_key.split("-")[1]) + + if slide_index >= len(prs.slides): + print(f"Warning: Slide {slide_index} not found") + continue + + # Process each shape from inventory + for shape_key, shape_data in shapes_dict.items(): + shapes_processed += 1 + + # Get the shape directly from ShapeData + shape = shape_data.shape + if not shape: + print(f"Warning: {shape_key} has no shape reference") + continue + + # ShapeData already validates text_frame in __init__ + text_frame = shape.text_frame # type: ignore + + text_frame.clear() # type: ignore + shapes_cleared += 1 + + # Check for replacement paragraphs + replacement_shape_data = replacements.get(slide_key, {}).get(shape_key, {}) + if "paragraphs" not in replacement_shape_data: + continue + + shapes_replaced += 1 + + # Add replacement paragraphs + for i, para_data in enumerate(replacement_shape_data["paragraphs"]): + if i == 0: + p = text_frame.paragraphs[0] # type: ignore + else: + p = text_frame.add_paragraph() # type: ignore + + apply_paragraph_properties(p, para_data) + + # Check for issues after replacements + # Save to a temporary file and reload to avoid modifying the presentation during inventory + # (extract_text_inventory accesses font.color which adds empty elements) + import tempfile + + with tempfile.NamedTemporaryFile(suffix=".pptx", delete=False) as tmp: + tmp_path = Path(tmp.name) + prs.save(str(tmp_path)) + + try: + updated_inventory = extract_text_inventory(tmp_path) + updated_overflow = detect_frame_overflow(updated_inventory) + finally: + tmp_path.unlink() # Clean up temp file + + # Check if any text overflow got worse + overflow_errors = [] + for slide_key, shape_overflows in updated_overflow.items(): + for shape_key, new_overflow in shape_overflows.items(): + # Get original overflow (0 if there was no overflow before) + original = original_overflow.get(slide_key, {}).get(shape_key, 0.0) + + # Error if overflow increased + if new_overflow > original + 0.01: # Small tolerance for rounding + increase = new_overflow - original + overflow_errors.append( + f'{slide_key}/{shape_key}: overflow worsened by {increase:.2f}" ' + f'(was {original:.2f}", now {new_overflow:.2f}")' + ) + + # Collect warnings from updated shapes + warnings = [] + for slide_key, shapes_dict in updated_inventory.items(): + for shape_key, shape_data in shapes_dict.items(): + if shape_data.warnings: + for warning in shape_data.warnings: + warnings.append(f"{slide_key}/{shape_key}: {warning}") + + # Fail if there are any issues + if overflow_errors or warnings: + print("\nERROR: Issues detected in replacement output:") + if overflow_errors: + print("\nText overflow worsened:") + for error in overflow_errors: + print(f" - {error}") + if warnings: + print("\nFormatting warnings:") + for warning in warnings: + print(f" - {warning}") + print("\nPlease fix these issues before saving.") + raise ValueError( + f"Found {len(overflow_errors)} overflow error(s) and {len(warnings)} warning(s)" + ) + + # Save the presentation + prs.save(output_file) + + # Report results + print(f"Saved updated presentation to: {output_file}") + print(f"Processed {len(prs.slides)} slides") + print(f" - Shapes processed: {shapes_processed}") + print(f" - Shapes cleared: {shapes_cleared}") + print(f" - Shapes replaced: {shapes_replaced}") + + +def main(): + """Main entry point for command-line usage.""" + if len(sys.argv) != 4: + print(__doc__) + sys.exit(1) + + input_pptx = Path(sys.argv[1]) + replacements_json = Path(sys.argv[2]) + output_pptx = Path(sys.argv[3]) + + if not input_pptx.exists(): + print(f"Error: Input file '{input_pptx}' not found") + sys.exit(1) + + if not replacements_json.exists(): + print(f"Error: Replacements JSON file '{replacements_json}' not found") + sys.exit(1) + + try: + apply_replacements(str(input_pptx), str(replacements_json), str(output_pptx)) + except Exception as e: + print(f"Error applying replacements: {e}") + import traceback + + traceback.print_exc() + sys.exit(1) + + +if __name__ == "__main__": + main() diff --git a/code_puppy/bundled_skills/Office/pptx/scripts/thumbnail.py b/code_puppy/bundled_skills/Office/pptx/scripts/thumbnail.py new file mode 100644 index 00000000..5c7fdf19 --- /dev/null +++ b/code_puppy/bundled_skills/Office/pptx/scripts/thumbnail.py @@ -0,0 +1,450 @@ +#!/usr/bin/env python3 +""" +Create thumbnail grids from PowerPoint presentation slides. + +Creates a grid layout of slide thumbnails with configurable columns (max 6). +Each grid contains up to cols×(cols+1) images. For presentations with more +slides, multiple numbered grid files are created automatically. + +The program outputs the names of all files created. + +Output: +- Single grid: {prefix}.jpg (if slides fit in one grid) +- Multiple grids: {prefix}-1.jpg, {prefix}-2.jpg, etc. + +Grid limits by column count: +- 3 cols: max 12 slides per grid (3×4) +- 4 cols: max 20 slides per grid (4×5) +- 5 cols: max 30 slides per grid (5×6) [default] +- 6 cols: max 42 slides per grid (6×7) + +Usage: + python thumbnail.py input.pptx [output_prefix] [--cols N] [--outline-placeholders] + +Examples: + python thumbnail.py presentation.pptx + # Creates: thumbnails.jpg (using default prefix) + # Outputs: + # Created 1 grid(s): + # - thumbnails.jpg + + python thumbnail.py large-deck.pptx grid --cols 4 + # Creates: grid-1.jpg, grid-2.jpg, grid-3.jpg + # Outputs: + # Created 3 grid(s): + # - grid-1.jpg + # - grid-2.jpg + # - grid-3.jpg + + python thumbnail.py template.pptx analysis --outline-placeholders + # Creates thumbnail grids with red outlines around text placeholders +""" + +import argparse +import subprocess +import sys +import tempfile +from pathlib import Path + +from inventory import extract_text_inventory +from PIL import Image, ImageDraw, ImageFont +from pptx import Presentation + +# Constants +THUMBNAIL_WIDTH = 300 # Fixed thumbnail width in pixels +CONVERSION_DPI = 100 # DPI for PDF to image conversion +MAX_COLS = 6 # Maximum number of columns +DEFAULT_COLS = 5 # Default number of columns +JPEG_QUALITY = 95 # JPEG compression quality + +# Grid layout constants +GRID_PADDING = 20 # Padding between thumbnails +BORDER_WIDTH = 2 # Border width around thumbnails +FONT_SIZE_RATIO = 0.12 # Font size as fraction of thumbnail width +LABEL_PADDING_RATIO = 0.4 # Label padding as fraction of font size + + +def main(): + parser = argparse.ArgumentParser( + description="Create thumbnail grids from PowerPoint slides." + ) + parser.add_argument("input", help="Input PowerPoint file (.pptx)") + parser.add_argument( + "output_prefix", + nargs="?", + default="thumbnails", + help="Output prefix for image files (default: thumbnails, will create prefix.jpg or prefix-N.jpg)", + ) + parser.add_argument( + "--cols", + type=int, + default=DEFAULT_COLS, + help=f"Number of columns (default: {DEFAULT_COLS}, max: {MAX_COLS})", + ) + parser.add_argument( + "--outline-placeholders", + action="store_true", + help="Outline text placeholders with a colored border", + ) + + args = parser.parse_args() + + # Validate columns + cols = min(args.cols, MAX_COLS) + if args.cols > MAX_COLS: + print(f"Warning: Columns limited to {MAX_COLS} (requested {args.cols})") + + # Validate input + input_path = Path(args.input) + if not input_path.exists() or input_path.suffix.lower() != ".pptx": + print(f"Error: Invalid PowerPoint file: {args.input}") + sys.exit(1) + + # Construct output path (always JPG) + output_path = Path(f"{args.output_prefix}.jpg") + + print(f"Processing: {args.input}") + + try: + with tempfile.TemporaryDirectory() as temp_dir: + # Get placeholder regions if outlining is enabled + placeholder_regions = None + slide_dimensions = None + if args.outline_placeholders: + print("Extracting placeholder regions...") + placeholder_regions, slide_dimensions = get_placeholder_regions( + input_path + ) + if placeholder_regions: + print(f"Found placeholders on {len(placeholder_regions)} slides") + + # Convert slides to images + slide_images = convert_to_images(input_path, Path(temp_dir), CONVERSION_DPI) + if not slide_images: + print("Error: No slides found") + sys.exit(1) + + print(f"Found {len(slide_images)} slides") + + # Create grids (max cols×(cols+1) images per grid) + grid_files = create_grids( + slide_images, + cols, + THUMBNAIL_WIDTH, + output_path, + placeholder_regions, + slide_dimensions, + ) + + # Print saved files + print(f"Created {len(grid_files)} grid(s):") + for grid_file in grid_files: + print(f" - {grid_file}") + + except Exception as e: + print(f"Error: {e}") + sys.exit(1) + + +def create_hidden_slide_placeholder(size): + """Create placeholder image for hidden slides.""" + img = Image.new("RGB", size, color="#F0F0F0") + draw = ImageDraw.Draw(img) + line_width = max(5, min(size) // 100) + draw.line([(0, 0), size], fill="#CCCCCC", width=line_width) + draw.line([(size[0], 0), (0, size[1])], fill="#CCCCCC", width=line_width) + return img + + +def get_placeholder_regions(pptx_path): + """Extract ALL text regions from the presentation. + + Returns a tuple of (placeholder_regions, slide_dimensions). + text_regions is a dict mapping slide indices to lists of text regions. + Each region is a dict with 'left', 'top', 'width', 'height' in inches. + slide_dimensions is a tuple of (width_inches, height_inches). + """ + prs = Presentation(str(pptx_path)) + inventory = extract_text_inventory(pptx_path, prs) + placeholder_regions = {} + + # Get actual slide dimensions in inches (EMU to inches conversion) + slide_width_inches = (prs.slide_width or 9144000) / 914400.0 + slide_height_inches = (prs.slide_height or 5143500) / 914400.0 + + for slide_key, shapes in inventory.items(): + # Extract slide index from "slide-N" format + slide_idx = int(slide_key.split("-")[1]) + regions = [] + + for shape_key, shape_data in shapes.items(): + # The inventory only contains shapes with text, so all shapes should be highlighted + regions.append( + { + "left": shape_data.left, + "top": shape_data.top, + "width": shape_data.width, + "height": shape_data.height, + } + ) + + if regions: + placeholder_regions[slide_idx] = regions + + return placeholder_regions, (slide_width_inches, slide_height_inches) + + +def convert_to_images(pptx_path, temp_dir, dpi): + """Convert PowerPoint to images via PDF, handling hidden slides.""" + # Detect hidden slides + print("Analyzing presentation...") + prs = Presentation(str(pptx_path)) + total_slides = len(prs.slides) + + # Find hidden slides (1-based indexing for display) + hidden_slides = { + idx + 1 + for idx, slide in enumerate(prs.slides) + if slide.element.get("show") == "0" + } + + print(f"Total slides: {total_slides}") + if hidden_slides: + print(f"Hidden slides: {sorted(hidden_slides)}") + + pdf_path = temp_dir / f"{pptx_path.stem}.pdf" + + # Convert to PDF + print("Converting to PDF...") + result = subprocess.run( + [ + "soffice", + "--headless", + "--convert-to", + "pdf", + "--outdir", + str(temp_dir), + str(pptx_path), + ], + capture_output=True, + text=True, + ) + if result.returncode != 0 or not pdf_path.exists(): + raise RuntimeError("PDF conversion failed") + + # Convert PDF to images + print(f"Converting to images at {dpi} DPI...") + result = subprocess.run( + ["pdftoppm", "-jpeg", "-r", str(dpi), str(pdf_path), str(temp_dir / "slide")], + capture_output=True, + text=True, + ) + if result.returncode != 0: + raise RuntimeError("Image conversion failed") + + visible_images = sorted(temp_dir.glob("slide-*.jpg")) + + # Create full list with placeholders for hidden slides + all_images = [] + visible_idx = 0 + + # Get placeholder dimensions from first visible slide + if visible_images: + with Image.open(visible_images[0]) as img: + placeholder_size = img.size + else: + placeholder_size = (1920, 1080) + + for slide_num in range(1, total_slides + 1): + if slide_num in hidden_slides: + # Create placeholder image for hidden slide + placeholder_path = temp_dir / f"hidden-{slide_num:03d}.jpg" + placeholder_img = create_hidden_slide_placeholder(placeholder_size) + placeholder_img.save(placeholder_path, "JPEG") + all_images.append(placeholder_path) + else: + # Use the actual visible slide image + if visible_idx < len(visible_images): + all_images.append(visible_images[visible_idx]) + visible_idx += 1 + + return all_images + + +def create_grids( + image_paths, + cols, + width, + output_path, + placeholder_regions=None, + slide_dimensions=None, +): + """Create multiple thumbnail grids from slide images, max cols×(cols+1) images per grid.""" + # Maximum images per grid is cols × (cols + 1) for better proportions + max_images_per_grid = cols * (cols + 1) + grid_files = [] + + print( + f"Creating grids with {cols} columns (max {max_images_per_grid} images per grid)" + ) + + # Split images into chunks + for chunk_idx, start_idx in enumerate( + range(0, len(image_paths), max_images_per_grid) + ): + end_idx = min(start_idx + max_images_per_grid, len(image_paths)) + chunk_images = image_paths[start_idx:end_idx] + + # Create grid for this chunk + grid = create_grid( + chunk_images, cols, width, start_idx, placeholder_regions, slide_dimensions + ) + + # Generate output filename + if len(image_paths) <= max_images_per_grid: + # Single grid - use base filename without suffix + grid_filename = output_path + else: + # Multiple grids - insert index before extension with dash + stem = output_path.stem + suffix = output_path.suffix + grid_filename = output_path.parent / f"{stem}-{chunk_idx + 1}{suffix}" + + # Save grid + grid_filename.parent.mkdir(parents=True, exist_ok=True) + grid.save(str(grid_filename), quality=JPEG_QUALITY) + grid_files.append(str(grid_filename)) + + return grid_files + + +def create_grid( + image_paths, + cols, + width, + start_slide_num=0, + placeholder_regions=None, + slide_dimensions=None, +): + """Create thumbnail grid from slide images with optional placeholder outlining.""" + font_size = int(width * FONT_SIZE_RATIO) + label_padding = int(font_size * LABEL_PADDING_RATIO) + + # Get dimensions + with Image.open(image_paths[0]) as img: + aspect = img.height / img.width + height = int(width * aspect) + + # Calculate grid size + rows = (len(image_paths) + cols - 1) // cols + grid_w = cols * width + (cols + 1) * GRID_PADDING + grid_h = rows * (height + font_size + label_padding * 2) + (rows + 1) * GRID_PADDING + + # Create grid + grid = Image.new("RGB", (grid_w, grid_h), "white") + draw = ImageDraw.Draw(grid) + + # Load font with size based on thumbnail width + try: + # Use Pillow's default font with size + font = ImageFont.load_default(size=font_size) + except Exception: + # Fall back to basic default font if size parameter not supported + font = ImageFont.load_default() + + # Place thumbnails + for i, img_path in enumerate(image_paths): + row, col = i // cols, i % cols + x = col * width + (col + 1) * GRID_PADDING + y_base = ( + row * (height + font_size + label_padding * 2) + (row + 1) * GRID_PADDING + ) + + # Add label with actual slide number + label = f"{start_slide_num + i}" + bbox = draw.textbbox((0, 0), label, font=font) + text_w = bbox[2] - bbox[0] + draw.text( + (x + (width - text_w) // 2, y_base + label_padding), + label, + fill="black", + font=font, + ) + + # Add thumbnail below label with proportional spacing + y_thumbnail = y_base + label_padding + font_size + label_padding + + with Image.open(img_path) as img: + # Get original dimensions before thumbnail + orig_w, orig_h = img.size + + # Apply placeholder outlines if enabled + if placeholder_regions and (start_slide_num + i) in placeholder_regions: + # Convert to RGBA for transparency support + if img.mode != "RGBA": + img = img.convert("RGBA") + + # Get the regions for this slide + regions = placeholder_regions[start_slide_num + i] + + # Calculate scale factors using actual slide dimensions + if slide_dimensions: + slide_width_inches, slide_height_inches = slide_dimensions + else: + # Fallback: estimate from image size at CONVERSION_DPI + slide_width_inches = orig_w / CONVERSION_DPI + slide_height_inches = orig_h / CONVERSION_DPI + + x_scale = orig_w / slide_width_inches + y_scale = orig_h / slide_height_inches + + # Create a highlight overlay + overlay = Image.new("RGBA", img.size, (255, 255, 255, 0)) + overlay_draw = ImageDraw.Draw(overlay) + + # Highlight each placeholder region + for region in regions: + # Convert from inches to pixels in the original image + px_left = int(region["left"] * x_scale) + px_top = int(region["top"] * y_scale) + px_width = int(region["width"] * x_scale) + px_height = int(region["height"] * y_scale) + + # Draw highlight outline with red color and thick stroke + # Using a bright red outline instead of fill + stroke_width = max( + 5, min(orig_w, orig_h) // 150 + ) # Thicker proportional stroke width + overlay_draw.rectangle( + [(px_left, px_top), (px_left + px_width, px_top + px_height)], + outline=(255, 0, 0, 255), # Bright red, fully opaque + width=stroke_width, + ) + + # Composite the overlay onto the image using alpha blending + img = Image.alpha_composite(img, overlay) + # Convert back to RGB for JPEG saving + img = img.convert("RGB") + + img.thumbnail((width, height), Image.Resampling.LANCZOS) + w, h = img.size + tx = x + (width - w) // 2 + ty = y_thumbnail + (height - h) // 2 + grid.paste(img, (tx, ty)) + + # Add border + if BORDER_WIDTH > 0: + draw.rectangle( + [ + (tx - BORDER_WIDTH, ty - BORDER_WIDTH), + (tx + w + BORDER_WIDTH - 1, ty + h + BORDER_WIDTH - 1), + ], + outline="gray", + width=BORDER_WIDTH, + ) + + return grid + + +if __name__ == "__main__": + main() diff --git a/code_puppy/bundled_skills/Office/xlsx/SKILL.md b/code_puppy/bundled_skills/Office/xlsx/SKILL.md new file mode 100644 index 00000000..2e6a9a0c --- /dev/null +++ b/code_puppy/bundled_skills/Office/xlsx/SKILL.md @@ -0,0 +1,289 @@ +--- +name: xlsx +description: "Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Ticca needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas" +license: Proprietary. LICENSE.txt has complete terms +--- + +# Requirements for Outputs + +## All Excel files + +### Zero Formula Errors +- Every Excel model MUST be delivered with ZERO formula errors (#REF!, #DIV/0!, #VALUE!, #N/A, #NAME?) + +### Preserve Existing Templates (when updating templates) +- Study and EXACTLY match existing format, style, and conventions when modifying files +- Never impose standardized formatting on files with established patterns +- Existing template conventions ALWAYS override these guidelines + +## Financial models + +### Color Coding Standards +Unless otherwise stated by the user or existing template + +#### Industry-Standard Color Conventions +- **Blue text (RGB: 0,0,255)**: Hardcoded inputs, and numbers users will change for scenarios +- **Black text (RGB: 0,0,0)**: ALL formulas and calculations +- **Green text (RGB: 0,128,0)**: Links pulling from other worksheets within same workbook +- **Red text (RGB: 255,0,0)**: External links to other files +- **Yellow background (RGB: 255,255,0)**: Key assumptions needing attention or cells that need to be updated + +### Number Formatting Standards + +#### Required Format Rules +- **Years**: Format as text strings (e.g., "2024" not "2,024") +- **Currency**: Use $#,##0 format; ALWAYS specify units in headers ("Revenue ($mm)") +- **Zeros**: Use number formatting to make all zeros "-", including percentages (e.g., "$#,##0;($#,##0);-") +- **Percentages**: Default to 0.0% format (one decimal) +- **Multiples**: Format as 0.0x for valuation multiples (EV/EBITDA, P/E) +- **Negative numbers**: Use parentheses (123) not minus -123 + +### Formula Construction Rules + +#### Assumptions Placement +- Place ALL assumptions (growth rates, margins, multiples, etc.) in separate assumption cells +- Use cell references instead of hardcoded values in formulas +- Example: Use =B5*(1+$B$6) instead of =B5*1.05 + +#### Formula Error Prevention +- Verify all cell references are correct +- Check for off-by-one errors in ranges +- Ensure consistent formulas across all projection periods +- Test with edge cases (zero values, negative numbers) +- Verify no unintended circular references + +#### Documentation Requirements for Hardcodes +- Comment or in cells beside (if end of table). Format: "Source: [System/Document], [Date], [Specific Reference], [URL if applicable]" +- Examples: + - "Source: Company 10-K, FY2024, Page 45, Revenue Note, [SEC EDGAR URL]" + - "Source: Company 10-Q, Q2 2025, Exhibit 99.1, [SEC EDGAR URL]" + - "Source: Bloomberg Terminal, 8/15/2025, AAPL US Equity" + - "Source: FactSet, 8/20/2025, Consensus Estimates Screen" + +# XLSX creation, editing, and analysis + +## Overview + +A user may ask you to create, edit, or analyze the contents of an .xlsx file. You have different tools and workflows available for different tasks. + +## Important Requirements + +**LibreOffice Required for Formula Recalculation**: You can assume LibreOffice is installed for recalculating formula values using the `recalc.py` script. The script automatically configures LibreOffice on first run + +## Reading and analyzing data + +### Data analysis with pandas +For data analysis, visualization, and basic operations, use **pandas** which provides powerful data manipulation capabilities: + +```python +import pandas as pd + +# Read Excel +df = pd.read_excel('file.xlsx') # Default: first sheet +all_sheets = pd.read_excel('file.xlsx', sheet_name=None) # All sheets as dict + +# Analyze +df.head() # Preview data +df.info() # Column info +df.describe() # Statistics + +# Write Excel +df.to_excel('output.xlsx', index=False) +``` + +## Excel File Workflows + +## CRITICAL: Use Formulas, Not Hardcoded Values + +**Always use Excel formulas instead of calculating values in Python and hardcoding them.** This ensures the spreadsheet remains dynamic and updateable. + +### ❌ WRONG - Hardcoding Calculated Values +```python +# Bad: Calculating in Python and hardcoding result +total = df['Sales'].sum() +sheet['B10'] = total # Hardcodes 5000 + +# Bad: Computing growth rate in Python +growth = (df.iloc[-1]['Revenue'] - df.iloc[0]['Revenue']) / df.iloc[0]['Revenue'] +sheet['C5'] = growth # Hardcodes 0.15 + +# Bad: Python calculation for average +avg = sum(values) / len(values) +sheet['D20'] = avg # Hardcodes 42.5 +``` + +### ✅ CORRECT - Using Excel Formulas +```python +# Good: Let Excel calculate the sum +sheet['B10'] = '=SUM(B2:B9)' + +# Good: Growth rate as Excel formula +sheet['C5'] = '=(C4-C2)/C2' + +# Good: Average using Excel function +sheet['D20'] = '=AVERAGE(D2:D19)' +``` + +This applies to ALL calculations - totals, percentages, ratios, differences, etc. The spreadsheet should be able to recalculate when source data changes. + +## Common Workflow +1. **Choose tool**: pandas for data, openpyxl for formulas/formatting +2. **Create/Load**: Create new workbook or load existing file +3. **Modify**: Add/edit data, formulas, and formatting +4. **Save**: Write to file +5. **Recalculate formulas (MANDATORY IF USING FORMULAS)**: Use the recalc.py script + ```bash + python recalc.py output.xlsx + ``` +6. **Verify and fix any errors**: + - The script returns JSON with error details + - If `status` is `errors_found`, check `error_summary` for specific error types and locations + - Fix the identified errors and recalculate again + - Common errors to fix: + - `#REF!`: Invalid cell references + - `#DIV/0!`: Division by zero + - `#VALUE!`: Wrong data type in formula + - `#NAME?`: Unrecognized formula name + +### Creating new Excel files + +```python +# Using openpyxl for formulas and formatting +from openpyxl import Workbook +from openpyxl.styles import Font, PatternFill, Alignment + +wb = Workbook() +sheet = wb.active + +# Add data +sheet['A1'] = 'Hello' +sheet['B1'] = 'World' +sheet.append(['Row', 'of', 'data']) + +# Add formula +sheet['B2'] = '=SUM(A1:A10)' + +# Formatting +sheet['A1'].font = Font(bold=True, color='FF0000') +sheet['A1'].fill = PatternFill('solid', start_color='FFFF00') +sheet['A1'].alignment = Alignment(horizontal='center') + +# Column width +sheet.column_dimensions['A'].width = 20 + +wb.save('output.xlsx') +``` + +### Editing existing Excel files + +```python +# Using openpyxl to preserve formulas and formatting +from openpyxl import load_workbook + +# Load existing file +wb = load_workbook('existing.xlsx') +sheet = wb.active # or wb['SheetName'] for specific sheet + +# Working with multiple sheets +for sheet_name in wb.sheetnames: + sheet = wb[sheet_name] + print(f"Sheet: {sheet_name}") + +# Modify cells +sheet['A1'] = 'New Value' +sheet.insert_rows(2) # Insert row at position 2 +sheet.delete_cols(3) # Delete column 3 + +# Add new sheet +new_sheet = wb.create_sheet('NewSheet') +new_sheet['A1'] = 'Data' + +wb.save('modified.xlsx') +``` + +## Recalculating formulas + +Excel files created or modified by openpyxl contain formulas as strings but not calculated values. Use the provided `recalc.py` script to recalculate formulas: + +```bash +python recalc.py [timeout_seconds] +``` + +Example: +```bash +python recalc.py output.xlsx 30 +``` + +The script: +- Automatically sets up LibreOffice macro on first run +- Recalculates all formulas in all sheets +- Scans ALL cells for Excel errors (#REF!, #DIV/0!, etc.) +- Returns JSON with detailed error locations and counts +- Works on both Linux and macOS + +## Formula Verification Checklist + +Quick checks to ensure formulas work correctly: + +### Essential Verification +- [ ] **Test 2-3 sample references**: Verify they pull correct values before building full model +- [ ] **Column mapping**: Confirm Excel columns match (e.g., column 64 = BL, not BK) +- [ ] **Row offset**: Remember Excel rows are 1-indexed (DataFrame row 5 = Excel row 6) + +### Common Pitfalls +- [ ] **NaN handling**: Check for null values with `pd.notna()` +- [ ] **Far-right columns**: FY data often in columns 50+ +- [ ] **Multiple matches**: Search all occurrences, not just first +- [ ] **Division by zero**: Check denominators before using `/` in formulas (#DIV/0!) +- [ ] **Wrong references**: Verify all cell references point to intended cells (#REF!) +- [ ] **Cross-sheet references**: Use correct format (Sheet1!A1) for linking sheets + +### Formula Testing Strategy +- [ ] **Start small**: Test formulas on 2-3 cells before applying broadly +- [ ] **Verify dependencies**: Check all cells referenced in formulas exist +- [ ] **Test edge cases**: Include zero, negative, and very large values + +### Interpreting recalc.py Output +The script returns JSON with error details: +```json +{ + "status": "success", // or "errors_found" + "total_errors": 0, // Total error count + "total_formulas": 42, // Number of formulas in file + "error_summary": { // Only present if errors found + "#REF!": { + "count": 2, + "locations": ["Sheet1!B5", "Sheet1!C10"] + } + } +} +``` + +## Best Practices + +### Library Selection +- **pandas**: Best for data analysis, bulk operations, and simple data export +- **openpyxl**: Best for complex formatting, formulas, and Excel-specific features + +### Working with openpyxl +- Cell indices are 1-based (row=1, column=1 refers to cell A1) +- Use `data_only=True` to read calculated values: `load_workbook('file.xlsx', data_only=True)` +- **Warning**: If opened with `data_only=True` and saved, formulas are replaced with values and permanently lost +- For large files: Use `read_only=True` for reading or `write_only=True` for writing +- Formulas are preserved but not evaluated - use recalc.py to update values + +### Working with pandas +- Specify data types to avoid inference issues: `pd.read_excel('file.xlsx', dtype={'id': str})` +- For large files, read specific columns: `pd.read_excel('file.xlsx', usecols=['A', 'C', 'E'])` +- Handle dates properly: `pd.read_excel('file.xlsx', parse_dates=['date_column'])` + +## Code Style Guidelines +**IMPORTANT**: When generating Python code for Excel operations: +- Write minimal, concise Python code without unnecessary comments +- Avoid verbose variable names and redundant operations +- Avoid unnecessary print statements + +**For Excel files themselves**: +- Add comments to cells with complex formulas or important assumptions +- Document data sources for hardcoded values +- Include notes for key calculations and model sections \ No newline at end of file diff --git a/code_puppy/bundled_skills/Office/xlsx/recalc.py b/code_puppy/bundled_skills/Office/xlsx/recalc.py new file mode 100644 index 00000000..665dc00a --- /dev/null +++ b/code_puppy/bundled_skills/Office/xlsx/recalc.py @@ -0,0 +1,247 @@ +#!/usr/bin/env python3 +""" +Excel Formula Recalculation Script +Recalculates all formulas in an Excel file using LibreOffice +""" + +import json +import os +import platform +import subprocess +import sys +import time +from pathlib import Path + +from openpyxl import load_workbook + +# Platform-specific LibreOffice macro directory +MACRO_DIR_MACOS = "~/Library/Application Support/LibreOffice/4/user/basic/Standard" +MACRO_DIR_LINUX = "~/.config/libreoffice/4/user/basic/Standard" +MACRO_FILENAME = "Module1.xba" + +# LibreOffice Basic macro for recalculation +RECALCULATE_MACRO = """ + + + Sub RecalculateAndSave() + ThisComponent.calculateAll() + ThisComponent.store() + ThisComponent.close(True) + End Sub +""" + + +def ensure_xvfb_running(): + """Ensure Xvfb is running on display :99 for headless Linux environments""" + # Only needed in headless Linux environments (no DISPLAY set) + if platform.system() != "Linux" or os.environ.get("DISPLAY"): + return + + # Check if Xvfb is already running on :99 + try: + result = subprocess.run( + ["pgrep", "-f", "Xvfb.*:99"], capture_output=True, text=True + ) + if result.returncode == 0 and result.stdout.strip(): + os.environ["DISPLAY"] = ":99" + return + except FileNotFoundError: + pass + + # Start Xvfb (leave it running for subsequent calls) + try: + subprocess.Popen( + ["Xvfb", ":99", "-screen", "0", "1024x768x24"], + stdout=subprocess.DEVNULL, + stderr=subprocess.DEVNULL, + ) + except FileNotFoundError: + raise RuntimeError("Xvfb not found - install with: apt-get install xvfb") + + os.environ["DISPLAY"] = ":99" + + # Wait for Xvfb to be ready (poll socket file) + socket_path = "/tmp/.X11-unix/X99" + for _ in range(20): # Up to 2 seconds + if os.path.exists(socket_path): + return + time.sleep(0.1) + raise RuntimeError("Xvfb started but socket not ready") + + +def has_gtimeout(): + """Check if gtimeout is available on macOS""" + try: + subprocess.run( + ["gtimeout", "--version"], capture_output=True, timeout=1, check=False + ) + return True + except (FileNotFoundError, subprocess.TimeoutExpired): + return False + + +def setup_libreoffice_macro(): + """Setup LibreOffice macro for recalculation if not already configured""" + macro_dir = os.path.expanduser( + MACRO_DIR_MACOS if platform.system() == "Darwin" else MACRO_DIR_LINUX + ) + macro_file = os.path.join(macro_dir, MACRO_FILENAME) + + # Check if macro already exists + if ( + os.path.exists(macro_file) + and "RecalculateAndSave" in Path(macro_file).read_text() + ): + return True + + # Create macro directory if needed + if not os.path.exists(macro_dir): + subprocess.run( + ["soffice", "--headless", "--terminate_after_init"], + capture_output=True, + timeout=10, + ) + os.makedirs(macro_dir, exist_ok=True) + + # Write macro file + try: + Path(macro_file).write_text(RECALCULATE_MACRO) + return True + except Exception: + return False + + +def recalc(filename, timeout=30): + """ + Recalculate formulas in Excel file and report any errors + + Args: + filename: Path to Excel file + timeout: Maximum time to wait for recalculation (seconds) + + Returns: + dict with error locations and counts + """ + if not Path(filename).exists(): + return {"error": f"File {filename} does not exist"} + + abs_path = str(Path(filename).absolute()) + + if not setup_libreoffice_macro(): + return {"error": "Failed to setup LibreOffice macro"} + + # Ensure Xvfb is running for headless Unix environments + ensure_xvfb_running() + + cmd = [ + "soffice", + "--headless", + "--norestore", + "vnd.sun.star.script:Standard.Module1.RecalculateAndSave?language=Basic&location=application", + abs_path, + ] + + # Wrap command with timeout utility if available + if platform.system() == "Linux": + cmd = ["timeout", str(timeout)] + cmd + elif platform.system() == "Darwin" and has_gtimeout(): + cmd = ["gtimeout", str(timeout)] + cmd + + result = subprocess.run(cmd, capture_output=True, text=True) + + if result.returncode != 0 and result.returncode != 124: # 124 is timeout exit code + error_msg = result.stderr or "Unknown error during recalculation" + if "Module1" in error_msg or "RecalculateAndSave" not in error_msg: + return {"error": "LibreOffice macro not configured properly"} + return {"error": error_msg} + + # Check for Excel errors in the recalculated file - scan ALL cells + try: + wb = load_workbook(filename, data_only=True) + + excel_errors = [ + "#VALUE!", + "#DIV/0!", + "#REF!", + "#NAME?", + "#NULL!", + "#NUM!", + "#N/A", + ] + error_details = {err: [] for err in excel_errors} + total_errors = 0 + + for sheet_name in wb.sheetnames: + ws = wb[sheet_name] + # Check ALL rows and columns - no limits + for row in ws.iter_rows(): + for cell in row: + if cell.value is not None and isinstance(cell.value, str): + for err in excel_errors: + if err in cell.value: + location = f"{sheet_name}!{cell.coordinate}" + error_details[err].append(location) + total_errors += 1 + break + + wb.close() + + # Build result summary + result = { + "status": "success" if total_errors == 0 else "errors_found", + "total_errors": total_errors, + "error_summary": {}, + } + + # Add non-empty error categories + for err_type, locations in error_details.items(): + if locations: + result["error_summary"][err_type] = { + "count": len(locations), + "locations": locations[:20], # Show up to 20 locations + } + + # Add formula count for context - also check ALL cells + wb_formulas = load_workbook(filename, data_only=False) + formula_count = 0 + for sheet_name in wb_formulas.sheetnames: + ws = wb_formulas[sheet_name] + for row in ws.iter_rows(): + for cell in row: + if ( + cell.value + and isinstance(cell.value, str) + and cell.value.startswith("=") + ): + formula_count += 1 + wb_formulas.close() + + result["total_formulas"] = formula_count + + return result + + except Exception as e: + return {"error": str(e)} + + +def main(): + if len(sys.argv) < 2: + print("Usage: python recalc.py [timeout_seconds]") + print("\nRecalculates all formulas in an Excel file using LibreOffice") + print("\nReturns JSON with error details:") + print(" - status: 'success' or 'errors_found'") + print(" - total_errors: Total number of Excel errors found") + print(" - total_formulas: Number of formulas in the file") + print(" - error_summary: Breakdown by error type with locations") + print(" - #VALUE!, #DIV/0!, #REF!, #NAME?, #NULL!, #NUM!, #N/A") + sys.exit(1) + + filename = sys.argv[1] + timeout = int(sys.argv[2]) if len(sys.argv) > 2 else 30 + + result = recalc(filename, timeout) + print(json.dumps(result, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/code_puppy/bundled_skills/ProductManagement/competitive-analysis/SKILL.md b/code_puppy/bundled_skills/ProductManagement/competitive-analysis/SKILL.md new file mode 100644 index 00000000..9cb3b4c4 --- /dev/null +++ b/code_puppy/bundled_skills/ProductManagement/competitive-analysis/SKILL.md @@ -0,0 +1,201 @@ +--- +name: competitive-analysis +description: Analyze competitors with feature comparison matrices, positioning analysis, and strategic implications. Use when researching a competitor, comparing product capabilities, assessing competitive positioning, or preparing a competitive brief for product strategy. +--- + +# Competitive Analysis Skill + +You are an expert at competitive analysis for product managers. You help analyze competitors, map competitive landscapes, compare features, assess positioning, and derive strategic implications for product decisions. + +## Competitive Landscape Mapping + +### Identifying the Competitive Set +Define competitors at multiple levels: + +**Direct competitors**: Products that solve the same problem for the same users in the same way. +- These are the products your customers actively evaluate against you +- They appear in your deals, in customer comparisons, in review site matchups + +**Indirect competitors**: Products that solve the same problem but differently. +- Different approach to the same user need (e.g., spreadsheets vs dedicated project management tool) +- Include "non-consumption" — sometimes the competitor is doing nothing or using a manual process + +**Adjacent competitors**: Products that do not compete today but could. +- Companies with similar technology, customer base, or distribution that could expand into your space +- Larger platforms that could add your functionality as a feature +- Startups attacking a niche that could grow into your core market + +**Substitute solutions**: Entirely different ways users solve the underlying need. +- Hiring a person instead of buying software +- Using a general-purpose tool (Excel, email) instead of a specialized one +- Outsourcing the process entirely + +### Landscape Map +Position competitors on meaningful dimensions: + +**Common axes**: +- Breadth vs depth (suite vs point solution) +- SMB vs enterprise (market segment focus) +- Self-serve vs sales-led (go-to-market approach) +- Simple vs powerful (product complexity) +- Horizontal vs vertical (general purpose vs industry-specific) + +Choose axes that reveal strategic positioning differences relevant to your market. The right axes make competitive dynamics visible. + +### Monitoring the Landscape +Track competitive movements over time: +- Product launches and feature releases (changelogs, blog posts, press releases) +- Pricing and packaging changes +- Funding rounds and acquisitions +- Key hires and job postings (signal strategic direction) +- Customer wins and losses (especially your wins/losses) +- Analyst and review coverage +- Partnership announcements + +## Feature Comparison Matrices + +### Building a Feature Comparison +1. **Define capability areas**: Group features into functional categories that matter to buyers (not your internal architecture). Use the categories buyers use when evaluating. +2. **List specific capabilities**: Under each area, list the specific features or capabilities to compare. +3. **Rate each competitor**: Use a consistent rating scale. + +### Rating Scale Options + +**Simple (recommended for most cases)**: +- Strong: Market-leading capability. Deep functionality, well-executed. +- Adequate: Functional capability. Gets the job done but not differentiated. +- Weak: Exists but limited. Significant gaps or poor execution. +- Absent: Does not have this capability. + +**Detailed (for deep-dive comparisons)**: +- 5: Best-in-class. Defines the standard others aspire to. +- 4: Strong. Fully-featured and well-executed. +- 3: Adequate. Meets basic needs without differentiation. +- 2: Limited. Exists but with significant gaps. +- 1: Minimal. Barely functional or in early beta. +- 0: Absent. Not available. + +### Comparison Matrix Template +``` +| Capability Area | Our Product | Competitor A | Competitor B | +|----------------|-------------|-------------|-------------| +| [Area 1] | | | | +| [Feature 1] | Strong | Adequate | Absent | +| [Feature 2] | Adequate | Strong | Weak | +| [Area 2] | | | | +| [Feature 3] | Strong | Strong | Adequate | +``` + +### Tips for Feature Comparison +- Rate based on real product experience, customer feedback, and reviews — not just marketing claims +- Features exist on a spectrum. "Has feature X" is less useful than "How well does it do X?" +- Weight the comparison by what matters to your target customers, not by total feature count +- Update regularly — feature comparisons get stale fast +- Be honest about where competitors are ahead. A comparison that always shows you winning is not credible. +- Include the "why it matters" for each capability area. Not all features matter equally to buyers. + +## Positioning Analysis Frameworks + +### Positioning Statement Analysis +For each competitor, extract their positioning: + +**Template**: For [target customer] who [need/problem], [Product] is a [category] that [key benefit]. Unlike [competitor/alternative], [Product] [key differentiator]. + +**Sources for positioning**: +- Homepage headline and subheadline +- Product description on app stores or review sites +- Sales pitch decks (sometimes leaked or shared by prospects) +- Analyst briefing materials +- Earnings call language (for public companies) + +### Message Architecture Analysis +How does each competitor communicate value? + +**Level 1 — Category**: What category do they claim? (CRM, project management, collaboration platform) +**Level 2 — Differentiator**: What makes them different within that category? (AI-powered, all-in-one, developer-first) +**Level 3 — Value Proposition**: What outcome do they promise? (Close deals faster, ship products faster, never miss a deadline) +**Level 4 — Proof Points**: What evidence do they provide? (Customer logos, metrics, awards, case studies) + +### Positioning Gaps and Opportunities +Look for: +- **Unclaimed positions**: Value propositions no competitor owns that matter to buyers +- **Crowded positions**: Claims every competitor makes that have lost meaning +- **Emerging positions**: New value propositions driven by market changes (AI, remote work, compliance) +- **Vulnerable positions**: Claims competitors make that they cannot fully deliver on + +## Win/Loss Analysis Methodology + +### Conducting Win/Loss Analysis +Win/loss analysis reveals why you actually win and lose deals. It is the most actionable competitive intelligence. + +**Data sources**: +- CRM notes from sales team (available immediately, but biased) +- Customer interviews shortly after decision (most valuable, least biased) +- Churned customer surveys or exit interviews +- Prospect surveys (for lost deals) + +### Win/Loss Interview Questions +For wins: +- What problem were you trying to solve? +- What alternatives did you evaluate? (Reveals competitive set) +- Why did you choose us over alternatives? +- What almost made you choose someone else? +- What would we need to lose for you to reconsider? + +For losses: +- What problem were you trying to solve? +- What did you end up choosing? Why? +- Where did our product fall short? +- What could we have done differently? +- Would you reconsider us in the future? Under what conditions? + +### Analyzing Win/Loss Data +- Track win/loss reasons over time. Are patterns changing? +- Segment by deal type: enterprise vs SMB, new vs expansion, industry vertical +- Identify the top 3-5 reasons for wins and losses +- Distinguish between product reasons (features, quality) and non-product reasons (pricing, brand, relationship, timing) +- Calculate competitive win rates by competitor: what % of deals involving each competitor do you win? + +### Common Win/Loss Patterns +- **Feature gap**: Competitor has a specific capability you lack that is a dealbreaker +- **Integration advantage**: Competitor integrates with tools the buyer already uses +- **Pricing structure**: Not always cheaper — sometimes different pricing model (per-seat vs usage-based) fits better +- **Incumbent advantage**: Buyer sticks with what they have because switching cost is too high +- **Sales execution**: Better demo, faster response, more relevant case studies +- **Brand/trust**: Buyer chooses the safer or more well-known option + +## Market Trend Identification + +### Sources for Trend Identification +- **Industry analyst reports**: Gartner, Forrester, IDC for market sizing and trends +- **Venture capital**: What are VCs funding? Investment themes signal where smart money sees opportunity. +- **Conference themes**: What are industry events focusing on? What topics draw the biggest audiences? +- **Technology shifts**: New platforms, APIs, or capabilities that enable new product categories +- **Regulatory changes**: New regulations that create requirements or opportunities +- **Customer behavior changes**: How are user expectations evolving? (e.g., mobile-first, AI-assisted, privacy-conscious) +- **Talent movement**: Where are top people going? What skills are in demand? + +### Trend Analysis Framework +For each trend identified: + +1. **What is changing?**: Describe the trend clearly and specifically +2. **Why now?**: What is driving this change? (Technology, regulation, behavior, economics) +3. **Who is affected?**: Which customer segments or market categories? +4. **What is the timeline?**: Is this happening now, in 1-2 years, or 3-5 years? +5. **What is the implication for us?**: How should this influence our product strategy? +6. **What are competitors doing?**: How are competitors responding to this trend? + +### Separating Signal from Noise +- **Signals**: Trends backed by behavioral data, growing investment, regulatory action, or customer demand +- **Noise**: Trends backed only by media hype, conference buzz, or competitor announcements without customer traction +- Test trends against your own customer data: are YOUR customers asking for this or experiencing this change? +- Be wary of "trend of the year" hype cycles. Many trends that dominate industry conversation do not materially affect your customers for years. + +### Strategic Response Options +For each significant trend: +- **Lead**: Invest early and try to define the category or approach. High risk, high reward. +- **Fast follow**: Wait for early signals of customer demand, then move quickly. Lower risk but harder to differentiate. +- **Monitor**: Track the trend but do not invest yet. Set triggers for when to act. +- **Ignore**: Explicitly decide this trend is not relevant to your strategy. Document why. + +The right response depends on: your competitive position, your customer base, your resources, and how fast the trend is moving. diff --git a/code_puppy/bundled_skills/ProductManagement/feature-spec/SKILL.md b/code_puppy/bundled_skills/ProductManagement/feature-spec/SKILL.md new file mode 100644 index 00000000..d9efe2cd --- /dev/null +++ b/code_puppy/bundled_skills/ProductManagement/feature-spec/SKILL.md @@ -0,0 +1,176 @@ +--- +name: feature-spec +description: Write structured product requirements documents (PRDs) with problem statements, user stories, requirements, and success metrics. Use when speccing a new feature, writing a PRD, defining acceptance criteria, prioritizing requirements, or documenting product decisions. +--- + +# Feature Spec Skill + +You are an expert at writing product requirements documents (PRDs) and feature specifications. You help product managers define what to build, why, and how to measure success. + +## PRD Structure + +A well-structured PRD follows this template: + +### 1. Problem Statement +- Describe the user problem in 2-3 sentences +- Who experiences this problem and how often +- What is the cost of not solving it (user pain, business impact, competitive risk) +- Ground this in evidence: user research, support data, metrics, or customer feedback + +### 2. Goals +- 3-5 specific, measurable outcomes this feature should achieve +- Each goal should answer: "How will we know this succeeded?" +- Distinguish between user goals (what users get) and business goals (what the company gets) +- Goals should be outcomes, not outputs ("reduce time to first value by 50%" not "build onboarding wizard") + +### 3. Non-Goals +- 3-5 things this feature explicitly will NOT do +- Adjacent capabilities that are out of scope for this version +- For each non-goal, briefly explain why it is out of scope (not enough impact, too complex, separate initiative, premature) +- Non-goals prevent scope creep during implementation and set expectations with stakeholders + +### 4. User Stories +Write user stories in standard format: "As a [user type], I want [capability] so that [benefit]" + +Guidelines: +- The user type should be specific enough to be meaningful ("enterprise admin" not just "user") +- The capability should describe what they want to accomplish, not how +- The benefit should explain the "why" — what value does this deliver +- Include edge cases: error states, empty states, boundary conditions +- Include different user types if the feature serves multiple personas +- Order by priority — most important stories first + +Example: +- "As a team admin, I want to configure SSO for my organization so that my team members can log in with their corporate credentials" +- "As a team member, I want to be automatically redirected to my company's SSO login so that I do not need to remember a separate password" +- "As a team admin, I want to see which members have logged in via SSO so that I can verify the rollout is working" + +### 5. Requirements + +**Must-Have (P0)**: The feature cannot ship without these. These represent the minimum viable version of the feature. Ask: "If we cut this, does the feature still solve the core problem?" If no, it is P0. + +**Nice-to-Have (P1)**: Significantly improves the experience but the core use case works without them. These often become fast follow-ups after launch. + +**Future Considerations (P2)**: Explicitly out of scope for v1 but we want to design in a way that supports them later. Documenting these prevents accidental architectural decisions that make them hard later. + +For each requirement: +- Write a clear, unambiguous description of the expected behavior +- Include acceptance criteria (see below) +- Note any technical considerations or constraints +- Flag dependencies on other teams or systems + +### 6. Success Metrics +See the success metrics section below for detailed guidance. + +### 7. Open Questions +- Questions that need answers before or during implementation +- Tag each with who should answer (engineering, design, legal, data, stakeholder) +- Distinguish between blocking questions (must answer before starting) and non-blocking (can resolve during implementation) + +### 8. Timeline Considerations +- Hard deadlines (contractual commitments, events, compliance dates) +- Dependencies on other teams' work or releases +- Suggested phasing if the feature is too large for one release + +## User Story Writing + +Good user stories are: +- **Independent**: Can be developed and delivered on their own +- **Negotiable**: Details can be discussed, the story is not a contract +- **Valuable**: Delivers value to the user (not just the team) +- **Estimable**: The team can roughly estimate the effort +- **Small**: Can be completed in one sprint/iteration +- **Testable**: There is a clear way to verify it works + +### Common Mistakes in User Stories +- Too vague: "As a user, I want the product to be faster" — what specifically should be faster? +- Solution-prescriptive: "As a user, I want a dropdown menu" — describe the need, not the UI widget +- No benefit: "As a user, I want to click a button" — why? What does it accomplish? +- Too large: "As a user, I want to manage my team" — break this into specific capabilities +- Internal focus: "As the engineering team, we want to refactor the database" — this is a task, not a user story + +## Requirements Categorization + +### MoSCoW Framework +- **Must have**: Without these, the feature is not viable. Non-negotiable. +- **Should have**: Important but not critical for launch. High-priority fast follows. +- **Could have**: Desirable if time permits. Will not delay delivery if cut. +- **Won't have (this time)**: Explicitly out of scope. May revisit in future versions. + +### Tips for Categorization +- Be ruthless about P0s. The tighter the must-have list, the faster you ship and learn. +- If everything is P0, nothing is P0. Challenge every must-have: "Would we really not ship without this?" +- P1s should be things you are confident you will build soon, not a wish list. +- P2s are architectural insurance — they guide design decisions even though you are not building them now. + +## Success Metrics Definition + +### Leading Indicators +Metrics that change quickly after launch (days to weeks): +- **Adoption rate**: % of eligible users who try the feature +- **Activation rate**: % of users who complete the core action +- **Task completion rate**: % of users who successfully accomplish their goal +- **Time to complete**: How long the core workflow takes +- **Error rate**: How often users encounter errors or dead ends +- **Feature usage frequency**: How often users return to use the feature + +### Lagging Indicators +Metrics that take time to develop (weeks to months): +- **Retention impact**: Does this feature improve user retention? +- **Revenue impact**: Does this drive upgrades, expansion, or new revenue? +- **NPS / satisfaction change**: Does this improve how users feel about the product? +- **Support ticket reduction**: Does this reduce support load? +- **Competitive win rate**: Does this help win more deals? + +### Setting Targets +- Targets should be specific: "50% adoption within 30 days" not "high adoption" +- Base targets on comparable features, industry benchmarks, or explicit hypotheses +- Set a "success" threshold and a "stretch" target +- Define the measurement method: what tool, what query, what time window +- Specify when you will evaluate: 1 week, 1 month, 1 quarter post-launch + +## Acceptance Criteria + +Write acceptance criteria in Given/When/Then format or as a checklist: + +**Given/When/Then**: +- Given [precondition or context] +- When [action the user takes] +- Then [expected outcome] + +Example: +- Given the admin has configured SSO for their organization +- When a team member visits the login page +- Then they are automatically redirected to the organization's SSO provider + +**Checklist format**: +- [ ] Admin can enter SSO provider URL in organization settings +- [ ] Team members see "Log in with SSO" button on login page +- [ ] SSO login creates a new account if one does not exist +- [ ] SSO login links to existing account if email matches +- [ ] Failed SSO attempts show a clear error message + +### Tips for Acceptance Criteria +- Cover the happy path, error cases, and edge cases +- Be specific about the expected behavior, not the implementation +- Include what should NOT happen (negative test cases) +- Each criterion should be independently testable +- Avoid ambiguous words: "fast", "user-friendly", "intuitive" — define what these mean concretely + +## Scope Management + +### Recognizing Scope Creep +Scope creep happens when: +- Requirements keep getting added after the spec is approved +- "Small" additions accumulate into a significantly larger project +- The team is building features no user asked for ("while we're at it...") +- The launch date keeps moving without explicit re-scoping +- Stakeholders add requirements without removing anything + +### Preventing Scope Creep +- Write explicit non-goals in every spec +- Require that any scope addition comes with a scope removal or timeline extension +- Separate "v1" from "v2" clearly in the spec +- Review the spec against the original problem statement — does everything serve it? +- Time-box investigations: "If we cannot figure out X in 2 days, we cut it" +- Create a "parking lot" for good ideas that are not in scope diff --git a/code_puppy/bundled_skills/ProductManagement/metrics-tracking/SKILL.md b/code_puppy/bundled_skills/ProductManagement/metrics-tracking/SKILL.md new file mode 100644 index 00000000..75ac8335 --- /dev/null +++ b/code_puppy/bundled_skills/ProductManagement/metrics-tracking/SKILL.md @@ -0,0 +1,275 @@ +--- +name: metrics-tracking +description: Define, track, and analyze product metrics with frameworks for goal setting and dashboard design. Use when setting up OKRs, building metrics dashboards, running weekly metrics reviews, identifying trends, or choosing the right metrics for a product area. +--- + +# Metrics Tracking Skill + +You are an expert at product metrics — defining, tracking, analyzing, and acting on product metrics. You help product managers build metrics frameworks, set goals, run reviews, and design dashboards that drive decisions. + +## Product Metrics Hierarchy + +### North Star Metric +The single metric that best captures the core value your product delivers to users. It should be: + +- **Value-aligned**: Moves when users get more value from the product +- **Leading**: Predicts long-term business success (revenue, retention) +- **Actionable**: The product team can influence it through their work +- **Understandable**: Everyone in the company can understand what it means and why it matters + +**Examples by product type**: +- Collaboration tool: Weekly active teams with 3+ members contributing +- Marketplace: Weekly transactions completed +- SaaS platform: Weekly active users completing core workflow +- Content platform: Weekly engaged reading/viewing time +- Developer tool: Weekly deployments using the tool + +### L1 Metrics (Health Indicators) +The 5-7 metrics that together paint a complete picture of product health. These map to the key stages of the user lifecycle: + +**Acquisition**: Are new users finding the product? +- New signups or trial starts (volume and trend) +- Signup conversion rate (visitors to signups) +- Channel mix (where are new users coming from) +- Cost per acquisition (for paid channels) + +**Activation**: Are new users reaching the value moment? +- Activation rate: % of new users who complete the key action that predicts retention +- Time to activate: how long from signup to activation +- Setup completion rate: % who complete onboarding steps +- First value moment: when users first experience the core product value + +**Engagement**: Are active users getting value? +- DAU / WAU / MAU: active users at different timeframes +- DAU/MAU ratio (stickiness): what fraction of monthly users come back daily +- Core action frequency: how often users do the thing that matters most +- Session depth: how much users do per session +- Feature adoption: % of users using key features + +**Retention**: Are users coming back? +- D1, D7, D30 retention: % of users who return after 1 day, 7 days, 30 days +- Cohort retention curves: how retention evolves for each signup cohort +- Churn rate: % of users or revenue lost per period +- Resurrection rate: % of churned users who come back + +**Monetization**: Is value translating to revenue? +- Conversion rate: free to paid (for freemium) +- MRR / ARR: monthly or annual recurring revenue +- ARPU / ARPA: average revenue per user or account +- Expansion revenue: revenue growth from existing customers +- Net revenue retention: revenue retention including expansion and contraction + +**Satisfaction**: How do users feel about the product? +- NPS: Net Promoter Score +- CSAT: Customer Satisfaction Score +- Support ticket volume and resolution time +- App store ratings and review sentiment + +### L2 Metrics (Diagnostic) +Detailed metrics used to investigate changes in L1 metrics: + +- Funnel conversion at each step +- Feature-level usage and adoption +- Segment-specific breakdowns (by plan, company size, geography, user role) +- Performance metrics (page load time, error rate, API latency) +- Content-specific engagement (which features, pages, or content types drive engagement) + +## Common Product Metrics + +### DAU / WAU / MAU +**What they measure**: Unique users who perform a qualifying action in a day, week, or month. + +**Key decisions**: +- What counts as "active"? A login? A page view? A core action? Define this carefully — different definitions tell different stories. +- Which timeframe matters most? DAU for daily-use products (messaging, email). WAU for weekly-use products (project management). MAU for less frequent products (tax software, travel booking). + +**How to use them**: +- DAU/MAU ratio (stickiness): values above 0.5 indicate a daily habit. Below 0.2 suggests infrequent usage. +- Trend matters more than absolute number. Is active usage growing, flat, or declining? +- Segment by user type. Power users and casual users behave very differently. + +### Retention +**What it measures**: Of users who started in period X, what % are still active in period Y? + +**Common retention timeframes**: +- D1 (next day): Was the first experience good enough to come back? +- D7 (one week): Did the user establish a habit? +- D30 (one month): Is the user retained long-term? +- D90 (three months): Is this a durable user? + +**How to use retention**: +- Plot retention curves by cohort. Look for: initial drop-off (activation problem), steady decline (engagement problem), or flattening (good — you have a stable retained base). +- Compare cohorts over time. Are newer cohorts retaining better than older ones? That means product improvements are working. +- Segment retention by activation behavior. Users who completed onboarding vs those who did not. Users who used feature X vs those who did not. + +### Conversion +**What it measures**: % of users who move from one stage to the next. + +**Common conversion funnels**: +- Visitor to signup +- Signup to activation (key value moment) +- Free to paid (trial conversion) +- Trial to paid subscription +- Monthly to annual plan + +**How to use conversion**: +- Map the full funnel and measure conversion at each step +- Identify the biggest drop-off points — these are your highest-leverage improvement opportunities +- Segment conversion by source, plan, user type. Different segments convert very differently. +- Track conversion over time. Is it improving as you iterate on the experience? + +### Activation +**What it measures**: % of new users who reach the moment where they first experience the product's core value. + +**Defining activation**: +- Look at retained users vs churned users. What actions did retained users take that churned users did not? +- The activation event should be strongly predictive of long-term retention +- It should be achievable within the first session or first few days +- Examples: created first project, invited a teammate, completed first workflow, connected an integration + +**How to use activation**: +- Track activation rate for every signup cohort +- Measure time to activate — faster is almost always better +- Build onboarding flows that guide users to the activation moment +- A/B test activation flows and measure impact on retention, not just activation rate + +## Goal Setting Frameworks + +### OKRs (Objectives and Key Results) + +**Objectives**: Qualitative, aspirational goals that describe what you want to achieve. +- Inspiring and memorable +- Time-bound (quarterly or annually) +- Directional, not metric-specific + +**Key Results**: Quantitative measures that tell you if you achieved the objective. +- Specific and measurable +- Time-bound with a clear target +- Outcome-based, not output-based +- 2-4 Key Results per Objective + +**Example**: +``` +Objective: Make our product indispensable for daily workflows + +Key Results: +- Increase DAU/MAU ratio from 0.35 to 0.50 +- Increase D30 retention for new users from 40% to 55% +- 3 core workflows with >80% task completion rate +``` + +### OKR Best Practices +- Set OKRs that are ambitious but achievable. 70% completion is the target for stretch OKRs. +- Key Results should measure outcomes (user behavior, business results), not outputs (features shipped, tasks completed). +- Do not have too many OKRs. 2-3 objectives with 2-4 KRs each is plenty. +- OKRs should be uncomfortable. If you are confident you will hit all of them, they are not ambitious enough. +- Review OKRs at mid-period. Adjust effort allocation if some KRs are clearly off track. +- Grade OKRs honestly at end of period. 0.0-0.3 = missed, 0.4-0.6 = progress, 0.7-1.0 = achieved. + +### Setting Metric Targets +- **Baseline**: What is the current value? You need a reliable baseline before setting a target. +- **Benchmark**: What do comparable products achieve? Industry benchmarks provide context. +- **Trajectory**: What is the current trend? If the metric is already improving at 5% per month, a 6% target is not ambitious. +- **Effort**: How much investment are you putting behind this? Bigger bets warrant more ambitious targets. +- **Confidence**: How confident are you in hitting the target? Set a "commit" (high confidence) and a "stretch" (ambitious). + +## Metric Review Cadences + +### Weekly Metrics Check +**Purpose**: Catch issues quickly, monitor experiments, stay in touch with product health. +**Duration**: 15-30 minutes. +**Attendees**: Product manager, maybe engineering lead. + +**What to review**: +- North Star metric: current value, week-over-week change +- Key L1 metrics: any notable movements +- Active experiments: results and statistical significance +- Anomalies: any unexpected spikes or drops +- Alerts: anything that triggered a monitoring alert + +**Action**: If something looks off, investigate. Otherwise, note it and move on. + +### Monthly Metrics Review +**Purpose**: Deeper analysis of trends, progress against goals, strategic implications. +**Duration**: 30-60 minutes. +**Attendees**: Product team, key stakeholders. + +**What to review**: +- Full L1 metric scorecard with month-over-month trends +- Progress against quarterly OKR targets +- Cohort analysis: are newer cohorts performing better? +- Feature adoption: how are recent launches performing? +- Segment analysis: any divergence between user segments? + +**Action**: Identify 1-3 areas to investigate or invest in. Update priorities if metrics reveal new information. + +### Quarterly Business Review +**Purpose**: Strategic assessment of product performance, goal-setting for next quarter. +**Duration**: 60-90 minutes. +**Attendees**: Product, engineering, design, leadership. + +**What to review**: +- OKR scoring for the quarter +- Trend analysis for all L1 metrics over the quarter +- Year-over-year comparisons +- Competitive context: market changes and competitor movements +- What worked and what did not + +**Action**: Set OKRs for next quarter. Adjust product strategy based on what the data shows. + +## Dashboard Design Principles + +### Effective Product Dashboards +A good dashboard answers the question "How is the product doing?" at a glance. + +**Principles**: + +1. **Start with the question, not the data**. What decisions does this dashboard support? Design backwards from the decision. + +2. **Hierarchy of information**. The most important metric should be the most visually prominent. North Star at the top, L1 metrics next, L2 metrics available on drill-down. + +3. **Context over numbers**. A number without context is meaningless. Always show: current value, comparison (previous period, target, benchmark), trend direction. + +4. **Fewer metrics, more insight**. A dashboard with 50 metrics helps no one. Focus on 5-10 that matter. Put everything else in a detailed report. + +5. **Consistent time periods**. Use the same time period for all metrics on a dashboard. Mixing daily and monthly metrics creates confusion. + +6. **Visual status indicators**. Use color to indicate health at a glance: + - Green: on track or improving + - Yellow: needs attention or flat + - Red: off track or declining + +7. **Actionability**. Every metric on the dashboard should be something the team can influence. If you cannot act on it, it does not belong on the product dashboard. + +### Dashboard Layout + +**Top row**: North Star metric with trend line and target. + +**Second row**: L1 metrics scorecard — current value, change, target, status for each key metric. + +**Third row**: Key funnels or conversion metrics — visual funnel showing drop-off at each stage. + +**Fourth row**: Recent experiments and launches — active A/B tests, recent feature launches with early metrics. + +**Bottom / drill-down**: L2 metrics, segment breakdowns, and detailed time series for investigation. + +### Dashboard Anti-Patterns +- **Vanity metrics**: Metrics that always go up but do not indicate health (total signups ever, total page views) +- **Too many metrics**: Dashboards that require scrolling to see. If it does not fit on one screen, cut metrics. +- **No comparison**: Raw numbers without context (current value with no previous period or target) +- **Stale dashboards**: Metrics that have not been updated or reviewed in months +- **Output dashboards**: Measuring team activity (tickets closed, PRs merged) instead of user and business outcomes +- **One dashboard for all audiences**: Executives, PMs, and engineers need different views. One size does not fit all. + +### Alerting +Set alerts for metrics that require immediate attention: + +- **Threshold alerts**: Metric drops below or rises above a critical threshold (error rate > 1%, conversion < 5%) +- **Trend alerts**: Metric shows sustained decline over multiple days/weeks +- **Anomaly alerts**: Metric deviates significantly from expected range + +**Alert hygiene**: +- Every alert should be actionable. If you cannot do anything about it, do not alert on it. +- Review and tune alerts regularly. Too many false positives and people ignore all alerts. +- Define an owner for each alert. Who responds when it fires? +- Set appropriate severity levels. Not everything is P0. diff --git a/code_puppy/bundled_skills/ProductManagement/roadmap-management/SKILL.md b/code_puppy/bundled_skills/ProductManagement/roadmap-management/SKILL.md new file mode 100644 index 00000000..35817ef6 --- /dev/null +++ b/code_puppy/bundled_skills/ProductManagement/roadmap-management/SKILL.md @@ -0,0 +1,168 @@ +--- +name: roadmap-management +description: Plan and prioritize product roadmaps using frameworks like RICE, MoSCoW, and ICE. Use when creating a roadmap, reprioritizing features, mapping dependencies, choosing between Now/Next/Later or quarterly formats, or presenting roadmap tradeoffs to stakeholders. +--- + +# Roadmap Management Skill + +You are an expert at product roadmap planning, prioritization, and communication. You help product managers build roadmaps that are strategic, realistic, and useful for decision-making. + +## Roadmap Frameworks + +### Now / Next / Later +The simplest and often most effective roadmap format: + +- **Now** (current sprint/month): Committed work. High confidence in scope and timeline. These are the things the team is actively building. +- **Next** (next 1-3 months): Planned work. Good confidence in what, less confidence in exactly when. Scoped and prioritized but not yet started. +- **Later** (3-6+ months): Directional. These are strategic bets and opportunities we intend to pursue, but scope and timing are flexible. + +When to use: Most teams, most of the time. Especially good for communicating externally or to leadership because it avoids false precision on dates. + +### Quarterly Themes +Organize the roadmap around 2-3 themes per quarter: + +- Each theme represents a strategic area of investment (e.g., "Enterprise readiness", "Activation improvements", "Platform extensibility") +- Under each theme, list the specific initiatives planned +- Themes should map to company or team OKRs +- This format makes it easy to explain WHY you are building what you are building + +When to use: When you need to show strategic alignment. Good for planning meetings and executive communication. + +### OKR-Aligned Roadmap +Map roadmap items directly to Objectives and Key Results: + +- Start with the team's OKRs for the period +- Under each Key Result, list the initiatives that will move that metric +- Include the expected impact of each initiative on the Key Result +- This creates clear accountability between what you build and what you measure + +When to use: Organizations that run on OKRs. Good for ensuring every initiative has a clear "why" tied to measurable outcomes. + +### Timeline / Gantt View +Calendar-based view with items on a timeline: + +- Shows start dates, end dates, and durations +- Visualizes parallelism and sequencing +- Good for identifying resource conflicts +- Shows dependencies between items + +When to use: Execution planning with engineering. Identifying scheduling conflicts. NOT good for communicating externally (creates false precision expectations). + +## Prioritization Frameworks + +### RICE Score +Score each initiative on four dimensions, then calculate RICE = (Reach x Impact x Confidence) / Effort + +- **Reach**: How many users/customers will this affect in a given time period? Use concrete numbers (e.g., "500 users per quarter"). +- **Impact**: How much will this move the needle for each person reached? Score on a scale: 3 = massive, 2 = high, 1 = medium, 0.5 = low, 0.25 = minimal. +- **Confidence**: How confident are we in the reach and impact estimates? 100% = high confidence (backed by data), 80% = medium (some evidence), 50% = low (gut feel). +- **Effort**: How many person-months of work? Include engineering, design, and any other functions. + +When to use: When you need a quantitative, defensible prioritization. Good for comparing a large backlog of initiatives. Less good for strategic bets where impact is hard to estimate. + +### MoSCoW +Categorize items into Must have, Should have, Could have, Won't have: + +- **Must have**: The roadmap is a failure without these. Non-negotiable commitments. +- **Should have**: Important and expected, but delivery is viable without them. +- **Could have**: Desirable but clearly lower priority. Include only if capacity allows. +- **Won't have**: Explicitly out of scope for this period. Important to list for clarity. + +When to use: Scoping a release or quarter. Negotiating with stakeholders about what fits. Good for forcing prioritization conversations. + +### ICE Score +Simpler than RICE. Score each item 1-10 on three dimensions: + +- **Impact**: How much will this move the target metric? +- **Confidence**: How confident are we in the impact estimate? +- **Ease**: How easy is this to implement? (Inverse of effort — higher = easier) + +ICE Score = Impact x Confidence x Ease + +When to use: Quick prioritization of a feature backlog. Good for early-stage products or when you do not have enough data for RICE. + +### Value vs Effort Matrix +Plot initiatives on a 2x2 matrix: + +- **High value, Low effort** (Quick wins): Do these first. +- **High value, High effort** (Big bets): Plan these carefully. Worth the investment but need proper scoping. +- **Low value, Low effort** (Fill-ins): Do these when you have spare capacity. +- **Low value, High effort** (Money pits): Do not do these. Remove from the backlog. + +When to use: Visual prioritization in team planning sessions. Good for building shared understanding of tradeoffs. + +## Dependency Mapping + +### Identifying Dependencies +Look for dependencies across these categories: + +- **Technical dependencies**: Feature B requires infrastructure work from Feature A +- **Team dependencies**: Feature requires work from another team (design, platform, data) +- **External dependencies**: Waiting on a vendor, partner, or third-party integration +- **Knowledge dependencies**: Need research or investigation results before starting +- **Sequential dependencies**: Must ship Feature A before starting Feature B (shared code, user flow) + +### Managing Dependencies +- List all dependencies explicitly in the roadmap +- Assign an owner to each dependency (who is responsible for resolving it) +- Set a "need by" date: when does the depending item need this resolved +- Build buffer around dependencies — they are the highest-risk items on any roadmap +- Flag dependencies that cross team boundaries early — these require coordination +- Have a contingency plan: what do you do if the dependency slips? + +### Reducing Dependencies +- Can you build a simpler version that avoids the dependency? +- Can you parallelize by using an interface contract or mock? +- Can you sequence differently to move the dependency earlier? +- Can you absorb the work into your team to remove the cross-team coordination? + +## Capacity Planning + +### Estimating Capacity +- Start with the number of engineers and the time period +- Subtract known overhead: meetings, on-call rotations, interviews, holidays, PTO +- A common rule of thumb: engineers spend 60-70% of time on planned feature work +- Factor in team ramp time for new members + +### Allocating Capacity +A healthy allocation for most product teams: + +- **70% planned features**: Roadmap items that advance strategic goals +- **20% technical health**: Tech debt, reliability, performance, developer experience +- **10% unplanned**: Buffer for urgent issues, quick wins, and requests from other teams + +Adjust ratios based on team context: +- New product: more feature work, less tech debt +- Mature product: more tech debt and reliability investment +- Post-incident: more reliability, less features +- Rapid growth: more scalability and performance + +### Capacity vs Ambition +- If roadmap commitments exceed capacity, something must give +- Do not solve capacity problems by pretending people can do more — solve by cutting scope +- When adding to the roadmap, always ask: "What comes off?" +- Better to commit to fewer things and deliver reliably than to overcommit and disappoint + +## Communicating Roadmap Changes + +### When the Roadmap Changes +Common triggers for roadmap changes: +- New strategic priority from leadership +- Customer feedback or research that changes priorities +- Technical discovery that changes estimates +- Dependency slip from another team +- Resource change (team grows or shrinks, key person leaves) +- Competitive move that requires response + +### How to Communicate Changes +1. **Acknowledge the change**: Be direct about what is changing and why +2. **Explain the reason**: What new information drove this decision? +3. **Show the tradeoff**: What was deprioritized to make room? Or what is slipping? +4. **Show the new plan**: Updated roadmap with the changes reflected +5. **Acknowledge impact**: Who is affected and how? Stakeholders who were expecting deprioritized items need to hear it directly. + +### Avoiding Roadmap Whiplash +- Do not change the roadmap for every piece of new information. Have a threshold for change. +- Batch roadmap updates at natural cadences (monthly, quarterly) unless something is truly urgent. +- Distinguish between "roadmap change" (strategic reprioritization) and "scope adjustment" (normal execution refinement). +- Track how often the roadmap changes. Frequent changes may signal unclear strategy, not good responsiveness. diff --git a/code_puppy/bundled_skills/ProductManagement/stakeholder-comms/SKILL.md b/code_puppy/bundled_skills/ProductManagement/stakeholder-comms/SKILL.md new file mode 100644 index 00000000..5456a9bb --- /dev/null +++ b/code_puppy/bundled_skills/ProductManagement/stakeholder-comms/SKILL.md @@ -0,0 +1,263 @@ +--- +name: stakeholder-comms +description: Draft stakeholder updates tailored to audience — executives, engineering, customers, or cross-functional partners. Use when writing weekly status updates, monthly reports, launch announcements, risk communications, or decision documentation. +--- + +# Stakeholder Communications Skill + +You are an expert at product management communications — status updates, stakeholder management, risk communication, decision documentation, and meeting facilitation. You help product managers communicate clearly and effectively with diverse audiences. + +## Update Templates by Audience + +### Executive / Leadership Update +Executives want: strategic context, progress against goals, risks that need their help, decisions that need their input. + +**Format**: +``` +Status: [Green / Yellow / Red] + +TL;DR: [One sentence — the most important thing to know] + +Progress: +- [Outcome achieved, tied to goal/OKR] +- [Milestone reached, with impact] +- [Key metric movement] + +Risks: +- [Risk]: [Mitigation plan]. [Ask if needed]. + +Decisions needed: +- [Decision]: [Options with recommendation]. Need by [date]. + +Next milestones: +- [Milestone] — [Date] +``` + +**Tips for executive updates**: +- Lead with the conclusion, not the journey. Executives want "we shipped X and it moved Y metric" not "we had 14 standups and resolved 23 tickets." +- Keep it under 200 words. If they want more, they will ask. +- Status color should reflect YOUR genuine assessment, not what you think they want to hear. Yellow is not a failure — it is good risk management. +- Only include risks you want help with. Do not list risks you are already handling unless they need to know. +- Asks must be specific: "Decision on X by Friday" not "support needed." + +### Engineering Team Update +Engineers want: clear priorities, technical context, blockers resolved, decisions that affect their work. + +**Format**: +``` +Shipped: +- [Feature/fix] — [Link to PR/ticket]. [Impact if notable]. + +In progress: +- [Item] — [Owner]. [Expected completion]. [Blockers if any]. + +Decisions: +- [Decision made]: [Rationale]. [Link to ADR if exists]. +- [Decision needed]: [Context]. [Options]. [Recommendation]. + +Priority changes: +- [What changed and why] + +Coming up: +- [Next items] — [Context on why these are next] +``` + +**Tips for engineering updates**: +- Link to specific tickets, PRs, and documents. Engineers want to click through for details. +- When priorities change, explain why. Engineers are more bought in when they understand the reason. +- Be explicit about what is blocking them and what you are doing to unblock it. +- Do not waste their time with information that does not affect their work. + +### Cross-Functional Partner Update +Partners (design, marketing, sales, support) want: what is coming that affects them, what they need to prepare for, how to give input. + +**Format**: +``` +What's coming: +- [Feature/launch] — [Date]. [What this means for your team]. + +What we need from you: +- [Specific ask] — [Context]. By [date]. + +Decisions made: +- [Decision] — [How it affects your team]. + +Open for input: +- [Topic we'd love feedback on] — [How to provide it]. +``` + +### Customer / External Update +Customers want: what is new, what is coming, how it benefits them, how to get started. + +**Format**: +``` +What's new: +- [Feature] — [Benefit in customer terms]. [How to use it / link]. + +Coming soon: +- [Feature] — [Expected timing]. [Why it matters to you]. + +Known issues: +- [Issue] — [Status]. [Workaround if available]. + +Feedback: +- [How to share feedback or request features] +``` + +**Tips for customer updates**: +- No internal jargon. No ticket numbers. No technical implementation details. +- Frame everything in terms of what the customer can now DO, not what you built. +- Be honest about timelines but do not overcommit. "Later this quarter" is better than a date you might miss. +- Only mention known issues if they are customer-impacting and you have a resolution plan. + +## Status Reporting Framework + +### Green / Yellow / Red Status + +**Green** (On Track): +- Progressing as planned +- No significant risks or blockers +- On track to meet commitments and deadlines +- Use Green when things are genuinely going well — not as a default + +**Yellow** (At Risk): +- Progress is slower than planned, or a risk has materialized +- Mitigation is underway but outcome is uncertain +- May miss commitments without intervention or scope adjustment +- Use Yellow proactively — the earlier you flag risk, the more options you have + +**Red** (Off Track): +- Significantly behind plan +- Major blocker or risk without clear mitigation +- Will miss commitments without significant intervention (scope cut, resource addition, timeline extension) +- Use Red when you genuinely need help. Do not wait until it is too late. + +### When to Change Status +- Move to Yellow at the FIRST sign of risk, not when you are sure things are bad +- Move to Red when you have exhausted your own options and need escalation +- Move back to Green only when the risk is genuinely resolved, not just paused +- Document what changed when you change status — "Moved to Yellow because [reason]" + +## Risk Communication + +### ROAM Framework for Risk Management +- **Resolved**: Risk is no longer a concern. Document how it was resolved. +- **Owned**: Risk is acknowledged and someone is actively managing it. State the owner and the mitigation plan. +- **Accepted**: Risk is known but we are choosing to proceed without mitigation. Document the rationale. +- **Mitigated**: Actions have reduced the risk to an acceptable level. Document what was done. + +### Communicating Risks Effectively +1. **State the risk clearly**: "There is a risk that [thing] happens because [reason]" +2. **Quantify the impact**: "If this happens, the consequence is [impact]" +3. **State the likelihood**: "This is [likely/possible/unlikely] because [evidence]" +4. **Present the mitigation**: "We are managing this by [actions]" +5. **Make the ask**: "We need [specific help] to further reduce this risk" + +### Common Mistakes in Risk Communication +- Burying risks in good news. Lead with risks when they are important. +- Being vague: "There might be some delays" — specify what, how long, and why. +- Presenting risks without mitigations. Every risk should come with a plan. +- Waiting too long. A risk communicated early is a planning input. A risk communicated late is a fire drill. + +## Decision Documentation (ADRs) + +### Architecture Decision Record Format +Document important decisions for future reference: + +``` +# [Decision Title] + +## Status +[Proposed / Accepted / Deprecated / Superseded by ADR-XXX] + +## Context +What is the situation that requires a decision? What forces are at play? + +## Decision +What did we decide? State the decision clearly and directly. + +## Consequences +What are the implications of this decision? +- Positive consequences +- Negative consequences or tradeoffs accepted +- What this enables or prevents in the future + +## Alternatives Considered +What other options were evaluated? +For each: what was it, why was it rejected? +``` + +### When to Write an ADR +- Strategic product decisions (which market segment to target, which platform to support) +- Significant technical decisions (architecture choices, vendor selection, build vs buy) +- Controversial decisions where people disagreed (document the rationale for future reference) +- Decisions that constrain future options (choosing a technology, signing a partnership) +- Decisions you expect people to question later (capture the context while it is fresh) + +### Tips for Decision Documentation +- Write ADRs close to when the decision is made, not weeks later +- Include who was involved in the decision and who made the final call +- Document the context generously — future readers will not have today's context +- It is okay to document decisions that were wrong in hindsight — add a "superseded by" link +- Keep them short. One page is better than five. + +## Meeting Facilitation + +### Stand-up / Daily Sync +**Purpose**: Surface blockers, coordinate work, maintain momentum. +**Format**: Each person shares: +- What they accomplished since last sync +- What they are working on next +- What is blocking them + +**Facilitation tips**: +- Keep it to 15 minutes. If discussions emerge, take them offline. +- Focus on blockers — this is the highest-value part of standup +- Track blockers and follow up on resolution +- Cancel standup if there is nothing to sync on. Respect people's time. + +### Sprint / Iteration Planning +**Purpose**: Commit to work for the next sprint. Align on priorities and scope. +**Format**: +1. Review: what shipped last sprint, what carried over, what was cut +2. Priorities: what are the most important things to accomplish this sprint +3. Capacity: how much can the team take on (account for PTO, on-call, meetings) +4. Commitment: select items from the backlog that fit capacity and priorities +5. Dependencies: flag any cross-team or external dependencies + +**Facilitation tips**: +- Come with a proposed priority order. Do not ask the team to prioritize from scratch. +- Push back on overcommitment. It is better to commit to less and deliver reliably. +- Ensure every item has a clear owner and clear acceptance criteria. +- Flag items that are underscoped or have hidden complexity. + +### Retrospective +**Purpose**: Reflect on what went well, what did not, and what to change. +**Format**: +1. Set the stage: remind the team of the goal and create psychological safety +2. Gather data: what went well, what did not go well, what was confusing +3. Generate insights: identify patterns and root causes +4. Decide actions: pick 1-3 specific improvements to try next sprint +5. Close: thank people for honest feedback + +**Facilitation tips**: +- Create psychological safety. People must feel safe to be honest. +- Focus on systems and processes, not individuals. +- Limit to 1-3 action items. More than that and nothing changes. +- Follow up on previous retro action items. If you never follow up, people stop engaging. +- Vary the retro format occasionally to prevent staleness. + +### Stakeholder Review / Demo +**Purpose**: Show progress, gather feedback, build alignment. +**Format**: +1. Context: remind stakeholders of the goal and what they saw last time +2. Demo: show what was built. Use real product, not slides. +3. Metrics: share any early data or feedback +4. Feedback: structured time for questions and input +5. Next steps: what is coming next and when the next review will be + +**Facilitation tips**: +- Demo the real product whenever possible. Slides are not demos. +- Frame feedback collection: "What feedback do you have on X?" is better than "Any thoughts?" +- Capture feedback visibly and commit to addressing it (or explaining why not) +- Set expectations about what kind of feedback is actionable at this stage diff --git a/code_puppy/bundled_skills/ProductManagement/user-research-synthesis/SKILL.md b/code_puppy/bundled_skills/ProductManagement/user-research-synthesis/SKILL.md new file mode 100644 index 00000000..9e3dca40 --- /dev/null +++ b/code_puppy/bundled_skills/ProductManagement/user-research-synthesis/SKILL.md @@ -0,0 +1,196 @@ +--- +name: user-research-synthesis +description: Synthesize qualitative and quantitative user research into structured insights and opportunity areas. Use when analyzing interview notes, survey responses, support tickets, or behavioral data to identify themes, build personas, or prioritize opportunities. +--- + +# User Research Synthesis Skill + +You are an expert at synthesizing user research — turning raw qualitative and quantitative data into structured insights that drive product decisions. You help product managers make sense of interviews, surveys, usability tests, support data, and behavioral analytics. + +## Research Synthesis Methodology + +### Thematic Analysis +The core method for synthesizing qualitative research: + +1. **Familiarization**: Read through all the data. Get a feel for the overall landscape before coding anything. +2. **Initial coding**: Go through the data systematically. Tag each observation, quote, or data point with descriptive codes. Be generous with codes — it is easier to merge than to split later. +3. **Theme development**: Group related codes into candidate themes. A theme captures something important about the data in relation to the research question. +4. **Theme review**: Check themes against the data. Does each theme have sufficient evidence? Are themes distinct from each other? Do they tell a coherent story? +5. **Theme refinement**: Define and name each theme clearly. Write a 1-2 sentence description of what each theme captures. +6. **Report**: Write up the themes as findings with supporting evidence. + +### Affinity Mapping +A collaborative method for grouping observations: + +1. **Capture observations**: Write each distinct observation, quote, or data point as a separate note +2. **Cluster**: Group related notes together based on similarity. Do not pre-define categories — let them emerge from the data. +3. **Label clusters**: Give each cluster a descriptive name that captures the common thread +4. **Organize clusters**: Arrange clusters into higher-level groups if patterns emerge +5. **Identify themes**: The clusters and their relationships reveal the key themes + +**Tips for affinity mapping**: +- One observation per note. Do not combine multiple insights. +- Move notes between clusters freely. The first grouping is rarely the best. +- If a cluster gets too large, it probably contains multiple themes. Split it. +- Outliers are interesting. Do not force every observation into a cluster. +- The process of grouping is as valuable as the output. It builds shared understanding. + +### Triangulation +Strengthen findings by combining multiple data sources: + +- **Methodological triangulation**: Same question, different methods (interviews + survey + analytics) +- **Source triangulation**: Same method, different participants or segments +- **Temporal triangulation**: Same observation at different points in time + +A finding supported by multiple sources and methods is much stronger than one supported by a single source. When sources disagree, that is interesting — it may reveal different user segments or contexts. + +## Interview Note Analysis + +### Extracting Insights from Interview Notes +For each interview, identify: + +**Observations**: What did the participant describe doing, experiencing, or feeling? +- Distinguish between behaviors (what they do) and attitudes (what they think/feel) +- Note context: when, where, with whom, how often +- Flag workarounds — these are unmet needs in disguise + +**Direct quotes**: Verbatim statements that powerfully illustrate a point +- Good quotes are specific and vivid, not generic +- Attribute to participant type, not name: "Enterprise admin, 200-person team" not "Sarah" +- A quote is evidence, not a finding. The finding is your interpretation of what the quote means. + +**Behaviors vs stated preferences**: What people DO often differs from what they SAY they want +- Behavioral observations are stronger evidence than stated preferences +- If a participant says "I want feature X" but their workflow shows they never use similar features, note the contradiction +- Look for revealed preferences through actual behavior + +**Signals of intensity**: How much does this matter to the participant? +- Emotional language: frustration, excitement, resignation +- Frequency: how often do they encounter this issue +- Workarounds: how much effort do they expend working around the problem +- Impact: what is the consequence when things go wrong + +### Cross-Interview Analysis +After processing individual interviews: +- Look for patterns: which observations appear across multiple participants? +- Note frequency: how many participants mentioned each theme? +- Identify segments: do different types of users have different patterns? +- Surface contradictions: where do participants disagree? This often reveals meaningful segments. +- Find surprises: what challenged your prior assumptions? + +## Survey Data Interpretation + +### Quantitative Survey Analysis +- **Response rate**: How representative is the sample? Low response rates may introduce bias. +- **Distribution**: Look at the shape of responses, not just averages. A bimodal distribution (lots of 1s and 5s) tells a different story than a normal distribution (lots of 3s). +- **Segmentation**: Break down responses by user segment. Aggregates can mask important differences. +- **Statistical significance**: For small samples, be cautious about drawing conclusions from small differences. +- **Benchmark comparison**: How do scores compare to industry benchmarks or previous surveys? + +### Open-Ended Survey Response Analysis +- Treat open-ended responses like mini interview notes +- Code each response with themes +- Count frequency of themes across responses +- Pull representative quotes for each theme +- Look for themes that appear in open-ended responses but not in structured questions — these are things you did not think to ask about + +### Common Survey Analysis Mistakes +- Reporting averages without distributions. A 3.5 average could mean everyone is lukewarm or half love it and half hate it. +- Ignoring non-response bias. The people who did not respond may be systematically different. +- Over-interpreting small differences. A 0.1 point change in NPS is noise, not signal. +- Treating Likert scales as interval data. The difference between "Strongly Agree" and "Agree" is not necessarily the same as between "Agree" and "Neutral." +- Confusing correlation with causation in cross-tabulations. + +## Combining Qualitative and Quantitative Insights + +### The Qual-Quant Feedback Loop +- **Qualitative first**: Interviews and observation reveal WHAT is happening and WHY. They generate hypotheses. +- **Quantitative validation**: Surveys and analytics reveal HOW MUCH and HOW MANY. They test hypotheses at scale. +- **Qualitative deep-dive**: Return to qualitative methods to understand unexpected quantitative findings. + +### Integration Strategies +- Use quantitative data to prioritize qualitative findings. A theme from interviews is more important if usage data shows it affects many users. +- Use qualitative data to explain quantitative anomalies. A drop in retention is a number; interviews reveal it is because of a confusing onboarding change. +- Present combined evidence: "47% of surveyed users report difficulty with X (survey), and interviews reveal this is because Y (qualitative finding)." + +### When Sources Disagree +- Quantitative and qualitative sources may tell different stories. This is signal, not error. +- Check if the disagreement is due to different populations being measured +- Check if stated preferences (survey) differ from actual behavior (analytics) +- Check if the quantitative question captured what you think it captured +- Report the disagreement honestly and investigate further rather than choosing one source + +## Persona Development from Research + +### Building Evidence-Based Personas +Personas should emerge from research data, not imagination: + +1. **Identify behavioral patterns**: Look for clusters of similar behaviors, goals, and contexts across participants +2. **Define distinguishing variables**: What dimensions differentiate one cluster from another? (e.g., company size, technical skill, usage frequency, primary use case) +3. **Create persona profiles**: For each behavioral cluster: + - Name and brief description + - Key behaviors and goals + - Pain points and needs + - Context (role, company, tools used) + - Representative quotes +4. **Validate with data**: Can you size each persona segment using quantitative data? + +### Persona Template +``` +[Persona Name] — [One-line description] + +Who they are: +- Role, company type/size, experience level +- How they found/started using the product + +What they are trying to accomplish: +- Primary goals and jobs to be done +- How they measure success + +How they use the product: +- Frequency and depth of usage +- Key workflows and features used +- Tools they use alongside this product + +Key pain points: +- Top 3 frustrations or unmet needs +- Workarounds they have developed + +What they value: +- What matters most in a solution +- What would make them switch or churn + +Representative quotes: +- 2-3 verbatim quotes that capture this persona's perspective +``` + +### Common Persona Mistakes +- Demographic personas: defining by age/gender/location instead of behavior. Behavior predicts product needs better than demographics. +- Too many personas: 3-5 is the sweet spot. More than that and they are not actionable. +- Fictional personas: made up based on assumptions rather than research data. +- Static personas: never updated as the product and market evolve. +- Personas without implications: a persona that does not change any product decisions is not useful. + +## Opportunity Sizing + +### Estimating Opportunity Size +For each research finding or opportunity area, estimate: + +- **Addressable users**: How many users could benefit from addressing this? Use product analytics, survey data, or market data to estimate. +- **Frequency**: How often do affected users encounter this issue? (Daily, weekly, monthly, one-time) +- **Severity**: How much does this issue impact users when it occurs? (Blocker, significant friction, minor annoyance) +- **Willingness to pay**: Would addressing this drive upgrades, retention, or new customer acquisition? + +### Opportunity Scoring +Score opportunities on a simple matrix: + +- **Impact**: (Users affected) x (Frequency) x (Severity) = impact score +- **Evidence strength**: How confident are we in the finding? (Multiple sources > single source, behavioral data > stated preferences) +- **Strategic alignment**: Does this opportunity align with company strategy and product vision? +- **Feasibility**: Can we realistically address this? (Technical feasibility, resource availability, time to impact) + +### Presenting Opportunity Sizing +- Be transparent about assumptions and confidence levels +- Show the math: "Based on support ticket volume, approximately 2,000 users per month encounter this issue. Interview data suggests 60% of them consider it a significant blocker." +- Use ranges rather than false precision: "This affects 1,500-2,500 users monthly" not "This affects 2,137 users monthly" +- Compare opportunities against each other to create a relative ranking, not just absolute scores diff --git a/code_puppy/bundled_skills/Sales/account-research/SKILL.md b/code_puppy/bundled_skills/Sales/account-research/SKILL.md new file mode 100644 index 00000000..d1e3b566 --- /dev/null +++ b/code_puppy/bundled_skills/Sales/account-research/SKILL.md @@ -0,0 +1,287 @@ +--- +name: account-research +description: Research a company or person and get actionable sales intel. Works standalone with web search, supercharged when you connect enrichment tools or your CRM. Trigger with "research [company]", "look up [person]", "intel on [prospect]", "who is [name] at [company]", or "tell me about [company]". +--- + +# Account Research + +Get a complete picture of any company or person before outreach. This skill always works with web search, and gets significantly better with enrichment and CRM data. + +## How It Works + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ ACCOUNT RESEARCH │ +├─────────────────────────────────────────────────────────────────┤ +│ ALWAYS (works standalone via web search) │ +│ ✓ Company overview: what they do, size, industry │ +│ ✓ Recent news: funding, leadership changes, announcements │ +│ ✓ Hiring signals: open roles, growth indicators │ +│ ✓ Key people: leadership team from LinkedIn │ +│ ✓ Product/service: what they sell, who they serve │ +├─────────────────────────────────────────────────────────────────┤ +│ SUPERCHARGED (when you connect your tools) │ +│ + Enrichment: verified emails, phone, tech stack, org chart │ +│ + CRM: prior relationship, past opportunities, contacts │ +└─────────────────────────────────────────────────────────────────┘ +``` + +--- + +## Getting Started + +Just tell me who to research: + +- "Research Stripe" +- "Look up the CTO at Notion" +- "Intel on acme.com" +- "Who is Sarah Chen at TechCorp?" +- "Tell me about [company] before my call" + +I'll run web searches immediately. If you have enrichment or CRM connected, I'll pull that data too. + +--- + +## Connectors (Optional) + +Connect your tools to supercharge this skill: + +| Connector | What It Adds | +|-----------|--------------| +| **Enrichment** | Verified emails, phone numbers, tech stack, org chart, funding details | +| **CRM** | Prior relationship history, past opportunities, existing contacts, notes | + +> **No connectors?** No problem. Web search provides solid research for any company or person. + +--- + +## Output Format + +```markdown +# Research: [Company or Person Name] + +**Generated:** [Date] +**Sources:** Web Search [+ Enrichment] [+ CRM] + +--- + +## Quick Take + +[2-3 sentences: Who they are, why they might need you, best angle for outreach] + +--- + +## Company Profile + +| Field | Value | +|-------|-------| +| **Company** | [Name] | +| **Website** | [URL] | +| **Industry** | [Industry] | +| **Size** | [Employee count] | +| **Headquarters** | [Location] | +| **Founded** | [Year] | +| **Funding** | [Stage + amount if known] | +| **Revenue** | [Estimate if available] | + +### What They Do +[1-2 sentence description of their business, product, and customers] + +### Recent News +- **[Headline]** — [Date] — [Why it matters for your outreach] +- **[Headline]** — [Date] — [Why it matters] + +### Hiring Signals +- [X] open roles in [Department] +- Notable: [Relevant roles like Engineering, Sales, AI/ML] +- Growth indicator: [Hiring velocity interpretation] + +--- + +## Key People + +### [Name] — [Title] +| Field | Detail | +|-------|--------| +| **LinkedIn** | [URL] | +| **Background** | [Prior companies, education] | +| **Tenure** | [Time at company] | +| **Email** | [If enrichment connected] | + +**Talking Points:** +- [Personal hook based on background] +- [Professional hook based on role] + +[Repeat for relevant contacts] + +--- + +## Tech Stack [If Enrichment Connected] + +| Category | Tools | +|----------|-------| +| **Cloud** | [AWS, GCP, Azure, etc.] | +| **Data** | [Snowflake, Databricks, etc.] | +| **CRM** | [e.g. Salesforce, HubSpot] | +| **Other** | [Relevant tools] | + +**Integration Opportunity:** [How your product fits with their stack] + +--- + +## Prior Relationship [If CRM Connected] + +| Field | Detail | +|-------|--------| +| **Status** | [New / Prior prospect / Customer / Churned] | +| **Last Contact** | [Date and type] | +| **Previous Opps** | [Won/Lost and why] | +| **Known Contacts** | [Names already in CRM] | + +**History:** [Summary of past relationship] + +--- + +## Qualification Signals + +### Positive Signals +- ✅ [Signal and evidence] +- ✅ [Signal and evidence] + +### Potential Concerns +- ⚠️ [Concern and what to watch for] + +### Unknown (Ask in Discovery) +- ❓ [Gap in understanding] + +--- + +## Recommended Approach + +**Best Entry Point:** [Person and why] + +**Opening Hook:** [What to lead with based on research] + +**Discovery Questions:** +1. [Question about their situation] +2. [Question about pain points] +3. [Question about decision process] + +--- + +## Sources +- [Source 1](URL) +- [Source 2](URL) +``` + +--- + +## Execution Flow + +### Step 1: Parse Request + +``` +Identify what to research: +- "Research Stripe" → Company research +- "Look up John Smith at Acme" → Person + company +- "Who is the CTO at Notion" → Role-based search +- "Intel on acme.com" → Domain-based lookup +``` + +### Step 2: Web Search (Always) + +``` +Run these searches: +1. "[Company name]" → Homepage, about page +2. "[Company name] news" → Recent announcements +3. "[Company name] funding" → Investment history +4. "[Company name] careers" → Hiring signals +5. "[Person name] [Company] LinkedIn" → Profile info +6. "[Company name] product" → What they sell +7. "[Company name] customers" → Who they serve +``` + +**Extract:** +- Company description and positioning +- Recent news (last 90 days) +- Leadership team +- Open job postings +- Technology mentions +- Customer base + +### Step 3: Enrichment (If Connected) + +``` +If enrichment tools available: +1. Enrich company → Firmographics, funding, tech stack +2. Search people → Org chart, contact list +3. Enrich person → Email, phone, background +4. Get signals → Intent data, hiring velocity +``` + +**Enrichment adds:** +- Verified contact info +- Complete org chart +- Precise employee count +- Detailed tech stack +- Funding history with investors + +### Step 4: CRM Check (If Connected) + +``` +If CRM available: +1. Search for account by domain +2. Get related contacts +3. Get opportunity history +4. Get activity timeline +``` + +**CRM adds:** +- Prior relationship context +- What happened before (won/lost deals) +- Who we've talked to +- Notes and history + +### Step 5: Synthesize + +``` +1. Combine all sources +2. Prioritize enrichment data over web (more accurate) +3. Add CRM context if exists +4. Identify qualification signals +5. Generate talking points +6. Recommend approach +``` + +--- + +## Research Variations + +### Company Research +Focus on: Business overview, news, hiring, leadership + +### Person Research +Focus on: Background, role, LinkedIn activity, talking points + +### Competitor Research +Focus on: Product comparison, positioning, win/loss patterns + +### Pre-Meeting Research +Focus on: Attendee backgrounds, recent news, relationship history + +--- + +## Tips for Better Research + +1. **Include the domain** — "research acme.com" is more precise +2. **Specify the person** — "look up Jane Smith, VP Sales at Acme" +3. **State your goal** — "research Stripe before my demo call" +4. **Ask for specifics** — "what's their tech stack?" after initial research + +--- + +## Related Skills + +- **call-prep** — Full meeting prep with this research plus context +- **draft-outreach** — Write personalized message based on research +- **prospecting** — Qualify and prioritize research targets diff --git a/code_puppy/bundled_skills/Sales/call-prep/SKILL.md b/code_puppy/bundled_skills/Sales/call-prep/SKILL.md new file mode 100644 index 00000000..5ca7e2c7 --- /dev/null +++ b/code_puppy/bundled_skills/Sales/call-prep/SKILL.md @@ -0,0 +1,258 @@ +--- +name: call-prep +description: Prepare for a sales call with account context, attendee research, and suggested agenda. Works standalone with user input and web research, supercharged when you connect your CRM, email, chat, or transcripts. Trigger with "prep me for my call with [company]", "I'm meeting with [company] prep me", "call prep [company]", or "get me ready for [meeting]". +--- + +# Call Prep + +Get fully prepared for any sales call in minutes. This skill works with whatever context you provide, and gets significantly better when you connect your sales tools. + +## How It Works + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ CALL PREP │ +├─────────────────────────────────────────────────────────────────┤ +│ ALWAYS (works standalone) │ +│ ✓ You tell me: company, meeting type, attendees │ +│ ✓ Web search: recent news, funding, leadership changes │ +│ ✓ Company research: what they do, size, industry │ +│ ✓ Output: prep brief with agenda and questions │ +├─────────────────────────────────────────────────────────────────┤ +│ SUPERCHARGED (when you connect your tools) │ +│ + CRM: account history, contacts, opportunities, activities │ +│ + Email: recent threads, open questions, commitments │ +│ + Chat: internal discussions, colleague insights │ +│ + Transcripts: prior call recordings, key moments │ +│ + Calendar: auto-find meeting, pull attendees │ +└─────────────────────────────────────────────────────────────────┘ +``` + +--- + +## Getting Started + +When you run this skill, I'll ask for what I need: + +**Required:** +- Company or contact name +- Meeting type (discovery, demo, negotiation, check-in, etc.) + +**Helpful if you have it:** +- Who's attending (names and titles) +- Any context you want me to know (paste prior notes, emails, etc.) + +If you've connected your CRM, email, or other tools, I'll pull context automatically and skip the questions. + +--- + +## Connectors (Optional) + +Connect your tools to supercharge this skill: + +| Connector | What It Adds | +|-----------|--------------| +| **CRM** | Account details, contact history, open deals, recent activities | +| **Email** | Recent threads with the company, open questions, attachments shared | +| **Chat** | Internal chat discussions (e.g. Slack) about the account, colleague insights | +| **Transcripts** | Prior call recordings, topics covered, competitor mentions | +| **Calendar** | Auto-find the meeting, pull attendees and description | + +> **No connectors?** No problem. Just tell me about the meeting and paste any context you have. I'll research the rest. + +--- + +## Output Format + +```markdown +# Call Prep: [Company Name] + +**Meeting:** [Type] — [Date/Time if known] +**Attendees:** [Names with titles] +**Your Goal:** [What you want to accomplish] + +--- + +## Account Snapshot + +| Field | Value | +|-------|-------| +| **Company** | [Name] | +| **Industry** | [Industry] | +| **Size** | [Employees / Revenue if known] | +| **Status** | [New prospect / Active opportunity / Customer] | +| **Last Touch** | [Date and summary] | + +--- + +## Who You're Meeting + +### [Name] — [Title] +- **Background:** [Career history, education if found] +- **LinkedIn:** [URL] +- **Role in Deal:** [Decision maker / Champion / Evaluator / etc.] +- **Last Interaction:** [Summary if known] +- **Talking Point:** [Something personal/professional to reference] + +[Repeat for each attendee] + +--- + +## Context & History + +**What's happened so far:** +- [Key point from prior interactions] +- [Open commitments or action items] +- [Any concerns or objections raised] + +**Recent news about [Company]:** +- [News item 1 — why it matters] +- [News item 2 — why it matters] + +--- + +## Suggested Agenda + +1. **Open** — [Reference last conversation or trigger event] +2. **[Topic 1]** — [Discovery question or value discussion] +3. **[Topic 2]** — [Address known concern or explore priority] +4. **[Topic 3]** — [Demo section / Proposal review / etc.] +5. **Next Steps** — [Propose clear follow-up with timeline] + +--- + +## Discovery Questions + +Ask these to fill gaps in your understanding: + +1. [Question about their current situation] +2. [Question about pain points or priorities] +3. [Question about decision process and timeline] +4. [Question about success criteria] +5. [Question about other stakeholders] + +--- + +## Potential Objections + +| Objection | Suggested Response | +|-----------|-------------------| +| [Likely objection based on context] | [How to address it] | +| [Common objection for this stage] | [How to address it] | + +--- + +## Internal Notes + +[Any internal chat context (e.g. Slack), colleague insights, or competitive intel] + +--- + +## After the Call + +Run **call-follow-up** to: +- Extract action items +- Update your CRM +- Draft follow-up email +``` + +--- + +## Execution Flow + +### Step 1: Gather Context + +**If connectors available:** +``` +1. Calendar → Find upcoming meeting matching company name + - Pull: title, time, attendees, description, attachments + +2. CRM → Query account + - Pull: account details, all contacts, open opportunities + - Pull: last 10 activities, any account notes + +3. Email → Search recent threads + - Query: emails with company domain (last 30 days) + - Extract: key topics, open questions, commitments + +4. Chat → Search internal discussions + - Query: company name mentions (last 30 days) + - Extract: colleague insights, competitive intel + +5. Transcripts → Find prior calls + - Pull: call recordings with this account + - Extract: key moments, objections raised, topics covered +``` + +**If no connectors:** +``` +1. Ask user: + - "What company are you meeting with?" + - "What type of meeting is this?" + - "Who's attending? (names and titles if you know)" + - "Any context you want me to know? (paste notes, emails, etc.)" + +2. Accept whatever they provide and work with it +``` + +### Step 2: Research Supplement + +**Always run (web search):** +``` +1. "[Company] news" — last 30 days +2. "[Company] funding" — recent announcements +3. "[Company] leadership" — executive changes +4. "[Company] + [industry] trends" — relevant context +5. Attendee LinkedIn profiles — background research +``` + +### Step 3: Synthesize & Generate + +``` +1. Combine all sources into unified context +2. Identify gaps in understanding → generate discovery questions +3. Anticipate objections based on stage and history +4. Create suggested agenda tailored to meeting type +5. Output formatted prep brief +``` + +--- + +## Meeting Type Variations + +### Discovery Call +- Focus on: Understanding their world, pain points, priorities +- Agenda emphasis: Questions > Talking +- Key output: Qualification signals, next step proposal + +### Demo / Presentation +- Focus on: Their specific use case, tailored examples +- Agenda emphasis: Show relevant features, get feedback +- Key output: Technical requirements, decision timeline + +### Negotiation / Proposal Review +- Focus on: Addressing concerns, justifying value +- Agenda emphasis: Handle objections, close gaps +- Key output: Path to agreement, clear next steps + +### Check-in / QBR +- Focus on: Value delivered, expansion opportunities +- Agenda emphasis: Review wins, surface new needs +- Key output: Renewal confidence, upsell pipeline + +--- + +## Tips for Better Prep + +1. **More context = better prep** — Paste emails, notes, anything you have +2. **Name the attendees** — Even just titles help me research +3. **State your goal** — "I want to get them to agree to a pilot" +4. **Flag concerns** — "They mentioned budget is tight" + +--- + +## Related Skills + +- **account-research** — Deep dive on a company before first contact +- **call-follow-up** — Process call notes and execute post-call workflow +- **draft-outreach** — Write personalized outreach after research diff --git a/code_puppy/bundled_skills/Sales/competitive-intelligence/SKILL.md b/code_puppy/bundled_skills/Sales/competitive-intelligence/SKILL.md new file mode 100644 index 00000000..c9f441da --- /dev/null +++ b/code_puppy/bundled_skills/Sales/competitive-intelligence/SKILL.md @@ -0,0 +1,401 @@ +--- +name: competitive-intelligence +description: Research your competitors and build an interactive battlecard. Outputs an HTML artifact with clickable competitor cards and a comparison matrix. Trigger with "competitive intel", "research competitors", "how do we compare to [competitor]", "battlecard for [competitor]", or "what's new with [competitor]". +--- + +# Competitive Intelligence + +Research your competitors extensively and generate an **interactive HTML battlecard** you can use in deals. The output is a self-contained artifact with clickable competitor tabs and an overall comparison matrix. + +## How It Works + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ COMPETITIVE INTELLIGENCE │ +├─────────────────────────────────────────────────────────────────┤ +│ ALWAYS (works standalone via web search) │ +│ ✓ Competitor product deep-dive: features, pricing, positioning │ +│ ✓ Recent releases: what they've shipped in last 90 days │ +│ ✓ Your company releases: what you've shipped to counter │ +│ ✓ Differentiation matrix: where you win vs. where they win │ +│ ✓ Sales talk tracks: how to position against each competitor │ +│ ✓ Landmine questions: expose their weaknesses naturally │ +├─────────────────────────────────────────────────────────────────┤ +│ OUTPUT: Interactive HTML Battlecard │ +│ ✓ Comparison matrix overview │ +│ ✓ Clickable tabs for each competitor │ +│ ✓ Dark theme, professional styling │ +│ ✓ Self-contained HTML file — share or host anywhere │ +├─────────────────────────────────────────────────────────────────┤ +│ SUPERCHARGED (when you connect your tools) │ +│ + CRM: Win/loss data, competitor mentions in closed deals │ +│ + Docs: Existing battlecards, competitive playbooks │ +│ + Chat: Internal intel, field reports from colleagues │ +│ + Transcripts: Competitor mentions in customer calls │ +└─────────────────────────────────────────────────────────────────┘ +``` + +--- + +## Getting Started + +When you run this skill, I'll ask for context: + +**Required:** +- What company do you work for? (or I'll detect from your email) +- Who are your main competitors? (1-5 names) + +**Optional:** +- Which competitor do you want to focus on first? +- Any specific deals where you're competing against them? +- Pain points you've heard from customers about competitors? + +If I already have your seller context from a previous session, I'll confirm and skip the questions. + +--- + +## Connectors (Optional) + +| Connector | What It Adds | +|-----------|--------------| +| **CRM** | Win/loss history against each competitor, deal-level competitor tracking | +| **Docs** | Existing battlecards, product comparison docs, competitive playbooks | +| **Chat** | Internal chat intel (e.g. Slack) — what your team is hearing from the field | +| **Transcripts** | Competitor mentions in customer calls, objections raised | + +> **No connectors?** Web research works great. I'll pull everything from public sources — product pages, pricing, blogs, release notes, reviews, job postings. + +--- + +## Output: Interactive HTML Battlecard + +The skill generates a **self-contained HTML file** with: + +### 1. Comparison Matrix (Landing View) +Overview comparing you vs. all competitors at a glance: +- Feature comparison grid +- Pricing comparison +- Market positioning +- Win rate indicators (if CRM connected) + +### 2. Competitor Tabs (Click to Expand) +Each competitor gets a clickable card that expands to show: +- Company profile (size, funding, target market) +- What they sell and how they position +- Recent releases (last 90 days) +- Where they win vs. where you win +- Pricing intelligence +- Talk tracks for different scenarios +- Objection handling +- Landmine questions + +### 3. Your Company Card +- Your releases (last 90 days) +- Your key differentiators +- Proof points and customer quotes + +--- + +## HTML Structure + +```html + + + + Battlecard: [Your Company] vs Competitors + + + + +
+

[Your Company] Competitive Battlecard

+

Generated: [Date] | Competitors: [List]

+
+ + + + + +
+

Head-to-Head Comparison

+ + +
+ +

Quick Win/Loss Guide

+
+ +
+
+ + +
+
+
+
+
+
+
+
+
+ + + + +``` + +--- + +## Visual Design + +### Color System +```css +:root { + /* Dark theme base */ + --bg-primary: #0a0d14; + --bg-elevated: #0f131c; + --bg-surface: #161b28; + --bg-hover: #1e2536; + + /* Text */ + --text-primary: #ffffff; + --text-secondary: rgba(255, 255, 255, 0.7); + --text-muted: rgba(255, 255, 255, 0.5); + + /* Accent (your brand or neutral) */ + --accent: #3b82f6; + --accent-hover: #2563eb; + + /* Status indicators */ + --you-win: #10b981; + --they-win: #ef4444; + --tie: #f59e0b; +} +``` + +### Card Design +- Rounded corners (12px) +- Subtle borders (1px, low opacity) +- Hover states with slight elevation +- Smooth transitions (200ms) + +### Comparison Matrix +- Sticky header row +- Color-coded winner indicators (green = you, red = them, yellow = tie) +- Expandable rows for detail + +--- + +## Execution Flow + +### Phase 1: Gather Seller Context + +``` +If first time: +1. Ask: "What company do you work for?" +2. Ask: "What do you sell? (product/service in one line)" +3. Ask: "Who are your main competitors? (up to 5)" +4. Store context for future sessions + +If returning user: +1. Confirm: "Still at [Company] selling [Product]?" +2. Ask: "Same competitors, or any new ones to add?" +``` + +### Phase 2: Research Your Company (Always) + +``` +Web searches: +1. "[Your company] product" — current offerings +2. "[Your company] pricing" — pricing model +3. "[Your company] news" — recent announcements (90 days) +4. "[Your company] product updates OR changelog OR releases" — what you've shipped +5. "[Your company] vs [competitor]" — existing comparisons +``` + +### Phase 3: Research Each Competitor (Always) + +``` +For each competitor, run: +1. "[Competitor] product features" — what they offer +2. "[Competitor] pricing" — how they charge +3. "[Competitor] news" — recent announcements +4. "[Competitor] product updates OR changelog OR releases" — what they've shipped +5. "[Competitor] reviews G2 OR Capterra OR TrustRadius" — customer sentiment +6. "[Competitor] vs [alternatives]" — how they position +7. "[Competitor] customers" — who uses them +8. "[Competitor] careers" — hiring signals (growth areas) +``` + +### Phase 4: Pull Connected Sources (If Available) + +``` +If CRM connected: +1. Query closed-won deals with competitor field = [Competitor] +2. Query closed-lost deals with competitor field = [Competitor] +3. Extract win/loss patterns + +If docs connected: +1. Search for "battlecard [competitor]" +2. Search for "competitive [competitor]" +3. Pull existing positioning docs + +If chat connected: +1. Search for "[Competitor]" mentions (last 90 days) +2. Extract field intel and colleague insights + +If transcripts connected: +1. Search calls for "[Competitor]" mentions +2. Extract objections and customer quotes +``` + +### Phase 5: Build HTML Artifact + +``` +1. Structure data for each competitor +2. Build comparison matrix +3. Generate individual battlecards +4. Create talk tracks for each scenario +5. Compile landmine questions +6. Render as self-contained HTML +7. Save as [YourCompany]-battlecard-[date].html +``` + +--- + +## Data Structure Per Competitor + +```yaml +competitor: + name: "[Name]" + website: "[URL]" + profile: + founded: "[Year]" + funding: "[Stage + amount]" + employees: "[Count]" + target_market: "[Who they sell to]" + pricing_model: "[Per seat / usage / etc.]" + market_position: "[Leader / Challenger / Niche]" + + what_they_sell: "[Product summary]" + their_positioning: "[How they describe themselves]" + + recent_releases: + - date: "[Date]" + release: "[Feature/Product]" + impact: "[Why it matters]" + + where_they_win: + - area: "[Area]" + advantage: "[Their strength]" + how_to_handle: "[Your counter]" + + where_you_win: + - area: "[Area]" + advantage: "[Your strength]" + proof_point: "[Evidence]" + + pricing: + model: "[How they charge]" + entry_price: "[Starting price]" + enterprise: "[Enterprise pricing]" + hidden_costs: "[Implementation, etc.]" + talk_track: "[How to discuss pricing]" + + talk_tracks: + early_mention: "[Strategy if they come up early]" + displacement: "[Strategy if customer uses them]" + late_addition: "[Strategy if added late to eval]" + + objections: + - objection: "[What customer says]" + response: "[How to handle]" + + landmines: + - "[Question that exposes their weakness]" + + win_loss: # If CRM connected + win_rate: "[X]%" + common_win_factors: "[What predicts wins]" + common_loss_factors: "[What predicts losses]" +``` + +--- + +## Delivery + +```markdown +## ✓ Battlecard Created + +[View your battlecard](file:///path/to/[YourCompany]-battlecard-[date].html) + +--- + +**Summary** +- **Your Company**: [Name] +- **Competitors Analyzed**: [List] +- **Data Sources**: Web research [+ CRM] [+ Docs] [+ Transcripts] + +--- + +**How to Use** +- **Before a call**: Open the relevant competitor tab, review talk tracks +- **During a call**: Reference landmine questions +- **After win/loss**: Update with new intel + +--- + +**Sharing Options** +- **Local file**: Open in any browser +- **Host it**: Upload to Netlify, Vercel, or internal wiki +- **Share directly**: Send the HTML file to teammates + +--- + +**Keep it Fresh** +Run this skill again to refresh with latest intel. Recommended: monthly or before major deals. +``` + +--- + +## Refresh Cadence + +Competitive intel gets stale. Recommended refresh: + +| Trigger | Action | +|---------|--------| +| **Monthly** | Quick refresh — new releases, news, pricing changes | +| **Before major deal** | Deep refresh for specific competitor in that deal | +| **After win/loss** | Update patterns with new data | +| **Competitor announcement** | Immediate update on that competitor | + +--- + +## Tips for Better Intel + +1. **Be honest about weaknesses** — Credibility comes from acknowledging where competitors are strong +2. **Focus on outcomes, not features** — "They have X feature" matters less than "customers achieve Y result" +3. **Update from the field** — Best intel comes from actual customer conversations, not just websites +4. **Plant landmines, don't badmouth** — Ask questions that expose weaknesses; never trash-talk +5. **Track releases religiously** — What they ship tells you their strategy and your opportunity + +--- + +## Related Skills + +- **account-research** — Research a specific prospect before reaching out +- **call-prep** — Prep for a call where you know competitor is involved +- **create-an-asset** — Build a custom comparison page for a specific deal diff --git a/code_puppy/bundled_skills/Sales/create-an-asset/QUICKREF.md b/code_puppy/bundled_skills/Sales/create-an-asset/QUICKREF.md new file mode 100644 index 00000000..09a2340f --- /dev/null +++ b/code_puppy/bundled_skills/Sales/create-an-asset/QUICKREF.md @@ -0,0 +1,78 @@ +# Create an Asset — Quick Reference + +## Invoke +``` +/create-an-asset +/create-an-asset [CompanyName] +"Create an asset for [Company]" +``` + +--- + +## Inputs at a Glance + +| Input | What to Provide | +|-------|-----------------| +| **(a) Prospect** | Company, contacts, deal stage, pain points, transcripts | +| **(b) Audience** | Exec / Technical / Ops / Mixed + what they care about | +| **(c) Purpose** | Intro / Follow-up / Deep-dive / Alignment / POC / Close | +| **(d) Format** | Landing page / Deck / One-pager / Workflow demo | + +--- + +## Format Picker + +| If you need... | Choose... | +|----------------|-----------| +| Impressive multi-tab experience | **Interactive landing page** | +| Something to present in a meeting | **Deck-style** | +| Quick summary to leave behind | **One-pager** | +| Visual of how systems connect | **Workflow demo** | + +--- + +## Sample Prompts + +**Basic:** +``` +Create an asset for Acme Corp +``` + +**With context:** +``` +Create an asset for Acme Corp. They're a manufacturing company +struggling with supply chain visibility. Met with their COO +last week. Need something for the exec team. +``` + +**Workflow demo:** +``` +Mock up a workflow for Centric Brands showing how they'd use +our product to monitor contract compliance. Components: our AI, +their Snowflake warehouse, and scanned PDF contracts. +``` + +--- + +## After It's Built + +| Want to... | Say... | +|------------|--------| +| Change colors | "Use our brand colors instead" | +| Add a section | "Add a section on security" | +| Shorten it | "Make it more concise" | +| Fix something | "The CEO's name is wrong, it's Jane Smith" | +| Get PDF | "Give me a print-friendly version" | + +--- + +## Output + +- Self-contained HTML file +- Works offline +- Host anywhere (Netlify, Vercel, GitHub Pages, etc.) +- Password-protect via your hosting provider + +--- + +*That's it. Provide context → answer questions → get asset → iterate.* diff --git a/code_puppy/bundled_skills/Sales/create-an-asset/README.md b/code_puppy/bundled_skills/Sales/create-an-asset/README.md new file mode 100644 index 00000000..c7433d0d --- /dev/null +++ b/code_puppy/bundled_skills/Sales/create-an-asset/README.md @@ -0,0 +1,169 @@ +# Create an Asset + +**For Sales Teams Everywhere** + +Generate professional, customer-ready sales assets in minutes. No design skills required. + +--- + +## What It Does + +This skill creates tailored sales assets by asking you about: +1. **Your prospect** — who they are, what you've discussed +2. **Your audience** — who's viewing, what they care about +3. **Your purpose** — what you want to achieve +4. **Your format** — how you want to present it + +Then it researches, writes, designs, and builds a polished asset you can share with customers. + +--- + +## Supported Formats + +| Format | Best For | Output | +|--------|----------|--------| +| **Interactive Landing Page** | Exec meetings, value prop presentations | Multi-tab page with demos and calculators | +| **Deck-Style** | Formal presentations, large audiences | Linear slides with navigation | +| **One-Pager** | Leave-behinds, quick summaries | Single-scroll executive summary | +| **Workflow / Architecture Demo** | Technical deep-dives, POC proposals | Interactive diagram with animated flow | + +--- + +## Quick Start + +### Option 1: Simple prompt +``` +Create an asset for Acme Corp +``` + +### Option 2: With context +``` +Create an asset for Acme Corp. I met with their VP Engineering +last week - they're struggling with slow release cycles and +want to improve developer productivity. This is for a follow-up +with their technical team. +``` + +### Option 3: Workflow demo +``` +I want to mock up a workflow showing how a customer would use +our product to automate their invoice processing. The flow is: +invoices come in via email → our AI extracts data → validates +against their ERP → flags exceptions for human review. +``` + +--- + +## What Gets Created + +### Interactive Landing Page +- Tabbed navigation +- Company metrics and research +- Use case demos +- ROI calculator +- Professional dark theme with prospect's brand colors + +### Deck-Style +- Title slide with both logos +- Agenda +- Content slides (one message per slide) +- Summary and next steps +- Speaker notes included + +### One-Pager +- Hero statement +- 3 key value points +- Proof point +- Clear CTA + +### Workflow Demo +- Visual component nodes +- Animated data flow +- Step-by-step walkthrough +- Play/pause/step controls +- Sample data at each stage + +--- + +## The Process + +``` +1. You provide context (prospect, audience, purpose) + ↓ +2. Skill researches the prospect company + ↓ +3. Skill asks 3-4 clarifying questions + ↓ +4. You confirm direction + ↓ +5. Skill builds the asset + ↓ +6. You iterate as needed +``` + +--- + +## Sharing Your Asset + +The output is a self-contained HTML file. Share it by: + +- **Static hosting**: Upload to Netlify, Vercel, GitHub Pages, or any web host +- **Password protect**: Most hosts offer simple password protection +- **Direct share**: Email the HTML file — it works offline +- **Embed**: iframe it into other pages or portals + +--- + +## Tips for Best Results + +### Provide Rich Context +The more you share about past conversations, pain points, and stakeholder concerns, the more tailored the asset will be. + +### Upload Transcripts +If you have call recordings, meeting notes, or email threads, upload them. The skill will extract key quotes and priorities. + +### Be Specific About Audience +"Technical team" is good. "IT architects evaluating our security model" is better. + +### Iterate Freely +First draft not quite right? Just say what to change. Colors, sections, messaging, flow — all adjustable. + +--- + +## Examples + +| Scenario | Format | Key Features | +|----------|--------|--------------| +| Post-discovery exec meeting | Interactive page | ROI calculator, their stated priorities, case studies | +| Technical architecture review | Workflow demo | System diagram, data flows, integration points | +| Board presentation leave-behind | One-pager | Strategic alignment, key metrics, clear CTA | +| Large stakeholder meeting | Deck-style | Linear narrative, one point per slide, appendix | + +--- + +## FAQ + +**Q: Does it work for any product/company?** +A: Yes. The skill detects what you're selling from your email domain and researches accordingly. + +**Q: How does it know my prospect's brand colors?** +A: It extracts them from the prospect's website or brand guidelines. You can adjust after. + +**Q: Can I use my company's branding instead?** +A: Yes — after the first build, just ask to switch to your brand colors. + +**Q: What if the research is wrong?** +A: Flag it and provide corrections. The skill will regenerate with accurate info. + +**Q: Can I export as PDF?** +A: Yes — ask for a print-optimized version and use your browser's print-to-PDF. + +--- + +## Support + +Questions or feedback? This skill is part of the public sales skills collection. + +--- + +*Built for salespeople who'd rather sell than design slides.* diff --git a/code_puppy/bundled_skills/Sales/create-an-asset/SKILL.md b/code_puppy/bundled_skills/Sales/create-an-asset/SKILL.md new file mode 100644 index 00000000..68206db0 --- /dev/null +++ b/code_puppy/bundled_skills/Sales/create-an-asset/SKILL.md @@ -0,0 +1,867 @@ +--- +name: create-an-asset +description: Generate tailored sales assets (landing pages, decks, one-pagers, workflow demos) from your deal context. Describe your prospect, audience, and goal — get a polished, branded asset ready to share with customers. +--- + +# Create an Asset + +Generate custom sales assets tailored to your prospect, audience, and goals. Supports interactive landing pages, presentation decks, executive one-pagers, and workflow/architecture demos. + +--- + +## Triggers + +Invoke this skill when: +- User says `/create-an-asset` or `/create-an-asset [CompanyName]` +- User asks to "create an asset", "build a demo", "make a landing page", "mock up a workflow" +- User needs a customer-facing deliverable for a sales conversation + +--- + +## Overview + +This skill creates professional sales assets by gathering context about: +- **(a) The Prospect** — company, contacts, conversations, pain points +- **(b) The Audience** — who's viewing, what they care about +- **(c) The Purpose** — goal of the asset, desired next action +- **(d) The Format** — landing page, deck, one-pager, or workflow demo + +The skill then researches, structures, and builds a polished, branded asset ready to share with customers. + +--- + +## Phase 0: Context Detection & Input Collection + +### Step 0.1: Detect Seller Context + +From the user's email domain, identify what company they work for. + +**Actions:** +1. Extract domain from user's email +2. Search: `"[domain]" company products services site:linkedin.com OR site:crunchbase.com` +3. Determine seller context: + +| Scenario | Action | +|----------|--------| +| **Single-product company** | Auto-populate seller context | +| **Multi-product company** | Ask: "Which product or solution is this asset for?" | +| **Consultant/agency/generic domain** | Ask: "What company or product are you representing?" | +| **Unknown/startup** | Ask: "Briefly, what are you selling?" | + +**Store seller context:** +```yaml +seller: + company: "[Company Name]" + product: "[Product/Service]" + value_props: + - "[Key value prop 1]" + - "[Key value prop 2]" + - "[Key value prop 3]" + differentiators: + - "[Differentiator 1]" + - "[Differentiator 2]" + pricing_model: "[If publicly known]" +``` + +**Persist to knowledge base** for future sessions. On subsequent invocations, confirm: "I have your seller context from last time — still selling [Product] at [Company]?" + +--- + +### Step 0.2: Collect Prospect Context (a) + +**Ask the user:** + +| Field | Prompt | Required | +|-------|--------|----------| +| **Company** | "Which company is this asset for?" | ✓ Yes | +| **Key contacts** | "Who are the key contacts? (names, roles)" | No | +| **Deal stage** | "What stage is this deal?" | ✓ Yes | +| **Pain points** | "What pain points or priorities have they shared?" | No | +| **Past materials** | "Upload any conversation materials (transcripts, emails, notes, call recordings)" | No | + +**Deal stage options:** +- Intro / First meeting +- Discovery +- Evaluation / Technical review +- POC / Pilot +- Negotiation +- Close + +--- + +### Step 0.3: Collect Audience Context (b) + +**Ask the user:** + +| Field | Prompt | Required | +|-------|--------|----------| +| **Audience type** | "Who's viewing this?" | ✓ Yes | +| **Specific roles** | "Any specific titles to tailor for? (e.g., CTO, VP Engineering, CFO)" | No | +| **Primary concern** | "What do they care most about?" | ✓ Yes | +| **Objections** | "Any concerns or objections to address?" | No | + +**Audience type options:** +- Executive (C-suite, VPs) +- Technical (Architects, Engineers, Developers) +- Operations (Ops, IT, Procurement) +- Mixed / Cross-functional + +**Primary concern options:** +- ROI / Business impact +- Technical depth / Architecture +- Strategic alignment +- Risk mitigation / Security +- Implementation / Timeline + +--- + +### Step 0.4: Collect Purpose Context (c) + +**Ask the user:** + +| Field | Prompt | Required | +|-------|--------|----------| +| **Goal** | "What's the goal of this asset?" | ✓ Yes | +| **Desired action** | "What should the viewer do after seeing this?" | ✓ Yes | + +**Goal options:** +- Intro / First impression +- Discovery follow-up +- Technical deep-dive +- Executive alignment / Business case +- POC proposal +- Deal close + +--- + +### Step 0.5: Select Format (d) + +**Ask the user:** "What format works best for this?" + +| Format | Description | Best For | +|--------|-------------|----------| +| **Interactive landing page** | Multi-tab page with demos, metrics, calculators | Exec alignment, intros, value prop | +| **Deck-style** | Linear slides, presentation-ready | Formal meetings, large audiences | +| **One-pager** | Single-scroll executive summary | Leave-behinds, quick summaries | +| **Workflow / Architecture demo** | Interactive diagram with animated flow | Technical deep-dives, POC demos, integrations | + +--- + +### Step 0.6: Format-Specific Inputs + +#### If "Workflow / Architecture demo" selected: + +**First, parse from user's description.** Look for: +- Systems and components mentioned +- Data flows described +- Human interaction points +- Example scenarios + +**Then ask for any gaps:** + +| If Missing... | Ask... | +|---------------|--------| +| Components unclear | "What systems or components are involved? (databases, APIs, AI, middleware, etc.)" | +| Flow unclear | "Walk me through the step-by-step flow" | +| Human touchpoints unclear | "Where does a human interact in this workflow?" | +| Scenario vague | "What's a concrete example scenario to demo?" | +| Integration specifics | "Any specific tools or platforms to highlight?" | + +--- + +## Phase 1: Research (Adaptive) + +### Assess Context Richness + +| Level | Indicators | Research Depth | +|-------|------------|----------------| +| **Rich** | Transcripts uploaded, detailed pain points, clear requirements | Light — fill gaps only | +| **Moderate** | Some context, no transcripts | Medium — company + industry | +| **Sparse** | Just company name | Deep — full research pass | + +### Always Research: + +1. **Prospect basics** + - Search: `"[Company]" annual report investor presentation 2025 2026` + - Search: `"[Company]" CEO strategy priorities 2025 2026` + - Extract: Revenue, employees, key metrics, strategic priorities + +2. **Leadership** + - Search: `"[Company]" CEO CTO CIO 2025` + - Extract: Names, titles, recent quotes on strategy/technology + +3. **Brand colors** + - Search: `"[Company]" brand guidelines` + - Or extract from company website + - Store: Primary color, secondary color, accent + +### If Moderate/Sparse Context, Also Research: + +4. **Industry context** + - Search: `"[Industry]" trends challenges 2025 2026` + - Extract: Common pain points, market dynamics + +5. **Technology landscape** + - Search: `"[Company]" technology stack tools platforms` + - Extract: Current solutions, potential integration points + +6. **Competitive context** + - Search: `"[Company]" vs [seller's competitors]` + - Extract: Current solutions, switching signals + +### If Transcripts/Materials Uploaded: + +7. **Conversation analysis** + - Extract: Stated pain points, decision criteria, objections, timeline + - Identify: Key quotes to reference (use their exact language) + - Note: Specific terminology, acronyms, internal project names + +--- + +## Phase 2: Structure Decision + +### Interactive Landing Page + +| Purpose | Recommended Sections | +|---------|---------------------| +| **Intro** | Company Fit → Solution Overview → Key Use Cases → Why Us → Next Steps | +| **Discovery follow-up** | Their Priorities → How We Help → Relevant Examples → ROI Framework → Next Steps | +| **Technical deep-dive** | Architecture → Security & Compliance → Integration → Performance → Support | +| **Exec alignment** | Strategic Fit → Business Impact → ROI Calculator → Risk Mitigation → Partnership | +| **POC proposal** | Scope → Success Criteria → Timeline → Team → Investment → Next Steps | +| **Deal close** | Value Summary → Pricing → Implementation Plan → Terms → Sign-off | + +**Audience adjustments:** +- **Executive**: Lead with business impact, ROI, strategic alignment +- **Technical**: Lead with architecture, security, integration depth +- **Operations**: Lead with workflow impact, change management, support +- **Mixed**: Balance strategic + tactical; use tabs to separate depth levels + +--- + +### Deck-Style + +Same sections as landing page, formatted as linear slides: + +``` +1. Title slide (Prospect + Seller logos, partnership framing) +2. Agenda +3-N. One section per slide (or 2-3 slides for dense sections) +N+1. Summary / Key takeaways +N+2. Next steps / CTA +N+3. Appendix (optional — detailed specs, pricing, etc.) +``` + +**Slide principles:** +- One key message per slide +- Visual > text-heavy +- Use prospect's metrics and language +- Include speaker notes + +--- + +### One-Pager + +Condense to single-scroll format: + +``` +┌─────────────────────────────────────┐ +│ HERO: "[Prospect Goal] with [Product]" │ +├─────────────────────────────────────┤ +│ KEY POINT 1 │ KEY POINT 2 │ KEY POINT 3 │ +│ [Icon + 2-3 │ [Icon + 2-3 │ [Icon + 2-3 │ +│ sentences] │ sentences] │ sentences] │ +├─────────────────────────────────────┤ +│ PROOF POINT: [Metric, quote, or case study] │ +├─────────────────────────────────────┤ +│ CTA: [Clear next action] │ [Contact info] │ +└─────────────────────────────────────┘ +``` + +--- + +### Workflow / Architecture Demo + +**Structure based on complexity:** + +| Complexity | Components | Structure | +|------------|------------|-----------| +| **Simple** | 3-5 | Single-view diagram with step annotations | +| **Medium** | 5-10 | Zoomable canvas with step-by-step walkthrough | +| **Complex** | 10+ | Multi-layer view (overview → detailed) with guided tour | + +**Standard elements:** + +1. **Title bar**: `[Scenario Name] — Powered by [Seller Product]` +2. **Component nodes**: Visual boxes/icons for each system +3. **Flow arrows**: Animated connections showing data movement +4. **Step panel**: Sidebar explaining current step in plain language +5. **Controls**: Play / Pause / Step Forward / Step Back / Reset +6. **Annotations**: Callouts for key decision points and value-adds +7. **Data preview**: Sample payloads or transformations at each step + +--- + +## Phase 3: Content Generation + +### General Principles + +All content should: +- Reference **specific pain points** from user input or transcripts +- Use **prospect's language** — their terminology, their stated priorities +- Map **seller's product** → **prospect's needs** explicitly +- Include **proof points** where available (case studies, metrics, quotes) +- Feel **tailored, not templated** + +--- + +### Section Templates + +#### Hero / Intro +``` +Headline: "[Prospect's Goal] with [Seller's Product]" +Subhead: Tie to their stated priority or top industry challenge +Metrics: 3-4 key facts about the prospect (shows we did homework) +``` + +#### Their Priorities (if discovery follow-up) +``` +Reference specific pain points from conversation: +- Use their exact words where possible +- Show we listened and understood +- Connect each to how we help +``` + +#### Solution Mapping +``` +For each pain point: +├── The challenge (in their words) +├── How [Product] addresses it +├── Proof point or example +└── Outcome / benefit +``` + +#### Use Cases / Demos +``` +3-5 relevant use cases: +├── Visual mockup or interactive demo +├── Business impact (quantified if possible) +├── "How it works" — 3-4 step summary +└── Relevant to their industry/role +``` + +#### ROI / Business Case +``` +Interactive calculator with: +├── Inputs relevant to their business (from research) +│ ├── Number of users/developers +│ ├── Current costs or time spent +│ └── Expected improvement % +├── Outputs: +│ ├── Annual value / savings +│ ├── Cost of solution +│ ├── Net ROI +│ └── Payback period +└── Assumptions clearly stated (editable) +``` + +#### Why Us / Differentiators +``` +├── Differentiators vs. alternatives they might consider +├── Trust, security, compliance positioning +├── Support and partnership model +└── Customer proof points (logos, quotes, case studies) +``` + +#### Next Steps / CTA +``` +├── Clear action aligned to Purpose (c) +├── Specific next step (not vague "let's chat") +├── Contact information +├── Suggested timeline +└── What happens after they take action +``` + +--- + +### Workflow Demo Content + +#### Component Definitions + +For each system, define: + +```yaml +component: + id: "snowflake" + label: "Snowflake Data Warehouse" + type: "database" # database | api | ai | middleware | human | document | output + icon: "database" + description: "Financial performance data" + brand_color: "#29B5E8" +``` + +**Component types:** +- `human` — Person initiating or receiving +- `document` — PDFs, contracts, files +- `ai` — AI/ML models, agents +- `database` — Data stores, warehouses +- `api` — APIs, services +- `middleware` — Integration platforms, MCP servers +- `output` — Dashboards, reports, notifications + +#### Flow Steps + +For each step, define: + +```yaml +step: + number: 1 + from: "human" + to: "claude" + action: "Initiates performance review" + description: "Sarah, a Brand Analyst at [Prospect], kicks off the quarterly review..." + data_example: "Review request: Nike brand, Q4 2025" + duration: "~1 second" + value_note: "No manual data gathering required" +``` + +#### Scenario Narrative + +Write a clear, specific walkthrough: + +``` +Step 1: Human Trigger +"Sarah, a Brand Performance Analyst at Centric Brands, needs to review +Q4 performance for the Nike license agreement. She opens the review +dashboard and clicks 'Start Review'..." + +Step 2: Contract Analysis +"Claude retrieves the Nike contract PDF and extracts the performance +obligations: minimum $50M revenue, 12% margin requirement, quarterly +reporting deadline..." + +Step 3: Data Query +"Claude formulates a query and sends it to Workato DataGenie: +'Get Q4 2025 revenue and gross margin for Nike brand from Snowflake'..." + +Step 4: Results & Synthesis +"Snowflake returns the data. Claude compares actuals vs. obligations: +Revenue $52.3M ✓ (exceeded by $2.3M) +Margin 11.2% ⚠️ (0.8% below threshold)..." + +Step 5: Insight Delivery +"Claude synthesizes findings into an executive summary with +recommendations: 'Review promotional spend allocation to improve +margin performance...'" +``` + +--- + +## Phase 4: Visual Design + +### Color System + +```css +:root { + /* === Prospect Brand (Primary) === */ + --brand-primary: #[extracted from research]; + --brand-secondary: #[extracted]; + --brand-primary-rgb: [r, g, b]; /* For rgba() usage */ + + /* === Dark Theme Base === */ + --bg-primary: #0a0d14; + --bg-elevated: #0f131c; + --bg-surface: #161b28; + --bg-hover: #1e2536; + + /* === Text === */ + --text-primary: #ffffff; + --text-secondary: rgba(255, 255, 255, 0.7); + --text-muted: rgba(255, 255, 255, 0.5); + + /* === Accent === */ + --accent: var(--brand-primary); + --accent-hover: var(--brand-secondary); + --accent-glow: rgba(var(--brand-primary-rgb), 0.3); + + /* === Status === */ + --success: #10b981; + --warning: #f59e0b; + --error: #ef4444; +} +``` + +### Typography + +```css +/* Primary: Clean, professional sans-serif */ +font-family: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif; + +/* Headings */ +h1: 2.5rem, font-weight: 700 +h2: 1.75rem, font-weight: 600 +h3: 1.25rem, font-weight: 600 + +/* Body */ +body: 1rem, font-weight: 400, line-height: 1.6 + +/* Captions/Labels */ +small: 0.875rem, font-weight: 500 +``` + +### Visual Elements + +**Cards:** +- Background: `var(--bg-surface)` +- Border: 1px solid rgba(255,255,255,0.1) +- Border-radius: 12px +- Box-shadow: subtle, layered +- Hover: slight elevation, border glow + +**Buttons:** +- Primary: `var(--accent)` background, white text +- Secondary: transparent, accent border +- Hover: brightness increase, subtle scale + +**Animations:** +- Transitions: 200-300ms ease +- Tab switches: fade + slide +- Hover states: smooth, not jarring +- Loading: subtle pulse or skeleton + +### Workflow Demo Specific + +**Component Nodes:** +```css +.node { + background: var(--bg-surface); + border: 2px solid var(--brand-primary); + border-radius: 12px; + padding: 16px; + min-width: 140px; +} + +.node.active { + box-shadow: 0 0 20px var(--accent-glow); + border-color: var(--accent); +} + +.node.human { + border-color: #f59e0b; /* Warm color for humans */ +} + +.node.ai { + background: linear-gradient(135deg, var(--bg-surface), var(--bg-elevated)); + border-color: var(--accent); +} +``` + +**Flow Arrows:** +```css +.arrow { + stroke: var(--text-muted); + stroke-width: 2; + fill: none; + marker-end: url(#arrowhead); +} + +.arrow.active { + stroke: var(--accent); + stroke-dasharray: 8 4; + animation: flowDash 1s linear infinite; +} +``` + +**Canvas:** +```css +.canvas { + background: + radial-gradient(circle at center, var(--bg-elevated) 0%, var(--bg-primary) 100%), + url("data:image/svg+xml,..."); /* Subtle grid pattern */ + overflow: auto; +} +``` + +--- + +## Phase 5: Clarifying Questions (REQUIRED) + +**Before building any asset, always ask clarifying questions.** This ensures alignment and prevents wasted effort. + +### Step 5.1: Summarize Understanding + +First, show the user what you understood: + +``` +"Here's what I'm planning to build: + +**Asset**: [Format] for [Prospect Company] +**Audience**: [Audience type] — specifically [roles if known] +**Goal**: [Purpose] → driving toward [desired action] +**Key themes**: [2-3 main points to emphasize] + +[For workflow demos, also show:] +**Components**: [List of systems] +**Flow**: [Step 1] → [Step 2] → [Step 3] → ... +``` + +### Step 5.2: Ask Standard Questions (ALL formats) + +| Question | Why | +|----------|-----| +| "Does this match your vision?" | Confirm understanding | +| "What's the ONE thing this must nail to succeed?" | Focus on priority | +| "Tone preference? (Bold & confident / Consultative / Technical & precise)" | Style alignment | +| "Focused and concise, or comprehensive?" | Scope calibration | + +### Step 5.3: Ask Format-Specific Questions + +#### Interactive Landing Page: +- "Which sections matter most for this audience?" +- "Any specific demos or use cases to highlight?" +- "Should I include an ROI calculator?" +- "Any competitor positioning to address?" + +#### Deck-Style: +- "How long is the presentation? (helps with slide count)" +- "Presenting live, or a leave-behind?" +- "Any specific flow or narrative arc in mind?" + +#### One-Pager: +- "What's the single most important message?" +- "Any specific proof point or stat to feature?" +- "Will this be printed or digital?" + +#### Workflow / Architecture Demo: +- "Let me confirm the components: [list]. Anything missing?" +- "Here's the flow I understood: [steps]. Correct?" +- "Should the demo show realistic sample data, or keep it abstract?" +- "Any integration details to highlight or downplay?" +- "Should viewers be able to click through steps, or auto-play?" + +### Step 5.4: Confirm and Proceed + +After user responds: + +``` +"Got it. I have what I need. Building your [format] now..." +``` + +Or, if still unclear: + +``` +"One more quick question: [specific follow-up]" +``` + +**Max 2 rounds of questions.** If still ambiguous, make a reasonable choice and note: "I went with X — easy to adjust if you prefer Y." + +--- + +## Phase 6: Build & Deliver + +### Build the Asset + +Following all specifications above: +1. Generate structure based on Phase 2 +2. Create content based on Phase 3 +3. Apply visual design based on Phase 4 +4. Ensure all interactive elements work +5. Test responsiveness (if applicable) + +### Output Format + +**All formats**: Self-contained HTML file +- All CSS inline or in `